Sorry if this is OLD, but I programmed SH2 on the Saturn (16-bit insctructions, 32-bit bus) so if you carried out a memory-fetch instruction in a 32-bit alligned instruction, it was 1-cycle faster. Important for people writing audio-mixers. Your 16-bit sample fetch can be speeded up.
I got 1001 speedups for every core. When I stopped, it was in a company of over 800 and I was THE assembly-language guy.
I also recommend someone take a look and optimise the felide constructors on the C compiler. I did and it made huge differences.
PS Memory writes didn't matter since the cache was mixed.