The DMA/no-DMA issue doesn't seem to have any real effect on performance. Different hardware has a large effect. PIGPIO is the performance winner on all platforms, but was causing signal 27 problems when running inside my game emulator. I rebuilt PIGPIO with the EMBEDDED_IN_VM define to disable logging and that solved the signal 27 problem. This causes a new problem - if you don't properly shut down the PIGPIO library (e.g. your program quits in the middle or you kill it with ctrl-z), it will leave running threads and prevent you from initializing it again.
As far as performance...
When running on my RPi 3B, and setting the SPI bus to 32Mhz, filling the framebuffer with a solid color (writing in large blocks) is able to achieve 17FPS. On my RPi 0W, the same code is able to achieve 34FPS. It seems odd that a slower machine is able to run twice as fast. I remember reading somewhere that someone else had encountered the same problem. Maybe someone here can shed some light on it.
I added hardware scrolling support to my code and it works quite well. I'm going to add it to my game emulator this week and see if side scrollers like super mario would see improvement by using it.
My ili9341 project now includes support for working with the display in portrait or landscape mode (including text and tile drawing). The scrolling is only possible in the portrait direction due to the simple hardware implementation.
I'll publish my ili9341 code soon and shortly after my game emulator. It's coming along well and shouldn't take too much more time.
Here's a video of Super Mario Brothers (NES) running at 60fps without hardware scrolling support. There is a slight lag when the title banner is scrolling, but otherwise holds a steady framerate. Once I enable hardware scrolling, this will be rock solid 60fps.