I could be wrong about this, but is the DMA code in the bcm2708_fb driver missing code to do L1 cache cleaning? Without this, most operations will work fine, but if you get unlucky, small regions (of the order of 8k pixels for a 16bit display) might come out corrupted.
Presumably it needs something to clean the cache in the area being copied from, and invalidate it in the destination region.
It might actually be quicker just to flush the entire L1 cache, given that it's quite small (16k L1) and it only does DMA for large(ish) regions. Possibly a call to dmac_flush_user_all() ?