Qualcomm and ARM GPUs have the ability to render to part of the front buffer without requiring loading and saving the rest of it - which is slow. (Qualcom have their QCOM_tiled_rendering extension and ARM uses scissor boxes.) It also looks like the Eric Anholt's driver supports it using scissoring.
Do the Broadcom drivers for the VC4 provide such a facility?