When it comes to GPGPU programming for the vidiocore 6 in your Pi 4, there are some changes compared to the old videocore 4 you found in your Pi0...Pi3. Most of this changes ( QPU: symetric Registerbanks a and b, new 16bit vector float format, … ) has already been described elsewhere. Others (e.g. Cache related ) are widly unknown.
If you are willing to do some „research“ on the array of vector-processing units in your „new“ Pi4 take a look at :
where you will find an embedded QPU-assembler (embedded in Python) for your new videocore 6 in your Pi4. (examples included) For the first steps only basic python knowlage is required.