Incidentally I found a single C routine which according to the profiler gets called a fair bit. It's straightforward and there's a NEON accelerated version but nothing else. So if there are any ARM experts reading I'd like some feedback on it. Essentially it takes two pointers, one to a list of floats and the other to a list of ints, also an integer counter and a float multiplier.
Just, one quick question, which profiler are you using? My first attempts using valgrind resulted in errors, I also like to try to stop what my code spends the most time.
A quick update on my gpu efforts, after I got all motion compensation working the performance was not that impressive. I hit 21-23 fps, the shader for motion compensation is just too complex. Since I had no direct clue, where to optimize the stuff, I frooze working and persued a different idea.
In the last two weeks I wrote a patch for libav aka ffmpeg, which adds the ability to transcode mpeg2 (only frame picture but I did not see others in DVB) to mpeg4 ASP (this is part 2 aka divx with b frames and not h264 aka part 10), as far as I understood this one is also licensed.
If I turn off writing to disk, I achieve constant above 29 fps on a high bitrate DVB-C sample (25 fps 720x576) (on the old debian, will try it on raspbian at the weekend). The good thing, the code is not optimized at all just the naive first implementation gives me this framerate, so lots of room for optimization. (All without threading and demuxing and no passing to omx, but is it designed in a way that libav can write directly to omx buffers).
Drawback, gray bars are added left and right to the picture to simulate intra macroblocks in b frames. (A feature missing in mpeg4). I have to find some little bugs and to do some optimization then I will probable post the code.
Marten