rgriggs wrote: ↑
Thu Aug 30, 2018 1:47 am
My thought was that if the GPU could simultaneously decode all those streams (up to about 30mBit max), it should be easy for the CPU to grab a frame
You got me curious so I did a frame-grab test. I used the latest version of FFmpeg to extract JPEG images from a completely standard H.264 1080p video. The video I used was a "pre-made" MP4 that I previously downloaded from YouTube (it was the brief neutron star video from NASA that I reference in my tutorial). The point is, there was no streaming or "live input" going on, nor was there any video display or any other software running on the system – so my Raspberry 3B+ was free to devote almost all of its resources to this one task.
After extracting exactly 1,269 frames, the average frame-grabbing rate came out to only 5.2 FPS. Keep in mind this was for only ONE MP4 video. Speed, of course, is in the eyes of the beholder – but that kind of software-based frame rate is very typical for FFmpeg on the Raspberry. In other words, it's perfectly normal – yet utterly pathetic at the same time if your needs are 10 or 100 times greater.
Here's a brief summary of my Terminal session:
ffmpeg -i video.mp4 -vf fps=30 %04d.jpg
ffmpeg version 4.0.2 Copyright (c) 2000-2018 the FFmpeg developers
built with gcc 6.3.0 (Raspbian 6.3.0-18+rpi1+deb9u1)
encoder: Lavc58.18.100 mjpeg
frame= 1269 fps=5.2
q=24.8 Lsize=N/A time=00:00:42.30 bitrate=N/A speed=0.173x
THE 25% REALITY:
As you can see, FFmpeg is using the standard "Lavc" library of encoders to convert to JPEG. Lavc is short for libavcodec. My build of FFmpeg is also using the latest version of libavcodec – 58.18.100.
Like so many things in life, there's theory and then there's reality.
For example, in theory the Raspberry's CPU has 4 cores that can speed everything up!
In reality, the CPU has only ONE core, or at most "1.4 cores", for almost everything it does. I say 1.4 cores because I rarely ever see it go above 35%. I was actually impressed that when FFmpeg exported the JPGs, it hovered between 33 and 35%.
There are a few exceptions to what I'm saying – such as the amazing GCC compiler or the x264 encoder with NEON support. They make full 25% x 4 = 100% use of the CPU.
But I look at my CPU monitor all the time, and by far the most common reading I see when it's tasked with a big job is 25%. This is even true for "native" applications like LibreOffice Writer that come standard with Raspbian. Just do a simple search and replace on a giant text document and you'll see the CPU maxes out at only 25%. That's because it's very difficult to write genuine parallel code that's also customized for a specific processor architecture. ARM is also a bit of a stepchild in the computer world. A lot of developers that take the extra time to create custom parallel code for Intel/AMD-based systems simply aren't going to bother doing the same thing for ARM-based systems. Let's face it – ARM is dominant in the "tiny pocket computer" world. Intel/AMD are the dominant systems in the "big boy" world. Things are slowly improving in the ARM world, but it makes sense that for heavy-duty nerd applications like video encoding, only so many developers are going to do the extremely difficult extra work to customize their code for the ARM.
In fact, here's a shocking but very common example of what I'm talking about:
When I use the highly-regarded GIMP imaging program on my Raspberry to export a single PNG from a 1920 x 1080 image, it takes SEVERAL SECONDS to process! Now it's true that PNG exporting is much more computationally intensive than JPG, so this is certainly a bit of apples and oranges. It's also true that the "computational complexity" of images can vary dramatically, which has a big impact on the exporting time. Nonetheless, it still gives a good idea of how a seemingly simple task can eat up a tremendous amount of CPU time.
Here's the relevant timestamp data – the inode creation time (when the file was born) vs. the file modification time (when the last bit was written to it):
crtime: 2018-08-30 18:58:29.747153460
mtime: 2018-08-30 18:58:41.245832484
As you can see, this particular PNG – a single image – took almost 12 seconds to export. That's a frame rate of only 0.08 FPS. That's 375 times slower than real time (30 FPS)!
And guess what? The entire time the CPU was stuck at a very typical 25%.
THE "REAL" CPU:
In theory, the Raspberry 3 has an impressive 64-bit ARMv8. In practice, it really has a 32-bit ARMv6 / ARMv7. Why do I say this? For understandable reasons, the Raspberry Pi Foundation wants to maintain backward compatibility with older models. They also don't want the nightmare of having to maintain two completely separate operating systems (32-bit and 64-bit) – one for older systems, and one for newer systems. Remember – it's a non-profit, not a multi-billion dollar corporation. Unfortunately, since the operating system is the "middle man" between you and the CPU, that has the effect of "watering down" the true capabilities of the CPU. When I compile FFmpeg and x264, for example, the operating system prevents me from making anything more than 32-bit ARMv6 optimizations. On top of that, those optimizations only apply to a limited subset of FFmpeg's total features. So for many things, it's not getting any specialized CPU acceleration.
SIMPLE ISN'T SIMPLE:
In theory, all that FFmpeg has to do is simply "export the JPG". Now I'm certainly not an expert on the inner workings of FFmpeg's source code or the Libavcodec suite, but my guess is that there's a whole bunch of "pre-processing" that has to first take place before it can even think about the actual JPG exporting process. A classic example of this would be the fact that unlike MJPEG video streams – where each frame is a completely normal "standalone" picture – H.264 video is vastly more complex. Instead of being a stitched-together series of simple pictures, it actually consists of (I)ntra-coded frames, (P)redicted frames and (B)i-directional frames. The I frames are fairly simple – because they're the rough equivalent of a standalone JPEG image. But P and B frames are not complete pictures. Instead, they contain some combination of motion vector displacement and image data. So FFmpeg almost certainly has to perform a considerable analysis of the "incomplete" P and B frames in the larger context of the video before it can process them into legible images that are suitable for export. Some might say, "oh, but the fancy hardware-based H.264 decoder in the GPU has already done that part." But keep in mind that's only relevant to FFmpeg if the developers have taken the time to write custom code so that FFmpeg can directly communicate with the internal workings of the Raspberry's proprietary VideoCore GPU. I seriously doubt that is the case. So FFmpeg is probably forced to look at the stream completely "fresh" through its own eyes and do everything in software on the CPU.
What that person did in displaying 9 simultaneous videos (3 x 3) with a single Raspberry was truly impressive and I would have no clue how to do that myself. But when you dig into the details, it's not quite what it seems. Here's what I mean. In his final update on the link you shared, he reported these stats on a non-overclocked Raspberry:
1x1 1080p 50 fps
2x2 1080p 12 fps but screen is blank
3x3 640x480 32 fps
Now let's consider for a moment how the "video wall" compares to my tutorial. In other words, how many pixels can each one pump out each second? As you know, my tutorial cranks out 60 frames per second at 1080p. So here's the math:
MY TUTORIAL BUILD – 1920 x 1080 x 1 @ 60 FPS:
1920 x 1080 = 2,073,600 pixels per frame. 2,073,600 x 60 FPS = 124,416,000 pixels per second
3 X 3 VIDEO WALL – 640 x 480 x 9 @ 32 FPS:
640 x 480 = 307,200. 307,200 x 9 = 2,764,800 pixels per frame. 2,764,800 x 32 FPS = 88,473,600 pixels per second
That means my customized build of FFmpeg / mpv is actually 40% faster. It's pumping out 36 MILLION
more pixels per second than the video wall!
That's why I maintain that if anyone wants to decode and display 8 simultaneous 1080p streams while simultaneously exporting JPEG frame grabs from each independent video simultaneously, they definitely should be thinking of a system a lot closer to $3,500 than $35. Maybe a brand-new $2,000 system – but I can't imagine going much lower than that!