Malvineous
Posts: 59
Joined: Wed Mar 07, 2012 10:31 am
Contact: Website

Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Sat Jul 24, 2021 6:25 am

Hi all,

I'm trying to work out whether it's possible to get hardware accelerated playback on an RPi4 with a recent 64-bit OS/kernel. It looks like there's about a million different ways this is supposed to work, but as yet I haven't come across one that does. Many ways use OMX or MMAL which aren't available in 64-bit, with everything seeming to move towards v4l2m2m but if I'm wrong please correct me.

What I want to do is to show four H264 video streams across two monitors. Can anyone confirm whether this is possible?

If I use ffplay normally to do this, I only get around 10fps in each of the four videos and all four CPUs are pegged at 100% and I get thermal throttling, so it looks like this is doing software decoding.

This ffmpeg command is the only one that seems to allow me to read the video stream in real time, at around 40% CPU usage for a single stream:

Code: Select all

$ ffmpeg -codec:v h264_v4l2m2m -i udp://224.0.1.4:5004 -probesize 32 -c:v rawvideo -f avi - | cat > /dev/null

  Stream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(progressive), 972x1296, 30 fps, 30 tbr, 90k tbn, 180k tbc
[h264_v4l2m2m @ 0x55a507f3d0] Using device /dev/video10
[h264_v4l2m2m @ 0x55a507f3d0] driver 'bcm2835-codec' on card 'bcm2835-codec-decode' in mplane mode
[h264_v4l2m2m @ 0x55a507f3d0] requesting formats: output=H264 capture=YU12
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (h264_v4l2m2m) -> rawvideo (native))
However if I do the same thing with ffplay, it uses over 200% CPU and still only gives me maybe 20fps:

Code: Select all

$ ffplay -codec:v h264_v4l2m2m udp://224.0.1.4:5004 -probesize 32

[h264_v4l2m2m @ 0x7f4402a4c0] Using device /dev/video10
[h264_v4l2m2m @ 0x7f4402a4c0] driver 'bcm2835-codec' on card 'bcm2835-codec-decode' in mplane mode
[h264_v4l2m2m @ 0x7f4402a4c0] requesting formats: output=H264 capture=YU12
I think the problem is that the hardware decoder is outputting the pixel data in YV12 or yuv420 format, while RGB is needed to draw on the screen. I'm guessing it's the conversion between the two colour spaces that's causing the slow framerate.

Is there a way to tell the hardware decoder to output in RGB, or to get a hardware accelerated method for drawing YUV420 pixel data?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11645
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Sat Jul 24, 2021 10:32 am

The decoder currently only supports producing YU12, YV12, NV12, NV12, and RGB565.
H264 (and H263, MPEG4, MPEG2, VC-1, and MJPEG) all encode YUV format data, so any conversion to RGB is a secondary step. RGB565 will look quite blocky, so is best avoided.

The hardware on the Pi4 will quite happily render YUV planes direct to the display, and the 3D GL block can convert it happily to a texture. It's been supported by the Linux kernel for a good number of years, but I don't believe ffmpeg/ffplay have been updated to use the efficient paths. x86 platforms do it with pure grunt, and optimisations for lower powered systems haven't always followed.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

Malvineous
Posts: 59
Joined: Wed Mar 07, 2012 10:31 am
Contact: Website

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Sat Jul 24, 2021 3:35 pm

Many thanks for the info. I found your post about listing M2M devices which was very interesting. You mentioned there that /dev/video10 is the hardware decoder, and 11 is the encoder. I followed your instructions and found the RGB565 output as you mentioned above, but I could not work out how to get ffmpeg to select it to see if it made any difference performance-wise [EDIT2: Figured it out, see below]. It looks like it may be selected automatically based on output, but if I add "-pix_fmt rgba" or similar to the end of the command line, it just tells me "[swscaler @ 0x5598d90f20] No accelerated colorspace conversion found from yuv420p to rgba" which looks like it's trying to convert the pixel data rather than configure the decoder to select a different pixel format.

I also see /dev/video12 and 13 which support a lot more pixel formats but I am not sure what these are for. I was hoping they might be hardware accelerated colour space conversions given the wide variety of pixel formats shown by the commands from your post (and none of them appear to be codecs) but from your message here it sounds like this is not the case. What are /dev/video12 and 13 for?

When you say the 3D GL block can render YUV planes, is this as simple as using a media player that can render to an OpenGL surface? I am not even sure if I have 3D acceleration available, as running glxgears at 1920x1200 only runs at 35 fps with 245% CPU use. Maybe that's expected for the hardware? I assumed the CPU use would be lower and framerate would be higher but have no idea what kind of performance to expect. All the docs I've seen say to just install Mesa but surely you still have to configure something to get hardware 3D, as I always thought Mesa provided a software rendering implementation of OpenGL.

EDIT: I think I figured out OpenGL. I had to add "dtoverlay=vc4-fkms-v3d" to /boot/config.txt. Without this, the vc4 kernel module wasn't being loaded and the Xorg log was complaining that /dev/dri/card0 didn't exist. "glxinfo | grep Broadcom" also returned nothing. Adding this line also got both HDMI displays working, whereas before I got /dev/fb0 and /dev/fb1 and had to manually configure Xorg to use both displays. Now the vc4 kernel module is loaded and in use, glxinfo reports Broadcom as the vendor, and /dev/dri/card[01] exist and Xorg reports using it. So it looks like all I need to figure out is the colour conversion issue and it should all work.

EDIT2: Worked out how to select the output pixel format. I just have to add "-pix_fmt X" onto the end of the ffmpeg command line, where X is yuv420p, nv12, nv21 or rgb565, for YU12, NV12, NV21 or RGBP respectively. I couldn't work out what pix_fmt YV12 is - it seems ffmpeg may not support it, but it's the same as YU12 just with the colour planes swapped so no big deal. Turns out RGB565 performs the worst of all of them (that and the colours are very wrong). The default YU12 is the best, even though it takes up close to 90% CPU to convert to RGB for display. So the hunt is on for an OpenGL solution I guess...

Pouuet
Posts: 1
Joined: Wed Jul 28, 2021 1:47 pm

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Mon Aug 02, 2021 9:22 am

Sharing my tests too here.

==
Hardware: RPI PI 4 (4GB RAM), dual hdmi screen 1080p

Software: Latest stable buildroot release, Kernel branch pi-5.10.y from RPI github repo, rpi-firmware latest main branch too.
Mainline FFMPEG, SDL2, MESA3D, libdrm.

(FFMPEG configure/build option with everything needed for RPI/H264 HW decode support)
No any specific tweaks or modifications.

gpu_mem_256=100
gpu_mem_512=100
gpu_mem_1024=100

dtoverlay=vc4-kms-v3d

==
Playing 1080p h264 video:
omxplayer 0 to 1% CPU usage, smooth

FFPlay h264_v4l2m2m 25% CPU (100% on 1 core) and more, frames drops, not smooth
FFPlay sw decode, even higher CPU usage, smooth
h264_mmal : errors

Custom test app based on libav/SDL2 : h264_v4l2m2m or sw decode (same results as FFplay).

While trying to play 2 smaller resolution videos with h264_v4l2m2m quickly rises Kernel oops around videobuf2-core.c . (with 2* 1080p videos Kernel oops direct).


Still trying to figure how to get h264 video hw decode / play correctly wit FFPlay or a custom App.
Last edited by Pouuet on Wed Aug 11, 2021 9:12 pm, edited 1 time in total.

Malvineous
Posts: 59
Joined: Wed Mar 07, 2012 10:31 am
Contact: Website

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Tue Aug 03, 2021 2:21 pm

That's interesting you're getting a kernel oops. I currently have two videos playing on a dual HDMI Pi4 (both on the same screen, ultimately I want four videos across both screens, two on each) and I haven't yet managed to get the Pi to stay up for as many as 24 hours. It seems to randomly reboot around once a day. I have the watchdog timer enabled so I guess it could be a kernel oops and then the watchdog resets it.

I'm also noticing that sometimes the video just stops decoding - ffplay goes from 75% CPU per video to 1% CPU and there are no more screen updates. It's like the codec just stops returning frames. Running 'killall ffmpeg" causes my systemd scripts to restart both ffmpeg and ffplay and then everything starts working again.

So it looks like there could be some bugs in the hardware H264 decoder, but regardless there's something that seems to cause the Pi to reboot after a few hours of hardware decoding. With only two videos playing at the same time I'm not seeing thermal throttling so I don't think it's that, but I'll see if I can hook up a serial console to see if there are any messages when things crash.

Malvineous
Posts: 59
Joined: Wed Mar 07, 2012 10:31 am
Contact: Website

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Wed Aug 04, 2021 2:47 pm

I've done some more experimenting. It seems that the mystical omxplayer is able to play H264 video with close to 0% CPU usage, but it only works on a 32-bit OS and on 64-bit, the lowest you can get is 72% on one CPU core.

First I tried using mplayer, as I know it has an OpenGL output method so I hoped it could do hardware accelerated colourspace conversion. mplayer unfortunately doesn't support V4L2M2M so I tried using hardware H264 decoding with ffmpeg, and piping the result to mplayer using the OpenGL output so that it would do hardware YUV conversion:

Code: Select all

ffmpeg -c:v h264_v4l2m2m -i testfile.h264 -c:v rawvideo -f avi - | mplayer -vo gl -
Unfortunately this resulted in ffmpeg at 57% CPU and mplayer at 44% CPU, so almost 100% of one CPU core per video. If I tweaked mplayer's OpenGL setting to choose different YUV conversion algorithms I could get it to use more CPU, but not less.

I then saw on the omxplayer git README that it was deprecated in favour of VLC, so I tried VLC:

Code: Select all

cvlc --no-video-title-show --no-mouse-events --no-audio -A dummy testfile.h264
This resulted in 75% CPU for one video, so slightly better than ffmpeg+mplayer but still nowhere near the famed omxplayer level of performance.

Going back to the basic ffplay:

Code: Select all

ffplay testfile.h264
This was 77% CPU so only slightly worse than VLC, however it is not using hardware acceleration. Enabling that:

Code: Select all

ffplay -codec:v h264_v4l2m2m testfile.h264
Dropped CPU usage down to 72% - the lowest yet on a 64-bit Pi but still miles behind what the hardware is supposedly capable of.

So by the looks of things, as of mid-2021, a 32-bit OS with omxplayer is still the only viable option for full hardware accelerated video playback?

egnor
Posts: 11
Joined: Fri Aug 06, 2021 6:04 am

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Wed Aug 11, 2021 12:34 am

Malvineous wrote:
Wed Aug 04, 2021 2:47 pm
So by the looks of things, as of mid-2021, a 32-bit OS with omxplayer is still the only viable option for full hardware accelerated video playback?
I don't know about ffmpeg/ffplay specifically, but using "real" KMS I have been able to use gstreamer (with v4l2h264dec and kmssink) to do hardware accelerated video playback, getting substantially better performance than you observe with other options (but not as good as omxplayer). However, kmssink cannot coexist with the X windows desktop.

cleverca22
Posts: 4390
Joined: Sat Aug 18, 2012 2:33 pm

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Wed Aug 11, 2021 1:18 am

egnor wrote:
Wed Aug 11, 2021 12:34 am
However, kmssink cannot coexist with the X windows desktop.
there is something in the DRM api, where you get a token from opening the drm node as a client
then you pass that client to the master (Xorg), and then the master authorizes you to do something

then you can directly use the DRM api, to render into an X11 window
2021-08-03 05:48:40< pq> Then you need to ask for a DRM lease on the HDMI connector+CRTC from your normal desktop's display server.
2021-08-03 05:52:49< pq> The point of DRM master concept is that only one program at a time can control the display hardware of one DRM device (gfx card)). DRM leases are way for that program to give some hardware pieces to another program to control, so that they can work simultaneously if they are lucky enough.
2021-08-03 05:58:52< pq> are you talking to a Wayland or a X11 display server that the normal desktop runs with?
2021-08-03 05:59:14< pq> or both?
2021-08-03 05:59:17< meatloaf> i've tried with both
2021-08-03 05:59:35< pq> each has a special protocol interface for asking of a DRM lease
2021-08-03 05:59:44< pq> *for a DRM lease
2021-08-03 06:00:02< meatloaf> is there any documentation of how to set it up?
2021-08-03 06:00:31< pq> the Wayland interface is being developed at https://gitlab.freedesktop.org/wayland/ ... equests/67
hmmm, and that seems to only be for getting exclusive control of the entire display, no compositing....

egnor
Posts: 11
Joined: Fri Aug 06, 2021 6:04 am

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Wed Aug 11, 2021 6:04 am

cleverca22 wrote:
Wed Aug 11, 2021 1:18 am
there is something in the DRM api, where you get a token from opening the drm node as a client
then you pass that client to the master (Xorg), and then the master authorizes you to do something

then you can directly use the DRM api, to render into an X11 window
...
hmmm, and that seems to only be for getting exclusive control of the entire display, no compositing....
I don't know the details at all, but I *think* under X11 you'd use DRI to get access to DRM buffers in a cooperative way, and then from there you could probably use V4L2 and get zero-copy video going to an X window.

In fact, for all I know, this is already possible, maybe using VA-API (which will perhaps wrap DRI/DRM and use V4L2)? I am not knowledgeable enough to find the appropriate magic gstreamer (or other) incantation, though. And I certainly haven't tried any of this in 64-bit.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11645
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Wed Aug 11, 2021 12:37 pm

egnor wrote:
Wed Aug 11, 2021 6:04 am
I don't know the details at all, but I *think* under X11 you'd use DRI to get access to DRM buffers in a cooperative way, and then from there you could probably use V4L2 and get zero-copy video going to an X window.
No. DRI only allows a single master at a time. X will be that, and whilst you can lease out a full crtc (display), that evicts X from it.
To compose into an X window you need to use GL, otherwise things go wrong if you drag another window over the top of your decoded video.
egnor wrote:In fact, for all I know, this is already possible, maybe using VA-API (which will perhaps wrap DRI/DRM and use V4L2)? I am not knowledgeable enough to find the appropriate magic gstreamer (or other) incantation, though. And I certainly haven't tried any of this in 64-bit.
VA-API is predominantly an alternate decode API. There are ways to map buffers from it (or from V4L2) to be used with DRM or GL.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

cleverca22
Posts: 4390
Joined: Sat Aug 18, 2012 2:33 pm

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Wed Aug 11, 2021 12:49 pm

6by9 wrote:
Wed Aug 11, 2021 12:37 pm
No. DRI only allows a single master at a time. X will be that, and whilst you can lease out a full crtc (display), that evicts X from it.
To compose into an X window you need to use GL, otherwise things go wrong if you drag another window over the top of your decoded video.
how does that deal with a yuv layer coming out of the video decoder then? ive heard that X11 only deals with RGB
does GL transform it into rgb? does that happen within the video players GL or the X11 compositor GL?

egnor
Posts: 11
Joined: Fri Aug 06, 2021 6:04 am

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Wed Aug 11, 2021 3:21 pm

6by9 wrote:
Wed Aug 11, 2021 12:37 pm
No. DRI only allows a single master at a time. X will be that, and whilst you can lease out a full crtc (display), that evicts X from it.
To compose into an X window you need to use GL, otherwise things go wrong if you drag another window over the top of your decoded video.
I *think* you *may* be confusing DRM (the kernel interface, which is one-master-at-a-time) with DRI (the X11 system, which coordinates access from multiple clients to direct rendering).

DRI is normally used for efficient OpenGL under X11, not video playback, but since DRI is fundamentally about ways to set up DRM buffers and share them between X11 server and clients, I was wondering if there was some way to use that to avoid the whole single-master issue while retaining hardware decoding and rendering performance.

I *believe* VA-API, which layers on top of these various other systems, is built to do exactly that when running under X11. But I am well out of my depth here.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11645
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Thu Aug 12, 2021 8:21 am

Multiple very similar acronyms :-(
And seeing as the kernel reports the state via /sys/kernel/debug/dri/0/state I'm not sure there is a real difference. Same with KMS vs DRM as they're incredibly closely linked (mode setting vs putting planes on those chosen modes).

From what I know, X never uses more than 2 planes from a crtc - the primary and cursor ones. Many platforms (predominantly x86) don't support more than 2 planes (eg the i915 on my laptop here has primary, cursor, and a sprite), so it's typically SoCs that support overlay planes, and desktop hasn't overly been optimised for that.

https://en.wikipedia.org/wiki/Direct_Re ... astructure
Nothing prevents DRI from being used to implement accelerated 2D direct rendering within an X client.[3] Simply no one has had the need to do so because the 2D indirect rendering performance was good enough.
There is an efficient way to pass dmabuf video frames in via eglCreateImageKHR. See https://github.com/6by9/drm_mmal/blob/x11/drm_mmal.c for a (potentially bit-rotted) use. vc6 v3d has an input block called the Texture Formatter Unit (TFU) which can happily convert RGB or YUV into texture format (UIF) for use by the rest of the 3D pipeline.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

cleverca22
Posts: 4390
Joined: Sat Aug 18, 2012 2:33 pm

Re: Is hardware accelerated playback possible with v4l2m2m and ffmpeg/ffplay?

Thu Aug 12, 2021 1:12 pm

6by9 wrote:
Thu Aug 12, 2021 8:21 am
From what I know, X never uses more than 2 planes from a crtc - the primary and cursor ones. Many platforms (predominantly x86) don't support more than 2 planes (eg the i915 on my laptop here has primary, cursor, and a sprite), so it's typically SoCs that support overlay planes, and desktop hasn't overly been optimised for that.
i think the old xvideo extension bumped that up to 3 planes

a hard max of 1 xvideo plane, that would be composited in with an chroma-key on the base layer, so menus and mouse can render over the xvideo plane

Return to “Graphics, sound and multimedia”