MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Stressing USB3 affects H264 encoding performance (Update).

Thu Apr 01, 2021 7:14 am

Hi.

Wondering if somebody may be able to explain the Pi bus architecture to me. In particular, how LAN, USB3<--->memory (DMA), and GPU (H264 encoder)<--->memory access are interleaved on the various bus(es) within a Pi 4B. It's a USB3 camera providing images, compress on GPU, then stream over wired LAN.

Are there any publicly available documents (or information that anybody could kindly provide) that specify/document what the timings/limits are?

Are there any bus interleaving/timing policies implemented on the Pi that can be set?

Thanks in advance.
Last edited by MarkDarcy on Tue Apr 20, 2021 9:36 am, edited 1 time in total.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance.

Tue Apr 20, 2021 9:34 am

As nobody was able to offer any information given the question as originally asked, I have provided more detail regarding the usage scenario. I hope the extra information proves useful. I may not be provide any further details as the project is commercially-related; it will depend upon what I'm asked. Again, if anybody can offer any explanation I would be grateful.

Hardware is a Pi 4B, Buster Lite (i.e., headless), HDMI ports disabled, no keyboard/mouse, 1 USB-3 camera, 1 USB-2 serial cable (tty), not overclocked (but force_turbo is 1), wireless/bluetooth disabled, wired LAN connected.

Code: Select all

$ uname -a
Linux raspberrypi 5.4.51-v7l+ #1333 SMP Mon Aug 10 16:51:40 BST 2020 armv7l GNU/Linux
I have developed a technique in C of invoking multiple H.264 encoders on the GPU and performing parallel encoding. A test program confirmed that N encoders each fed frames at 30fps can encode at 30*N frames/sec. This was done by creating N encoder instances, then taking a physical 30fps stream from a camera (V4L2 mmap'd buffers, 1080p packed YUV422 ⇛ 995 Mbit/sec) and submitting each physical frame to each of the N encoder instances.

I successfully ran 1080p at N == 8 (i.e., 8 encodings per frame ⇛ 240fps). The resulting encoded output was successfully streamed using a bespoke protocol (3.5 Mbps/encoder).

However, when I then attempted to stream from the camera at 30*N fps, feeding each physical frame to the encoders in a "round-robin" style so as not to stress any encoder over 30fps, the overall throughput dramatically fell after USB load exceeded around 850 Mbit. This 850 Mbit number was derived from the following tests:

Code: Select all

                 |   |  USB load (Mbit)  |     Encoding     |   Encoder    |
      Input      | N | per frame | total | Throughput (fps) | Input (Mbit) | Notes
-----------------+---+-----------+-------+------------------+--------------+-------
1080p   @  30fps | 1 |   33.18   |  995  |        30        |      995     | Ok
1080p   @  60fps | 2 |   33.18   | 1991  |        30        |      995     | Slow
 720p   @  60fps | 2 |   14.75   |  885  |       ~57.5      |      848     | Slow
640x480 @ 120fps | 4 |    4.92   |  590  |       120        |      590     | Ok
640x480 @ 240fps | 8 |    4.92   | 1180  |      ~175.5      |      863     | Slow
Note: image format packed YUV422 in all cases. Observed CPU usage during all runs was ~15%.

USB line speed tests confirmed I could stream the camera's limit of 1080p @ 90fps (packed YUV422 ⇛ 2.99 Gbit). Thus, running USB line speed at over 1 Gbit, or the Pi's ability to keep up, doesn't appear to be the problem.

The network output is only 3.5*N Mbps which, being numerically small compared to the 1 Gbit the PHY can run at, doesn't appear to be interfering with memory throughput.

No undervoltage warnings appear in syslog during encoding so it appears the GPU isn't momentarily halting/slowing down.

Having read up briefly on the Pi 4's DRAM bandwidth (e.g., this previous discussion) I understand that the "worst-case maximum" total memory bus activity is about 4GB/sec (32 Gbit). Using a naive per-frame memory lifetime model of:

Code: Select all

USB Read -> DMA Store -> RAM read -> GPU write -> encode -> GPU read -> NET write
(1 frame)   (1 frame)    (1 frame)   (1 frame)      ???     (3.5 Mbps)  (3.5 Mbps)
i.e., approximately 4 frame moves per frame processed, at 1080p (33.18 Mbit/frame) this gives a theoretical maximum of ~240fps if memory access was at saturation (and "???" was zero).

Furthermore, if I take my original test program's confirmed throughput derived from 1080p @ 30fps and N == 8, the per-frame memory lifetime model was:

Code: Select all

USB Read -> DMA Store -> 8 x [[ RAM read -> GPU write -> encode -> GPU read -> NET write ]]
(1 frame)   (1 frame)           (1 frame)   (1 frame)      ???     (3.5 Mbps)  (3.5 Mbps)
which gave an actual memory bandwidth of (597 Mbit x 30 times/sec) + NET ⇛ ~18 Gbit. In addition to being well under the 32 Gbit practical limit, it also appears to show that the GPU is comfortably able to cope with handling I/O with respect to all encoders simultaneously while they're running, so neither GPU I/O nor memory bandwidth appear to be the problem.

And yet, it can't do 640x480 @ 240fps in the above table which has only around 4.71 GBit of memory-related bus utilisation; just 8% of the 32 Gbit limit. Regardless of camera frame rate or encoder instances, around 850 Mbit USB loading is the maximum.

So, does anybody know what the cause of the cliff at 850 Mbit USB load might be? Is there any per-peripheral "bus bandwidth reservation" policy? Is there any bus activity during GPU encoding (the "???" above)?

Thanks in advance.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11496
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Apr 20, 2021 11:38 am

A couple of issues that you may not be aware of.

- I'm amazed you can get 8 1080p30 streams encoding simultaneously, as the H264 block is specified for 1 1080p30 (level 4.0) stream.

- The front end of the encoder uses the ISP block to convert whatever format you care to throw at it into the internal YUV420 based format that the H264 blocks use, so another read of YUV422 and write of YUV420.

- UVC isn't as simple as you might hope. It is presented with USB Request Buffers (URBs) which contain fragments of the overall video frame, and the kernel has to memcpy (using the ARM cores) the video data from the URB to the V4L2 buffer. On a previous project I actually hit the limit where this memcpy was slow enough that it resulted in the USB subsystem running out of URBs and dropping USB3 packets, and this was on an x86 processor so not really underpowered. You may need to mess with cache flushing because the CPU is doing this memcpy.
Memory says it's the copy at https://elixir.bootlin.com/linux/latest ... eo.c#L1115 that's of note.

You don't say if you're using DMABUF or not for your V4L2 nodes, or even which API you're using for the encoder.
Dmabufs are supported on the V4L2 M2M encoder and avoids a memcpy from the V4L2 buffer from UVC to the encoder buffer. You will need to allocate from the encoder and import within UVC as the encoder requires physically contiguous buffers whilst UVC is more flexible (mainly due to that memcpy). In theory you could use MMAP on the encoder and USERPTR on UVC, but that causes even more headaches pinning pages and remapping page tables.
If using MMAL, then there are ways of using vcsm to import a contiguous dmabuf for use by the VPU, otherwise it will be doing a copy of the data to get it from ARM memory to GPU memory.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Apr 20, 2021 2:01 pm

Thanks for the suggestions.
- I'm amazed you can get 8 1080p30 streams encoding simultaneously, as the H264 block is specified for 1 1080p30 (level 4.0) stream.
As a result of our previous discussions on threading and deadlock, I was able to craft a very lightweight and highly-parallelised OpenMAX-based implementation. Luckily, it also turned out that the way the H.264 block is implemented via OpenMAX is sufficiently clean and modular that multiple encoders can be created and driven in parallel once deadlocking and other API inconsistencies have been taken care of in application code. This behaviour might be accidental but it's reliable and predictable.

"Slippery" multi-threaded algorithms are my thing. Always have been ;-)
You don't say if you're using DMABUF or not for your V4L2 nodes, or even which API you're using for the encoder.
V4L2 to the camera is opened, then mmap() is called with MAP_SHARED to allocate the buffers. On the OpenMAX side, I allocate buffers using vcos_malloc_aligned() with the OMX-suggested alignment set on the port. On each V4L2 frame de-queued I then memcpy() from the received MMAP buffer to the VCOS buffer and then call OMX_EmptyThisBuffer(). All standard OMX stuff. The copy+empty takes ~15ms per frame per encoder for 1080p packed YUV422. I'm not specifically implementing any DMA transfer logic at application level; I trust that V4L2 or OpenMAX uses it at its discretion.
- The front end of the encoder uses the ISP block to convert whatever format you care to throw at it into the internal YUV420 based format that the H264 blocks use, so another read of YUV422 and write of YUV420.
Understood. So, naive frame memory lifetime model would now look like this?

Code: Select all

USB Read -> DMA Store -> RAM read -> GPU write -> YUV422 Read -> YUV420 Write -> encode -> GPU read -> NET write
(1 frame)   (1 frame)    (1 frame)   (1 frame)     (1 frame)      (1 frame)        ???     (3.5 Mbps)  (3.5 Mbps)
Even at six frame moves per frame, 640x480 @ 240fps would only come up at about 7 Gbit bus traffic; comfortably below the 32 Gbit theoretical maximum. Additionally, the test program which read USB at 30fps but then cycled each frame through N encoders did not have any trouble driving the encoders with that amount of data.

In terms of USB loading, I have managed to receive USB at 3 Gbit and simultaneously transmit that raw data back out over the LAN at 1 GBit while buffering in RAM (i.e., LAN Tx time == 3 x USB receive time). In this case, V4L2 worked fine, no USB stalls, all OK. It's just when I try to go via GPU that this USB "throttle" appears to kick in.

Clutching at straws here, but are there any throttles on USB when the GPU is running? For example, AXI bus arbitration policy between GPU and USB when both simultaneously moving data around? In my particular use case, data will be arriving on USB while the GPU is encoding. It is not a serial "read -> encode -> read -> encode -> ... etc." serialised processing model; USB and GPU data actvity will be interleaved on the bus. Might this be the cause of what is being observed?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11496
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Apr 20, 2021 2:52 pm

You can run multiple encodes simultaneously, but I'm surprised that your overall was significantly above about 120MPix/s in total (1080p60/Level 4.2 can be achieved if everything aligns).

My tot up of memory transactions is:
- USB write (URB)
- CPU read (URB) CPU write (V4L2 buffer) - uvcvideo memcpy
- CPU read (V4L2 buffer) CPU write (OMX buffer) - app memcpy
- DMA read (OMX buffer) DMA write (gpu_mem buffer) - ILCS/VCHI
- ISP read (gpu_mem buffer) ISP write (internal video_encoder buffer) - video_encode
- H264 read (new frame and reference frame, H264 write (reference frame) - video_encode
- H264 write (encoded data) - video_encode
- DMA read (encoded data gpu_mem) DMA write (encoded data ARM mem) - ILCS/VCHI
- CPU read (encoded data) and does something with it.

OK it's spread across multiple hardware blocks, but I make that 6 reads of each raw frame, and 6 writes of each raw frame. 4 of each are of YUV422, with the internal video_encode ones being YUV420.
DMAbufs (not available with IL) would allow you to get rid of the app memcpy and the DMA copy in ILCS (IL Component Service)/VCHI (VideoCore Host Interface). They are the Linux kernel thing for sharing raw memory allocations between multiple kernel subsystems in a zero copy manner.

All peripherals hang on largely the same AXI bus. There are arbiter priorities that can be tweaked if really needed, but it's not something that you can do on a generic system. It also becomes a real balancing act over setting the AXI priorities at which different peripherals panic, what triggers those panics, and a load of other stuff. I won't claim it is totally optimally tuned, but it covers most use cases.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Apr 21, 2021 10:33 am

You can run multiple encodes simultaneously, ... (1080p60/Level 4.2 can be achieved if everything aligns).
It's good to hear that it is possible in principle as this is was one of the key features I have been aiming for. The 12 memory transactions you detailed also help a great deal in understanding if and where optimisations can be made. Actually, in your explanation you said
- ISP read (gpu_mem buffer) ISP write (internal video_encoder buffer) - video_encode
- H264 read (new frame and reference frame, H264 write (reference frame) - video_encode
These are normal memcpy()'s and not DMA transfers, right?

I've been working under the assumption that all 12 frame copy operations will be happening at "n-bits per clock" on the AXI bus. However, this will probably only happen for DMA transfers. If not all memory transactions within the ISP block are DMA, that would obviously cause a CPU-side slow-down.

Would you happen to know off the top of your head what is the time ratio difference is between a DMA transfer and the C-runtime memcpy() implementation?

However, what might this have to do with USB being active as when USB transfers are not happening during encoding the slowdown doesn't occur...?
... I make that 6 reads of each raw frame, and 6 writes of each raw frame. 4 of each are of YUV422, with the internal video_encode ones being YUV420.
Incidentally, where in your memory transaction list is the YUV422 -> YUV420 conversion performed? If my applicaction can supply YUV420 directly to the ISP block (e.g., by sourcing a suitable camera), will it save a couple of copy operations?
There are arbiter priorities that can be tweaked if really needed, but it's not something that you can do on a generic system.
Understood.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11496
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Apr 21, 2021 11:11 am

MarkDarcy wrote:
Wed Apr 21, 2021 10:33 am
- ISP read (gpu_mem buffer) ISP write (internal video_encoder buffer) - video_encode
- H264 read (new frame and reference frame, H264 write (reference frame) - video_encode
These are normal memcpy()'s and not DMA transfers, right?
The ISP (Image Sensor Pipeline) and H264 encoder are hardware blocks, so AXI masters. They are pretty optimised to be making efficient AXI burst requests.
MarkDarcy wrote:I've been working under the assumption that all 12 frame copy operations will be happening at "n-bits per clock" on the AXI bus. However, this will probably only happen for DMA transfers. If not all memory transactions within the ISP block are DMA, that would obviously cause a CPU-side slow-down.

Would you happen to know off the top of your head what is the time ratio difference is between a DMA transfer and the C-runtime memcpy() implementation?
Sorry, no idea.
MarkDarcy wrote:However, what might this have to do with USB being active as when USB transfers are not happening during encoding the slowdown doesn't occur...?
... I make that 6 reads of each raw frame, and 6 writes of each raw frame. 4 of each are of YUV422, with the internal video_encode ones being YUV420.
Incidentally, where in your memory transaction list is the YUV422 -> YUV420 conversion performed? If my applicaction can supply YUV420 directly to the ISP block (e.g., by sourcing a suitable camera), will it save a couple of copy operations?
It's done in the ISP as part of the video_encode component.
The H264 blocks need the frames in a weird column format(*), and also a second 2x2 subsampled version of the image to do a coarse motion search on. The ISP can produce both these images efficiently, and there isn't an easy way to configure the outside world to produce and pass in this pair of images simultaneously.

(*) If you divide your image into 128 column wide strips with both the luma and respective U/V (NV12) interleaved chroma, and then glue these strips together end on end, that's about right. The subsampled image is either planar or a similar column format but 32 pixels wide. Cleverer people than me designed it for optimised SDRAM access patterns.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Fri Apr 23, 2021 2:08 pm

Thanks for your reply. There are some other things I wanted to ask but they rely on me doing some more tests and unfortunately I wasn't able to grab the time today. I'll ask again next week if that's OK.

There was one quick thing...
You can run multiple encodes simultaneously, ... (1080p60/Level 4.2 can be achieved if everything aligns).
Is this theoretical maximum confirmed via a USB path or is it only confirmed for the CSI path? It may be possible with the Pi camera but doesn't that feed data via CSI directly into ISP so probably saving four of the 12 copy operations in our frame memory lifetime model...?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11496
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Fri Apr 23, 2021 2:22 pm

MarkDarcy wrote:
Fri Apr 23, 2021 2:08 pm
There was one quick thing...
You can run multiple encodes simultaneously, ... (1080p60/Level 4.2 can be achieved if everything aligns).
Is this theoretical maximum confirmed via a USB path or is it only confirmed for the CSI path? It may be possible with the Pi camera but doesn't that feed data via CSI directly into ISP so probably saving four of the 12 copy operations in our frame memory lifetime model...?
Only with the legacy camera stack.

When using the legacy camera stack with MMAL (MMAL_ENCODING_OPAQUE) or IL tunnels, the ISP processing step of taking the Bayer image (that has been received over CSI2 and stored in SDRAM) also produces the two versions of the image that the H264 block requires. You therefore only have:
- CSI2 rx: write Bayer image
- ISP: read Bayer image
- ISP: write pair of YUV420 images
- H264 read pair of YUV420 images and reference frame
- H264 write reference frame.
- H264 write encoded bitstream

Bayer is generally only 10bpp (12bpp on HQ camera) and single plane, so w*h*10 bits instead of the w*h*16 bits of your YUV422 image, so that saves some SDRAM bandwidth, and not having to copy the images about is a huge saving.

1080p50 YUYV (422) has been tested from a TC358743 HDMI to CSI2 bridge chip, and I believe that did keep up on Pi4. I don't remember trying 1080p60 as that needs the 4 lane version of the bridge board (which I have, but have never tried in that mode).
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

cleverca22
Posts: 3979
Joined: Sat Aug 18, 2012 2:33 pm

Re: Stressing USB3 affects H264 encoding performance (Update).

Fri Apr 23, 2021 3:25 pm

MarkDarcy wrote:
Thu Apr 01, 2021 7:14 am
Wondering if somebody may be able to explain the Pi bus architecture to me. In particular, how LAN, USB3<--->memory (DMA), and GPU (H264 encoder)<--->memory access are interleaved on the various bus(es) within a Pi 4B. It's a USB3 camera providing images, compress on GPU, then stream over wired LAN.

Code: Select all

root@pi400:~# grep axi /boot/config.txt 
dtparam=axiperf
root@pi400:~# cd /sys/kernel/debug/raspberrypi_axi_monitor
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat VPU/data 
     Bus   |    Atrans    Atwait      AMax    Wtrans    Wtwait      WMax    Rtrans    Rtwait      RMax
======================================================================================================
 VPU1_D_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU0_D_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU1_I_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU0_I_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 SYSTEM_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
  L2_FLUSH |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    DMA_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU1_D_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU0_D_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU1_I_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 VPU0_I_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 SYSTEM_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    L2_OUT |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    DMA_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
     SDRAM |        0K        0K        0K        0K        0K        0K        0K        0K        0K
     L2_IN |        0K        0K        0K        0K        0K        0K        0K        0K        0K
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/data 
     Bus   |    Atrans    Atwait      AMax    Wtrans    Wtwait      WMax    Rtrans    Rtwait      RMax
======================================================================================================
    DMA_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
     TRANS |        0K        0K        0K        0K        0K        0K        0K        0K        0K
      JPEG |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 SYSTEM_UC |        1K        0K        0K        0K        0K        0K        1K        0K        0K
    DMA_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
 SYSTEM_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    CCP2TX |      128K        0K        0K        0K        0K        0K     2063K        0K        0K
   MPHI_RX |        0K        0K        0K        0K        0K        0K        0K        0K        0K
   MPHI_TX |        0K        0K        0K        0K        0K        0K        0K        0K        0K
       HVS |        5K        0K        0K        1K        0K        0K        4K        0K        0K
      H264 |        1K        0K        0K        2K        0K        0K        1K        0K        0K
       ISP |        0K        0K        0K        0K        0K        0K        0K        0K        0K
       V3D |        0K        0K        0K        0K        0K        0K        0K        0K        0K
PERIPHERAL |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    CPU_UC |        0K        0K        0K        0K        0K        0K        0K        0K        0K
    CPU_L2 |        0K        0K        0K        0K        0K        0K        0K        0K        0K
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# ls System/
data  enable  filter  sample_time
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/enable 
65535
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/filter 
0
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/sample_time 
100
this driver might also be of some use

in its default config, its reporting transaction counters per destination, measuring each one for 100ms

Code: Select all

root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# echo 11 > System/filter 
root@pi400:/sys/kernel/debug/raspberrypi_axi_monitor# cat System/data 

Monitoring transactions from ISP only
the filter file lets you limit what source increments the counters, so you can then see only reads/writes caused by the ISP for example

the enable file is a bit-mask to not count certain destinations, allowing the samples to update faster (it reads each destination for 100ms, so 16 destinations means 1600ms to update all)
i suspect the driver isnt working 100% correctly on bcm2711 though, the VPU counters arent working right

the part where it may become useful, is that it can report how many reads/writes are having to wait because the bus was too busy

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Apr 28, 2021 9:55 am

Apologies for the delay in getting back.
cleverca22 wrote: this driver might also be of some use
...
the part where it may become useful, is that it can report how many reads/writes are having to wait because the bus was too busy
Hi, cleverca22, and thanks very much for the tip. Much appreciated. I'm sure it will show something, good or bad...!
6by9 wrote: Bayer is generally only 10bpp (12bpp on HQ camera) and single plane, so w*h*10 bits instead of the w*h*16 bits of your YUV422 image, so that saves some SDRAM bandwidth, and not having to copy the images about is a huge saving.
Understood. I was aware of the heaviness of YUV422 from the start but the choice of camera dictates it. The only formats lighter than this that are supported as input to video_encode are planar formats and unfortunately the cameras available to me don't output planar YUV, only packed.

As you mention it, I have a 12-bit Bayer camera. However, even though a 12-bit bayer format appears in the OMX header as a vendor extension format (0x7F000004), it is not available when enumerating the port formats acceptable to the input port of video_encode:

Code: Select all

[0x00000014] OMX_COLOR_FormatYUV420PackedPlanar
[0x7F000007] OMX_COLOR_FormatYVU420PackedPlanar
[0x00000027] OMX_COLOR_FormatYUV420PackedSemiPlanar
[0x7F000008] OMX_COLOR_FormatYVU420PackedSemiPlanar
[0x00000006] OMX_COLOR_Format16bitRGB565
[0x0000000C] OMX_COLOR_Format24bitBGR888
[0x0000000B] OMX_COLOR_Format24bitRGB888
[0x7F000001] OMX_COLOR_Format32bitABGR8888
[0x00000010] OMX_COLOR_Format32bitARGB8888
[0x00000019] OMX_COLOR_FormatYCbYCr
[0x0000001A] OMX_COLOR_FormatYCrYCb
[0x0000001B] OMX_COLOR_FormatCbYCrY
[0x0000001C] OMX_COLOR_FormatCrYCbY
[0x7F000003] OMX_COLOR_FormatYUVUV128
[0x00000017] OMX_COLOR_FormatYUV422PackedPlanar
[0x7F000005] OMX_COLOR_FormatBRCMEGL
By the way, there is a custom colour format the video_encode block supports called OMX_COLOR_FormatYUVUV128 (0x7F000003 in the above dump). Can you explain the layout of this format? Is it the special YUV format you mentioned with the full/reduced images packed together?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11496
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Apr 28, 2021 5:14 pm

MarkDarcy wrote:
Wed Apr 28, 2021 9:55 am
6by9 wrote: Bayer is generally only 10bpp (12bpp on HQ camera) and single plane, so w*h*10 bits instead of the w*h*16 bits of your YUV422 image, so that saves some SDRAM bandwidth, and not having to copy the images about is a huge saving.
Understood. I was aware of the heaviness of YUV422 from the start but the choice of camera dictates it. The only formats lighter than this that are supported as input to video_encode are planar formats and unfortunately the cameras available to me don't output planar YUV, only packed.

As you mention it, I have a 12-bit Bayer camera. However, even though a 12-bit bayer format appears in the OMX header as a vendor extension format (0x7F000004), it is not available when enumerating the port formats acceptable to the input port of video_encode:
Bayer data normally involves quite significant additional image processing, eg white balance, lens shading, denoise, etc. That is what the full ISP component is there for. video_encode just happens to make use of the hardware block for a simple conversion.
MarkDarcy wrote:

Code: Select all

[0x00000014] OMX_COLOR_FormatYUV420PackedPlanar
[0x7F000007] OMX_COLOR_FormatYVU420PackedPlanar
[0x00000027] OMX_COLOR_FormatYUV420PackedSemiPlanar
[0x7F000008] OMX_COLOR_FormatYVU420PackedSemiPlanar
[0x00000006] OMX_COLOR_Format16bitRGB565
[0x0000000C] OMX_COLOR_Format24bitBGR888
[0x0000000B] OMX_COLOR_Format24bitRGB888
[0x7F000001] OMX_COLOR_Format32bitABGR8888
[0x00000010] OMX_COLOR_Format32bitARGB8888
[0x00000019] OMX_COLOR_FormatYCbYCr
[0x0000001A] OMX_COLOR_FormatYCrYCb
[0x0000001B] OMX_COLOR_FormatCbYCrY
[0x0000001C] OMX_COLOR_FormatCrYCbY
[0x7F000003] OMX_COLOR_FormatYUVUV128
[0x00000017] OMX_COLOR_FormatYUV422PackedPlanar
[0x7F000005] OMX_COLOR_FormatBRCMEGL
By the way, there is a custom colour format the video_encode block supports called OMX_COLOR_FormatYUVUV128 (0x7F000003 in the above dump). Can you explain the layout of this format? Is it the special YUV format you mentioned with the full/reduced images packed together?
OMX_COLOR_FormatYUVUV128 is just the full resolution version image in the column-based stripes.

It's only possible to pass the pair of images when using a suitable source component (I believe exclusively the camera), and tunneling with OMX, or using MMAL_ENCODING_OPAQUE.
It's not possible to create suitable buffers from the ARM.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed May 12, 2021 3:29 pm

Hi 6by9,

First, I must apologise for not dropping a line sooner. Something came up for about a week, and then I have been trying to develop a more reliable test environment.

While I have been "away", I managed to revise my test environment so that the encoding is now done on YUV420 planar as opposed to YUYV. This was done by inserting an 8-bit greyscale image received from the camera into two statically-prepared U/V planes set to "zero", then submitting to OMX as YUV420 planar. This had two effects:

  • It reduced the USB traffic by half compared to YUYV.
  • It reduced the traffic the encoder deals with by 25% compared to YUYV.
Here are the results of the new set of tests. Each test was designed to complete in 20 seconds (so 2400 frames at 120fps and 4800 frames at 240fps). I have included a summary of the old YUYV tests for comparison.

First the previous results for 640x480 YUYV:

Code: Select all

YUYV in (614400 bytes/frame) / YUYV enc (614400 bytes/frame): (640×480×16×11) = 54,067,200 bits/frame

     USB TX RATE (BIT/SEC)   |  ENCODE RATE (PIX/SEC)  | BUS LOAD: BIT/SEC |           NOTES
  ---------------------------+-------------------------+-------------------+---------------------------
  120fps (  589,824,000 bps) | 120fps (36,864,000 pix) | 6,496,064,000 bps | (stream network)
  240fps (1,179,648,000 bps) | 175fps (53,760,000 pix) | 9,469,760,000 bps | (stream network)

  Note: Bus load totals include encoded result (2 Mbps) over four stages (8 Mbps total, exc. network TX)

Next, there are the results for 640x480 8-bit greyscale/YUV420 hybrid:

Code: Select all

GREY in (307200 bytes/frame) / YUV420 enc (460800 bytes/frame): (640×480×8×5)+(640×480×12×6) = 34,406,400 bits/frame

    USB TX RATE: BIT/SEC   |  ENCODE RATE (PIX/SEC)  | BUS LOAD: BIT/SEC |             NOTES
  -------------------------+-------------------------+-------------------+-----------------------------
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 4,136,768,000 bps | (no network, no SD card)
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 4,136,768,000 bps | (stream network, no SD card)
  240fps (589,824,000 bps) | 240fps (73,728,000 pix) | 8,265,536,000 bps | (no network, no SD card)
  240fps (589,824,000 bps) | 240fps (73,728,000 pix) | 8,265,536,000 bps | (stream network, no SD card)

  Note: Bus load totals include encoded result (2 Mbps) over four stages (8 Mbps total, exc. network TX)

And here are the results for 720x540 8-bit greyscale/YUV420 hybrid:

Code: Select all

GREY in (388800 bytes/frame) / YUV420 enc (583200 bytes/frame): (720×540×8×5)+(720×540×12×6) =  43,545,600 bits/frame

    USB TX RATE: BIT/SEC   |  ENCODE RATE (PIX/SEC)  | BUS LOAD: BIT/SEC |             NOTES
  -------------------------+-------------------------+-------------------+-----------------------------
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 5,233,472,000 bps | (no network, no SD card)
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 5,233,472,000 bps | (stream network, no SD card)
  240fps (294,912,000 bps) | 150fps (46,080,000 pix) | 6,539,840,000 bps | (no network, no SD card)
  240fps (589,824,000 bps) | 145fps (44,544,000 pix) | 6,322,112,000 bps | (stream network, no SD card)
  240fps (589,824,000 bps) | 140fps (43,008,000 pix) | 6,104,384,000 bps | (stream network + SD card)

  Note: Bus load totals include encoded result (2 Mbps) over four stages (8 Mbps total, exc. network TX)

The bits/frame calculations above were based on your earlier algorithm of:
6by9 wrote: My tot up of memory transactions is:
- USB write (URB)
- CPU read (URB) CPU write (V4L2 buffer) - uvcvideo memcpy
- CPU read (V4L2 buffer) CPU write (OMX buffer) - app memcpy
- DMA read (OMX buffer) DMA write (gpu_mem buffer) - ILCS/VCHI
- ISP read (gpu_mem buffer) ISP write (internal video_encoder buffer) - video_encode
- H264 read (new frame and reference frame, H264 write (reference frame) - video_encode
- H264 write (encoded data) - video_encode
- DMA read (encoded data gpu_mem) DMA write (encoded data ARM mem) - ILCS/VCHI
- CPU read (encoded data) and does something with it.
Thus for the original YUYV it is 11 stages at full YUYV, whilst for grey/YUV420-planar it is 5 stages as greyscale and 6 stages as YUV420 planar. Finally, four stages for handling the encoded output per frame simply adds (4 x 2Mbps = 8 Mbps) to the totals in each case.

I would be grateful if you let me know if any of my numbers look incorrect.

The SD card writing was so that I could log the AXI bus metrics while the encoding was running. The command used was:

Code: Select all

bash$ while true; do sudo cat /sys/kernel/debug/raspberrypi_axi_monitor/System/data >> /tmp/__log; sleep 0.2; done

You can see from the last table in the 720x540 test above that streaming 2 Mbps to network appeared to reduce the frame rate by about 5 fps, and writing to SD card also reduced the frame rate by about 5fps. When streaming and logging to SD card, a total of 10 fps slowdown was observed.

All of these new tests were done while monitoring the AXI performance counters. Two of these captures I have attached as text files. The first log at 480p was a full speed 240fps capture and there was no slowdown. The second log is a 540p capture and it equates to the last entry in the last table (140fps throughput).

In summary, these new set of tests suggest that:

  1. The theoretical "worst case" maximum memory performance of 32 Gbit is not being approached. The most throughput I have achieved is ~10 Gbps.
  2. The theoretical maximum throughput of the encoder you suggested earlier was 120 Mpixels/sec. I am only managing to get around half of that at 73 Mpixels peak performance.
I apologise in advance as this represents a lot of information to take in. However, given this more comprehensive information, if possible, could you please explain:

  1. What might be causing me not to hit peak theoretical performance?
  2. Does the information in the attached AXI performance logs indicate unnecessary waits/delays?

Thanks in advance,
Attachments
AXI-Performance-Logs.zip
(14.34 KiB) Downloaded 19 times

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Thu May 13, 2021 8:43 am

Sorry, but there was a recurring error in that last post. Each of the results tables had this comment:

Code: Select all

Note: Bus load totals include encoded result (2 Mbps) over four stages (8 Mbps total, exc. network TX)

This is wrong as it was streaming 2 Mbps per encoder. This means that the total memory bandwidth for all encoders returning the encoded stream to the application layer wasn't 8 Mbps total but a multiple of 8 Mbps. For 120fps this means that the Bus Load figures are short by 24 Mbps (three encoders' worth), while for 240fps the Bus Load figures are short by 56 Mbps (seven encoders' worth). In the grand scheme of things it doesn't really change much but I thought it was best you know in case the numbers didn't seem to add up.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Thu May 20, 2021 1:42 pm

Hi,

Was the information I posted last time (in particular the AXI performance monitor logs) yield any new information?

Thanks in advance,

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 29044
Joined: Sat Jul 30, 2011 7:41 pm

Re: Stressing USB3 affects H264 encoding performance (Update).

Fri Jun 11, 2021 9:55 am

Can I ask specifically what you are trying to achieve, and what is the current issue? From the first post you are attempting to encode multiple 1080p30 streams? Which is too much for the HW encoder which has a limit of just over 1080p30. Irrespective of the number of streams, there is a maximum number of pixels per second that the encoder can manage. You can multiplex the encode, but you cannot exceed the maximum pixels per second of the encoder, or you will slow down the frame rate.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Working in the Application's Team.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Mon Jun 14, 2021 2:58 am

Thanks for your enquiry.

I have managed to get multiple encoding sessions working. What I am not seeing is anywhere near the throughput that has been cited as being possible for the Pi 4, either in terms of pixels/second processed or memory bandwidth. What I am trying to establish is whether the shortfall in performance is due to insufficient program performance or insufficient hardware performance.

To summarise the thread for you thus far,

1) Theoretical Maximum Encoding Throughput

This I understand to be around 120 megapixels/sec.
6by9 wrote:
Tue Apr 20, 2021 2:52 pm
You can run multiple encodes simultaneously, but I'm surprised that your overall was significantly above about 120MPix/s in total (1080p60/Level 4.2 can be achieved if everything aligns).
2) Theoretical AXI bus maximum memory-bound throughput.

This I understand to be in practical terms to be 4GB/second (32 Gbit) (source: this discussion which is based on this article from MagPi).

3) The Objective

120 megapixels/second is a significant amount of throughput and many resolution/frame-rate combinations should be theoretically possible if my assumptions about hardware are correct. 1080p@60fps is one combination. However, this is not a combination I am interested in. Other combinations that theoretically should fit into 120 Mpixels/sec are 640x480@240fps (73 Mpixels), 720x540@240fps (93 Mpixels), 720p@120fps (110 Mpixels), etc.

I have conducted several tests. Here are the results, repeated from an earlier post, for one of the tests I performed. It's a 720x540@240fps 8-bit greyscale video stream (93 Mpixels) received over USB 3 via V4L2 submitted to the GPU via OMX as planar YUV420.

Code: Select all

GREY in (388800 bytes/frame) / YUV420 enc (583200 bytes/frame): (720×540×8×5)+(720×540×12×6) =  43,545,600 bits/frame

    USB TX RATE: BIT/SEC   |  ENCODE RATE (PIX/SEC)  | BUS LOAD: BIT/SEC |             NOTES
  -------------------------+-------------------------+-------------------+------------------------------------
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 5,233,472,000 bps | (no network, no SD card access)
  120fps (294,912,000 bps) | 120fps (36,864,000 pix) | 5,233,472,000 bps | (stream network, no SD card access)
  240fps (589,824,000 bps) | 150fps (46,080,000 pix) | 6,539,840,000 bps | (no network, no SD card access)
  240fps (589,824,000 bps) | 145fps (44,544,000 pix) | 6,322,112,000 bps | (stream network, no SD card access)
  240fps (589,824,000 bps) | 140fps (43,008,000 pix) | 6,104,384,000 bps | (stream network + SD card access)

The calculation for the "bits/frame" metric is based on this earlier post. At the time the encoding is being performed, there is no other significant activity that my program is responsible for. As you can see, 120fps works fine but 240fps is not being achieved. However, the maximum encoded pixel rate being achieved is far short of the 120 megapixels/sec limit that has been cited as possible and this does not look to be caused by memory-bandwidth saturation as the calculated bus load is also far short the 32 GBit limit previously cited by the above source(s).

There seems to be a very large discrepancy between the throughput I understand to be achievable and what is actually being achieved. I am trying to establish where the errors are in my assumptions about the hardware or hardware-related drivers (e.g., USB).

Thanks in advance.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11496
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Mon Jun 14, 2021 8:57 am

Theoretical limits can very rarely be reached.

1) My comment about anything above 1080p60 being reached is exceeding the quoted design spec.
The design spec is 1080p30 (as per the product brief page 3), but there is a significant overhead potentially available, and I seem to recall getting 1080p50 through the TC358743 HDMI to CSI2 bridge, but haven't tested that recently. I thought I'd had 1080p60 too.
720p120 from imx219 was tested on Pi2 or 3, and was achievable with an overclock. There was a thread on it at the time.

2) The key thing with that SDRAM bandwidth analysis is that it is using 1MB blocks, therefore the access patterns to RAM are almost ideal.

When scanning an image for motion estimation, access patterns are far from ideal. Prediction can be from any of the surrounding macroblocks from the previous frame, so for each 16x16 block you're pulling in 48x48 pixels. After each line you're skipping a chunk of memory to get to the same start place on the next line, so the actual contiguous read from memory is 48 bytes, and you now need 48 of them. That's not an ideal access pattern, and you will get frequent page swaps which reduce the bandwidth available from SDRAM.
There is likely to be some minimal caching the search progresses horizontally across the image, but that then results in only 16 bytes being read off each line for the next macroblock search.

What needs to be profiled is ensuring that nothing is ever waiting for a buffer to fill except the source (USB in your case). With pipelining you generally want at least 3 buffers on each link, and with higher frame rates generally more to compensate for any latency in thread switching.

You say there is no other significant activity that your program is responsible for. Do you have a display connected, and if so what resolution is it running at? That's a moderate memory bandwidth hit, and because it has to be real-time it has higher AXI arbiter priority.

Hitting 720x540@240 through the encoder I would expect to be achievable at 367200 Macroblocks/sec (level 4.0 being max 245760, and level 4.2 being max 522240). Benchmarking the encoder in isolation would be a recommended first step.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Jun 15, 2021 10:31 am

There are a lot a valid points you raised. I would like to get some sort of resolution on each of them before tackling any new points if that's OK?
6by9 wrote:
Mon Jun 14, 2021 8:57 am
Theoretical limits can very rarely be reached.
2) The key thing with that SDRAM bandwidth analysis is that it is using 1MB blocks, therefore the access patterns to RAM are almost ideal.

When scanning an image for motion estimation, access patterns are far from ideal. Prediction can be from any of the surrounding macroblocks from the previous frame, so for each 16x16 block you're pulling in 48x48 pixels. After each line you're skipping a chunk of memory to get to the same start place on the next line, so the actual contiguous read from memory is 48 bytes, and you now need 48 of them. That's not an ideal access pattern, and you will get frequent page swaps which reduce the bandwidth available from SDRAM.
There is likely to be some minimal caching the search progresses horizontally across the image, but that then results in only 16 bytes being read off each line for the next macroblock search.
This I understood from early on and I explicitly tested for it. In previous tests that there was a small drop (~3%) in encoding performance when shooting a completely still, well lit scene at 640x480 when compared to shooting at the same resolution/frame rate a continuously and very fast and randomly oscillating object under the same lighting conditions. However, even with the 3% drop, the resulting rate was still way above >= 240 fps and therefore not of any concern.

I had therefore already assumed that memory access patterns due to motion vector fluctuations, while causing measurable degradation in performance, were not the cause of the near 50% shortfall in theoretical performance in this instance. Is my assumption still valid?

Incidentally, how are memory access patterns affected when two or more encoders are run in parallel? Are my assumptions about maximum throughput only valid when considering a single encoder instance?
6by9 wrote:
Mon Jun 14, 2021 8:57 am
You say there is no other significant activity that your program is responsible for. Do you have a display connected, and if so what resolution is it running at? That's a moderate memory bandwidth hit, and because it has to be real-time it has higher AXI arbiter priority.
The system I am running is headless; no keyboard, no mouse, no monitor. It is the "lite" version of Raspbian so all UI-related components aren't installed. Network access at 20Mbps (in a separate thread to the capture thread) plus whatever SD card activity the OS is doing (I am doing none).

I noted in an earlier post that random SD card access causes delays to occur and the system log (journalctl) seems to be logging messages about FIPS random seed generators, V4L2 status messages, etc. that I can't disable. Could the SD card access be causing delays in the encoder's memory access patterns? Could interrupts from the SD card controller be causing the system to wait?
6by9 wrote:
Mon Jun 14, 2021 8:57 am
Hitting 720x540@240 through the encoder I would expect to be achievable at 367200 Macroblocks/sec (level 4.0 being max 245760, and level 4.2 being max 522240). Benchmarking the encoder in isolation would be a recommended first step.
I agree. I did post some performance counter logs of the AXI bus activity during the above 720x540 run in this previous post. Is there any information in these logs that could point to the source of any potential delay?

Thanks in advance.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Jun 22, 2021 2:44 pm

Hi,

I know you're busy but I would be grateful if you have any further information in response to my earlier queries.

Also, I benchmarked a single encoder as discussed and here is what I observed.

Code: Select all

Selected FPS | Frames |  Time (s)  | Actual FPS | Notes
-------------+--------+------------+------------+-------
      60     |  72000 |  1199.768  |   60.011   | Pass
     120     | 144000 |  1297.957  |  110.940   | Fail

Both tests were conducted at 720x540, 8-bit greyscale from camera, submitted wo OMX as YUV422 planar, under the same lighting conditions and with the same object being shot. Test time was 20 minutes. With regards to single encoder benchmarking, is there any other information you require?

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Jun 23, 2021 9:35 am

I've added 240fps for completeness (these are the only frame rates selectable on the camera at this resolution):

Code: Select all

Selected FPS | Frames |  Time (s)  | Actual FPS | Notes
-------------+--------+------------+------------+-------
      60     |  72000 |  1199.768  |   60.011   |  OK
     120     | 144000 |  1297.957  |  110.940   |
     240     | 288000 |  2396.805  |  120.160   |

As you can see, 120 fps is the limit (720x540x120 = 46.65 Mpixel). The program can comfortably handle significantly more than 240fps given more encoders and a lower resolution so the program itself not being able to keep up does not appear to be an issue.

As an observation, I would expect the maximum encoding rate to be approximately constant once its been hit.

I would appreciate if you could please advise as to what other metrics I should be watching when conducting these tests.

Thanks in advance.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11496
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Stressing USB3 affects H264 encoding performance (Update).

Wed Jun 23, 2021 1:09 pm

Without working through exactly where things are blocking, it's hard to comment further.
How many IL buffers do you have allocated on the encoder ports? The encode pipeline needs to be kept filled if you want optimum encode rate.

MMAL test app at https://github.com/6by9/mmal_encode_example.
Run on a CM4 I'm getting 200fps or greater.
720x540 is 45x34 macroblocks, or 1530 macroblocks/frame. 200fps means 306000 macroblocks/s, which is in excess of the 245760 requirement for level 4.0/4.1.

Run at 1920x1080 I get 49fps, which is 399840 macroblocks/s. OK this is 33% higher than the numbers for 720x540, but I am expecting there to be per frame overheads.

Add force_turbo=1 to config.txt and I get around 335fps at 720x540, and 71fps at 1080p.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Fri Jun 25, 2021 6:03 am

Hi,

Thanks very much for taking the time to knock up that test program. I conducted several tests with it; here's what I observed. Apologies for the long report but its findings are significant I think.

1) Buffers

First, I did try increasing the number of buffers in my program to six per port like your test program as I currently only run two buffers (as is the suggestion when reading the port definition via OMX). Performance of my program didn't change.

2) example_basic

When running your program standalone I was able to get around 362fps (force_turbo=1). Here are the summaries from five consecutive back-to-back runs:

Code: Select all

stop encoding 500 frames took 1377518 usecs or 362 fps
stop encoding 500 frames took 1377109 usecs or 363 fps
stop encoding 500 frames took 1377610 usecs or 362 fps
stop encoding 500 frames took 1374784 usecs or 363 fps
stop encoding 500 frames took 1373651 usecs or 363 fps

However, as you know, when I first opened this thread I reported that I too was able to confirm with my own program that pure encoding speed was not the issue. It is the encoding speed when USB is being stressed that I surmised was the issue (i.e., when the camera is actually running). Although the video output by your program appears to have unique frames (I observed a sort of "colour cycle"), your program as it stands does not stress USB.

So, I stubbed just the encoding out of my program so that it reads the raw camera frames and then sends just the first raster of the image (i.e., 720 bytes) over the network. The point of this was to benchmark my entire program minus the encoding (i.e., USB/network load/inter-thread communications, everything).

I reproduced my original test environment with your test program. I ran a 30fps capture in the background for one minute with my program while I ran your test program in parallel sometime during that minute. Again, five consecutive runs, here's what happened:

Code: Select all

stop encoding 500 frames took 1378235 usecs or 362 fps
stop encoding 500 frames took 1375062 usecs or 363 fps
stop encoding 500 frames took 1377501 usecs or 362 fps
stop encoding 500 frames took 1378169 usecs or 362 fps
stop encoding 500 frames took 1377012 usecs or 363 fps

Here's what my program said at the end of its minute run:

Code: Select all

2021/06/25 14:14:09.007 [0x00000000b6fbec50] frames(#/time/rate) [1800/59.985s/30.008fps] WAIT-ANOTHER

Your test program in this instance and my original test program behave almost identically at 30fps. Your program doesn't read USB in-between each frame sent to the encoder and my original test program took each 30fps frame and sent the same frame to multiple encoders. Our respective programs each managed to achieve high frame rates.

I also started a 240fps camera capture (720x520 8-bit greyscale) and ran your test program in parallel. Again, five consecutive runs, here's what happened:

Code: Select all

stop encoding 500 frames took 1713867 usecs or 291 fps
stop encoding 500 frames took 1979794 usecs or 252 fps
stop encoding 500 frames took 1865524 usecs or 268 fps
stop encoding 500 frames took 1647906 usecs or 303 fps
stop encoding 500 frames took 1836703 usecs or 272 fps

Here is the output from my encoding-less program while these tests were running:

Code: Select all

2021/06/25 13:10:36.149 [0x00000000b6fbec50] frames(#/time/rate) [14400/60.017s/239.931fps] WAIT-ANOTHER

Averaging your program's encode rate (277.2fps) that's a 24% drop in performance. Just to make sure, I upped the capture rate to 720x540 @ 480fps and re-ran the test. Here again are the five runs:

Code: Select all

stop encoding 500 frames took 2008887 usecs or 248 fps
stop encoding 500 frames took 1960778 usecs or 255 fps
stop encoding 500 frames took 1947228 usecs or 256 fps
stop encoding 500 frames took 1911526 usecs or 261 fps
stop encoding 500 frames took 1801305 usecs or 277 fps

and here's the output from my program:

Code: Select all

2021/06/25 13:39:14.624 [0x00000000b6fbec50] frames(#/time/rate) [28800/66.618s/432.312fps] WAIT-ANOTHER

Averaging again (259.4fps) we now have a 30% drop in your program's performance.

It is curious to note here that my program isn't getting full 480fps throughput. Monitoring with top gave a status similar to this for the entire minute the 480fps capture was running:

Code: Select all

top - 13:40:48 up 1 day, 21:07,  2 users,  load average: 1.45, 1.42, 1.24
Tasks: 117 total,   4 running, 113 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.8 us, 16.0 sy,  0.0 ni, 80.1 id,  0.0 wa,  0.0 hi,  1.1 si,  0.0 st
MiB Mem :   1688.6 total,   1463.9 free,     53.0 used,    171.8 buff/cache
MiB Swap:    100.0 total,    100.0 free,      0.0 used.   1558.3 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND            
15845 root       0 -20       0      0      0 R  23.4   0.0   0:12.31 kworker/u9:5+uvcvideo
15895 root       0 -20       0      0      0 R  21.4   0.0   0:16.96 kworker/u9:0+uvcvideo
15589 root      20   0  141748   5668   3752 S  17.2   0.3   2:33.08 testprog

The "testprog" is my program. It doesn't appear to be over-stressing the system. However, when I ran my encoding-less program again but did not run your test program during that time, here's what my program said:

Code: Select all

2021/06/25 13:47:12.004 [0x00000000b6fbec50] frames(#/time/rate) [28800/60.147s/478.825fps] WAIT-ANOTHER

It's not exactly 480fps but it's close enough (camera is not externally triggered so its clock probably drifts slightly). However, the result illustrates the issue as originally reported. USB loading in itself does not appear to be the problem and the performance of my program with respect to capturing frames, dequeuing, interthread-comms, network transmission, etc. also doesn't appear to be an issue.

What looks to be an issue, though, is that whenever encoding is being performed combined with traffic over USB (i.e., during capture) then performance declines significantly. Furthermore, with the aid of your program, as this has now been confirmed when the encoding and the USB transfer are being performed within two completely separate processes, can we not now eliminate buffer management within my program as being the sole cause of the slow-down?

Incidentally, your test program presently doesn't do any buffer shifting between a camera driver and the encoder so if V4L2 support was added to your program I wonder what would happen. Without a full implementation we won't know.


Do the results above convince you that the encoding alone isn't the problem, USB loading itself isn't the problem, but that the system-wide combination of the two look to be the problem?

I appreciate you're very busy so if you don't have a method at your end of quickly running up some kind of USB transfer (>= 750 Mbit) and then monitoring your test program while that transfer is running, would you be able to advise me as to the kind of metrics I should be watching so I can do the tests for you? Memory access/interleaving/bus timing appears to be at play here, but what should I be looking for?

Many thanks in advance,

MarkDarcy
Posts: 48
Joined: Thu Sep 20, 2018 8:23 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Jun 29, 2021 1:58 pm

Hi,

Apologies if the previous post wasn't clear but the focus is about how to confirm how the theoretical maximum 32 Gbit of memory bandwidth is being utilised. It's not about the performance of the encoder under "normal" memory conditions; this has never been in doubt.

To summarise:

1) In the absence of USB traffic, your test program gets 362fps. Input is YUYV so using our previous discussion regarding memory copies per frame as a basis, I make it approx. 18 Gbit/sec total memory bandwidth:

Code: Select all

720 x 540  pixels
    x 16   YUYV (16 bits/pixel)
    x 352  fps
    x 8    memory copies/frame

= 18,015,436,800 bits

2) My test program reads frames via V4L2 and then discards them. According to our previous discussion I surmise three copies per frame: USB write (URB), CPU read (URB), CPU write (V4L2 buffer). At 720x540 greyscale this gives:

Code: Select all

720 x 540  pixels
    x 8    GREY (8 bits/pixel)
    x 3    memory copies/frame

= 9,331,200 bits/frame

 FPS  |   Total memory bandwidth
------+--------------------------
  30  |         0.280 Gbit
 240  |         2.239 Gbit
 432  |         4.031 Gbit

3) At 240fps USB traffic your program showed a 24% drop in performance and managed 277fps. This is approx. 13.8 Gbit encoding throughput plus the 2.239 Gbit of the 240fps V4L2 copy overhead within my test program yielding approx. 16.0 Gbit total memory utilisation.

4) At 480fps USB traffic your program showed a 30% drop in performance and managed 260fps whilst my program showed a 10% drop in performance managing 432fps. This is approx. 12.9 Gbit encoding throughput plus the 4.031 Gbit of the 432fps V4L2 copy overhead within my test program yielding approx. 17.0 Gbit total memory utilisation.

Are the assumptions about the hardware correct in these calculations?

If they are then both of these numbers (16.0 and 17.0) are approximately half of the 32 Gbit memory bandwidth that is cited as being available. I appreciate that caching strategy and in-process memory layout may be causing delays. If that is the case I would just like some way of being able to observe the delay so I can confirm whether what I am trying to achieve is actually possible or not. It might well not be possible.

Is there any metric that can be observed that would pin-point either way where/why the hold-ups are arising?

Thanks again.

ejolson
Posts: 7604
Joined: Tue Mar 18, 2014 11:47 am

Re: Stressing USB3 affects H264 encoding performance (Update).

Tue Jun 29, 2021 3:39 pm

MarkDarcy wrote:
Tue Jun 29, 2021 1:58 pm
Is there any metric that can be observed that would pin-point either way where/why the hold-ups are arising?
I'm posting, not so much with an answer, but an observation about the memory on the Raspberry Pi. The graph at

viewtopic.php?p=1644489#p1644489

indicates aggregate memory bandwidth can decrease by a significant factor when multiple CPU processes simultaneously access different pages. As per

viewtopic.php?p=1644747#p1644747

the apparent cause for this slowdown is a hardware limit on the number of rows of RAM that can be open at once. As far as I can tell, since the Pi has only one memory chip, the row limit is more noticeable than two-chip designs and, of course, much more noticeable than the multi-channel architectures used on servers. Recent IBM Power systems seem especially good, for example, in this regard.

At any rate, sorry I don't have a solution to help track down the stalls on the memory bus. I find this an interesting discussion and am looking forward to hearing of any more discoveries that might be found.

Return to “Graphics, sound and multimedia”