mikerr
Posts: 2774
Joined: Thu Jan 12, 2012 12:46 pm
Location: UK
Contact: Website

Pi4 Sysbench results

Tue Jun 25, 2019 2:39 pm

I previously ran sysbench to compare Pi3B and Asus Tinkerboard, so added pi4 to the mix:

Code: Select all

 sysbench --num-threads=4 --test=cpu --cpu-max-prime=20000 --validate run

Number of threads: 4
Pi 3B

Code: Select all

    total time:                          129.6265s
   
    per-request statistics:
         min:                                 47.69ms
         avg:                                 48.12ms
Asus Tinkerboard

Code: Select all

    total time:                          82.5418s
   
    per-request statistics:
         min:                                 32.63ms
         avg:                                 33.00ms
Pi4B

Code: Select all

    total time:                          62.6426s
    
    per-request statistics:
         min:                                 24.95ms
         avg:                                 25.05ms
Android app - Raspi Card Imager - download and image SD cards - No PC required !

goodburner
Posts: 41
Joined: Sun Jun 16, 2019 3:20 am

Re: Pi4 Sysbench results

Fri Jun 28, 2019 10:31 pm

rasbperry pi 3b+

Code: Select all

[email protected]:~ $ sysbench --num-threads=4 --test=cpu --cpu-mae=20000 --validate run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4
Additional request validation enabled.


Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          95.4187s
    total number of events:              10000
    total time taken by event execution: 381.6150
    per-request statistics:
         min:                                 32.57ms
         avg:                                 38.16ms
         max:                                126.70ms
         approx.  95 percentile:              38.36ms

Threads fairness:
    events (avg/stddev):           2500.0000/10.51
    execution time (avg/stddev):   95.4037/0.01

[email protected]:~ $ 

raspberry pi 4b

Code: Select all

[email protected]:~ $ sysbench --num-threads=4 --test=cpu --cpu-max-prime=20000 --validate run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4
Additional request validation enabled.


Doing CPU performance benchmark

Threads started!
 Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          62.8740s
    total number of events:              10000
    total time taken by event execution: 251.4166
    per-request statistics:
         min:                                 24.94ms
         avg:                                 25.14ms
         max:                                115.22ms
         approx.  95 percentile:              25.08ms

Threads fairness:
    events (avg/stddev):           2500.0000/4.58
    execution time (avg/stddev):   62.8541/0.01

[email protected]:~ $  

zerschranzer
Posts: 1
Joined: Sun Sep 15, 2019 1:33 pm

Re: Pi4 Sysbench results

Sun Sep 15, 2019 1:40 pm

[email protected],75ghz

Code: Select all

[email protected]:~ $ sysbench --num-threads=4 --test=cpu --cpu-max-prime=20000 --validate run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4
Additional request validation enabled.


Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          53.7144s
    total number of events:              10000
    total time taken by event execution: 214.8269
    per-request statistics:
         min:                                 21.38ms
         avg:                                 21.48ms
         max:                                 72.57ms
         approx.  95 percentile:              21.48ms

Threads fairness:
    events (avg/stddev):           2500.0000/1.22
    execution time (avg/stddev):   53.7067/0.01

ejolson
Posts: 3400
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi4 Sysbench results

Sun Sep 15, 2019 8:42 pm

zerschranzer wrote:
Sun Sep 15, 2019 1:40 pm
[email protected],75ghz

Code: Select all

[email protected]:~ $ sysbench --num-threads=4 --test=cpu --cpu-max-prime=20000 --validate run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 4
Additional request validation enabled.


Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          53.7144s
    total number of events:              10000
    total time taken by event execution: 214.8269
    per-request statistics:
         min:                                 21.38ms
         avg:                                 21.48ms
         max:                                 72.57ms
         approx.  95 percentile:              21.48ms

Threads fairness:
    events (avg/stddev):           2500.0000/1.22
    execution time (avg/stddev):   53.7067/0.01
Rather than overclocking, you're better off running sysbench using a 64-bit operating system. Here are results from Gentoo:

Code: Select all

$ sysbench --threads=4 --cpu-max-prime=20000 --validate cpu run
sysbench 1.1.0-74f3b6b (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Validation checks: on.

Initializing random number generator from current time


Prime numbers limit: 20000

Initializing worker threads...

Threads started!

CPU speed:
    events per second:  2329.02

Throughput:
    events/s (eps):                      2329.0193
    time elapsed:                        10.0012s
    total number of events:              23293

Latency (ms):
         min:                                    1.70
         avg:                                    1.72
         max:                                    7.22
         95th percentile:                        1.70
         sum:                                39993.38

Threads fairness:
    events (avg/stddev):           5823.2500/6.30
    execution time (avg/stddev):   9.9983/0.00
Thus, the 64-bit sysbench cpu test at stock speeds is 5.38 times faster than the overclocked 32-bit result and 6.29 times faster than 32-bit at stock speeds.

As real programs run somewhere between 20 percent faster or slower on 64-bit, one conclusion is that the sysbench cpu test is nearly useless. In my opinion, a slightly more meaningful measurement of processor performance can be obtained using this Pi pie chart program.

jdonald
Posts: 388
Joined: Fri Nov 03, 2017 4:36 pm

Re: Pi4 Sysbench results

Sun Sep 15, 2019 9:03 pm

I wouldn't say sysbench CPU is useless, but acknowledge that a prime number sieve is not representative of everyday desktop usage.

I missed the sysbench discussion 1.5 years ago on this forum, but I believe I resolved any open questions here: https://www.raspberrypi.org/forums/view ... 3#p1518193

Restated, if you compile with -march=armv8-a+crc+simd -mtune=cortex-a72, the difference between 32-bit and 64-bit code disappears. In fact you need not even tune for Cortex-A72. I recall it even worked to do compile with -march=armv8-a+crc+simd -mtune=cortex-a7 because that's sufficient for udiv/sdiv instructions.

I know I've said this a lot, but people should be very careful when comparing ARMv6 code against anything more modern. This particular case is especially insidious because it doesn't resolve by installing the Debian armhf binary alone, as that is tuned for Cortex-A9.

ejolson
Posts: 3400
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi4 Sysbench results

Sun Sep 15, 2019 9:34 pm

jdonald wrote:
Sun Sep 15, 2019 9:03 pm
I wouldn't say sysbench CPU is useless, but acknowledge that a prime number sieve is not representative of everyday desktop usage.

I missed the sysbench discussion 1.5 years ago on this forum, but I believe I resolved any open questions here: https://www.raspberrypi.org/forums/view ... 3#p1518193

Restated, if you compile with -march=armv8-a+crc+simd -mtune=cortex-a72, the difference between 32-bit and 64-bit code disappears. In fact you need not even tune for Cortex-A72. I recall it even worked to do compile with -march=armv8-a+crc+simd -mtune=cortex-a7 because that's sufficient for udiv/sdiv instructions.

I know I've said this a lot, but people should be very careful when comparing ARMv6 code against anything more modern. This particular case is especially insidious because it doesn't resolve by installing the Debian armhf binary alone, as that is tuned for Cortex-A9.
Those optimization settings for 32-bit binaries are interesting. I wonder what difference they make for Pi pie charts.

For the record, the sysbench CPU test does not perform a prime sieve but rather uses trial by division. Moreover, the divisions are performed with 64-bit arithmetic even though the maximum prime is set to 20000. From what I understand, the sysbench cpu code was originally a placeholder that was never meant to be widely distributed.

A parallel prime sieve is one of the tests in the Pi pie chart program.

ejolson
Posts: 3400
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi4 Sysbench results

Mon Sep 16, 2019 5:29 pm

ejolson wrote:
Sun Sep 15, 2019 9:34 pm
jdonald wrote:
Sun Sep 15, 2019 9:03 pm
I wouldn't say sysbench CPU is useless, but acknowledge that a prime number sieve is not representative of everyday desktop usage.

I missed the sysbench discussion 1.5 years ago on this forum, but I believe I resolved any open questions here: https://www.raspberrypi.org/forums/view ... 3#p1518193

Restated, if you compile with -march=armv8-a+crc+simd -mtune=cortex-a72, the difference between 32-bit and 64-bit code disappears. In fact you need not even tune for Cortex-A72. I recall it even worked to do compile with -march=armv8-a+crc+simd -mtune=cortex-a7 because that's sufficient for udiv/sdiv instructions.

I know I've said this a lot, but people should be very careful when comparing ARMv6 code against anything more modern. This particular case is especially insidious because it doesn't resolve by installing the Debian armhf binary alone, as that is tuned for Cortex-A9.
Those optimization settings for 32-bit binaries are interesting.
I've started a thread here to discuss whether the optimization settings -march=armv8-a+crc+simd affect the speed of 64-bit division during 32-bit mode operation on Cortex-A72 processors. So far I'm not seeing much improvement.

User avatar
pi-tastic
Posts: 89
Joined: Mon Jul 29, 2019 6:34 pm

Re: Pi4 Sysbench results

Mon Sep 16, 2019 6:23 pm

sysbench --num-threads=4 --test=cpu --cpu-max-prime=20000 --validate run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --num-threads is deprecated, use --threads instead
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Additional request validation enabled.

Initializing random number generator from current time


Prime numbers limit: 20000

Initializing worker threads...

Threads started!

CPU speed:
events per second: 1470.80

General statistics:
total time: 10.0020s
total number of events: 14713

Latency (ms):
min: 2.44
avg: 2.72
max: 17.61
95th percentile: 2.71
sum: 40001.49

Threads fairness:
events (avg/stddev): 3678.2500/3.77
execution time (avg/stddev): 10.0004/0.00
maccaps.com - bringing life to dead electronics.

ejolson
Posts: 3400
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi4 Sysbench results

Mon Sep 16, 2019 7:33 pm

pi-tastic wrote:
Mon Sep 16, 2019 6:23 pm
sysbench --num-threads=4 --test=cpu --cpu-max-prime=20000 --validate run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --num-threads is deprecated, use --threads instead
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Additional request validation enabled.

Initializing random number generator from current time


Prime numbers limit: 20000

Initializing worker threads...

Threads started!

CPU speed:
events per second: 1470.80

General statistics:
total time: 10.0020s
total number of events: 14713

Latency (ms):
min: 2.44
avg: 2.72
max: 17.61
95th percentile: 2.71
sum: 40001.49

Threads fairness:
events (avg/stddev): 3678.2500/3.77
execution time (avg/stddev): 10.0004/0.00
Am I to understand the above run was performed using a Pi 4B running Rasbian Buster? Could the compiler be so smart that it optimized the 64-bit divisions as 32-bit divisions?

ejolson
Posts: 3400
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi4 Sysbench results

Mon Sep 16, 2019 8:07 pm

ejolson wrote:
Mon Sep 16, 2019 7:33 pm
pi-tastic wrote:
Mon Sep 16, 2019 6:23 pm
sysbench --num-threads=4 --test=cpu --cpu-max-prime=20000 --validate run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --num-threads is deprecated, use --threads instead
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Additional request validation enabled.

Initializing random number generator from current time


Prime numbers limit: 20000

Initializing worker threads...

Threads started!

CPU speed:
events per second: 1470.80

General statistics:
total time: 10.0020s
total number of events: 14713

Latency (ms):
min: 2.44
avg: 2.72
max: 17.61
95th percentile: 2.71
sum: 40001.49

Threads fairness:
events (avg/stddev): 3678.2500/3.77
execution time (avg/stddev): 10.0004/0.00
Am I to understand the above run was performed using a Pi 4B running Rasbian Buster? Could the compiler be so smart that it optimized the 64-bit divisions as 32-bit divisions?
I've confirmed [Edit: I was unable to confirm] the above execution times on Raspbian Buster with sysbench compiled from source [Edit: what compilation options do I need?] using the commands:

Code: Select all

$ git clone https://github.com/akopytov/sysbench.git
$ cd sysbench
$ ./autogen.sh
$ CFLAGS="-march=armv8-a+crc+simd -mtune=cortex-a72" ./configure --prefix=/home/pi/sysbench --without-mysql
$ make -j4
$ cd /home/pi/sysbench
$ ./sysbench --threads=4 --cpu-max-prime=20000 --validate cpu run
The speedup is very strange, because the code in question sb_cpu.c reads

Code: Select all

int cpu_execute_event(sb_event_t *r, int thread_id)
{
  unsigned long long c;
  unsigned long long l;
  double t;
  unsigned long long n=0;

  (void)thread_id; /* unused */
  (void)r; /* unused */

  /* So far we're using very simple test prime number tests in 64bit */

  for(c=3; c < max_prime; c++)
  {
    t = sqrt((double)c);
    for(l = 2; l <= t; l++)
      if (c % l == 0)
        break;
    if (l > t )
      n++;
  }

  return 0;
}
which was clearly written with the intent to perform 64-bit divisions.

My conjecture is that the unusually fast run time results from a compiler aggressively optimizing a trivial benchmark [Edit: However, I'm unable to confirm this, because my Raspbian 32-bit binary still runs about 10 times slower].
Last edited by ejolson on Mon Sep 16, 2019 11:55 pm, edited 5 times in total.

ejolson
Posts: 3400
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi4 Sysbench results

Mon Sep 16, 2019 8:12 pm

ejolson wrote:
Mon Sep 16, 2019 8:07 pm
My conjecture is that the unusually fast run time results from a compiler aggressively optimizing a trivial benchmark.
Performing objdump -S with debugging information turned on yields

Code: Select all

000000ac <cpu_execute_event>:
  for(c=3; c < max_prime; c++)
  ac:   e3003000    movw    r3, #0
  b0:   e3403000    movt    r3, #0
{
  b4:   e16d42f4    strd    r4, [sp, #-36]! ; 0xffffffdc
  for(c=3; c < max_prime; c++)
  b8:   e5930000    ldr r0, [r3]
{
  bc:   e1cd60f8    strd    r6, [sp, #8]
  c0:   e1cda1f8    strd    sl, [sp, #24]
  for(c=3; c < max_prime; c++)
  c4:   e3a0b000    mov fp, #0
  c8:   e35b0000    cmp fp, #0
{
  cc:   e1cd81f0    strd    r8, [sp, #16]
  d0:   e58de020    str lr, [sp, #32]
  for(c=3; c < max_prime; c++)
  d4:   03500003    cmpeq   r0, #3
{
  d8:   ed2d8b04    vpush   {d8-d9}
  dc:   e24dd004    sub sp, sp, #4
  for(c=3; c < max_prime; c++)
  e0:   9a00002a    bls 190 <cpu_execute_event+0xe4>
    for(l = 2; l <= t; l++)
  e4:   ed9f9b35    vldr    d9, [pc, #212]  ; 1c0 <cpu_execute_event+0x114>
  e8:   e1a0a000    mov sl, r0
  for(c=3; c < max_prime; c++)
  ec:   e3a06003    mov r6, #3
  f0:   e3a07000    mov r7, #0
  f4:   e2966001    adds    r6, r6, #1
  f8:   e2a77000    adc r7, r7, #0
  fc:   e157000b    cmp r7, fp
 100:   0156000a    cmpeq   r6, sl
 104:   0a000021    beq 190 <cpu_execute_event+0xe4>
    t = sqrt((double)c);
 108:   e1a00006    mov r0, r6
 10c:   e1a01007    mov r1, r7
 110:   ebfffffe    bl  0 <__aeabi_ul2d>
 114:   ec410b10    vmov    d0, r0, r1
 118:   eeb50b40    vcmp.f64    d0, #0.0
 11c:   eeb18bc0    vsqrt.f64   d8, d0
 120:   eef1fa10    vmrs    APSR_nzcv, fpscr
 124:   4a000022    bmi 1b4 <cpu_execute_event+0x108>
    for(l = 2; l <= t; l++)
 128:   eeb48bc9    vcmpe.f64   d8, d9
 12c:   eef1fa10    vmrs    APSR_nzcv, fpscr
 130:   baffffef    blt f4 <cpu_execute_event+0x48>
      if (c % l == 0)
 134:   e2068001    and r8, r6, #1
 138:   e3a09000    mov r9, #0
 13c:   e1983009    orrs    r3, r8, r9
 140:   0affffeb    beq f4 <cpu_execute_event+0x48>
    for(l = 2; l <= t; l++)
 144:   e3a04002    mov r4, #2
 148:   e1a05009    mov r5, r9
 14c:   e2944001    adds    r4, r4, #1
 150:   e2a55000    adc r5, r5, #0
 154:   e1a00004    mov r0, r4
 158:   e1a01005    mov r1, r5
 15c:   ebfffffe    bl  0 <__aeabi_ul2d>
 160:   ec410b17    vmov    d7, r0, r1
      if (c % l == 0)
 164:   e1a02004    mov r2, r4
 168:   e1a03005    mov r3, r5
 16c:   e1a00006    mov r0, r6
 170:   e1a01007    mov r1, r7
    for(l = 2; l <= t; l++)
 174:   eeb47bc8    vcmpe.f64   d7, d8
 178:   eef1fa10    vmrs    APSR_nzcv, fpscr
 17c:   8affffdc    bhi f4 <cpu_execute_event+0x48>
      if (c % l == 0)
 180:   ebfffffe    bl  0 <__aeabi_uldivmod>
 184:   e1923003    orrs    r3, r2, r3
 188:   1affffef    bne 14c <cpu_execute_event+0xa0>
 18c:   eaffffd8    b   f4 <cpu_execute_event+0x48>
}
 190:   e3a00000    mov r0, #0
 194:   e28dd004    add sp, sp, #4
 198:   ecbd8b04    vpop    {d8-d9}
 19c:   e1cd40d0    ldrd    r4, [sp]
 1a0:   e1cd60d8    ldrd    r6, [sp, #8]
 1a4:   e1cd81d0    ldrd    r8, [sp, #16]
 1a8:   e1cda1d8    ldrd    sl, [sp, #24]
 1ac:   e28dd020    add sp, sp, #32
 1b0:   e49df004    pop {pc}        ; (ldr pc, [sp], #4)
    t = sqrt((double)c);
 1b4:   ebfffffe    bl  0 <sqrt>
 1b8:   eaffffda    b   128 <cpu_execute_event+0x7c>
 1bc:   e320f000    nop {0}
 1c0:   00000000    .word   0x00000000
 1c4:   40000000    .word   0x40000000
which doesn't have any divide instructions or equivalent subroutine calls [Edit: Actually it does. At address 180 there is a call to __aeabi_uldivmod as pointed out below. This is also not so surprising, because the binary I created doesn't run any faster than the usual Raspbian one].
Last edited by ejolson on Mon Sep 16, 2019 11:50 pm, edited 4 times in total.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5310
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: Pi4 Sysbench results

Mon Sep 16, 2019 9:38 pm

ejolson wrote:
Mon Sep 16, 2019 8:12 pm
which doesn't have any divide instructions or equivalent subroutine calls.
bl 0 <__aeabi_uldivmod> ?

ejolson
Posts: 3400
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi4 Sysbench results

Mon Sep 16, 2019 9:48 pm

dom wrote:
Mon Sep 16, 2019 9:38 pm
bl 0 <__aeabi_uldivmod> ?
Hm. I must be blind. It seems I missed that. So why is it running so fast? [Edit: Okay, my compilation isn't running fast. What's going on with the claim that -mcpu=cortex-a72 makes sysbench run fast in 32-bit mode?]
Last edited by ejolson on Mon Sep 16, 2019 11:44 pm, edited 1 time in total.

Return to “General discussion”