gkreidl
Posts: 5999
Joined: Thu Jan 26, 2012 1:07 pm
Location: Germany

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 8:22 pm

ejolson wrote:...

As far as I know most video encoding software uses integer arithmetic that may not generate as much heat as the floating point arithmetic used when solving linear algebra problems. While the video may play fine, it is difficult to claim there were no errors unless you have encoded the same stream twice and done a bit-level compare on the output to ensure to both encodes are identical.
I think it was JamesH who reported here once that an optimized H264 decoder using NEON would execute about 12 times faster (if I remember correctly). And current VLC has also some NEON optimizations built in.
Minimal Kiosk Browser (kweb)
Slim, fast webkit browser with support for audio+video+playlists+youtube+pdf+download
Optional fullscreen kiosk mode and command interface for embedded applications
Includes omxplayerGUI, an X front end for omxplayer

lb
Posts: 260
Joined: Sat Jan 28, 2012 8:07 pm

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 9:10 pm

Earlier I reported that one of my boards was able to run the benchmark with over_voltage=2. I've now done a longer test and sadly it still fails sometimes (9 out of 10 runs are fine) with incorrect results.
The other board even fails with over_voltage=4. And raising the voltage like mad cannot be a solution anyway, the Pi 3 has enough trouble with heat and high power consumption.

clivem
Posts: 79
Joined: Sun Aug 03, 2014 11:18 am

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 9:13 pm

Curious.... I just checked that my +2'd board is still running cpuburn-a53 and hasn't locked. Same, with the board that requires +4. It's still crunching..... Tomorrow I'll actually run a few files through my NEON transcode and bit compare the results with a file created on another SBC, that is known to generate good output.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5288
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 9:19 pm

lb wrote:Earlier I reported that one of my boards was able to run the benchmark with over_voltage=2. I've now done a longer test and sadly it still fails sometimes (9 out of 10 runs are fine) with incorrect results.
The other board even fails with over_voltage=4. And raising the voltage like mad cannot be a solution anyway, the Pi 3 has enough trouble with heat and high power consumption.
The other option is to decrease the frequency. Remove the over_voltage line and try "arm_freq=1000" for example (raise or lower it as appropriate to find a working value).

At the moment we're just gathering information of what helps. Eventually the plan will be that the frequency/voltage is adjusted as appropriate based on temperature and load.

joyrider3774
Posts: 19
Joined: Sun Mar 13, 2016 12:21 pm

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 9:29 pm

lb wrote:Earlier I reported that one of my boards was able to run the benchmark with over_voltage=2. I've now done a longer test and sadly it still fails sometimes (9 out of 10 runs are fine) with incorrect results.
The other board even fails with over_voltage=4. And raising the voltage like mad cannot be a solution anyway, the Pi 3 has enough trouble with heat and high power consumption.
how much longer did you have it running ? i'm running it for about an hour now with over_voltage=2, all tests done passed so far.
My pi3 does not have a sink though but it isn't sitting in a small case, it sits in my picade, i am going to stop the test though because the cpu temprature does not have enough time to settle itselve, the max temprature it reaches seems to rise the more tests i do (as does the lowest temp in between tests). I already saw on a github issue they might be lowering to even lower cpu frequencies if needed then 600mhz do let the cpu settle down. I'm also tempted to buy a heat sink though although when i last tested it in my picade the only system i have on it that made the cpu temps reach throthling limit was psp using ppsspp and temps would reach just about 80 degrees if playing a 3d game for some time, will have to test with the over_voltage = 2 to see if under "normal" circumstances (while playing a 3d game on ppsspp) the temps would not reach the limit and make throthling happen as that will definatly have a very negative impact on the emulation speed

ejolson
Posts: 3260
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 9:40 pm

clivem wrote:Curious.... I just checked that my +2'd board is still running cpuburn-a53 and hasn't locked. Same, with the board that requires +4. It's still crunching..... Tomorrow I'll actually run a few files through my NEON transcode and bit compare the results with a file created on another SBC, that is known to generate good output.
It may be that the initial burst of running NEON instructions at 1200 MHz is what creates the errors and once the processor is throttled further errors don't occur. At any rate, since cpuburn doesn't actually verify the results of the calculation, it is impossible to tell whether there were errors or not.

Rather than running cpuburn-a53 at throttled speeds for long periods of time, you might consider writing a script

Code: Select all

#!/usr/bash
while sleep 10
do
    killall -CONT cpuburn-a53
    sleep 2
    killall -STOP cpuburn-a53
done
to periodically pause and continue the cpuburn process so the CPU stays cool enough to remain at 1200 MHz. Note this kind of CPU use also more accurately reflects the bursts of computation that occurs in most interactive applications.
Last edited by ejolson on Tue Mar 15, 2016 9:42 pm, edited 1 time in total.

lb
Posts: 260
Joined: Sat Jan 28, 2012 8:07 pm

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 9:42 pm

joyrider3774 wrote:
lb wrote: how much longer did you have it running ? i'm running it for about an hour now with over_voltage=2, all tests done passed so far.
Not particularly long - around 30 minutes.

@dom
ARM core frequency reduction definitely helps, both boards are stable at 1100 MHz. It doesn't really look like temperature plays a big role for stability though. At stock settings, one of the boards crashes as quick as 10s into the test, so there isn't really much time for it to heat up as there's a heatsink. I've seen crashes while "vcgencmd measure_temp" reported less than 60 degC...

jdb
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2035
Joined: Thu Jul 11, 2013 2:37 pm

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 10:33 pm

clivem wrote:
dom wrote:Eventually the plan will be that the frequency/voltage is adjusted as appropriate based on temperature and load.
This is starting to feel like a good Monty Python sketch.

User: "I bought a 64 bit SBC".
PiF: "Yeah, but we never said that the proprietary firmware would support 64 bit, or that the kernel would".
This doesn't stop the Pi3 you bought today from being a 64-bit machine tomorrow. Currently, ARMv7 is the only supported software stack unless you want to try experimental features that are not yet fully tested or implemented, and thus are expected to be unstable.
User: "I bought a board that was sold as being capable of 1.2GHz CPU speed."
PiF: "Well, it is if you void your warranty by giving it an extra volt or two, and down forget to supply your own liquid cooling solution. We felt that the typical educational use, would mean that the processor wouldn't need to run at 1.2GHz for more than 10 seconds, so we didn't even supply a 10 cent passive heatsink. Anyway, our solution is to limit operation to 1GHz. Problem solved!"
Disingenuous hyperbole. Clearly many Pi 3 customers have satisfactory performance (we've probably shipped tens of thousands by now) and as far as I can tell the only people complaining are those running multicore benchmarks designed to stress hardware and a fraction of those boards in their possession are failing.

Failure at stock frequencies on running a real-world application is a big deal. Running a benchmark application that thermally stresses the chip on purpose is less of an issue, but we're still going to look at it. Extrapolating failure in all cases from a tiny minority of stress tests is nonsense.
Rockets are loud.
https://astro-pi.org

clivem
Posts: 79
Joined: Sun Aug 03, 2014 11:18 am

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 10:43 pm

jdb wrote: Failure at stock frequencies on running a real-world application is a big deal.
I am glad you said that.

If the numbers generated from my NEON FFT functions on Pi3B, which differ from the numbers being generated by 4 other ARMv7 boards, and the same code running on Pi2B are any indication, it looks as if you are going to have a bigger problem on your hands than people moaning about CPU temps and throttling......

First thing tomorrow morning, I will ask the company I work for, for permission to share my code, which isn't open source, with PiF, because this is a serious issue, and this isn't a synthetic test measuring cpu speed or designed to stress a CPU. It is a real world app, transcoding audio from one format, sample rate, (or both), to another.

jdb
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2035
Joined: Thu Jul 11, 2013 2:37 pm

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 10:46 pm

clivem wrote:
jdb wrote: Failure at stock frequencies on running a real-world application is a big deal.
I am glad you said that.

If the numbers generated from my NEON FFT functions, which differ from the numbers being generated by 4 other ARMv7 boards, and the same code running on Pi2B are any indication, it looks as if you are going to have a bigger problem on your hands than people moaning about CPU temps and throttling......

First thing tomorrow morning, I will ask the company I work for, for permission to share my code, which isn't open source, with PiF, because this is a serious issue, and this isn't a synthetic test measuring cpu speed or designed to stress a CPU. It is a real world app, transcoding audio from one format, sample rate, (or both), to another.
Please do. If you have ARMv7 code that worked on a previous generation of Pi that doesn't work on the latest generation then we'd like to get to the bottom of the issue.
Rockets are loud.
https://astro-pi.org

lb
Posts: 260
Joined: Sat Jan 28, 2012 8:07 pm

Re: Pi3 incorrect results under load (possibly heat related)

Tue Mar 15, 2016 10:48 pm

Note that I also see random internal compiler errors from time to time when compiling software. That's not a synthetic benchmark or anything like that. In fact, these random compiler errors brought me here to investigate further. For the record, I used clang 3.7.1 to compile OpenCV 3.1. So nothing special at all.

joyrider3774
Posts: 19
Joined: Sun Mar 13, 2016 12:21 pm

Re: Pi3 incorrect results under load (possibly heat related)

Wed Mar 16, 2016 3:56 am

i actually don't like reaching the cpu throttling temprature (80-85°C) in *normal* use but to be fair i only reached it using the ppsspp emulator from retropie while running a 3d game for some time and while building from sources using make -j4 on the pi itselve (when (re)building ppsspp). it did not crash the pi3 but it did affect performance. but it's not that i'll be doing that much have no clue what kind of other software might push it to these limits (as in examples to give)

ejolson
Posts: 3260
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Wed Mar 16, 2016 5:38 am

jdb wrote:Clearly many Pi 3 customers have satisfactory performance (we've probably shipped tens of thousands by now) and as far as I can tell the only people complaining are those running multicore benchmarks designed to stress hardware and a fraction of those boards in their possession are failing.

Failure at stock frequencies on running a real-world application is a big deal. Running a benchmark application that thermally stresses the chip on purpose is less of an issue, but we're still going to look at it. Extrapolating failure in all cases from a tiny minority of stress tests is nonsense.
Extrapolating failure from stress tests is exactly why such tests have been developed and used in best engineering practices. Failure to acknowledge engineering faults translates into failure to fix them. Ignoring problems reported by engineers got NASA into trouble many years ago with the O rings on the rockets for the space shuttle; assuming the public wont notice when the emissions of a vehicle fail to meet specifications recently got Volkswagen in trouble with their turbo diesels.

This thread started with the observation that the NEON optimized version of the OpenBLAS linear algebra subroutine library yields wrong answers on the Raspberry Pi 3B but not on the Pi 2B. This library is used by R, Octave, Python and many engineering and scientific applications. All the test program does is solve systems of linear equations using Gaussian elimination with partial pivoting. This is such a common and simple algorithm that it has become a standard benchmark for many people involved in scientific computation.

People are reporting crashing web browsers, gcc internal compiler errors and crashes when processing audio. The Linpack benchmark discussed here provides a quick and reliable test to verify whether hardware is experiencing errors. Such errors are much more difficult to diagnose in the context of a multi-threaded web browser, but that doesn't mean they do not occur. As far as I know, not a single Pi 3B using stock voltage and frequency settings has been able to reliably solve systems of linear equations using the NEON Linpack binary while every Pi 2B tested can run the program without error.
Last edited by ejolson on Wed Mar 16, 2016 9:23 am, edited 1 time in total.

gkreidl
Posts: 5999
Joined: Thu Jan 26, 2012 1:07 pm
Location: Germany

Re: Pi3 incorrect results under load (possibly heat related)

Wed Mar 16, 2016 6:20 am

I repeated my HandBrake test (described above, converting a 1.3 GB ts video to 350MB h264 mkv) twice and compared the files: except for a time stamp in the header the files are identical. The test runs for about an hour using all four cores.
The last time I added a small heat sink (from the PiHut) before I ran the test. Without the heat sink it throttled down to 860-920 MHz under full load, with the heat sink it was running slightly above and below 1100 MHz. So a heat sink gives about 200-220 MHz performance boost under full load.

BTW, I've done a lot of compiling lately (packages like VLC) running on multiple cores and never had a compiler error.

And regarding reported browser crashes: These are well known and reproducible bugs and memory leaks in the patched webkit3 engine and the gstreamer1.0 libraries. They happen on all systems (B+, RPi2 and 3). There's a real problem as nobody seems to care about it, but it is not related to the Pi3 hardware.

I don't underestimate that we may have a real problem here regarding the benchmark test, but some some people here are exaggerating it and using it for an overall attack.
Minimal Kiosk Browser (kweb)
Slim, fast webkit browser with support for audio+video+playlists+youtube+pdf+download
Optional fullscreen kiosk mode and command interface for embedded applications
Includes omxplayerGUI, an X front end for omxplayer

User avatar
Fidelius
Posts: 438
Joined: Wed Jan 01, 2014 8:40 pm
Location: Germany

Re: Pi3 incorrect results under load (possibly heat related)

Wed Mar 16, 2016 8:47 am

dom wrote:At the moment we're just gathering information of what helps. Eventually the plan will be that the frequency/voltage is adjusted as appropriate based on temperature and load.
Just out of interest (and I am no Kernel/Linux expert, just an application programmer) : what part in a Linux system like Raspbian would do such an on-the-fly adjusting? The kernel as such, or some Pi module ("overlay" module?) ?

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23083
Joined: Sat Jul 30, 2011 7:41 pm

Re: Pi3 incorrect results under load (possibly heat related)

Wed Mar 16, 2016 9:41 am

I've removed some clearly concern trolling posts that help no-one. Please keep to the facts rather than speculation or wondering why things are not done in a certain way. If there is an issue it will be found and sorted.


With regard to prerelease testing, boards have existed for some months, and have been tested in house and externally. But testing NEVER find everything, especially when the issues depends on silicon variations.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

User avatar
RaTTuS
Posts: 10381
Joined: Tue Nov 29, 2011 11:12 am
Location: North West UK

Re: Pi3 incorrect results under load (possibly heat related)

Wed Mar 16, 2016 9:53 am

oops sorry - stupid keybord mash.
How To ask Questions :- http://www.catb.org/esr/faqs/smart-questions.html
WARNING - some parts of this post may be erroneous YMMV

1QC43qbL5FySu2Pi51vGqKqxy3UiJgukSX
Covfefe

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5288
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: Pi3 incorrect results under load (possibly heat related)

Wed Mar 16, 2016 1:03 pm

Fidelius wrote:Just out of interest (and I am no Kernel/Linux expert, just an application programmer) : what part in a Linux system like Raspbian would do such an on-the-fly adjusting? The kernel as such, or some Pi module ("overlay" module?) ?
There is a cpufreq driver on the arm which sends requests for high or low frequencies depending on the cpu usage.
These requests are handled by the GPU through the mailbox interface.
The GPU also monitors the temperature of the chip and will cap the arm frequency when the temperate exceeds a threshold (currently 80'C), and removes turbo when it exceeds a second threshold (currently 85'C).
Removing turbo mode means the core voltage is reduced to 1.2V and the arm and GPU frequencies are reduced to non-turbo.

ejolson
Posts: 3260
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Wed Mar 16, 2016 3:30 pm

dom wrote:The GPU also monitors the temperature of the chip and will cap the arm frequency when the temperate exceeds a threshold (currently 80'C), and removes turbo when it exceeds a second threshold (currently 85'C).
Removing turbo mode means the core voltage is reduced to 1.2V and the arm and GPU frequencies are reduced to non-turbo.
Given the fact that CPU seems to run reliably after it's been throttled, maybe it's turbo mode or likely the switch from turbo mode to non-turbo mode which is incompatible with NEON optimized code. Is the logic in the GPU that turns turbo mode on and off user accessible? In particular, is there a way to turn turbo off before running some program? I wonder if reducing the voltage more gradually when entering non-turbo mode would avoid errors.

It should be possible to inspect the LU decomposition of A to determine approximately when during the run of Lapack an error was made. If this time correlates to the switch from turbo to non-turbo mode, that would be interesting.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5288
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: Pi3 incorrect results under load (possibly heat related)

Wed Mar 16, 2016 5:51 pm

ejolson wrote:Is the logic in the GPU that turns turbo mode on and off user accessible? In particular, is there a way to turn turbo off before running some program? I wonder if reducing the voltage more gradually when entering non-turbo mode would avoid errors.
This is controlled by cpufreq driver. Try:
echo powersave | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo ondemand | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

powersave will never use turbo.
ondemand is the default - turbo when any core is above 50% busy
performance - turbo mode always

(although turbo mode may be limited/disabled due to temperature).

clivem
Posts: 79
Joined: Sun Aug 03, 2014 11:18 am

Re: Pi3 incorrect results under load (possibly heat related)

Thu Mar 17, 2016 3:02 pm

deater wrote: It's definitely not out-of-memory. The benchmark finishes, it's just the correctness (residual) checks fail.
Would you mind humouring me, and recompile your OpenBLAS library and test again?
Build OpenBLAS with "make TARGET=ARMV6 DYNAMIC_ARCH=0 USE_THREAD=1 USE_OPENMP=0".
From multiple runs, do you still see any residual failures on Pi3B?

ejolson
Posts: 3260
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Thu Mar 17, 2016 8:29 pm

clivem wrote:
deater wrote: It's definitely not out-of-memory. The benchmark finishes, it's just the correctness (residual) checks fail.
Would you mind humouring me, and recompile your OpenBLAS library and test again?
Build OpenBLAS with "make TARGET=ARMV6 DYNAMIC_ARCH=0 USE_THREAD=1 USE_OPENMP=0".
From multiple runs, do you still see any residual failures on Pi3B?
I think an ARMv6 version of OpenBLAS is part of the Raspbian/Jessie distribution. You should be able to install with apt-get. On x86 the standard Debian binary is multi-threaded; however, I haven't checked the Raspbian binary.

Yggdrasil
Posts: 138
Joined: Sun Aug 26, 2012 8:45 pm

Re: Pi3 incorrect results under load (possibly heat related)

Fri Mar 18, 2016 1:13 am

clivem wrote:Would you mind humouring me, and recompile your OpenBLAS library and test again?
Build OpenBLAS with "make TARGET=ARMV6 DYNAMIC_ARCH=0 USE_THREAD=1 USE_OPENMP=0".
From multiple runs, do you still see any residual failures on Pi3B?
@Deater: And please post the information which benchmark you has used. I've only found 'N=8000' but no info about the used script/test.

ejolson
Posts: 3260
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3 incorrect results under load (possibly heat related)

Fri Mar 18, 2016 3:55 am

Yggdrasil wrote:@Deater: And please post the information which benchmark you has used. I've only found 'N=8000' but no info about the used script/test.
The Linpack benchmark is described on Wikipedia. Essentially one times how long it takes the computer to solve a system of N linear equations in N variables using Gaussian elimination with partial pivoting and then divides that time into 2/3N³ + 2N² to obtain a flops (floating point operations per second) rating. There are no requirements what programming language to use, however optimized assembler as exemplified by the OpenBLAS library is common.

deater
Posts: 27
Joined: Fri Mar 11, 2016 3:58 pm
Location: 45N

Re: Pi3 incorrect results under load (possibly heat related)

Fri Mar 18, 2016 4:35 am

clivem wrote:
Would you mind humouring me, and recompile your OpenBLAS library and test again?
Build OpenBLAS with "make TARGET=ARMV6 DYNAMIC_ARCH=0 USE_THREAD=1 USE_OPENMP=0".
From multiple runs, do you still see any residual failures on Pi3B?
What exactly are you getting at here?

I've run this HPL/OpenBLAS combo on over 40 different machines, including each version of Pi hardware. The only one that has ever failed is the Pi3. That includes running on a dragon board which also has a Cortex A53 in it.

And when I up the voltage in config.txt it runs fine.

I'm not sure what compiling for single thread armv6 would do at all, except potentially stress the CPU a bit less.

Return to “General discussion”