Gnyueh wrote: ↑Fri Jul 31, 2020 3:19 pm
jamesh wrote: ↑Fri Jul 31, 2020 3:09 pm
I cannot be particularly detailed (NDA), but it worth noting that everyday memory bandwidth usage will, on average, be higher than pathological test case memory tests seen here. We would probably expect anywhere between 20 and 40% better performance in real world usage than that show in some memory test systems. It would certainly be possible, given what we know about how everything works, to 'game' these benchmarks and get much higher speeds reported, along the lines of that 20-40% figure, but we don't do that. I cannot be sure about other manufacturers ethics on this.
Welp that is interesting, the BW test is of course not that real world but it indicates the performace of memory controller in the specific dimension, which is meaningful, and how about the BW scaling with thread increasing in internal test? I wonder rather believe these disappointing results are not real.
From reading the thread
viewtopic.php?t=271121
it is clear that different people obtained and confirmed the same result, so there is no error in the sense of reproducibility. Instructions are also given in that thread, if you want to run the test yourself.
As I described where
ejolson wrote: ↑Tue Apr 21, 2020 3:51 pm
While stream is definitely a synthetic RAM benchmark, in my opinion copying memory around is not uncommon as are sections of code that add and rescale vectors of numbers. The stream kernels were originally developed to represent the kinds of vector operations used in a real-world ocean-modelling program and explain why the Cray supercomputers performed so well when performing such calculations. More information is available at
http://www.cs.virginia.edu/~mccalpin/ST ... -01-25.pdf
which appears in
viewtopic.php?p=1647290#p1647290
When performing similar operations in practical code, it becomes important to understand the hardware well enough to know whether memory contention will result in the elapsed time of the parallel computation being more or less.
Note that scheduling is not a problem as these are trivially parallel operations. The difficulty is that even though there are many cores, they all share the same memory bus. Moreover, the nature of the stream vector kernels allows a single fast core to max out available bandwidth on the 4B's memory bus.
This is one reason the extra memory channels on high core-count EPYC and Power9 processors are attractive for parallel scaling. It is also why the high-bandwidth memory of the Fujitsu AFX64 led to a machine that absolutely trounces other systems in per-core efficiency on problems similar to the high-performance conjugate gradient. The advantage of such designs become most obvious when compared to the memory controllers used in the ShenWei SW26010 and, of course, the Raspberry Pi.