bensimmo wrote: ↑Sun Nov 25, 2018 2:47 pm
TX2 is 6-core, 2xDenver2-ARM and 4x-A57 (all at 2GHz).
Can it run on the CUDA cores, would that improve things?
(Given the TX2 is more powerful than the Switch Console)
Update: By default Jetson boots with the Denver cores disabled for power savings. The tests above only involved the A57 cores. The Denver cores are about 60% faster on the Pi Chart benchmark. A new set of runs that include the Denver cores appears
here.
From what I can tell, the Ubuntu Linux running on the Jetson TX2 boots with cores 0, 3, 4 and 5 active. I haven't spent much time to know which kind of cores those are and whether they dynamically switch based on the load. In particular, the above tests were performed with only 4 cores active.
I actually translated the parallel merge sort into CUDA last year to use in an algorithm for finding differences in sets of point cloud data gathered from a LIDAR. While a work-span analysis of general parallel sort algorithms places merge sort of near the top, to efficiently use hundreds of CUDA cores people sometimes employ a radix sort customized to the data. In the case of the point-cloud differencing algorithm, the real-number radix sort from the CUB sorting library performed surprisingly worse than my general merge sort running on a P100 GPU. When running the same code on the Pascal GPU of the Jetson TX2 the radix sort performed better.
The prime sieve is embarrassingly parallel with the added benefit of being cache friendly--for this reason the parallel algorithm also runs faster on single core computers. The algorithm employs a lot of bitwise operations so it would be interesting to see what happens on CUDA. I haven't tried.
In my opinion, the fact that Nvidia offered good FFT and BLAS libraries for CUDA greatly increased its popularity in the beginning. The Fourier transform coded in C for the pie-chart benchmark uses a cache-agnostic recursive parallel implementation. New versions of CUDA support this type of dynamic parallelism and it would be interesting to see how the vendor library compares. When I converted the merge sort to CUDA, it was necessary to explicitly code the nonparallel part of the recursion using arrays and goto statements. This was needed to prevent too many stack frames from overwhelming the simple GPU memory manager. I expect a similar technique would be needed when translating the C code used in the Fourier transform benchmark.
The Lorenz 96 simulation might be the easiest to translate to CUDA because of its computational simplicity and the fine-grained parallelism available. The CPU version uses an overlapping boundary approach based on domains of dependence to obtain coarser-grained computational blocks that contain enough work for parallel speedup on 64-core systems. GPUs typically have many more processors. It is also worth noting that the phase space consists of 32K double-precision words and entirely fits into the cache. This is small for a parallel problem but quite large compared to typical Lorenz 96 simulations.
Since CUDA is not supported on any current Raspberry Pi models, I don't see a pressing need to translate the existing OpenMP and Cilk parallel C codes to CUDA. At the same time, the performance characteristics of GPUs are quite different than CPUs so the comparison would still be interesting. If you decide to write a GPU version yourself, I'll contribute my CUDA merge sort code as needed.