MiKuSite 
News   RISC OS stuff   Java stuff   x86 stuff   VFP/NEON stuff   LINUX stuff  
LINUX Stuff
Work in progress - Here you can find my recent efforts in 64 Bit ARM coding on 64 Bit LINUX
FracNEON_sp/dp_mt V1.0 (25/06/2022)

After the effort to code efficient single core Mandelbrot variants for ARMv8 it took some time to make it work for multicore and big/little cpu environment to also max out all cores. For the multithreading I use the C++ PThread library. The needed atomic code is done in assembler.

For the graphical output I'm using SDL2. The computation time displayed only covers the calculation, not the graphical output, even if that's neglecatable.

There are 3 different optimisation level versions each for single and for double precision.

Without any command line options set the code reads the available concurrent CPU threads first and sets up the amount of threads used (usually this seems to be the core count). Then it calibrates the timing and finally the benchmark runs for about 10 seconds.

You can specify a command line argument to play with the amount of threads and the amount how many repeats are done:
  • FracNEON_sp_mt <amount of threads> <amount of repeats>
The 3 different optimisation variants calculate the exact same result regarding the Mandelbrot set but differentiate heavily in the assembler implementation to max out - if available - the multiple execution ports, the out-of-order architecture and especially the out-of-order windows of modern ARM cores. In this code I use the NEON extension in single and double precision. A brief description of the 3 variants:
  • opt1 - 1 instruction block, loop unrolling 3 times
  • opt2 - 2 independent instructions blocks, loop unrolling 1 time
  • opt3 - 3 independent instructions blocks, loop unrolling 1 time
To max out speed on all cores the multithreading code assigns one Mandelbrot set line at a time to each available core. If one line of any core is finished it increments the global line counter and the next available one is chosen until the set is complete. This ensures that no core ever runs idle as each line might take a different time to calculate due to the iterative nature and especially in big/little cores. So the parallelisation reaches something like 99%. If you got any questions about it just contact me. It's also my first Linux application, so there might be better ways to code the C++ part or the SDL2 implementation. And also in the assembler code I might have missed some possible speed ups. Benchmark results in table and graph:
Download
FracNEON_sp_opt V1.0 (11/06/2021)

Based on my efforts on 32 Bit x86 I coded 64 Bit ARMv8 assembler versions of my Mandelbrot benchmark. The archive contains 3 executables for Linux (tested on Kali 64 Bit Linux and Ubuntu) and the sources. For the results and graphical output I'm using some C++ and SDL2 code. Text only versions are also included. The computation time displayed as a result only covers the calculation, not the graphical output, even if that's neglectable. The 3 versions calculate the exact same result and amount of iterations but differentiate in the assembler implementation to max out - if available - the multiple execution ports, the out-of-order architecture and especially the out-of-order windows of modern ARM cores. In this code case I use the NEON extension in single precision on a single core only. Future versions are planed to use multiple cores and I plan to add double precision and VFP versions. A brief description of the 3 versions:
  • opt1 - 1 instruction block, loop unrolling 3 times
  • opt2 - 2 independent instructions blocks, loop unrolling 1 time
  • opt3 - 3 independent instructions blocks, loop unrolling 1 time
If you got any questions about it just contact me. It's also my first Linux application, so there might be better ways to code the C++ part or the SDL2 implementation. And also in the assembler code I might have missed some possible speed ups. Benchmark results in table and graph:
Download
FracNEON_dp_opt V1.0 (03/08/2021)
Basically the same effort like the single precision version, now using double precision floats. Therefore it can only use 2 floats in one NEON register compared to the 4 floats in single precision, achieving more or less half the speed. Benchmark results in table and graph:
Download