## How to use 100 percent of the raspberry pi cpu in C programs

ErezDahan
Posts: 1
Joined: Mon Mar 13, 2017 10:47 pm

### How to use 100 percent of the raspberry pi cpu in C programs

I wrote a long c program for audio processing.
When I ran the program on the raspberry pi, I've seen that the cpu used in 25 percent.
The execution time was 50 seconds, it is to long for my app.
My question is:
What do I need to do in order to use 100 percent of the cpu when I want to execute computer programs?

LdB
Posts: 1648
Joined: Wed Dec 07, 2016 2:29 pm

### Re: How to use 100 percent of the raspberry pi cpu in C prog

Let me guess you are on a Pi2 or Pi3

They have 4 cores and your C code only runs one flat out which is 25% on some desktop CPU use app.

If that is the case it is not a C problem, it's you have to write code for multiprocessors if you want to use all the cores and that is a whole learning exercise.

However that aside you need to look at what is taking all the time in your code before you can even work out ideas how to tackle it. We have no idea whether it's complex maths, disk IO, memory transfer .... you need to isolate why and where it's slow. It's usually referred to as program profiling but to help we sort of need to understand where all the time is going.

EULERPI
Posts: 51
Joined: Sun May 15, 2016 2:44 pm

### Re: How to use 100 percent of the raspberry pi cpu in C prog

Hi,

I echo LdB's guidance and would just add that once you've worked out whether the programme can go any faster (profiling) you can look at 'multiprocessing'.

Assuming your Pi has four cores and your program is capable of being split into parallel tasks then you can use threading in C. There may be a cost in time of starting the thread and transferring the data to and from the thread but if your programmes operation is cpu intensive there should be an overall speed increase of the parallel section of the programme code.

My C textbook didn't cover threading so I learnt from trying the examples I found online.

Also you might be able to get some speed up if your programme is capable of using different data types and calculations that the system processes faster during the most intensive parts of the computation.

Regards

Nick

stephj
Posts: 80
Joined: Thu Jun 21, 2012 1:20 pm
Location: Lancashire, UK

### Re: How to use 100 percent of the raspberry pi cpu in C prog

This will probably require an awful lot of thinking about first, before you decide whether it is actually worth the effort.

Can the problem be broken into tasks that can run completely independently of each other?

You could have four threads each processing its own quarter of the file concurrently but if you end up with four separate output files that you have to somehow “stick back together” at the end, is there any advantage. If all four cores are carrying out Read-Process-Write cycles, then this may create an input/output pinch point on the SD card.

Chapter 4 of Advanced Linux Programming may come in useful.

Chapters 3 and 5 might also be handy.

jahboater
Posts: 6300
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

### Re: How to use 100 percent of the raspberry pi cpu in C prog

If you don't use a Pi3 then I suggest upgrading to that.

If its an earlier Pi then you might get some benefit from over-clocking it.

Otherwise as noted previously, try to use all 4 cores with threads or multiple processes (easier said than done).

mikerr
Posts: 2826
Joined: Thu Jan 12, 2012 12:46 pm
Location: UK
Contact: Website

### Re: How to use 100 percent of the raspberry pi cpu in C prog

I wrote this ultra simple code to show someone processes/cores on single core vs quad core
so I'll drop it in here

coretest.c

Code: Select all

#include <sys/types.h>
void main(int argc, char **argv)
{
int c,p;
int procs=1;
pid_t pid;

if (argc > 1)  procs = atoi(argv[1]);

for (p=0;p<procs;p++)
if (pid == 0) pid = fork ();

if (pid != 0) for (c=0;c<1000*1000000;c++);
}

That just counts to 1000 million on each process.

On a single core CPU like Pi B+ or Pi Zero it takes around 15 seconds per process, and running 2 in parallel ends up taking twice as long (since only one core!)

Code: Select all

pi@raspi:~ $time ./coretest 1 real 0m14.901s pi@raspi:~$ time ./coretest 2
real    0m31.004s

Running on a 4 core Pi 3 takes about the same time for 1 or 4 processes in parallel:

Code: Select all

pi@pi3:~ $time ./coretest 1 real 0m10.738s pi@pi3:~$ time ./coretest 2
real    0m11.714s
pi@pi3:~ \$ time ./coretest 4
real    0m13.038s

You have to push it to 8 processes to double total time.

Visual of total cpu usage for the pi 3 runs... 1,2,3,4 processes , 25/50/75/100% cpu

(from android app pi healthcheck)
Android app - Raspi Card Imager - download and image SD cards - No PC required !

gordon@drogon.net
Posts: 2023
Joined: Tue Feb 07, 2012 2:14 pm
Location: Devon, UK

### Re: How to use 100 percent of the raspberry pi cpu in C prog

ErezDahan wrote:I wrote a long c program for audio processing.
When I ran the program on the raspberry pi, I've seen that the cpu used in 25 percent.
The execution time was 50 seconds, it is to long for my app.
My question is:
What do I need to do in order to use 100 percent of the cpu when I want to execute computer programs?
I think others have given you the answers, but just to check, open a terminal window, run top then press the '1' key. That will show you the CPU usage of all 4 cores.

-Gordon
--
Gordons projects: https://projects.drogon.net/

stephj
Posts: 80
Joined: Thu Jun 21, 2012 1:20 pm
Location: Lancashire, UK

### Re: How to use 100 percent of the raspberry pi cpu in C prog

Let’s use a practical example, although whether you consider number bashing a worthy cause is another matter. This code calculates pi to a defined number of places, in this code example 10,000 places, using the Taylor series expansion of 16*arctan(1/5) - 4*arctan(1/239) There are quicker ways to calculate pi, that’s admitted, but the original code was written in Fortran and ran on an IBM 4381 mid-range mainframe machine. A modern i7 processor can run rings round it, emulating it probably as fast, if not faster than the original machine. The 4381 had up to 16Mb of memory, Wow!

https://www-03.ibm.com/ibm/history/exhi ... P4381.html

Its the upright blue box with the black panel in it. About the same size as an American style upright fridge/freezer/ice dispenser.

Over the years it has been re-written in C and been tweaked several times to improve performance. As the arrays are processed they fill up from the left with leading zeros. The log10() functions figure out how many of the leading elements are zeros and skip over them saving processing time. As more features have been added to x86 and finally x64 architectures, it has been tweaked further. Multi-core support was added, but not without some serious reworking.

The latest version is not this one, but an x64 Windows version that uses x64 assembler subroutines for the add(), subtract(), and divide() functions, allowing 18 digits of the result to be held in an __int64 array element. This version is restricted to working on 9 digits at a time.

The original program used a pi[3][MAX_NO_ELEMENTS] array. To enable four threads to run simultaneously this is increased to pi[12][MAX_NO_ELEMENTS]. Each thread operates on three of the twelve arrays and doesn’t touch the others.

arctan(x) = x – x^3/3 + x^5/5 - x^7/7 + x^9/9 ………………..

This calculation is split into two tasks, one sums all the positive terms, the other sums all the negative terms. Two such threads are created to process arctan(1/5), and another two to process arctan(1/239) .

Once the original setup has been completed, lines 66 to 69 create the four threads above. No further processing is possible until all four threads have finished, so lines 72 to 75 wait until all they are complete before continuing. The results of all four threads are then combined to get the final value of pi. The result is printed out to file P10000.txt. I would ignore the last few digits of the answer.

As it runs, all four cores initially run flat out. arctan(1/239) converges much faster than arctan(1/5) so these two threads finish before the others, and the remaining two cores continue flat out, (50%) total usage, until they complete.

If you want more decimal places, change line 38 no_of_places=10000;

You may also have to change #define MAX_NO_ELEMENTS 100000, but there is a rudimentary run time check at lines 41 to 45 that will detect an impending segmentation fault.

Compile it with: gcc pi.c -o pi -lm -lpthread
and run it with .\pi

The code will run on single core Raspberry Pi machines, but you won’t see the performance boost of four cores.

PeterO, I have checked that this code will run correctly on an x64 Linux release.

The file is written out as a continuous line of characters. If you want a newline every n characters say, specify it as the second parameter in line 86 output(pi[2],100); currently 100 chars and uncomment the two lines at 244,245.

P.S. GNU licence bit added. Feel free to amend or do whatever you will with it.

Code: Select all

// pi.c
//
// Calculate pi to the required number of places using
// the expansion of 16*arctan(1/5) - 4*arctan(1/239)
//
// Where arctan(x) =  x - x^3 + x^5 - x^7 + x^9  .................
//                        ¯¯¯   ¯¯¯   ¯¯¯   ¯¯¯
//                         3     5     7     9
//
// Each thread will sum the positive or negative terms
// for the expansion of arctan(1/5) or arctan(1/239)
// At completion they will be combined to provide the final value for pi.
//
// This program is free software; you can redistribute it and/or modify
// the Free Software Foundation; either version 2 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
#include <stdio.h>
#include <math.h>
#include <inttypes.h>

#define MAX_NO_ELEMENTS 100000

int32_t pi[12][MAX_NO_ELEMENTS];
int32_t no_of_places, elmend;

void divide(int32_t *arr1,int32_t *arr2,int32_t elmstr,int32_t elmend,uint64_t value);
void add(int32_t *arr1,int32_t *arr2,int32_t elmstr,int32_t elmend);
void subtract(int32_t *arr1,int32_t *arr2,int32_t elmstr,int32_t elmend);
void output(int32_t *arr1, int32_t linelength);

void * arctan_plus(void *arg);
void * arctan_minus(void *arg);

int main()
{
uint32_t ul5=5;
uint32_t ul239=239;
no_of_places=10000;
elmend=(no_of_places/9)+2;

if(elmend>=MAX_NO_ELEMENTS)
{
printf("Abort! Number of places too big! Increase MAX_NO_ELEMENTS.\n");
return(0);
}

// pi = 16*arctan(1/5) - 4*arctan(1/239)
// Initialize positive arctan(1/5) arrays
*pi[0]=*pi[2]=16;
divide(pi[0],pi[0],0,elmend, (uint64_t) 5);
divide(pi[2],pi[2],0,elmend, (uint64_t) 5);
// Initialize negative arctan(1/5) arrays (5^3 =125)
*pi[3]=*pi[5]=16;
divide(pi[3],pi[3],0,elmend, (uint64_t) 125);
divide(pi[3],pi[5],0,elmend, (uint64_t) 3);
// Initialize positive arctan(1/239) arrays
*pi[6]=*pi[8]=4;
divide(pi[6],pi[6],0,elmend, (uint64_t) 239);
divide(pi[8],pi[8],0,elmend, (uint64_t) 239);
// Initialize negative arctan(1/239) arrays (239^3 = 13651919)
*pi[9]=*pi[11]=4;
divide(pi[9],pi[9],0,elmend, (uint64_t) 13651919);
divide(pi[9],pi[11],0,elmend,(uint64_t) 3);

// Wait until all four threads complete.

// Set pi[2] to have the value 16*arctan(1/5)
subtract(pi[5],pi[2],0,elmend);
// - 4*arctan(1/239)
// Subtract the sum of the all positive terms.....,
subtract(pi[8],pi[2],0,elmend);
// then add the sum of the all negative terms
// Complete.
// The array pi[2] now holds the value of pi
output(pi[2],100);
return 0;
}

void * arctan_plus(void *arg)
{
uint32_t n;
uint64_t n_4;
uint64_t next_odd_number=1;
double logn4;
int32_t no_of_zero_elements;
int32_t temp_zero_elements;
int32_t *p0,*p1,*p2;

n=*(uint32_t *) arg;
n_4=n*n*n*n;
logn4=log10((double)n_4);

p0 = (n==5)? pi[0] : pi[6];
p1 = (n==5)? pi[1] : pi[7];
p2 = (n==5)? pi[2] : pi[8];

while(no_of_zero_elements<elmend)
{
next_odd_number+=4;

temp_zero_elements=no_of_zero_elements-1;

if(temp_zero_elements<0)
temp_zero_elements=0;

divide(p0,p0,temp_zero_elements,elmend,n_4);
divide(p0,p1,temp_zero_elements,elmend,next_odd_number);
}
return arg;
}

void * arctan_minus(void *arg)
{
uint32_t n;
uint64_t n_4;
uint64_t next_odd_number=3;
double logn4;
int32_t no_of_zero_elements;
int32_t temp_zero_elements;
int32_t *p0,*p1,*p2;

n=*(uint32_t *) arg;
n_4=n*n*n*n;
logn4=log10((double)n_4);

p0 = (n==5)? pi[3] : pi[9];
p1 = (n==5)? pi[4] : pi[10];
p2 = (n==5)? pi[5] : pi[11];

while(no_of_zero_elements<elmend)
{
next_odd_number+=4;

temp_zero_elements=no_of_zero_elements-1;

if(temp_zero_elements<0)
temp_zero_elements=0;

divide(p0,p0,temp_zero_elements,elmend,n_4);
divide(p0,p1,temp_zero_elements,elmend, next_odd_number);
}
return arg;
}

void add(int32_t *arr1,int32_t *arr2,int32_t elmstr,int32_t elmend)
{
int32_t i,carry=0;

for(i=elmend;i>=elmstr;i--)
{
arr2[i]=arr2[i]+arr1[i]+carry;
carry=arr2[i]/1000000000;
arr2[i]%=1000000000;
}
}

void subtract(int32_t *arr1,int32_t *arr2,int32_t elmstr,int32_t elmend)
{
int32_t i,carry=0;

for(i=elmend;i>=elmstr;i--)
{
arr2[i]=arr2[i]-arr1[i]-carry;
if(arr2[i]<0)
{
carry=1;
arr2[i]+=1000000000;
}
else
carry=0;
}
}

void divide(int32_t *arr1,int32_t *arr2,int32_t elmstr,int32_t elmend, uint64_t value)
{
uint64_t product,carry=0;
int32_t i;

for(i=elmstr;i<elmend;i++)
{
product=1000000000*carry+arr1[i];
carry=product%value;
product/=value;
arr2[i]=product;
}
}

void output(int32_t *arr1, int32_t linelength)
{
int32_t i,j;
int32_t charsout;
FILE *pi_text;
char pi_file[20];
char buffer[15];

sprintf(pi_file,"P%d.txt",no_of_places);

if((pi_text=fopen(pi_file,"w"))==NULL)
{
printf("Can't open %s for output!\n",pi_file);
return;
}
fprintf(pi_text,"%d.",arr1[0]);
charsout=2;
for(i=1;i<elmend-1;i++)
{
sprintf(buffer,"%09ld",(long) arr1[i]);
for(j=0;j<9;j++)
{
fputc(buffer[j],pi_text);
charsout++;
charsout%=linelength;
// if(charsout==0)
//	  fputc('\n',pi_text);
}
}
fputc('\n',pi_text);
fclose(pi_text);
}