Killertechno
Posts: 180
Joined: Wed Jan 02, 2013 8:28 am

Little doubt on floating point precision

Fri Dec 28, 2018 9:54 pm

Hi to all, I'm porting Python code to C code and I have alittle doubt.

On my Python code I have:

Code: Select all

a = 18781
b = 0.997649
c = a*b
print("%f" % (c))
result is 18736.851729 while in C code:

Code: Select all

        int a = 18781;
        float b = 0.997649;
        double c = 0.997649;


        printf("%f\r\n", a*b);
        printf("%f\r\n", (float)(a)*b);
        printf("%f\r\n", (double)(a)*b);
result is 18736.846132 regardless operation.

OpenOffice (and calc) says: 18781 * 0.997649 = 18736.845869

Python result is so near to calc instead C result. Is this normal or do I need to set some particular flag during C compiling (gcc -o main main.c)?
(I'm doing digital filtering, so at the end these small errors would add to each other....)
Thanks.

jahboater
Posts: 4772
Joined: Wed Feb 04, 2015 6:38 pm

Re: Little doubt on floating point precision

Fri Dec 28, 2018 10:53 pm

Killertechno wrote:
Fri Dec 28, 2018 9:54 pm
OpenOffice (and calc) says: 18781 * 0.997649 = 18736.845869
Yes that's right.
It is also exactly what I get from Python and C.

Code: Select all

$ python3
Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 18781 * 0.997649
18736.845869
>>> 
and in C:

Code: Select all

cat test.c
#include <stdio.h>
int main( void )
{
  int a = 18781;
  double b = 0.997649;
  printf( "%f\n", a * b );
}
$ gcc test.c -o test
$ ./test
18736.845869
So OpenOffice, calc, Python, Python3, and C (using "double") all give consistent identical results.
I don't know why your Python differs?

Killertechno
Posts: 180
Joined: Wed Jan 02, 2013 8:28 am

Re: Little doubt on floating point precision

Fri Dec 28, 2018 11:14 pm

Here another example code:

Code: Select all

#include <ctype.h>
#include <errno.h>
#include <fcntl.h>
#include <dirent.h>
#include <libgen.h>
#include <math.h>
#include <netinet/in.h>
#include <sys/stat.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <syslog.h>
#include <time.h>
#include <unistd.h>




#define hpf_filterorder 	1		// ordine del filtro passa-alto per la rimozione dell'offset
#define max_components  	1


#define filter_order 		hpf_filterorder	

struct FilterBuffer
{
	float input[filter_order+1];
	float output[filter_order+1];
};


struct FilterBuffer FilterValues[max_components];			// contiene i valori
struct FilterBuffer FilterCoeff;				// contiene i coefficienti in base al tipo di filtro



float f_HPF_0R075_100_b[filter_order+1] = {0.99764934, -1.0};
float f_HPF_0R075_100_a[filter_order+1] = {1.0, 0.99529868};


void Filter (void);




unsigned char maxFilterOrder;

void Filter (void)
{
	
	unsigned char ch;
	unsigned char n;
        float accu;

	for (ch=0; ch<max_components; ch++)
	{
		printf("\t\t\t\t\t\t[%f\t%f] (input buffer at beginning)\r\n",FilterValues[ch].input[0], FilterValues[ch].input[1] );
		printf("multiply by %f (value should be %f)\r\n", FilterCoeff.input[0], FilterValues[ch].input[0] * FilterCoeff.input[0]);
		FilterValues[ch].input[0] = FilterValues[ch].input[0] * FilterCoeff.input[0];	
		printf("\t\t\t\t\t\t[%f\t%f] (b)\r\n",FilterValues[ch].input[0], FilterValues[ch].input[1] );   // here  value of FilterValues[ch].input[0]
		
		accu = 4.0;
		printf("\t\t\t\t\t\t[%f\t%f] (p)\r\n",FilterValues[ch].input[0], FilterValues[ch].input[1] );   // here FilterValues[ch].input[0] has changed
		accu = 0.0;
		printf("\t\t\t\t\t\t[%f\t%f] (m)\r\n",FilterValues[ch].input[0], FilterValues[ch].input[1] );
		for( n=filter_order; n >0; n--)
		{
			printf("\t\t\t\t\t\t[%f\t%f] (r)\r\n",FilterValues[ch].input[0], FilterValues[ch].input[1] );
			printf("%d\r\n", n);
			printf("\t\t\t\t\t\t[%f\t%f] (c)\r\n",FilterValues[ch].input[0], FilterValues[ch].input[1] );
			accu += FilterValues[ch].input[n]*FilterCoeff.input[n];
			printf("\t\t\t\t\t\t[%f\t%f] (d)\r\n",FilterValues[ch].input[0], FilterValues[ch].input[1] );
			printf("\taccu = %f\r\n", accu);
			printf("\t\t\t\t\t\t[%f\t%f] (e)\r\n",FilterValues[ch].input[0], FilterValues[ch].input[1] );
			accu += FilterValues[ch].output[n]*FilterCoeff.output[n];
			printf("\t\t\t\t\t\t[%f\t%f] (f)\r\n",FilterValues[ch].input[0], FilterValues[ch].input[1] );
			printf("\taccu = %f\r\n", accu);


			FilterValues[ch].input[n] = FilterValues[ch].input[n-1];
			FilterValues[ch].output[n] = FilterValues[ch].output[n-1];
			
		}
		printf("\t\tadding %f\r\n", FilterValues[ch].input[0]);
		
		accu += FilterValues[ch].input[0];
			printf("\taccu = %f\r\n", accu);
		
		FilterValues[ch].output[0] = accu;
		FilterValues[ch].output[1] = accu;	// shifto già per il prossimo passaggio
		printf("%f\r\n", accu);
		
	}



	
	
}



int main(int argc, char *argv[]){
	unsigned char i;
	unsigned char ch;
	
	// filter init 
	for (i=0; i<(filter_order+1); i++)
	{
		printf("loading coeff for x(n-%d)\r\n", i);
		FilterCoeff.input[i] = f_HPF_0R075_100_b[i];
		FilterCoeff.output[i] = f_HPF_0R075_100_a[i];
	}
	
	printf("\r\n");
	for (ch=0; ch<max_components; ch++)
	{
		printf("zeroing values for component %d\r\n", ch);
		for (i=0; i<(filter_order+1); i++)
		{
			printf("%d ", i);
			FilterValues[ch].input[i] = 0.0;
			FilterValues[ch].output[i] = 0.0;
		}
	}	
	printf("\r\n\r\n\r\n");
	printf("start filtering...\r\n");
	printf("\r\n\r\n\r\n");	
	
	
	FilterValues[0].input[0] = 18781.0;
	Filter();
	
	FilterValues[0].input[0] = 18831.0;
	Filter();
	
	
	
	
	
	return 0;
}
gcc -o testfilter testfilter.c -O2 (I get slightly different results compiling without -O2 but I suppose it would be normal)


Then this is the output:

Code: Select all

~/myc/test_filter$ ./testfilter 
loading coeff for x(n-0)
loading coeff for x(n-1)

zeroing values for component 0
0 1 


start filtering...



						[18781.000000	0.000000] (input buffer at beginning)
multiply by 0.997649 (value should be 18736.851729)
						[18736.851729	0.000000] (b)
						[18736.851562	0.000000] (p)
						[18736.851562	0.000000] (m)
						[18736.851562	0.000000] (r)
1
						[18736.851562	0.000000] (c)
						[18736.851562	0.000000] (d)
	accu = 0.000000
						[18736.851562	0.000000] (e)
						[18736.851562	0.000000] (f)
	accu = 0.000000
		adding 18736.851562
	accu = 18736.851562
18736.851562
						[18831.000000	18736.851562] (input buffer at beginning)
multiply by 0.997649 (value should be 18786.734195)
						[18786.734195	18736.851562] (b)
						[18786.734375	18736.851562] (p)
						[18786.734375	18736.851562] (m)
						[18786.734375	18736.851562] (r)
1
						[18786.734375	18736.851562] (c)
						[18786.734375	18736.851562] (d)
	accu = -18736.851562
						[18786.734375	18736.851562] (e)
						[18786.734375	18736.851562] (f)
	accu = -88.087868
		adding 18786.734375
	accu = 18698.646507
18698.646484
I can't understand WHY value changes from (b) and (p) lines!!!!!!!!!

Killertechno
Posts: 180
Joined: Wed Jan 02, 2013 8:28 am

Re: Little doubt on floating point precision

Fri Dec 28, 2018 11:27 pm

Sorry, taking a look to Python code "a" is not an integer as I supposed but iy says <type 'numpy.ndarray'>.

Killertechno
Posts: 180
Joined: Wed Jan 02, 2013 8:28 am

Re: Little doubt on floating point precision

Fri Dec 28, 2018 11:57 pm

This not working:

Code: Select all

#include <stdio.h>


#define hpf_filterorder 	1		// ordine del filtro passa-alto per la rimozione dell'offset
#define max_components  	1


#define filter_order 		hpf_filterorder	

struct FilterBuffer
{
	float input[filter_order+1];
	float output[filter_order+1];
};


struct FilterBuffer FilterValues[max_components];			// contiene i valori
struct FilterBuffer FilterCoeff;				// contiene i coefficienti in base al tipo di filtro



float f_HPF_0R075_100_b[filter_order+1] = {0.99764934, -1.0};
float f_HPF_0R075_100_a[filter_order+1] = {1.0, 0.99529868};




void Filter2 (void);


void Filter2 (void)
{
	unsigned char cha;
	unsigned char n;
	
	cha = 0;
	printf("\t\t\t\t\t\t[%f\t%f] (input buffer at beginning)\r\n",FilterValues[cha].input[0], FilterValues[cha].input[1] );
	printf("multiply %f by %f (value should be %f)\r\n", FilterValues[cha].input[0], FilterCoeff.input[0], FilterValues[cha].input[0] * FilterCoeff.input[0]);
	
	FilterValues[cha].input[0] = FilterValues[cha].input[0] *  FilterCoeff.input[0];
	printf("\t\t\t\t\t\t[%f\t%f] (b)\r\n",FilterValues[cha].input[0], FilterValues[cha].input[1] );
	printf("\t\t\t\t\t\t[%f\t%f] (c)\r\n",FilterValues[cha].input[0], FilterValues[cha].input[1] );
}



int main(int argc, char *argv[]){
	unsigned char i;
	unsigned char ch;

	// filter init 
	for (i=0; i<(filter_order+1); i++)
	{
		printf("loading coeff for x(n-%d)\r\n", i);
		FilterCoeff.input[i] = f_HPF_0R075_100_b[i];
		FilterCoeff.output[i] = f_HPF_0R075_100_a[i];
	}
	
	printf("\r\n");
	for (ch=0; ch<max_components; ch++)
	{
		printf("zeroing values for component %d\r\n", ch);
		for (i=0; i<(filter_order+1); i++)
		{
			printf("%d ", i);
			FilterValues[ch].input[i] = 0.0;
			FilterValues[ch].output[i] = 0.0;
		}
	}	
	printf("\r\n\r\n\r\n");
	printf("start filtering...\r\n");
	printf("\r\n\r\n\r\n");	
	
	
	FilterValues[0].input[0] = 18781.0;
	Filter2();

	
	return 0;
}

Output is:

Code: Select all

start filtering...



						[18781.000000	0.000000] (input buffer at beginning)
multiply 18781.000000 by 0.997649 (value should be 18736.851729)
						[18736.851729	0.000000] (b)
						[18736.851562	0.000000] (c)    First value array has changed!!!!!!!!!!!

Now I change from float to double:

Code: Select all

struct FilterBuffer
{
	double input[filter_order+1];
	double output[filter_order+1];
};


struct FilterBuffer FilterValues[max_components];			// contiene i valori
struct FilterBuffer FilterCoeff;				// contiene i coefficienti in base al tipo di filtro



double f_HPF_0R075_100_b[filter_order+1] = {0.99764934, -1.0};
double f_HPF_0R075_100_a[filter_order+1] = {1.0, 0.99529868};

And result looks correct:

Code: Select all

start filtering...



						[18781.000000	0.000000] (input buffer at beginning)
multiply 18781.000000 by 0.997649 (value should be 18736.852255)
						[18736.852255	0.000000] (b)
						[18736.852255	0.000000] (c) First value array has not changed!!!!!

Am I missing something about programming or compiling?

User avatar
Paeryn
Posts: 2706
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Little doubt on floating point precision

Sat Dec 29, 2018 3:50 am

Floats only have 24 bits of significand precision, doubles have 53. Python should be using doubles for all floating point values.

0.997649 cannot be represented exactly by either a float or a double, but a double is a lot closer. Floats and doubles are in binary format, a lot of exact decimal fractions cannot be represented as exact binary fractions, even 0.1 has no finite binary representation.

These aren't exact values but it shows you how close the float and double representations of 0.997649 are.

Code: Select all

float  b = 0.99764901399612426758
double c = 0.99764900000000000001
Mix in the fact that you then multiply that by 18781 then if you do that entirely in floats your answer is going to be further out than if you convert the float to a double and do the multiplication in doubles or use doubles for the whole thing.
She who travels light — forgot something.

User avatar
jojopi
Posts: 3085
Joined: Tue Oct 11, 2011 8:38 pm

Re: Little doubt on floating point precision

Sat Dec 29, 2018 4:23 am

The Pi/Raspbian use IEEE754 binary floating point representation. The float type has 24 bits of precision, which is equivalent to around 7 significant digits in decimal. The double type has 53 bits, equivalent to 15 digits.

A number such as 18736.845869 will be accurate enough if you calculate and store it using double. Using float, the best you can hope for is that it rounds correctly in the seventh digit, giving 18736.85.

The "%f" format specifier for printf requests a fixed-point output, and it defaults to six digits after the decimal point. If you pass a float and there is more than one digit before the decimal point, then there simply is not the precision for there also to be six correct digits after. So you get something like 18736.85XXXX, where the last four digits are essentially garbage. (Actually the decimal representation of trailing zeros in the binary.)

Unfortunately there is no easy way to avoid printing more digits than you have using "%f". A format such as "%.7g" or "%.6e" may be a better choice if you do not know in advance how big the numbers are going to be.

I cannot reproduce your behaviour where printing the same variable twice gives a slightly different result. Since float arguments to printf have to be promoted to double anyway, maybe sometimes the compiler is passing an actual double instead of truncating to float and promoting back? I doubt that that would be standards-compliant, but it also would not be a problem if you were not trying to print more digits than you have.

Just use double. You are not likely to see much performance improvement in single precision, unless you are dealing with so much data that the size of the arrays in memory becomes significant.

Incidentally, in C the newline sequence is "\n", not "\r\n". A carriage return will be added for you on platforms that want it. Linux is not one of those.

jahboater
Posts: 4772
Joined: Wed Feb 04, 2015 6:38 pm

Re: Little doubt on floating point precision

Sat Dec 29, 2018 8:06 am

Paeryn wrote:
Sat Dec 29, 2018 3:50 am
Floats only have 24 bits of significand precision, doubles have 53. Python should be using doubles for all floating point values.

0.997649 cannot be represented exactly by either a float or a double, but a double is a lot closer. Floats and doubles are in binary format, a lot of exact decimal fractions cannot be represented as exact binary fractions, even 0.1 has no finite binary representation.

These aren't exact values but it shows you how close the float and double representations of 0.997649 are.

Code: Select all

float  b = 0.99764901399612426758
double c = 0.99764900000000000001
Mix in the fact that you then multiply that by 18781 then if you do that entirely in floats your answer is going to be further out than if you convert the float to a double and do the multiplication in doubles or use doubles for the whole thing.
Yes,

Further confused by having two limits DBL_DIG (15) and DBL_DECIMAL_DIG (17) (for doubles) :(
which is only FLT_DIG (6) and FLT_DECIMAL_DIG (9) (for float).

In C99 hex, 0.997649 is 0x1.fecbd987c6328p-1

The multiplication "18781.0 * 0.997649" raises the FE_INEXACT exception ...

IEEE 754-2008 specifies decimal arithmetic "_Decimal64" etc if you want exact results (available in C).

Return to “C/C++”