my demo is like: A: float aa=1,bb=1,cc=1,dd=1; for(int i;i<10000000;i++){ aa=aa+aa; bb=bb+bb; cc=cc+cc; dd=dd+dd; } B: float32x4_t ff={1,1,1,1}; for(int i;i<10000000;i++){ ff=vaddq_f32(ff,ff); } the time of code A and B took is same. There appear to be two bugs: The first bug is that the counter va...