Issue
I am implementing a mathematical library in C which heavily uses multiplications. Initially, all my multiplications were done using uint16_t
. Recently I change many of them to uint32_t
and I saw my code runtime became almost double. I was confused as I thought in Intel x64 processors, 32 and 16bit multiplications take same clock cycles.
I wrote a diagnosis code, please find it below
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include "cpucycles.c"
#define REPEAT 10000
#define OUT_REPEAT 100000
void main(){
uint16_t a_16[REPEAT], b_16[REPEAT], c_16[REPEAT];
uint32_t a_32[REPEAT], b_32[REPEAT], c_32[REPEAT];
int32_t i,j;
uint64_t clock1, clock2, CLOCK16, CLOCK32;
uint64_t acc=0;
time_t t;
srand((unsigned) time(&t));
clock1=clock2=CLOCK16=CLOCK32=0;
for(j=0;j<OUT_REPEAT;j++){
for(i=0;i<REPEAT;i++){
a_16[i]=rand()& ( (1<<13) 1); //need 13bit integers only
b_16[i]=rand()& ( (1<<13) 1);
a_32[i]=rand()&( (1<<19) 1);
b_32[i]=rand()&( (1<<19) 1); //need 19bit integers only
}
clock1=cpucycles();
for(i=0;i<REPEAT;i++){
c_16[i]=a_16[i]*b_16[i];
}
clock2=cpucycles();
CLOCK16=CLOCK16+(clock2clock1);
clock1=cpucycles();
for(i=0;i<REPEAT;i++){
c_32[i]=a_32[i]*b_32[i];
}
clock2=cpucycles();
CLOCK32=CLOCK32+(clock2clock1);
for(i=0;i<REPEAT;i++){
acc=(acc+(c_32[i](uint32_t)c_16[i])); //this is just to prevent compiler optimization
}
printf("Iteration: %d, acc:%llu\n", j, acc);
acc=0;
}
printf("\n\n");
printf("Time for 16 bit multiplication : %llu\n", CLOCK16/OUT_REPEAT);
printf("Time for 32 bit multiplication : %llu\n", CLOCK32/OUT_REPEAT);
printf("\n\n");
}
The cpucycles code is from ECRYPT and given below,
#include "cpucycles.h"
long long cpucycles(void)
{
unsigned long long result;
asm volatile(".byte 15;.byte 49;shlq $32,%%rdx;orq %%rdx,%%rax"
: "=a" (result) :: "%rdx");
return result;
}
Result from one sample run, using single core and hyperthreading/TurboBoost disabled

Time for 16 bit multiplication : 2795
Time for 32 bit multiplication : 4190

and finally my cpuinfo (excerpt) as given by lscpu
Architecture: x86_64
CPU opmode(s): 32bit, 64bit
Model name: Intel(R) Core(TM) i77700 CPU @ 3.60GHz
Now my question is,

Is it correct that in x64 platforms 16bit multiplications take almost half of the total time than 32bit multiplications? Or I am doing something wrong.

If yes, can you please show me some reference to justify this behavior?
I appreciate your help.
Solution
Is it correct that in x64 platforms 16 bit multiplications take almost half of the total time than 32 bit multiplications? Or I am doing something wrong.
No, that isn’t correct by itself. Respectively it’s not what you have actually tested.
Your short loops are trivially vectorizable, so that’s just what the compiler did. Depending on the target CPU generation, that means there is a 128, 256 or 512bit vector type available which can be partitioned into different word sizes (8bit, 16bit, 32bit, 64bit, 128bit), and can then perform vectorized multiplications on multiple elements at once. Not just the multiplication, but also loading and storing the numbers from and to memory is fully vectorized and doesn’t operate just on single elements.
To put it simple, you can fit twice as many 16bit integers in the same vector, compared to 32bit. And your code actually isn’t limited by the multiplications either – it’s limited purely by load / stores, so you have only successfully measured that 16bit integers are half as big as 32bit integers, so when vectorized and bound by load/stores you can load twice as many elements in the same time.
If you whish to benchmark a specific instruction (in this case single element multiplication), would you need to explicitly use that specific instruction by inline assembly. You would also need to be aware of all the side effects and preconditions which effect performance, pipelined superscalar architectures are not trivial to benchmark in general.
Otherwise the compiler is expected to optimize (vectorize, fold, inline etc.) your code as much as possible.
Answered By – Ext3h
Answer Checked By – Gilberto Lyons (BugsFixing Admin)