Issue
I was playing with the code in this answer, slightly modifying it:
BITS 64
GLOBAL _start
SECTION .text
_start:
mov ecx, 1000000
.loop:
;T is a symbol defined with the CLI (-DT=...)
TIMES T imul eax, eax
lfence
TIMES T imul edx, edx
dec ecx
jnz .loop
mov eax, 60 ;sys_exit
xor edi, edi
syscall
Without the lfence
I the results I get are consistent with the static analysis in that answer.
When I introduce a single lfence
I’d expect the CPU to execute the imul edx, edx
sequence of the k-th iteration in parallel with the imul eax, eax
sequence of the next (k+1-th) iteration.
Something like this (calling A the imul eax, eax
sequence and D the imul edx, edx
one):
|
| A
| D A
| D A
| D A
| ...
| D A
| D
|
V time
Taking more or less the same number of cycles but for one unpaired parallel execution.
When I measure the number of cycles, for the original and modified version, with taskset -c 2 ocperf.py stat -r 5 -e cycles:u '-x ' ./main-$T
for T
in the range below I get
T Cycles:u Cycles:u Delta
lfence no lfence
10 42047564 30039060 12008504
15 58561018 45058832 13502186
20 75096403 60078056 15018347
25 91397069 75116661 16280408
30 108032041 90103844 17928197
35 124663013 105155678 19507335
40 140145764 120146110 19999654
45 156721111 135158434 21562677
50 172001996 150181473 21820523
55 191229173 165196260 26032913
60 221881438 180170249 41711189
65 250983063 195306576 55676487
70 281102683 210255704 70846979
75 312319626 225314892 87004734
80 339836648 240320162 99516486
85 372344426 255358484 116985942
90 401630332 270320076 131310256
95 431465386 285955731 145509655
100 460786274 305050719 155735555
How can the values of Cycles:u lfence
be explained?
I would have expected them to be similar to those of Cycles:u no lfence
since a single lfence
should prevent only the first iteration from being executed in parallel for the two blocks.
I don’t think it’s due to the lfence
overhead as I believe that should be constant for all T
s.
I’d like to fix what’s wrong with my forma mentis when dealing with the static analysis of code.
Supporting repository with source files.
Solution
I’ll present an analysis for the case where T = 1 for both codes (with and without lfence
). You can then extend this for other values of T. You can refer to Figure 2.4 of the Intel Optimization Manual for a visual.
Because there is only a single easily predicted branch, the frontend will only stall if the backend stalled. The frontend is 4-wide in Haswell, which means up to 4 fused uops can be issued from the IDQ (instruction decode queue, which is just a queue that holds in-order fused-domain uops, also called the uop queue) to the reservation station (RS) entires of the scheduler. Each imul
is decoded into a single uop that cannot be fused. The instructions dec ecx
and jnz .loop
get macrofused in the frontend to a single uop. One of the differences between microfusion and macrofusion is that when the scheduler dispatches a macrofused uop (that are not microfused) to the execution unit it’s assigned to, it gets dispatched as a single uop. In contrast, a microfused uop needs to be split into its constituent uops, each of which must be separately dispatched to an execution unit. (However, splitting microfused uops happens on entrance to the RS, not on dispatch, see Footnote 2 in @Peter’s answer). lfence
is decoded into 6 uops. Recognizing microfusion only matters in the backend, and in this case, there is no microfusion in the loop.
Since the loop branch is easily predictable and since the number of iterations is relatively large, we can just assume without compromising accuracy that the allocator will always be able to allocate 4 uops per cycle. In other words, the scheduler will receive 4 uops per cycle. Since there is no micorfusion, each uop will be dispatched as a single uop.
imul
can only be executed by the Slow Int execution unit (see Figure 2.4). This means that the only choice for executing the imul
uops is to dispatch them to port 1. In Haswell, the Slow Int is nicely pipelined so that a single imul
can be dispatched per cycle. But it takes three cycles for the result of the multiplication be available for any instruction that requires (the writeback stage is the third cycle from the dispatch stage of the pipeline). So for each dependence chain, at most one imul
can be dispatched per 3 cycles.
Becausedec/jnz
is predicted taken, the only execution unit that can execute it is Primary Branch on port 6.
So at any given cycle, as long as the RS has space, it will receive 4 uops. But what kind of uops? Let’s examine the loop without lfence:
imul eax, eax
imul edx, edx
dec ecx/jnz .loop (macrofused)
There are two possibilities:
- Two
imul
s from the same iteration, oneimul
from a neighboring iteration, and onedec/jnz
from one of those two iterations. - One
dec/jnz
from one iteration, twoimul
s from the next iteration, and onedec/jnz
from the same iteration.
So at the beginning of any cycle, the RS will receive at least one dec/jnz
and at least one imul
from each chain. At the same time, in the same cycle and from those uops that are already there in the RS, the scheduler will do one of two actions:
- Dispatch the oldest
dec/jnz
to port 6 and dispatch the oldestimul
that is ready to port 1. That’s a total of 2 uops. - Because the Slow Int has a latency of 3 cycles but there are only two chains, for each cycle of 3 cycles, no
imul
in the RS will be ready for execution. However, there is always at least onedec/jnz
in the RS. So the scheduler can dispatch that. That’s a total of 1 uop.
Now we can calculate the expected number of uops in the RS, XN, at the end of any given cycle N:
XN = XN-1 + (the number of uops to be allocated in the RS at the beginning of cycle N) – (the expected number of uops that will be dispatched at the beginning of cycle N)
= XN-1 + 4 – ((0+1)*1/3 + (1+1)*2/3)
= XN-1 + 12/3 – 5/3
= XN-1 + 7/3 for all N > 0
The initial condition for the recurrence is X0 = 4. This is a simple recurrence that can be solved by unfolding XN-1.
XN = 4 + 2.3 * N for all N >= 0
The RS in Haswell has 60 entries. We can determine the first cycle in which the RS is expected to become full:
60 = 4 + 7/3 * N
N = 56/2.3 = 24.3
So at the end of cycle 24.3, the RS is expected to be full. This means that at the beginning of cycle 25.3, the RS will not be able to receive any new uops. Now the number of iterations, I, under consideration determines how you should proceed with the analysis. Since a dependency chain will require at least 3*I cycles to execute, it takes about 8.1 iterations to reach cycle 24.3. So if the number of iterations is larger than 8.1, which is the case here, you need to analyze what happens after cycle 24.3.
The scheduler dispatches instructions at the following rates every cycle (as discussed above):
1
2
2
1
2
2
1
2
.
.
But the allocator will not allocate any uops in the RS unless there are at least 4 available entries. Otherwise, it will not waste power on issuing uops at a sub-optimal throughput. However, it is only at the beginning of every 4th cycle are there at least 4 free entries in the RS. So starting from cycle 24.3, the allocator is expected to get stalled 3 out of every 4 cycles.
Another important observation for the code being analyzed is that it never happens that there are more than 4 uops that can be dispatched, which means that the average number of uops that leave their execution units per cycle is not larger than 4. At most 4 uops can be retired from the ReOrder Buffer (ROB). This means that the ROB can never be on the critical path. In other words, performance is determined by the dispatch throughput.
We can calculate the IPC (instructions per cycles) fairly easily now. The ROB entries look something like this:
imul eax, eax - N
imul edx, edx - N + 1
dec ecx/jnz .loop - M
imul eax, eax - N + 3
imul edx, edx - N + 4
dec ecx/jnz .loop - M + 1
The column to the right shows the cycles in which the instruction can be retired. Retirement happens in order and is bounded by the latency of the critical path. Here each dependency chain have the same path length and so both constitute two equal critical paths of length 3 cycles. So every 3 cycles, 4 instructions can be retired. So the IPC is 4/3 = 1.3 and the CPI is 3/4 = 0.75. This is much smaller than the theoretical optimal IPC of 4 (even without considering micro- and macro-fusion). Because retirement happens in-order, the retirement behavior will be the same.
We can check our analysis using both perf
and IACA. I’ll discuss perf
. I’ve a Haswell CPU.
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-nolfence
Performance counter stats for './main-1-nolfence' (10 runs):
30,01,556 cycles:u ( +- 0.00% )
40,00,005 instructions:u # 1.33 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
23,42,246 UOPS_ISSUED.ANY ( +- 0.26% )
22,49,892 RESOURCE_STALLS.RS ( +- 0.00% )
0.001061681 seconds time elapsed ( +- 0.48% )
There are 1 million iterations each takes about 3 cycles. Each iteration contains 4 instructions and the IPC is 1.33.RESOURCE_STALLS.ROB
shows the number of cycles in which the allocator was stalled due to a full ROB. This of course never happens. UOPS_ISSUED.ANY
can be used to count the number of uops issued to the RS and the number of cycles in which the allocator was stalled (no specific reason). The first is straightforward (not shown in the perf
output); 1 million * 3 = 3 million + small noise. The latter is much more interesting. It shows that about 73% of all time the allocator stalled due to a full RS, which matches our analysis. RESOURCE_STALLS.RS
counts the number of cycles in which the allocator was stalled due to a full RS. This is close to UOPS_ISSUED.ANY
because the allocator does not stall for any other reason (although the difference could be proportional to the number of iterations for some reason, I’ll have to see the results for T>1).
The analysis of the code without lfence
can be extended to determine what happens if an lfence
was added between the two imul
s. Let’s check out the perf
results first (IACA unfortunately does not support lfence
):
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-lfence
Performance counter stats for './main-1-lfence' (10 runs):
1,32,55,451 cycles:u ( +- 0.01% )
50,00,007 instructions:u # 0.38 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
1,03,84,640 UOPS_ISSUED.ANY ( +- 0.04% )
0 RESOURCE_STALLS.RS
0.004163500 seconds time elapsed ( +- 0.41% )
Observe that the number of cycles has increased by about 10 million, or 10 cycles per iteration. The number of cycles does not tell us much. The number of retired instruction has increased by a million, which is expected. We already know that the lfence
will not make instruction complete any faster, so RESOURCE_STALLS.ROB
should not change. UOPS_ISSUED.ANY
and RESOURCE_STALLS.RS
are particularly interesting. In this output, UOPS_ISSUED.ANY
counts cycles, not uops. The number of uops can also be counted (using cpu/event=0x0E,umask=0x1,name=UOPS_ISSUED.ANY/u
instead of cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u
) and has increased by 6 uops per iteration (no fusion). This means that an lfence
that was placed between two imul
s was decoded into 6 uops. The one million dollar question is now what these uops do and how they move around in the pipe.
RESOURCE_STALLS.RS
is zero. What does that mean? This indicates that the allocator, when it sees an lfence
in the IDQ, it stops allocating until all current uops in the ROB retire. In other words, the allocator will not allocate entries in the RS past an lfence
until the lfence
retires. Since the loop body contains only 3 other uops, the 60-entry RS will never be full. In fact, it will be always almost empty.
The IDQ in reality is not a single simple queue. It consists of multiple hardware structures that can operate in parallel. The number of uops an lfence
requires depends on the exact design of the IDQ. The allocator, which also consists of many different hardware structures, when it see there is an lfence
uops at the front of any of the structures of the IDQ, it suspends allocation from that structure until the ROB is empty. So different uops are usd with different hardware structures.
UOPS_ISSUED.ANY
shows that the allocator is not issuing any uops for about 9-10 cycles per iteration. What is happening here? Well, one of the uses of lfence
is that it can tell us how much time it takes to retire an instruction and allocate the next instruction. The following assembly code can be used to do that:
TIMES T lfence
The performance event counters will not work well for small values of T
. For sufficiently large T, and by measuring UOPS_ISSUED.ANY
, we can determine that it takes about 4 cycles to retire each lfence
. That’s because UOPS_ISSUED.ANY
will be incremented about 4 times every 5 cycles. So after every 4 cycles, the allocator issues another lfence
(it doesn’t stall), then it waits for another 4 cycles, and so on. That said, instructions that produce results may require 1 or few more cycle to retire depending on the instruction. IACA always assume that it takes 5 cycles to retire an instruction.
Our loop looks like this:
imul eax, eax
lfence
imul edx, edx
dec ecx
jnz .loop
At any cycle at the lfence
boundary, the ROB will contain the following instructions starting from the top of the ROB (the oldest instruction):
imul edx, edx - N
dec ecx/jnz .loop - N
imul eax, eax - N+1
Where N denotes the cycle number at which the corresponding instruction was dispatched. The last instruction that is going to complete (reach the writeback stage) is imul eax, eax
. and this happens at cycle N+4. The allocator stall cycle count will be incremented during cycles, N+1, N+2, N+3, and N+4. However it will about 5 more cycles until imul eax, eax
retires. In addition, after it retires, the allocator needs to clean up the lfence
uops from the IDQ and allocate the next group of instructions before they can be dispatched in the next cycle. The perf
output tells us that it takes about 13 cycles per iteration and that the allocator stalls (because of the lfence
) for 10 out of these 13 cycles.
The graph from the question shows only the number of cycles for up to T=100. However, there is another (final) knee at this point. So it would be better to plot the cycles for up to T=120 to see the full pattern.
Answered By – Hadi Brais
Answer Checked By – Gilberto Lyons (BugsFixing Admin)