I was playing with the code in this answer, slightly modifying it:
BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 1000000 .loop: ;T is a symbol defined with the CLI (-DT=...) TIMES T imul eax, eax lfence TIMES T imul edx, edx dec ecx jnz .loop mov eax, 60 ;sys_exit xor edi, edi syscall
lfence I the results I get are consistent with the static analysis in that answer.
When I introduce a single
lfence I’d expect the CPU to execute the
imul edx, edx sequence of the k-th iteration in parallel with the
imul eax, eax sequence of the next (k+1-th) iteration.
Something like this (calling A the
imul eax, eax sequence and D the
imul edx, edx one):
| | A | D A | D A | D A | ... | D A | D | V time
Taking more or less the same number of cycles but for one unpaired parallel execution.
When I measure the number of cycles, for the original and modified version, with
taskset -c 2 ocperf.py stat -r 5 -e cycles:u '-x ' ./main-$T for
T in the range below I get
T Cycles:u Cycles:u Delta lfence no lfence 10 42047564 30039060 12008504 15 58561018 45058832 13502186 20 75096403 60078056 15018347 25 91397069 75116661 16280408 30 108032041 90103844 17928197 35 124663013 105155678 19507335 40 140145764 120146110 19999654 45 156721111 135158434 21562677 50 172001996 150181473 21820523 55 191229173 165196260 26032913 60 221881438 180170249 41711189 65 250983063 195306576 55676487 70 281102683 210255704 70846979 75 312319626 225314892 87004734 80 339836648 240320162 99516486 85 372344426 255358484 116985942 90 401630332 270320076 131310256 95 431465386 285955731 145509655 100 460786274 305050719 155735555
How can the values of
Cycles:u lfence be explained?
I would have expected them to be similar to those of
Cycles:u no lfence since a single
lfence should prevent only the first iteration from being executed in parallel for the two blocks.
I don’t think it’s due to the
lfence overhead as I believe that should be constant for all
I’d like to fix what’s wrong with my forma mentis when dealing with the static analysis of code.
I’ll present an analysis for the case where T = 1 for both codes (with and without
lfence). You can then extend this for other values of T. You can refer to Figure 2.4 of the Intel Optimization Manual for a visual.
Because there is only a single easily predicted branch, the frontend will only stall if the backend stalled. The frontend is 4-wide in Haswell, which means up to 4 fused uops can be issued from the IDQ (instruction decode queue, which is just a queue that holds in-order fused-domain uops, also called the uop queue) to the reservation station (RS) entires of the scheduler. Each
imul is decoded into a single uop that cannot be fused. The instructions
dec ecx and
jnz .loop get macrofused in the frontend to a single uop. One of the differences between microfusion and macrofusion is that when the scheduler dispatches a macrofused uop (that are not microfused) to the execution unit it’s assigned to, it gets dispatched as a single uop. In contrast, a microfused uop needs to be split into its constituent uops, each of which must be separately dispatched to an execution unit. (However, splitting microfused uops happens on entrance to the RS, not on dispatch, see Footnote 2 in @Peter’s answer).
lfence is decoded into 6 uops. Recognizing microfusion only matters in the backend, and in this case, there is no microfusion in the loop.
Since the loop branch is easily predictable and since the number of iterations is relatively large, we can just assume without compromising accuracy that the allocator will always be able to allocate 4 uops per cycle. In other words, the scheduler will receive 4 uops per cycle. Since there is no micorfusion, each uop will be dispatched as a single uop.
imul can only be executed by the Slow Int execution unit (see Figure 2.4). This means that the only choice for executing the
imul uops is to dispatch them to port 1. In Haswell, the Slow Int is nicely pipelined so that a single
imul can be dispatched per cycle. But it takes three cycles for the result of the multiplication be available for any instruction that requires (the writeback stage is the third cycle from the dispatch stage of the pipeline). So for each dependence chain, at most one
imul can be dispatched per 3 cycles.
dec/jnz is predicted taken, the only execution unit that can execute it is Primary Branch on port 6.
So at any given cycle, as long as the RS has space, it will receive 4 uops. But what kind of uops? Let’s examine the loop without lfence:
imul eax, eax imul edx, edx dec ecx/jnz .loop (macrofused)
There are two possibilities:
imuls from the same iteration, one
imulfrom a neighboring iteration, and one
dec/jnzfrom one of those two iterations.
dec/jnzfrom one iteration, two
imuls from the next iteration, and one
dec/jnzfrom the same iteration.
So at the beginning of any cycle, the RS will receive at least one
dec/jnz and at least one
imul from each chain. At the same time, in the same cycle and from those uops that are already there in the RS, the scheduler will do one of two actions:
- Dispatch the oldest
dec/jnzto port 6 and dispatch the oldest
imulthat is ready to port 1. That’s a total of 2 uops.
- Because the Slow Int has a latency of 3 cycles but there are only two chains, for each cycle of 3 cycles, no
imulin the RS will be ready for execution. However, there is always at least one
dec/jnzin the RS. So the scheduler can dispatch that. That’s a total of 1 uop.
Now we can calculate the expected number of uops in the RS, XN, at the end of any given cycle N:
XN = XN-1 + (the number of uops to be allocated in the RS at the beginning of cycle N) – (the expected number of uops that will be dispatched at the beginning of cycle N)
= XN-1 + 4 – ((0+1)*1/3 + (1+1)*2/3)
= XN-1 + 12/3 – 5/3
= XN-1 + 7/3 for all N > 0
The initial condition for the recurrence is X0 = 4. This is a simple recurrence that can be solved by unfolding XN-1.
XN = 4 + 2.3 * N for all N >= 0
The RS in Haswell has 60 entries. We can determine the first cycle in which the RS is expected to become full:
60 = 4 + 7/3 * N
N = 56/2.3 = 24.3
So at the end of cycle 24.3, the RS is expected to be full. This means that at the beginning of cycle 25.3, the RS will not be able to receive any new uops. Now the number of iterations, I, under consideration determines how you should proceed with the analysis. Since a dependency chain will require at least 3*I cycles to execute, it takes about 8.1 iterations to reach cycle 24.3. So if the number of iterations is larger than 8.1, which is the case here, you need to analyze what happens after cycle 24.3.
The scheduler dispatches instructions at the following rates every cycle (as discussed above):
1 2 2 1 2 2 1 2 . .
But the allocator will not allocate any uops in the RS unless there are at least 4 available entries. Otherwise, it will not waste power on issuing uops at a sub-optimal throughput. However, it is only at the beginning of every 4th cycle are there at least 4 free entries in the RS. So starting from cycle 24.3, the allocator is expected to get stalled 3 out of every 4 cycles.
Another important observation for the code being analyzed is that it never happens that there are more than 4 uops that can be dispatched, which means that the average number of uops that leave their execution units per cycle is not larger than 4. At most 4 uops can be retired from the ReOrder Buffer (ROB). This means that the ROB can never be on the critical path. In other words, performance is determined by the dispatch throughput.
We can calculate the IPC (instructions per cycles) fairly easily now. The ROB entries look something like this:
imul eax, eax - N imul edx, edx - N + 1 dec ecx/jnz .loop - M imul eax, eax - N + 3 imul edx, edx - N + 4 dec ecx/jnz .loop - M + 1
The column to the right shows the cycles in which the instruction can be retired. Retirement happens in order and is bounded by the latency of the critical path. Here each dependency chain have the same path length and so both constitute two equal critical paths of length 3 cycles. So every 3 cycles, 4 instructions can be retired. So the IPC is 4/3 = 1.3 and the CPI is 3/4 = 0.75. This is much smaller than the theoretical optimal IPC of 4 (even without considering micro- and macro-fusion). Because retirement happens in-order, the retirement behavior will be the same.
We can check our analysis using both
perf and IACA. I’ll discuss
perf. I’ve a Haswell CPU.
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-nolfence Performance counter stats for './main-1-nolfence' (10 runs): 30,01,556 cycles:u ( +- 0.00% ) 40,00,005 instructions:u # 1.33 insns per cycle ( +- 0.00% ) 0 RESOURCE_STALLS.ROB 23,42,246 UOPS_ISSUED.ANY ( +- 0.26% ) 22,49,892 RESOURCE_STALLS.RS ( +- 0.00% ) 0.001061681 seconds time elapsed ( +- 0.48% )
There are 1 million iterations each takes about 3 cycles. Each iteration contains 4 instructions and the IPC is 1.33.
RESOURCE_STALLS.ROB shows the number of cycles in which the allocator was stalled due to a full ROB. This of course never happens.
UOPS_ISSUED.ANY can be used to count the number of uops issued to the RS and the number of cycles in which the allocator was stalled (no specific reason). The first is straightforward (not shown in the
perf output); 1 million * 3 = 3 million + small noise. The latter is much more interesting. It shows that about 73% of all time the allocator stalled due to a full RS, which matches our analysis.
RESOURCE_STALLS.RS counts the number of cycles in which the allocator was stalled due to a full RS. This is close to
UOPS_ISSUED.ANY because the allocator does not stall for any other reason (although the difference could be proportional to the number of iterations for some reason, I’ll have to see the results for T>1).
The analysis of the code without
lfence can be extended to determine what happens if an
lfence was added between the two
imuls. Let’s check out the
perf results first (IACA unfortunately does not support
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-lfence Performance counter stats for './main-1-lfence' (10 runs): 1,32,55,451 cycles:u ( +- 0.01% ) 50,00,007 instructions:u # 0.38 insns per cycle ( +- 0.00% ) 0 RESOURCE_STALLS.ROB 1,03,84,640 UOPS_ISSUED.ANY ( +- 0.04% ) 0 RESOURCE_STALLS.RS 0.004163500 seconds time elapsed ( +- 0.41% )
Observe that the number of cycles has increased by about 10 million, or 10 cycles per iteration. The number of cycles does not tell us much. The number of retired instruction has increased by a million, which is expected. We already know that the
lfence will not make instruction complete any faster, so
RESOURCE_STALLS.ROB should not change.
RESOURCE_STALLS.RS are particularly interesting. In this output,
UOPS_ISSUED.ANY counts cycles, not uops. The number of uops can also be counted (using
cpu/event=0x0E,umask=0x1,name=UOPS_ISSUED.ANY/u instead of
cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u) and has increased by 6 uops per iteration (no fusion). This means that an
lfence that was placed between two
imuls was decoded into 6 uops. The one million dollar question is now what these uops do and how they move around in the pipe.
RESOURCE_STALLS.RS is zero. What does that mean? This indicates that the allocator, when it sees an
lfence in the IDQ, it stops allocating until all current uops in the ROB retire. In other words, the allocator will not allocate entries in the RS past an
lfence until the
lfence retires. Since the loop body contains only 3 other uops, the 60-entry RS will never be full. In fact, it will be always almost empty.
The IDQ in reality is not a single simple queue. It consists of multiple hardware structures that can operate in parallel. The number of uops an
lfence requires depends on the exact design of the IDQ. The allocator, which also consists of many different hardware structures, when it see there is an
lfence uops at the front of any of the structures of the IDQ, it suspends allocation from that structure until the ROB is empty. So different uops are usd with different hardware structures.
UOPS_ISSUED.ANY shows that the allocator is not issuing any uops for about 9-10 cycles per iteration. What is happening here? Well, one of the uses of
lfence is that it can tell us how much time it takes to retire an instruction and allocate the next instruction. The following assembly code can be used to do that:
TIMES T lfence
The performance event counters will not work well for small values of
T. For sufficiently large T, and by measuring
UOPS_ISSUED.ANY, we can determine that it takes about 4 cycles to retire each
lfence. That’s because
UOPS_ISSUED.ANY will be incremented about 4 times every 5 cycles. So after every 4 cycles, the allocator issues another
lfence (it doesn’t stall), then it waits for another 4 cycles, and so on. That said, instructions that produce results may require 1 or few more cycle to retire depending on the instruction. IACA always assume that it takes 5 cycles to retire an instruction.
Our loop looks like this:
imul eax, eax lfence imul edx, edx dec ecx jnz .loop
At any cycle at the
lfence boundary, the ROB will contain the following instructions starting from the top of the ROB (the oldest instruction):
imul edx, edx - N dec ecx/jnz .loop - N imul eax, eax - N+1
Where N denotes the cycle number at which the corresponding instruction was dispatched. The last instruction that is going to complete (reach the writeback stage) is
imul eax, eax. and this happens at cycle N+4. The allocator stall cycle count will be incremented during cycles, N+1, N+2, N+3, and N+4. However it will about 5 more cycles until
imul eax, eax retires. In addition, after it retires, the allocator needs to clean up the
lfence uops from the IDQ and allocate the next group of instructions before they can be dispatched in the next cycle. The
perf output tells us that it takes about 13 cycles per iteration and that the allocator stalls (because of the
lfence) for 10 out of these 13 cycles.
The graph from the question shows only the number of cycles for up to T=100. However, there is another (final) knee at this point. So it would be better to plot the cycles for up to T=120 to see the full pattern.
Answered By – Hadi Brais
Answer Checked By – Gilberto Lyons (BugsFixing Admin)