[SOLVED] x86 Assembly pushad/popad, How fast it is?

Issue

Im just trying to make very fast calculation based programe in x86 assembly
but i need to push accumulator, counter and data register before calling procedure. Is faster to push them manualy:

push eax
push ecx
push edx

or just use,

pushad

and same with poping. Thanks

Solution

If you care about performance, pusha / popa are almost never useful. They’re only useful when optimizing for code-size at the expense of speed, e.g. to save/restore registers around a function. But it’s pretty inconvenient for non-void functions, because they reload all registers, so you have to store the return value in memory (e.g. over the stack slot that will be loaded into eax, or somewhere else to be reloaded after popad).

Only push the registers that need saving, or that you want to pass as function args. Or, in , just let the compiler manage registers for you by declaring "=r"(dummy1) dummy output operands for any temp regs, or use clobbers on specific registers. Normally the compiler can pick registers that it can let you clobber without saving. (Or in the clunky MSVC-style inline asm, the compiler can’t allocate registers for you, so you have to pick manually. The compiler parses your asm to find clobbers.)

You normally don’t need to save/restore eax; for performance you should probably mov esi, eax / call / use the value in esi, if you can’t calculate the value in esi in the first place. i.e. use call-preserved registers for values that need to survive a call, so a store/reload of your important value isn’t on the critical path. Instead, the store/reload is on the critical path of one of the caller’s call-preserved registers which you (or the compiler) push/pop around the whole function, outside of any loops.

See more about call-preserved vs. call-clobbered registers and how saving/restoring should normally work. And what makes a good calling convention, e.g. how x86-64 System V was designed, and also this Q&A about how many args should be passed in registers, and why not also use XMM registers for integer args. Of course, helper functions can use custom calling conventions.


pusha / popa are slow on most CPUs

Even if you did want to push all 8 integer registers (including esp!), using 8 separate push instructions is actually faster on modern CPUs. pusha/popa are microcoded, which can be a problem for the front-end. (Although 8 single-byte instructions could be a problem for the uop-cache, too. But in real code, you usually only need to push a few registers, not all of them.)

If you’re optimizing for obsolete CPUs (like original in-order Pentium, and Pentium II/III), pusha/popa are as fast as 8 push r or 8 pop r, and actually fewer uops, because they didn’t have a stack engine to eliminate the ESP-update uops.

From Agner Fog’s instruction tables: modern CPUs have single-uop push reg and pop reg, because those instructions are used all the time by compilers and thus are important for performance. push/pop throughput typically matches store/load throughput (often 1 store per clock or 2 loads per clock). But pusha / popa are not used by compilers, so CPU designers don’t have special support to make them fast. popa throughput is limited to only 1 load per clock if just running popa. (I think on Intel CPUs, the most likely explanation for the measured performance is that popa doesn’t use the stack engine, so it bottlenecks on a dependency on esp.)

Intel:

  • Skylake: pusha: 11 uops, 8c throughput. popa: 18 uops / 8c throughput.
  • Sandybridge: pusha: 16 uops / 8c throughput. popa: 18 uops / 9c throughput.
  • Nehalem: pusha: 18 uops / 8c throughput. popa: 10 uops / 8c throughput.
  • Silvermont/KNL: pusha: 10 uops / 10c throughput. popa: 17 uops / 14c throughput.
  • Pentium4: pusha: 4/10 uops / 19c throughput. popa: 4/16 uops / 14c throughput.
  • P5 Pentium 1 / MMX: 5-9 cycles, non-pairable. "9 if SP divisible by 4 (imperfect pairing)."

AMD: pusha/popa are surprisingly good on some AMD CPUs, especially K8.

  • Ryzen: pusha: 9 uops, 8c throughput. popa: 9 uops, 4c throughput. (Unlike Intel, AMD’s new design has popa no worse than 8x pop.)
  • Jaguar: pusha: 9 uops / 8c throughput. popa: 9 uops / 8c throughput. (Jaguar can only do one load per clock normally.)
  • Piledriver: pusha: 9 uops / 9c throughput. popa: 14 uops / 8c throughput. (Agner lists regular pop reg throughput as 1 per clock for Bulldozer-family, although I think they do have a stack engine and can do 2 loads per clock. Maybe the stack engine can only handle one stack instruction at a time?)
  • K8: pusha: 9 uops / 4c throughput!! (IDK how this is possible, either this is an error or typo in the table, or K8 merges 32-bit registers and does four 64-bit stores). popa: 9 uops / 4c throughput. These numbers do seem to be real: InstLatx86 measurements agree with 4c throughput for pushad / popad on Clawhammer (the first-gen K8 microarchitecture). So clearly AMD put some effort into optimizing pushad.

You tagged this . Usually you should avoid using call in inline-asm, so the C compiler knows about the call.

And let the compiler worry about registers; just tell it which ones you modify (GNU C asm("..." ::: "eax", "ecx") or whatever), or in MSVC-style inline asm it parses your asm and knows which registers were written. If that includes any call-preserved registers, the compiler will save/restore those at the start/end of the whole function, even if the asm statement is in a loop. (It might need to spill and/or reload some local vars before/after the asm statement or block, but would use mov, not push/pop for that.)

Answered By – Peter Cordes

Answer Checked By – Cary Denson (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *