jgrahamc an hour ago

In my 6502 hacking days, the presence of an exclusive OR was a sure-fire indicator you’d either found the encryption part of the code, or some kind of sprite routine.

Yeah, sadly the 6502 didn't allow you to do EOR A; while the Z80 did allow XOR A. If I remember correctly XOR A was AF and LD A, 0 was 3E 01[1]. So saved a whole byte! And I think the XOR was 3 clock cycles fast than the LD. So less space taken up by the instruction and faster.

I have a very distinct memory in my first job (writing x86 assembly) of the CEO walking up behind my desk and pointing out that I'd done MOV AX, 0 when I could have done XOR AX, AX.

[1] 3E 00

  • wavemode an hour ago

    > CEO walking up behind my desk and pointing out that I'd done MOV AX, 0 when I could have done XOR AX, AX

    Now that's what I call micromanagement.

    (sorry couldn't resist)

    • jgrahamc an hour ago

      How was right though. We were memory and cycle constrained and I'd wasted both!

    • xigoi 42 minutes ago

      The real joke is that a CEO had actual technical knowledge instead of just being there for decoration.

  • anonzzzies 14 minutes ago

    3E 00 : I was on MSX and never had an assembler when you so I only remember the Hex, never actually knew the instructions; I wrote programs/games by data 3E,00,CD,etc without comments saying LD A as I never knew those at the time.

  • vanderZwan an hour ago

    Hah, we commented on the exact same paragraph within a minute of each other! My memory agrees with your memory, although I think that should be 3E 00. Let me look that up:

    https://jnz.dk/z80/ld_r_n.html

    https://jnz.dk/z80/xor_r.html

    Yep, if I'm reading this right that's 3E 00, since the second byte is the immediate value.

    One difference between XOR and LD is that LD A, 0 does not affect flags, which sometimes mattered.

    • jgrahamc an hour ago

      You're right. Of course, it's 3E 00. Not sure how I remembered 3E 01. My only excuse is that it was 40 years ago!

daeken 2 hours ago

Back in 2005 or 2006, I was working at a little startup with "DVD Jon" Johansen and we'd have Quake 3 tournaments to break up the monotony of reverse-engineering and juggling storage infrastructure. His name was always "xor eax,eax" and I always just had to laugh at the idea of getting zeroed out by someone with that name. (Which happened a lot -- I was good, but he was much better!)

pansa2 2 hours ago

> Unlike other partial register writes, when writing to an e register like eax, the architecture zeros the top 32 bits for free.

I’m familiar with 32-bit x86 assembly from writing it 10-20 years ago. So I was aware of the benefit of xor in general, but the above quote was new to me.

I don’t have any experience with 64-bit assembly - is there a guide anywhere that teaches 64-bit specifics like the above? Something like “x64 for those who know x86”?

  • sparkie 2 hours ago

    It's not only xor that does this, but most 32-bit operations zero-extend the result of the 64-bit register. AMD did this for backward compatibility. so existing programs would mostly continue working, unlike Intel's earlier attempt at 64-bits which was an entirely new design.

    The reason `xor eax,eax` is preferred to `xor rax,rax` is due to how the instructions are encoded - it saves one byte which in turn reduces instruction cache usage.

    When using 64-bit operations, a REX prefix is required on the instruction (byte 0x40..0x4F), which serves two purposes - the MSB of the low nybble (W) being set (ie, REX prefixes 0x48..0x4f) indicates a 64-bit operation, and the low 3 bits of low nybble allow using registers r8-r15 by providing an extra bit for the ModRM register field and the base and index fields in the SIB byte, as only 3-bits (8-registers) are provided by x86.

    A recent addition, APX, adds an additional 16 registers (r16-r31), which need 2 additional bits. There's a REX2 prefix for this (0xD5 ...), which is a two byte prefix to the instruction. REX2 replaces the REX prefix when accessing r16-r31, still contains the W bit, but it also includes an `M0` bit, which says which of the two main opcode maps to use, which replaces the 0x0F prefix, so it has no additional cost over the REX prefix when accessing the second opcode map.

    • cesarb an hour ago

      > It's not only xor that does this, but most 32-bit operations zero-extend the result of the 64-bit register. AMD did this for backward compatibility.

      It's not just that, zero-extending or sign-extending the result is also better for out-of-order implementations. If parts of the output register are preserved, the instruction needs an extra dependency on the original value.

grimgrin 6 minutes ago

I'd like to learn about the earliest pronunciations of these instructions. Only because watching a video earlier, I heard "MOV" pronounced "MAUV" not "MOVE"

Not sure exactly how I could dig up pronunciations, except finding the oldest recordings

eb0la 2 hours ago

I remember a lot of code zeroing registrers, dating at least back from the IBM PC XT days (before the 80286).

If you decode the instruction, it makes sense to use XOR:

- mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

This extra byte in a machine with less than 1 Megabyte of memory did id matter.

In 386 processors it was also - mov eax,0 - needs 5 bytes (b8 00 00 00 00) - xor eax,eax - needs 2 bytes (31 c0)

Here Intel made the decision to use only 2 bytes. I bet this helps both the instruction decoder and (of course) saves more memory than the old 8086 instruction.

  • Sharlin 41 minutes ago

    As the author says, a couple of extra bytes still matter, perhaps more than 20ish years ago. There are vast amounts of RAM, sure, but it's glacially slow, and there's only a few tens of kBs of L1 instruction cache.

    Never mind the fact that, as the author also mentions, the xor idiom takes essentially zero cycles to execute because nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

    • cogman10 22 minutes ago

      L1 instruction cache is backed by L2 and L3 caches.

      For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core) and 64MB of L3 (shared, 128 if you have the X3D version).

      I won't say it doesn't matter, but it doesn't matter as much as it once did. CPU caches have gotten huge while the instructions remain the same size.

      The more important part, at this point, is it's idiomatic. That means hardware designers are much more likely to put in specialty logic to make sure it's fast. It's a common enough operation to deserve it's own special cases. You can fit a lot of 8 byte instructions into 1280kb of memory. And as it turns out, it's pretty common for applications to spend a lot of their time in small chunks of instructions. The slow part of a lot of code will be that `for loop` with the 30 AVX instructions doing magic. That's why you'll often see compilers burn `NOP` instructions to align a loop. That's to avoid splitting a cache line.

      • Sharlin a minute ago

        > For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core)

        Well, that's certainly a ridiculous amount of L1. My 7700, which is way way closer to what normal people have in their machines, has 96kB (32+64) per core, so I used that as a point of reference.

  • vardump 2 hours ago

    > - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

    You don't need operand size prefix 0x66 when running 16 bit code in Real Mode. So "mov ax, 0" is 3 bytes and "xor ax, ax" is just 2 bytes.

    • eb0la an hour ago

      My fault: I just compiled the instruction with an assembler instead of looking up the actual instruction from documentation.

      It makes much more sense: resetting ax, and bc (xor ax,ax ; xor bx,bx) will be 4 octets, DWORD aligned, and a bit faster to fetch by the x86 than the 3-octet version I wrote before.

  • Someone an hour ago

    > If you decode the instruction, it makes sense to use XOR:

    > - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

    Except, apparently, on the pentium Pro, according to this comment: https://randomascii.wordpress.com/2012/12/29/the-surprising-..., which says:

    “But there was at least one out-of-order design that did not recognize xor reg, reg as a special case: the Pentium Pro. The Intel Optimization manuals for the Pentium Pro recommended “mov” to zero a register.”

  • Anarch157a an hour ago

    I don't know enough of the 8086 so I don't know if this works the same, but on the Z80 (which means it was probably true for the 8080 too), XOR A would also clear pretty much all bits on the flag register, meaning the flags would be in a known state before doing something that could affect them.

    • vanderZwan an hour ago

      Which I guess is the same reason why modern Intel CPU pipelines can rely on it for pipelining.

  • RHSeeger 2 hours ago

    > the IBM PC XT days (before the 80286)

    Fun fact - the IBM PC XT also came in a 286 model (the XT 286).

    • eb0la 2 hours ago

      You're right. I forgot that!

charles_f 16 minutes ago

> By using a slightly more obscure instruction, we save three bytes every time we need to set a register to zero

Meanwhile, most "apps" we get nowadays contain half of npmjs neatly bundled in electron. I miss the days when default was native and devs had constraints to how big their output could be.

pclmulqdq 2 hours ago

In modern CPUs, a lot of these are recognized as zeroing idioms and they end up doing the same thing (often a register renaming trick). Using the shortest one makes sense. If you use a really weird zeroing pattern, you can also see it as a backend uop while many of these zeroing idioms are elided by the frontend on some cores.

vanderZwan an hour ago

> In my 6502 hacking days, the presence of an exclusive OR was a sure-fire indicator you’d either found the encryption part of the code, or some kind of sprite routine.

Meanwhile, people like me who got started with a Z80 instead immediately knew why, since XOR A is the smallest and fastest way to clear the accumulator and flag register. Funny how that also shows how specific this is to a particular CPU lineage or its offshoots.

fooker 2 hours ago

It's funny how machine code is a high level language nowadays, for this example the CPU recognizes the zeroing pattern and does something quite a bit different.

  • Reubensson an hour ago

    What do you mean that cpu does something different? Isnt cpu doing what is being asked, that being xor with consequence of zeroing when given two same values.

    • IsTom an hour ago

      I think OP means that it has come a long way from the simple mental model of µops being a direct execution of operations and with all the register renamings and so on

HackerThemAll 16 minutes ago

> Interestingly, when zeroing the “extended” numbered registers (like r8), GCC still uses the d (double width, ie 32-bit) variant.

Of course. I might have some data stored in the higher dword of that register.

  • rfl890 11 minutes ago

    Which will still be zeroed.

omnicognate 2 hours ago

It happens to be the first instruction of the first snippet in the wonderful xchg rax,rax.

https://www.xorpd.net/pages/xchg_rax/snip_00.html

  • dooglius 2 hours ago

    Not sure what I am looking at here is this just a bunch of different ways to zero registers?

    • omnicognate 2 hours ago

      It's a collection of interesting assembly snippets ("gems and riddles" in the author's words) presented without commentary. People have posted annotated "solutions" online, but figuring out what the snippets do and why they are interesting is the fun of it.

      It's also available as an inscrutable printed book on Amazon.

flohofwoe 24 minutes ago

The actually surprising part to me is that such an important instruction uses a two byte encoding instead of one byte :)

jabedude an hour ago

similarly IIRC, on (some generations of) x86 chips, NOP is sugar around `XCHG EAX, EAX` which is effectively a do-nothing operation

  • bitwize 9 minutes ago

    This is pretty much all x86 chips as far as I'm aware: opcode 0x90 which is equivalent to XCHG EAX,EAX.

    The 8080 and Z80's NOP was at opcode 0. Which was neat because you could make a "NOP slide" simply by zeroing out memory.

Dwedit 2 hours ago

Because "sub eax,eax" looks stupid. (and also clears the carry flag, unlike "xor eax, eax")

  • HackerThemAll 15 minutes ago

    If I remember correctly, sub used to be slower than xor on some ancient architectures.

  • tom_ 2 hours ago

    xor clears the carry as well? In fact, looks like xor and sub affect the same set of flags!

    xor:

    > The OF and CF flags are cleared; the SF, ZF, and PF flags are set according to the result. The state of the AF flag is undefined.

    sub:

    > The OF, SF, ZF, AF, PF, and CF flags are set according to the result.

    (I don't have an x64 system handy, but hopefully the reference manual can be trusted. I dimly remembered this, or something like it, tripping me up after coming from programming for the 6502.)

    • trollbridge 27 minutes ago

      This is a good thing since the pipeline now doesn’t have to track the state of the flags since they all got zero’d.

BiraIgnacio 21 minutes ago

Also cool this got at the top item on the HN front page

sixthDot an hour ago

I've wrote a lot of `xor al,al` in my youth.

silverfrost 2 hours ago

Back on the Z80 'xor a' is the shortest sequence to zero A

dintech an hour ago

My brain read this is "Why not ear wax?"

bitwize an hour ago

Because mov eax, 0 requires fetching a constant and prolongs instruction fetching/execution. XOR A was a trick I learned back in the Z80 days.

fortran77 an hour ago

Back when I did IBM 370 BAL Assembly Language, we did the same thing to clear a register to zero.

  XR   15,15         XOR REGISTER 15 WITH REGISTER 15
vs

  L    15,=F'0'      LOAD REGISTER 15 WITH 0
This was alleged to be faster on the 370 because because XR operated entirely within the CPU registers, and L (Load) fetched data from memory (i.e.., the constant came from program memory).
snvzz 2 hours ago

Because, unlike RISC-V, x86 has no x0 register.

  • crote 2 hours ago

    And the other way around: RISC-V doesn't have a move instruction so that's done as "dst = src + 0", and it doesn't have a nop instruction so that's done as "x0 = x0 + 0". There's like a dozen of them.

    It's quite interesting what neat tricks roll out once you've got a guaranteed zero register - it greatly reduces the number of distinct instructions you need for what is basically the same operation.

    • dist1ll 2 hours ago

      Another one is "jalr x0, imm(x0)", which turns an indirect branch into a direct jump to address "imm" in a single instruction w/o clobbering a register. Pretty neat.

  • jabl 2 hours ago

    From your past posting history, I presume that you're implying this makes RISC-V better?

    Do we have any data showing that having a dedicated zero register is better than a short and canonical instruction for zeroing an arbitrary register?

    • phire an hour ago

      The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity.

      You don't need a mov instruction, you just OR with $zero. You don't need a load immediate instruction you just ADDI/ORI with $zero. You don't need a Neg instruction, you just SUB with $zero. All your Compare-And-Branch instructions get a compare with $zero variant for free.

      I refuse to say this "zero register" approach is better, it is part of a wide design with many interacting features. But once you have 31 registers, it's quite cheap to allocate one register to be zero, and may actually save encoding space elsewhere. (And encoding space is always an issue with fixed width instructions).

      AArch64 takes the concept further, they have a register that is sometimes acts as the zero register (when used in ALU instructions) and other times is the stack pointer (when used in memory instructions and a few special stack instructions).

      • phkahler an hour ago

        >> The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity.

        Which if funny because IMHO RISC-V instruction encoding is garbage. It was all optimized around the idea of fixed length 32-bit instructions. This leads to weird sized immediates (12 bits?) and 2 instructions to load a 32 bit constant. No support for 64 bit immediates. Then they decided to have "compressed" instructions that are 16 bits, so it's somewhat variable length anyway.

        IMHO once all the vector, AI and graphics instructions are nailed down they should make RISC-VI where it's almost the same but re-encoding the instructions. Have sensible 16-bit ones, 32-bit, and use immediate constants after the opcodes. It seems like there is a lot they could do to clean it up - obviously not as much as x86 ;-)

        • zozbot234 26 minutes ago

          There's not a strong case for redoing the RISC-V encoding with a new RISC-VI unless they run out of 32-bit encoding space outright, due to e.g. extensive new vector-like or AI-like instructions. And then they could free up a huge amount of encoding space trivially by moving to a 2-address format throughout with Rd=Rs1 and using a simple instruction fusion approach MOV Rd ← Rs1; OP Rd ← etc. for the former 3-address case.

          (Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach.)

        • adgjlsfhk1 24 minutes ago

          IMO the riscv decoding is really elegant (arguably excepting the C extension). Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory). Most 64 bit constants in use can be sign extended from much smaller values, and for those that can't, supporting 72 bit (or bigger) instructions just to be able to load a 64 bit immediate will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism), and will only be 2 cycles faster than a L1 cache load (if the instruction is hot). 32 bit immediate would be kind of nice, but the benefit is pretty small. An x86 instruction with 32 bit immediate is 6 bytes, while the 2 RISC-V instructions are 8 bytes. There have been proposals to add 48 bit instructions, which would let Risc-v have 32 bit immediate support with the same 6 bytes as x86 (and 12 byte 2 instructions 64 bit loads vs 10 bit for x86 in the very rare situations where doing so will be faster than a load).

          ISA design is always a tradeoff, https://ics.uci.edu/~swjun/courses/2023F-CS250P/materials/le... has some good details, but the TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.

    • wongarsu an hour ago

      MIPS for example also has one, along with a similar number of registers (~32). So it's not like RISC-V took a radical new position here, they were able to look back at what worked and what didn't, and decided that for their target a zero register was the right tradeoff. It's certainly the more "elegant" solution. A zero register is useful as input or output register for all kinds of operations, not just for zeroing

    • kevin_thibedeau 2 hours ago

      It's a definite liability on a machine with only 8 general purpose registers. Losing 12% of the register space for a constant would be a waste of hardware.

      • menaerus 2 hours ago

        8 registers? Ever heard of register renaming?

        • Polizeiposaune an hour ago

          Ever heard of a loop that needed to keep more than 7 variables live? Register renaming helps with pipelining and out-of-order execution, but instructions in the program can only reference the architectural registers - go beyond that and you end up needing to spill some values to (architectural) memory.

          There's a reason why AMD added r8-r15 to the architecture, and why intel is adding r16-r31..

          • menaerus 21 minutes ago

            I have but that was not the point? My first point was exactly that there are more ISA registers and not only 8, and therefore the question mark. My second point was about register renaming which, contrary what you say, does mitigate the artifacts of running out of registers by spilling the variables to the stack memory. It does it by eliminating the false dependencies between variables/registers and xor eax, eax is a great candidate for that.

        • account42 an hour ago

          That's irrelevant, the zero register would be taking a slot in the limited register addressing bits in instructions, not replace a physical register on the chip.

    • gruez 2 hours ago

      It's basically the eternal debate of RISC vs CISC (x86). RISC proponents claim RISC is better because it's simpler to decode. CISC proponents retort that CISC means code can be more compact, which helps with cache hits.

      • bluGill an hour ago

        In the real world there is no CISC or RISC anymore. RISC is always extended to some new feature and suddenly becomes more complex. Meanwhile CISC is just a decoder over a RISC processor. Either way you get the best of both worlds: simple hardware (the RISC internals and CSIC instructions that do what you need.

        Don't get too carried away in the above, x86 is still a lot more complex than ARM or RISC-V. However the complexity is only a tiny part of a CPU and so it doesn't matter.

    • dooglius 2 hours ago

      I think one could just pick a convention where a particular GP register is zeroed at program startup and just make your own zero register that way, getting all the benefits at very small cost. The microarchitecture AIUI has a dedicated zero register so any processor-level optimizations would still apply.

      • pklausler an hour ago

        That’s what was done on the CDC 6600 with two handy values, B0 (0) and B1 (1).

  • gpderetta an hour ago

    x86 doesn't need a zero register as it can encode constants in the instruction itself.

sylware 2 hours ago

Remnant of RISC attempt without a zero register.

OgsyedIE 2 hours ago

The page crashes after 3 seconds, 100% of the time, on the latest version of Android Chrome and works fine on Brave, fyi.

  • robmccoll 2 hours ago

    This is not my experience on the latest version of Chrome Android (142.0.7444.171). It did not crash for me.