high memory +------------------------+ | .... | | .... | virtual frame | argument n | pointer $(fp) ->/+------------------------+\ / | | \ frame offset | local & temporaries | \ \ | | \ \+------------------------+ \ | saved registers | frame size | (including returnreg) | / +------------------------+ / | .... | / stack | argument build | / pointer $(sp) -> +------------------------+/ (framereg) | .... | | .... | | | | | | | low memory +------------------------+
.frame framereg, framesize, returnreg
The virtual frame pointer is a frame pointer as used in other compiler systems but has no register allocated for it. It consists of the framereg ($sp, in most cases) added to the framesize. The returnreg specifies the register containing the return address (usually $ra).
.mask bitmask, frameoffset
The .mask directive specifies the registers to be stored and where they are stored. A bit should be on in bitmask for each register saved (for example, if register $31 is saved, bit 31 should be ‘1’ in bitmask. Bits are set in bitmask in little-endian order, even if the machine configuration is big-endian).The frameoffset is the offset from the virtual frame pointer (this number is usually negative).
.fmask bitmask, frameoffset
Notice that saving floating-point registers is identical to saving general registers except we use the .fmask pseudo-op instead of .mask, and the stores are of floating-point singles or doubles.
.text .cfi_sections .debug_frame .align 2 .globl main .cfi_startproc .ent main .type main, @function main: .frame $fp,32,$31 # vars= 16, regs= 1/0, args= 0, gp= 0 .mask 0x40000000,-8 .fmask 0x00000000,0 .set noreorder .set nomacro daddiu $sp,$sp,-32 .cfi_def_cfa_offset 32 sd $fp,24($sp) .cfi_offset 30, -8 move $fp,$sp .cfi_def_cfa_register 30 move $2,$4 sd $5,8($fp) sll $2,$2,0 sw $2,0($fp) .loc 1 6 12 move $2,$0 .loc 1 7 1 move $sp,$fp .cfi_def_cfa_register 29 ld $fp,24($sp) daddiu $sp,$sp,32 .cfi_restore 30 .cfi_def_cfa_offset 0 jr $31 nop .set macro .set reorder .end main .cfi_endproc .size main, .-main
(This post uses x86-64 for illustration throughout. The fundamentals are similar for other platforms but will need some translation that I don’t cover here.)
Despite compilers getting better over time, it’s still the case that hand-written assembly can be worthwhile for certain hot-spots. Sometimes there are special CPU instructions for the thing that you’re trying to do, sometimes you need detailed control of the resulting code and, to some extent, it remains possible for some people to out-optimise a compiler.
But hand-written assembly doesn’t automatically get some of the things that the compiler generates for normal code, such as debugging information. Perhaps your assembly code never crashes (although any function that takes a pointer can suffer from bugs in other code) but you probably still care about accurate profiling information. In order for debuggers to walk up the stack in a core file, or for profilers to correctly account for CPU time, they need be able to unwind call frames.
Unwinding used to be easy as every function would have a standard prologue:
push rbp mov rbp, rsp
This would make the stack look like this (remember that stacks grow downwards in memory):
So, upon entry to a function, the CALL instruction that jumped to the function in question will have pushed the previous program counter (from the RIP register) onto the stack. Then the function prologue saves the current value of RBP on the stack and copies the current value of the stack pointer into RBP. From this point until the function is complete, RBP won’t be touched.
This makes stack unwinding easy because RBP always points to the call frame for the current function. That gets you the saved address of the parent call and the saved value of its RBP and so on.
The problems with this scheme are that a) the function prologue can be excessive for small functions and b) we would like to be able to use RBP as a general purpose register to avoid spills. Which is why the GCC documentation says that “-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging”. This means that you can’t depend on being able to unwind stacks like this. A process can be comprised of various shared libraries, any of which might be compiled with optimisations.
To be able to unwind the stack without depending on this convention, additional debugging tables are needed. The compiler will generate these automatically (when asked) for code that it generates, but it’s something that we need to worry about when writing assembly functions ourselves if we want profilers and debuggers to work.
The reference for the assembly directives that we’ll need is here, but they are very lightly documented. You can understand more by reading the DWARF spec, which documents the data that is being generated. Specifically see sections 6.4 and D.6. But I’ll try to tie the two together in this post.
The tables that we need the assembler to emit for us are called Call Frame Information (CFI). (Not to be confused with Control Flow Integrity, which is very different.) Based on that name, all the assembler directives begin with .cfi_.
Next we need to define the Canonical Frame Address (CFA). This is the value of the stack pointer just before the CALL instruction in the parent function. In the diagram above, it’s the value indicated by “RSP value before CALL”. Our first task will be to define data that allows the CFA to be calculated for any given instruction.
The CFI tables allow the CFA to be expressed as a register value plus an offset. For example, immediately upon function entry the CFA is RSP + 8. (The eight byte offset is because the CALL instruction will have pushed the previous RIP on the stack.)
As the function executes, however, the expression will probably change. If nothing else, after pushing a value onto the stack we would need to increase the offset.
So one design for the CFI table would be to store a (register, offset) pair for every instruction. Conceptually that’s what we do but, to save space, only changes from instruction to instruction are stored.
It’s time for an example, so here’s a trivial assembly function that includes CFI directives and a running commentary.
.globl square .type square,@function .hidden square square:
This is a standard preamble for a function that’s unrelated to CFI. Your assembly code should already be full of this.
Our first CFI directive. This is needed at the start of every annotated function. It causes a new CFI table for this function to be initialised.
.cfi_def_cfa rsp, 8
This is defining the CFA expression as a register plus offset. One of the things that you’ll see compilers do is express the registers as numbers rather than names. But, at least with GAS, you can write names. (I’ve included a table of DWARF register numbers and names below in case you need it.)
Getting back to the directive, this is just specifying what I discussed above: on entry to a function, the CFA is at RSP + 8.
push rbp .cfi_def_cfa rsp, 16
After pushing something to the stack, the value of RSP will have changed so we need to update the CFA expression. It’s now RSP + 16, to account for the eight bytes we pushed.
mov rbp, rsp .cfi_def_cfa rbp, 16
This function happens to have a standard prologue, so we’ll save the frame pointer in RBP, following the old convention. Thus, for the rest of the function we can define the CFA as RBP + 16 and manipulate the stack without having to worry about it again.
mov DWORD PTR [rbp-4], edi mov eax, DWORD PTR [rbp-4] imul eax, DWORD PTR [rbp-4] pop rbp .cfi_def_cfa rsp, 8
We’re getting ready to return from this function and, after restoring RBP from the stack, the old CFA expression is invalid because the value of RBP has changed. So we define it as RSP + 8 again.
At the end of the function we need to trigger the CFI table to be emitted. (It’s an error if a CFI table is left open at the end of the file.)
The CFI tables for an object file can be dumped with objdump -W and, if you do that for the example above, you’ll see two tables: something called a CIE and something called an FDE.
The CIE (Common Information Entry) table contains information common to all functions and it’s worth taking a look at it:
… CIE Version: 1 Augmentation: "zR" Code alignment factor: 1 Data alignment factor: -8 Return address column: 16 Augmentation data: 1b DW_CFA_def_cfa: r7 (rsp) ofs 8 DW_CFA_offset: r16 (rip) at cfa-8
You can ignore everything until the DW_CFA_… lines at the end. They define CFI directives that are common to all functions (that reference this CIE). The first is saying that the CFA is at RSP + 8, which is what we had already defined at function entry. This means that you don’t need a CFI directive at the beginning of the function. Basically RSP + 8 is already the default.
The second directive is something that we’ll get to when we discuss saving registers.
If we look at the FDE (Frame Description Entry) for the example function that we defined, we see that it reflects the CFI directives from the assembly:
… FDE cie=… DW_CFA_advance_loc: 1 to 0000000000000001 DW_CFA_def_cfa: r7 (rsp) ofs 16 DW_CFA_advance_loc: 3 to 0000000000000004 DW_CFA_def_cfa: r6 (rbp) ofs 16 DW_CFA_advance_loc: 11 to 000000000000000f DW_CFA_def_cfa: r7 (rsp) ofs 8
The FDE describes the range of instructions that it’s valid for and is a series of operations to either update the CFA expression, or to skip over the next n bytes of instructions. Fairly obvious.
Optimisations for CFA directives
There are some shortcuts when writing CFA directives:
Firstly, you can update just the offset, or just the register, with cfi_def_cfa_offset and cfi_def_cfa_register respectively. This not only saves typing in the source file, it saves bytes in the table too.
Secondly, you can update the offset with a relative value using cfi_adjust_cfa_offset. This is useful when pushing lots of values to the stack as the offset will increase by eight each time.
Here’s the example from above, but using these directives and omitting the first directive that we don’t need because of the CIE:
.globl square .type square,@function .hidden square square: .cfi_startproc push rbp .cfi_adjust_cfa_offset 8 mov rbp, rsp .cfi_def_cfa_register rbp mov DWORD PTR [rbp-4], edi mov eax, DWORD PTR [rbp-4] imul eax, DWORD PTR [rbp-4] pop rbp .cfi_def_cfa rsp, 8 ret .cfi_endproc
Consider a profiler that is unwinding the stack after a profiling signal. It calculates the CFA of the active function and, from that, finds the parent function. Now it needs to calculate the parent function’s CFA and, from the CFI tables, discovers that it’s related to RBX. Since RBX is a callee-saved register, that’s reasonable, but the active function might have stomped RBX. So, in order for the unwinding to proceed it needs a way to find where the active function saved the old value of RBX. So there are more CFI directives that let you document where registers have been saved.
Registers can either be saved at an offset from the CFA (i.e. on the stack), or in another register. Most of the time they’ll be saved on the stack though because, if you had a caller-saved register to spare, you would be using it first.
To indicate that a register is saved on the stack, use cfi_offset. In the same example as above (see the stack diagram at the top) the caller’s RBP is saved at CFA – 16 bytes. So, with saved registers annotated too, it would start like this:
square: .cfi_startproc push rbp .cfi_adjust_cfa_offset 8 .cfi_offset rbp, -16
If you need to save a register in another register for some reason, see the documentation for cfi_register.
If you get all of that correct then your debugger should be able to unwind crashes correctly, and your profiler should be able to avoid recording lots of detached functions. However, I’m afraid that I don’t know of a better way to test this than to zero RBP, add a crash in the assembly code, and check whether GBD can go up correctly.
(None of this works for Windows. But Per Vognsen, via Twitter, notes that there are similar directives in MASM.)
New in version three of the DWARF standard are CFI Expressions. These define a stack machine for calculating the CFA value and can be useful when your stack frame is non-standard (which is fairly common in assembly code). However, there’s no assembler support for them that I’ve been able to find, so one has to use cfi_escape and provide the raw DWARF data in a .s file. As an example, see this kernel patch.
Since there’s no assembler support, you’ll need to read section 2.5 of the standard, then search for DW_CFA_def_cfa_expression and, perhaps, search for cfi_directive in OpenSSL’s perlasm script for x86-64 and the places in OpenSSL where that is used. Good luck.
(I suggest testing by adding some instructions that write to NULL in the assembly code and checking that gdb can correctly step up the stack and that info reg shows the correct values for callee-saved registers in the parent frame.)
CFI register numbers
In case you need to use or read the raw register numbers, here they are for a few architectures:
(may be EBP on MacOS) (may be ESP on MacOS)
.cfi_sections may be used to specify whether CFI directives should emit .eh_frame section and/or .debug_frame section. If section_list is .eh_frame, .eh_frame is emitted, if section_list is .debug_frame, .debug_frame is emitted. To emit both use .eh_frame, .debug_frame. The default if this directive is not used is .cfi_sections .eh_frame.
On targets that support compact unwinding tables these can be generated by specifying .eh_frame_entry instead of .eh_frame.
Some targets may support an additional name, such as .c6xabi.exidx which is used by the target.
The .cfi_sections directive can be repeated, with the same or different arguments, provided that CFI generation has not yet started. Once CFI generation has started however the section list is fixed and any attempts to redefine it will result in an error.
.cfi_startproc is used at the beginning of each function that should have an entry in .eh_frame. It initializes some internal data structures. Don’t forget to close the function by .cfi_endproc.
Unless .cfi_startproc is used along with parameter simple it also emits some architecture dependent initial CFI instructions.
.cfi_endproc is used at the end of a function where it closes its unwind entry previously opened by .cfi_startproc, and emits it to .eh_frame.
.cfi_personality encoding [, exp]
.cfi_personality defines personality routine and its encoding. encoding must be a constant determining how the personality should be encoded. If it is 255 (DW_EH_PE_omit), second argument is not present, otherwise second argument should be a constant or a symbol name. When using indirect encodings, the symbol provided should be the location where personality can be loaded from, not the personality routine itself. The default after .cfi_startproc is .cfi_personality 0xff, no personality routine.
.cfi_personality_id defines a personality routine by its index as defined in a compact unwinding format. Only valid when generating compact EH frames (i.e. with .cfi_sections eh_frame_entry.
.cfi_fde_data [opcode1 [, …]]
.cfi_fde_data is used to describe the compact unwind opcodes to be used for the current function. These are emitted inline in the .eh_frame_entry section if small enough and there is no LSDA, or in the .gnu.extab section otherwise. Only valid when generating compact EH frames (i.e. with .cfi_sections eh_frame_entry.
.cfi_lsda encoding [, exp]
.cfi_lsda defines LSDA and its encoding. encoding must be a constant determining how the LSDA should be encoded. If it is 255 (DW_EH_PE_omit), the second argument is not present, otherwise the second argument should be a constant or a symbol name. The default after .cfi_startproc is .cfi_lsda 0xff, meaning that no LSDA is present.
.cfi_inline_lsda marks the start of a LSDA data section and switches to the corresponding .gnu.extab section. Must be preceded by a CFI block containing a .cfi_lsda directive. Only valid when generating compact EH frames (i.e. with .cfi_sections eh_frame_entry.
The table header and unwinding opcodes will be generated at this point, so that they are immediately followed by the LSDA data. The symbol referenced by the .cfi_lsda directive should still be defined in case a fallback FDE based encoding is used. The LSDA data is terminated by a section directive.
The optional align argument specifies the alignment required. The alignment is specified as a power of two, as with the .p2align directive.
.cfi_def_cfa register, offset
.cfi_def_cfa defines a rule for computing CFA as: take address from register and add offset to it.
.cfi_def_cfa_register modifies a rule for computing CFA. From now on register will be used instead of the old one. Offset remains the same.
.cfi_def_cfa_offset modifies a rule for computing CFA. Register remains the same, but offset is new. Note that it is the absolute offset that will be added to a defined register to compute CFA address.
Same as .cfi_def_cfa_offset but offset is a relative value that is added/subtracted from the previous offset.
.cfi_offset register, offset
Previous value of register is saved at offset offset from CFA.
.cfi_val_offset register, offset
Previous value of register is CFA + offset.
.cfi_rel_offset register, offset
Previous value of register is saved at offset offset from the current CFA register. This is transformed to .cfi_offset using the known displacement of the CFA register from the CFA. This is often easier to use, because the number will match the code it’s annotating.
.cfi_register register1, register2
Previous value of register1 is saved in register register2.
.cfi_restore says that the rule for register is now the same as it was at the beginning of the function, after all initial instruction added by .cfi_startproc were executed.
From now on the previous value of register can’t be restored anymore.
Current value of register is the same like in the previous frame, i.e. no restoration needed.
.cfi_remember_state and .cfi_restore_state
.cfi_remember_state pushes the set of rules for every register onto an implicit stack, while .cfi_restore_state pops them off the stack and places them in the current row. This is useful for situations where you have multiple .cfi_* directives that need to be undone due to the control flow of the program. For example, we could have something like this (assuming the CFA is the value of rbp):
je label popq %rbx .cfi_restore %rbx popq %r12 .cfi_restore %r12 popq %rbp .cfi_restore %rbp .cfi_def_cfa %rsp, 8 ret label: /* Do something else */
Here, we want the .cfi directives to affect only the rows corresponding to the instructions before label. This means we’d have to add multiple .cfi directives after label to recreate the original save locations of the registers, as well as setting the CFA back to the value of rbp. This would be clumsy, and result in a larger binary size. Instead, we can write:
je label popq %rbx .cfi_remember_state .cfi_restore %rbx popq %r12 .cfi_restore %r12 popq %rbp .cfi_restore %rbp .cfi_def_cfa %rsp, 8 ret label: .cfi_restore_state /* Do something else */
That way, the rules for the instructions after label will be the same as before the first .cfi_restore without having to use multiple .cfi directives.
Change return column register, i.e. the return address is either directly in register or can be accessed by rules for register.
Mark current function as signal trampoline.
SPARC register window has been saved.
.cfi_escape expression[, …]
Allows the user to add arbitrary bytes to the unwind info. One might use this to add OS-specific CFI opcodes, or generic CFI opcodes that GAS does not yet support.
.cfi_val_encoded_addr register, encoding, label
The current value of register is label. The value of label will be encoded in the output file according to encoding; see the description of .cfi_personality for details on this encoding.
The usefulness of equating a register to a fixed label is probably limited to the return address register. Here, it can be useful to mark a code segment that has only one return address which is reached by a direct branch and no copy of the return address exists in memory or another register.
A very non-intuitive property of the Alpha processor is that it allows the following behavior:
Initially: p = & x, x = 1, y = 0 Thread 1 Thread 2 -------------------------------- y = 1 | memoryBarrier | i = *p p = & y | -------------------------------- Can result in: i = 0
This behavior means that the reader needs to perform a memory barrier in lazy initialization idioms (e.g., Double-checked locking) and creates issues for synchronization-free immutable objects (e.g., ensuring. that other threads see the correct value for fields of a String object).
Kourosh Gharachorloo wrote a note explaining how it can actually happen on an Alpha multiprocessor:
The anomalous behavior is currently only possible on a 21264-based system. And obviously you have to be using one of our multiprocessor servers. Finally, the chances that you actually see it are very low, yet it is possible.
Here is what has to happen for this behavior to show up. Assume T1 runs on P1 and T2 on P2. P2 has to be caching location y with value 0. P1 does y=1 which causes an “invalidate y” to be sent to P2. This invalidate goes into the incoming “probe queue” of P2; as you will see, the problem arises because this invalidate could theoretically sit in the probe queue without doing an MB on P2. The invalidate is acknowledged right away at this point (i.e., you don’t wait for it to actually invalidate the copy in P2’s cache before sending the acknowledgment). Therefore, P1 can go through its MB. And it proceeds to do the write to p. Now P2 proceeds to read p. The reply for read p is allowed to bypass the probe queue on P2 on its incoming path (this allows replies/data to get back to the 21264 quickly without needing to wait for previous incoming probes to be serviced). Now, P2 can derefence P to read the old value of y that is sitting in its cache (the inval y in P2’s probe queue is still sitting there).
How does an MB on P2 fix this? The 21264 flushes its incoming probe queue (i.e., services any pending messages in there) at every MB. Hence, after the read of P, you do an MB which pulls in the inval to y for sure. And you can no longer see the old cached value for y.
Even though the above scenario is theoretically possible, the chances of observing a problem due to it are extremely minute. The reason is that even if you setup the caching properly, P2 will likely have ample opportunity to service the messages (i.e., inval) in its probe queue before it receives the data reply for “read p”. Nonetheless, if you get into a situation where you have placed many things in P2’s probe queue ahead of the inval to y, then it is possible that the reply to p comes back and bypasses this inval. It would be difficult for you to set up the scenario though and actually observe the anomaly.
The above addresses how current Alpha’s may violate what you have shown. Future Alpha’s can violate it due to other optimizations. One interesting optimization is value prediction.