ARM Assembly on the Pi Pico: Stack attack

This time round, I’ll wrap up my coverage of the key ARMv6-M Thumb instructions and mnemonics that you can use to command the Raspberry Pi RP2040. There are not many instructions left that were not covered in parts one and two, and I won’t be including all the remaining mnemonics, only those you’re likely to use frequently.

The handful of more specialised system management ops that are remaining will have to wait for another day.

Stack Ops

I’m currently working on a program that takes advantage of the RP2040’s I²C functionality to drive a display. That’s going to take some figuring out, so as a first step I’m using some of my existing C code to handle the I²C interactions and to talk to an LED matrix driven by the Holtek HT16K33 controller. A future post will cover doing I²C fully in ARM assembly.

Meantime, I’m using assembly code to drive a pixel around the display. I need to store the pixel’s x and y co-ordinates plus flags that determine whether those co-ordinates increase or decrease each pass through the loop. The stack is a good place to store these variables which are accessed frequently.

The stack, for folks new to it, is sequence of zero or more 32-bit words somewhere in memory. The RP2040 has a register, R13 aka SP (for “Stack Pointer”) which tracks the current top of the stack. As the contents of other registers are added to the stack, this address is reduced by four bytes; pulling a value off the stack increases it by that amount. In short, the stack grows downward through memory.

Use the PUSH mnemonic to add words to the stack; removing them uses one called POP. Each instruction’s operand is a list of registers in curly brackets. For example:

PUSH {R0-R4, R7}

As you can see, you can uses ranges in the list — you don’t have to include each one. Registers are pushed in number order. Pushing a register copies its value.

Just as registers’ values go onto the stack in numerical order, so they come out in numerical order. So

POP {R0-R4, R7}

will ensure R0 and R7 swap values. R0 goes in first, R7 in last. Then R0 comes out first, taking the top-most stack value, which came from R7. R7 comes out last, taking the value stacked from R0. This might not be what you want, but it’s up to you as the programmer to know what’s at the top of the stack so you pop values in the correct order for your application.

The stack is commonly used to hold values passed into a function, and as temporary storage during an operation. For example, you might swap two registers this way:

PUSH {R0}           // R0 -> Stack
MOV R0, R1          // R1 -> R0
POP {R1}            // R0 -> R1

As you can see, just because a certain register values is pushed, you don’t have to pop it back into the same register.

The Stack Pointer is like other registers in that you can change its value with other instructions. This allows you to pop a value from deep in the stack: you add the depth to the current value of SP and pop the value out. Don’t forget to restore SP to the value it held before you messed with it, or values closer to the top may not be popped correctly. Manipulating the stack this way required a lot of care.

Take a look at the code in my pico-asm GitHub repo: in particular, the file source/asm_i2c/main.S.

Lines 55-61 set up space for and fill four 32-bit words. I’m setting two base values rather than four, so I just push each of them twice. I also record the address of the first word in R9.

How are the values used? In lines 69-74, the code sets the SP to the value preserved in R9 then pops the first two values: the pixel’s current X and Y co-ordinates. R10 is used to preserve the current value of SP, which is put back afterwards and before the plot function is called.

Because the stacks works on a last in, first out order, and popping a value moves SP up four bytes, it’s important to understand the order in which values will be popped. So having loaded the base address from R9, we subtract eight from that value so that first Y is popped, this moves the SP back four bytes so it’s ready to pop the X value. After getting X, SP is left back at the base address.

So to pop a value from the stack space, we set SP to the address right after the value to be popped. To push a value, set SP to the address of the slot. You can see this in the code at label addx, which updates the X direction flag, DX. Later we set SP to the base address, ready to store the X co-ordinate, which is held in R0.

Always preserve and then restore the current value of SP so you don’t impact use of the stack for other functions, such as plot.

Here we push LR, the Link Register, and later pop the PC. LR is the only high register (R8 or above) you can push to the stack. Likewise, PC is the only high register you can pop. The code shows why. The bl op stores the address of the instruction after the branch in the LR so you can B LR to return from the function. Now plot makes function calls of its own, so the value in LR will be overwritten and the original value lost if we didn’t preserve it on the stack. This we do, and pop it out and into the PC at the end to effect the return.

push {lx}
...
pop {pc}

You should note that this only works if the functions called by plot balance any pushes of their own with pops, otherwise the Stack Pointer won’t be where you expect it to be when PC is popped and the code will jump somewhere else. The stack doesn’t know a value is right or wrong; it just pushes and pops when asked to. You have to watch the values on its behalf.

Introducing fixed size ‘gaps’ within the stack to hold variables is called ‘stack framing’, and it’s a handy way of passing bulk data to functions as, by a convention followed by the assembler, only used R0 through R3 for parameters. If you need more than four parameters, you can pass the address of a stack frame.

Note You don’t have to use a stack frame here. Take a look at the file main_no_stack.S in the same repo directory, which shows you how to store and access variables in a different way: in the program’s .data section. There’ll be more on program sections in an upcoming post.

Exchange Ops

‘Exhange’ here means ‘exchange instruction sets’, and is signified by suffixing supported instructions with X. For RP2040, that means BX and BLX. They work exactly like B and BL, but are able to manage jumps to code written in a different ARM instruction set. Remember, the RP2040’s Cortex-M0+ cores both use the Thumb instruction set, a largely 16-bit, pared down version of the full 32-bit ARMv6 instruction set. BX and BLX allow the CPU to jump smoothly from a block of code written using one set to code written in another. A function doesn’t know which set it was called by, and unless you wrote both caller and called, neither do you. So it’s good practice to include the X form when you’re branching back. Examples of functions you haven’t written are of the ones in the Pico SDK.

You might think this unnecessary since both the Pico SDK and your code all get compiled to Thumb instructions — that’s all that will run on the RP2040. However, the SDK uses BLX and BX a lot. Doing so sets a bit within both the jumped to address and the return address stored in the LR. To stop that causing alignment issues, you need to anticipate it. You do so by including the .thumb_func directive so the code is compiled correctly and handles the notional instruction set exchange properly. 

Extend Ops

ARM instructions work with 32-bit numbers. What if you load an 8- or 16-bit value — maybe read from a peripheral — that’s signed? How you do ensure the 32-bit value the Cortex-M0+ will work on is itself correctly signed? On an 8-bit value, the sign bit is bit 7, but bit 31 is the 31-bit value’s sign bit. How, in short, do you move bit 7 to bit 31?

You use SXTB. This op performs that operation. Its partner, SXTH, does the same for a signed 16-bit value, but this time bit 15 gets moved to bit 31.

UXTB and UXTH perform similar magic for unsigned bytes and halfwords. This time all the bits above the source value (8-31 and 16-31, respectively) are zeroed. This ensures that any bit set in a register’s existing value are wiped when the lower byte or halfword are added in. Loading a byte into a register with LDRB automatically zeroes the upper bits, but there can be situations when you get the 8-bit value into the register in a different way. UXTB is a quick way of zeroing the upper bits without having to set up a mask to AND out bits 8 through 31.

You can see these two instructions in the disassembly of the I2C demo code. You’ll find it after a build, in the build/source/asm_i2c directory — look for the file I2C_DEMO.dis. Open it in your text editor and scroll down to the <ht16k33_plot> section.

More ARM Assembly on the Raspberry Pi Pico