ARM assembly-language programming for the Raspberry Pi

9. Using comparisons and branches to create loops

The ability to repeat some set of actions a number of times, or until some condition is met, is one of the most fundamental operations in procedural programming. All high-level languages have multiple constructs for controlling looping. In ARM assembler, there is just one -- a conditional branch.

Before considering the example, we need to introduce the various roles of the stack in assembly programming, as it will become increasingly important.

The stack

A stack in programming is conceptually the same as a stack in any other area of life -- a region of storage that can be extended by adding things to the top, and compacted by removing them from the top. Unless we're very careful, we can only get to the bottom of the stack by removing all the items above it. So a stack is a store that expands and contracts in particular directions as stuff is added and removed.

In programming, the stack is an area of memory that expands and contracts as data is added and removed. In almost all programming paradigms, adding data to the stack is called 'pushing', while removing it is called 'popping'.

The stack is a hugely useful construct in applications that are built up from nested functions. As each function is called, it adds new data to the stack. When the function returns, it removes its own working data. Only one function is active at a time (in a specific thread, anyway), so a function only cares about its own specific part of the stack. When a function is called there might be megabytes of data on the stack already, pushed there by other functions; but until those functions get control back, nobody cares about that data. The push-on, pop-off model is ideal for this style of programming.

A very common use of the stack -- and the one that this example demonstrates -- is as a way to store the values of registers that a function will manipulate. It's generally good practice for a function to leave all the registers in the state it finds them, except for the ones used to return data to the caller. So if a function manipulates registers r0, r1, and r2, a simple way to protect those registers from being modified is to push their values onto the stack on entry, and pop them off at exit. The push and the pop need to be in opposite orders -- a fact that ARM assembly language rather obscures.

If all the program does is push data onto the stack and pop it off again, the programmer need not care much about how and where the stack is implemented. However, there is another important use of the stack, and that is to store working data used by a function, that won't fit into registers. If there aren't enough registers to store all the data a function needs, the function can expand the stack by as much as will be needed to store that data. When the function returns the data will be lost as the stack shrinks, but that's OK -- it was always temporary.

Using the stack this way does require a little knowledge about memory organization. The present example only uses the stack to push and pop registers, but later examples will use it in more sophisticated ways, so it's worth understanding the details.

On Linux, the code and data sections of the program occupy the lowest part of the program's address space, typically starting at address zero. Above those sections is an allocatable region of memory called the heap. Above that is the stack. Because the stack is at the top (highest-numbered addresses) of the address space, it can't expand upwards. That is, pushing items onto the stack can't increase the address of the top of the stack. So on Linux the stack typically grows downwards. Expanding the stack amounts to subtracting from the address of the top of the stack. It's very easy to get this wrong, and end up in a mess.

This stack organization is called full descending. The ARM CPU supports both ascending and descending stacks, but there are shortcuts in assembly language for the full-descending case because it's so widely used.

ARM stack instructions

The simplest instructions to push and pop registers are of this form:

  push   {%r0} // Put r0 on the stack, expanding it
  pop    {%r1} // Put the value at the top of the stack into r1
               //   and contract the stack

These instructions are concepturally easy to understand. The position of the top of the stack is maintained by the stack pointer, sp, but there's no need to know this to use simple push and pop instructions. However, if you want to use the stack to store arbibtrary data, you can use add and sub instructions to move sp around -- I will demonstrate this in a later example.

It's tacitly understood that the push and pop instructions apply to a full-descending stack. In fact, these instructions are really shorthand for specific kinds of store multiple and load multiple operations. These operations store a set of register values to contiguous areas of memory, either with ascending or descending addresses. This means that we can push and pop multiple registers together. In the example, I use these instructions to preserve the contents of registers that the function changes:

 push     {r4-r5}
 ...
 pop      {r4-r5}

These instructions push and pop the r4 and r5 registers together. The way the instructions are written obscures the fact that the push and pop must be done in the right order. It's tempting to write

 push     {r4-r5}
 ...
 pop      {r5-r4}

because we know that the last register to be pushed must be the first popped, but the assembler will reject this -- the registers must be specified in numerical order. The CPU knows that pop should be done in the reverse order from push.

The ARM instruction set allows for any register to hold the base address for a multiple store or multiple load operation -- not just the stack pointer. Another way to write a push operation with the stack pointer as the base is

  stmfd sp!, {r4-r5}

stmfd is store multiple full descending.

The corresponding pop is

  ldmfd sp!, {r4-r5}

and you can probably guess what the ld is an abbreviation for by this point. Although you'll often see ldmfd sp! and stmfd sp! in ARM assembly code, I don't think any of the large number of other variants is much used.

The example

I will only show here the complete implementation of the strlen function -- the rest of the program is the same as the previous example.

/* =========================== strlen ======================================*/
// Calulate the length of a null-terminated string
//   On entry, r0 is the address of the string
//   On exit, r0 is the length of the string, not including
//   the terminating null
strlen:
   push     {r4-r5}    // Save the values of %r4 and %r5, which we will use
   mov      %r4, $0    // Use %r4 as the character count; initially 0
strlen_0:
   ldrb     %r5,[r0]   // Read into %r5 the value in memory location %r0
   cmp      %r5, #0    // Compare to zero, the end-of-line terminator
   beq      strlen_1   // If it's equal to zero, jump out of loop
   add      %r0, $1    // If not zero, add one to the character count...
   add      %r4, $1    //   and to the address we are looking at
   b        strlen_0   // Then do the loop again
strlen_1:
   mov      %r0, %r4   // Transfer the character count to %r0 for return
   pop      {r4-r5}    // Restore the temporary registers 
   bx       lr

I feel I must point out from the start that this implementation is extremely inefficient. I'm using it for ease of understanding, not as an example of production-quality code. I'll explain why later.

The function works by storing 0 in register r4, then looping around looking for the byte 0 that marks the end of the string. The r0 register stores the address of the character under consideration. This register is initialized on entry to the start of the string, and then incremented on each repeat of the loop, so it moves along the string, one byte at a time. On each loop, r4 is incremented, so it ends up holding the number of bytes in the string, not including the null.

The `ldrb` instruction

We've already seen the ldr function, which reads a 32-bit number from memory into a register. ldrb has the same effect, except that it only reads one byte (that's the b on the end of the instruction name).

In the example, ldrb is used like this:

   ldrb     %r5,[r0]

A single byte at the address indicated by the r0 register is read into register r5. The is a 32-bit register, as all the ARM registers are, and it's worth thinking what happens to the other 24 bits that aren't read. They are initialized to zero -- which is one of the reasons my simple implementation is so inefficient. More on this point later.

Comparisons and branches

The cmp operation compares a register to a number, or to another register. In practice, the comparison is implemented as a subtraction. So when we do

   cmp     %r0,$N

the effect is as if the CPU subtracted N from the value in r0. This subtraction sets various flags, which the following branch operation tests. The subtraction isn't actually performed -- the value of the register isn't actually changed: the CPU just sets the flags as if it were.

There are many branches that can follow the comparison instruction. All the branch instructions start with b; a suffix indicates what kind of test to apply. I won't list them all -- they are well-documented. In the example I use beq -- branch if equal. Even if you've never seen ARM assembly before, you can probably guess what some of the others are.

Why is the implementation inefficient?

It's a general principle of machine-level programming that register operations are much, much faster the memory operations. The 32-bit ARM CPUs read memory in 32-bit blocks, directly into a 32-bit register. A loop that repeats ldrb over a block of contiguous bytes is inherently wasteful -- the CPU can't actually read a byte. Instead, it reads a four-byte block and sets three of the bytes to zero. This means that the same block of memory gets read four times, with only byte being used each time.

A production-quality implementation would read the string in four-byte blocks, and then examine the individual bytes within the blocks. There are instructions for extracting specific bit groups from within a register.

A note on predication

'Predication' is the term used for a feature that the ARM instruction set has, which allows all instructions to be branches. I don't want to linger on predication, because I don't think it's widely used, and I suspect it will disappear from newer ARM cores. However, the term is widely used in the ARM documentation, and it helps to know what it is.

I already mentioned that the branch (b) instruction could have a variety of different suffices, according to what kind of test to perform. In fact the same is true of all ARM instructions. For example, the add instruction adds two registers and stores the result in a third. We can add the eq suffix I described above, and turn it into addeq. This means that the add instruction will only take effect if the last comparison was an equality (technically, if the zero flag is set).

Predication offers lots of opportunities to remove branches completely from simple tests. Consider the following kind of operation:

  if a = last then 
    a = 0
  end

Using conventional conditional branches, this test might be implemented in ARM assembly with a conditional jump to a label:

  cmp %r0, %r1
  beq not_last
  add %r0,$1
not_last:
  ...

By using predication we can write the test in a compact way, that is considerably faster to execute:

  cmp %r0, %r1
  addeq %r0,$1

Predication is a powerful technique, but one that I suspect is only employed when hand-crafting highly optimized assembler routines. I'm not sure if any compiler generates code that uses predication.

Summary

The stack can be used to protect the values of registers that are altered during a function call.
The cmp instruction performs what amounts to a subtraction.
Various conditional branches can take action that depends on the outcome of the comparison.
Predication can be used -- but mostly is not -- to eliminate many branches and improve efficiency.