ARM assembly-language programming for the Raspberry Pi
9. Using comparisons and branches to create loops
The ability to repeat some set of actions a number of times, or until some condition is met, is one of the most fundamental operations in procedural programming. All high-level languages have multiple constructs for controlling looping. In ARM assembler, there is just one -- a conditional branch.
Before considering the example, we need to introduce the various roles of the stack in assembly programming, as it will become increasingly important.
The stack
A stack in programming is conceptually the same as a stack in any other area of life -- a region of storage that can be extended by adding things to the top, and compacted by removing them from the top. Unless we're very careful, we can only get to the bottom of the stack by removing all the items above it. So a stack is a store that expands and contracts in particular directions as stuff is added and removed.
In programming, the stack is an area of memory that expands and contracts as data is added and removed. In almost all programming paradigms, adding data to the stack is called 'pushing', while removing it is called 'popping'.
The stack is a hugely useful construct in applications that are built up from nested functions. As each function is called, it adds new data to the stack. When the function returns, it removes its own working data. Only one function is active at a time (in a specific thread, anyway), so a function only cares about its own specific part of the stack. When a function is called there might be megabytes of data on the stack already, pushed there by other functions; but until those functions get control back, nobody cares about that data. The push-on, pop-off model is ideal for this style of programming.
A very common use of the stack -- and the one that this example
demonstrates -- is as a way to store the values of registers that
a function will manipulate. It's generally good practice for a function
to leave all the registers in the state it finds them, except for
the ones used to return data to the caller. So if a function
manipulates registers r0
, r1
, and
r2
, a simple way to protect those registers from
being modified is to push their values onto the stack on entry,
and pop them off at exit. The push and the pop need to be in
opposite orders -- a fact that ARM assembly language rather obscures.
If all the program does is push data onto the stack and pop it off again, the programmer need not care much about how and where the stack is implemented. However, there is another important use of the stack, and that is to store working data used by a function, that won't fit into registers. If there aren't enough registers to store all the data a function needs, the function can expand the stack by as much as will be needed to store that data. When the function returns the data will be lost as the stack shrinks, but that's OK -- it was always temporary.
Using the stack this way does require a little knowledge about memory organization. The present example only uses the stack to push and pop registers, but later examples will use it in more sophisticated ways, so it's worth understanding the details.
On Linux, the code and data sections of the program occupy the lowest part of the program's address space, typically starting at address zero. Above those sections is an allocatable region of memory called the heap. Above that is the stack. Because the stack is at the top (highest-numbered addresses) of the address space, it can't expand upwards. That is, pushing items onto the stack can't increase the address of the top of the stack. So on Linux the stack typically grows downwards. Expanding the stack amounts to subtracting from the address of the top of the stack. It's very easy to get this wrong, and end up in a mess.
This stack organization is called full descending. The ARM CPU supports both ascending and descending stacks, but there are shortcuts in assembly language for the full-descending case because it's so widely used.
ARM stack instructions
The simplest instructions to push and pop registers are of this form:
push {%r0} // Put r0 on the stack, expanding it pop {%r1} // Put the value at the top of the stack into r1 // and contract the stack
These instructions are concepturally easy to understand. The position of
the top of the stack is maintained by the stack pointer, sp
, but there's no need to know this to use simple push
and pop
instructions. However, if you want to use the
stack to store arbibtrary data, you can use add
and sub
instructions to move sp
around
-- I will demonstrate this in a later example.
It's tacitly understood that the push
and pop
instructions apply to a full-descending stack. In fact,
these instructions are really shorthand for specific kinds of
store multiple and load multiple operations.
These operations store a set of register values to contiguous areas
of memory, either with ascending or descending addresses. This means that we
can push and pop multiple registers together. In the example, I use
these instructions to preserve the contents of registers that
the function changes:
push {r4-r5} ... pop {r4-r5}These instructions push and pop the
r4
and r5
registers together. The way the instructions are written obscures the
fact that the push and pop must be done in the right order. It's tempting
to write
push {r4-r5} ... pop {r5-r4}
because we know that the last register to be pushed must be the first popped, but the assembler will reject this -- the registers must be specified in numerical order. The CPU knows that pop should be done in the reverse order from push.
The ARM instruction set allows for any register to hold the base address for a multiple store or multiple load operation -- not just the stack pointer. Another way to write a push operation with the stack pointer as the base is
stmfd sp!, {r4-r5}
stmfd
is
store multiple full descending.
The corresponding pop is
ldmfd sp!, {r4-r5}
and you can probably guess what the ld is an abbreviation for
by this point. Although you'll often see
ldmfd sp!
and stmfd sp!
in ARM assembly code,
I don't think any of the large number of other variants is much used.
The example
I will only show here the complete implementation of the strlen
function -- the rest of the program is the same as the previous example.
/* =========================== strlen ======================================*/ // Calulate the length of a null-terminated string // On entry, r0 is the address of the string // On exit, r0 is the length of the string, not including // the terminating null strlen: push {r4-r5} // Save the values of %r4 and %r5, which we will use mov %r4, $0 // Use %r4 as the character count; initially 0 strlen_0: ldrb %r5,[r0] // Read into %r5 the value in memory location %r0 cmp %r5, #0 // Compare to zero, the end-of-line terminator beq strlen_1 // If it's equal to zero, jump out of loop add %r0, $1 // If not zero, add one to the character count... add %r4, $1 // and to the address we are looking at b strlen_0 // Then do the loop again strlen_1: mov %r0, %r4 // Transfer the character count to %r0 for return pop {r4-r5} // Restore the temporary registers bx lr
I feel I must point out from the start that this implementation is extremely inefficient. I'm using it for ease of understanding, not as an example of production-quality code. I'll explain why later.
The function works by storing 0 in register r4
, then
looping around looking for the byte 0 that marks the end of the
string. The r0
register stores the address of the
character under consideration. This register is initialized on entry
to the start of the string, and then incremented on each repeat of
the loop, so it moves along the string, one byte at a time.
On each loop, r4
is incremented, so it ends up
holding the number of bytes in the string, not including the
null.
The ldrb
instruction
We've already seen the ldr
function, which reads a
32-bit number from memory into a register.
ldrb
has the same effect, except that it only
reads one byte (that's the b on the end of the instruction name).
In the example, ldrb
is used like this:
ldrb %r5,[r0]
A single byte at the address indicated by the r0
register
is read into register r5
. The is a 32-bit register, as
all the ARM registers are, and it's worth thinking what happens to the
other 24 bits that aren't read. They are initialized to
zero -- which is one of the reasons my simple implementation is
so inefficient. More on this point later.
Comparisons and branches
The cmp
operation compares a register to a number, or
to another register. In practice, the comparison is implemented as
a subtraction. So when we do
cmp %r0,$N
the effect is as if the CPU subtracted N from the value in r0
.
This subtraction sets various flags, which the following branch
operation tests. The subtraction isn't actually performed -- the value
of the register isn't actually changed: the CPU just sets the flags
as if it were.
There are many branches that can follow the comparison instruction.
All the branch instructions start with b
; a suffix
indicates what kind of test to apply. I won't list them all -- they
are well-documented. In the example I use beq
--
branch if equal. Even if you've never seen ARM assembly
before, you can probably guess what some of the others are.
Why is the implementation inefficient?
It's a general principle of machine-level programming that register
operations are much, much faster the memory operations. The
32-bit ARM CPUs read memory in 32-bit blocks, directly into
a 32-bit register. A loop that repeats ldrb
over
a block of contiguous bytes is inherently wasteful -- the CPU can't
actually read a byte. Instead, it reads a four-byte block and sets
three of the bytes to zero. This means that the same block of
memory gets read four times, with only byte being used each time.
A production-quality implementation would read the string in four-byte blocks, and then examine the individual bytes within the blocks. There are instructions for extracting specific bit groups from within a register.
A note on predication
'Predication' is the term used for a feature that the ARM instruction set has, which allows all instructions to be branches. I don't want to linger on predication, because I don't think it's widely used, and I suspect it will disappear from newer ARM cores. However, the term is widely used in the ARM documentation, and it helps to know what it is.
I already mentioned that the branch (b
) instruction could
have a variety of different suffices, according to what kind of test
to perform. In fact the same is true of all ARM instructions.
For example, the add
instruction adds two registers and
stores the result in a third. We can add the eq
suffix
I described above, and turn it into addeq
. This means
that the add instruction will only take effect if the last comparison
was an equality (technically, if the zero flag is set).
Predication offers lots of opportunities to remove branches completely from simple tests. Consider the following kind of operation:
if a = last then a = 0 endUsing conventional conditional branches, this test might be implemented in ARM assembly with a conditional jump to a label:
cmp %r0, %r1 beq not_last add %r0,$1 not_last: ...By using predication we can write the test in a compact way, that is considerably faster to execute:
cmp %r0, %r1 addeq %r0,$1Predication is a powerful technique, but one that I suspect is only employed when hand-crafting highly optimized assembler routines. I'm not sure if any compiler generates code that uses predication.
Summary
The stack can be used to protect the values of registers that are altered during a function call.
The
cmp
instruction performs what amounts to a subtraction.Various conditional branches can take action that depends on the outcome of the comparison.
Predication can be used -- but mostly is not -- to eliminate many branches and improve efficiency.