ARM assembly-language programming for the Raspberry Pi
6. Using the sys_write syscall to output text
And so we arrive, at last, at "Hello, World". This example demonstrates
how to use sys_write to write to the console, and
introduces some other new assembly-language features.
I will show the example in its entirety, but some of the code is
the same as in the previous example.
Example
Here is the code. It just outputs "Hello, World" to the console.
// Outputs a simple message using sys_write
.text
SYS_EXIT = 1
SYS_WRITE = 4
STDOUT = 1
.global _start
// Exit the program.
// On entry, r0 should hold the exit code
exit:
mov %r7, $SYS_EXIT
swi $0
_start:
// Use the sys_write syscall to output a string
mov %r7, $SYS_WRITE
mov %r0, $STDOUT
ldr %r1, =msg // Store the address of the message in r1
mov %r2, $13 // Store the length of the message in r2
swi $0
// Now exit
mov %r0, $0
b exit
msg:
.ascii "Hello, World\n"
Defining data
The text message "Hello, World" is a piece of data larger than a single
number. We've already seen how an integer number can be loaded
directly into a register using an immediate instruction like
mov %r0, $32. However, we can't load a whole string of
text into a 32-bit register. We can, and will, load the address
of the string into a register, but to do that we have to define the
string, and know its address.
The assembler provide a straightforward way to introduce data of various types into the object file. My example uses this method for a text string:
msg:
.ascii "Hello, World\n"
As in most other programming languages, \n is a code that
means 'new line'. Although it is written as two symbols in the source
-- '\' and 'n' -- it only occupies one byte in memory.
msg is just a label. When the program is assembled, references
to the label msg will be replace with its address.
The assembler supports many other data types --
.byte, .word, etc.
The sys_write syscall
The sys_write syscall (number 4 in ARM Linux) is a little
more complicated than sys_exit. It takes three arguments:
r0-- the file descriptor. This is an integer that identifies the file or device to write to. "standard out" will always be file 1 on Linux terminals or consoles. Standard error is file 2.r1-- the address in memory of the data to write.r2-- the number of bytes to write.
As with all ARM Linux syscalls, the syscall number (4) goes into
r7.
The ldr instruction
ldr is load register. In this
example, ldr is used in a way that
is conceptually exactly the same as the immediate mode of mov,
-- to transfer a number into a register. This instruction:
ldr %r1, =msg
transfers to register r1 the numerical address labeled by msg:.
ldr also has an indirect mode, like this:
ldr %r1, [r4]
In this mode, the value of the register r4 is treated
as an address in memory, and r0 is loaded with the
data in memory at that address. It is the square brackets that
indicate the indirect mode of operation.
We don't need to use the indirect form of ldr in this
example, but will need it later.
ldr is not what it seems
If the use of ldr in this example is conceptually the
same as mov, then why not just use mov as
we did previously? Answering this question requires delving into
the internal operation of the assembler, but it's necessary to do
this, in order
to write efficient code.
So why could we not, in the present example, instead of ldr
use this?:
mov %r1, =msg
After all, I've already said that the immediate modes of mov
and ldr are conceptually equivalent. The
reason for not using mov is that the immediate
operand to mov is of limited size. I already touched on
this back in example 1, and hinted at it again in example 3. The
immediate operand to mov can only be 11 bits long, but the
register can store a 32-bit number. This limitation arises from the
way that the operand is encoded, using only 11 bits in the instruction.
It isn't the case that we can encode any 11-bit number -- numbers
that are powers of two are encoded differently. The assembler will stop
with an error if you try to use an immediate number that can't be
encoded using the CPU's rules.
In practice, the address of the message labeled msg:
might fit into a mov -- it's just about
possible, because the program is so small. However, it's unwise to
rely on this in a real program.
On the other hand, the ldr operation can encode any
32-bit number at all. If you're wondering how we can encode a
32-bit number into an instruction which is only 32 bits in total
the answer is, of course: we can't. It's impossible.
The fact is that ldr's immediate mode is an illusion.
ldr has no immediate mode -- only an indirect mode,
where data is loaded from an address in memory. An instruction like this:
ldr %r1, =42
is actually a pseudo-instruction. The assembler converts this instruction into something like:
ldr %r1, [foo]
foo:
.word 42
That is, the assembler simulates an immediate operand by storing the
operand's value in memory, and generating an indirect access to the
stored value. That's how ldr can store a 32-bit value
using a 32-bit instruction code.
The downside, and the reason we prefer to use mov if
we can, is that executing a pseudo-immediate ldr will
take much longer than the truly immediate mov. As well
as the CPU having to read and decode the instruction code itself
from memory,
which is all that mov requires, using ldr
requires some additional arithmetic and then a further read from
memory. Using mov, where possible, is faster, as
well as using less storage. To be fair, we won't notice the
difference of less than a microsecond in a trivial program like this, but
those microseconds add up when there are millions of them.
In short: mov is an immediate instruction -- all
the data it needs is in the instruction itself. ldr
is an indirect instruction that reads from memory, but the
assembler simulates an immediate mode for ldr because
mov has a range limitation.
Where is the data?
You may have noticed that the data that forms the "Hello, World" message is just tacked on the end of the program code. This is a reasonable thing to do, but a little unconventional -- usually the program's constant data will be placed in a separate memory segment. I'll illustrate this in the next example.
One disadvantage of using the .text (program code) section
for data is that if you try to disassemble the code, or use an
interactive debugger, the tools won't be able to tell the difference between
genuine program code and your data. This won't do any harm, but it will
make the tools confusing to use.
Summary
The
sys_writesyscall outputs data to a file or device.The
ldrinstruction reads data from memory into a register.ldrcan be used to overcome the range limitation in themovinstruction, butmov-- where it can be used -- is faster and uses less storage.
- Previous: 5. Using constants in assembly programming
- Table of contents
- Next: 7. Using sections and alignment
