ARM assembly-language programming for the Raspberry Pi
6. Using the sys_write syscall to output text
And so we arrive, at last, at "Hello, World". This example demonstrates
how to use sys_write
to write to the console, and
introduces some other new assembly-language features.
I will show the example in its entirety, but some of the code is
the same as in the previous example.
Example
Here is the code. It just outputs "Hello, World" to the console.
// Outputs a simple message using sys_write .text SYS_EXIT = 1 SYS_WRITE = 4 STDOUT = 1 .global _start // Exit the program. // On entry, r0 should hold the exit code exit: mov %r7, $SYS_EXIT swi $0 _start: // Use the sys_write syscall to output a string mov %r7, $SYS_WRITE mov %r0, $STDOUT ldr %r1, =msg // Store the address of the message in r1 mov %r2, $13 // Store the length of the message in r2 swi $0 // Now exit mov %r0, $0 b exit msg: .ascii "Hello, World\n"
Defining data
The text message "Hello, World" is a piece of data larger than a single
number. We've already seen how an integer number can be loaded
directly into a register using an immediate instruction like
mov %r0, $32
. However, we can't load a whole string of
text into a 32-bit register. We can, and will, load the address
of the string into a register, but to do that we have to define the
string, and know its address.
The assembler provide a straightforward way to introduce data of various types into the object file. My example uses this method for a text string:
msg: .ascii "Hello, World\n"
As in most other programming languages, \n
is a code that
means 'new line'. Although it is written as two symbols in the source
-- '\' and 'n' -- it only occupies one byte in memory.
msg
is just a label. When the program is assembled, references
to the label msg
will be replace with its address.
The assembler supports many other data types --
.byte
, .word
, etc.
The sys_write syscall
The sys_write
syscall (number 4 in ARM Linux) is a little
more complicated than sys_exit
. It takes three arguments:
r0
-- the file descriptor. This is an integer that identifies the file or device to write to. "standard out" will always be file 1 on Linux terminals or consoles. Standard error is file 2.r1
-- the address in memory of the data to write.r2
-- the number of bytes to write.
As with all ARM Linux syscalls, the syscall number (4) goes into
r7
.
The ldr instruction
ldr
is load register. In this
example, ldr
is used in a way that
is conceptually exactly the same as the immediate mode of mov
,
-- to transfer a number into a register. This instruction:
ldr %r1, =msg
transfers to register r1
the numerical address labeled by msg:
.
ldr
also has an indirect mode, like this:
ldr %r1, [r4]
In this mode, the value of the register r4
is treated
as an address in memory, and r0
is loaded with the
data in memory at that address. It is the square brackets that
indicate the indirect mode of operation.
We don't need to use the indirect form of ldr
in this
example, but will need it later.
ldr is not what it seems
If the use of ldr
in this example is conceptually the
same as mov
, then why not just use mov
as
we did previously? Answering this question requires delving into
the internal operation of the assembler, but it's necessary to do
this, in order
to write efficient code.
So why could we not, in the present example, instead of ldr
use this?:
mov %r1, =msg
After all, I've already said that the immediate modes of mov
and ldr
are conceptually equivalent. The
reason for not using mov
is that the immediate
operand to mov
is of limited size. I already touched on
this back in example 1, and hinted at it again in example 3. The
immediate operand to mov
can only be 11 bits long, but the
register can store a 32-bit number. This limitation arises from the
way that the operand is encoded, using only 11 bits in the instruction.
It isn't the case that we can encode any 11-bit number -- numbers
that are powers of two are encoded differently. The assembler will stop
with an error if you try to use an immediate number that can't be
encoded using the CPU's rules.
In practice, the address of the message labeled msg:
might fit into a mov
-- it's just about
possible, because the program is so small. However, it's unwise to
rely on this in a real program.
On the other hand, the ldr
operation can encode any
32-bit number at all. If you're wondering how we can encode a
32-bit number into an instruction which is only 32 bits in total
the answer is, of course: we can't. It's impossible.
The fact is that ldr
's immediate mode is an illusion.
ldr
has no immediate mode -- only an indirect mode,
where data is loaded from an address in memory. An instruction like this:
ldr %r1, =42
is actually a pseudo-instruction. The assembler converts this instruction into something like:
ldr %r1, [foo] foo: .word 42
That is, the assembler simulates an immediate operand by storing the
operand's value in memory, and generating an indirect access to the
stored value. That's how ldr
can store a 32-bit value
using a 32-bit instruction code.
The downside, and the reason we prefer to use mov
if
we can, is that executing a pseudo-immediate ldr
will
take much longer than the truly immediate mov
. As well
as the CPU having to read and decode the instruction code itself
from memory,
which is all that mov
requires, using ldr
requires some additional arithmetic and then a further read from
memory. Using mov
, where possible, is faster, as
well as using less storage. To be fair, we won't notice the
difference of less than a microsecond in a trivial program like this, but
those microseconds add up when there are millions of them.
In short: mov
is an immediate instruction -- all
the data it needs is in the instruction itself. ldr
is an indirect instruction that reads from memory, but the
assembler simulates an immediate mode for ldr
because
mov
has a range limitation.
Where is the data?
You may have noticed that the data that forms the "Hello, World" message is just tacked on the end of the program code. This is a reasonable thing to do, but a little unconventional -- usually the program's constant data will be placed in a separate memory segment. I'll illustrate this in the next example.
One disadvantage of using the .text
(program code) section
for data is that if you try to disassemble the code, or use an
interactive debugger, the tools won't be able to tell the difference between
genuine program code and your data. This won't do any harm, but it will
make the tools confusing to use.
Summary
The
sys_write
syscall outputs data to a file or device.The
ldr
instruction reads data from memory into a register.ldr
can be used to overcome the range limitation in themov
instruction, butmov
-- where it can be used -- is faster and uses less storage.
- Previous: 5. Using constants in assembly programming
- Table of contents
- Next: 7. Using sections and alignment