C development for Linux without a standard library

gears

Why???

This article discusses some of the challenges involved in creating C applications without the use of a standard C library or, indeed, any dependencies at all except a Linux kernel. It's certainly reasonable to ask why you'd want to do such a thing, given that there are many substantial and well-maintained standard C libraries around. I can think of a number of possibilities.

You're working on some highly-optimized embedded platform that does not, in fact, have a C library available.
You have some other reason to need to create a program that has a tiny memory footprint and sub-millisecond start-up time.
You want to write a utility that can be distributed in executable format to any Linux machine of the same architecture. That is, the executable has no dependencies at all.
You want to understand how things really work.

I will illustrate the various roles played by a C standard library by implementing a very simple Linux shell called shnolib. There's nothing particularly remarkable, or even interesting about the program, except that it forces us to think what the standard library does, and how we can compensate for its absence.

shnolib is not an impressive shell -- it just prompts the user to enter a line of text, parses it, searches the $PATH for a matching program, and executes it. It can also execute a file of commands, line by line, to demonstrate the principles of buffered input (of which, more below).

There is one slightly interesting feature of shnolib, which can be see by running top -p on it:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3083 kevin     20   0     184      8      0 S   0.0  0.0   0:00.00 shnolib

In case it's not clear, that's 8 kilobytes of mapped physical memory, in a total address space of 184 kB. In this article I'll be making many references to the implementation of shnolib; the source code is available on github.

Note:
This article primarily describes the AMD64 architecture, although only about 30 lines of assembler code are actually architecture-specific. AMD64 is probably the platform that least needs the techniques described in this article; but it's ubiquitous, making experimentation easy. The source code also includes an assembler module for ARMv7, e.g., Raspberry Pi

What a standard C library actually does

Most C developers don't think much about the standard library; many have only a vague notion what it does. It's really only when working in embedded systems, or non-standard Linux variants like Android, that we really have to think about the standard library.

If you work with GCC, then there's a 99.9% probability that you're using the GNU standard library, glibc. This is such a fundamental part of Linux that it's almost considered part of the kernel. It's rare to find a desktop or server Linux that does not have glibc installed, and used by just about every other piece of software.

But glibc isn't the only standard library, even for Linux. Android doesn't use it by default -- it has its own standard C library called Bionic. Bionic implements a subset of the functionality of glibc, but in a way that is appropriate for the Android platform. A number of embedded Linux system use ucLibC, which attempts to be much smaller than glibc, while offering most of the same fundamental features.

But what does the standard C library actually do? Some or all of the following things, and perhaps others:

it acts as an entry point for the kernel to start the program, setting up a memory environment suitable for a C program, and providing the command-line arguments;
it provides functions to simplify integration with kernel system calls (syscalls);
it allocates and manages memory;
it provides buffers for I/O to reduce the number of (slow) kernels calls the application makes;
it supports a number of programming constructs, like variable-length argument lists;
it provides a library of useful functions, for things like string manipulation, data formatting, error reporting...
it implements fundamental arithemetic operations that are not supported by the CPU.

These are all features that most C programmers -- in fact, most programmers in any language -- take completely for granted. You probably don't realize how much the standard library is doing, until its gone.

The "useful functions" that shnolib requires mostly centre on manipulating text strings. The program needs a way to parse a command line, tokenize the $PATH, and construct filenames. It needs a way to generate text from system call error codes. It doesn't really need buffered input and output, but implementing it is a useful exercise.

In the sections that follow, I will demonstrate each of the standard library functions described above, with reference to the shnolib example. However, before getting into the details, we need a brief digression about C-language calling conventions.

A note about calling conventions

A 'calling convention' is a specification for passing arguments to functions, and returning values from functions. The developer rarely needs to worry about this -- the compiler generates the necessary code. It's really only something we have to face when writing code that calls C from assembler, or vice versa.

There are two main ways in which arguments are passed in C code.

The arguments are pushed onto the stack by the calling function, usually in reverse order so the first argument is at the top of the stack. The called function reads them from the stack.
The arguments are passed from the calling function to the called function in CPU registers.

These methods are often used in combination. Passing arguments in registers is much quicker than using the stack, but CPUs have a limited set of registers. The calling convention needs to define what types of registers are used for what type of data.

For the record, on AMD64 GCC uses the following registers for integer and pointer data, in this order: RDI, RSI, RDX, RCX, R8, and R9. If the application requires more than six arguments, the remaining arguments are pushed on the stack. Different registers are used for floating-point numbers, but that need not concern us here.

The return value from a function, if it is an integer or pointer, is passed in RAX register.

Now, here's the complication: the calling convention on AMD64 for Linux kernel syscalls is almost the same as the GCC calling convention. But not exactly the same. Specifically, syscalls use R10 rather than RCX for the fourth argument. In addition, the kernel requires the specific syscall to be identified by a number in the EAX register. We'll see why this is important when we discuss calling kernel syscalls from C later.

A note about assembly language

There are a few places in the implementation of the shnolib program where we need assembler code. However, I've tried to minimize the use of assembly language, to those small parts of the example that really need it -- where there's no alternative. Although it's possible to embed assembly code in C, I prefer to keep these things separate, so it's clearer what needs to be changed to suit a different architecture.

The assembly language snippets in this article, and in the shnolib sample, are in GNU syntax, intended to be processed with the GNU as assembler.

Providing an entry point and environment

One of the main functions of a standard C library is to start the main program with a suitable environment. When building a executable file, which will be in ELF format on Linux, we specify the starting address in memory for program execution. The GNU linker ld will set the starting address to a function called _start, which has to exist. In principle, we could provide the _start function in C, but it needs to do various things that C can't easily do. Most obviously, the library needs to pass the command-line arguments and environment ($PATH, etc) to the C program.

In all architectures, the kernel passes the command-line and environment on the stack. In compiler set-ups which use the stack exclusively for argument passing, we could just call main() directly from _start. However, this use of the stack for calling isn't guaranteed in C and, in fact, code generated by GCC for AMD64 uses CPU registers for at least some arguments. So we need to extract the relevant data from the stack, and put it into registers before calling main().

As I mentioned before, a C function needs its first argument passed in the RDI register, and the second in RSI. The function main is conventionally defined like this:

int main (int argc, char **argv)

So we need to pass argc in RDI and argv in RSI. This is easily done using the following snippet of assembler:

.global _start

_start:
    mov 0x0(%rsp),%rdi
    lea 0x8(%rsp),%rsi
    call __main
    ...

What we're doing here is moving the number at the top of the stack into RDI, and the address of the top of the stack into RSI. We need the address because the kernel doesn't push a char** onto the stack -- it pushes the individual char* values for the command-line arguments.

By way of comparison, here is the same code for ARM. If you compare the two implementations you'll see that, although the registers are different (ARM uses r0 and r1 for the first two arguments, and these are 32-bit registers), and the operand syntax is a little different, the AMD64 and ARM implementations do exactly the same thing. This is only to be expected, since the kernel passes the command line and environment the same way, regardless of architecture.

.global _start

_start:
   ldr    %r0, [sp]
   add    r1, sp, #4
   bl      __main

We could call main() directly from _start but, in fact, I call a C function called __main(), which eventually calls main(). __main() carries out various initialization steps, including initializing the environment.

But where is the environment? Well, the kernel just pushes the environment strings, in the form FOO=bar onto the stack, directly under the command-line arguments. The __main() function can find the location of the environment in memory simply by reading 'off the end' of the argv passed by the assembler code. We don't need the environment at start-up time, but we'll need it whenever the program needs an environment variable. The shnolib program uses the environment variables HOME and PATH.

My __main() method also sets up buffers for buffered I/O. This would be a good place to initialize the memory management system, if it were complicated enough to need initialization -- mine isn't.

Handing syscalls

Let's first consider the kinds of things that a C program will need to ask the kernel to do.

File and device I/O
Getting file information (size, access modes).
Allocating memory
Process control (fork/exec/wait).
Changing the program environment -- permissions, working directory, etc.
Networking.
Getting information and user identity.
Signal handling.
Sharing memory and other forms of IPC.
Yielding the CPU and scheduling work.
...and many others.

These functions are all provided by syscalls -- numbered entry points to the kernel

We'll see that some syscalls are very rudimentary, compared to the sophistication that a C programmer would expect from a standard library. There's no malloc() in the kernel, for example -- programs are expected to manage their own address space.

A C function like write(), however, is actually a fairly thin wrapper around a kernel call, sys_write. In C, write() is usually defined like this:

int write (int fd, void *buffer, size_t size);

This is, it takes two integer and one pointer argument. These arguments will all need to be passed to the kernel in registers, as described above.

The Linux kernel is so closely bound to the C programming language that you'll find that the kernel parameters are a more-or-less exact match for the corresponding C functions in the standard library, for most of the syscalls that have C wrappers. That is, the kernel takes its arguments in the same order as the corresponding C function in the standard library. This means that we can implement a general syscall() method that calls an abitrary syscall. However, syscalls usually return an error code in the EAX register, while C functions usuall set errno, to indicate success or failure. So there needs to be a little low-level data manipulation.

Let's assume for now that we had a C function that could call an arbitrary syscall. If we had that, we could write the write() function like this:

int write (int fd, const void *buff, int l)
  {
  int r = syscall (SYS_WRITE, fd, buff, l);
  if (r < 0)
    {
    errno = -r;
    return -1;
    }
  else
    {
    errno = 0;
    return r;
    }
  }

SYS_WRITE is the numeric code for the syscall. This number will not necessarily be the same for different Linux architectures -- in fact, it's not even the same for x86 and AMD64 systems. The syscalls are documented in various places, most obviously in the kernel source. For the record, on AMD64, SYS_WRITE is syscall "1".

This is all very well, but we don't have a syscall() function. This is something that really needs to be implemented in assembly, because its entire purpose is to juggle register values around. Here is the implementation from shnolib. There's nothing clever about it: it just aligns the kernel register order with the C register order. On some architectures, or with compilers which use the stack more for argument passing, this code would need to manipulate the stack rather than registers.

.global _syscall

syscall:
    mov %rdi, %rax
    mov %rsi, %rdi
    mov %rdx, %rsi
    mov %rcx, %rdx
    mov %r8, %r10
    mov %r9, %r8
    syscall
    ret

Of the 200-or-so syscalls that are currently defined for the Linux kernel, my simple shell needs ten of them.

It's worth bearing mind that syscalls are inherently slow to execute, compared to application code. It's always worth thinking about whether multiple operations can be aggregated into a single syscall.

To follow C standard library conventions, or not?

Suppose your program needs a way to copy a null-terminated character string from one place in memory to another. Standard libraries provide a function strcpy for this. If you're implementing this function yourself, without a library, then you can use any function name you like, taking whatever arguments you like. You could even choose to implement text strings as something other than null-terminated character arrays. A good case can be made, for example, for implementing strings so that the first few bytes of the string encode the string's length. A lot of nasty buffer-overrun attacks could have been prevented if the C language designers had worked this way from the start.

You could even choose to implement strings using multi-byte characters from the very start, as the designers of Java did.

My preference, though, is to follow the names and function prototypes of C standard libraries, when I'm writing code to provide comparable functionality. Not only does this make the code easier for other people to follow, but it makes the code more portable -- in both directions. To an experienced C developer, it's fairly obvious what a function called atoi (for example) does, and it would be confusing if a function with that name did something unexpected.

There's a particular wrinkle with strings in particular -- however you choose to represent text strings in your program, the Linux kernel expects null-terminated byte arrays for filenames.

Memory management

C developers rarely think about how complex the task of memory management is, and how little help the kernel provides. It's not obvious, but a C library call like malloc() is not provided by the kernel -- not in an efficient way. It's possible to use the kernel's sys_mmap function to map pages of memory to the application's address space, but this is hopelessly inefficient for small, frequently memory allocations.

The basic facility provided by the kernel for managing memory is the sys_brk syscall, which gets or sets the end of the program's data segment. Increasing the size of the data segment effectively allocates memory to the program; reducing the size frees memory.

The argument to sys_brk is an address in virtual memory, not a size. If the argument is zero, the existing address of the end of the data segment is returned, otherwise the address is set. Conventionally the C standard library wraps the syscall in a function called brk(), and I have followed that convention in my sample program.

It's important to understand that the kernel cannot be forced to set the end of the data segment to a specific address, even if the memory is available -- the kernel will typically allocate memory in pages of a fixed size. Provided that enough memory is available, the sys_brk syscall will set the data segment to at least the requested address, but it might be higher. The return value from sys_brk is the address actually allocated.

In order to allocate memory using the sys_brk syscall, we need to know where the start of the data segment is, so we can determine where the end should be (at least approximately, as discussed above). Conventionally the data segment starts after the storage allocated to static variables and constants. For example, if we define static int x;, the compiler can reserve space for x even though its value is not known at compile time (it will be initialized to a default value, however). The C compiler defines a static variable called end whose location is just past the end of the allocated data area. So, peculiar as it may sound, the address &end is the start of the data segment. If we know the start of the data segment, we can always calculate the required end by adding the amount of memory required to the start address.

However, in most cases we don't actually care where the specific end of the data segment is -- we just want to move it up or down by particular amounts, according to the programs's memory demands. The C standard library has a function called sbrk that expands the data segment by a specified number of bytes. In practice, C developers don't use brk or sys_brk but, of the two, sbrk is more useful. If we have a function brk() then we can implement sbrk like this:

void *sbrk (intptr_t increment)
  {
  void *old = (void *) ((uintptr_t)brk (0));
  brk ((uintptr_t)old + increment);
  //...
  }

The ugly casts are necessary because the kernel does not distinguish between an integer value and a pointer, but the C compiler does. I should point out that types like inptr_t are defined in the header files of the standard library, and will depend on the architecture's pointer size. Without a standard library we have to make those definitions ourselves, but using the conventional type names allows us to keep track of what changes need to be made to suit different architectures. GCC provides a pre-defined macro __WORDSIZE to help with this.

So, in summary, we can use the sys_brk syscall to expand, and perhaps contract, the amount of memory allocated to the program.

This is all very well if the program just manipulates one huge block of memory, that all functions share. In practice, this isn't very convenient -- we need functions equivalent to malloc() and free() for doing fine-grained memory management.

shnolib uses a very crude memory allocation strategy, which I wouldn't recommend for any serious application -- but it does work. This is how malloc() and free() work.

The program maintains a linked list of memory blocks. Each block starts with a size and pointer to the next block, if there is one.
When the program calls malloc(), the function searches the linked list for a free block of at least the requested size. If there is one, then the address of the block is returned, and the block is marked as in use.
If there is no free block, the program calls its implementation of sbrk() to expand the data segment. The expansion must be large enough to include the block header, as well as the data itself.
When the program calls free(), the memory block is marked as free, and can be used again later.
The data segment never gets any smaller.

The implementation is inefficient for a whole host of reasons. Most obviously, it makes no attempt to match the sizes of memory requests to the sizes of available blocks -- memory can easily become fragmented. Still, it does work well enough to demonstrate the principle.

A number of highly efficient memory allocators have been developed over the years, some in C and some in assembler. The only possible defence for my crude implementation is that it requires only about 400 bytes of code.

Buffered I/O

The C standard library uses buffered I/O to implement functions like fprintf() and fgets(). These functions all take a FILE * as an argument; the FILE structure maintains memory and indexes for buffering data.

But why do we need this additional complexity? The reason, which I've already alluded to, is that there are significant overheads involved in making kernel calls. If you need to read or write a large file, it's hugely more efficient to do it in large blocks, than in many small I/O operations.

If we do buffer I/O, then we need a mechanism to flush the buffers at specific points, and when the file or stream is closed. This flushing can be done automatically (when and end-of-line is read, for example), or we can provide specific functions for it. The functions of the C standard library provide both methods.

There's nothing particularly interesting about implementing buffered I/O -- it's just a case of hacking code to manipulate the buffers. There are some subtleties involved in making a really optimal implementation, as there always are. shnolib uses its own implementation of fputs() and fgets() for writing and reading the console. This isn't really necessary, since a shell -- even a crude one -- really only works at the user's speed. However, it's an interesting exercise to implement these functions.

Variable-length argument lists

It's idiomatic in C programming to define functions that take a variable number of arguments. Everybody is familiar with printf(), for example. Usually the called function determines how many arguments were passed in one of two ways. It might use the first argument to determine what the others are (as printf does) or it might use some special value, usually NULL, as an indication where the arguments stop. This is what execl does. These techniques are necessary because C calling conventions do not stipulate that the caller must indicate explicitly how many arguments have been passed -- the called function has to work it out from the context.

Developers do not usually have to worry about the mechanics of handling variable-length argument lists -- the standard library provides macros that do all the work. It's easy, therefore, to forget how difficult the process actually is. In the old days, when argument passing was invariably by the stack, it was relatively straightforward -- if the called function wanted to read the n'th argument, it would look n places down the stack. We knew where the top of the stack was, because the first argument passed to the function -- which is always named explicitly -- is on the top of the stack. So if we read 'off the end' of the first argument, we get the other arguments. This was straightforward whether it was done by the C developer, as it was back in the day, or by features in the standard library.

These days we don't normally use the stack for passing arguments, not alone, at least. Modern compilers typically use some mixture of stack-based and register-based argument passing. It's no longer possible to find the arguments just by reading off the end of the first argument. In AMD64 the first argument is not even in memory -- it's in the RDI register.

When we use macros like va_start to read variable-length argument lists, the implementation is provided by the standard C library. The implementation is complicated, architecture-specific, and at least partly in assembly language. So there needs to be a different implementation for each architecture. In fact, I can't think of any way of handling variable-length arguments, however ugly, that isn't architecture-specific.

It's undeniably useful to implement functions that provide functionality like printf and fprintf. However, adding the code needed to support this kind of operation would have doubled the size of the shnolib program, and made it even less portable that it already is. When working without a standard library, I've learned to work around the lack of variable-length argument support, rather than implementing it myself.

Arithmetic

CPUs differ in their built-in support for arithemetic. Most CPUs have limited support for floating-point math, but even integer arithemetic is not guaranteed to be supported. 64-bit x86 CPUs have a full set of integer math (add, subtract, multiply, divide, modulus) for variable sizes up to 64 bits. Since gcc generally takes a long to be 64 bits, on any architecture, that means that the compiler will generate instructions that work on int and long variables.

The 32-bit ARMv7, in contrast, has 64-bit addition, subtraction, and multiplication support, but no such support for division or modulus. So, while the compiler will take care of some arithmetic operations, for others it will output a function call. Thus, for example, some forms of division result in the generation of a call to __aeabi_idiv.

If you're working without a standard library, you'll need to provide implementations of these functions, or reorganize your code so they aren't necessary. It's not difficult, in principle, to implement division and modulus using subtraction, but it's difficult to do well. shnolib only uses these operations for converting error codes to ASCII, and my implementations do not bear close scrutiny.

Compilation issues

Back in the day, C compilers were completely independent of the standard C library. That clean separate between what the compiler does, and what the standard library does, is an enduring benefit for those of us who have to work without libraries and dependencies.

More recently, the line between compiler and library has become eroded in the GCC compiler. For example, if you use a function like open(), the compiler will warn you to #include the relevant header if you forget. It will also warn you if you aren't using the arguments to printf correctly.

While useful for routine programming, all this sophistication just gets in the way when working on embedded systems. You can disable a lot of it using the -fno-builtin switch to gcc.

If you looked at the source for shnolib you'll see that I created a single source file and a single header file for all my general-purpose functions. You could, in principle, use the same source and header for multiple projects that required the same functionality. However, in applications like this storage is likely to be critical, and we wouldn't want to include function definitions that aren't actually used.

The solution here is to compile the code using the -ffunction-sections -fdata-sections switches. This forces gcc to output a separate section in the object file for each definition. While this is slightly wasteful by itself, you can then specify --gc-sections to the linker. The linker will remove entire sections that have no references to them.

Closing remarks

There are very few reasons these days, other than when working on embedded systems, to write C code without using a standard library. Libraries like glibc are well-maintained, efficient, and relatively compact for the amount of functionality they provide. However, learning how to work without a standard library provides useful insights into the operation of the platform, and is worth attempting, even if only for educational purposes.