Why 'int x = 0' is uninitialized data to the GNU C compiler
This article is about a curiosity of gcc
that will
interest almost nobody.
I came across it whilst implementing a program loader that could
load binary code into RAM on a Raspberry Pi Pico. I suspect that this is the only application where the initialization
behaviour of the C compiler is of any significance whatsoever,
Still, I thought it worth writing up because, apart from anything else, it's
such bizarre behaviour that I can imagine myself being caught by it
again later, when my own intracranial RAM needs to be refreshed.
I noticed the problem when different (global) variables in my programs, although of the same type, were being treated differently at runtime. For example, I had variables defined like this:
int max_columns=80; int starting_column=0;
I couldn't work out why max_columns
seemed to take the
correct value, 80, but starting_colums
had a crazy
value.
Now, I'm writing the program loader myself so, of course, I could well understand why all the global variables would end up with crazy values. It would just be a bug in my loader. I could even understand why variables of a particular type or storage class might have ended up with crazy values. However, after much experimentation I realized that the crazy values were all assigned to variables which were initialized to zero in my code. It turned out I could set any value except zero.
Let's look at a simple example to see why this happens. Although I'm
working on an embedded device, the problem can be seen quite easily using
the ordinary Linux gcc
.
Compile and link this trivial program:
int aa1 = 42; int aa2 = 6; int aa3 = 0; int aa4; int main() { return 0; }
$ gcc -o test test.c
I've chosen these "aa" names just to make it easy to find the variables
in the ELF file created by gcc
:
$ objdump --all test | grep aa
The output, on my system, is:
000000000040401c g O .data 0000000000000004 aa1 0000000000404020 g O .data 0000000000000004 aa2 0000000000404028 g O .bss 0000000000000004 aa3 000000000040402c g O .bss 0000000000000004 aa4
Notice that aa1
and aa2
-- variables that
are assigned non-zero values in my program -- are in .data
segments. But aa3
, which is assigned the value 0, and
aa4
, which is not assigned a value at all, are in
.bss
segments.
The problem was that my program loader was not initializing the BSS segments
correctly. So variables whose values were stored in those segments were
not getting zeroed. Since aa3
specifically had to be
zero, not zeroing it was a significant error.
But why are different variables of the same type in different segments?
What's so special about the value zero?
The fact is that I don't really know. I presume that this is some kind
of optimization carried out by gcc
, in an attempt to be
save a few bytes somewhere.
Conventionally, 'data' segments are used to store values that have been initialized specifically by the programmer. It matters what the values are, and they are stored in the executable file generated by the compiler.
'BSS' segments, however, are traditionally used to "store" variables that have not explicitly been given an initial value. "Store" isn't really the right word here: the linker does not have to allocate any space in the executable file for values of these variables -- they are just placeholders.
Although no values are stored in the executable,
values still have to be set in memory.
The C language standards,
however, stipulate that uninitialized global variables take the
value zero at runtime. So somebody has to initialize them.
Setting these values to zero is not the responsibility of
the compiler -- at least, it is not in the GCC world. Instead, some
start-up code, executed before main()
is invoked, has
to zero all this data.
I can only imagine that gcc
assumes that, if the programmer
sets a global variable to zero, it doesn't actually need
to store the zero value in the executable. If it assigns it to a
BSS segment, the start-up code will zero it along with all the other
BSS data. And, frankly, neither the program nor the programmer usually
care about the exact addresses in memory where data is stored. So this
slightly odd behaviour potentially results in a modest saving
of executable size, at least if large numbers of variables
are involved.
It isn't an accident that gcc
behaves this way. It turns out
that, on some platforms, gcc
has a specific switch to
control this behaviour: -fzero-initialized-in-bss
.
As I said, this is an oddity that will affect almost nobody. The default behaviour seems, well, wrong to me; but, unless you're implementing a program loader, you probably won't even notice.