Jens Gustedt's Blog

June 24, 2011

Name mangling in C

Filed under: C99 — Jens Gustedt @ 19:01

Most will know that C++ mangles external names in a compiler specific way such that they encode the types of function parameters and the nesting of classes and namespaces. People are probably less aware that most C compilers also mangle some names to make them unique inside compilation units.

Namely, most compilers will create “local” symbol for static variables, and since the naming of static variables in function scope is not unique, they have to mangle the names. gcc, e.g, does that by appending a dot and a unique number to the name. Since the dot is not part of any valid identifier, this makes sure that there is no name clash. Let us have a look at two simple variables:

  static int f = 0;
  static int e = 42;

and the symbols in the compiled object that are visible with nm. For gcc I have

0000000000000000 d e.1504
0000000000000004 d f.1503

so this looks something like the original name (the human readable part) followed by a dot and a series number.
(The long numbers are the offsets of the objects in the object file, the “d” says that it is an internal symbol.)

icc does it differently

0000000000000004 d e.3.0.2
0000000000000000 d f.3.0.2

so here we have, as above, the original name and then followed by something that is an identification of the scope inside the compilation unit. Observe that here icc relies on the fact that e and f are unique in their respective scopes, we come back to that later.

Things become more obscure when the compiler supports universal characters in identifiers. Depending on the environment, the compiler has to mangle such identifiers, since e.g the loader might not support them in the symbol table.

icc chooses something like replacing such a character by _uXXXX where XXXX is the hex representation of the character. In that case (icc) this results in two subtle compiler bugs. First, this mangling uses valid identifiers that the user is allowed to use, so they may clash for global symbols with identifiers from the same compilation unit or even from other units.

int o\u03ba;
int o_u03ba;

The symbols that gcc produces for these objects (placed in file scope) are straight forward namely o_u03ba and which also shows you that the Unicode character with position 03ba is a Greek kappa. In contrast to that icc just has the same external name o_u03ba for the two objects. Even if these two are placed in different compilation units, when they are linked together, there is a clash.

Second, icc even mixes up its own internal naming. All goes well as long as the local variables that we declare are auto, e.g in the local scope of some function

int \u03ba;
int _u03ba;

Here _u03ba is a valid name inside a function, one leading underscore followed by a non-capitalized letter is allowed. Now as long as we define the variables like that, icc internally distinguished them and all goes fine. But if we declare them as static icc’ mangling convention fires back. The internal names are both folded on the name _u03ba.83.0.1. Remember that icc needs the “real name” of the variables to distinguish its local statics.

The code still behaves somewhat as expected if the compiler can optimize the access to the static location away and hold the values in temporaries. It then crashes at seemingly random points when the optimizer can’t keep track of the value anymore or when the variables are also declared volatile.

#include <stdio.h>
int o\u03ba;
int o_u03ba;
int main(int argc, char*argv[]) {
  printf("addresses %p and %p\n", (void*)&o\u03ba, (void*)&o_u03ba);
  static int volatile \u03ba;
  static int volatile _u03ba;
  \u03ba = !!argc;
  _u03ba = !argc;
  printf("values %d and %d\n", \u03ba, _u03ba);
  return \u03ba == _u03ba;
}

This little example program shows correct output with gcc:

addresses 0x60103c and 0×601038
values 1 and 0

and goes fundamentally wrong with icc:

addresses 0x604a44 and 0x604a44
values 0 and 0

About these ads

5 Comments »

  1. Hi Jens.

    Great article. I need a way to build deterministically where the unique numbers added by gcc’s C-compiler is always the same for the same static variable. Do you know if this is possible?

    Regards, Micke.

    Comment by Micke — November 25, 2011 @ 12:59

    • Hm, I am not sure what you want to achieve. Folding several local static variables on top of each other? I don’t think you should do such a thing, it can only cause you headaches.

      In any case, in C static variables should always be defined in .c files. So if you’d have to have several functions that “share” the same variable, why not just have a global static variable in the corresponding C file with a “good” name that makes clear that the purpose is to be shared between the functions.

      Jens

      Comment by Jens Gustedt — November 25, 2011 @ 13:14

  2. That’s true: it’s not that universally known that C compilers can also mangle names. I guess they didn’t some long time ago.
    So, thanks for reminding about it.

    Comment by Alexey Ivanov — June 8, 2012 @ 10:33

    • Even before we had support for complicated character sets, C compilers did already rudimentary mangling, I think: prefixing names with an underscore, shortening long names, mapping to lower or upper case.

      Comment by Jens Gustedt — June 8, 2012 @ 11:13


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 29 other followers