Jens Gustedt's Blog

July 15, 2013

a praise of size_t and other unsigned types

Filed under: C11, C99, integers, language — Jens Gustedt @ 16:17

Again I had a discussion with someone from a C++ background who claimed that one should use signed integer types where possible, and who also claimed that the unsignedness of size_t is merely a historical accident and would never be defined as such nowadays. I strongly disagree with that, so I decided to write this up, for once.

What I write here will only work with C, and can possibly extended to C++ and other languages that implement unsigned integer types, e.g good old Pascal had a cardinal type.

Let’s get practical and look into what type to use for array indexing, let’s call it T. A typical use of such an index type is something like

for (T i = 0; i < array_size; ++i) {
   A[i] = something_nice;
}

Here the first “problem” might the comparision i < array_size. Your compiler might already wine here, if i and array_size are of different signedness (and he is right). Everybody knows that index type and bounds should be of the same type, and if it were a variable, array_size should have been declared as

T array_size = something;

What if it is not a variable, but an expression? Well that expression should be of the same type. Now the difficulty arises, the type of expressions in C is a subtle subject. In general, you can only find out of the concrete type of an expression by looking up the types of all of its components. If we suppose that they are all of the same width and

  • all signed, the result is signed
  • all unsigned, the result is unsigned
  • mixed, the result is unsigned

(might be a bit different on some historical platforms, but should be a safe bet for the last 20 years.)

Now bound expressions often contain operands that are unsigned, for the simple reason that size_t is a unsigned type: a typical “trick” to compute the number of elements in an array is

sizeof A / sizeof A[0]

which is of type size_t. E.g for the above simple loop we have

for (T i = 0; i < sizeof A / sizeof A[0]; ++i) {
   A[i] = something_nice;
}

and so T should be an unsigned type.

Another typical thing to do with indices is to check if they are in bounds. Usually the lower bound is 0 and the upper bound is some value. Typical code for a more complicated iteration would be

for (T i = start_value; in_bounds(i); i = update(i)) {
   A[i] = something_nice;
}

How would you then implement in_bounds? If T is signed you’d do

bool in_bounds(T i) {
   return i >= 0 && i < some_value;
}

whereas if it is unsigned you’d have something simpler:

bool in_bounds(T i) {
   return i < some_value;
}

This works independently of update(i), even if that would decrement i. Provided that our array A has at least 42 elements the following works like a charm with size_t.

for (size_t i = 41; i < sizeof A / sizeof A[0]; --i) {
   A[i] = something_nice;
}

Such computations always have well defined behavior, unsigned types just wrap around below 0 or above their maximal value. No traps, no signals, no exceptions.

In general index calculations with unsigned types just work, whereas for signed types you’d often have to think of casting some of the boundary expressions and treat special cases. So using an unsigned type for size_t is not a historical accident, it is a necessity if you want to have things simple. And other than most other modern programming languages, C is made to be simple and straight.

Now why use size_t and not unsigned? Because it has the width that you’d need to address any kind of object on your platform. E.g doing complicated index computations for matrices easily overflows if you use a narrow unsigned type.

About these ads

16 Comments »

  1. There are cases where using size_t for loop index is less convenient than using an int. Iterating through a sub-array skipping N elements at the beginning and M elements at the end is simpler with an int (note that sub-array may be empty):

    for (int i=N; i < (int)array_size-M; ++i)
    

    vs. size_t:

    if (N <= (int)array_size-M)
        for (size_t i=N; i < array_size-M; ++i)
    

    Comment by Paul Jurczak — July 16, 2013 @ 06:59

    • Paul, this is tempting, but I can’t completely agree with your analysis.

      The only assumption that must be made for the size_t version to work is that M <= array_size, which is a reasonable assumption. The size_t version, if you make that assumption explicit, then reads

      if (M <= array_size)
          for (size_t i=N; i < array_size-M; ++i)
      

      which is fairly simply, and as readable as the int version with the cast clutter in the middle of an expression. And in many cases where you have an assertion for the size of M the if clause wouldn’t even be necessary.

      BTW, your int version has a lot of implicit assumptions about M. It shouldn’t be negative (if it is an int, too) or you must cast it also if it is a size_t expression.

      Comment by Jens Gustedt — July 16, 2013 @ 08:19

      • You are correct, M <= array_size is sufficient, but requiring an if instruction to accompany for loop in case of some ranges, makes it an irregular pattern.

        Additionally, using unsigned may rely on integral overflow for range testing, which requires modulo operation for precise mathematical description (not needed for equivalent for loop with int). Iterating with unsigned creates “magic” jumps on the number axis as opposed to orderly steps in case of int.

        None of these approaches is perfect – I tried both and decided that int is a lesser evil.

        Comment by Paul Jurczak — July 16, 2013 @ 11:48

        • Paul, perhaps I made my point not clear enough, but the test for the size_t is as necessary as would be proper testing for the int version. If M is int it should read

          if (M >= 0)
            for (int i=N; i < (int)array_size-M; ++i)
          

          if it is size_t you’d have to do

          if (INT_MIN+(int)M <= (int)array_size)
            for (int i=N; i < (int)array_size-(int)M; ++i)
          

          (if you’d argue that these usually aren’t necessary, I could do the same for the check in my unsigned variant)

          And then unsigned types never overflow, they wrap around in a very simple and controlled way.

          As you can see in the refined analysis above, the int version not only may wrap around if you aren’t careful but may raise a signal or provoke undefined behavior.

          So frankly I don’t see much of an advantage, only that you have int casts all over: one big plus for properly written code with unsigned types is that you usually don’t need casts.

          Comment by Jens Gustedt — July 16, 2013 @ 12:15

      • Paul: even if you prefer signed to unsigned, you shouldn’t use an “int”.
        The last note still applies: “Because it has the width that you’d need to address any kind of object on your platform. E.g doing complicated index computations for matrices easily overflows if you use a narrow unsigned type.”

        POSIX has ssize_t as a solution, but that’s of course non-portable outside of POSIX.
        A portable alternative is ptrdiff_t:

        http://en.cppreference.com/w/c/types/ptrdiff_t

        http://en.cppreference.com/w/cpp/types/ptrdiff_t

        It’s probably also a good idea to typedef this to something descriptive (of course, another problem with “int” is that it’s not descriptive, either, anything can be an “int”, not every “int” can represent size, not every size can be represented by “int”).

        Personally, I just use size_t, found it simpler and more self-documenting :-)

        Comment by MD — July 16, 2013 @ 13:53

        • If you have to use a signed type, then you are right, probably ptrdiff_t is a good choice. On any reasonable architecture I would expect this to be the same type as ssize_t, anyhow.

          Comment by Jens Gustedt — July 16, 2013 @ 14:25

  2. I admit, there is a lot of merit in your approach. I find the line below elegant and disturbing at the same time:

    for (size_t i = v.size()-1; i < v.size(); --i)
    

    but I find it cleaner then this pattern:

    for (size_t i = v.size()-1; i != (size_t)-1; --i)
    

    Comment by Paul Jurczak — July 16, 2013 @ 14:28

    • Yes, I also clearly prefer the first version, it more clearly states the range of iteration. And if your size() thing is really a function, I would go for something like

      for (size_t size = v.size(), i = size-1; i < size; --i)
      

      For the “disturbing” part, just wait until you get used to it :)

      Comment by Jens Gustedt — July 16, 2013 @ 14:36

  3. I am definitely in the signed camp myself (and I code mainly in C++ rather than C). With unsigned types you are always at greater risk of bumping up against the boundary condition at 0 where it’s all too easy to accidentally put in a >= 0 check that will always be true (although compilers should warn of this) or for a number to wrap around beyond 0 and suddenly become 4 billion. Signed types keep you more safely in the middle of the range of available numbers.

    I find that using unsigned types actually ends up introducing a lot *more* messy casts into code because signed and unsigned values often need to interact. I personally wish size_t and sizeof returned signed types so that I could avoid casts in these situations too (and really, is the ‘sizeof’ something ever likely to be bigger than half the range of a machine word?). Another reason for my preference for signed integers wherever possible is that literal integers in code are, by default, signed and having to apply a ‘u’ suffix is a bit ugly.

    I would argue that your in_bounds check is less explicit about the requirement for a number to be >= 0 to somebody reading the code, especially if the type is a typedef or a macro or template parameter. You can always cast to unsigned if you want the optimisation and this will alert the reader of the code that clearly it is intentional to omit the >= 0 check and that it wasn’t a mistake. I prefer readability over premature optimisation.

    I’ve also heard the argument that using unsigned types for values is a hint to the reader that values shouldn’t or cannot be negative. In those cases I think it is better to implement explicit range checks with an assert or other check for *both* ends of the valid range, not just at the zero end..

    Another argument for using signed representations is that subtracting two unsigned values can produce a negative number. Then you start having a mix of unsigned and signed numbers in a piece of code and the likelihood of needing casts or of mistakenly mixing up signed and unsigned types increases. Going for ‘int’ by default for everything keeps things a bit simpler, you just don’t need to think about it. For the same reason, I never use any types smaller than an int for anything except storage of values in memory (and also to avoid the compiler generating extra unnecessary masking instructions).

    The only time I will use unsigned types if I know the value might be bigger than 0x7F, 0x7FFF or 07FFFFFFF etc or for bitfields or other values on which arithmetic will not be performed.

    Comment by Matt Shepcar — July 17, 2013 @ 21:36

    • Matt, thanks for your opinion.

      Much of what you say is just that, opinion, coding habits and style. And that is completely ok, much about good coding is about a consequent use of style, that eases reading and code review. Here I just present a different one.

      Now to your particular arguments that you give.

      • You say that you’d wished that you could avoid casts, and this is a very good idea. With my style I usually don’t have casts at all, none for integer types, and only a few for pointers.
      • You say that literals are int? They aren’t in general. The rule for that is relatively complicated and fills a whole page in the standard. Basically it is a first fit rule, the smallest type that fits with width at least int is used. Decimal constants can be signed or unsigned, octal and hexadecimal constants are signed or always unsigned. Thus it depends on the platform if a given constant is signed or unsigned. Don’t rely upon that a constant with a several digits has a certain type, you may have surprising resuls.
      • If you are using subtraction, effectively you should usually use a signed type. The type that I mostly need for that is ptrdiff_t because it most naturally arises when I subtract two pointers to array elements. But I found in many cases that subtraction is just artificial. To come back to bounds checking: many people “naturally” come up with things like x < N - len where x + len < N would be completely sufficient.

      The discussion above already shows some cases, where bounds checking if done consequently with size_t is much easier and more readable because you avoid casts. Just give it a try.

      Comment by Jens Gustedt — July 17, 2013 @ 22:08

      • Thanks for replying. I guess if I had to choose just one argument in favour of signed variables it would be that I feel things become less complicated and more consistent if everything is signed. Maybe it’s a bit of a woolly argument based on opinion (and experience!) but I think it simplifies the reading, writing and maintainability of code if you know almost all values are always signed. I do agree that some of your examples are simpler, but these things, like sizeof and pointer subtraction are relatively infrequent compared to general day-to-day code involving calling your own functions and deciding whether function arguments and variables should be signed or unsigned. In many cases in C++ these low level operations will be hidden behind some interface such as std::vector or std:::array or an iterator anyway.

        I’m sure the rules are quite complicated for literal values but I’ve encountered numerous problems dealing with C++’s type deduction when mixes of signed and unsigned types occur (e.g. std::max/std::min with an unsigned var vs a literal) or dealing with C++ resolving function overloads when you’ve used a signed literal and there are only unsigned and floating point overloads etc. When everything is signed this becomes a bit simpler. Maybe this is less of a problem in C where there are no templates or overloads and maybe you don’t mind casting so much in C anyway.

        I do kind of agree with your point that in many cases ‘subtraction is artificial’. The assembly language programmer in me says that we should only have ints and floats (and other machine register formats) and only care about signedness when we perform operations that care about the sign bit such as comparisons or multiplies, e.g. jg/jl vs ja/jb in x86. However, your example of x < N – len vs x + len < N is a prime example of where an error can creep in with unsigned values due to values being close to boundary values (i.e. 0). This is something that, in my opinion, you shouldn't really need to care or think about because they should be mathematically equivalent. Using the more natural form has got to be preferable when in the majority of cases, where the values are not near boundaries, they are equivalent. This is only a simple example, more complicated expressions might be easier to foul up.

        Comment by Matt Shepcar — July 17, 2013 @ 23:37

        • First, please everybody don’t try to argue with that or that feature in C++. This is a site dedicated to C. Some things might apply to C++, too, but I wouldn’t know C++ enough to argue about these things. I abandoned C++ several years ago for good reasons. If you want to discuss such things about C++, please use a different site. And please, if you post something here, try to take the cultural differences between the C++ and C community into account.

          For your argument of “simplified reading”, this is easily turned around. If you only have size_t everywhere, the same reasoning applies.

          My primary use for integers is as indices, scanning through an array, doing matrix multiplication and stuff like that. I don’t have much use of negation, differences or things like that. 99% is incrementing or decrementing values, addressing and bounds checking. Using size_t as sole integer type works flawlessly, I don’t need any casts.

          Then your statement that errors might creep in for values close to 0, I simply didn’t get.

          Comment by Jens Gustedt — July 18, 2013 @ 09:54

      • Wouldn’t life be much easier if only signed integral types were available in C++? I don’t think there is more than a handful of cases, where 2^31-1 is too small, but 2^32-1 is large enough and int64 can’t be used.

        Comment by Paul Jurczak — July 18, 2013 @ 03:57

        • Definitively not for C. It is a conceptual thing. Cardinal values don’t need signs, bit operations on signed types is suspicious, and overflow secure code is much easier to write with unsigned types. So for a low level language like C this is not an option. I made this post to convince you that it is not just an accident and that a completely different line of thinking is possible, that leads to easier, clearer and more efficient code. From your comment here I see that this was not a success.

          Comment by Jens Gustedt — July 18, 2013 @ 06:50

  4. Hi, actually you are wrong in the statement “Such computations always have well defined behavior, unsigned types just wrap around below 0 or above their maximal value. No traps, no signals, no exceptions.”.

    Integer overflows result in an undefined behavior. Please take a look in the following posts:

    http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html (Chris Lattner, one of the creators of LLVM)
    http://blog.regehr.org/archives/213 (John Regehr, compiler researcher in University of Utah, USA)

    Comment by Raphael Rodrigues — July 18, 2013 @ 08:10

    • Actually it is you who is wrong, arithmetic with unsigned types is always well defined. Integer overflows only occur for signed types in C, not for unsigned types. For unsigned types everything is well defined, and the values just wrap around. From the C standard, 6.2.5 “Types”:

      A computation involving unsigned operands can never overflow,
      because a result that cannot be represented by the resulting unsigned integer type is
      reduced modulo the number that is one greater than the largest value that can be
      represented by the resulting type.

      So the inverse is true, using signed types gives you all that trouble, one could even say trouble is a builtin feature of signed types, whereas unsigned types can generally be used to detect potential overflow. Just as an exercise, try to write a test for two int values x and y to see if their sum would overflow, without producing overflow yourself.

      (And if you look into the examples on the pages that you linked to, you will notice that they all use signed values, not unsigned ones.)

      Comment by Jens Gustedt — July 18, 2013 @ 09:35


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 26 other followers

%d bloggers like this: