Jens Gustedt's Blog

March 9, 2017

Unicode operators for C

Filed under: C11, feature request, Modular C, rants — Jens Gustedt @ 16:54

C11 has added a certain level of Unicode support to C, but I think for C2x it will be time to go a step further and put C in line with general usage of special characters as they are normalized by Unicode. In particular, it is time to get rid of restrictions in operator naming that stem from the limited availability of special characters 30 years ago, when all of this was invented.

Let us first revise the different levels of support that the C language provides nowadays. I really mean the core language, not library interfaces such as locale, for example. Neglecting some subtleties such as Endianess, for example, this gives:

  1. Some punctuator characters that might not be present on a keyboard can be represented by trigraphs. For example, the sequence ??/ can be used for a \ character. This replacement is done regardless of the lexical context.
  2. Unicode code points can be entered as \uXXXX or \UXXXXXXXX where the X represent the short or long form of the code point. For example, the code \u030C corresponds to the Greek letter π. This replacement is done regardless of the lexical context.
  3. Some operators have another spelling as digraphs. For example, the token <: is equivalent to the token [. This replacement is done on a syntax level, after tokenization. So it only occurs, where the corresponding operators may occur. It is not performed inside strings.
  4. Wide character characters and strings can be used to deal with special characters internally in a C program. For example, the wide character L'\u030C' should result in a single entity of type wchar_t containing the character π expressed in the encoding of the execution environment.
  5. Prefixes u and U introduce UTF-16 (or UCS-2) and UTF-32 characters and strings. They can thereby be used to ensure a specific encoding of characters.  For example u'π' should have the character π represented in the UTF-16 encoding, that is the value 0x030C.
  6. UTF-8 encoding of strings can be enforced with the special prefix u8. For example, u8"π" is an array of three char with contents 0x80 0xCF 0x00.

Points 1. to 4. are independent of any encoding  in which we may write our source file. Using 4. ensures the transition from the source encoding to the execution environment, which may in fact be different.

Points 5. and 6. where added by C11 to ensure interoperability  such that we know which encoding is used for a particular character or  string.  Note that this implies in particular that a conforming compiler implements conversions between different encodings:

  • the source encoding
  • the execution environment encoding
  • UTF-16 (or UCS-2), UTF-32 and UTF-8.

For example, even if your platform has some weird historic source encoding but which includes the character ö, the strings u"söng", U"söng" and u8"söng" have a well-defined contents that is independent of the execution environment.

Now let us have a look at the user side. Here is my wish list for a comfortable C coding experience:

  1. Arbitrary code points in strings and comments.
  2. Arbitrary code points in identifiers, within the allowed range demanded by the C standard.
  3. Source (native) encoding of operator characters. This should be provided for operators where C still has multi-character tokens e.g for<= for >=, for &&, or where a character is used with an non-standard semantic, e.g × instead of * for multiplication.
  4. Wide character encoding support for sources.
  5. UTF-8 source code support.

Sometimes 1. is “implemented by negligence” because UTF-8 and similar encodings can just be used as such and so the compiler just takes the bytes and stuffs them into strings. Subtle bugs may be triggered by this, such as character arrays being longer than expected or UTF-8 sequences cut in half, in particular if the library support for these things is incomplete.

Generally, points 1. and 2. are guaranteed by the long list above, but in a very weird form, namely by using \uXXXX encoding all over the place. If you have to write C code with strings in a human language other than English, this is effectively not usable.  To maintain strings as L"\u0395\u1f50\u03ba\u03bb\u03b5\u03af\u03b4\u03b7\u03c2" for Εὐκλείδης must be a nightmare.

Point 2. has the same disadvantages, but becomes even weirder when seen alone. Even if the \uXXXX encoding allows to have the code point for π as an identifier, if we have to write  \u030C such a specification serves no real purpose.

In my personal view, supporting real Unicode identifiers in C code is important. First of all, who are we westerners to impose the usage of the overly restricted Latin alphabet to the rest of the world? Aren’t we a bit arrogant? Then, my personal usage for these things is not for German or French, but for mathematics.  Mathematics often uses Greek letters, some special characters etc to express complicated facts, and I want to use these in my daily writing.

C code should map as closely as possible to different national or professional cultures.   Everything else is simply not acceptable in the 21st century.

Unfortunately, many compilers are stuck and don’t go beyond point 2. This is even so, as we have seen above, all the components to deal with different encodings must be present in a C11 compiler. Providers of such compilers really should get their objectives straight and think just a tiny little bit in terms of usability and comfort for their clients. As we will see below, implementing 3., 4. and 5. is a piece of cake. Basically something that is doable by an intern.

Since 5. implies 4., let us concentrate on 5. that is UTF-8 source code encoding. Other encodings should be similarly easy. The “only” thing that is needed to implement UTF-8 source encoding for a conforming C11 compiler is

  • tools-dump8 a tool to transform UTF-8 multibyte characters into \uXXXX or \UXXXXXXX notation.

Modular C offers a simple tool that does just this. This is a stand-alone utility programmed in about 100 lines of C code. You can simply integrate it in your tool-chain as a source-to-source pre-compiler. It is fast and easy, there is no excuse not to integrate such a tool in your compiler tool-chain.

So now that we have settled 4. and 5. let’s have a look into 3. C abuses a lot of weird characters and character combinations that have clear descriptions as code points in Unicode. They have satisfactory support on all platforms, text processing, fonts etc. Being stuck in the character restrictions of ancient machines that are out of order since 30 years is not acceptable, either. Let’s put up a list:

  • \u00AC, glyph ¬, operator !
  • \u00D7, glyph ×, operator *, binary only
  • \u00F7, glyph ÷, operator /, binary only
  • \u2026, glyph …, operator ...
  • \u2192, glyph →, operator ->
  • \u2227, glyph ∧, operator &&
  • \u2228, glyph ∨, operator ||
  • \u2229, glyph ∩, operator &, binary only
  • \u222A, glyph ∪, operator |, binary only
  • \u2254, glyph ≔, operator =, assignment
  • \u2A74, glyph ⩴, operator =, initialization
  • \u2260, glyph ≠, operator !=
  • \u2261, glyph ≡, operator ==
  • \u2264, glyph ≤, operator <=
  • \u2265, glyph ≥, operator >=
  • \u2AA1, glyph ⪡, operator <<
  • \u2AA2, glyph ⪢, operator >>

Some points could probably be discussed in detail, but I hope the idea is clear. There is a well established character for “less or equal” namely ≤, let’s simply use this.

Again, implementing this in a compiler tool-chain is quite easy. Modular C does this with a multi-stage process (hiding characters and strings, decoding, unhiding of characters and strings), but you could certainly come up with an appropriate way to do this directly in your tool.

Integrating these operators in the standard would also not be very difficult, either. They can just be added as “punctuators” (6.4.6) . Then the replaced tokens could be added to p3 of that section as “digraphs”. E.g we could simply add || to the list of digraphs and say that its meaning is .

Advertisements

3 Comments »

  1. One thing that strikes me as odd about the C11 specification for UTF-8 string literals is that their type is specified as char, but that an implementation is free to treat char as either signed char or unsigned char by default. This makes bitwise operations awkward. String literals can’t be modified of course, but work on UTF-8 string literals requires a copy to unsigned char to be portably modified on with bitwise operators. This isn’t really obvious, since the standard says that UTF-8 string literals can be represented by char, when char can be (and popularly is) signed by default. C2X, in my opinion, should fix UTF-8 literals to be represented as unsigned char.

    In my personal view, supporting real Unicode identifiers in C code is important.

    I wholeheartedly agree with you. Plan 9 has done this for decades; it’s time for ISO to follow. Especially for UTF-8 identifiers.

    However, I might disagree about requiring UTF-8 for various operators for a few reasons:

    First of all, various fonts have ligature support, so it’s entirely possible for a font to render <= as ≤ on many modern systems; this could be turned into entirely a rendering problem. Secondly, C supports digraphs and trigraphs, with C99 even introducing new ones. This (scarily) indicates that currently-used sources probably still rely on input methods / operating systems / compilers that probably couldn’t possibly support UTF-8. Input methods for UTF-8 sequences are annoying; it’s still easier for me to type | on my keyboard than it is for me to memorize and type some sequence that results in (arguably is more correct for this, which leaves an operator for || missing).

    I think we’d need at least one standard to introduce them, and also to sunset the ??! and ??- trigraphs.

    Finally, there’s an argument to be made that ≡ should imply type-equal, which C doesn’t have. There are multiple arrows other than , why not ? I’m not convinced that ... should be represented by , but mostly for rendering reasons that defeat my ligature argument. (On that note, I’d like to see case ranges in C2X.)

    But I definitely like the idea. This is probably the closest I’ll get to contributing to the standards body, so take it for what you will. Would love to hear your thoughts on these comments!

    Comment by dhobsd — April 19, 2017 @ 04:44

    • One thing that strikes me as odd about the C11 specification for UTF-8 string literals is that their type is specified as char, …

      Yes, this is certainly one of the oddities of all this inherited mixup between character types (for strings etc) and plain bytes. I think the reason that the standard does it, is that this makes UTF-8 string literals compatible with other strings and with all the functions from stdio.h. ‘u8’ was invented just such that you usually don’t have to manipulated individual bytes, so the idea is that you’d have to do that rarely.

      But then, I also think that your assertion is wrong that a UTF-8 string has to be copied such that its bytes can be manipulated correctly as unsigned char. Actually, any object type can be viewed through a pointer to unsigned char, and its bytes can be manipulated through it.

      However, I might disagree about requiring UTF-8 for various operators for a few reasons:

      I really didn’t say this. I said Unicode, and for a good reason. For 3. the particular encoding for these code points should not be a concern. Whatever source encoding is used on a particular platform, they should be able to have code points for these operators nowadays. AFAICS, the only non-ASCII encoding that is still in use is EBCDIC, and depending on the code page they seem to have partial support for these code points.

      I also don’t agree that this is a font rendering problem, C sources should be precise and unambiguous. I merely see this as a “keyboard” problem. With my editor there is no problem at all to enter → or α, this just a question of defining proper shortcuts.

      For the choice of the particular codepoints, as I said this is certainly still to be discussed. My idea is to stick as close as possible to the semantic definitions that the Unicode code points have. The only “real” invention in that list are ⪡,and ⪢ because there isn’t any code point that represents a shift operation. So I chose some that come close graphically and that are not opening or closing parenthesis. For the others that you mention the → character is “RIGHT ARROW”. I think this is the correct translation of -> into Unicode. “…” is “HORIZONTAL ELLIPSIS” and I can’t think of any other choice for this.

      Comment by Jens Gustedt — April 19, 2017 @ 06:22

      • But then, I also think that your assertion is wrong that a UTF-8 string has to be copied such that its bytes can be manipulated correctly as unsigned char. Actually, any object type can be viewed through a pointer to unsigned char, and its bytes can be manipulated through it.

        Well, I didn’t say this, I said “string literal”, so you have to copy if you want to modify. Anyway, I totally buy the argument that one only has to do it rarely. And I hadn’t really considered making it simple to send into e.g. stdio.h functions, which makes sense. Thanks for this. I’ve only tended to use unsigned char * for working on UTF-8 strings, so I was a little surprised to find this out.

        I merely see this as a “keyboard” problem. With my editor there is no problem at all to enter → or α, this just a question of defining proper shortcuts.

        Your other points are well-taken. I’m not fully convinced this is easy to do in all environments in which people actually write C, but maybe it is. There’s also the education factor; for example, ∩ is no longer ‘bitwise and’, it’s ‘intersection’. So there’s a re-terminology that happens which could end up being rather confusing for some folks, especially neophytes. (Though I quite like using or × for multiplication as reduces burden on a severely overloaded operator, which will probably result in less confusion amongst neophytes.)

        I also missed the part reading this last night where you mentioned moving the “replaced” operators over to the digraph section. Not sure how, but I did. Anyway, I think my (long-winded) point is that I agree with you that this is a good idea 🙂

        Comment by dhobsd — April 19, 2017 @ 15:11


RSS feed for comments on this post.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.