TL;DR I wanted to retrofit simplistic sum types into the C language. I implemented a proof-of-concept based on pycparser. The proof-of-concept works by preprocessing custom syntax into plain C if additional type-checking passes.
Sum types are known by many names. One of them is “tagged union”. I’m sure there are subtle differences with different usages1 so let me first make it clear what I mean in the context of this writing: Tagged unions with compiler support for enabling exhaustive pattern matching. Powerful destructuring with pattern matching often goes together with sum types, but I will keep it simpler.
My goal was to hack together a simplistic proof-of-concept as a transformer from enhanced C to plain C. What I’m presenting here is not a robust generic solution. It has just enough functionality to handle the toy examples I’m presenting.
As a motivating example, we’ll first take a look at how unions work in C, how one could make them tagged, and how these compare against the type system of OCaml. After that we’ll take a closer look at the proof-of-concept implementation with some rationale. Lastly, I’ll present some contrasting examples handled by the modified pycparser.
Specifying and using unions in C looks like this:
#include <stdio.h>
union numeric {
int i;
float f;
};
int main(void) {
union numeric v;
v.i = 123;
printf("the value is an int: %d\n", v.i);
v.f = 3.1415;
printf("the value is a float: %.4f\n", v.f);
}
which prints
the value is an int: 123
the value is a float: 3.1415
The C99 standard2 has many things to say about unions
and I would condense it like this: Unless you know what you’re doing,
you should use the union value which you last set, because otherwise you
may be surprised by unspecified values. With the example above, if you
set the integer
value and decide to read the
float
value, the standard says you may get an “unspecified”
value back. There’s even a footnote which says you may get back a “trap
representation”. All in all, cautiousness is warranted.3
We can derive an example from the union shown above:
#include <stdio.h>
#include <stdlib.h>
enum numeric_kind {
NUMERIC_INT,
NUMERIC_FLOAT,
};
struct numeric {
enum numeric_kind kind;
union {
int i;
float f;
} data;
};
void print_numeric(const struct numeric *num) {
switch (num->kind) {
case NUMERIC_INT:
printf("the value is an int: %d\n", num->data.i);
break;
case NUMERIC_FLOAT:
printf("the value is a float: %.4f\n", num->data.f);
break;
}
}
int main(void) {
struct numeric v, v2;
v.kind = NUMERIC_INT;
v.data.i = 123;
v2.kind = NUMERIC_FLOAT;
v2.data.f = 3.1415;
print_numeric(&v);
print_numeric(&v2);
}
which again prints
the value is an int: 123
the value is a float: 3.1415
Now that we have kind
boxed along with the actual union
value, we can look up the last set value. But what if we have these
kinds of decisions sprinkled around a larger code base and then we
decide to modify enum numeric_kind
? Ideally every
use is properly updated by the developer. As a mitigation, we can use
loud default
cases like this:
switch (kind) {
...
default:
printf("no idea what this kind is: %d\n", kind);
abort();
}
What we saw above is a practical way to implement the type-dependent branching required for matching types. We do need something more to become type-aware of what we’re actually matching.
Above we specified a union called numeric
which could
hold an int
or float
value. We also gave it a
tag to indicate which type is currently contained in the nested union,
so we could use a switch
to do something based on the
kind.
Compared to most mainstream languages, OCaml has had a very rich type system for a long time which is made even more impressive by type inference. Naturally we can easily implement something similar to the tagged union example above with OCaml’s data types4:
type numeric = Int of int | Float of float
let string_of_numeric n = match n with
| Int i -> "the value is an int: " ^ (string_of_int i)
| Float f -> "the value is a float: " ^ (string_of_float f)
let () =
let (v, v2) = (Int 123, Float 3.1415) in
print_endline (string_of_numeric v);
print_endline (string_of_numeric v2)
which prints
the value is an int: 123
the value is a float: 3.1415
What’s the difference between how OCaml handles
type numeric
and the C compiler sees
struct numeric
? Many things, but in this specific case,
let’s say we add a new kind of numeric variant called Str
and run the otherwise identical code:
type numeric = Int of int | Float of float | Str of string
The compiler says this:
File "./numeric2.ml", lines 3-5, characters 26-61:
3 | ..........................match n with
4 | | Int i -> "the value is an int: " ^ (string_of_int i)
5 | | Float f -> "the value is a float: " ^ (string_of_float f)
Warning 8 [partial-match]: this pattern-matching is not exhaustive.
Here is an example of a case that is not matched:
Str _
the value is an int: 123
the value is a float: 3.1415
Wouldn’t it be nice to have something like this in the C language, too?
I needed to parse new syntax, type-check the sum type usage and generate suitable C code. The code-generation would be based on the tagged union approach we saw above.
pycparser is a pure-Python C99 parser. It has no external dependencies, although it does make good use of bundled PLY.
The important modifications are in
_c_ast.cfg
for new AST
typesc_generator.py
for handling new type-checking and
generating the plain C code for custom typesc_lexer.py
for a few new grammar itemsc_parser.py
for handling the custom syntaxexamples/c_files/data*.c
for the test codeFor the whole diff, see here.
As the name implies, pycparser is a parser implementation, so it mostly doesn’t have to worry about the type details of the translation units. As proper parsing of C requires awareness of typedefs, I had to handle the new types similarly.
Instead of adding a new stage to pycparser to handle our custom
types, I modified c_generator.py
to construct the necessary
type-awareness and perform additional type-checking required by our new
match
statement. For this proof-of-concept, it doesn’t
matter much, but a separate stage for AST transformations would be
cleaner and easier to extend.
The ability to declare self-referential or recursive types is a typical feature of ML languages. An example of a linked list of integers in OCaml could be like this:
type list_of_ints = Nil | IntItem of (int * list_of_ints)
C supports self-referential structs with pointers:
struct st {
int kind;
union {
int i;
float f;
struct st *s;
} data;
};
but the C way does not permit self-referenced structs with plain values. This can be contrasted with languages which use some kind of a boxed representation for data structures, that is, essentially pointers.
So, to use our “tagged union” model in a self-referential manner, we would have to use pointers, which would have meant one or more of the following:
This was a side-quest I didn’t want to solve, so I simplified the problem: no self-referential types. In any case, since I’m solving a toy problem, leaking memory would’ve been just fine…
I didn’t spend that much thought on coming up with a solution that’s really in harmony with the C syntax. I was mainly interested in making a thing which sort of works, and I prioritized making the new syntax easy to parse. I needed four language constructs to have means to
As may be seen below, I used string comparison for the union tags. This is to make things more explicit.
I aped the OCaml syntax:
data SimpleType int | float | struct st;
When generating plain C code, the above is transformed into
struct __data_SimpleType {
const char *kind;
union {
int d_int;
float d_float;
struct st d_s_st;
} data;
};
This I hacked as a typedef-like addition to the parser:
SimpleType one, two, three, four;
which is transformed into
struct __data_SimpleType one;
struct __data_SimpleType two;
struct __data_SimpleType three;
struct __data_SimpleType four;
Here I’m feeling slight pain because I would’ve wanted to use regular
assignment operations. struct
values are set as compound
literals, so it’s necessary to include the explicit type cast. This is
mainly an artifact of my dirty parser modifications.
dataset one <- int 10;
dataset two <- float 123.2 + 1;
dataset three <- struct st (struct st){ .val = 256, .str = "Hello World!" };
With slight editing for clarity, the above is transformed into
one.kind = "int";
one.data.d_int = 10;
two.kind = "float";
two.data.d_float = 123.2 + 1;
three.kind = "struct st";
three.data.d_s_st = (struct st){.val = 256, .str = "Hello World!"};
This is again pretty much aped from OCaml. I did not implement nested destructuring which would be extremely nice to have for actual use.
void match_printer(SimpleType val) {
match val
int i {
printf("the value is an int: %d\n", i);
}
| float f {
printf("the value is a float: %f\n", f);
}
| struct st s {
printf("the value is a struct st: %d, %s\n", s.val, s.str);
};
}
Finally, and again slightly edited for clarity, the above is transformed6 into
void match_printer(struct __data_SimpleType val)
{
if (!strcmp(val.kind, "int")) {
int i = val.data.d_int;
{
printf("the value is an int: %d\n", i);
}
} else if (!strcmp(val.kind, "float")) {
float f = val.data.d_float;
{
printf("the value is a float: %f\n", f);
}
} else if (!strcmp(val.kind, "struct st")) {
struct st s = val.data.d_s_st;
{
printf("the value is a struct st: %d, %s\n", s.val, s.str);
}
}
}
We should be getting several errors here:
data Numeric int | float | unsigned | long
int main(void) {
Numeric nt;
dataset nt <- float 3.1415;
match nt
int i {
printf("val is an int: %d\n", i);
};
}
Indeed, this is the case:
$ python examples/c-to-c.py examples/c_files/data3.c
Missing matcher for float
Missing matcher for unsigned
Missing matcher for long
Traceback (most recent call last):
...
RuntimeError: non-exhaustive `match`: missing 3 match cases
Sometimes we will selectively want to handle only some specific types and forget the rest. I added the ellipsis matcher to permit this:
data Numeric int | float | unsigned | long;
int main(void) {
Numeric nt;
dataset nt <- float 3.1415;
match nt
int i { printf("val is an int: %d\n", i); }
| ... { puts("it's who knows"); };
}
The generated plain C can be seen here. After compiling and running, we get this:
$ {
echo '#include <stdio.h>';
echo '#include <string.h>';
python examples/c-to-c.py examples/c_files/data2.c;
} > test2.c && gcc test2.c && ./a.out
it's who knows
This was the benchmark I wanted to compile successfully:
struct st {
int val;
const char *str;
};
data SimpleType int | float | struct st;
void match_printer_ptr(SimpleType *ptr) {
match *ptr
float f {
printf("the *ptr is a float: %f\n", f);
dataset *ptr <- float f * 2;
}
| int i { printf("the *ptr is an int: %d\n", i); }
| struct st s { printf("the *ptr is a struct st: %d, %s\n", s.val, s.str); }
| ... { puts("ptr is something else!\n"); };
}
void match_printer(SimpleType val) {
match val
int i { printf("the value is an int: %d\n", i); }
| float f { printf("the value is a float: %f\n", f); }
| struct st s { printf("the value is a struct st: %d, %s\n", s.val, s.str); };
}
int main(void) {
SimpleType one, two, three, four;
dataset one <- int 10;
dataset two <- float 123.2 + 1;
dataset three <- struct st (struct st){ .val = 256, .str = "Hello World!" };
match_printer_ptr(&one);
match_printer_ptr(&two);
match_printer_ptr(&two);
match_printer(three);
match_printer(four);
}
The generated plain C can be seen here. When compiled and run on my current machine, I get the following output:
$ {
echo '#include <stdio.h>';
echo '#include <string.h>';
python examples/c-to-c.py examples/c_files/data.c;
} > test.c && gcc test.c && ./a.out
the *ptr is an int: 10
the *ptr is a float: 124.199997
the *ptr is a float: 248.399994
the value is a struct st: 256, Hello World!
zsh: segmentation fault ./a.out
Well, you can’t have everything!
This was again one of those “How hard can it be?” type of ideas, which had been bothering me for a while. If we’d like to tie this into the bigger picture, I do think it makes much sense to look at C and think about its limitations. Also, after again refreshing my memory about OCaml, I would really like to use it more.
The fork is available here.
I’m not well-versed enough in type theory to discuss the theoretical basis much further, so I’ll be sticking to practical matters. Those interested in the more details can check out, for example, the article on algebraic data types.↩︎
We could also look at C89/90/11 and the relevant parts would be pretty similar for our practical purposes in this context. Anyhow, the final ISO draft for C99 can be found here: https://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf↩︎
There is a longer story here.↩︎
It could be some other ML or ML-inspired language, too. I’m not an OCaml expert either, but it’s the ML language I’ve used most.↩︎
I don’t mean real nested destructuring in this case.↩︎
I used if
and else
without
much thought. For a real system, switch
statements against
integer-typed tags would probably bring performance improvements as
simple switch
statements provide easier opportunities for
optimizing via lookup tables, for example.↩︎