Sum Types in C

⟵

TL;DR I wanted to retrofit simplistic sum types into the C language. I implemented a proof-of-concept based on pycparser. The proof-of-concept works by preprocessing custom syntax into plain C if additional type-checking passes.

Introduction

Sum types are known by many names. One of them is “tagged union”. I’m sure there are subtle differences with different usages¹ so let me first make it clear what I mean in the context of this writing: Tagged unions with compiler support for enabling exhaustive pattern matching. Powerful destructuring with pattern matching often goes together with sum types, but I will keep it simpler.

My goal was to hack together a simplistic proof-of-concept as a transformer from enhanced C to plain C. What I’m presenting here is not a robust generic solution. It has just enough functionality to handle the toy examples I’m presenting.

As a motivating example, we’ll first take a look at how unions work in C, how one could make them tagged, and how these compare against the type system of OCaml. After that we’ll take a closer look at the proof-of-concept implementation with some rationale. Lastly, I’ll present some contrasting examples handled by the modified pycparser.

Unions in C

Specifying and using unions in C looks like this:

#include <stdio.h>

union numeric {
    int i;
    float f;
};

int main(void) {
    union numeric v;

    v.i = 123;
    printf("the value is an int: %d\n", v.i);

    v.f = 3.1415;
    printf("the value is a float: %.4f\n", v.f);
}

which prints

the value is an int: 123
the value is a float: 3.1415

The C99 standard² has many things to say about unions and I would condense it like this: Unless you know what you’re doing, you should use the union value which you last set, because otherwise you may be surprised by unspecified values. With the example above, if you set the integer value and decide to read the float value, the standard says you may get an “unspecified” value back. There’s even a footnote which says you may get back a “trap representation”. All in all, cautiousness is warranted.³

What about tagged unions?

We can derive an example from the union shown above:

#include <stdio.h>
#include <stdlib.h>

enum numeric_kind {
  NUMERIC_INT,
  NUMERIC_FLOAT,
};

struct numeric {
  enum numeric_kind kind;
  union {
    int i;
    float f;
  } data;
};

void print_numeric(const struct numeric *num) {
  switch (num->kind) {
  case NUMERIC_INT:
    printf("the value is an int: %d\n", num->data.i);
    break;
  case NUMERIC_FLOAT:
    printf("the value is a float: %.4f\n", num->data.f);
    break;
  }
}

int main(void) {
  struct numeric v, v2;

  v.kind = NUMERIC_INT;
  v.data.i = 123;

  v2.kind = NUMERIC_FLOAT;
  v2.data.f = 3.1415;

  print_numeric(&v);
  print_numeric(&v2);
}

which again prints

the value is an int: 123
the value is a float: 3.1415

Now that we have kind boxed along with the actual union value, we can look up the last set value. But what if we have these kinds of decisions sprinkled around a larger code base and then we decide to modify enum numeric_kind? Ideally every use is properly updated by the developer. As a mitigation, we can use loud default cases like this:

  switch (kind) {
  ...
  default:
    printf("no idea what this kind is: %d\n", kind);
    abort();
  }

What we saw above is a practical way to implement the type-dependent branching required for matching types. We do need something more to become type-aware of what we’re actually matching.

Sum types in OCaml

Above we specified a union called numeric which could hold an int or float value. We also gave it a tag to indicate which type is currently contained in the nested union, so we could use a switch to do something based on the kind.

Compared to most mainstream languages, OCaml has had a very rich type system for a long time which is made even more impressive by type inference. Naturally we can easily implement something similar to the tagged union example above with OCaml’s data types⁴:

type numeric = Int of int | Float of float

let string_of_numeric n = match n with
  | Int i -> "the value is an int: " ^ (string_of_int i)
  | Float f -> "the value is a float: " ^ (string_of_float f)

let () =
  let (v, v2) = (Int 123, Float 3.1415) in
    print_endline (string_of_numeric v);
    print_endline (string_of_numeric v2)

which prints

the value is an int: 123
the value is a float: 3.1415

What’s the difference between how OCaml handles type numeric and the C compiler sees struct numeric? Many things, but in this specific case, let’s say we add a new kind of numeric variant called Str and run the otherwise identical code:

type numeric = Int of int | Float of float | Str of string

The compiler says this:

File "./numeric2.ml", lines 3-5, characters 26-61:
3 | ..........................match n with
4 |   | Int i -> "the value is an int: " ^ (string_of_int i)
5 |   | Float f -> "the value is a float: " ^ (string_of_float f)
Warning 8 [partial-match]: this pattern-matching is not exhaustive.
Here is an example of a case that is not matched:
Str _
the value is an int: 123
the value is a float: 3.1415

Wouldn’t it be nice to have something like this in the C language, too?

Defining and implementing the proof-of-concept

I needed to parse new syntax, type-check the sum type usage and generate suitable C code. The code-generation would be based on the tagged union approach we saw above.

Hacking pycparser

pycparser is a pure-Python C99 parser. It has no external dependencies, although it does make good use of bundled PLY.

The important modifications are in

_c_ast.cfg for new AST types
c_generator.py for handling new type-checking and generating the plain C code for custom types
c_lexer.py for a few new grammar items
c_parser.py for handling the custom syntax
examples/c_files/data*.c for the test code

For the whole diff, see here.

Type-checking sum types

As the name implies, pycparser is a parser implementation, so it mostly doesn’t have to worry about the type details of the translation units. As proper parsing of C requires awareness of typedefs, I had to handle the new types similarly.

Instead of adding a new stage to pycparser to handle our custom types, I modified c_generator.py to construct the necessary type-awareness and perform additional type-checking required by our new match statement. For this proof-of-concept, it doesn’t matter much, but a separate stage for AST transformations would be cleaner and easier to extend.

Self-referential types

The ability to declare self-referential or recursive types is a typical feature of ML languages. An example of a linked list of integers in OCaml could be like this:

type list_of_ints = Nil | IntItem of (int * list_of_ints)

C supports self-referential structs with pointers:

struct st {
  int kind;
  union {
    int i;
    float f;
    struct st *s;
  } data;
};

but the C way does not permit self-referenced structs with plain values. This can be contrasted with languages which use some kind of a boxed representation for data structures, that is, essentially pointers.

So, to use our “tagged union” model in a self-referential manner, we would have to use pointers, which would have meant one or more of the following:

leaking memory
manual (de)allocations
introducing some kind of garbage collection

This was a side-quest I didn’t want to solve, so I simplified the problem: no self-referential types. In any case, since I’m solving a toy problem, leaking memory would’ve been just fine…

Syntax

I didn’t spend that much thought on coming up with a solution that’s really in harmony with the C syntax. I was mainly interested in making a thing which sort of works, and I prioritized making the new syntax easy to parse. I needed four language constructs to have means to

define a new sum type
declare and define variables of some sum type
set a value of specific kind to some sum-typed variable
match the current value and destructure⁵ its contents

As may be seen below, I used string comparison for the union tags. This is to make things more explicit.

Defining new sum types

I aped the OCaml syntax:

data SimpleType int | float | struct st;

When generating plain C code, the above is transformed into

struct __data_SimpleType {
  const char *kind;
  union {
    int d_int;
    float d_float;
    struct st d_s_st;
  } data;
};

Declaring and defining variables of some sum type

This I hacked as a typedef-like addition to the parser:

SimpleType one, two, three, four;

which is transformed into

struct __data_SimpleType one;
struct __data_SimpleType two;
struct __data_SimpleType three;
struct __data_SimpleType four;

Setting a value of specific kind to a sum-typed variable

Here I’m feeling slight pain because I would’ve wanted to use regular assignment operations. struct values are set as compound literals, so it’s necessary to include the explicit type cast. This is mainly an artifact of my dirty parser modifications.

dataset one <- int 10;
dataset two <- float 123.2 + 1;
dataset three <- struct st (struct st){ .val = 256, .str = "Hello World!" };

With slight editing for clarity, the above is transformed into

one.kind = "int";
one.data.d_int = 10;

two.kind = "float";
two.data.d_float = 123.2 + 1;

three.kind = "struct st";
three.data.d_s_st = (struct st){.val = 256, .str = "Hello World!"};

Destructuring the contents

This is again pretty much aped from OCaml. I did not implement nested destructuring which would be extremely nice to have for actual use.

void match_printer(SimpleType val) {
  match val
  int i {
    printf("the value is an int: %d\n", i);
  }
  | float f {
    printf("the value is a float: %f\n", f);
  }
  | struct st s {
    printf("the value is a struct st: %d, %s\n", s.val, s.str);
  };
}

Finally, and again slightly edited for clarity, the above is transformed⁶ into

void match_printer(struct __data_SimpleType val)
{
  if (!strcmp(val.kind, "int")) {
    int i = val.data.d_int;
    {
      printf("the value is an int: %d\n", i);
    }
  } else if (!strcmp(val.kind, "float")) {
    float f = val.data.d_float;
    {
      printf("the value is a float: %f\n", f);
    }
  } else if (!strcmp(val.kind, "struct st")) {
    struct st s = val.data.d_s_st;
    {
      printf("the value is a struct st: %d, %s\n", s.val, s.str);
    }
  }
}

Results

Missing matchers

We should be getting several errors here:

data Numeric int | float | unsigned | long

int main(void) {
  Numeric nt;
  dataset nt <- float 3.1415;

  match nt
  int i {
    printf("val is an int: %d\n", i);
  };
}

Indeed, this is the case:

$ python examples/c-to-c.py examples/c_files/data3.c
Missing matcher for float
Missing matcher for unsigned
Missing matcher for long
Traceback (most recent call last):
...
RuntimeError: non-exhaustive `match`: missing 3 match cases

Catch all

Sometimes we will selectively want to handle only some specific types and forget the rest. I added the ellipsis matcher to permit this:

data Numeric int | float | unsigned | long;

int main(void) {
  Numeric nt;
  dataset nt <- float 3.1415;

  match nt
    int i { printf("val is an int: %d\n", i); }
  | ... { puts("it's who knows"); };
}

The generated plain C can be seen here. After compiling and running, we get this:

$ {
    echo '#include <stdio.h>';
    echo '#include <string.h>';
    python examples/c-to-c.py examples/c_files/data2.c;
  } > test2.c && gcc test2.c && ./a.out
it's who knows

The benchmark

This was the benchmark I wanted to compile successfully:

struct st {
  int val;
  const char *str;
};

data SimpleType int | float | struct st;

void match_printer_ptr(SimpleType *ptr) {
  match *ptr
  float f {
    printf("the *ptr is a float: %f\n", f);
    dataset *ptr <- float f * 2;
  }
  | int i { printf("the *ptr is an int: %d\n", i); }
  | struct st s { printf("the *ptr is a struct st: %d, %s\n", s.val, s.str); }
  | ... { puts("ptr is something else!\n"); };

}

void match_printer(SimpleType val) {
  match val
  int i { printf("the value is an int: %d\n", i); }
  | float f { printf("the value is a float: %f\n", f); }
  | struct st s { printf("the value is a struct st: %d, %s\n", s.val, s.str); };
}

int main(void) {
  SimpleType one, two, three, four;

  dataset one <- int 10;
  dataset two <- float 123.2 + 1;
  dataset three <- struct st (struct st){ .val = 256, .str = "Hello World!" };

  match_printer_ptr(&one);
  match_printer_ptr(&two);
  match_printer_ptr(&two);
  match_printer(three);
  match_printer(four);
}

The generated plain C can be seen here. When compiled and run on my current machine, I get the following output:

$ {
    echo '#include <stdio.h>';
    echo '#include <string.h>';
    python examples/c-to-c.py examples/c_files/data.c;
  } > test.c && gcc test.c && ./a.out
the *ptr is an int: 10
the *ptr is a float: 124.199997
the *ptr is a float: 248.399994
the value is a struct st: 256, Hello World!
zsh: segmentation fault  ./a.out

Well, you can’t have everything!

Afterthoughts

This was again one of those “How hard can it be?” type of ideas, which had been bothering me for a while. If we’d like to tie this into the bigger picture, I do think it makes much sense to look at C and think about its limitations. Also, after again refreshing my memory about OCaml, I would really like to use it more.

The fork is available here.

Changelog

2022/12/06: Initial version online
2022/12/07: Several clarifications & additions and shifting text around
2024/01/28: Change font size

I’m not well-versed enough in type theory to discuss the theoretical basis much further, so I’ll be sticking to practical matters. Those interested in the more details can check out, for example, the article on algebraic data types.↩︎
We could also look at C89/90/11 and the relevant parts would be pretty similar for our practical purposes in this context. Anyhow, the final ISO draft for C99 can be found here: https://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf ↩︎
There is a longer story here.↩︎
It could be some other ML or ML-inspired language, too. I’m not an OCaml expert either, but it’s the ML language I’ve used most.↩︎
I don’t mean real nested destructuring in this case.↩︎
I used if and else without much thought. For a real system, switch statements against integer-typed tags would probably bring performance improvements as simple switch statements provide easier opportunities for optimizing via lookup tables, for example.↩︎