The case for a modern language (part 1)

There is a difference of opinion among many programmers regarding the idea of replacing C, either in newly wrritten code or else altogether. I find myself coming down on the side that it makes little sense to try to replace a legacy codebase, and still find C useful in some contexts (particularly in the realm of microcontrollers). That said I think a strong case can be made for using one of the several more modern languages that have sprung up in the systems programming space, at least for newly written code. I make no secret that I love Rust and Zig both, for many similar reasons. But I also find Nim to be impressive and have heard great things about Odin. There is, frankly, room for more than one.

I'm going to starrt this off with a small example, parsing a string into an integer, because this is something that is very fundamental and must be done for many command line programs just to parse some input. Let's look at Rust first, for no particular reason.

// pretend that this was passed in on the command line
let my_number_string = String::from("42");
// If we just want to bubble up errors
let my_number: u8 = my_number_string.parse()?;
assert_eq!(my_number, 42);
// If we might like to panic!
let my_number: u8 = my_number_string.parse().unwrap();
assert_eq!(my_number, 42);
// If we're a good Rustacean and check for errors before trying to use the data
if let Ok(my_number: u8) = my_number_string.parse() {
    assert_eq!(my_number, 42);
}

There's lots of ways that we can handle the error case. None of them are particularly high friction, we can choose what fits or what we're comfortable with, and the error handling strategy is ultimately pretty generic over the entire standard library so you can just choose for yourself. The method can fail, but it can't fail silently. If we choose to use unwrap(), the program very well might panic if given garbage data, but we're not silently using the wrong data. Cool. How about Zig then.

const std = @import("std");

const myNumberString = "42";
// Bubble up the errors just like in Rust, third arg is the radix (10)
const myNumber = try std.fmt.parseInt(u8, myNumberString, 10);
// But we can do better
const myOtherNumber = std.fmt.parseInt(u8, myNumberString, 10) catch 42;
// Or even better
if (std.fmt.parseInt(u8, myNumberString, 10)) |val| {
    // do something with val
    std.debug.print("{d} is THE answer\n", .{val});
} else |err| {
    std.debug.print("Error getting THE answer: {s}\n", .{err});
}

Ok then, the syntax is really different but we're using some really familiar concepts here. One might cite parallel evolution here, the two languages have functionally equivalent constructs. Now how 'bout good old C?

#include <stdlib.h> // atoi

char *forty_two = "42";
int i = atoi(forty_two);

Are you cringing yet? You should be. You can pass literally anything to atoi and it will probably give you what you want if the input is valid. If the input isn't valid, tough, you have literally no way of knowing. Why does this function exist at all, you might ask? And well you might. It exists because it became part of the POSIX standard way back when a pdp7 was an advanced computer and nobody has had the bollucks to say, go ahead and break the old code because this just doesn't belong in there. There is of course a better way, well sort of better. You'll see.

#include <stdlib.h> // strtol

char *forty_two = "42";
long i = strtol(forty_two, NULL, 10);

This is marginally better, but we never did check to see if we got what we think that we got. And you'll see strtol used this way even in what are considered good codebases sometimes, because this is the lowest friction way to use the function, and we're all pretty lazy, but after all, it's better than atoi, right? Well no, not if it's used like this it isn't. So how can we do better?

#include <stdlib.h> // strtol
#include <errno.h>  // errno, perror

char *forty_two = "42";
errno = 0; // Initialize errno in case we tripped an error previously
long i = strtol(forty_two, NULL, 10);
if (errno != 0)
    perror("Error parsing integer from string: ");

Ok, so we're good now, right?

Right?

No. We're nowhere near good yet. We've caught two of a number of error conditions by checking errno, but it turns out that the only two errors which we have caught are underflow and overflow. We could have passed in "42b" instead of just "42" and it would have given us the exact same output as before without setting errno. Confused yet? Don't worry, you'll be more confused later. Let's try something else and really blow your mind.

#include <stdlib.h> // strtol
#include <errno.h>  // errno, perror

char *one = "one";
errno = 0; // remember errno?

long i = strtol(one, NULL, 10);
if (errno != 0)
    perror("Error parsing integer from string: ");
else
    printf("%i\n", i);

That code prints 0 to your terminal without any errors. So why is that, you wonder? And how do we fix it? Do we check for 0 and consider that a failure? If so, what if the string given to the function actually was "0"? Well remember when I said that we're all pretty lazy? What's that NULL that we've been feeding to the function as the second parameter?

Ah, now we're asking the right question. That's a pointer. Maybe. If we pass it something real, then we can use it.

#include <stdlib.h> // strtol
#include <errno.h>  // errno, perror

char *one = "one";
char *end;
errno = 0; // remember errno?
long i = strtol(one, &end, 10);
if (errno != 0) {
    perror("Error parsing integer from string: ");
} else if (i == 0 && end == one) {
   fprintf(stderr, "Error: invalid input: %s\n", one);
}

So if you look at the function signature for strtol, our second parameter is our entptr, or the pointer we end on. This gets incremented for every valid character, so now if we were to pass it that string "one" as above it will return 0, but by checking if the pointer has moved we can determine that it has not, and so we know that we gave it garbage input. So we're good, right?

We're good, right?

Beuller...

Beuller...

No. No we're not good yet. Because we could pass it something like this:

char *forty_two_bee = "42b";
char *end;
errno = 0; // remember errno?

long i = strtol(forty_two_bee, &end, 10);

This will return 0, will not trip errno (remember errno?), and the pointer has moved forward by two bytes. So our logic from the previous example would not catch the pointer move. We also have to check and see what's at the pointer. And because strings in C are the char * type, what we have a an array of bytes that is null terminated. So our final check is that this character is null.

Note: When this post went live there was, somewhat ironically, a logical error in this final iteration. See further down for the explanation.

#include <stdlib.h>   // strtol
#include <errno.h>    // errno, perror
#include <heathers.h> // f__kMeGently

char *one = "one";
char *end;
errno = 0; // remember errno?
long i = strtol(one, &end, 10);
if (errno != 0) {
    perror("Error parsing integer from string: ");
} else if (i == 0 && end == one) {
    fprintf(stderr, "Error: invalid input: %s\n", one);
} else if (*end != '\0') {
    f__kMeGently(with_a_chainsaw);
}

And that there is the bare minimum amount that you have to do in order to use strtol safely. This is, and I can't stress this enough, a horrible interface. Is there anything in there that's better? Well, if you're on BSD then the OpenBSD folks had your back in 2004 with the release of OpenBSD 3.6, which came with the shiny new strtonum function, which is itself just a wrapper around the awful interface described above that is much more usable and, dare I say it, sane? But alas, as with many things BSD even when thay do something that is obviously better than Linux it just never really gets adopted there. And that company from Redmond? Well they've been happily overflowing buffers with C++ for some decades now and are too busy patching and force-rebooting to have noticed something that happened in a hobbyist operating system 18 years ago. It's not in the POSIX standard. It made it into the other BSD's, and into libbsd, but unless you want to make your C code non-portable best to not use it. Also, did I mention that it's just a wrapper around that crap function strtoll to begin with? Have I mentioned that BSD libc is in every way superior to glibc? But I digress..

Is this an outlier?

No. This is the sort of thing that C programmers have been living with for decades. There are plenty of other examples. I don't, per se, dislike C so much as I dislike the cruft that's accumulated around it and the fact that libc looks and feels very much like it was hacked together by a bunch of random dudes over the course of a few decades rather than being designed. Which is, of course, exactly what happened. I have some other complaints about C that revolve around the lack of any kind of official error handling strategy, the issue of null in general, etc but what it really comes down to is that if we keep the language we should at the very least modernize libc and standardize the other tooling around the language somewhat.

This is turning into something that I don't think should be just one post. So I think that it's probably time to call it for tonight, mark this as a part one and pick things back up around the concept of language tooling in another post.

Bonus - don't get cocky

When this post originally went live, I didn't expect it to get much attention. Indeed, this blog wasn't even being indexed by any search engines yet. There was a bit of discussion triggered on Mastodon and it was apparently shared to Reddit (a first for me) where one user pointed out a couple of mistakes.

I noticed two things

  • let my_number: u8 = &my_number_string.parse()?; - the dereference & is not needed iiuc

  • and a little amusingly: else if (i == 0 && *end != '\0') - the i == 0 check actually breaks this, because in the case we're trying to check here, some of the string was actually converted

Well he was absolutely correct. In the case of my Rust code the compiler would have immediately told me this was wrong, but it's really besides the point other than that this would have been a compile error rather than a runtime error, which is one of the many things to love about Rust.

In the case of the C code, having that i == 0 in the check means that it would only have run if we got 0 back, meaning it would never have tripped if we managed to parse part of the string into any other number than 0. I have fixed the code above just in case someone came by and wanted to do a copypasta (not recommended anyway) but I wanted to acknowledge the issue.

So what have we learned? Well I never claimed to be an expert C programmer, but this sort of feeds right back into my point that this interface is notoriously difficult to reason about and use correctly. Also, if you are ever feeling full of yourself, post some C code that you've written on the internet.