Another Rust Parser

Parsing text input is a common programming task. If you intend to write any kind of software, you are eventually going to be writing some sort of low level parser, because you just can't always count on having a library ready made for the task. For today's post, I'm going to take some lessons learned in writing the gemtext parser for GemView, a few things you're likely to encounter a lot in C code, and write another parser in Rust.

The program and problem space

The program that's being worked on has a working name of shitbox and is, as you might guess from that moniker, a multi-call binary inspired by Busybox. The basic idea of a multi-call binary has the advantage of allowing quite a lot of code re-use. This is great from a Rust standpoint, as it partially mitigates one of Rust's known weaknesses - massive binary size.

When dealing with Unix shell utilities something that comes up quite a lot is permissions. The parser in question is going to handle parsing command line input such as you would get for the chmod utility, but it's going to also be used for mkdir, mknod, mkfifo and a number of others. Some of the supporting code will be a minimalist bitflags implementation and will see usage in a lot of other locations as well.

We can expect this input to come in one of two forms. We might be getting a string representing a number in Octal format, which is easy to deal with by just using i32::from_str_radix(<string>, 8). Things get trickier when you bump into the other possibility, symbolic representation, eg u+rwx,go=r.

Supporting bits - bitflags

Unix permissions are actually set using bitwise operations, allowing a single u32 number to represent a series of on/off boolean switches. You'll see this in C all of the time, but not as much in higher level languages. That's a shame, because it's an inherently efficient method of storing and processing state. Let's look at some psuedo-code to compare the two approaches.

// First a memory inefficient way
enum Flags {
    A,
    B,
    C,
}

struct State {
    flags: Vec<Flags>,
}

impl Default for State {
    fn default() -> Self {
        Self {
            flags: vec![A,B,B],
        }
    }
}

Notice that B appears twice. I did that on purpose to illustrate a weakness - we probably would have to sort and dedup this Vec<Flags> before using it. Another weakness is that we're using up a lot more memory than we need to in order to represent this, and we're potentially allocating when we don't need to. Now let's look at doing this using bitflags.

enum Flags {
    A = 0b100,
    B = 0b10,
    C = 0b1,
}

struct State {
    flags: u32,
}

impl Default for State {
    fn default() -> Self {
        Self {
            flags: 0b110,
        }
    }
}

Normally Rust, along with most languages which have enums, will automatically assign an integer value to each enum member. But we can assign those integer values manually, which is what we're doing here. Ob1 is the number 1 in binary. You can do this with Octal pretty easily, too, but binary makes it really easy to see what is going on. Since our Default impl sets flags to 0b110 we can see that the bits which are toggled on include the same bits as both A and B. If we wanted to add C to this number, we could use the bitwise or operator.

let mut state = State::default();
state.flags |= Flags::C as u32;
assert!(state.flags & Flags::C as u32 != 0);

I snuck in that last line to show how we can tell programmatically if state has the Flags::C flag set. But it's going to get messy and difficult to track if we have to keep throwing around as u32, or worse state.flags & Flags::C as u32 != 0 all over our code. You'll see just that sort of thing in a lot of old code anyway, but we can abstract it a little to make our lives easier. First we'll get rid of the as u32 part. Note that this time, I'm going to show you code from the actual parser.

use std::ops::{BitAnd, BitAndAssign, BitOr, BitOrAssign};

#[derive(PartialEq)]
/// The granularity of the given permissions
pub enum Who {
    /// applies for the current user
    User = 0b100,
    /// applies for the current group
    Group = 0b10,
    /// applies for everyone else
    Other = 0b1,
}

impl BitAnd<Who> for u32 {
    type Output = u32;

    fn bitand(self, rhs: Who) -> Self::Output {
        self & rhs as u32
    }
}

impl BitAnd<u32> for Who {
    type Output = u32;

    fn bitand(self, rhs: u32) -> Self::Output {
        self as u32 & rhs
    }
}

impl BitAndAssign<Who> for u32 {
    fn bitand_assign(&mut self, rhs: Who) {
        *self = *self & rhs;
    }
}

impl BitOr<Who> for u32 {
    type Output = u32;

    fn bitor(self, rhs: Who) -> Self::Output {
        self | rhs as u32
    }
}

impl BitOrAssign<Who> for u32 {
    fn bitor_assign(&mut self, rhs: Who) {
        *self = *self | rhs;
    }
}

I'm going to manually generate these impl's for each data type I need them for. You could probably do it with a macro, but it won't actually make the compiled code any smaller or faster and it will increase compilation times. The only reason for a macro would be to reduce boilerplate, or if our implementation was going into a library that had to be ergonomically usable for other people's random types.

Next, we're groing to create a trait with a single function, contains, which will allow easy inspection to see if a given u32 contains a flag.

//! Minimal bitflags type implementation allowing to check if a set of flags
//! contains a specific flag. The type must implement Copy and Bitand with u32.
use core::ops::BitAnd;

pub trait BitFlags<T> {
    fn contains(&self, rhs: T) -> bool;
}

impl<T, U> BitFlags<T> for U
where
    U: BitAnd<T> + Copy,
    T: BitAnd<U>,
    <U as BitAnd<T>>::Output: PartialEq<u32>,
{
    fn contains(&self, rhs: T) -> bool {
        *self & rhs != 0
    }
}

Our type must be Copy and implement BitAnd with the above constraints. But we can freely use it on any given enum we want to if we implement those traits.

Note: There are a lot more operations that a real bitflags implementation would want to support. But this is just a minimal implementation for the internal use of one program (albeit one that can do a bunch of different things).

It should be apparent that the Who enum represents the who that a permissions bit applies to. But we need to know the 'what' and the 'where', what in this case being the permissions themselves and where being where they are going (are they being added or removed?). For that we need two more enums. Our Op enum isn't going to use the concepts I've talked about so far, it's just a simple enum representing three choices (Add, Remove or Equals). I'm not going to show the code for that. I'm also not going to show the code for the Bit enum, which specifies each of the permission bits in the set.

The Parser struct

pub struct Parser {
    mode: u32,
    op: Option<Op>,
    who: u32,
    bits: u32,
}

impl Default for Parser {
    fn default() -> Self {
        let umask = get_umask();
        let mut mode = 0o0666;
        mode &= umask;
        Self {
            mode,
            op: None,
            who: 0,
            bits: 0,
        }
    }
}

impl Parser {
    pub fn new(mode: u32) -> Self {
        Self {
            mode,
            op: None,
            who: 0,
            bits: 0,
        }
    }
}

Like the gemtext parser mentioned earlier, our Parser struct is going to be use via &mut self references, so as it processes the input it will modify itself accordingly, returning a single u32 representing the mode bits at the end. The Default implementation is for creating new files, and begins with the default setting of a=rw (everybody can read or write) modified by our umask. But we have another constructor, new(mode: u32) -> Self, which takes an initial mode as a starting point. This is going to be used for, say, chmod, where we're modifying existing files.

The actual parsing process is going to work roughly as follows:

see if the entire input string can be parsed as an Octal number. If successful, this is the return value. If it's an error, go on to step 2.
parse the rest of the string a single character at a time, until a comma is encountered. As we encounter appropriate characters, we will set self.op to Some(Op), add flags to self.who and to self.bits.
once a comma is encountered, evaluate the input as a set of bits and apply them with the appropriate actions to self.mode
clear out self.op by setting it back to None. Clear out self.who by returning it to zero, and begin with the next slice of input at step 2.
once the entire string is processed, return self.mode

Doing the parsing

pub fn parse(&mut self, value: &str) -> Result<u32, ParseError> {
    match Self::parse_octal(value) {
        Ok(mode) => {
            self.mode = mode;
            return Ok(mode);
        }
        Err(e) => {
            if e == ParseError::OutsideRange {
                return Err(e);
            }
        }
    }
    for c in value.chars() {
        match c {
            'u' => self.add_who(Who::User)?,
            'g' => self.add_who(Who::Group)?,
            'o' => self.add_who(Who::Other)?,
            'a' => {
                self.add_who(Who::User)?;
                self.add_who(Who::Group)?;
                self.add_who(Who::Other)?;
            }
            '-' => self.set_op(Op::Remove)?,
            '+' => self.set_op(Op::Add)?,
            '=' => self.set_op(Op::Equals)?,
            'r' => self.push_read_bits()?,
            'w' => self.push_write_bits()?,
            'x' => self.push_exec_bits()?,
            's' => self.push_suid_sgid()?,
            't' => self.push_sticky()?,
            ',' => {
                self.set_bits()?;
                self.reset();
            }
            _ => return Err(ParseError::InvalidChar),
        }
    }
    self.set_bits()?;
    self.reset();
    Ok(self.mode)
}

Obviously some of the work is broken out into helper functions such as self.add_who or self.set_op. Those functions also perform most of our sanity checking, such as making sure that we have an Op set before allowing us to add any permission bits to the parser. If we encounter a character other than the ones in the big match statement, or a character that would be legal but is not expected at this time (such as if multiple operations are specified) the function will return an error. Let's take a peak at some of these helper functions.

    fn add_who(&mut self, who: Who) -> Result<(), ParseError> {
        if self.op.is_some() || !self.bits == 0 {
            Err(ParseError::InvalidChar)
        } else {
            self.who |= who;
            Ok(())
        }
    }

    fn set_op(&mut self, op: Op) -> Result<(), ParseError> {
        if self.op.is_some() || !self.bits == 0 {
            Err(ParseError::InvalidChar)
        } else {
            self.op = Some(op);
            if self.who == 0 {
                self.who |= 0o111;
            }
            Ok(())
        }
    }

    fn push_read_bits(&mut self) -> Result<(), ParseError> {
        if self.op.is_none() {
            Err(ParseError::NoOpSet)
        } else {
            if self.who.contains(Who::User) {
                self.bits |= Bit::URead;
            }
            if self.who.contains(Who::Group) {
                self.bits |= Bit::GRead;
            }
            if self.who.contains(Who::Other) {
                self.bits |= Bit::ORead;
            }
            Ok(())
        }
    }
    // skipping some functions for brevity

    fn add_bits(&mut self) {
        self.mode |= self.bits;
    }

    fn remove_bits(&mut self) {
        self.mode &= !self.bits;
    }

    fn set_bits(&mut self) -> Result<(), ParseError> {
        match self.op {
            Some(Op::Add) => self.add_bits(),
            Some(Op::Remove) => self.remove_bits(),
            Some(Op::Equals) => {
                if self.who.contains(Who::User) {
                    self.mode &= !(0o4700);
                }
                if self.who.contains(Who::Group) {
                    self.mode &= !(0o2070);
                }
                if self.who.contains(Who::Other) {
                    self.mode &= !(0o1007);
                }
                self.add_bits();
            }
            None => return Err(ParseError::NoOpSet),
        }
        Ok(())
    }

Looking at these little helper functions you can see how much cleaner the code is than it would be without the prep work we did earlier. We're able to perform bitwise operations directly between our enum members and u32 numbers, and we do a lot of checking using our Bitflags trait using the contains method. The other nice thing about breaking out our helper functions like this is that we have all small, easy to reason about and maintain functions. The longest function is the parse method, which is a total of 40 lines. True there was some boilerplate required to get to this point, but it will pay dividends as so much of this code will get re-used throughout the codebase, not just this parser.

Code for this post is at the codeberg repo. The bitflags module is at src/bitflags while everything else mentioned is under src/mode. As of this writing shitbox has a total of 42 applets. Which seemed like a good place to write something about it..

Tags for this post: Programming Rust GemView BitFlags

2023-01-23T18:00:00Z