Another Rust Parser
Parsing text input is a common programming task. If you intend to write any kind of software, you are eventually going to be writing some sort of low level parser, because you just can't always count on having a library ready made for the task. For today's post, I'm going to take some lessons learned in writing the gemtext parser for GemView, a few things you're likely to encounter a lot in C code, and write another parser in Rust.
The program and problem space
The program that's being worked on has a working name of shitbox and is, as you might guess from that moniker, a multi-call binary inspired by Busybox. The basic idea of a multi-call binary has the advantage of allowing quite a lot of code re-use. This is great from a Rust standpoint, as it partially mitigates one of Rust's known weaknesses - massive binary size.
When dealing with Unix shell utilities something that comes up quite a
lot is permissions. The parser in question is going to handle parsing
command line input such as you would get for the chmod
utility, but it's going to also be used for mkdir,
mknod, mkfifo and a number of others. Some of
the supporting code will be a minimalist bitflags
implementation and will see usage in a lot of other locations as well.
We can expect this input to come in one of two forms. We might be
getting a string representing a number in Octal format, which is easy
to deal with by just using i32::from_str_radix(<string>,
8). Things get trickier when you bump into the other
possibility, symbolic representation, eg u+rwx,go=r.
Supporting bits - bitflags
Unix permissions are actually set using bitwise operations, allowing a single u32 number to represent a series of on/off boolean switches. You'll see this in C all of the time, but not as much in higher level languages. That's a shame, because it's an inherently efficient method of storing and processing state. Let's look at some psuedo-code to compare the two approaches.
// First a memory inefficient way enum Flags { A, B, C, } struct State { flags: Vec<Flags>, } impl Default for State { fn default() -> Self { Self { flags: vec![A,B,B], } } }
Notice that B appears twice. I did that on purpose to illustrate a
weakness - we probably would have to sort and dedup this
Vec<Flags> before using it. Another weakness is that
we're using up a lot more memory than we need to in order to represent
this, and we're potentially allocating when we don't need to. Now let's
look at doing this using bitflags.
enum Flags { A = 0b100, B = 0b10, C = 0b1, } struct State { flags: u32, } impl Default for State { fn default() -> Self { Self { flags: 0b110, } } }
Normally Rust, along with most languages which have enums, will
automatically assign an integer value to each enum member. But we can
assign those integer values manually, which is what we're doing here.
Ob1 is the number 1 in binary. You can do
this with Octal pretty easily, too, but binary makes it really easy to
see what is going on. Since our Default impl sets
flags to 0b110 we can see that the bits which
are toggled on include the same bits as both
A and B. If we wanted to add C
to this number, we could use the bitwise or operator.
let mut state = State::default(); state.flags |= Flags::C as u32; assert!(state.flags & Flags::C as u32 != 0);
I snuck in that last line to show how we can tell programmatically if
state has the Flags::C flag set. But it's going to get
messy and difficult to track if we have to keep throwing around
as u32, or worse state.flags & Flags::C as u32 !=
0 all over our code. You'll see just that sort of thing in a lot
of old code anyway, but we can abstract it a little to make our lives
easier. First we'll get rid of the as u32 part. Note that
this time, I'm going to show you code from the actual parser.
use std::ops::{BitAnd, BitAndAssign, BitOr, BitOrAssign}; #[derive(PartialEq)] /// The granularity of the given permissions pub enum Who { /// applies for the current user User = 0b100, /// applies for the current group Group = 0b10, /// applies for everyone else Other = 0b1, } impl BitAnd<Who> for u32 { type Output = u32; fn bitand(self, rhs: Who) -> Self::Output { self & rhs as u32 } } impl BitAnd<u32> for Who { type Output = u32; fn bitand(self, rhs: u32) -> Self::Output { self as u32 & rhs } } impl BitAndAssign<Who> for u32 { fn bitand_assign(&mut self, rhs: Who) { *self = *self & rhs; } } impl BitOr<Who> for u32 { type Output = u32; fn bitor(self, rhs: Who) -> Self::Output { self | rhs as u32 } } impl BitOrAssign<Who> for u32 { fn bitor_assign(&mut self, rhs: Who) { *self = *self | rhs; } }
I'm going to manually generate these impl's for each data type I need them for. You could probably do it with a macro, but it won't actually make the compiled code any smaller or faster and it will increase compilation times. The only reason for a macro would be to reduce boilerplate, or if our implementation was going into a library that had to be ergonomically usable for other people's random types.
Next, we're groing to create a trait with a single function, contains,
which will allow easy inspection to see if a given u32
contains a flag.
//! Minimal bitflags type implementation allowing to check if a set of flags //! contains a specific flag. The type must implement Copy and Bitand with u32. use core::ops::BitAnd; pub trait BitFlags<T> { fn contains(&self, rhs: T) -> bool; } impl<T, U> BitFlags<T> for U where U: BitAnd<T> + Copy, T: BitAnd<U>, <U as BitAnd<T>>::Output: PartialEq<u32>, { fn contains(&self, rhs: T) -> bool { *self & rhs != 0 } }
Our type must be Copy and implement BitAnd
with the above constraints. But we can freely use it on any given enum
we want to if we implement those traits.
Note: There are a lot more operations that a real
bitflagsimplementation would want to support. But this is just a minimal implementation for the internal use of one program (albeit one that can do a bunch of different things).
It should be apparent that the Who enum represents the who
that a permissions bit applies to. But we need to know the 'what' and
the 'where', what in this case being the permissions themselves and
where being where they are going (are they being added or removed?).
For that we need two more enums. Our Op enum isn't going
to use the concepts I've talked about so far, it's just a simple enum
representing three choices (Add, Remove or Equals). I'm not going to
show the code for that. I'm also not going to show the code for the
Bit enum, which specifies each of the permission bits in
the set.
The Parser struct
pub struct Parser { mode: u32, op: Option<Op>, who: u32, bits: u32, } impl Default for Parser { fn default() -> Self { let umask = get_umask(); let mut mode = 0o0666; mode &= umask; Self { mode, op: None, who: 0, bits: 0, } } } impl Parser { pub fn new(mode: u32) -> Self { Self { mode, op: None, who: 0, bits: 0, } } }
Like the gemtext parser mentioned earlier, our Parser
struct is going to be use via &mut self references, so
as it processes the input it will modify itself accordingly, returning
a single u32 representing the mode bits at the end. The
Default implementation is for creating new files, and
begins with the default setting of a=rw (everybody can
read or write) modified by our umask. But we have another constructor,
new(mode: u32) -> Self, which takes an initial mode as
a starting point. This is going to be used for, say, chmod, where we're
modifying existing files.
The actual parsing process is going to work roughly as follows:
- see if the entire input string can be parsed as an Octal number. If successful, this is the return value. If it's an error, go on to step 2.
- parse the rest of the string a single character at a time, until a
comma is encountered. As we encounter appropriate characters, we will
set
self.optoSome(Op), add flags toself.whoand toself.bits. - once a comma is encountered, evaluate the input as a set of bits
and apply them with the appropriate actions to
self.mode - clear out
self.opby setting it back toNone. Clear outself.whoby returning it tozero, and begin with the next slice of input at step 2. - once the entire string is processed, return
self.mode
Doing the parsing
pub fn parse(&mut self, value: &str) -> Result<u32, ParseError> { match Self::parse_octal(value) { Ok(mode) => { self.mode = mode; return Ok(mode); } Err(e) => { if e == ParseError::OutsideRange { return Err(e); } } } for c in value.chars() { match c { 'u' => self.add_who(Who::User)?, 'g' => self.add_who(Who::Group)?, 'o' => self.add_who(Who::Other)?, 'a' => { self.add_who(Who::User)?; self.add_who(Who::Group)?; self.add_who(Who::Other)?; } '-' => self.set_op(Op::Remove)?, '+' => self.set_op(Op::Add)?, '=' => self.set_op(Op::Equals)?, 'r' => self.push_read_bits()?, 'w' => self.push_write_bits()?, 'x' => self.push_exec_bits()?, 's' => self.push_suid_sgid()?, 't' => self.push_sticky()?, ',' => { self.set_bits()?; self.reset(); } _ => return Err(ParseError::InvalidChar), } } self.set_bits()?; self.reset(); Ok(self.mode) }
Obviously some of the work is broken out into helper functions such as
self.add_who or self.set_op. Those functions
also perform most of our sanity checking, such as making sure that we
have an Op set before allowing us to add any permission
bits to the parser. If we encounter a character other than the ones in
the big match statement, or a character that would be
legal but is not expected at this time (such as if multiple operations
are specified) the function will return an error. Let's take a peak at
some of these helper functions.
fn add_who(&mut self, who: Who) -> Result<(), ParseError> { if self.op.is_some() || !self.bits == 0 { Err(ParseError::InvalidChar) } else { self.who |= who; Ok(()) } } fn set_op(&mut self, op: Op) -> Result<(), ParseError> { if self.op.is_some() || !self.bits == 0 { Err(ParseError::InvalidChar) } else { self.op = Some(op); if self.who == 0 { self.who |= 0o111; } Ok(()) } } fn push_read_bits(&mut self) -> Result<(), ParseError> { if self.op.is_none() { Err(ParseError::NoOpSet) } else { if self.who.contains(Who::User) { self.bits |= Bit::URead; } if self.who.contains(Who::Group) { self.bits |= Bit::GRead; } if self.who.contains(Who::Other) { self.bits |= Bit::ORead; } Ok(()) } } // skipping some functions for brevity fn add_bits(&mut self) { self.mode |= self.bits; } fn remove_bits(&mut self) { self.mode &= !self.bits; } fn set_bits(&mut self) -> Result<(), ParseError> { match self.op { Some(Op::Add) => self.add_bits(), Some(Op::Remove) => self.remove_bits(), Some(Op::Equals) => { if self.who.contains(Who::User) { self.mode &= !(0o4700); } if self.who.contains(Who::Group) { self.mode &= !(0o2070); } if self.who.contains(Who::Other) { self.mode &= !(0o1007); } self.add_bits(); } None => return Err(ParseError::NoOpSet), } Ok(()) }
Looking at these little helper functions you can see how much cleaner
the code is than it would be without the prep work we did earlier.
We're able to perform bitwise operations directly between our enum
members and u32 numbers, and we do a lot of checking using
our Bitflags trait using the contains method.
The other nice thing about breaking out our helper functions like this
is that we have all small, easy to reason about and maintain functions.
The longest function is the parse method, which is a total
of 40 lines. True there was some boilerplate required to get to this
point, but it will pay dividends as so much of this code will get
re-used throughout the codebase, not just this parser.
Code for this post is at the codeberg repo. The
bitflags module is at src/bitflags while
everything else mentioned is under src/mode. As of this
writing shitbox has a total of 42 applets.
Which seemed like a good place to write something about it..
Tags for this post: Programming Rust GemView BitFlags