Another parser in Rust
Parsing text input is a common programming task. If you intend to write any kind of software, you are eventually going to be writing some sort of low level parser, because you just can't always count on having a library ready made for the task. For today's post, I'm going to take some lessons learned in writing the gemtext parser for GemView, a few things you're likely to encounter a lot in C code, and write another parser in Rust.
The program and problem space
The program that's being worked on has a working name of shitbox and is, as you might guess from that moniker, a multi-call binary inspired by Busybox. The basic idea of a multi-call binary has the advantage of allowing quite a lot of code re-use. This is great from a Rust standpoint, as it partially mitigates one of Rust's known weaknesses - massive binary size.
When dealing with Unix shell utilities something that comes up quite a lot is
permissions. The parser in question is going to handle parsing command line input
such as you would get for the chmod
utility, but it's going to also be used for
mkdir
, mknod
, mkfifo
and a number of others. Some of the supporting code
will be a minimalist bitflags
implementation and will see usage in a lot of
other locations as well.
We can expect this input to come in one of two forms. We might be getting a string
representing a number in Octal format, which is easy to deal with by just using
i32::from_str_radix(<string>, 8)
. Things get trickier when you bump into the
other possibility, symbolic representation, eg u+rwx,go=r
.
Supporting bits - bitflags
Unix permissions are actually set using bitwise operations, allowing a single u32 number to represent a series of on/off boolean switches. You'll see this in C all of the time, but not as much in higher level languages. That's a shame, because it's an inherently efficient method of storing and processing state. Let's look at some psuedo-code to compare the two approaches.
// First a memory inefficient way
enum Flags {
A,
B,
C,
}
struct State {
flags: Vec<Flags>,
}
impl Default for State {
fn default() -> Self {
Self {
flags: vec![A,B,B],
}
}
}
Notice that B appears twice. I did that on purpose to illustrate a weakness - we
probably would have to sort and dedup this Vec<Flags>
before using it. Another
weakness is that we're using up a lot more memory than we need to in order to
represent this, and we're potentially allocating when we don't need to. Now let's
look at doing this using bitflags.
enum Flags {
A = 0b100,
B = 0b10,
C = 0b1,
}
struct State {
flags: u32,
}
impl Default for State {
fn default() -> Self {
Self {
flags: 0b110,
}
}
}
Normally Rust, along with most languages which have enums, will automatically
assign an integer value to each enum member. But we can assign those integer
values manually, which is what we're doing here. Ob1
is the number 1
in binary.
You can do this with Octal pretty easily, too, but binary makes it really easy
to see what is going on. Since our Default
impl sets flags
to 0b110
we can
see that the bits which are toggled on
include the same bits as both A
and B
.
If we wanted to add C
to this number, we could use the bitwise or
operator.
let mut state = State::default();
state.flags |= Flags::C as u32;
assert!(state.flags & Flags::C as u32 != 0);
I snuck in that last line to show how we can tell programmatically if state has
the Flags::C
flag set. But it's going to get messy and difficult to track if
we have to keep throwing around as u32
, or worse state.flags & Flags::C as u32 != 0
all over our code. You'll see just that sort of thing in a lot of old code anyway,
but we can abstract it a little to make our lives easier. First we'll get rid of
the as u32
part. Note that this time, I'm going to show you code from the actual
parser.
use std::ops::{BitAnd, BitAndAssign, BitOr, BitOrAssign};
#[derive(PartialEq)]
/// The granularity of the given permissions
pub enum Who {
/// applies for the current user
User = 0b100,
/// applies for the current group
Group = 0b10,
/// applies for everyone else
Other = 0b1,
}
impl BitAnd<Who> for u32 {
type Output = u32;
fn bitand(self, rhs: Who) -> Self::Output {
self & rhs as u32
}
}
impl BitAnd<u32> for Who {
type Output = u32;
fn bitand(self, rhs: u32) -> Self::Output {
self as u32 & rhs
}
}
impl BitAndAssign<Who> for u32 {
fn bitand_assign(&mut self, rhs: Who) {
*self = *self & rhs;
}
}
impl BitOr<Who> for u32 {
type Output = u32;
fn bitor(self, rhs: Who) -> Self::Output {
self | rhs as u32
}
}
impl BitOrAssign<Who> for u32 {
fn bitor_assign(&mut self, rhs: Who) {
*self = *self | rhs;
}
}
I'm going to manually generate these impl's for each data type I need them for. You could probably do it with a macro, but it won't actually make the compiled code any smaller or faster and it will increase compilation times. The only reason for a macro would be to reduce boilerplate, or if our implementation was going into a library that had to be ergonomically usable for other people's random types.
Next, we're groing to create a trait with a single function, contains, which will
allow easy inspection to see if a given u32
contains a flag.
//! Minimal bitflags type implementation allowing to check if a set of flags
//! contains a specific flag. The type must implement Copy and Bitand with u32.
use core::ops::BitAnd;
pub trait BitFlags<T> {
fn contains(&self, rhs: T) -> bool;
}
impl<T, U> BitFlags<T> for U
where
U: BitAnd<T> + Copy,
T: BitAnd<U>,
<U as BitAnd<T>>::Output: PartialEq<u32>,
{
fn contains(&self, rhs: T) -> bool {
*self & rhs != 0
}
}
Our type must be Copy
and implement BitAnd
with the above constraints. But we
can freely use it on any given enum we want to if we implement those traits.
Note: There are a lot more operations that a real
bitflags
implementation would want to support. But this is just a minimal implementation for the internal use of one program (albeit one that can do a bunch of different things).
It should be apparent that the Who
enum represents the who that a permissions bit
applies to. But we need to know the 'what' and the 'where', what in this case being
the permissions themselves and where being where they are going (are they being added
or removed?). For that we need two more enums. Our Op
enum isn't going to use
the concepts I've talked about so far, it's just a simple enum representing three
choices (Add, Remove or Equals). I'm not going to show the code for that. I'm also
not going to show the code for the Bit
enum, which specifies each of the permission
bits in the set.
The Parser struct
pub struct Parser {
mode: u32,
op: Option<Op>,
who: u32,
bits: u32,
}
impl Default for Parser {
fn default() -> Self {
let umask = get_umask();
let mut mode = 0o0666;
mode &= umask;
Self {
mode,
op: None,
who: 0,
bits: 0,
}
}
}
impl Parser {
pub fn new(mode: u32) -> Self {
Self {
mode,
op: None,
who: 0,
bits: 0,
}
}
}
Like the gemtext parser mentioned earlier, our Parser
struct is going to be
use via &mut self
references, so as it processes the input it will modify itself
accordingly, returning a single u32 representing the mode bits at the end. The
Default
implementation is for creating new files, and begins with the default
setting of a=rw
(everybody can read or write) modified by our umask. But we
have another constructor, new(mode: u32) -> Self
, which takes an initial mode
as a starting point. This is going to be used for, say, chmod, where we're modifying
existing files.
The actual parsing process is going to work roughly as follows:
- see if the entire input string can be parsed as an Octal number. If successful, this is the return value. If it's an error, go on to step 2.
- parse the rest of the string a single character at a time, until a comma is
encountered. As we encounter appropriate characters, we will set
self.op
toSome(Op)
, add flags toself.who
and toself.bits
. - once a comma is encountered, evaluate the input as a set of bits and apply
them with the appropriate actions to
self.mode
- clear out
self.op
by setting it back toNone
. Clear outself.who
by returning it tozero
, and begin with the next slice of input at step 2. - once the entire string is processed, return
self.mode
Doing the parsing
pub fn parse(&mut self, value: &str) -> Result<u32, ParseError> {
match Self::parse_octal(value) {
Ok(mode) => {
self.mode = mode;
return Ok(mode);
}
Err(e) => {
if e == ParseError::OutsideRange {
return Err(e);
}
}
}
for c in value.chars() {
match c {
'u' => self.add_who(Who::User)?,
'g' => self.add_who(Who::Group)?,
'o' => self.add_who(Who::Other)?,
'a' => {
self.add_who(Who::User)?;
self.add_who(Who::Group)?;
self.add_who(Who::Other)?;
}
'-' => self.set_op(Op::Remove)?,
'+' => self.set_op(Op::Add)?,
'=' => self.set_op(Op::Equals)?,
'r' => self.push_read_bits()?,
'w' => self.push_write_bits()?,
'x' => self.push_exec_bits()?,
's' => self.push_suid_sgid()?,
't' => self.push_sticky()?,
',' => {
self.set_bits()?;
self.reset();
}
_ => return Err(ParseError::InvalidChar),
}
}
self.set_bits()?;
self.reset();
Ok(self.mode)
}
Obviously some of the work is broken out into helper functions such as self.add_who
or self.set_op
. Those functions also perform most of our sanity checking, such
as making sure that we have an Op
set before allowing us to add any permission
bits to the parser. If we encounter a character other than the ones in the big match
statement, or a character that would be legal but is not expected at this time (such
as if multiple operations are specified) the function will return an error. Let's
take a peak at some of these helper functions.
fn add_who(&mut self, who: Who) -> Result<(), ParseError> {
if self.op.is_some() || !self.bits == 0 {
Err(ParseError::InvalidChar)
} else {
self.who |= who;
Ok(())
}
}
fn set_op(&mut self, op: Op) -> Result<(), ParseError> {
if self.op.is_some() || !self.bits == 0 {
Err(ParseError::InvalidChar)
} else {
self.op = Some(op);
if self.who == 0 {
self.who |= 0o111;
}
Ok(())
}
}
fn push_read_bits(&mut self) -> Result<(), ParseError> {
if self.op.is_none() {
Err(ParseError::NoOpSet)
} else {
if self.who.contains(Who::User) {
self.bits |= Bit::URead;
}
if self.who.contains(Who::Group) {
self.bits |= Bit::GRead;
}
if self.who.contains(Who::Other) {
self.bits |= Bit::ORead;
}
Ok(())
}
}
// skipping some functions for brevity
fn add_bits(&mut self) {
self.mode |= self.bits;
}
fn remove_bits(&mut self) {
self.mode &= !self.bits;
}
fn set_bits(&mut self) -> Result<(), ParseError> {
match self.op {
Some(Op::Add) => self.add_bits(),
Some(Op::Remove) => self.remove_bits(),
Some(Op::Equals) => {
if self.who.contains(Who::User) {
self.mode &= !(0o4700);
}
if self.who.contains(Who::Group) {
self.mode &= !(0o2070);
}
if self.who.contains(Who::Other) {
self.mode &= !(0o1007);
}
self.add_bits();
}
None => return Err(ParseError::NoOpSet),
}
Ok(())
}
Looking at these little helper functions you can see how much cleaner the code
is than it would be without the prep work we did earlier. We're able to perform
bitwise operations directly between our enum members and u32
numbers, and we
do a lot of checking using our Bitflags
trait using the contains
method. The
other nice thing about breaking out our helper functions like this is that we have
all small, easy to reason about and maintain functions. The longest function is
the parse
method, which is a total of 40 lines. True there was some boilerplate
required to get to this point, but it will pay dividends as so much of this code
will get re-used throughout the codebase, not just this parser.
Code for this post is at the codeberg repo.
The bitflags
module is at src/bitflags
while everything else mentioned is
under src/mode
. As of this writing shitbox has a total of 42 applets. Which
seemed like a good place to write something about it..