The New Gemtext Parser (part 2)

In my last post I described the reasons behind and implementation of the new Gemtext parser for GemView. In the summary, I pointed out a few areas of "low hanging fruit" which might make for some nice improvements over both the old parser and the new in it's then state. Since this is a personal project, and thus not subject to deadlines or reviews of any sort, I decided to go for it. This in turn wound up being a more significant refactor than one might expect, but I think the result is well worth the effort.

Grouping consecutive blockquotes

First on the agenda was the grouping of consecutive blockquote lines, in the same way that preformatted lines are now grouped. In order to do that, we have to go from two possible parser states to three. This means that instead of using a simple boolean expression, preformatted, to track our state, an enum is now more appropriate.

enum State {
    Normal,
    Preformatted,
    Quote,
}

There, that's the easy part. Now instead of a simple if preformatted..else block we need to match against the current state and react appropriately. Now that's simple enough, but let's take a peak at what a BlockQuote looks like in Gemtext compared with a preformatted block.

> This is a blockquote
> This is another one, which we will group with the first
```
This is preformatted text
```

When we start a preformatted block, we have three backticks on a line of it's own, and don't have to worry about the rest of that line. To close the block, we use another line of only three backticks. With blockquotes, however, we have a '>' symbol at the beginning of each line to denote that it's a blockquote. Which means that we don't know whether a line is going to continue being a blockquote or be a completely different kind of line until we're actually parsing it. Which makes our match arm for State::Quote have to do a bit more work than our match arm for State::Preformatted. In fact, we have to do everything that we would do in Normal mode. And since, in our first go around, the !preformatted branch of the if..else block was by far the bulk of the code, we're just about doubling the size of the function.

Now, it's not terrible. It was around 70 lines or so when I first got it working, but can we split this up? Because I like nice small chunks of code, as I said in the previous post. Well of course we can, but we have a number of state related items to track now.

  • Our State enum
  • The preblk string
  • A new quoteblk string
  • The lines Vec of GemtextNode elements
  • The line we are currently parsing

At this point we would be passing five distinct pieces of data into each function if we just made separate functions for each match arm. As you might guess, I'm not a fan of this approach, so let's just go ahead and combine all of that into a single struct and write methods on that struct.

#[derive(Default)]
struct Parser {
    state: State,
    preblk: String,
    quoteblk: String,
    lines: Vec<GemtextNode<'a>>,
}

impl Parser {
    // Runs the loop over the full document
    fn parse(mut self, text: &str) -> Vec<GemtextNode<'a>> {
        for line in text.lines() {
            match self.state {
                State::Preformatted => self.parse_preformatted(line),
                State::Quote => self.parse_quote(line),
                State::Normal => self.parse_normal(line),
            }
        }
        self.lines
    }

Nice and concise, if I do say so myself. So let's look at a bit more.

    // impl Parser
    fn link(&mut self, line: &str) {
        self.lines.push(GemtextNode::parse_link(line));
    }

    fn prompt(&mut self, line: &str) {
        self.lines.push(GemtextNode::parse_prompt(line));
    }

    fn heading(&mut self, line: &str) {
        self.lines.push(GemtextNode::parse_heading(line));
    }

    fn list_item(&mut self, line: &str) {
        self.lines.push(GemtextNode::parse_list_item(line));
    }

    // Runs when the parser state is `State::Normal`
    fn parse_normal(&mut self, line: &str) {
        match line {
            s if s.starts_with("=>") => self.link(s),
            s if s.starts_with("=:") => self.prompt(s),
            s if s.starts_with('#') => self.heading(s),
            s if s.starts_with('*') => self.list_item(s),
            s if s.starts_with('>') => match GemtextNode::parse_blockquote(s) {
// and so on

Once this was working and some new tests were written to ensure that block quotes are being grouped appropriately, it was time to move on to the next task.

Capturing the alt text for preformatted blocks

Now this one really was simple.

    // impl Parser
    fn enter_preformatted(&mut self, line: &str) {
        self.state = State::Preformatted;
        if line.len() > 3 {
            self.pre_alt = Some(line[3..].to_string());
        }
    }

    // Runs when the parser state is `State::Normal`
    fn parse_normal(&mut self, line: &str) {
        match line {
	    // most of the match arms are omitted for brevity
            s if s.starts_with("```") => self.enter_preformatted(s),
            s => self.lines.push(GemtextNode::Text(s.to_string())),
        }
    }

    // Runs when the parser state is `State::Preformatted`
    fn parse_preformatted(&mut self, line: &str) {
        if line.starts_with("```") {
            self.lines.push(GemtextNode::Preformatted(
                self.preblk.trim_end().to_string(),
                self.pre_alt.clone(),
            ));
            self.state = State::Normal;
            self.pre_alt = None;
            self.preblk.clear();
// and so on

We do the same thing in parse_quote if we find that the current line is the start of a preformatted block. We call enter_preformatted with the current line and if there is any alt text it gets captured in the new field pre_alt of our Parser struct, which now looks like this.

#[derive(Default)]
struct Parser {
    state: State,
    preblk: String,
    pre_alt: Option<String>,
    quoteblk: String,
    lines: Vec<GemtextNode>,
}

So yeah, this part really was simple. But let's go for some extra credit at this point and see if we can avoid some String allocations by adjusting our GemtextNode enum variants. We're going to make some of them store a string slice reference, or &str, rather than an allocated String. And for that, we have to resort to something that tends to strike fear into a lot of intermediate Rust programmers (myself included) - lifetimes.

Reducing allocations with named lifetimes

Lifetimes are hard. But they are doable, and the compiler will halp get you to the finish line. But before we go changing one of our enum variants from a Variant(String to Variant(&'a str) it pays to think about what that will mean, from the perspective of the actual memory used.

Right now, our parser input is chunked up into the appropriate enum variants and new Strings are allocated for each, which are all newly allocated memory. Once we have finished parsing, the input &str is then free to be dropped by the compiler once it goes out of scope.

The moment we try to store a string slice things change completely. Let's say we start here, with just text lines.

#[derive(Debug, Eq, PartialEq, Clone)]
/// A singular gemtext node.
pub enum GemtextNode<'a> {
    /// A pure text block. The string contained within is the entire line of text
    Text(&'a str),

Now the slice stored in GemtextNode::Text(&'a str) is only a reference to a slice of the memory storing our original parser input. That means that the input string must continue to live on until the &'a str goes out of scope, which happens when that particular line is rendered onto a page in the GemView widget. Once every GemtextNode::Text(&'a str) goes out of scope, then our original input can be dropped.

But that's not all. We're storing each of those elements in a Vec<GemtextNode>, which is a field of our Parser struct. That means that our parser will have to remain alive until every element is consumed as well.

And now, if that's not enough, the <'a> lifetime is now considered to be part of the type of GemtextNode and Parser, and the compiler will expect us to annotate those items and their implementations appropriately.

struct Parser<'a> {}
impl<'a> Parser<'a> {}

Ok, so if the fact that the lifetime being part of the type of an item sounds odd to you, you're probably not alone. But let's think of it this way. We've got a Parser struct, which chunks up some input text and churns out references to portions of the original, which are in fact pointers to the underlying data. The fact that the Parser struct has to live at least as long as the original data is a fundamental property of the Struct, and none of the impl methods that are defined for it would be possible if it weren't. So the impl block is only valid for a Parser struct with that lifetime, and we annotate not only the struct, but also it's impl block. So, starting with just our GemtextNode enum, here we go.

#[derive(Debug, Eq, PartialEq, Clone)]
pub enum GemtextNode<'a> {
    Text(&'a str),
    Link(Link<'a>),
    Prompt(Link<'a>),
    H1(&'a str),
    H2(&'a str),
    H3(&'a str),
    ListItem(&'a str),
    Blockquote(String),
    Preformatted(String, Option<String>),
}

In for a penny, in for a pound. We're using this technique for most of the line types now, only stopping short because when trying to do the same for the BlockQuote and Preformatted variants, the actual methods defined were resulting in more than one mutable reference, which obviously is not going to be allowed by the borrow checker. Let's take a look at what sorts of things also have to change when we get to the impl block for GemtextNode.

impl<'a> GemtextNode<'a> {
    fn parse_link(text: &'a str) -> Self {
        if let Ok(link) = Link::try_from(text) {
            Self::Link(link)
        } else {
            Self::Text(text)
        }
    }

    fn parse_heading(text: &'a str) -> Self {
        if let Some((h, s)) = text.split_once(' ') {
            match h {
                "#" => GemtextNode::H1(s),
                "##" => GemtextNode::H2(s),
                "###" => GemtextNode::H3(s),
                _ => GemtextNode::Text(text),
            }
        } else {
            GemtextNode::Text(text)
        }
    }
/// additional methods not shown

Notice that apart from annotating impl<'a> GemtextNode<'a>, we also have to specify that the text parameter has an 'a lifetime. Not shown, but also important, is that we've given our Link struct a similar treatment. We've also been able to remove all of the to_string() calls on our output.

On to the Parser.

#[derive(Default)]
struct Parser<'a> {
    state: State,
    preblk: String,
    pre_alt: Option<String>,
    quoteblk: String,
    lines: Vec<GemtextNode<'a>>,
}

Here in addition to annotating Parser we have to specify that the lines field has our lifetime, because it is ultimately our output. How about the impl block for Parser?

impl<'a> Parser<'a> {
    // Runs the loop over the full document
    fn parse(mut self, text: &'a str) -> Vec<GemtextNode<'a>> {
        for line in text.lines() {
            match self.state {
                State::Preformatted => self.parse_preformatted(line),
                State::Quote => self.parse_quote(line),
                State::Normal => self.parse_normal(line),
            }
        }
        self.lines
    }

    fn link(&mut self, line: &'a str) {
        self.lines.push(GemtextNode::parse_link(line));
    }

    fn prompt(&mut self, line: &'a str) {
        self.lines.push(GemtextNode::parse_prompt(line));
    }

    fn heading(&mut self, line: &'a str) {
        self.lines.push(GemtextNode::parse_heading(line));
    }

    fn list_item(&mut self, line: &'a str) {
        self.lines.push(GemtextNode::parse_list_item(line));
    }

    // Runs when the parser state is `State::Normal`
    fn parse_normal(&mut self, line: &'a str) {
        match line {
            s if s.starts_with("=>") => self.link(s),
            s if s.starts_with("=:") => self.prompt(s),
            s if s.starts_with('#') => self.heading(s),
            s if s.starts_with('*') => self.list_item(s),
            s if s.starts_with('>') => match GemtextNode::parse_blockquote(s) {
                GemtextNode::Blockquote(q) => {
                    self.quoteblk.push_str(&q);
                    self.quoteblk.push('\n');
                    self.state = State::Quote;
                }
                GemtextNode::Text(t) => self.lines.push(GemtextNode::Text(t)),
                _ => unreachable!(),
            },
            s if s.starts_with("```") => self.enter_preformatted(s),
            s => self.lines.push(GemtextNode::Text(s)),
        }
    }
// Additional methods not shown

This is similar to what we had to do with impl GemtextNode. We annotate the impl Parser as impl<'a> Parser<'a> and give our functions inputs the 'a lifetime. Our parser's output is the Parser.lines Vec, which we already gave the correct annotation earlier.

Final thoughts - what is gained?

In the end, after making the changes mentioned in this article, our parser module has grown to be almost exactly the size of the original parser. However, we have additional capability in that we're correctly grouping both preformatted and blockquote blocks of text together. We've managed to capture the alt_text for any preformatted blocks, which GemView is currently not using, but which opens up some interesting avenues for further development. We've also managed to eliminate the vast majority of string allocations that existed in the old parser and in the first draft of this one. I call it a win. I also call it a win that Rust's lifetimes were successfully conquered here. When I first began poking at Rust that was one of the few concepts which I wasn't entirely sure I would ever attempt to tackle in my own code. Having done so now, my confidence has gotten another major boost.

Additionally, the code is now broken up into small and manageable functions and methods. The longest continuous block of code is a total of 34 lines, whereas the original parser I inherited was 180 or so lines and very unwieldy. It's hard to overstate the importance of this sort of thing. The technical debt has been largely paid at this point, and I don't expect any issues in this code for the foreseeable future.