Rewriting Gemview's Gemtext Parser Part 2
In my last post I described the reasons behind and implementation of the new Gemtext parser for GemView. In the summary, I pointed out a few areas of "low hanging fruit" which might make for some nice improvements over both the old parser and the new in it's then state. Since this is a personal project, and thus not subject to deadlines or reviews of any sort, I decided to go for it. This in turn wound up being a more significant refactor than one might expect, but I think the result is well worth the effort.
Grouping consecutive blockquotes
First on the agenda was the grouping of consecutive blockquote lines,
in the same way that preformatted lines are now grouped. In order to do
that, we have to go from two possible parser states to three. This
means that instead of using a simple boolean expression,
preformatted, to track our state, an enum is now more
appropriate.
enum State { Normal, Preformatted, Quote, }
There, that's the easy part. Now instead of a simple if
preformatted..else block we need to match against the current
state and react appropriately. Now that's simple enough, but let's take
a peak at what a BlockQuote looks like in Gemtext compared with a
preformatted block.
> This is a blockquote > This is another one, which we will group with the first
This is preformatted text
When we start a preformatted block, we have three backticks on a line
of it's own, and don't have to worry about the rest of that line. To
close the block, we use another line of only three backticks. With
blockquotes, however, we have a '>' symbol at the beginning of each
line to denote that it's a blockquote. Which means that we don't know
whether a line is going to continue being a blockquote or be a
completely different kind of line until we're actually parsing it.
Which makes our match arm for State::Quote have to do a
bit more work than our match arm for State::Preformatted.
In fact, we have to do everything that we would do in
Normal mode. And since, in our first go around, the
!preformatted branch of the if..else block
was by far the bulk of the code, we're just about doubling the size of
the function.
Now, it's not terrible. It was around 70 lines or so when I first got it working, but can we split this up? Because I like nice small chunks of code, as I said in the previous post. Well of course we can, but we have a number of state related items to track now.
- Our
Stateenum - The
preblkstring - A new
quoteblkstring - The
linesVec of GemtextNode elements - The line we are currently parsing
At this point we would be passing five distinct pieces of data into each function if we just made separate functions for each match arm. As you might guess, I'm not a fan of this approach, so let's just go ahead and combine all of that into a single struct and write methods on that struct.
#[derive(Default)] struct Parser { state: State, preblk: String, quoteblk: String, lines: Vec<GemtextNode<'a>>, } impl Parser { // Runs the loop over the full document fn parse(mut self, text: &str) -> Vec<GemtextNode<'a>> { for line in text.lines() { match self.state { State::Preformatted => self.parse_preformatted(line), State::Quote => self.parse_quote(line), State::Normal => self.parse_normal(line), } } self.lines }
Nice and concise, if I do say so myself. So let's look at a bit more.
// impl Parser fn link(&mut self, line: &str) { self.lines.push(GemtextNode::parse_link(line)); } fn prompt(&mut self, line: &str) { self.lines.push(GemtextNode::parse_prompt(line)); } fn heading(&mut self, line: &str) { self.lines.push(GemtextNode::parse_heading(line)); } fn list_item(&mut self, line: &str) { self.lines.push(GemtextNode::parse_list_item(line)); } // Runs when the parser state is `State::Normal` fn parse_normal(&mut self, line: &str) { match line { s if s.starts_with("=>") => self.link(s), s if s.starts_with("=:") => self.prompt(s), s if s.starts_with('#') => self.heading(s), s if s.starts_with('*') => self.list_item(s), s if s.starts_with('>') => match GemtextNode::parse_blockquote(s) { // and so on
Once this was working and some new tests were written to ensure that block quotes are being grouped appropriately, it was time to move on to the next task.
Capturing the alt text for preformatted blocks
Now this one really was simple.
// impl Parser fn enter_preformatted(&mut self, line: &str) { self.state = State::Preformatted; if line.len() > 3 { self.pre_alt = Some(line[3..].to_string()); } } // Runs when the parser state is `State::Normal` fn parse_normal(&mut self, line: &str) { match line { // most of the match arms are omitted for brevity s if s.starts_with("```") => self.enter_preformatted(s), s => self.lines.push(GemtextNode::Text(s.to_string())), } } // Runs when the parser state is `State::Preformatted` fn parse_preformatted(&mut self, line: &str) { if line.starts_with("```") { self.lines.push(GemtextNode::Preformatted( self.preblk.trim_end().to_string(), self.pre_alt.clone(), )); self.state = State::Normal; self.pre_alt = None; self.preblk.clear(); // and so on
We do the same thing in parse_quote if we find that the
current line is the start of a preformatted block. We call
enter_preformatted with the current line and if there is
any alt text it gets captured in the new field pre_alt of
our Parser struct, which now looks like this.
#[derive(Default)] struct Parser { state: State, preblk: String, pre_alt: Option<String>, quoteblk: String, lines: Vec<GemtextNode>, }
So yeah, this part really was simple. But let's go for some extra
credit at this point and see if we can avoid some String allocations by
adjusting our GemtextNode enum variants. We're going to make some of
them store a string slice reference, or &str, rather
than an allocated String. And for that, we have to resort to something
that tends to strike fear into a lot of intermediate Rust programmers
(myself included) - lifetimes.
Reducing allocations with named lifetimes
Lifetimes are hard. But they are doable, and the compiler will halp get
you to the finish line. But before we go changing one of our enum
variants from a Variant(String to Variant(&'a
str) it pays to think about what that will mean, from the
perspective of the actual memory used.
Right now, our parser input is chunked up into the appropriate enum
variants and new Strings are allocated for each, which are
all newly allocated memory. Once we have finished parsing, the input
&str is then free to be dropped by the compiler once
it goes out of scope.
The moment we try to store a string slice things change completely. Let's say we start here, with just text lines.
#[derive(Debug, Eq, PartialEq, Clone)] /// A singular gemtext node. pub enum GemtextNode<'a> { /// A pure text block. The string contained within is the entire line of text Text(&'a str),
Now the slice stored in GemtextNode::Text(&'a
str) is only a reference to a slice of the memory storing our
original parser input. That means that the input string must continue
to live on until the &'a str goes out of scope, which
happens when that particular line is rendered onto a page in the
GemView widget. Once every GemtextNode::Text(&'a str)
goes out of scope, then our original input can be dropped.
But that's not all. We're storing each of those elements in a
Vec<GemtextNode>, which is a field of our
Parser struct. That means that our parser will have to
remain alive until every element is consumed as well.
And now, if that's not enough, the <'a> lifetime is
now considered to be part of the type of
GemtextNode and Parser, and the compiler will
expect us to annotate those items and their implementations
appropriately.
struct Parser<'a> {} impl<'a> Parser<'a> {}
Ok, so if the fact that the lifetime being part of the type of an item sounds odd to you, you're probably not alone. But let's think of it this way. We've got a Parser struct, which chunks up some input text and churns out references to portions of the original, which are in fact pointers to the underlying data. The fact that the Parser struct has to live at least as long as the original data is a fundamental property of the Struct, and none of the impl methods that are defined for it would be possible if it weren't. So the impl block is only valid for a Parser struct with that lifetime, and we annotate not only the struct, but also it's impl block. So, starting with just our GemtextNode enum, here we go.
#[derive(Debug, Eq, PartialEq, Clone)] pub enum GemtextNode<'a> { Text(&'a str), Link(Link<'a>), Prompt(Link<'a>), H1(&'a str), H2(&'a str), H3(&'a str), ListItem(&'a str), Blockquote(String), Preformatted(String, Option<String>), }
In for a penny, in for a pound. We're using this technique for most of the line types now, only stopping short because when trying to do the same for the BlockQuote and Preformatted variants, the actual methods defined were resulting in more than one mutable reference, which obviously is not going to be allowed by the borrow checker. Let's take a look at what sorts of things also have to change when we get to the impl block for GemtextNode.
impl<'a> GemtextNode<'a> { fn parse_link(text: &'a str) -> Self { if let Ok(link) = Link::try_from(text) { Self::Link(link) } else { Self::Text(text) } } fn parse_heading(text: &'a str) -> Self { if let Some((h, s)) = text.split_once(' ') { match h { "#" => GemtextNode::H1(s), "##" => GemtextNode::H2(s), "###" => GemtextNode::H3(s), _ => GemtextNode::Text(text), } } else { GemtextNode::Text(text) } } /// additional methods not shown
Notice that apart from annotating impl<'a>
GemtextNode<'a>, we also have to specify that the
text parameter has an 'a lifetime. Not shown,
but also important, is that we've given our Link struct a similar
treatment. We've also been able to remove all of the
to_string() calls on our output.
On to the Parser.
#[derive(Default)] struct Parser<'a> { state: State, preblk: String, pre_alt: Option<String>, quoteblk: String, lines: Vec<GemtextNode<'a>>, }
Here in addition to annotating Parser we have to specify that the lines field has our lifetime, because it is ultimately our output. How about the impl block for Parser?
impl<'a> Parser<'a> { // Runs the loop over the full document fn parse(mut self, text: &'a str) -> Vec<GemtextNode<'a>> { for line in text.lines() { match self.state { State::Preformatted => self.parse_preformatted(line), State::Quote => self.parse_quote(line), State::Normal => self.parse_normal(line), } } self.lines } fn link(&mut self, line: &'a str) { self.lines.push(GemtextNode::parse_link(line)); } fn prompt(&mut self, line: &'a str) { self.lines.push(GemtextNode::parse_prompt(line)); } fn heading(&mut self, line: &'a str) { self.lines.push(GemtextNode::parse_heading(line)); } fn list_item(&mut self, line: &'a str) { self.lines.push(GemtextNode::parse_list_item(line)); } // Runs when the parser state is `State::Normal` fn parse_normal(&mut self, line: &'a str) { match line { s if s.starts_with("=>") => self.link(s), s if s.starts_with("=:") => self.prompt(s), s if s.starts_with('#') => self.heading(s), s if s.starts_with('*') => self.list_item(s), s if s.starts_with('>') => match GemtextNode::parse_blockquote(s) { GemtextNode::Blockquote(q) => { self.quoteblk.push_str(&q); self.quoteblk.push('\n'); self.state = State::Quote; } GemtextNode::Text(t) => self.lines.push(GemtextNode::Text(t)), _ => unreachable!(), }, s if s.starts_with("```") => self.enter_preformatted(s), s => self.lines.push(GemtextNode::Text(s)), } } // Additional methods not shown
This is similar to what we had to do with impl
GemtextNode. We annotate the impl Parser as
impl<'a> Parser<'a> and give our functions
inputs the 'a lifetime. Our parser's output is the
Parser.lines Vec, which we already gave the correct annotation earlier.
Final thoughts - what is gained?
In the end, after making the changes mentioned in this article, our parser module has grown to be almost exactly the size of the original parser. However, we have additional capability in that we're correctly grouping both preformatted and blockquote blocks of text together. We've managed to capture the alt_text for any preformatted blocks, which GemView is currently not using, but which opens up some interesting avenues for further development. We've also managed to eliminate the vast majority of string allocations that existed in the old parser and in the first draft of this one. I call it a win. I also call it a win that Rust's lifetimes were successfully conquered here. When I first began poking at Rust that was one of the few concepts which I wasn't entirely sure I would ever attempt to tackle in my own code. Having done so now, my confidence has gotten another major boost.
Additionally, the code is now broken up into small and manageable functions and methods. The longest continuous block of code is a total of 34 lines, whereas the original parser I inherited was 180 or so lines and very unwieldy. It's hard to overstate the importance of this sort of thing. The technical debt has been largely paid at this point, and I don't expect any issues in this code for the foreseeable future.
Tags for this post: Programming GemView Gemini GemText Parsers Rust Lifetimes