The New Gemtext Parser (part 2)
In my last post I described the reasons behind and implementation of the new Gemtext parser for GemView. In the summary, I pointed out a few areas of "low hanging fruit" which might make for some nice improvements over both the old parser and the new in it's then state. Since this is a personal project, and thus not subject to deadlines or reviews of any sort, I decided to go for it. This in turn wound up being a more significant refactor than one might expect, but I think the result is well worth the effort.
Grouping consecutive blockquotes
First on the agenda was the grouping of consecutive blockquote lines, in the same
way that preformatted lines are now grouped. In order to do that, we have to go
from two possible parser states to three. This means that instead of using a
simple boolean expression, preformatted
, to track our state, an enum is now
more appropriate.
enum State {
Normal,
Preformatted,
Quote,
}
There, that's the easy part. Now instead of a simple if preformatted..else
block we need to match against the current state and react appropriately. Now
that's simple enough, but let's take a peak at what a BlockQuote looks like in
Gemtext compared with a preformatted block.
> This is a blockquote > This is another one, which we will group with the first ``` This is preformatted text ```
When we start a preformatted block, we have three backticks on a line of it's
own, and don't have to worry about the rest of that line. To close the block, we
use another line of only three backticks. With blockquotes, however, we have a
'>' symbol at the beginning of each line to denote that it's a blockquote. Which
means that we don't know whether a line is going to continue being a blockquote
or be a completely different kind of line until we're actually parsing it. Which
makes our match arm for State::Quote
have to do a bit more work than our match
arm for State::Preformatted
. In fact, we have to do everything that we would
do in Normal
mode. And since, in our first go around, the !preformatted
branch of the if..else
block was by far the bulk of the code, we're just about
doubling the size of the function.
Now, it's not terrible. It was around 70 lines or so when I first got it working, but can we split this up? Because I like nice small chunks of code, as I said in the previous post. Well of course we can, but we have a number of state related items to track now.
- Our
State
enum - The
preblk
string - A new
quoteblk
string - The
lines
Vec of GemtextNode elements - The line we are currently parsing
At this point we would be passing five distinct pieces of data into each function if we just made separate functions for each match arm. As you might guess, I'm not a fan of this approach, so let's just go ahead and combine all of that into a single struct and write methods on that struct.
#[derive(Default)]
struct Parser {
state: State,
preblk: String,
quoteblk: String,
lines: Vec<GemtextNode<'a>>,
}
impl Parser {
// Runs the loop over the full document
fn parse(mut self, text: &str) -> Vec<GemtextNode<'a>> {
for line in text.lines() {
match self.state {
State::Preformatted => self.parse_preformatted(line),
State::Quote => self.parse_quote(line),
State::Normal => self.parse_normal(line),
}
}
self.lines
}
Nice and concise, if I do say so myself. So let's look at a bit more.
// impl Parser
fn link(&mut self, line: &str) {
self.lines.push(GemtextNode::parse_link(line));
}
fn prompt(&mut self, line: &str) {
self.lines.push(GemtextNode::parse_prompt(line));
}
fn heading(&mut self, line: &str) {
self.lines.push(GemtextNode::parse_heading(line));
}
fn list_item(&mut self, line: &str) {
self.lines.push(GemtextNode::parse_list_item(line));
}
// Runs when the parser state is `State::Normal`
fn parse_normal(&mut self, line: &str) {
match line {
s if s.starts_with("=>") => self.link(s),
s if s.starts_with("=:") => self.prompt(s),
s if s.starts_with('#') => self.heading(s),
s if s.starts_with('*') => self.list_item(s),
s if s.starts_with('>') => match GemtextNode::parse_blockquote(s) {
// and so on
Once this was working and some new tests were written to ensure that block quotes are being grouped appropriately, it was time to move on to the next task.
Capturing the alt text for preformatted blocks
Now this one really was simple.
// impl Parser
fn enter_preformatted(&mut self, line: &str) {
self.state = State::Preformatted;
if line.len() > 3 {
self.pre_alt = Some(line[3..].to_string());
}
}
// Runs when the parser state is `State::Normal`
fn parse_normal(&mut self, line: &str) {
match line {
// most of the match arms are omitted for brevity
s if s.starts_with("```") => self.enter_preformatted(s),
s => self.lines.push(GemtextNode::Text(s.to_string())),
}
}
// Runs when the parser state is `State::Preformatted`
fn parse_preformatted(&mut self, line: &str) {
if line.starts_with("```") {
self.lines.push(GemtextNode::Preformatted(
self.preblk.trim_end().to_string(),
self.pre_alt.clone(),
));
self.state = State::Normal;
self.pre_alt = None;
self.preblk.clear();
// and so on
We do the same thing in parse_quote
if we find that the current line is the
start of a preformatted block. We call enter_preformatted
with the current
line and if there is any alt text it gets captured in the new field pre_alt
of
our Parser struct, which now looks like this.
#[derive(Default)]
struct Parser {
state: State,
preblk: String,
pre_alt: Option<String>,
quoteblk: String,
lines: Vec<GemtextNode>,
}
So yeah, this part really was simple. But let's go for some extra credit at this
point and see if we can avoid some String allocations by adjusting our GemtextNode
enum variants. We're going to make some of them store a string slice reference,
or &str
, rather than an allocated String. And for that, we have to resort to
something that tends to strike fear into a lot of intermediate Rust programmers
(myself included) - lifetimes.
Reducing allocations with named lifetimes
Lifetimes are hard. But they are doable, and the compiler will halp get you to
the finish line. But before we go changing one of our enum variants from a
Variant(String
to Variant(&'a str)
it pays to think about what that will
mean, from the perspective of the actual memory used.
Right now, our parser input is chunked up into the appropriate enum variants and
new String
s are allocated for each, which are all newly allocated memory. Once
we have finished parsing, the input &str
is then free to be dropped by the
compiler once it goes out of scope.
The moment we try to store a string slice things change completely. Let's say we start here, with just text lines.
#[derive(Debug, Eq, PartialEq, Clone)]
/// A singular gemtext node.
pub enum GemtextNode<'a> {
/// A pure text block. The string contained within is the entire line of text
Text(&'a str),
Now the slice stored in GemtextNode::Text(&'a str)
is only a reference to a
slice of the memory storing our original parser input. That means that the input
string must continue to live on until the &'a str
goes out of scope, which
happens when that particular line is rendered onto a page in the GemView widget.
Once every GemtextNode::Text(&'a str)
goes out of scope, then our original
input can be dropped.
But that's not all. We're storing each of those elements in a Vec<GemtextNode>
,
which is a field of our Parser
struct. That means that our parser will have to
remain alive until every element is consumed as well.
And now, if that's not enough, the <'a>
lifetime is now considered to be part
of the type of GemtextNode
and Parser
, and the compiler will expect us to
annotate those items and their implementations appropriately.
struct Parser<'a> {}
impl<'a> Parser<'a> {}
Ok, so if the fact that the lifetime being part of the type of an item sounds odd to you, you're probably not alone. But let's think of it this way. We've got a Parser struct, which chunks up some input text and churns out references to portions of the original, which are in fact pointers to the underlying data. The fact that the Parser struct has to live at least as long as the original data is a fundamental property of the Struct, and none of the impl methods that are defined for it would be possible if it weren't. So the impl block is only valid for a Parser struct with that lifetime, and we annotate not only the struct, but also it's impl block. So, starting with just our GemtextNode enum, here we go.
#[derive(Debug, Eq, PartialEq, Clone)]
pub enum GemtextNode<'a> {
Text(&'a str),
Link(Link<'a>),
Prompt(Link<'a>),
H1(&'a str),
H2(&'a str),
H3(&'a str),
ListItem(&'a str),
Blockquote(String),
Preformatted(String, Option<String>),
}
In for a penny, in for a pound. We're using this technique for most of the line types now, only stopping short because when trying to do the same for the BlockQuote and Preformatted variants, the actual methods defined were resulting in more than one mutable reference, which obviously is not going to be allowed by the borrow checker. Let's take a look at what sorts of things also have to change when we get to the impl block for GemtextNode.
impl<'a> GemtextNode<'a> {
fn parse_link(text: &'a str) -> Self {
if let Ok(link) = Link::try_from(text) {
Self::Link(link)
} else {
Self::Text(text)
}
}
fn parse_heading(text: &'a str) -> Self {
if let Some((h, s)) = text.split_once(' ') {
match h {
"#" => GemtextNode::H1(s),
"##" => GemtextNode::H2(s),
"###" => GemtextNode::H3(s),
_ => GemtextNode::Text(text),
}
} else {
GemtextNode::Text(text)
}
}
/// additional methods not shown
Notice that apart from annotating impl<'a> GemtextNode<'a>
, we also have to
specify that the text
parameter has an 'a
lifetime. Not shown, but also
important, is that we've given our Link struct a similar treatment. We've also
been able to remove all of the to_string()
calls on our output.
On to the Parser.
#[derive(Default)]
struct Parser<'a> {
state: State,
preblk: String,
pre_alt: Option<String>,
quoteblk: String,
lines: Vec<GemtextNode<'a>>,
}
Here in addition to annotating Parser we have to specify that the lines field has our lifetime, because it is ultimately our output. How about the impl block for Parser?
impl<'a> Parser<'a> {
// Runs the loop over the full document
fn parse(mut self, text: &'a str) -> Vec<GemtextNode<'a>> {
for line in text.lines() {
match self.state {
State::Preformatted => self.parse_preformatted(line),
State::Quote => self.parse_quote(line),
State::Normal => self.parse_normal(line),
}
}
self.lines
}
fn link(&mut self, line: &'a str) {
self.lines.push(GemtextNode::parse_link(line));
}
fn prompt(&mut self, line: &'a str) {
self.lines.push(GemtextNode::parse_prompt(line));
}
fn heading(&mut self, line: &'a str) {
self.lines.push(GemtextNode::parse_heading(line));
}
fn list_item(&mut self, line: &'a str) {
self.lines.push(GemtextNode::parse_list_item(line));
}
// Runs when the parser state is `State::Normal`
fn parse_normal(&mut self, line: &'a str) {
match line {
s if s.starts_with("=>") => self.link(s),
s if s.starts_with("=:") => self.prompt(s),
s if s.starts_with('#') => self.heading(s),
s if s.starts_with('*') => self.list_item(s),
s if s.starts_with('>') => match GemtextNode::parse_blockquote(s) {
GemtextNode::Blockquote(q) => {
self.quoteblk.push_str(&q);
self.quoteblk.push('\n');
self.state = State::Quote;
}
GemtextNode::Text(t) => self.lines.push(GemtextNode::Text(t)),
_ => unreachable!(),
},
s if s.starts_with("```") => self.enter_preformatted(s),
s => self.lines.push(GemtextNode::Text(s)),
}
}
// Additional methods not shown
This is similar to what we had to do with impl GemtextNode
. We annotate the
impl Parser
as impl<'a> Parser<'a>
and give our functions inputs the 'a
lifetime. Our parser's output is the Parser.lines Vec, which we already gave the
correct annotation earlier.
Final thoughts - what is gained?
In the end, after making the changes mentioned in this article, our parser module has grown to be almost exactly the size of the original parser. However, we have additional capability in that we're correctly grouping both preformatted and blockquote blocks of text together. We've managed to capture the alt_text for any preformatted blocks, which GemView is currently not using, but which opens up some interesting avenues for further development. We've also managed to eliminate the vast majority of string allocations that existed in the old parser and in the first draft of this one. I call it a win. I also call it a win that Rust's lifetimes were successfully conquered here. When I first began poking at Rust that was one of the few concepts which I wasn't entirely sure I would ever attempt to tackle in my own code. Having done so now, my confidence has gotten another major boost.
Additionally, the code is now broken up into small and manageable functions and methods. The longest continuous block of code is a total of 34 lines, whereas the original parser I inherited was 180 or so lines and very unwieldy. It's hard to overstate the importance of this sort of thing. The technical debt has been largely paid at this point, and I don't expect any issues in this code for the foreseeable future.
2022-05-20