Rewriting GemView's Gemtext Parser

When I first began the GemView project, I had chosen to use an external crate to handle most gemini related tasks, such as connecting to the server, retreiving documents, and parsing gemtext. GemView would only really have to handle rendering the parsed GemText. Or so I thought. In reality, I've had to take ownership of the code, due to the crate author not responding to bug reports and pull requests, import it into GemView, and rewrite a number of things that were either broken, such as the parser panicking with malformed blockquote lines, or just not ideal, such as rustls being too picky about only accepting certificates which adhere 100% to the spec, causing a number of the commonly self signed certs that you will find in gemini space to be rejected. In the latter case, this actually affected the only known search engine in geminispace.

The old parser

Gemtext is a very line based format, which makes parsing actually quite easy. The original parser took each line and went through the first few characters, char by char, to see what type of line it was, before pushing the line into a vector of lines. This required the parser to maintain state at each character advance using this massive enum.

#[derive(Debug)]
enum ParseState {
    Searching,
    Text,
    FirstLinkChar,
    SecondLinkChar,
    SecondPromptChar,
    LinkLink,
    PromptLink,
    LinkDesc,
    PromptDesc,
    ListWaitForSpace,
    ListItem,
    FirstTick,
    SecondTick,
    PreformattedTextType,
    HeadingStart,
    Heading,
    SubHeadingStart,
    SubHeading,
    SubSubHeadingStart,
    SubSubHeading,
    BlockquoteStart,
    Blockquote,
}

The parsing function itself, in it's current form, is a 184 line function. And that's after I did some condensing of match arms and collapsed some whitespace. It's also the type of code that is very difficult to split up into separate functions because we've basically got a few really lengthy match statements to see what our current state is and then put the parser into the appropriate state depending on the next character. We also have an inner and an outerr loop. Our outer loop is a loop over each line in the document, and our inner loop is over the characters of the line. Here's a small snippet showing the start of the inner loop.

        // A simple enum to keep our parsing state
        let mut current_parse_state: ParseState = ParseState::Searching;
        let mut temp1 = String::new();
        let mut temp2 = String::new();
        // Go character by character and set our state accordingly
        for c in line.chars() {
            match current_parse_state {
                ParseState::Searching => match c {
                    '=' => current_parse_state = ParseState::FirstLinkChar,
                    '*' => current_parse_state = ParseState::ListWaitForSpace,
                    '`' => current_parse_state = ParseState::FirstTick,
                    '#' => current_parse_state = ParseState::HeadingStart,
                    '>' => current_parse_state = ParseState::BlockquoteStart,
                    _ => current_parse_state = ParseState::Text,
                },
                //=====
                //Text parsing
                //=====
                ParseState::Text => break,
                //=====
                //Link parsing
                //=====
                ParseState::FirstLinkChar => match c {

This type of code is, frankly, very C like. It's also both difficult to maintain and easy to introduce errors into. And since we have to match against all 22 members of the above enum and then react appropriately for each match arm, it makes for a long function which is quite difficult to break up into smaller chunks. And, common to large functions (and common in C code), we wind up nested quite deep here, with a number of our match arms having additional match statements nested within them (and remember this is our inner loop; we're already nested within the outer loop at this point). I wanted to see if I could come up with something that is a little more idiomatic and easily maintained, and that required a wholly different strategy.

The strategy

When defining a strategy it's helpful to have an idea of what one wants and what one does not want. So first, what do we want in the new parser? This is basically my wish list.

Less overall code
Shorter functions that are easier to debug and maintain
Reuse of code where possible
No nested loops
Move the code left by at least one indentation level

That's a good start. Now, what do we not want?

A lot of allocations
Something inefficient compared with the previous parser
Any repeating of old mistakes
Significant lacks in test coverage

Rust is known for having very powerful pattern matching and some great string manipulation utility. We're going to leverage both to avoid having to maintain all of the state that the old parser did, by judicious use of std::str::starts_with to determine what our line type ought to be and sending the line to smaller helper functions. We'll be able to fall back to a text line if we were wrong in our original assumption.

In GemView, we aren't concerned with only the gemini protocol, either. We have support for gemini, gopher, finger, file, data and spartan url's. So when we parse, say, a link line, let's reuse the same struct that we would be using for other protocols. Let's also see if we can make use of any traits in std so that it's obvious to a casual glance what a piece of code is supposed to accomplish.

Our line types

In Gemtext we have text, link, list item, quote, link, and preformatted line types. Only preformatted is ever really expected to flow across more than one line (we could with block quotes, but we're not going to). So we have the following enum, which is largely similar to the original parser, but with some naming adjusted to my own taste.

#[derive(Debug, Eq, PartialEq, Clone)]
pub enum GemtextNode {
    Text(String),
    Link(Link),
    Prompt(Link),
    H1(String),
    H2(String),
    H3(String),
    ListItem(String),
    Blockquote(String),
    Preformatted(String, Option<String>),
}

Our Link struct

This is pretty simple. We have a url, and optionally a display string to display in lieu of the url. It's also the same struct that we're using in a few other places in our codebase.

#[derive(Debug, Eq, PartialEq, Clone)]
pub struct Link {
    url: String,
    display: Option<String>,
}

Now link lines are a good start. If we just test that our line starts with the string => then we can send that to a function which attempts tp parse it into a link struct. This is a good fit for std::convert::TryFrom.

impl TryFrom<&str> for Link {
    type Error = &'static str;

    fn try_from(text: &str) -> Result<Self, Self::Error> {
        let mut split = text.split_whitespace().skip(1);
        if let Some(url) = split.next() {
            let display = split.collect::<Vec<&str>>().join(" ");
            if display.is_empty() {
                Ok(Link {
                    url: url.to_string(),
                    display: None
                })
            } else {
                Ok(Link {
                    url: url.to_string(),
                    display: Some(display)
                })
            }
        } else {
            Err("Not a hyperlink")
        }
    }
}

Of course this is fallible, because we might not have had any whitespace in between the => string and the url (a malformed link line) or maybe the => string is the only thing in the line. In the case that Link::try_from returns an error, the main loop of our parser will push the entire line as a GemtextNode::Text element instead of a GemtextNode::Link. Which is exactly the behavior we want. This is inside of an impl GemtextNode block.

impl GemtextNode {
    fn parse_link(text: &str) -> Self {
        if let Ok(link) = Link::try_from(text) {
            Self::Link(link)
        } else {
            Self::Text(text.to_string())
        }
    }

Headings

Following this same strategy, we'll take any line starting with a # character and see if it's a heading.

    // Further down in our `impl GemtextNode` block
    fn parse_heading(text: &str) -> Self {
        if let Some((h, s)) = text.split_once(' ') {
            match h {
                "#" => GemtextNode::H1(s.to_string()),
                "##" => GemtextNode::H2(s.to_string()),
                "###" => GemtextNode::H3(s.to_string()),
                _ => GemtextNode::Text(text.to_string()),
            }
        } else {
            GemtextNode::Text(text.to_string())
        }
    }

Other elements

Our other elements all follow a very similar strategy, so it would be redundant to show all of that code. Basically, in our main loop over each line, we are going to check to see if the line starts with any of our identifier strings and then send it off to a helper function if so. We'll see what that looks like when I show the main parser loop. We'll also see how preformatted blocks are handled. And remember, preformatted blocks can span lines, making them the only exception to everything being line based.

The main parser

pub fn parse_gemtext(text: &str) -> Vec<GemtextNode> {
    let mut lines = vec![];
    let mut preformatted = false;
    let mut preblk = String::new();
    for line in text.lines() {
        if preformatted && line.starts_with("```") {
            lines.push(GemtextNode::Preformatted(preblk.trim_end().to_string(), None));
            preformatted = false;
            preblk.clear();
        } else if preformatted {
            preblk.push_str(&line);
            preblk.push('\n');
        } else {
            match line {
                s if s.starts_with("=>") => {
                    lines.push(GemtextNode::parse_link(s));
                },
                s if s.starts_with("=:") => {
                    lines.push(GemtextNode::parse_prompt(s));
                },
                s if s.starts_with('#') => {
                    lines.push(GemtextNode::parse_heading(s));
                },
                s if s.starts_with('*') => {
                    lines.push(GemtextNode::parse_list_item(s));
                },
                s if s.starts_with('>') => {
                    lines.push(GemtextNode::parse_blockquote(s));
                },
                s if s.starts_with("```") => preformatted = true,
                s => lines.push(GemtextNode::Text(s.to_string())),
            }
        }
    }
    lines
}

Remember when I said earlier that the original parser was a 184 line function? This version is 35 lines. The old parser also had 22 different states that it could be in at any given time. We now have exactly two possible states. It's either in preformatted mode or it's not. The nested loop is gone. Some of the parsing takes place in nice, small helper functions. If we modify one of those functions we don't risk putting the parser into an invalid state, so we've effectively eliminated a lot of unwanted side effects that might happen while maintaining the code.

I want to point out the pattern matching. As I mentioned earlier, Rust has very powerful pattern matching, and here we get to see some of it. Instead of looping over characters we do a match with arms like: s if s.starts_with(<pattern>) =>. Easy peasy.

How did we do in reducing the overall size? According to Tokei, the original parser module is 382 LOC, while the rewrite is 268 LOC. The test suite in the old parser takes up 141 lines, while the one in the new parser is 142 lines. Note that those last two figures are including comments and blank lines, so it's not a perfect comaprison. Any way you slice it though, we've definitely cut down the overall size of the module.

How did we do in moving the code to the left? Well, while we still get a little ways over to the right, just by not having an outer loop our maximum indentation is five levels here, whereas the old code got to six in several places. And in reality, I could cut this back to a maximum of four levels of indentation if I was willing to let the match arms in the main loop run over the long line marker since the each run a single line of code, but I'm not willing to do that.

Some possible improvements, and when will it ship?

So the Gemtext spec is a little open to interpretation in a couple of places, and in it's current state I've erred on the side of simplicity. But there are two places where this could be given some extra functionality without undue complication.

Regarding preformatted blocks, the spec says that anything after the third tick should be considered "alt text" and clients are free to do whatever they want with it, including ignoring it completely (which is what we're currently doing). In Markdown of course, this is often used to indicate a language for syntax highlighting source code. It would be easy enough to capture the "alt text" in the parser. Actually implementing source code highlighting in GemView is less trivial, but not off the table.

The second bit of low hanging fruit is around block quotes. The spec again leaves it up to the client how they want to handle them, if at all. Now, when I write gemtext, I treat blockquote lines just the same as normal lines in that I write the entire quote as a single line, leaving it to the editor to wrap things during my writing session. I've worked under the assumption in GemView that this is how it should be done, but spend any time in Gemini and you'll see that a lot of people treat block quotes the way that they do preformatted blocks. They start a new line where they think it should go, beginning with the > character. This leads to a situation in Eva where you get what was intended as a single quote appearing in separate consecutive boxes. It would not be terribly difficult to change our current model of a simple boolean preformatted state to one where we have a three member enum, like so.

enum State {
    Normal,
    Preformatted,
    Quote,
}

Then we would change our simple if preformatted..else construct to a match state block. I am leaning towards this at the moment, because the current way is beginning to feel like swimming upstream.

One last little improvement I'm considering is replacing all of the String's in the GemtextNode elements with a std::borrow::Cow type, to get rid of having to allocate a new String for each line. It's not a huge gain, but since we pass a string slice reference into Gtk when we actually render the elements there isn't anywhere we actually need an owned String. This would be yet another small improvement over the original.

Integrating this new parser into GemView is going to be relatively straightforward. The module is not a completely drop in replacement, due to the enum GemtextNode having some naming differences and one less member (The original had a separate case for blank lines, which we just treat as an empty string). But it isn't a large amount of work. I plan to have it incorporated into GemView by Saturday, and sometime in the next couple of weeks I'll make releases of both GemView and Eva to get the new features into the hands of actual users.

2022-05-19

Jean G3nie