Raivo Laanemets - Parsing Markdown in Prolog

Parsing Markdown in Prolog

On 2013-08-06

The last couple of days I have been working on a Markdown parser. I have been using Textile for a couple of years, and I have grown to hate that line breaks in my source become line breaks in HTML. I like to do manual text layout in my text files for better readability, but layout in HTML should be done automatically. I have thus switched over to Markdown and also wanted a Markdown parser to be integrated with this blog.

I could not find any Markdown parser already written in Prolog. Then I decided to write my own parser. The basic Markdown format does not look complex, but there is no formal grammar. Many existing Markdown parsers use nest of regexes, some are based on parser generators (see the StackOverflow question). Another interesting source of information were articles in these series. As with many parsers, I also decided to split parsing into two phases: block level and span level. Block level parser recognizes structures such as blockquotes, lists, headings, code blocks and paragraphs. Span level parser deals with inline elements such as strong and emphasis formatting and links.

Parsing in Prolog

The main parsing tool in Prolog are DCG rules. For example, here is a rule for parsing a blockquote:

blockquote(blockquote(Opt)) -->
    "> ", any(Text), ((ln, white, ln) ; at_end), !,
    { phrase(strip_bq(Stripped), Text, "") }, !,
    { phrase(blocks(Quote), Stripped, "") },
    { element_opt(Quote, Opt) }.

The rule tries to match a text block starting with a token "> " and ending with either an empty line (ln, white, ln) or a stream end (at_end). The rule any(Text) lazily matches as much input as needed (for the end condition to match). It has basic definition:

any([]) --> [].
any([C|Cs]) --> [C], any(Cs).

Anything between {} in DCG rules are ordinary Prolog predicate calls. In the blockquote parsing rule, the call phrase(strip_bq(Stripped), Text, "") (which is also done with the help of DCG rules) removes possible blockquote tokens ("> ") from the block and phrase(blocks(Quote), Stripped, "") invokes the Markdown parser recursively as the blockquote can contain nested Markdown. While it would be possible to parse whole nested block in one pass I decided not to do so as one must keep around extensive context information about the current level of block elements. The last call element_opt(Quote, Opt) removes excessive paragraph (<p>) elements from the syntax tree.

The span-level parser is written in the similar way. It is invoked at the lowest block-level elements, which are paragraphs.

Syntax tree

The output tree of the parser is for direct usage with SWI-Prolog's html//1. It takes away the pain from emitting HTML, but unfortunately limits compatibility with other Prolog implementations. At first, I used my own specification for the tree, but later figured out it takes a lot less code to directly output a suitable tree for the html//1 predicate than convert my tree into it.

Supported Markdown features

Not all features listed in the specification are implemented. Ordered lists are not supported yet (but I'm working on it). The line break rule with two spaces at the end to insert a <br> is also not working yet. I do not like this rule, as some text editors remove excessive space (which they think are unnecessary) from line ends.

I would like to implement GitHub-styled code blocks and maybe less-strict rules for indentation. Currently, I have stuck to 4 spaces or tab indentation rule.

What is implemented so far is covered by unit tests. There are currently 75 test cases. Almost all of them test single features. It means that combination of some elements might lead to unexpected results.

Parser package

I attempted to package the code for the parser in the new package (pack) format for SWI-Prolog. I have not yet submitted the package as my documentation is incomplete yet. The source code can be downloaded from GitHub. I will soon provide means for downloading the package over the net.