Parsing Markdown in Prolog
Last couple of days I have been working on a Markdown parser. I have been using Textile for couple of years and I have grown to hate that line breaks in my source become line breaks in HTML. I like to do manual text layout in my text files for better readability but layout in HTML should be done automatically. I have thus switched over to Markdown and also wanted a Markdown parser to be integrated with this blog.
I could not find any Markdown parser already written in Prolog. I then decided to write my own parser. The basic Markdown format does not look complex but there is no formal grammar. Many existing Markdown parsers use nest of regexes, some are based on parser generators (see the StackOverflow question). Another interesting source of information were articles in these series. As with many parsers I also decided to split parsing into two phases: block level and span level. Block level parser recognizes structures such as blockquotes, lists, headings, code blocks and paragraphs. Span level parser deals with inline elements such as strong and emphasis formatting and links.
Parsing in Prolog
The main parsing tool in Prolog are DCG rules. For example, here is a rule for parsing a blockquote:
blockquote(blockquote(Opt)) -->
"> ", any(Text), ((ln, white, ln) ; at_end), !,
{ phrase(strip_bq(Stripped), Text, "") }, !,
{ phrase(blocks(Quote), Stripped, "") },
{ element_opt(Quote, Opt) }.
The rule tries to match a text block starting with a token "> "
and ending with either an empty line (ln, white, ln
) or a stream end (at_end
). The rule any(Text)
lazily matches as much input as needed (for the end condition to match).
It has very simple definition:
any([]) --> [].
any([C|Cs]) --> [C], any(Cs).
Anything between {}
in DCG rules are ordinary Prolog predicate calls. In the blockquote parsing rule, the call
phrase(strip_bq(Stripped), Text, "")
(which is also done with the help of DCG rules) removes possible blockquote tokens
("> "
) from the block and phrase(blocks(Quote), Stripped, "")
invokes the Markdown parser
recursively as the blockquote can contain nested Markdown. While it would be possible to parse whole nested block in one pass I decided to not do so
as one must keep around extensive context information about the current level of block elements. The last call
element_opt(Quote, Opt)
removes excessive paragraph (<p>
) elements from the syntax tree.
The span-level parser is written in the similar way. It is invoked at the lowest block-level elements which are paragraphs.
Syntax tree
The output tree of the parser is for direct usage with SWI-Prolog's html//1.
It takes away the pain from emitting HTML but unfortunately limits compatibility with other Prolog implementations. At first I used my own
specification for the tree but later figured out it takes a lot less code to directly output suitable tree for the html//1
predicate
than convert my tree into it.
Supported Markdown features
Not all features listed in the specification are implemented. Ordered lists are not
supported yet (but I'm working on it). The line break rule with two spaces at the end to insert a <br>
is also not working
yet. I do not like this rule as some text editors remove excessive space (which they think are unnecessary) from line ends.
I would like to implement GitHub-styled code blocks and maybe less-strict rules for indentation. Right now I have sticked to 4 spaces or tab indentation rule.
What is implemented so far is covered by unit tests. There are currently 75 test cases. Almost all of them test single features. It means that combination of some elements might lead to unexpected results.
Parser package
I attempted to package the code for the parser in the new package (pack) format for SWI-Prolog. I have not yet submitted the package as my documentation is not complete yet. The source code can be downloaded from GitHub. I will soon provide means for downloading the package over the net.
I have updated my blog to use the new parser. This is the first article written using it.