md

package
v0.20.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 14, 2024 License: BSD-2-Clause Imports: 9 Imported by: 1

Documentation

Overview

Package md implements a Markdown parser.

To use this package, call Render with one of the Codec implementations:

  • HTMLCodec converts Markdown to HTML. This is used in src.elv.sh/website/cmd/md2html, part of Elvish's website toolchain.

  • FmtCodec formats Markdown. This is used in src.elv.sh/cmd/elvmdfmt, used for formatting Markdown files in the Elvish repo.

  • TTYCodec renders Markdown in the terminal. This will be used in a help system that can used directly from Elvish to render documentation of Elvish modules.

Why another Markdown implementation?

The Elvish project uses Markdown in the documentation ("elvdoc") for the functions and variables defined in builtin modules. These docs are then converted to HTML as part of the website; for example, you can read the docs for builtin functions and variables at https://elv.sh/ref/builtin.html.

We used to use Pandoc to convert the docs from their Markdown sources to HTML. However, we would also like to expand the elvdoc system in two ways:

  • We would like to support elvdocs in user-defined modules, not just builtin modules.

  • We would like to allow users to read elvdocs directly from the Elvish program, in the terminal, without needing a browser or an Internet connection.

With these requirements, Elvish itself needs to know how to parse Markdown sources and render them in the terminal, so we need a Go implementation instead. There is a good Go implementation, github.com/yuin/goldmark, but it is quite large: linking it into Elvish will increase the binary size by more than 1MB. (There is another popular Markdown implementation, github.com/russross/blackfriday/v2, but it doesn't support CommonMark.)

By having a more narrow focus, this package is much smaller than goldmark, and can be easily optimized for Elvish's use cases. In contrast to goldmark's 1MB, including Render and HTMLCodec in Elvish only increases the binary size by 150KB. That said, the functionalities provided by this package still try to be as general as possible, and can potentially be used by other people interested in a small Markdown implementation.

Besides elvdocs, Pandoc was also used to convert all the other content on the Elvish website (https://elv.sh) to HTML. Additionally, Prettier used to be used to format all the Markdown files in the repo. Now that Elvish has its own Markdown implementation, we can use it not just for rendering elvdocs in the terminal, but also replace the use of Pandoc and Prettier. These external tools are decent, but using them still came with some frictions:

  • Even though both are relatively easy to set up, they can still be a hindrance to casual contributors.

  • Since behavior of these tools can change with version, we explicit specify their versions in both CI configurations and contributing instructions. But this creates another problem: every time these tools release new versions, we have to manually bump the versions, and every contributor also needs to manually update them in their development environments.

Replacing external tools with this package removes these frictions.

Additionally, this package is very easy to extend and optimize to suit Elvish's needs:

  • We used to custom Pandoc using a mix of shell scripts, templates and Lua scripts. While these customization options of Pandoc are well documented, they are not something people are likely to be familiar with.

    With this implementation, everything is now done with Go code.

  • The Markdown formatter is much faster than Prettier, so it's now feasible to run the formatter every time when saving a Markdown file.

Which Markdown variant does this package implement?

This package implements a large subset of the CommonMark spec, with the following omissions:

  • "\r" and "\r\n" are not supported as line endings. This can be easily worked around by converting them to "\n" first.

  • Tabs are not supported for defining block structures; use spaces instead. Tabs in other context are supported.

  • Among HTML entities, only a few are supported: < > &quote; ' &. This is because the full list of HTML entities is very large and will inflate the binary size.

    If full support for HTML entities are desirable, this can be done by overriding the UnescapeHTML variable with html.UnescapeString.

    (Numeric character references like 	 and   are fully supported.)

  • Setext headings are not supported; use ATX headings instead.

  • Reference links are not supported; use inline links instead.

  • Lists are always considered loose.

These omitted features are never used in Elvish's Markdown sources.

All implemented features pass their relevant CommonMark spec tests. See testutils_test.go for a complete list of which spec tests are skipped.

Note: the spec tests were taken from the CommonMark spec Git repo on 2022-09-26. This version is almost identical to the latest released version, CommonMark 0.30 (released 2021-06-09), with two minor changes in the syntax of HTML blocks and inline HTML comments. Once CommonMark 0.31 is released, the spec tests will be updated to follow that instead.

Is this package useful outside Elvish?

Yes! Well, hopefully. Assuming you don't use the features this package omits, it can be useful in at least the following ways:

  • The implementation is quite lightweight, so you can use it instead of a more full-features Markdown library if small binary size is important.

    As shown above, the increase in binary size when using this package in Elvish is about 150KB, compared to more than 1MB when using github.com/yuin/goldmark. You mileage may vary though, since the binary size increase depends on which packages the binary is already including.

  • The formatter implemented by FmtCodec is heavily fuzz-tested to ensure that it does not alter the semantics of the Markdown.

    Markdown formatting is fraught with tricky edge cases. For example, if a formatter standardizes all bullet markers to "-", it might reformat "* --" to "- ---", but the latter will now be parsed as a thematic break.

    Thanks to Go's builtin fuzzing support, the formatter is able to handle many such corner cases (at least all the corner cases found by the fuzzer; take a look and try them on other formatters!). There are two areas - namely nested and consecutive emphasis or strong emphasis - that are just too tricky to get 100% right that the formatter is not guaranteed to be correct; the fuzz test explicitly skips those cases.

    Nonetheless, if you are writing a Markdown formatter and care about correctness, the corner cases will be interesting, regardless of which language you are using to implement the formatter.

Index

Constants

This section is empty.

Variables

View Source
var UnescapeHTML = unescapeHTML

UnescapeHTML is used by the parser to unescape HTML entities and numeric character references.

The default implementation supports numeric character references, plus a minimal set of entities that are necessary for writing valid HTML or can appear in the output of FmtCodec. It can be set to html.UnescapeString for better CommonMark compliance.

Functions

func Render

func Render(text string, codec Codec)

Render parses markdown and renders it with a Codec.

func RenderInlineContentToHTML

func RenderInlineContentToHTML(sb *strings.Builder, ops []InlineOp)

RenderInlineContentToHTML renders inline content to HTML, writing to a strings.Builder. This is useful for implementing an alternative HTML-outputting Codec.

func RenderString

func RenderString(text string, codec StringerCodec) string

Render calls Render(text, codec) and returns codec.String(). This can be a bit more convenient to use than Render.

Types

type Codec

type Codec interface {
	Do(Op)
}

Codec is used to render output.

type FmtCodec

type FmtCodec struct {
	Width int
	// contains filtered or unexported fields
}

FmtCodec is a codec that formats Markdown in a specific style.

The only supported configuration option is the text width.

The formatted text uses the following style:

  • Blocks are always separated by a blank line.

  • Thematic breaks use "***" where possible, falling back to "---" if using the former is problematic.

  • Code blocks are always fenced, never indented.

  • Code fences use backquotes (like "```") wherever possible, falling back to "~~~" if using the former is problematic.

  • Continuation markers of container blocks ("> " for blockquotes and spaces for list items) are never omitted; in other words, lazy continuation is never used.

  • Blockquotes use "> ", never omitting the space.

  • Bullet lists use "-" as markers where possible, falling back to "*" if using the former is problematic.

  • Ordered lists use "X." (X being a number) where possible, falling back to "X)" if using the former is problematic.

  • Bullet lists and ordered lists are indented 4 spaces where possible.

  • Emphasis always uses "*".

  • Strong emphasis always uses "**".

  • Hard line break always uses an explicit "\".

func (*FmtCodec) Do

func (c *FmtCodec) Do(op Op)

func (*FmtCodec) String

func (c *FmtCodec) String() string

func (*FmtCodec) Unsupported

func (c *FmtCodec) Unsupported() *FmtUnsupported

Unsupported returns information about use of unsupported features that may make the output incorrect. It returns nil if there is no use of unsupported features.

type FmtUnsupported

type FmtUnsupported struct {
	// Input contains emphasis or strong emphasis nested in another emphasis or
	// strong emphasis (not necessarily of the same type).
	NestedEmphasisOrStrongEmphasis bool
	// Input contains emphasis or strong emphasis that follows immediately after
	// another emphasis or strong emphasis (not necessarily of the same type).
	ConsecutiveEmphasisOrStrongEmphasis bool
}

FmtUnsupported contains information about use of unsupported features.

type HTMLCodec

type HTMLCodec struct {
	strings.Builder
	// If non-nil, will be called for each code block. The return value is
	// inserted into the HTML output and should be properly escaped.
	ConvertCodeBlock func(info, code string) string
}

HTMLCodec converts markdown to HTML.

func (*HTMLCodec) Do

func (c *HTMLCodec) Do(op Op)

type InlineOp

type InlineOp struct {
	Type InlineOpType
	// OpText, OpCodeSpan, OpRawHTML, OpAutolink: Text content
	// OpLinkStart, OpLinkEnd, OpImage: title text
	Text string
	// OpLinkStart, OpLinkEnd, OpImage, OpAutolink
	Dest string
	// ForOpImage
	Alt string
}

InlineOp represents an inline operation.

func (InlineOp) String

func (op InlineOp) String() string

String returns the text content of the InlineOp

type InlineOpType

type InlineOpType uint

InlineOpType enumerates possible types of an InlineOp.

const (
	// Text elements. Embedded newlines in OpText are turned into OpNewLine, but
	// OpRawHTML can contain embedded newlines. OpCodeSpan never contains
	// embedded newlines.
	OpText InlineOpType = iota
	OpCodeSpan
	OpRawHTML
	OpNewLine

	// Inline markup elements.
	OpEmphasisStart
	OpEmphasisEnd
	OpStrongEmphasisStart
	OpStrongEmphasisEnd
	OpLinkStart
	OpLinkEnd
	OpImage
	OpAutolink
	OpHardLineBreak
)

func (InlineOpType) String

func (i InlineOpType) String() string

type Op

type Op struct {
	Type OpType
	// For OpOrderedListStart (the start number) or OpHeading (as the heading
	// level)
	Number int
	// For OpHeading and OpCodeBlock
	Info string
	// For OpCodeBlock and OpHTMLBlock
	Lines []string
	// For OpParagraph and OpHeading
	Content []InlineOp
}

Op represents an operation for the Codec.

type OpType

type OpType uint

OpType enumerates possible types of an Op.

const (
	// Leaf blocks.
	OpThematicBreak OpType = iota
	OpHeading
	OpCodeBlock
	OpHTMLBlock
	OpParagraph

	// Container blocks.
	OpBlockquoteStart
	OpBlockquoteEnd
	OpListItemStart
	OpListItemEnd
	OpBulletListStart
	OpBulletListEnd
	OpOrderedListStart
	OpOrderedListEnd
)

Possible output operations.

func (OpType) String

func (i OpType) String() string

type SmartPunctsCodec

type SmartPunctsCodec struct{ Inner Codec }

SmartPunctsCodec wraps another codec, converting certain ASCII punctuations to nicer Unicode counterparts:

  • A straight double quote (") is converted to a left double quote (“) when it follows a whitespace, or a right double quote (”) when it follows a non-whitespace.

  • A straight single quote (') is converted to a left single quote (‘) when it follows a whitespace, or a right single quote or apostrophe (’) when it follows a non-whitespace.

  • A run of two dashes (--) is converted to an en-dash (–).

  • A run of three dashes (---) is converted to an em-dash (—).

  • A run of three dot (...) is converted to an ellipsis (…).

Start of lines are considered to be whitespaces.

func (SmartPunctsCodec) Do

func (c SmartPunctsCodec) Do(op Op)

type StringerCodec

type StringerCodec interface {
	Codec
	String() string
}

StringerCodec is a Codec that also implements the String method.

type TTYCodec

type TTYCodec struct {
	Width int
	// If non-nil, will be called to highlight the content of code blocks.
	HighlightCodeBlock func(info, code string) ui.Text
	// If non-nil, will be called for each relative link destination.
	ConvertRelativeLink func(dest string) string
	// contains filtered or unexported fields
}

TTYCodec renders Markdown in a terminal.

The rendered text uses the following style:

  • Adjacent blocks are always separated with one blank line.

  • Thematic breaks are rendered as "────" (four U+2500 "box drawing light horizontal").

  • Headings are rendered like "# Heading" in bold, with the same number of hashes as in Markdown

  • Code blocks are indented two spaces. The HighlightCodeBlock callback can be supplied to highlight the content of the code block.

  • HTML blocks are ignored.

  • Paragraphs are always reflowed to fit the given width.

  • Blockquotes start with "│ " (U+2502 "box drawing light vertical", then a space) on each line.

  • Bullet list items start with "• " (U+2022 "bullet", then a space) on the first line. Continuation lines are indented two spaces.

  • Ordered list items start with "X. " (where X is a number) on the first line. Continuation lines are indented three spaces.

  • Code spans are underlined.

  • Emphasis makes the text italic. (Some terminal emulators turn italic text into inverse text, which is not ideal but fine.)

  • Strong emphasis makes the text bold.

  • Links are rendered with their text content underlined. If the link is absolute (starts with scheme:), the destination is rendered like " (https://example.com)" after the text content.

    Relative link destinations are not shown by default, since they are usually not useful in a terminal. If the ConvertRelativeLink callback is non-nil, it is called for each relative links and non-empty return values are shown.

    The link description is ignored for now since Elvish's Markdown sources never use them.

  • Images are rendered like "Image: alt text (https://example.com/a.png)".

  • Autolinks have their text content rendered.

  • Raw HTML is mostly ignored, except that text between <kbd> and </kbd> becomes inverse video.

  • Hard line breaks are respected.

The structure of the implementation closely mirrors FmtCodec in a lot of places, without the complexity of handling all edge cases correctly, but with the slight complexity of handling styles.

func (*TTYCodec) Do

func (c *TTYCodec) Do(op Op)

Do processes an Op.

func (*TTYCodec) String

func (c *TTYCodec) String() string

String returns the rendering result as a string with ANSI escape sequences.

func (*TTYCodec) Text

func (c *TTYCodec) Text() ui.Text

Text returns the rendering result as a ui.Text.

type TextBlock

type TextBlock struct {
	Text string
	Code bool
}

TextBlock is a text block dumped by TextCodec.

type TextCodec

type TextCodec struct {
	// contains filtered or unexported fields
}

TextCodec is a codec that dumps the pure text content of Markdown.

func (*TextCodec) Blocks

func (c *TextCodec) Blocks() []TextBlock

func (*TextCodec) Do

func (c *TextCodec) Do(op Op)

type TraceCodec

type TraceCodec struct {
	strings.Builder
	// contains filtered or unexported fields
}

TraceCodec is a Codec that records all the Op's passed to its Do method.

func (*TraceCodec) Do

func (c *TraceCodec) Do(op Op)

func (*TraceCodec) Ops

func (c *TraceCodec) Ops() []Op

Directories

Path Synopsis
Command mdrun can be used to test the md package.
Command mdrun can be used to test the md package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL