md

package module
v0.0.0-...-3a729f6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 31, 2024 License: MIT Imports: 17 Imported by: 0

README

html-to-markdown

Go Report Card codecov GitHub MIT License GoDoc

Gopher, the mascot of Golang, is wearing a party hat and holding a balloon. Next to the Gopher is a machine that converts characters associated with HTML to characters associated with Markdown.

Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of regexp as much as possible. That should prevent some weird cases and allows it to be used for cases where the input is totally unknown.

Installation

go get github.com/tomkosm/html-to-markdown

Usage

import (
	"fmt"
	"log"

	md "github.com/tomkosm/html-to-markdown"
)

converter := md.NewConverter("", true, nil)

html := `<strong>Important</strong>`

markdown, err := converter.ConvertString(html)
if err != nil {
  log.Fatal(err)
}
fmt.Println("md ->", markdown)

If you are already using goquery you can pass a selection to Convert.

markdown, err := converter.Convert(selec)
Using it on the command line

If you want to make use of html-to-markdown on the command line without any Go coding, check out html2md, a cli wrapper for html-to-markdown that has all the following options and plugins builtin.

Options

The third parameter to md.NewConverter is *md.Options.

For example you can change the character that is around a bold text ("**") to a different one (for example "__") by changing the value of StrongDelimiter.

opt := &md.Options{
  StrongDelimiter: "__", // default: **
  // ...
}
converter := md.NewConverter("", true, opt)

For all the possible options look at godocs and for a example look at the example.

Adding Rules

converter.AddRules(
  md.Rule{
    Filter: []string{"del", "s", "strike"},
    Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
      // You need to return a pointer to a string (md.String is just a helper function).
      // If you return nil the next function for that html element
      // will be picked. For example you could only convert an element
      // if it has a certain class name and fallback if not.
      content = strings.TrimSpace(content)
      return md.String("~" + content + "~")
    },
  },
  // more rules
)

For more information have a look at the example add_rules.

Using Plugins

If you want plugins (github flavored markdown like striketrough, tables, ...) you can pass it to Use.

import "github.com/tomkosm/html-to-markdown/plugin"

// Use the `GitHubFlavored` plugin from the `plugin` package.
converter.Use(plugin.GitHubFlavored())

Or if you only want to use the Strikethrough plugin. You can change the character that distinguishes the text that is crossed out by setting the first argument to a different value (for example "~~" instead of "~").

converter.Use(plugin.Strikethrough(""))

For more information have a look at the example github_flavored.


These are the plugins located in the plugin folder which you can use by importing "github.com/tomkosm/html-to-markdown/plugin".

Name Description
GitHubFlavored GitHub's Flavored Markdown contains TaskListItems, Strikethrough and Table.
TaskListItems (Included in GitHubFlavored). Converts <input> checkboxes into - [x] Task.
Strikethrough (Included in GitHubFlavored). Converts <strike>, <s>, and <del> to the ~~ syntax.
Table (Included in GitHubFlavored). Convert a <table> into something like this...
TableCompat
VimeoEmbed
YoutubeEmbed
ConfluenceCodeBlock Converts <ac:structured-macro> elements that are used in Atlassian’s Wiki "Confluence".
ConfluenceAttachments Converts <ri:attachment ri:filename=""/> elements.

These are the plugins in other repositories:

Name Description
[Plugin Name](Your Link) A short description

I you write a plugin, feel free to open a PR that adds your Plugin to this list.

Writing Plugins

Have a look at the plugin folder for a reference implementation. The most basic one is Strikethrough.

Security

This library produces markdown that is readable and can be changed by humans.

Once you convert this markdown back to HTML (e.g. using goldmark or blackfriday) you need to be careful of malicious content.

This library does NOT sanitize untrusted content. Use an HTML sanitizer such as bluemonday before displaying the HTML in the browser.

Other Methods

Godoc

func (c *Converter) Keep(tags ...string) *Converter

Determines which elements are to be kept and rendered as HTML.

func (c *Converter) Remove(tags ...string) *Converter

Determines which elements are to be removed altogether i.e. converted to an empty string.

Escaping

Some characters have a special meaning in markdown. For example, the character "*" can be used for lists, emphasis and dividers. By placing a backlash before that character (e.g. "\*") you can "escape" it. Then the character will render as a raw "*" without the "markdown meaning" applied.

But why is "escaping" even necessary?

Paragraph 1
-
Paragraph 2

The markdown above doesn't seem that problematic. But "Paragraph 1" (with only one hyphen below) will be recognized as a setext heading.

<h2>Paragraph 1</h2>
<p>Paragraph 2</p>

A well-placed backslash character would prevent that...

Paragraph 1
\-
Paragraph 2

How to configure escaping? Depending on the EscapeMode option, the markdown output is going to be different.

opt = &md.Options{
	EscapeMode: "basic", // default
}

Lets try it out with this HTML input:

input <p>fake **bold** and real <strong>bold</strong></p>
With EscapeMode "basic"
output fake \*\*bold\*\* and real **bold**
rendered fake **bold** and real bold
With EscapeMode "disabled"
output fake **bold** and real **bold**
rendered fake bold and real bold

With basic escaping, we get some escape characters (the backlash "\") but it renders correctly.

With escaping disabled, the fake and real bold can't be distinguished in the markdown. That means it is both going to render as bold.


So now you know the purpose of escaping. However, if you encounter some content where the escaping breaks, you can manually disable it. But please also open an issue!

Issues

If you find HTML snippets (or even full websites) that don't produce the expected results, please open an issue!

Contributing & Testing

Please first discuss the change you wish to make, by opening an issue. I'm also happy to guide you to where a change is most likely needed.

Note: The outside API should not change because of backwards compatibility...

You don't have to be afraid of breaking the converter, since there are many "Golden File Tests":

Add your problematic HTML snippet to one of the input.html files in the testdata folder. Then run go test -update and have a look at which .golden files changed in GIT.

You can now change the internal logic and inspect what impact your change has by running go test -update again.

Note: Before submitting your change as a PR, make sure that you run those tests and check the files into GIT...

License

This project is licensed under the terms of the MIT license.

Documentation

Overview

Package md converts html to markdown.

converter := md.NewConverter("", true, nil)

html = `<strong>Important</strong>`

markdown, err := converter.ConvertString(html)
if err != nil {
  log.Fatal(err)
}
fmt.Println("md ->", markdown)

Or if you are already using goquery:

markdown, err := converter.Convert(selec)

Index

Constants

This section is empty.

Variables

View Source
var Timeout = time.Second * 10

Timeout for the http client

Functions

func AddSpaceIfNessesary

func AddSpaceIfNessesary(selec *goquery.Selection, markdown string) string

AddSpaceIfNessesary adds spaces to the text based on the neighbors. That makes sure that there is always a space to the side, to recognize the delimiter.

func CalculateCodeFence

func CalculateCodeFence(fenceChar rune, content string) string

CalculateCodeFence can be passed the content of a code block and it returns how many fence characters (` or ~) should be used.

This is useful if the html content includes the same fence characters for example ``` -> https://stackoverflow.com/a/49268657

func CollectText

func CollectText(n *html.Node) string

CollectText returns the text of the node and all its children

func DefaultGetAbsoluteURL

func DefaultGetAbsoluteURL(selec *goquery.Selection, rawURL string, domain string) string

DefaultGetAbsoluteURL is the default function and can be overridden through `GetAbsoluteURL` in the options.

func DomainFromURL

func DomainFromURL(rawURL string) string

DomainFromURL returns `u.Host` from the parsed url.

func EscapeMultiLine

func EscapeMultiLine(content string) string

EscapeMultiLine deals with multiline content inside a link

func IndentMultiLineListItem

func IndentMultiLineListItem(opt *Options, text string, spaces int) string

IndentMultiLineListItem makes sure that multiline list items are properly indented.

func IndexWithText

func IndexWithText(s *goquery.Selection) int

IndexWithText is similar to goquery's Index function but returns the index of the current element while NOT counting the empty elements beforehand.

func IsInlineElement

func IsInlineElement(e string) bool

IsInlineElement can be used to check wether a node name (goquery.Nodename) is an html inline element and not a block element. Used in the rule for the p tag to check wether the text is inside a block element.

func String

func String(text string) *string

String is a helper function to return a pointer.

func TrimTrailingSpaces

func TrimTrailingSpaces(text string) string

TrimTrailingSpaces removes unnecessary spaces from the end of lines.

func TrimpLeadingSpaces

func TrimpLeadingSpaces(text string) string

TrimpLeadingSpaces removes spaces from the beginning of a line but makes sure that list items and code blocks are not affected.

Types

type AdvancedResult

type AdvancedResult struct {
	Header   string
	Markdown string
	Footer   string
}

AdvancedResult is used for example for links. If you use LinkStyle:referenced the link href is placed at the bottom of the generated markdown (Footer).

type Afterhook

type Afterhook func(markdown string) string

Afterhook runs after the converter and can be used to transform the resulting markdown

type BeforeHook

type BeforeHook func(selec *goquery.Selection)

BeforeHook runs before the converter and can be used to transform the original html

type Converter

type Converter struct {
	// contains filtered or unexported fields
}

Converter is initialized by NewConverter.

func NewConverter

func NewConverter(domain string, enableCommonmark bool, options *Options) *Converter

NewConverter initializes a new converter and holds all the rules.

  • `domain` is used for links and images to convert relative urls ("/image.png") to absolute urls.
  • CommonMark is the default set of rules. Set enableCommonmark to false if you want to customize everything using AddRules and DONT want to fallback to default rules.

func (*Converter) AddRules

func (conv *Converter) AddRules(rules ...Rule) *Converter

AddRules adds the rules that are passed in to the converter.

By default it overrides the rule for that html tag. You can fall back to the default rule by returning nil.

func (*Converter) After

func (conv *Converter) After(hooks ...Afterhook) *Converter

After registers a hook that is run after the conversion. It can be used to transform the markdown document that is about to be returned.

For example, the default after hook trims the returned markdown.

func (*Converter) Before

func (conv *Converter) Before(hooks ...BeforeHook) *Converter

Before registers a hook that is run before the conversion. It can be used to transform the original goquery html document.

For example, the default before hook adds an index to every link, so that the `a` tag rule (for "reference" "full") can have an incremental number.

func (*Converter) ClearAfter

func (conv *Converter) ClearAfter() *Converter

ClearAfter clears the current after hooks (including the default after hooks).

func (*Converter) ClearBefore

func (conv *Converter) ClearBefore() *Converter

ClearBefore clears the current before hooks (including the default before hooks).

func (*Converter) Convert

func (conv *Converter) Convert(selec *goquery.Selection) string

Convert returns the content from a goquery selection. If you have a goquery document just pass in doc.Selection.

func (*Converter) ConvertBytes

func (conv *Converter) ConvertBytes(bytes []byte) ([]byte, error)

ConvertBytes returns the content from a html byte array.

func (*Converter) ConvertReader

func (conv *Converter) ConvertReader(reader io.Reader) (bytes.Buffer, error)

ConvertReader returns the content from a reader and returns a buffer.

func (*Converter) ConvertResponse

func (conv *Converter) ConvertResponse(res *http.Response) (string, error)

ConvertResponse returns the content from a html response.

func (*Converter) ConvertString

func (conv *Converter) ConvertString(html string) (string, error)

ConvertString returns the content from a html string. If you already have a goquery selection use `Convert`.

func (*Converter) ConvertURL

func (conv *Converter) ConvertURL(url string) (string, error)

ConvertURL returns the content from the page with that url.

func (*Converter) InitializeCommonMarkRules

func (c *Converter) InitializeCommonMarkRules() []Rule

func (*Converter) Keep

func (conv *Converter) Keep(tags ...string) *Converter

Keep certain html tags in the generated output.

func (*Converter) Remove

func (conv *Converter) Remove(tags ...string) *Converter

Remove certain html tags from the source.

func (*Converter) Use

func (conv *Converter) Use(plugins ...Plugin) *Converter

Use can be used to add additional functionality to the converter. It is used when its not sufficient to use only rules for example in Plugins.

type Options

type Options struct {
	// "setext" or "atx"
	// default: "atx"
	HeadingStyle string

	// Any Thematic break
	// default: "* * *"
	HorizontalRule string

	// "-", "+", or "*"
	// default: "-"
	BulletListMarker string

	// "indented" or "fenced"
	// default: "indented"
	CodeBlockStyle string

	// “` or ~~~
	// default: “`
	Fence string

	// _ or *
	// default: _
	EmDelimiter string

	// ** or __
	// default: **
	StrongDelimiter string

	// inlined or referenced
	// default: inlined
	LinkStyle string

	// full, collapsed, or shortcut
	// default: full
	LinkReferenceStyle string

	// basic, disabled
	// default: basic
	EscapeMode string

	// GetAbsoluteURL parses the `rawURL` and adds the `domain` to convert relative (/page.html)
	// urls to absolute urls (http://domain.com/page.html).
	//
	// The default is `DefaultGetAbsoluteURL`, unless you override it. That can also
	// be useful if you want to proxy the images.
	GetAbsoluteURL func(selec *goquery.Selection, rawURL string, domain string) string
	// contains filtered or unexported fields
}

Options to customize the output. You can change stuff like the character that is used for strong text.

type Plugin

type Plugin func(conv *Converter) []Rule

Plugin can be used to extends functionality beyond what is offered by commonmark.

type Rule

type Rule struct {
	Filter              []string
	Replacement         func(content string, selec *goquery.Selection, options *Options) *string
	AdvancedReplacement func(content string, selec *goquery.Selection, options *Options) (res AdvancedResult, skip bool)
}

Rule to convert certain html tags to markdown.

md.Rule{
  Filter: []string{"del", "s", "strike"},
  Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
    // You need to return a pointer to a string (md.String is just a helper function).
    // If you return nil the next function for that html element
    // will be picked. For example you could only convert an element
    // if it has a certain class name and fallback if not.
    return md.String("~" + content + "~")
  },
}

Directories

Path Synopsis
Package escape escapes characters that are commonly used in markdown like the * for strong/italic.
Package escape escapes characters that are commonly used in markdown like the * for strong/italic.
examples
Package plugin contains all the rules that are not part of Commonmark like GitHub Flavored Markdown.
Package plugin contains all the rules that are not part of Commonmark like GitHub Flavored Markdown.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL