README
¶
html-to-markdown
Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of regexp
as much as possible. That should prevent some weird cases and allows it to be used for cases where the input is totally unknown.
Installation
go get github.com/JohannesKaufmann/html-to-markdown
Usage
import "github.com/JohannesKaufmann/html-to-markdown"
converter := md.NewConverter("", true, nil)
html = `<strong>Important</strong>`
markdown, err := converter.ConvertString(html)
if err != nil {
log.Fatal(err)
}
fmt.Println("md ->", markdown)
If you are already using goquery you can pass a selection to Convert
.
markdown, err := converter.Convert(selec)
Options
The third parameter to md.NewConverter
is *md.Options
.
For example you can change the character that is around a bold text ("**
") to a different one (for example "__
") by changing the value of StrongDelimiter
.
opt := &md.Options{
StrongDelimiter: "__", // default: **
// ...
}
converter := md.NewConverter("", true, opt)
For all the possible options look at godocs and for a example look at the example.
Adding Rules
converter.AddRules(
md.Rule{
Filter: []string{"del", "s", "strike"},
Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
// You need to return a pointer to a string (md.String is just a helper function).
// If you return nil the next function for that html element
// will be picked. For example you could only convert an element
// if it has a certain class name and fallback if not.
content = strings.TrimSpace(content)
return md.String("~" + content + "~")
},
},
// more rules
)
For more information have a look at the example add_rules.
Using Plugins
If you want plugins (github flavored markdown like striketrough, tables, ...) you can pass it to Use
.
import "github.com/JohannesKaufmann/html-to-markdown/plugin"
// Use the `GitHubFlavored` plugin from the `plugin` package.
converter.Use(plugin.GitHubFlavored())
Or if you only want to use the Strikethrough
plugin. You can change the character that distinguishes
the text that is crossed out by setting the first argument to a different value (for example "~~" instead of "~").
converter.Use(plugin.Strikethrough(""))
For more information have a look at the example github_flavored.
Writing Plugins
Have a look at the plugin folder for a reference implementation. The most basic one is Strikethrough.
Other Methods
func (c *Converter) Keep(tags ...string) *Converter
Determines which elements are to be kept and rendered as HTML.
func (c *Converter) Remove(tags ...string) *Converter
Determines which elements are to be removed altogether i.e. converted to an empty string.
Issues
If you find HTML snippets (or even full websites) that don't produce the expected results, please open an issue!
Related Projects
- turndown (js), a very good library written in javascript.
- lunny/html2md, which is using regex instead of goquery. I came around a few edge case when using it (leaving some html comments, ...) so I wrote my own.
Documentation
¶
Overview ¶
Package md converts html to markdown.
converter := md.NewConverter("", true, nil) html = `<strong>Important</strong>` markdown, err := converter.ConvertString(html) if err != nil { log.Fatal(err) } fmt.Println("md ->", markdown)
Or if you are already using goquery:
markdown, err := converter.Convert(selec)
Index ¶
- Variables
- func DomainFromURL(rawURL string) string
- func IsBlockElement(e string) bool
- func IsInlineElement(e string) bool
- func String(text string) *string
- type AdvancedResult
- type Converter
- func (c *Converter) AddRules(rules ...Rule) *Converter
- func (c *Converter) Convert(selec *goquery.Selection) string
- func (c *Converter) ConvertBytes(bytes []byte) ([]byte, error)
- func (c *Converter) ConvertReader(reader io.Reader) (bytes.Buffer, error)
- func (c *Converter) ConvertResponse(res *http.Response) (string, error)
- func (c *Converter) ConvertString(html string) (string, error)
- func (c *Converter) ConvertURL(url string) (string, error)
- func (c *Converter) Keep(tags ...string) *Converter
- func (c *Converter) Remove(tags ...string) *Converter
- func (c *Converter) Sanitize(html string) string
- func (c *Converter) Use(plugins ...Plugin) *Converter
- type Options
- type Plugin
- type Rule
Constants ¶
This section is empty.
Variables ¶
var Timeout = time.Second * 10
Timeout for the http client
Functions ¶
func DomainFromURL ¶
DomainFromURL removes the path from the url.
func IsBlockElement ¶
func IsInlineElement ¶
Types ¶
type AdvancedResult ¶
type Converter ¶
type Converter struct { Before func(selec *goquery.Selection) // contains filtered or unexported fields }
Converter is initialized by NewConverter.
func NewConverter ¶
NewConverter initializes a new converter and holds all the rules.
- `domain` is used for links and images to convert relative urls ("/image.png") to absolute urls.
- CommonMark is the default set of rules. Set enableCommonmark to false if you want to customize everything using AddRules and DONT want to fallback to default rules.
func (*Converter) Convert ¶
Convert returns the content from a goquery selection. If you have a goquery document just pass in doc.Selection.
func (*Converter) ConvertBytes ¶
ConvertBytes returns the content from a html byte array.
func (*Converter) ConvertReader ¶
ConvertReader returns the content from a reader and returns a buffer.
func (*Converter) ConvertResponse ¶
ConvertResponse returns the content from a html response.
func (*Converter) ConvertString ¶
ConvertString returns the content from a html string. If you already have a goquery selection use `Convert`.
func (*Converter) ConvertURL ¶
ConvertURL returns the content from the page with that url.
type Options ¶
type Options struct { PreSanitize bool //sanitise the input before go with the conversion // "setext" or "atx" // default: "atx" HeadingStyle string // Any Thematic break // default: "* * *" HorizontalRule string // "-", "+", or "*" // default: "-" BulletListMarker string // "indented" or "fenced" // default: "indented" CodeBlockStyle string // “` or ~~~ // default: “` Fence string // _ or * // default: _ EmDelimiter string // ** or __ // default: ** StrongDelimiter string // inlined or referenced // default: inlined LinkStyle string // full, collapsed, or shortcut // default: full LinkReferenceStyle string }
Options to customize the output. You can change stuff like the character that is used for strong text.
type Rule ¶
type Rule struct { Filter []string Replacement func(content string, selec *goquery.Selection, options *Options) *string AdvancedReplacement func(content string, selec *goquery.Selection, options *Options) (res AdvancedResult, skip bool) }
Rule to convert certain html tags to markdown.
md.Rule{ Filter: []string{"del", "s", "strike"}, Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string { // You need to return a pointer to a string (md.String is just a helper function). // If you return nil the next function for that html element // will be picked. For example you could only convert an element // if it has a certain class name and fallback if not. return md.String("~" + content + "~") }, }
Directories
¶
Path | Synopsis |
---|---|
Package escape escapes characters that are commonly used in markdown like the * for strong/italic.
|
Package escape escapes characters that are commonly used in markdown like the * for strong/italic. |
examples
|
|
Package plugin contains all the rules that are not part of Commonmark like GitHub Flavored Markdown.
|
Package plugin contains all the rules that are not part of Commonmark like GitHub Flavored Markdown. |