Documentation ¶
Overview ¶
Package readability is a Go package that find the main readable content from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.
This package is based from Readability.js by Mozilla, and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Check ¶
Check checks whether the input is readable without parsing the whole thing. It's the wrapper for `Parser.Check()` and useful if you only use the default parser.
func CheckDocument ¶
CheckDocument checks whether the document is readable without parsing the whole thing. It's the wrapper for `Parser.CheckDocument()` and useful if you only use the default parser.
func WithUserAgent ¶
func WithUserAgent(userAgent string) requestWith
Types ¶
type Article ¶
type Article struct { Title string Byline string Node *html.Node Content string TextContent string Length int Excerpt string SiteName string Image string Favicon string Language string PublishedTime *time.Time ModifiedTime *time.Time HttpCode int }
Article is the final readable content.
func FromDocument ¶
FromDocument parses an document and returns the readable content. It's the wrapper or `Parser.ParseDocument()` and useful if you only want to use the default parser.
func FromReader ¶
FromReader parses an `io.Reader` and returns the readable content. It's the wrapper or `Parser.Parse()` and useful if you only want to use the default parser.
type Parser ¶
type Parser struct { // MaxElemsToParse is the max number of nodes supported by this // parser. Default: 0 (no limit) MaxElemsToParse int // NTopCandidates is the number of top candidates to consider when // analysing how tight the competition is among candidates. NTopCandidates int // CharThresholds is the default number of chars an article must // have in order to return a result CharThresholds int // ClassesToPreserve are the classes that readability sets itself. ClassesToPreserve []string // KeepClasses specify whether the classes should be stripped or not. KeepClasses bool // TagsToScore is element tags to score by default. TagsToScore []string // Debug determines if the log should be printed or not. Default: false. Debug bool // DisableJSONLD determines if metadata in JSON+LD will be extracted // or not. Default: false. DisableJSONLD bool // AllowedVideoRegex is a regular expression that matches video URLs that should be // allowed to be included in the article content. If undefined, it will use default filter. AllowedVideoRegex *regexp.Regexp // contains filtered or unexported fields }
Parser is the parser that parses the page to get the readable content.
func NewParser ¶
func NewParser() Parser
NewParser returns new Parser which set up with default value.
func (*Parser) CheckDocument ¶
CheckDocument checks whether the document is readable without parsing the whole thing.