Documentation ¶
Overview ¶
Example ¶
// URL to extract contents (title, description, images, ...) url := "https://en.wikipedia.org/wiki/Lego" // Default option opt := readability.NewOption() // You can modify some option values if needed. opt.ImageRequestTimeout = 3000 // ms content, err := readability.Extract(url, opt) if err != nil { log.Fatal(err) } log.Println(content.Title) log.Println(content.Description) log.Println(content.Images)
Output:
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Content ¶
Content contains primary readable content of a webpage.
func ExtractFromDocument ¶
ExtractFromDocument returns Content when extraction succeeds, otherwise error. reqURL is required for converting relative image paths to absolute.
If you already have *goquery.Document after requesting HTTP, use this function, otherwise use Extract(reqURL, opt).
type Option ¶
type Option struct { // RetryLength is minimum length for a page description. // It will retry to extract page description with more liberal rule // if extracted description length is less than this value. RetryLength int // MinTextLength is minimum length of an inner text for a tag. // If a tag has short inner text (length is less than MinTextLength), // the text will be discarded from the page description candidates. MinTextLength int // RemoveUnlikelyCandidates is a flag whether to remove some tags // if they are considered relatively unimportant. RemoveUnlikelyCandidates bool // WeightClasses is a flag whether to give more/less weight to some tags // if they contain some positive/negative words in id/class value. WeightClasses bool // CleanConditionally is a flag whether to remove some tags // using various rules in conditionalCleanReason(). CleanConditionally bool // RemoveEmptyNodes is a flag whether to remove some tags which have empty inner text. RemoveEmptyNodes bool // MinImageWidth is the minimum width (pixel) for choosing images. MinImageWidth uint32 // MinImageHeight is the minimum height (pixel) for choosing images. MinImageHeight uint32 // MaxImageCount is the maximum number of images for a web page. MaxImageCount int // CheckImageLoopCount is the number of images // for parallel requests to fetch the image size. // For example, if this value is set to 10, // the first 10 img src URLs without width/height attributes // will be requested over network. // (img tags with both width/height attributes (pixels in int) are not conunted, // since they are not requested over network to get image size.) CheckImageLoopCount uint // ImageRequestTimeout is timeout(ms) for a single image request. ImageRequestTimeout uint // IgnoreImageFormat is an array of strings for ignoring some images. // If an image URL contains at least one of strings in this array, the image will be ignored. IgnoreImageFormat []string // DescriptionAsPlainText is a flag whether to strip all tags in a description value. DescriptionAsPlainText bool // DescriptionExtractionTimeout is timeout(ms) for extracting description for a page. DescriptionExtractionTimeout uint }
Option contains variety of options for extracting page content and images.
Click to show internal directories.
Click to hide internal directories.