Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type LogFlag ¶
type LogFlag uint
LogFlag is enum to specify logging level.
const ( // If LogEverything is set DistillerLogger will enable all logs. LogEverything LogFlag = LogExtraction | LogVisibility | LogPagination | LogTiming // If LogExtraction is set DistillerLogger will print info of each process when extracting article. LogExtraction LogFlag = 1 << iota // If LogVisibility is set DistillerLogger will print info on why an element is visible. LogVisibility // If LogPagination is set DistillerLogger will print info of pagination process. LogPagination // If LogTiming is set DistillerLogger will print info of duration of each process when extracting article. LogTiming )
type Options ¶
type Options struct { // Flags to specify which info to dump to log. LogFlags LogFlag // Original URL of the page, which is used in the heuristics in detecting // next/prev page links. Will be ignored if Option is used in ApplyForURL. OriginalURL *nurl.URL // Set to true to skip process for finding pagination. SkipPagination bool // Algorithm to use for next page detection. PaginationAlgo PaginationAlgo }
Options is configuration for the distiller.
type PaginationAlgo ¶
type PaginationAlgo uint
PaginationAlgo is the algorithm to find the pagination links.
const ( // PrevNext is the algorithm to find pagination links that work by scoring each anchor // in documents using various heuristics on its href, text, class name and ID. It's quite // accurate and used as default algorithm. Unfortunately it uses a lot of regular expressions, // so it's a bit slow. PrevNext PaginationAlgo = iota // PageNumber is algorithm to find pagination links that work by collecting groups of adjacent plain // text numbers and outlinks with digital anchor text. A lot faster than PrevNext, but also less // accurate. PageNumber )
type Result ¶
type Result struct { // URL is the URL of the processed page. URL string // Title is the title of the processed page. Title string // MarkupInfo is the metadata of the page. The metadata is extracted following three markup // specifications: OpenGraphProtocol, IEReadingView and SchemaOrg. For now, OpenGraph protocol // takes precedence because it uses specific meta tags and hence the fastest. The other // specifications is used as fallback in case some metadata not found. MarkupInfo data.MarkupInfo // TimingInfo is the record of the time it takes to do each step in the process of content extraction. TimingInfo data.TimingInfo // PaginationInfo contains link to previous and next partial page. This is useful for long article or // that may be partitioned into several partial pages by its webmaster. PaginationInfo data.PaginationInfo // WordCount is the count of words within document. WordCount int // Node is the *html.Node which contain the distilled content. Node *html.Node // Text is the string which contains the distilled content in text format. Text string // ContentImages is list of image URLs that used within the distilled content. ContentImages []string }
Result is the final output of the distiller
func ApplyForFile ¶
ApplyForFile runs distiller for the specified file.
func ApplyForReader ¶
Apply runs distiller for the specified io.Reader.
Click to show internal directories.
Click to hide internal directories.