distiller

package

v0.0.0-...-977eb4a Latest Latest Go to latest Published: Feb 10, 2023 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/omnivore-app/go-domdistiller

Links

Open Source Insights

Documentation ¶

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type LogFlag ¶

type LogFlag uint

LogFlag is enum to specify logging level.

const (
	// If LogEverything is set DistillerLogger will enable all logs.
	LogEverything LogFlag = LogExtraction | LogVisibility | LogPagination | LogTiming

	// If LogExtraction is set DistillerLogger will print info of each process when extracting article.
	LogExtraction LogFlag = 1 << iota

	// If LogVisibility is set DistillerLogger will print info on why an element is visible.
	LogVisibility

	// If LogPagination is set DistillerLogger will print info of pagination process.
	LogPagination

	// If LogTiming is set DistillerLogger will print info of duration of each process when extracting article.
	LogTiming
)

type Options ¶

type Options struct {
	// Flags to specify which info to dump to log.
	LogFlags LogFlag

	// Original URL of the page, which is used in the heuristics in detecting
	// next/prev page links. Will be ignored if Option is used in ApplyForURL.
	OriginalURL *nurl.URL

	// Set to true to skip process for finding pagination.
	SkipPagination bool

	// Algorithm to use for next page detection.
	PaginationAlgo PaginationAlgo
}

Options is configuration for the distiller.

type PaginationAlgo ¶

type PaginationAlgo uint

PaginationAlgo is the algorithm to find the pagination links.

const (
	// PrevNext is the algorithm to find pagination links that work by scoring  each anchor
	// in documents using various heuristics on its href, text, class name and ID. It's quite
	// accurate and used as default algorithm. Unfortunately it uses a lot of regular expressions,
	// so it's a bit slow.
	PrevNext PaginationAlgo = iota

	// PageNumber is algorithm to find pagination links that work by collecting groups of adjacent plain
	// text numbers and outlinks with digital anchor text. A lot faster than PrevNext, but also less
	// accurate.
	PageNumber
)

type Result ¶

type Result struct {
	// URL is the URL of the processed page.
	URL string

	// Title is the title of the processed page.
	Title string

	// MarkupInfo is the metadata of the page. The metadata is extracted following three markup
	// specifications: OpenGraphProtocol, IEReadingView and SchemaOrg. For now, OpenGraph protocol
	// takes precedence because it uses specific meta tags and hence the fastest. The other
	// specifications is used as fallback in case some metadata not found.
	MarkupInfo data.MarkupInfo

	// TimingInfo is the record of the time it takes to do each step in the process of content extraction.
	TimingInfo data.TimingInfo

	// PaginationInfo contains link to previous and next partial page. This is useful for long article or
	// that may be partitioned into several partial pages by its webmaster.
	PaginationInfo data.PaginationInfo

	// WordCount is the count of words within document.
	WordCount int

	// Node is the *html.Node which contain the distilled content.
	Node *html.Node

	// Text is the string which contains the distilled content in text format.
	Text string

	// ContentImages is list of image URLs that used within the distilled content.
	ContentImages []string
}

Result is the final output of the distiller

func Apply ¶

func Apply(doc *html.Node, opts *Options) (*Result, error)

Apply runs distiller for the specified parsed document.

func ApplyForFile ¶

func ApplyForFile(path string, opts *Options) (*Result, error)

ApplyForFile runs distiller for the specified file.

func ApplyForReader ¶

func ApplyForReader(r io.Reader, opts *Options) (*Result, error)

Apply runs distiller for the specified io.Reader.

func ApplyForURL ¶

func ApplyForURL(url string, timeout time.Duration, opts *Options) (*Result, error)

ApplyForURL runs distiller for the specified URL.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL