markup

package

v0.0.0-...-25b8d04 Latest Latest Go to latest Published: Sep 26, 2024 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/markusmobius/go-domdistiller

Links

Open Source Insights

Documentation ¶

Index ¶

type Accessor
type Parser
- func NewParser(root *html.Node, timingInfo *data.TimingInfo) *Parser

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Accessor ¶

type Accessor interface {
	// Title returns the markup title of the document, empty if none.
	Title() string

	// Type returns the markup type of the document, empty if none.
	Type() string

	// URL returns the markup url of the document, empty if none.
	URL() string

	// Images returns the properties of all markup images in the document.
	// The first image is the dominant (i.e. top or salient) one.
	Images() []data.MarkupImage

	// Description returns the markup description of the document, empty if none.
	Description() string

	// Publisher returns the markup publisher of the document, empty if none.
	Publisher() string

	// Copyright returns the markup copyright of the document, empty if none.
	Copyright() string

	// Author returns the full name of the markup author, empty if none.
	Author() string

	// Article returns the properties of the markup "article" object, null if none.
	Article() *data.MarkupArticle

	// OptOut returns true if page owner has opted out of distillation.
	OptOut() bool
}

Accessor is the interface that all parsers must implement so that Parser can retrieve their properties.

type Parser ¶

type Parser struct {
	// contains filtered or unexported fields
}

Parser loads the different parsers that are based on different markup specifications, and allows retrieval of different distillation-related markup properties from a document. It retrieves the requested properties from one or more parsers. If necessary, it may merge the information from multiple parsers.

Currently, three markup format are supported: OpenGraphProtocol, IEReadingView and SchemaOrg. For now, OpenGraphProtocolParser takes precedence because it uses specific meta tags and hence extracts information the fastest; it also demands conformance to rules. If the rules are broken or the properties retrieved are null or empty, we try with SchemaOrg then IEReadingView.

The properties that matter to distilled content are:

individual properties: title, page type, page url, description, publisher, author, copyright
dominant and inline images and their properties: url, secure_url, type, caption, width, height
article and its properties: section name, published time, modified time, expiration time, authors.

TODO: for some properties, e.g. dominant and inline images, we might want to retrieve from multiple parsers; IEReadingViewParser provides more information as it scans all images in the document. If we do so, we would need to merge the multiple versions in a meaningful way.

func NewParser ¶

func NewParser(root *html.Node, timingInfo *data.TimingInfo) *Parser

func (*Parser) Article ¶

func (ps *Parser) Article() *data.MarkupArticle

func (*Parser) Author ¶

func (ps *Parser) Author() string

func (*Parser) Copyright ¶

func (ps *Parser) Copyright() string

func (*Parser) Description ¶

func (ps *Parser) Description() string

func (*Parser) Images ¶

func (ps *Parser) Images() []data.MarkupImage

func (*Parser) MarkupInfo ¶

func (ps *Parser) MarkupInfo() data.MarkupInfo

func (*Parser) OptOut ¶

func (ps *Parser) OptOut() bool

func (*Parser) Publisher ¶

func (ps *Parser) Publisher() string

func (*Parser) Title ¶

func (ps *Parser) Title() string

func (*Parser) Type ¶

func (ps *Parser) Type() string

func (*Parser) URL ¶

func (ps *Parser) URL() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
iereader
opengraph
schemaorg

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL