markup

package
v0.0.0-...-25b8d04 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 26, 2024 License: MIT Imports: 6 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Accessor

type Accessor interface {
	// Title returns the markup title of the document, empty if none.
	Title() string

	// Type returns the markup type of the document, empty if none.
	Type() string

	// URL returns the markup url of the document, empty if none.
	URL() string

	// Images returns the properties of all markup images in the document.
	// The first image is the dominant (i.e. top or salient) one.
	Images() []data.MarkupImage

	// Description returns the markup description of the document, empty if none.
	Description() string

	// Publisher returns the markup publisher of the document, empty if none.
	Publisher() string

	// Copyright returns the markup copyright of the document, empty if none.
	Copyright() string

	// Author returns the full name of the markup author, empty if none.
	Author() string

	// Article returns the properties of the markup "article" object, null if none.
	Article() *data.MarkupArticle

	// OptOut returns true if page owner has opted out of distillation.
	OptOut() bool
}

Accessor is the interface that all parsers must implement so that Parser can retrieve their properties.

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser loads the different parsers that are based on different markup specifications, and allows retrieval of different distillation-related markup properties from a document. It retrieves the requested properties from one or more parsers. If necessary, it may merge the information from multiple parsers.

Currently, three markup format are supported: OpenGraphProtocol, IEReadingView and SchemaOrg. For now, OpenGraphProtocolParser takes precedence because it uses specific meta tags and hence extracts information the fastest; it also demands conformance to rules. If the rules are broken or the properties retrieved are null or empty, we try with SchemaOrg then IEReadingView.

The properties that matter to distilled content are:

  • individual properties: title, page type, page url, description, publisher, author, copyright
  • dominant and inline images and their properties: url, secure_url, type, caption, width, height
  • article and its properties: section name, published time, modified time, expiration time, authors.

TODO: for some properties, e.g. dominant and inline images, we might want to retrieve from multiple parsers; IEReadingViewParser provides more information as it scans all images in the document. If we do so, we would need to merge the multiple versions in a meaningful way.

func NewParser

func NewParser(root *html.Node, timingInfo *data.TimingInfo) *Parser

func (*Parser) Article

func (ps *Parser) Article() *data.MarkupArticle

func (*Parser) Author

func (ps *Parser) Author() string

func (*Parser) Copyright

func (ps *Parser) Copyright() string

func (*Parser) Description

func (ps *Parser) Description() string

func (*Parser) Images

func (ps *Parser) Images() []data.MarkupImage

func (*Parser) MarkupInfo

func (ps *Parser) MarkupInfo() data.MarkupInfo

func (*Parser) OptOut

func (ps *Parser) OptOut() bool

func (*Parser) Publisher

func (ps *Parser) Publisher() string

func (*Parser) Title

func (ps *Parser) Title() string

func (*Parser) Type

func (ps *Parser) Type() string

func (*Parser) URL

func (ps *Parser) URL() string

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL