Documentation ¶
Index ¶
- type Accessor
- type Parser
- func (ps *Parser) Article() *data.MarkupArticle
- func (ps *Parser) Author() string
- func (ps *Parser) Copyright() string
- func (ps *Parser) Description() string
- func (ps *Parser) Images() []data.MarkupImage
- func (ps *Parser) MarkupInfo() data.MarkupInfo
- func (ps *Parser) OptOut() bool
- func (ps *Parser) Publisher() string
- func (ps *Parser) Title() string
- func (ps *Parser) Type() string
- func (ps *Parser) URL() string
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Accessor ¶
type Accessor interface { // Title returns the markup title of the document, empty if none. Title() string // Type returns the markup type of the document, empty if none. Type() string // URL returns the markup url of the document, empty if none. URL() string // Images returns the properties of all markup images in the document. // The first image is the dominant (i.e. top or salient) one. Images() []data.MarkupImage // Description returns the markup description of the document, empty if none. Description() string // Publisher returns the markup publisher of the document, empty if none. Publisher() string // Copyright returns the markup copyright of the document, empty if none. Copyright() string // Author returns the full name of the markup author, empty if none. Author() string // Article returns the properties of the markup "article" object, null if none. Article() *data.MarkupArticle // OptOut returns true if page owner has opted out of distillation. OptOut() bool }
Accessor is the interface that all parsers must implement so that Parser can retrieve their properties.
type Parser ¶
type Parser struct {
// contains filtered or unexported fields
}
Parser loads the different parsers that are based on different markup specifications, and allows retrieval of different distillation-related markup properties from a document. It retrieves the requested properties from one or more parsers. If necessary, it may merge the information from multiple parsers.
Currently, three markup format are supported: OpenGraphProtocol, IEReadingView and SchemaOrg. For now, OpenGraphProtocolParser takes precedence because it uses specific meta tags and hence extracts information the fastest; it also demands conformance to rules. If the rules are broken or the properties retrieved are null or empty, we try with SchemaOrg then IEReadingView.
The properties that matter to distilled content are:
- individual properties: title, page type, page url, description, publisher, author, copyright
- dominant and inline images and their properties: url, secure_url, type, caption, width, height
- article and its properties: section name, published time, modified time, expiration time, authors.
TODO: for some properties, e.g. dominant and inline images, we might want to retrieve from multiple parsers; IEReadingViewParser provides more information as it scans all images in the document. If we do so, we would need to merge the multiple versions in a meaningful way.
func (*Parser) Article ¶
func (ps *Parser) Article() *data.MarkupArticle
func (*Parser) Description ¶
func (*Parser) Images ¶
func (ps *Parser) Images() []data.MarkupImage
func (*Parser) MarkupInfo ¶
func (ps *Parser) MarkupInfo() data.MarkupInfo