readability

package module

v1.0.0 Latest Latest Go to latest Published: Jul 2, 2019 License: Apache-2.0 Imports: 10 Imported by: 17

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cixtor/readability

Links

Open Source Insights

README ¶

Readability

Readability is a library written in Go (golang) to parse, analyze and convert HTML pages into readable content. Originally an Arc90 Experiment, it is now incorporated into Safari’s Reader View.

Despite the ubiquity of reading on the web, readers remain a neglected audience. Much of our talk about web design revolves around a sense of movement: users are thought to be finding, searching, skimming, looking. We measure how frequently they click but not how long they stay on the page. We concern ourselves with their travel and participation–how they move from page to page, who they talk to when they get there–but forget the needs of those whose purpose is to be still. Readers flourish when they have space–some distance from the hubbub of the crowds–and as web designers, there is yet much we can do to help them carve out that space.

In Defense Of Readers, by Mandy Brown

Evolution of Readability Web Engines

Product	Year	Shutdown
Instapaper	2008	N/A
Arc90 Readability	2009	Sep 30, 2016
Apple Readability	2010	N/A
Microsoft Reading View	2014	N/A
Mozilla Readability	2015	N/A
Mercury Reader	2016	Apr 15, 2019

Reader Mode Parser Diversity

All modern web browsers, except for Google Chrome, include an option to parse, analyze, and extract the main content from web pages to provide what is commonly known as “Reading Mode”. Reading Mode is a separate web rendering mode that strips out repeated and irrelevant content, this allows the web browser to extract the main content and display it cleanly and consistently to the user.

Vendor	Product	Parser	Environments
Mozilla	Firefox	Mozilla Readability	Desktop and Android
GNOME	Web	Mozilla Readability	Desktop
Vivaldi	Vivaldi	Mozilla Readability	Desktop
Yandex	Browser	Mozilla Readability	Desktop
Samsung	Browser	Mozilla Readability	Android
Apple	Safari	Safari Reader	macOS and iOS
Maxthon	Maxthon	Maxthon Reader	Desktop
Microsoft	Edge	EdgeHTML	Windows and Windows Mobile
Microsoft	Edge Mobile	Chrome DOM Distiller	Android
Google	Chrome	Chrome DOM Distiller	Android
Postlight	Mercury Reader	Web Reader	Web / browser extension
Instant Paper	Instapaper	Instaparser	Web / browser extension
Mozilla	Pocket	Unknown	Web / browser extension

Ref: https://web.archive.org/web/20150817073201/http://lab.arc90.com/2009/03/02/readability/

Documentation ¶

Index ¶

type Article
type Readability
- func New() *Readability
- func (r *Readability) IsReadable(input io.Reader) bool
- func (r *Readability) Parse(input io.Reader, pageURL string) (Article, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Article ¶

type Article struct {
	// Title is the heading that preceeds the article’s content, and the basis
	// for the article’s page name and URL. It indicates what the article is
	// about, and distinguishes it from other articles. The title may simply
	// be the name of the subject of the article, or it may be a description
	// of the topic.
	Title string

	// Byline is a printed line of text accompanying a news story, article, or
	// the like, giving the author’s name
	Byline string

	// Dir is the direction of the text in the article.
	//
	// Either Left-to-Right (LTR) or Right-to-Left (RTL).
	Dir string

	// Content is the relevant text in the article with HTML tags.
	Content string

	// TextContent is the relevant text in the article without HTML tags.
	TextContent string

	// Excerpt is the summary for the relevant text in the article.
	Excerpt string

	// SiteName is the name of the original publisher website.
	SiteName string

	// Favicon (short for favorite icon) is a file containing one or more small
	// icons, associated with a particular website or web page. A web designer
	// can create such an icon and upload it to a website (or web page) by
	// several means, and graphical web browsers will then make use of it.
	Favicon string

	// Image is an image URL which represents the article’s content.
	Image string

	// Length is the amount of characters in the article.
	Length int

	// Node is the first element in the HTML document.
	Node *html.Node
}

Article represents the metadata and content of the article.

type Readability ¶

type Readability struct {

	// MaxElemsToParse is the optional maximum number of HTML nodes to parse
	// from the document. If the number of elements in the document is higher
	// than this number, the operation immediately errors.
	MaxElemsToParse int

	// NTopCandidates is the number of top candidates to consider when the
	// parser is analysing how tight the competition is among candidates.
	NTopCandidates int

	// CharThresholds is the default number of chars an article must have in
	// order to return a result.
	CharThresholds int

	// ClassesToPreserve are the classes that readability sets itself.
	ClassesToPreserve []string

	// TagsToScore is element tags to score by default.
	TagsToScore []string
	// contains filtered or unexported fields
}

Readability is an HTML parser that reads and extract relevant content.

func New ¶

func New() *Readability

New returns new Readability with sane defaults to parse simple documents.

func (*Readability) IsReadable ¶

func (r *Readability) IsReadable(input io.Reader) bool

IsReadable decides whether the document is usable or not without parsing the whole thing. In the original `mozilla/readability` library, this method is located in `Readability-readable.js`.

func (*Readability) Parse ¶

func (r *Readability) Parse(input io.Reader, pageURL string) (Article, error)

Parse parses input and find the main readable content.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL