readability

package module

v0.1.0 Latest Latest Go to latest Published: May 18, 2024 License: Apache-2.0 Imports: 17 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/giulianopz/go-readability

Links

Open Source Insights

README ¶

go-readability

A Go port of Mozilla Readability.js, the heuristic which powers the Firefox Reader View offering a distraction-free reading experience for articles, blog posts, and other text-heavy web pages by removing ads, GDPR-compliant cookie banners and other unsolicited junk.

This port uses only the minimal DOM parser bundled with the original lib, without resorting to the Go stdlib (net/html). The rest of the source code is aligned with the latest commit (97db40b) on the main branch.

A Bit of History

Readability.js maintained by Mozilla is based on a JavaScript bookmarklet developed in 2009 by Arc90 Lab, an organization which left no traces behind, but whose main contributor was Chris Dary (@umbrae).

The source code was then released under the Apache 2.0 software license on Google Code before being abandoned in 2010 to be repackaged as a web service called Readability.com, which was then discontinued in 2016.

Most modern browsers still use one of the available forks of the Arc90 original implementation when displaying web pages in reading mode.

For a historical and detailed analysis of the reading mode offered by current browsers, please read this excellent series of articles by Daniel Aleksandersen.

Basic usage

Add a dependency for the package:

go get -u github.com/giulianopz/go-readability

Get text content from a web page article:

package main

import (
	"fmt"

	"github.com/giulianopz/go-readability"
)

func main() {

	var htmlSource = `<!DOCTYPE html>
<html>

<head>
	<meta charset="utf-8" />
	<title>
		Redis will remain BSD licensed - &lt;antirez&gt;
	</title>
	<link href="/rss" rel="alternate" type="application/rss+xml" />
</head>

<body>
	<div id="container">
		<header>
			<h1><a href="/">&lt;antirez&gt;</a></h1>
		</header>
		<div id="content">
			<section id="newslist">
				<article data-news-id="120">
					<h2><a href="/news/120">Redis will remain BSD licensed</a></h2>
				</article>
			</section>
			<article class="comment" style="margin-left:0px" data-comment-id="120-" id="120-"><span class="info"><span
						class="username"><a href="/user/antirez">antirez</a></span> 2095 days ago.
					170643 views. </span>
				<pre>Today a page about the new Common Clause license in the Redis Labs web site was interpreted as if Redis itself switched license. This is not the case, Redis is, and will remain, BSD licensed. However in the era of [edit] uncontrollable spreading of information, my attempts to provide the correct information failed, and I’m still seeing everywhere “Redis is no longer open source”. The reality is that Redis remains BSD, and actually Redis Labs did the right thing supporting my effort to keep the Redis core open as usually.

				[...]

We at Redis Labs are sorry for the confusion generated by the Common Clause page, and my colleagues are working to fix the page with better wording.</pre>
			</article>
		</div>
	</div>
</body>

</html>`

	isReaderable := readability.IsProbablyReaderable(htmlSource)
	fmt.Printf("Contains any text?: %t\n", isReaderable)

	reader, err := readability.New(
		htmlSource,
		"http://antirez.com/news/120",
		readability.ClassesToPreserve("caption"),
	)
	if err != nil {
		panic(err)
	}

	result, err := reader.Parse()
	if err != nil {
		panic(err)
	}

	fmt.Printf("Title: %s\n", result.Title)
	fmt.Printf("Author: %s\n", result.Byline)
	fmt.Printf("Length: %d\n", result.Length)
	fmt.Printf("Excerpt: %s\n", result.Excerpt)
	fmt.Printf("SiteName: %s\n", result.SiteName)
	fmt.Printf("Lang: %s\n", result.Lang)
	fmt.Printf("PublishedTime: %s\n", result.PublishedTime)
	fmt.Printf("Content: %s\n", result.Content)
	fmt.Printf("TextContent: %s\n", result.TextContent)
}

Documentation ¶

Index ¶

func IsProbablyReaderable(htmlSource string, opts ...Option) bool
type Option
type Options
type Readability
- func New(htmlSource, uri string, opts ...Option) (*Readability, error)
- func (r *Readability) Parse() (*Result, error)
type Result

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func IsProbablyReaderable ¶

func IsProbablyReaderable(htmlSource string, opts ...Option) bool

Decides whether or not the document is reader-able without parsing the whole thing. Options:

options.minContentLength (default 140), the minimum node content length used to decide if the document is readerable
options.minScore (default 20), the minumum cumulated 'score' used to determine if the document is readerable
options.visibilityChecker (default isNodeVisible), the function used to determine if a node is visible

Types ¶

type Option ¶

type Option func(*Options)

func AllowedVideoRegex ¶

func AllowedVideoRegex(rgx *regexp.Regexp) Option

func CharThreshold ¶

func CharThreshold(n int) Option

func ClassesToPreserve ¶

func ClassesToPreserve(classes ...string) Option

func DisableJSONLD ¶

func DisableJSONLD(b bool) Option

func KeepClasses ¶

func KeepClasses(b bool) Option

func LogLevel ¶

func LogLevel(l slog.Level) Option

func MaxElemsToParse ¶

func MaxElemsToParse(n int) Option

func MinContentLength ¶

func MinContentLength(len int) Option

func MinScore ¶

func MinScore(score float64) Option

func NTopCandidates ¶

func NTopCandidates(n int) Option

func Serializer ¶

func Serializer(f func(*node) string) Option

func VisibilityChecker ¶

func VisibilityChecker(f func(*html.Node) bool) Option

type Options ¶

type Options struct {
	// contains filtered or unexported fields
}

type Readability ¶

type Readability struct {
	// contains filtered or unexported fields
}

func New ¶

func New(htmlSource, uri string, opts ...Option) (*Readability, error)

New is the public constructor of Readability and it supports the following options:

options.debug
options.maxElemsToParse
options.nbTopCandidates
options.charThreshold
this.classesToPreseve
options.keepClasses
options.serializer

func (*Readability) Parse ¶

func (r *Readability) Parse() (*Result, error)

Runs readability. Workflow:

Prep the document by removing script tags, css, etc.
Build readability's DOM tree.
Grab the article content from the current dom tree.
Replace the current DOM tree with the new one.
Read peacefully.

type Result ¶

type Result struct {
	// article title
	Title string
	// HTML string of processed article Content
	Content string
	// text content of the article, with all the HTML tags removed
	TextContent string
	// length of an article, in characters (runes)
	Length int
	// article description, or short excerpt from the content
	Excerpt string
	// author metadata
	Byline string
	// content direction
	Dir string
	// name of the site
	SiteName string
	// content language
	Lang string
	// published time
	PublishedTime string
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL