readability

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 18, 2024 License: Apache-2.0 Imports: 17 Imported by: 1

README

go-readability

A Go port of Mozilla Readability.js, the heuristic which powers the Firefox Reader View offering a distraction-free reading experience for articles, blog posts, and other text-heavy web pages by removing ads, GDPR-compliant cookie banners and other unsolicited junk.

This port uses only the minimal DOM parser bundled with the original lib, without resorting to the Go stdlib (net/html). The rest of the source code is aligned with the latest commit (97db40b) on the main branch.

A Bit of History

Readability.js maintained by Mozilla is based on a JavaScript bookmarklet developed in 2009 by Arc90 Lab, an organization which left no traces behind, but whose main contributor was Chris Dary (@umbrae).

The source code was then released under the Apache 2.0 software license on Google Code before being abandoned in 2010 to be repackaged as a web service called Readability.com, which was then discontinued in 2016.

Most modern browsers still use one of the available forks of the Arc90 original implementation when displaying web pages in reading mode.

For a historical and detailed analysis of the reading mode offered by current browsers, please read this excellent series of articles by Daniel Aleksandersen.

Basic usage

Add a dependency for the package:

go get -u github.com/giulianopz/go-readability

Get text content from a web page article:

package main

import (
	"fmt"

	"github.com/giulianopz/go-readability"
)

func main() {

	var htmlSource = `<!DOCTYPE html>
<html>

<head>
	<meta charset="utf-8" />
	<title>
		Redis will remain BSD licensed - &lt;antirez&gt;
	</title>
	<link href="/rss" rel="alternate" type="application/rss+xml" />
</head>

<body>
	<div id="container">
		<header>
			<h1><a href="/">&lt;antirez&gt;</a></h1>
		</header>
		<div id="content">
			<section id="newslist">
				<article data-news-id="120">
					<h2><a href="/news/120">Redis will remain BSD licensed</a></h2>
				</article>
			</section>
			<article class="comment" style="margin-left:0px" data-comment-id="120-" id="120-"><span class="info"><span
						class="username"><a href="/user/antirez">antirez</a></span> 2095 days ago.
					170643 views. </span>
				<pre>Today a page about the new Common Clause license in the Redis Labs web site was interpreted as if Redis itself switched license. This is not the case, Redis is, and will remain, BSD licensed. However in the era of [edit] uncontrollable spreading of information, my attempts to provide the correct information failed, and I’m still seeing everywhere “Redis is no longer open source”. The reality is that Redis remains BSD, and actually Redis Labs did the right thing supporting my effort to keep the Redis core open as usually.

				[...]

We at Redis Labs are sorry for the confusion generated by the Common Clause page, and my colleagues are working to fix the page with better wording.</pre>
			</article>
		</div>
	</div>
</body>

</html>`

	isReaderable := readability.IsProbablyReaderable(htmlSource)
	fmt.Printf("Contains any text?: %t\n", isReaderable)

	reader, err := readability.New(
		htmlSource,
		"http://antirez.com/news/120",
		readability.ClassesToPreserve("caption"),
	)
	if err != nil {
		panic(err)
	}

	result, err := reader.Parse()
	if err != nil {
		panic(err)
	}

	fmt.Printf("Title: %s\n", result.Title)
	fmt.Printf("Author: %s\n", result.Byline)
	fmt.Printf("Length: %d\n", result.Length)
	fmt.Printf("Excerpt: %s\n", result.Excerpt)
	fmt.Printf("SiteName: %s\n", result.SiteName)
	fmt.Printf("Lang: %s\n", result.Lang)
	fmt.Printf("PublishedTime: %s\n", result.PublishedTime)
	fmt.Printf("Content: %s\n", result.Content)
	fmt.Printf("TextContent: %s\n", result.TextContent)
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsProbablyReaderable

func IsProbablyReaderable(htmlSource string, opts ...Option) bool

Decides whether or not the document is reader-able without parsing the whole thing. Options:

  • options.minContentLength (default 140), the minimum node content length used to decide if the document is readerable
  • options.minScore (default 20), the minumum cumulated 'score' used to determine if the document is readerable
  • options.visibilityChecker (default isNodeVisible), the function used to determine if a node is visible

Types

type Option

type Option func(*Options)

func AllowedVideoRegex

func AllowedVideoRegex(rgx *regexp.Regexp) Option

func CharThreshold

func CharThreshold(n int) Option

func ClassesToPreserve

func ClassesToPreserve(classes ...string) Option

func DisableJSONLD

func DisableJSONLD(b bool) Option

func KeepClasses

func KeepClasses(b bool) Option

func LogLevel

func LogLevel(l slog.Level) Option

func MaxElemsToParse

func MaxElemsToParse(n int) Option

func MinContentLength

func MinContentLength(len int) Option

func MinScore

func MinScore(score float64) Option

func NTopCandidates

func NTopCandidates(n int) Option

func Serializer

func Serializer(f func(*node) string) Option

func VisibilityChecker

func VisibilityChecker(f func(*html.Node) bool) Option

type Options

type Options struct {
	// contains filtered or unexported fields
}

type Readability

type Readability struct {
	// contains filtered or unexported fields
}

func New

func New(htmlSource, uri string, opts ...Option) (*Readability, error)

New is the public constructor of Readability and it supports the following options:

  • options.debug
  • options.maxElemsToParse
  • options.nbTopCandidates
  • options.charThreshold
  • this.classesToPreseve
  • options.keepClasses
  • options.serializer

func (*Readability) Parse

func (r *Readability) Parse() (*Result, error)

Runs readability. Workflow:

  1. Prep the document by removing script tags, css, etc.
  2. Build readability's DOM tree.
  3. Grab the article content from the current dom tree.
  4. Replace the current DOM tree with the new one.
  5. Read peacefully.

type Result

type Result struct {
	// article title
	Title string
	// HTML string of processed article Content
	Content string
	// text content of the article, with all the HTML tags removed
	TextContent string
	// length of an article, in characters (runes)
	Length int
	// article description, or short excerpt from the content
	Excerpt string
	// author metadata
	Byline string
	// content direction
	Dir string
	// name of the site
	SiteName string
	// content language
	Lang string
	// published time
	PublishedTime string
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL