readability

package module

v0.0.0-...-4217e20 Latest Latest Go to latest Published: May 31, 2018 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/requaos/goreadability

Links

Open Source Insights

README ¶

goreadability

goreadability is a tool for extracting the primary readable content of a webpage. It is a Go port of arc90's readability project, based on ruby-readability.

Install

go get github.com/philipjkim/goreadability

Example

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
    log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Testing

go test

# or if you want to see verbose logs:
DEBUG=true go test

Command Line Tool

TODO

ruby-readability is the base of this project.
fastimage finds the type and/or size of a remote image given its uri, by fetching as little as needed.

Potential Issues

TODO

License

MIT

Documentation ¶

Overview ¶

Example ¶

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
	log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Output:

Index ¶

func Debug()
type Content
- func Extract(reqURL string, opt *Option) (*Content, error)
- func ExtractFromDocument(doc *goquery.Document, reqURL string, opt *Option) (*Content, error)
type Image
- func (i Image) String() string
type Option
- func NewOption() *Option

Examples ¶

Package

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Debug ¶

func Debug()

Debug enables debug logging of the operations done by the library. If called, lots of information will be print to stdout.

Types ¶

type Content ¶

type Content struct {
	Title       string
	Description string
	Author      string
	Images      []Image
}

Content contains primary readable content of a webpage.

func Extract ¶

func Extract(reqURL string, opt *Option) (*Content, error)

Extract requests to reqURL then returns contents extracted from the response.

func ExtractFromDocument ¶

func ExtractFromDocument(doc *goquery.Document, reqURL string, opt *Option) (*Content, error)

ExtractFromDocument returns Content when extraction succeeds, otherwise error. reqURL is required for converting relative image paths to absolute.

If you already have *goquery.Document after requesting HTTP, use this function, otherwise use Extract(reqURL, opt).

type Image ¶

type Image struct {
	URL  string
	Size *fastimage.ImageSize
}

Image contains URL and Size (width and height in pixel).

func (Image) String ¶

func (i Image) String() string

type Option ¶

type Option struct {
	// RetryLength is minimum length for a page description.
	// It will retry to extract page description with more liberal rule
	// if extracted description length is less than this value.
	RetryLength int

	// MinTextLength is minimum length of an inner text for a tag.
	// If a tag has short inner text (length is less than MinTextLength),
	// the text will be discarded from the page description candidates.
	MinTextLength int

	// RemoveUnlikelyCandidates is a flag whether to remove some tags
	// if they are considered relatively unimportant.
	RemoveUnlikelyCandidates bool

	// WeightClasses is a flag whether to give more/less weight to some tags
	// if they contain some positive/negative words in id/class value.
	WeightClasses bool

	// CleanConditionally is a flag whether to remove some tags
	// using various rules in conditionalCleanReason().
	CleanConditionally bool

	// RemoveEmptyNodes is a flag whether to remove some tags which have empty inner text.
	RemoveEmptyNodes bool

	// MinImageWidth is the minimum width (pixel) for choosing images.
	MinImageWidth uint32

	// MinImageHeight is the minimum height (pixel) for choosing images.
	MinImageHeight uint32

	// MaxImageCount is the maximum number of images for a web page.
	MaxImageCount int

	// CheckImageLoopCount is the number of images
	// for parallel requests to fetch the image size.
	// For example, if this value is set to 10,
	// the first 10 img src URLs without width/height attributes
	// will be requested over network.
	// (img tags with both width/height attributes (pixels in int) are not conunted,
	// since they are not requested over network to get image size.)
	CheckImageLoopCount uint

	// ImageRequestTimeout is timeout(ms) for a single image request.
	ImageRequestTimeout uint

	// IgnoreImageFormat is an array of strings for ignoring some images.
	// If an image URL contains at least one of strings in this array, the image will be ignored.
	IgnoreImageFormat []string

	// DescriptionAsPlainText is a flag whether to strip all tags in a description value.
	DescriptionAsPlainText bool

	// DescriptionExtractionTimeout is timeout(ms) for extracting description for a page.
	DescriptionExtractionTimeout uint
}

Option contains variety of options for extracting page content and images.

func NewOption ¶

func NewOption() *Option

NewOption returns the default option.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL