readability

package module

v0.0.0-...-a3db0f1 Latest Latest Go to latest Published: Oct 28, 2016 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/virusvn/goreadability

Links

Open Source Insights

README ¶

goreadability

goreadability is a tool for extracting the primary readable content of a webpage. It is a Go port of arc90's readability project, based on ruby-readability.

Install

go get github.com/philipjkim/goreadability

Example

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
    log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Command Line Tool

TODO

ruby-readability is the base of this project.
fastimage finds the type and/or size of a remote image given its uri, by fetching as little as needed.

Potential Issues

TODO

License

This code is under the Apache License 2.0. See http://www.apache.org/licenses/LICENSE-2.0.

Documentation ¶

Overview ¶

Example ¶

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
	log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Output:

Index ¶

type Content
- func Extract(reqURL string, opt *Option) (*Content, error)
- func ExtractFromDocument(doc *goquery.Document, reqURL string, opt *Option) (*Content, error)
type Image
- func (i Image) String() string
type Option
- func NewOption() *Option

Examples ¶

Package

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Content ¶

type Content struct {
	Title       string
	Description string
	Content     string
	Author      string
	Images      []Image
}

Content contains primary readable content of a webpage.

func Extract ¶

func Extract(reqURL string, opt *Option) (*Content, error)

Extract requests to reqURL then returns contents extracted from the response.

func ExtractFromDocument ¶

func ExtractFromDocument(doc *goquery.Document, reqURL string, opt *Option) (*Content, error)

ExtractFromDocument returns Content when extraction succeeds, otherwise error. reqURL is required for converting relative image paths to absolute.

If you already have *goquery.Document after requesting HTTP, use this function, otherwise use Extract(reqURL, opt).

type Image ¶

type Image struct {
	URL  string
	Size *fastimage.ImageSize
}

Image contains URL and Size (width and height in pixel).

func (Image) String ¶

func (i Image) String() string

type Option ¶

type Option struct {
	// RetryLength is minimum length for a page description.
	// It will retry to extract page description with more liberal rule
	// if extracted description length is less than this value.
	RetryLength int

	// MinTextLength is minimum length of an inner text for a tag.
	// If a tag has short inner text (length is less than MinTextLength),
	// the text will be discarded from the page description candidates.
	MinTextLength int

	// RemoveUnlikelyCandidates is a flag whether to remove some tags
	// if they are considered relatively unimportant.
	RemoveUnlikelyCandidates bool

	// WeightClasses is a flag whether to give more/less weight to some tags
	// if they contain some positive/negative words in id/class value.
	WeightClasses bool

	// CleanConditionally is a flag whether to remove some tags
	// using various rules in conditionalCleanReason().
	CleanConditionally bool

	// RemoveEmptyNodes is a flag whether to remove some tags which have empty inner text.
	RemoveEmptyNodes bool

	// MinImageWidth is the minimum width (pixel) for choosing images.
	MinImageWidth uint32

	// MinImageHeight is the minimum height (pixel) for choosing images.
	MinImageHeight uint32

	// MaxImageCount is the maximum number of images for a web page.
	MaxImageCount int

	// CheckImageSize is the flag for check image's size or not
	CheckImageSize bool

	// CheckImageLoopCount is the number of images for parallel requests to fetch the image size.
	// For example, if this value is set to 10,
	// the first 10 image sources in img tag will be requested.
	CheckImageLoopCount uint

	// ImageRequestTimeout is timeout(ms) for a single image request.
	ImageRequestTimeout uint

	// IgnoreImageFormat is an array of strings for ignoring some images.
	// If an image URL contains at least one of strings in this array, the image will be ignored.
	IgnoreImageFormat []string

	// DescriptionAsPlainText is a flag whether to strip all tags in a description value.
	DescriptionAsPlainText bool

	// DescriptionExtractionTimeout is timeout(ms) for extracting description for a page.
	DescriptionExtractionTimeout uint
}

Option contains variety of options for extracting page content and images.

func NewOption ¶

func NewOption() *Option

NewOption returns the default option.

Source Files ¶

View all Source files

readability.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

goreadability

Install

Example

Command Line Tool

Related Projects

Potential Issues

License

Documentation ¶

Overview ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type Content ¶

func Extract ¶

func ExtractFromDocument ¶

type Image ¶

func (Image) String ¶

type Option ¶

func NewOption ¶

Source Files ¶