justext

package module
v0.0.0-...-be571e3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 6, 2022 License: MIT Imports: 13 Imported by: 51

README

justext

A Go package that implements the JusText boilerplate removal algorithm (http://code.google.com/p/justext/)

Install

go get github.com/JalfResi/justext

And import:

import "github.com/JalfResi/justext"

Usage

Supports all stoplist files available at http://code.google.com/p/justext/source/browse/#svn%2Ftrunk%2Fjustext%2Fstoplists

Justext expects valid HTML; it is your responsability to ensure that valid HTML is passed to Justext. To make things easier I have written a CGO wrapper around libtidy which you can find here: github.com/JalfResi/GoTidy In the future, once exp/html is part of the standard packages I will refactor JusText to accept only valid HTML documents/strings.

Justext use the reader-writer idiom, alowing you to setup the reader with a common configuration and just pump out articles to the writer.

Example usage:

// Create a justext reader from another reader
reader := justext.NewReader(os.Stdin)

// Configure the reader
reader.LengthLow = 70
reader.LengthHigh = 200
reader.Stoplist = stoplist // The stoplist map[string]bool
reader.StopwordsLow = 0.3
reader.StopwordsHigh = 0.32
reader.MaxLinkDensity = 0.2
reader.MaxHeadingDistance = 200
reader.NoHeadings = false

// Read from the reader to generate a paragraph set
paragraphSet, _ := reader.ReadAll()

// Create a writer from another writer
writer := justext.NewWriter(os.Stdout)
// Write the paragraph set to the writer
writer.WriteAll(paragraphSet)

Documentation

Index

Constants

View Source
const (
	MODE_DEFAULT  = 1
	MODE_DETAILED = 2
)

Variables

This section is empty.

Functions

func CopyNode

func CopyNode(node *html.Node, deep bool) *html.Node

func DefaultTemplate

func DefaultTemplate() []byte

DefaultTemplate returns the binary data for a given file.

func DetailedTemplate

func DetailedTemplate() []byte

DetailedTemplate returns the binary data for a given file.

func GetStoplist

func GetStoplist(language string) (map[string]bool, error)

func IsGood

func IsGood(args ...interface{}) (result bool)

func ReadStoplist

func ReadStoplist(filename string) (map[string]bool, error)

func RegisterStoplist

func RegisterStoplist(name string, resourceFunc ResourceFunc)

Types

type Paragraph

type Paragraph struct {
	DomPath         string
	TextNodes       []string
	WordCount       int
	LinkedCharCount int
	TagCount        int
	Text            string
	StopwordCount   int
	StopwordDensity float64
	LinkDensity     float64
	Heading         bool
	CfClass         string
	Class           string
}

type Reader

type Reader struct {
	LengthLow          int
	LengthHigh         int
	Stoplist           map[string]bool
	StopwordsLow       float64
	StopwordsHigh      float64
	MaxLinkDensity     float64
	MaxHeadingDistance int
	NoHeadings         bool
	// contains filtered or unexported fields
}

func NewReader

func NewReader(r io.Reader) *Reader

func (*Reader) ReadAll

func (r *Reader) ReadAll() ([]*Paragraph, error)

type ResourceFunc

type ResourceFunc func() ([]byte, error)

type Writer

type Writer struct {
	Mode          int
	NoBoilerplate bool
	Stoplist      map[string]bool
	// contains filtered or unexported fields
}

func NewWriter

func NewWriter(w io.Writer) *Writer

func (*Writer) OutputDebug

func (w *Writer) OutputDebug(paragraphs []*Paragraph)

func (*Writer) WriteAll

func (w *Writer) WriteAll(paragraphs []*Paragraph) error

Directories

Path Synopsis
example

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL