html

package
v0.0.0-...-0906917 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 6, 2021 License: MIT Imports: 7 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// We remember a few special node types when descending into their
	// children.
	AncestorArticle = 1 << iota
	AncestorAside
	AncestorBlockquote
	AncestorList
)
View Source
const (
	IterNext = iota // Keep going.
	IterSkip        // Skip the current subtree, proceed with the next sibling.
	IterStop        // Skip everything.
)

Variables

View Source
var (
	ErrNoParent = errors.New("no parent")
	ErrNoText   = errors.New("no text")
	ErrNoBlock  = errors.New("no block node after parent")
)

Errors returned by the NewChunk function.

View Source
var (
	ErrNoHTML = errors.New("missing html element")
	ErrNoHead = errors.New("missing head element")
	ErrNoBody = errors.New("missing body element")
)

Errors returned during Document parsing.

Functions

This section is empty.

Types

type Chunk

type Chunk struct {
	Prev      *Chunk     // previous chunk
	Next      *Chunk     // next chunk
	Text      *util.Text // text of this chunk
	Base      *html.Node // element node which contained this chunk
	Block     *html.Node // parent block node of base node
	Container *html.Node // parent block node of block node
	Classes   []string   // list of classes this chunk belongs to
	Ancestors int        // bitmask of the ancestors of this chunk
	LinkText  float32    // link text to normal text ratio.
}

A Chunk is a chunk of consecutive text found in the HTML document. It combines the content of one or more html.TextNodes. Whitespace is ignored, but inter-word separation is preserved. Therefore each Chunk must contain actual text and whitespace-only html.TextNodes don't result in Chunks.

func NewChunk

func NewChunk(doc *Document, n *html.Node) (*Chunk, error)

func (*Chunk) GetChildTypes

func (ch *Chunk) GetChildTypes() []string

Returns a list of strings containing the HTML element types of the Chunk's children.

func (*Chunk) GetSiblingTypes

func (ch *Chunk) GetSiblingTypes() []string

Returns a list of strings containing the HTML element types of the Chunk's siblings.

func (*Chunk) IsHeading

func (ch *Chunk) IsHeading() bool

type Document

type Document struct {
	Title  *util.Text // the <title>...</title> text.
	Chunks []*Chunk   // all chunks found in this document.
	// contains filtered or unexported fields
}

Document is a parsed HTML document that extracts the document title and holds unexported pointers to the html, head and body nodes.

func NewDocument

func NewDocument(r io.Reader) (*Document, error)

NewDocument parses the HTML data provided through an io.Reader interface.

func (*Document) GetClassStats

func (doc *Document) GetClassStats() map[string]*TextStat

GetClassStats groups the document chunks by their classes (defined by the class attribute of HTML nodes) and calculates TextStats for each class.

func (*Document) GetClusterStats

func (doc *Document) GetClusterStats() map[*Chunk]*TextStat

GetClusterStats groups the document chunks by common ancestors and calculates TextStats for each group of chunks.

type TextStat

type TextStat struct {
	Words     int // total number of words
	Sentences int // total number of sentences
	Count     int // number of texts used to calculate this stats
}

TextStat contains the number of words and sentences found in text.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL