Documentation ¶
Index ¶
Constants ¶
const ( // We remember a few special node types when descending into their // children. AncestorArticle = 1 << iota AncestorAside AncestorBlockquote AncestorList )
const ( IterNext = iota // Keep going. IterSkip // Skip the current subtree, proceed with the next sibling. IterStop // Skip everything. )
Variables ¶
var ( ErrNoParent = errors.New("no parent") ErrNoText = errors.New("no text") ErrNoBlock = errors.New("no block node after parent") )
Errors returned by the NewChunk function.
var ( ErrNoHTML = errors.New("missing html element") ErrNoHead = errors.New("missing head element") ErrNoBody = errors.New("missing body element") )
Errors returned during Document parsing.
Functions ¶
This section is empty.
Types ¶
type Chunk ¶
type Chunk struct { Prev *Chunk // previous chunk Next *Chunk // next chunk Text *util.Text // text of this chunk Base *html.Node // element node which contained this chunk Block *html.Node // parent block node of base node Container *html.Node // parent block node of block node Classes []string // list of classes this chunk belongs to Ancestors int // bitmask of the ancestors of this chunk LinkText float32 // link text to normal text ratio. }
A Chunk is a chunk of consecutive text found in the HTML document. It combines the content of one or more html.TextNodes. Whitespace is ignored, but inter-word separation is preserved. Therefore each Chunk must contain actual text and whitespace-only html.TextNodes don't result in Chunks.
func (*Chunk) GetChildTypes ¶
Returns a list of strings containing the HTML element types of the Chunk's children.
func (*Chunk) GetSiblingTypes ¶
Returns a list of strings containing the HTML element types of the Chunk's siblings.
type Document ¶
type Document struct { Title *util.Text // the <title>...</title> text. Chunks []*Chunk // all chunks found in this document. // contains filtered or unexported fields }
Document is a parsed HTML document that extracts the document title and holds unexported pointers to the html, head and body nodes.
func NewDocument ¶
NewDocument parses the HTML data provided through an io.Reader interface.
func (*Document) GetClassStats ¶
GetClassStats groups the document chunks by their classes (defined by the class attribute of HTML nodes) and calculates TextStats for each class.
func (*Document) GetClusterStats ¶
GetClusterStats groups the document chunks by common ancestors and calculates TextStats for each group of chunks.