html

package

v0.0.0-...-dd9c64c Latest Latest Go to latest Published: Mar 13, 2015 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/slyrz/newscat

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
Variables
type Chunk
- func NewChunk(doc *Document, n *html.Node) (*Chunk, error)
type Document
- func NewDocument(r io.Reader) (*Document, error)
- func (doc *Document) GetClassStats() map[string]*TextStat
- func (doc *Document) GetClusterStats() map[*Chunk]*TextStat
type TextStat

Constants ¶

View Source

const (
	// We remember a few special node types when descending into their
	// children.
	AncestorArticle = 1 << iota
	AncestorAside
	AncestorBlockquote
	AncestorList
)

View Source

const (
	IterNext = iota // Keep going.
	IterSkip        // Skip the current subtree, proceed with the next sibling.
	IterStop        // Skip everything.
)

Variables ¶

View Source

var (
	ErrNoParent = errors.New("no parent")
	ErrNoText   = errors.New("no text")
	ErrNoBlock  = errors.New("no block node after parent")
)

Errors returned by the NewChunk function.

View Source

var (
	ErrNoHTML = errors.New("missing html element")
	ErrNoHead = errors.New("missing head element")
	ErrNoBody = errors.New("missing body element")
)

Errors returned during Document parsing.

Functions ¶

This section is empty.

Types ¶

type Chunk ¶

type Chunk struct {
	Prev      *Chunk     // previous chunk
	Next      *Chunk     // next chunk
	Text      *util.Text // text of this chunk
	Base      *html.Node // element node which contained this chunk
	Block     *html.Node // parent block node of base node
	Container *html.Node // parent block node of block node
	Classes   []string   // list of classes this chunk belongs to
	Ancestors int        // bitmask of the ancestors of this chunk
	LinkText  float32    // link text to normal text ratio.
}

A Chunk is a chunk of consecutive text found in the HTML document. It combines the content of one or more html.TextNodes. Whitespace is ignored, but inter-word separation is preserved. Therefore each Chunk must contain actual text and whitespace-only html.TextNodes don't result in Chunks.

func NewChunk ¶

func NewChunk(doc *Document, n *html.Node) (*Chunk, error)

func (*Chunk) GetChildTypes ¶

func (ch *Chunk) GetChildTypes() []string

Returns a list of strings containing the HTML element types of the Chunk's children.

func (*Chunk) GetSiblingTypes ¶

func (ch *Chunk) GetSiblingTypes() []string

Returns a list of strings containing the HTML element types of the Chunk's siblings.

func (*Chunk) IsHeading ¶

func (ch *Chunk) IsHeading() bool

type Document ¶

type Document struct {
	Title  *util.Text // the <title>...</title> text.
	Chunks []*Chunk   // all chunks found in this document.
	// contains filtered or unexported fields
}

Document is a parsed HTML document that extracts the document title and holds unexported pointers to the html, head and body nodes.

func NewDocument ¶

func NewDocument(r io.Reader) (*Document, error)

NewDocument parses the HTML data provided through an io.Reader interface.

func (*Document) GetClassStats ¶

func (doc *Document) GetClassStats() map[string]*TextStat

GetClassStats groups the document chunks by their classes (defined by the class attribute of HTML nodes) and calculates TextStats for each class.

func (*Document) GetClusterStats ¶

func (doc *Document) GetClusterStats() map[*Chunk]*TextStat

GetClusterStats groups the document chunks by common ancestors and calculates TextStats for each group of chunks.

type TextStat ¶

type TextStat struct {
	Words     int // total number of words
	Sentences int // total number of sentences
	Count     int // number of texts used to calculate this stats
}

TextStat contains the number of words and sentences found in text.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL