Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( ErrNoChunks = errors.New("document contains no chunks") ErrEmptyResult = errors.New("nothing found") )
Functions ¶
This section is empty.
Types ¶
type Extractor ¶
type Extractor struct {
Labels []bool
}
Extractor utilizes the trained model to extract relevant html.Chunks from an html.Document.
func NewExtractor ¶
func NewExtractor() *Extractor
NewExtractor creates and initializes a new Extractor.
func (*Extractor) Extract ¶
Extract returns a list of relevant text chunks found in doc.
How it works ¶
This function creates a feature vector for each chunk found in doc. A feature vector contains a numerical representation of the chunk's properties like HTML element type, parent element type, number of words, number of sentences and stuff like this.
A logistic regression model is used to calculate scores based on these feature vectors. Then, in some kind of meta / ensemble learning approach, a second type of feature vector is created based on these scores. This feature vector is fed to our random forest and finally the random forest's predictions are used to generate the result.
By now you might have noticed that I'm exceptionally bad at naming and describing things properly.