Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type BlockProximityFusion ¶
type BlockProximityFusion struct {
// contains filtered or unexported fields
}
BlockProximityFusion fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. This probably makes sense only in cases where an upstream filter already has removed some blocks.
func NewBlockProximityFusion ¶
func NewBlockProximityFusion(postFiltering bool) *BlockProximityFusion
func (*BlockProximityFusion) Process ¶
func (f *BlockProximityFusion) Process(doc *webdoc.TextDocument) bool
type DocumentTitleMatch ¶
type DocumentTitleMatch struct {
// contains filtered or unexported fields
}
DocumentTitleMatch marks TextBlocks which contain parts of the HTML `title` tag, using some heuristics which are quite specific to the news domain.
func NewDocumentTitleMatch ¶
func NewDocumentTitleMatch(wc stringutil.WordCounter, titles ...string) *DocumentTitleMatch
func (*DocumentTitleMatch) Process ¶
func (f *DocumentTitleMatch) Process(doc *webdoc.TextDocument) bool
type ExpandTitleToContent ¶
type ExpandTitleToContent struct{}
ExpandTitleToContent marks all TextBlocks "content" which are between the headline and the part that has already been marked content, if they are marked with label.MightBeContent. This filter is quite specific to the news domain.
func NewExpandTitleToContent ¶
func NewExpandTitleToContent() *ExpandTitleToContent
func (*ExpandTitleToContent) Process ¶
func (f *ExpandTitleToContent) Process(doc *webdoc.TextDocument) bool
type HeadingFusion ¶
type HeadingFusion struct{}
HeadingFusion fuses headings with the blocks after them. If the heading was marked as boilerplate, the fused block will be labeled to prevent BlockProximityFusion from merging through it.
func NewHeadingFusion ¶
func NewHeadingFusion() *HeadingFusion
func (*HeadingFusion) Process ¶
func (f *HeadingFusion) Process(doc *webdoc.TextDocument) bool
type KeepLargestBlock ¶
type KeepLargestBlock struct {
// contains filtered or unexported fields
}
KeepLargestBlock keeps the largest TextBlock only (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked "not content" and flagged as `label.MightBeContent`. Note that, by default, only TextBlocks marked as "content" are taken into consideration.
func NewKeepLargestBlock ¶
func NewKeepLargestBlock(expandToSiblings bool) *KeepLargestBlock
func (*KeepLargestBlock) Process ¶
func (f *KeepLargestBlock) Process(doc *webdoc.TextDocument) bool
type LargeBlockAroundTagLevelToContent ¶
type LargeBlockAroundTagLevelToContent struct{}
LargeBlockAroundTagLevelToContent marks all blocks as content that: - are on the same or adjacent tag-level as very likely main content (usually the level of the largest block) - have a significant number of words, currently: at least 100
func NewLargeBlockAroundTagLevelToContent ¶
func NewLargeBlockAroundTagLevelToContent() *LargeBlockAroundTagLevelToContent
func (*LargeBlockAroundTagLevelToContent) Process ¶
func (f *LargeBlockAroundTagLevelToContent) Process(doc *webdoc.TextDocument) bool
type ListAtEnd ¶
type ListAtEnd struct{}
ListAtEnd marks nested list-item blocks after the end of the main content.
func NewListAtEnd ¶
func NewListAtEnd() *ListAtEnd
type SimilarSiblingContent ¶
type SimilarSiblingContent struct { AllowCrossTitles bool AllowCrossHeadings bool AllowMixedTags bool MaxLinkDensity float64 MaxBlockDistance int }
SimilarSiblingContent marks "siblings" of content as content if they are "similar" enough.
This calculates "siblings" by finding a "canonical" DOM node for each TextBlock. This node is the highest ancestor of the TextBlock's first contained node that does not contain (in its subtree) the last node of the previous TextBlock or the first node of the next TextBlock.
If a content block and a non-content block are siblings and are "similar" enough, then the non- content block is marked as content. The "similarity" test is configurable in various ways.
func NewSimilarSiblingContentExpansion ¶
func NewSimilarSiblingContentExpansion() *SimilarSiblingContent
func (*SimilarSiblingContent) Process ¶
func (f *SimilarSiblingContent) Process(doc *webdoc.TextDocument) bool