Documentation ¶
Index ¶
Constants ¶
const ( // If the numeric value of a link's anchor text is greater than this number, // we don't think it represents the page number of the link. MaxNumForPageParam = 100 )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type PageNumberFinder ¶
type PageNumberFinder struct {
// contains filtered or unexported fields
}
PageNumberFinder parses the document to collect groups of adjacent plain text numbers and outlinks with digital anchor text.
func NewPageNumberFinder ¶
func NewPageNumberFinder(wc stringutil.WordCounter, timingInfo *data.TimingInfo, logger logutil.Logger) *PageNumberFinder
func (*PageNumberFinder) FindOutlink ¶
func (pnf *PageNumberFinder) FindOutlink(root *html.Node, pageURL *nurl.URL) *info.PageParamInfo
FindOutlink parses the document to collect outlinks with numeric anchor text and numeric text around them. Returns PageParamInfo, always (never null). If no page parameter is detected or determined to be best, its Type is info.Unset.
func (*PageNumberFinder) FindPagination ¶
func (pnf *PageNumberFinder) FindPagination(root *html.Node, pageURL *nurl.URL) (pagination data.PaginationInfo)
type PrevNextFinder ¶
type PrevNextFinder struct {
// contains filtered or unexported fields
}
PrevNextFinder finds the next and previous page links for the distilled document. The functionality for next page links is migrated from readability.getArticleTitle() in chromium codebase's third_party/readability/js/readability.js, and then expanded for previous page links; boilerpipe doesn't have such capability. First, it determines the prefix URL of the document. Then, for each anchor in the document, its href and text are compared to the prefix URL and examined for next- or previous-paging-related information. If it passes, its score is then determined by applying various heuristics on its href, text, class name and ID. Lastly, the page link with the highest score of at least 50 is considered to have enough confidence as the next or previous page link.
func NewPrevNextFinder ¶
func NewPrevNextFinder(logger logutil.Logger) *PrevNextFinder
func (*PrevNextFinder) FindOutlink ¶
func (*PrevNextFinder) FindPagination ¶
func (pnf *PrevNextFinder) FindPagination(root *html.Node, pageURL *nurl.URL) data.PaginationInfo