Documentation ¶
Index ¶
- func OpenGraphResolver(article *Article) string
- func ReadLinesOfFile(filename string) []string
- func RegSplit(text string, reg *regexp.Regexp) []string
- func WebPageResolver(article *Article) string
- type Article
- type Cleaner
- type Configuration
- type ContentExtractor
- type Crawler
- type Goose
- type Helper
- type Parser
- type StopWords
- type VideoExtractor
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func OpenGraphResolver ¶
OpenGraphResolver return OpenGraph properties
func ReadLinesOfFile ¶
ReadLinesOfFile returns the lines from a file as a slice of strings
func WebPageResolver ¶
WebPageResolver fetches the main image from the HTML page
Types ¶
type Article ¶
type Article struct { Title string `json:"title,omitempty"` CleanedText string `json:"content,omitempty"` MetaDescription string `json:"description,omitempty"` MetaLang string `json:"lang,omitempty"` MetaFavicon string `json:"favicon,omitempty"` MetaKeywords string `json:"keywords,omitempty"` CanonicalLink string `json:"canonicalurl,omitempty"` Domain string `json:"domain,omitempty"` TopNode *goquery.Selection `json:"-"` TopImage string `json:"image,omitempty"` Tags *set.Set `json:"tags,omitempty"` Movies *set.Set `json:"movies,omitempty"` FinalURL string `json:"url,omitempty"` LinkHash string `json:"linkhash,omitempty"` RawHTML string `json:"rawhtml,omitempty"` Doc *goquery.Document `json:"-"` Links []string `json:"links,omitempty"` PublishDate string `json:"publishdate,omitempty"` AdditionalData map[string]string `json:"additionaldata,omitempty"` Delta int64 `json:"delta,omitempty"` }
Article is a collection of properties extracted from the HTML body
type Cleaner ¶
type Cleaner struct {
// contains filtered or unexported fields
}
Cleaner removes menus, ads, sidebars, etc. and leaves the main content
func NewCleaner ¶
func NewCleaner(config Configuration) Cleaner
NewCleaner returns a new instance of a Cleaner
type Configuration ¶
type Configuration struct {
// contains filtered or unexported fields
}
Configuration is a wrapper for various config options
func GetDefaultConfiguration ¶
func GetDefaultConfiguration(args ...string) Configuration
GetDefaultConfiguration returns safe default configuration options
type ContentExtractor ¶
type ContentExtractor struct {
// contains filtered or unexported fields
}
ContentExtractor can parse the HTML and fetch various properties
func NewExtractor ¶
func NewExtractor(config Configuration) ContentExtractor
NewExtractor returns a configured HTML parser
type Crawler ¶
type Crawler struct { RawHTML string // contains filtered or unexported fields }
Crawler can fetch the target HTML page
func NewCrawler ¶
func NewCrawler(config Configuration, url string, RawHTML string) Crawler
NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body
type Goose ¶
type Goose struct {
// contains filtered or unexported fields
}
Goose is the main entry point of the program
func (Goose) ExtractFromRawHTML ¶
ExtractFromRawHTML returns an article object from the raw HTML content
func (Goose) ExtractFromURL ¶
ExtractFromURL follows the URL, fetches the HTML page and returns an article object
type Helper ¶
type Helper struct {
// contains filtered or unexported fields
}
Helper is a utility struct to clean up URLs and charsets
func NewRawHelper ¶
NewRawHelper converts the text to UTF8
type Parser ¶
type Parser struct{}
Parser is an HTML parser specialised in extraction of main content and other properties
type StopWords ¶
type StopWords struct {
// contains filtered or unexported fields
}
StopWords implements a simple language detector
func NewStopwords ¶
func NewStopwords() StopWords
NewStopwords returns an instance of a stop words detector
func (StopWords) SimpleLanguageDetector ¶
SimpleLanguageDetector returns the language code for the text, based on its stop words
type VideoExtractor ¶
type VideoExtractor struct {
// contains filtered or unexported fields
}
VideoExtractor can extract the main video from an HTML page
func NewVideoExtractor ¶
func NewVideoExtractor() VideoExtractor
NewVideoExtractor returns a new instance of a HTML video extractor
func (*VideoExtractor) GetVideos ¶
func (ve *VideoExtractor) GetVideos(article *Article) *set.Set
GetVideos returns the video tags embedded in the article