Documentation
¶
Overview ¶
Package document parses URLs and the HTML of a webpage
Index ¶
- Variables
- func ExtractDomain(u *url.URL) (string, error)
- func Languages(supported []language.Tag) []language.Tag
- func ValidateURL(lnk string) (*url.URL, error)
- type Content
- type Document
- func (d *Document) SchemeHost() string
- func (d *Document) SetCanonical(ch chan string) *Document
- func (d *Document) SetContent(bot string, maxLinks int, ch chan string, ...) error
- func (d *Document) SetCrawled(t time.Time) *Document
- func (d *Document) SetHeader(h http.Header) *Document
- func (d *Document) SetPolicyFromHeader(bot string) *Document
- func (d *Document) SetStatusCode(code int) *Document
- func (d *Document) SetTokenizer(b io.Reader) error
- type ElasticSearch
- type Policy
Constants ¶
This section is empty.
Variables ¶
var Matcher = language.NewMatcher(available) // globals...ugh!
Matcher is a language matcher. Will need to change if we can figure out language customization (see note above)
Functions ¶
func ExtractDomain ¶
ExtractDomain extracts the domain from a *url.URL e.g. "example.com" from "https://www.example.com/path/somewhere"
func Languages ¶
Languages (will) verifies that languages are supported. An empty slice of supported languages implies you support every language available. How to make this configurable? We crawl a doc we don't support it goes to a matcher where it will just match the first language supported. Tricky. Once we are ready look at wikipedia package implementation.
Types ¶
type Content ¶
type Content struct { StatusCode int `json:"status,omitempty"` Canonical bool `json:"canonical,omitempty"` Language language.Tag `json:"-"` Date string `json:"date,omitempty"` Title string `json:"title,omitempty"` Keywords string `json:"keywords,omitempty"` Description string `json:"description,omitempty"` Policy // contains filtered or unexported fields }
Content is set from the response
type Document ¶
type Document struct { ID string `json:"id"` // store ID also as a field as sorting on document ID is not advised in Elasticsearch URL *url.URL `json:"-"` Scheme string `json:"scheme,omitempty"` Host string `json:"host,omitempty"` // not HostName()...we want the port for the robots.txt file Domain string `json:"domain,omitempty"` // tld+1 -> example.com TLD string `json:"tld,omitempty"` // com, org, uk, etc (we don't want co.uk just uk) PathParts string `json:"path_parts,omitempty"` // https://api.example.com/path/to/something -> "path to something" Crawled string `json:"crawled,omitempty"` MIME string `json:"mime,omitempty"` Content Votes int `json:"-"` // contains filtered or unexported fields }
Document is the URL & parsed content of the page Note, since we want just a couple of fields from *url.URL (Scheme, Host) we explicitly set those. Much easier than a custom MarshalJSON method.
func (*Document) SchemeHost ¶
SchemeHost simply concatenates the Scheme, '://', and Host
func (*Document) SetCanonical ¶
SetCanonical sets Canonical to true if the Document's ID is the canonical URL
func (*Document) SetContent ¶
func (d *Document) SetContent(bot string, maxLinks int, ch chan string, truncateTitle, truncateKeywords, truncateDescription int) error
SetContent parses the html and sets the language, title, description, extracts links, etc.
func (*Document) SetCrawled ¶
SetCrawled marks the date the doc was crawled
func (*Document) SetPolicyFromHeader ¶
SetPolicyFromHeader sets the indexing & follow policy of a document from the response header. A specific bot directive overrides a general robots directive (still TODO). We process the X-Robots-Tag header first so may not even get to the meta tag found in the html. https://developers.google.com/search/reference/robots_meta_tag https://stackoverflow.com/a/18330818/776942 (see end of answer) TODO: Process the bot directive.
func (*Document) SetStatusCode ¶
SetStatusCode sets the http status code
type ElasticSearch ¶
ElasticSearch hold connection and index settings
func (*ElasticSearch) Analyzer ¶
func (e *ElasticSearch) Analyzer(lang language.Tag) (string, error)
Analyzer returns the appropriate analyzer for a given language.
func (*ElasticSearch) IndexName ¶
func (e *ElasticSearch) IndexName(a string) string
IndexName returns the language-specific index e.g. "search-english", "search-french"
func (*ElasticSearch) Setup ¶
func (e *ElasticSearch) Setup() error
Setup will create our main search index and language-specific indices for the content