document

package

v0.0.0-...-73de0e8 Latest Latest Go to latest Published: Feb 2, 2018 License: Apache-2.0 Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/clanstyles/jivesearch

Links

Open Source Insights

Documentation ¶

Overview ¶

Package document parses URLs and the HTML of a webpage

Index ¶

Variables
func ExtractDomain(u *url.URL) (string, error)
func Languages(supported []language.Tag) []language.Tag
func ValidateURL(lnk string) (*url.URL, error)
type Content
type Document
- func New(lnk string) (*Document, error)
type ElasticSearch
type Policy

Constants ¶

This section is empty.

Variables ¶

View Source

var Matcher = language.NewMatcher(available) // globals...ugh!

Matcher is a language matcher. Will need to change if we can figure out language customization (see note above)

Functions ¶

func ExtractDomain ¶

func ExtractDomain(u *url.URL) (string, error)

ExtractDomain extracts the domain from a *url.URL e.g. "example.com" from "https://www.example.com/path/somewhere"

func Languages ¶

func Languages(supported []language.Tag) []language.Tag

Languages (will) verifies that languages are supported. An empty slice of supported languages implies you support every language available. How to make this configurable? We crawl a doc we don't support it goes to a matcher where it will just match the first language supported. Tricky. Once we are ready look at wikipedia package implementation.

func ValidateURL ¶

func ValidateURL(lnk string) (*url.URL, error)

ValidateURL validates a link and returns a *url.URL Note: There seems to be a lot of overlap between this and handleLink()

Types ¶

type Content ¶

type Content struct {
	StatusCode int `json:"status,omitempty"`

	Canonical   bool         `json:"canonical,omitempty"`
	Language    language.Tag `json:"-"`
	Date        string       `json:"date,omitempty"`
	Title       string       `json:"title,omitempty"`
	Keywords    string       `json:"keywords,omitempty"`
	Description string       `json:"description,omitempty"`
	Policy
	// contains filtered or unexported fields
}

Content is set from the response

type Document ¶

type Document struct {
	ID        string   `json:"id"` // store ID also as a field as sorting on document ID is not advised in Elasticsearch
	URL       *url.URL `json:"-"`
	Scheme    string   `json:"scheme,omitempty"`
	Host      string   `json:"host,omitempty"`       // not HostName()...we want the port for the robots.txt file
	Domain    string   `json:"domain,omitempty"`     // tld+1 -> example.com
	TLD       string   `json:"tld,omitempty"`        // com, org, uk, etc (we don't want co.uk just uk)
	PathParts string   `json:"path_parts,omitempty"` // https://api.example.com/path/to/something -> "path to something"
	Crawled   string   `json:"crawled,omitempty"`

	MIME string `json:"mime,omitempty"`

	Content
	Votes int `json:"-"`
	// contains filtered or unexported fields
}

Document is the URL & parsed content of the page Note, since we want just a couple of fields from *url.URL (Scheme, Host) we explicitly set those. Much easier than a custom MarshalJSON method.

func New ¶

func New(lnk string) (*Document, error)

New creates a new Document from a link and validates the url

func (*Document) SchemeHost ¶

func (d *Document) SchemeHost() string

SchemeHost simply concatenates the Scheme, '://', and Host

func (*Document) SetCanonical ¶

func (d *Document) SetCanonical(ch chan string) *Document

SetCanonical sets Canonical to true if the Document's ID is the canonical URL

func (*Document) SetContent ¶

func (d *Document) SetContent(bot string, maxLinks int, ch chan string,
	truncateTitle, truncateKeywords, truncateDescription int) error

SetContent parses the html and sets the language, title, description, extracts links, etc.

func (*Document) SetCrawled ¶

func (d *Document) SetCrawled(t time.Time) *Document

SetCrawled marks the date the doc was crawled

func (*Document) SetHeader ¶

func (d *Document) SetHeader(h http.Header) *Document

SetHeader sets the Document's header to the response header.

func (*Document) SetPolicyFromHeader ¶

func (d *Document) SetPolicyFromHeader(bot string) *Document

SetPolicyFromHeader sets the indexing & follow policy of a document from the response header. A specific bot directive overrides a general robots directive (still TODO). We process the X-Robots-Tag header first so may not even get to the meta tag found in the html. https://developers.google.com/search/reference/robots_meta_tag https://stackoverflow.com/a/18330818/776942 (see end of answer) TODO: Process the bot directive.

func (*Document) SetStatusCode ¶

func (d *Document) SetStatusCode(code int) *Document

SetStatusCode sets the http status code

func (*Document) SetTokenizer ¶

func (d *Document) SetTokenizer(b io.Reader) error

SetTokenizer sets the html tokenizer and MIME Type from the response's body (utf-8 encoded). It is the caller's responsibility to close the response body.

type ElasticSearch ¶

type ElasticSearch struct {
	Client *elastic.Client
	Index  string
	Type   string
}

ElasticSearch hold connection and index settings

func (*ElasticSearch) Analyzer ¶

func (e *ElasticSearch) Analyzer(lang language.Tag) (string, error)

Analyzer returns the appropriate analyzer for a given language.

func (*ElasticSearch) IndexName ¶

func (e *ElasticSearch) IndexName(a string) string

IndexName returns the language-specific index e.g. "search-english", "search-french"

func (*ElasticSearch) Setup ¶

func (e *ElasticSearch) Setup() error

Setup will create our main search index and language-specific indices for the content

type Policy ¶

type Policy struct {
	Index bool `json:"index,omitempty"` // are we allowed to index the page?
	// contains filtered or unexported fields
}

Policy tells us if we can index the content & store the links

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL