Documentation ¶
Index ¶
Constants ¶
const (
DefaultUserAgent = "Hermes Bot (github.com/jtaylor32/hermes"
)
Variables ¶
var ( // ErrNilHostParameter defines you cannot have a nil elasticsearch host address ErrNilHostParameter = errors.New("missing host parameter") // ErrNilIndexParameter defines you cannot have a nil elasticsearch index name ErrNilIndexParameter = errors.New("missing index parameter") // ErrNilTypeParameter defines you cannot have a nil elasticsearch type name ErrNilTypeParameter = errors.New("missing type parameters") // ErrNegativeNParameter defines you cannot have a negative value of documents ErrNegativeNParameter = errors.New("n parameter cannot be negative") )
Functions ¶
This section is empty.
Types ¶
type CustomSettings ¶
type CustomSettings struct { RootLink string `json:"link"` Tags []string `json:"tags"` Subdomain bool `json:"subdomain"` TopLevelDomain bool `json:"top_level_domain"` }
CustomSettings struct to model custom settings we want to scrape from a specific page
type Document ¶
type Document struct { ID string `json:"id"` Title string `json:"title"` Description string `json:"description"` Content string `json:"content"` Link string `json:"link"` Tag string `json:"tag"` Time time.Time `json:"time"` }
Document stuct to model our single "Document" store we will ingestion into the elasticsearch index/type
type Elasticsearch ¶
type Elasticsearch struct {
Host, Index, Type string
}
The Elasticsearch struct type is to model the storage into a single ELasticsearch node. It must have a host, index and type to ingest data to.
func (*Elasticsearch) Store ¶
func (e *Elasticsearch) Store(n int, docs []Document) error
Store function will take total documents, es host, es index, es type and the Documents to be ingested. It will return with an error if faulted or will print stats on ingestion process (Total, Requests/sec, Time to ingest)
type IngestionDocument ¶
type IngestionDocument struct {
Documents []Document
}
IngestionDocument struct to model our ingestion set for multiple types and Documents for our index
type Runner ¶
type Runner struct { // The CrawlDelay is the set time for the Runner to abide by. CrawlDelay time.Duration // The CancelDuration is the set time for the Runner to cancel immediately. CancelDuration time.Duration // The CancelAtURL is the specific URL that the Runner will cancel on. CancelAtURL string // The StopDuration is the set time for the Runner to stop at while still processing the remaining links in the queue. StopDuration time.Duration // The StopAtURL is the specific URL that the Runner will stop on. It will still process the remaining links in the queue. StopAtURL string // The MemStatsInterval is a set time for when the Runner will output memory statistics to standard output. MemStatsInterval time.Duration // The UserAgent is the Runner's user agent string name. Be polite and identify yourself for people to see. UserAgent string // The WorkerIdleTTL keeps a watch for an idle timeout. When the Runner is crawling if it has finished it's total crawl // it will exit after this timeout. WorkerIdleTTL time.Duration // AutoClose will make the Runner terminate and successfully exit after the WorkerIdleTTL if set to true. AutoClose bool // The URL a reference pointer to a URL type URL *url.URL // The Tags are the HTML tags you want to scrape with this Runner Tags []string // If you want to specify how many documents you want to crawl/scrape the Runner will hit you can specify the size here. // If you don't have a specific preference you can leave it alone or set it to 0. MaximumDocuments int // The TopLevelDomain is a toggle to determine if you want to limit the Runner to a specific TLD. (i.e. .com, .edu, .gov, etc.) // If it is set to true it will make sure it stays to the URL's specific TLD. TopLevelDomain bool // The Subdomain is a toggle to determine if you want to limit the Runner to a subdomain of the URL. If it is set to true // it will make sure it stays to the host's domain. Think of it like a wildcard -- *.github.com -- anything link that has // github.com will be fetched. Subdomain bool // contains filtered or unexported fields }
A Runner defines the parameters for running a single instance of Hermes ETL
type Settings ¶
type Settings struct { ElasticsearchHost string `json:"es_host"` // host address for the elasticsearch instance ElasticsearchIndex string `json:"es_index"` // index name you are going to ingest data into ElasticsearchType string `json:"es_type"` // type name you are going to ingest data into CrawlDelay time.Duration `json:"crawl_delay"` // delay time for the crawler to abide to CancelDuration time.Duration `json:"cancel_duration"` // time duration for canceling the crawler (immediate cancel) CancelAtURL string `json:"cancel_url"` // specific URL to cancel the crawler at StopDuration time.Duration `json:"stop_duration"` // time duration for stopping the crawler (processes links on queue after duration time) StopAtURL string `json:"stop_url"` // specific URL to stop the crawler at for a specific "root" MemStatsInterval time.Duration `json:"mem_stats_interval"` // display memory statistics at a given interval UserAgent string `json:"user_agent"` // set the user agent string for the crawler... to be polite and identify yourself WorkerIdleTTL time.Duration `json:"worker_timeout"` // time-to-live for a host URL's goroutine AutoClose bool `json:"autoclose"` // sets the application to terminate if the WorkerIdleTTL time is passed (must be true) EnableLogging bool `json:"enable_logging"` // sets whether or not to log to a file }
Settings struct to model the settings we want to run our hermes application with.
func ParseSettings ¶
func ParseSettings() Settings
ParseSettings will parse a local settings.json file that is in the same directory as the executable. The json file will be all the configuration set by the user for the application.
type Sources ¶
type Sources struct {
Links []CustomSettings `json:"links"` // an array of all the URL strings we want to start our crawler at
}
Sources struct to model a Type we want to ingest into the elasticsearch index and the links we want to crawl/scrape information to store in our index/type
func ParseLinks ¶
func ParseLinks() Sources
ParseLinks will parse the local data.json file that is in the same directory as the executable. The json file will be a "master" list of links we are going to crawl through.