Documentation
¶
Overview ¶
Package walker is an efficient, scalable, continuous crawler leveraging Go and Cassandra
This package provides the core walker libraries. The development API is documented here. See http://github.com/iParadigms/walker or README.md for an overview of the project.
Index ¶
- Variables
- func GetTestFileDir() string
- func LoadTestConfig(filename string)
- func MustReadConfigFile(path string)
- func PostConfigHooks()
- func ReadConfigFile(path string) error
- func SetDefaultConfig()
- type ConfigStruct
- type Datastore
- type Dispatcher
- type FetchManager
- type FetchResults
- type HTMLParser
- type Handler
- type MockDatastore
- func (ds *MockDatastore) ClaimNewHost() string
- func (ds *MockDatastore) Close()
- func (ds *MockDatastore) KeepAlive() error
- func (ds *MockDatastore) LinksForHost(domain string) <-chan *URL
- func (ds *MockDatastore) StoreParsedURL(u *URL, fr *FetchResults)
- func (ds *MockDatastore) StoreURLFetchResults(fr *FetchResults)
- func (ds *MockDatastore) UnclaimAll() error
- func (ds *MockDatastore) UnclaimHost(host string)
- type MockDispatcher
- type MockHTTPHandler
- type MockHandler
- type MockRemoteServer
- type MockResponse
- type URL
- func (u *URL) Clone() *URL
- func (u *URL) Equal(other *URL) bool
- func (u *URL) EqualIgnoreLastCrawled(other *URL) bool
- func (u *URL) MakeAbsolute(base *URL)
- func (u *URL) Normalize()
- func (u *URL) NormalizedForm() *URL
- func (u *URL) PrimaryKey() (dom string, subdom string, path string, proto string, time time.Time, ...)
- func (u *URL) Subdomain() (string, error)
- func (u *URL) TLDPlusOneAndSubdomain() (string, string, error)
- func (u *URL) ToplevelDomainPlusOne() (string, error)
Constants ¶
This section is empty.
Variables ¶
var ConfigName = "walker.yaml"
ConfigName is the path (can be relative or absolute) to the config file that should be read.
var NotYetCrawled time.Time
NotYetCrawled is a convenience for time.Unix(0, 0), used as a crawl time in Walker for links that have not yet been fetched.
Functions ¶
func GetTestFileDir ¶
func GetTestFileDir() string
GetTestFileDir returns the directory where shared test files are stored, for example test config files. It will panic if it could not get the path from the runtime.
func LoadTestConfig ¶
func LoadTestConfig(filename string)
LoadTestConfig loads the given test config yaml file. The given path is assumed to be relative to the `walker/test/` directory, the location of this file. This will panic if it cannot read the requested config file. If you expect an error or are testing ReadConfigFile, use `GetTestFileDir()` instead.
func MustReadConfigFile ¶
func MustReadConfigFile(path string)
MustReadConfigFile calls ReadConfigFile and panics on error.
func PostConfigHooks ¶
func PostConfigHooks()
PostConfigHooks allows code to set up data structures that depend on the config. It is always called right after the config file is consumed. But it's also public so if you modify the config in a test, you may need to call this function. This function is idempotent; so you can call it as many times as you like.
func ReadConfigFile ¶
ReadConfigFile sets a new path to find the walker yaml config file and forces a reload of the config.
func SetDefaultConfig ¶
func SetDefaultConfig()
SetDefaultConfig resets the Config object to default values, regardless of what was set by any configuration file.
Types ¶
type ConfigStruct ¶
type ConfigStruct struct { Fetcher struct { MaxDNSCacheEntries int `yaml:"max_dns_cache_entries"` UserAgent string `yaml:"user_agent"` AcceptFormats []string `yaml:"accept_formats"` AcceptProtocols []string `yaml:"accept_protocols"` MaxHTTPContentSizeBytes int64 `yaml:"max_http_content_size_bytes"` IgnoreTags []string `yaml:"ignore_tags"` MaxLinksPerPage int `yaml:"max_links_per_page"` NumSimultaneousFetchers int `yaml:"num_simultaneous_fetchers"` BlacklistPrivateIPs bool `yaml:"blacklist_private_ips"` HTTPTimeout string `yaml:"http_timeout"` HonorMetaNoindex bool `yaml:"honor_meta_noindex"` HonorMetaNofollow bool `yaml:"honor_meta_nofollow"` ExcludeLinkPatterns []string `yaml:"exclude_link_patterns"` IncludeLinkPatterns []string `yaml:"include_link_patterns"` DefaultCrawlDelay string `yaml:"default_crawl_delay"` MaxCrawlDelay string `yaml:"max_crawl_delay"` PurgeSidList []string `yaml:"purge_sid_list"` ActiveFetchersTTL string `yaml:"active_fetchers_ttl"` ActiveFetchersCacheratio float32 `yaml:"active_fetchers_cacheratio"` ActiveFetchersKeepratio float32 `yaml:"active_fetchers_keepratio"` HTTPKeepAlive string `yaml:"http_keep_alive"` HTTPKeepAliveThreshold string `yaml:"http_keep_alive_threshold"` MaxPathLength int `yaml:"max_path_length"` } `yaml:"fetcher"` Dispatcher struct { MaxLinksPerSegment int `yaml:"num_links_per_segment"` RefreshPercentage float64 `yaml:"refresh_percentage"` NumConcurrentDomains int `yaml:"num_concurrent_domains"` MinLinkRefreshTime string `yaml:"min_link_refresh_time"` DispatchInterval string `yaml:"dispatch_interval"` CorrectLinkNormalization bool `yaml:"correct_link_normalization"` EmptyDispatchRetryInterval string `yaml:"empty_dispatch_retry_interval"` } `yaml:"dispatcher"` Cassandra struct { Hosts []string `yaml:"hosts"` Keyspace string `yaml:"keyspace"` ReplicationFactor int `yaml:"replication_factor"` Timeout string `yaml:"timeout"` CQLVersion string `yaml:"cql_version"` ProtoVersion int `yaml:"proto_version"` Port int `yaml:"port"` NumConns int `yaml:"num_conns"` NumStreams int `yaml:"num_streams"` DiscoverHosts bool `yaml:"discover_hosts"` MaxPreparedStmts int `yaml:"max_prepared_stmts"` AddNewDomains bool `yaml:"add_new_domains"` AddedDomainsCacheSize int `yaml:"added_domains_cache_size"` StoreResponseBody bool `yaml:"store_response_body"` StoreResponseHeaders bool `yaml:"store_response_headers"` NumQueryRetries int `yaml:"num_query_retries"` DefaultDomainPriority int `yaml:"default_domain_priority"` } `yaml:"cassandra"` Console struct { Port int `yaml:"port"` TemplateDirectory string `yaml:"template_directory"` PublicFolder string `yaml:"public_folder"` MaxAllowedDomainPriority int `yaml:"max_allowed_domain_priority"` } `yaml:"console"` }
ConfigStruct defines the available global configuration parameters for walker. It reads values straight from the config file (walker.yaml by default). See sample-walker.yaml for explanations and default values.
var Config ConfigStruct
Config is the configuration instance the rest of walker should access for global configuration values. See ConfigStruct for available config members.
type Datastore ¶
type Datastore interface { // ClaimNewHost returns a hostname that is now claimed for this crawler to // crawl. A segment of links for this host is assumed to be available. // Returns the domain of the segment it claimed, or "" if there are none // available. ClaimNewHost() string // UnclaimHost indicates that all links from `LinksForHost` have been // processed, so other work may be done with this host. For example the // dispatcher will be free analyze the links and generate a new segment. UnclaimHost(host string) // LinksForHost returns a channel that will feed URLs for a given host. LinksForHost(host string) <-chan *URL // StoreURLFetchResults takes the return data/metadata from a fetch and // stores the visit. Fetchers will call this once for each link in the // segment being crawled. StoreURLFetchResults(fr *FetchResults) // StoreParsedURL stores a URL parsed out of a page (i.e. a URL we may not // have crawled yet). `u` is the URL to store. `fr` is the FetchResults // object for the fetch from which we got the URL, for any context the // datastore may want. A datastore implementation should handle `fr` being // nil, so links can be seeded without a fetch having occurred. // // URLs passed to StoreParsedURL should be absolute. // // This layer should handle efficiently deduplicating // links (i.e. a fetcher should be safe feeding the same URL many times. StoreParsedURL(u *URL, fr *FetchResults) // KeepAlive will be called periodically in fetcher. This method should // notify the datastore that this fetcher is still alive. KeepAlive() error // Close will be called when no more Datastore calls will be made, allowing // any necessary cleanup to take place. Close() }
Datastore defines the interface for an object to be used as walker's datastore.
Note that this is for link and metadata storage required to make walker function properly. It has nothing to do with storing fetched content (see `Handler` for that).
type Dispatcher ¶
type Dispatcher interface { // StartDispatcher should be a blocking call that starts the dispatcher. It // should return an error if it could not start or stop properly and nil // when it has safely shut down and stopped all internal processing. StartDispatcher() error // Stop signals the dispatcher to stop. It should block until all internal // goroutines have stopped. StopDispatcher() error }
Dispatcher defines the calls a dispatcher should respond to. A dispatcher would typically be paired with a particular Datastore, and not all Datastore implementations may need a Dispatcher.
A basic crawl will likely run the dispatcher in the same process as the fetchers, but higher-scale crawl setups may run dispatchers separately.
type FetchManager ¶
type FetchManager struct { // Handler must be set to handle fetch responses. Handler Handler // Datastore must be set to drive the fetching. Datastore Datastore // Transport can be set to override the default network transport the // FetchManager is going to use. Good for faking remote servers for // testing. Transport http.RoundTripper // TransNoKeepAlive stores a RoundTripper with Keep-Alive set to 0 IF // http_keep_alive == "threshold". Otherwise it's nil. TransNoKeepAlive http.RoundTripper // Parsed duration of the string Config.Fetcher.HTTPKeepAliveThreshold KeepAliveThreshold time.Duration // contains filtered or unexported fields }
FetchManager configures and runs the crawl.
The calling code must create a FetchManager, set a Datastore and handlers, then call `Start()`
func (*FetchManager) Start ¶
func (fm *FetchManager) Start()
Start starts a FetchManager. Always pair go Start() with a Stop()
func (*FetchManager) Stop ¶
func (fm *FetchManager) Stop()
Stop notifies the fetchers to finish their current requests. It blocks until all fetchers have finished.
type FetchResults ¶
type FetchResults struct { // URL that was requested; will always be populated. If this URL redirects, // RedirectedFrom will contain a list of all requested URLS. URL *URL // A list of redirects. During this request cycle, the first request URL is stored // in URL. The second request (first redirect) is stored in RedirectedFrom[0]. And // the Nth request (N-1 th redirect) will be stored in RedirectedFrom[N-2], // and this is the URL that furnished the http.Response. RedirectedFrom []*URL // Response object; nil if there was a FetchError or ExcludedByRobots is // true. Response.Body may not be the same object the HTTP request actually // returns; the fetcher may have read in the response to parse out links, // replacing Response.Body with an alternate reader. Response *http.Response // If the user has set cassandra.store_response_body to true in the config file, // then the content of the link will be stored in Body (and consequently stored in the // body column of the links table). Otherwise Body is the empty string. Body string // FetchError if the net/http request had an error (non-2XX HTTP response // codes are not considered errors) FetchError error // Time at the beginning of the request (if a request was made) FetchTime time.Time // True if we did not request this link because it is excluded by // robots.txt rules ExcludedByRobots bool // True if the page was marked as 'noindex' via a <meta> tag. Whether it // was crawled depends on the honor_meta_noindex configuration parameter MetaNoIndex bool // True if the page was marked as 'nofollow' via a <meta> tag. Whether it // was crawled depends on the honor_meta_nofollow configuration parameter MetaNoFollow bool // The Content-Type of the fetched page. MimeType string // Fingerprint of the reponse body computed with fnv algorithm (see // hash/fnv in standard library) FnvFingerprint int64 // Fingerprint of the text parsed out of the response body, also computed // with fnv FnvTextFingerprint int64 }
FetchResults contains all relevant context and return data from an individual fetch. Handlers receive this to process results.
type HTMLParser ¶
type HTMLParser struct { // A concatenation of all text, excluding content from script/style tags Text []byte // A list of links found on the parsed page Links []*URL // true if <meta name="ROBOTS" content="noindex"> was found HasMetaNoIndex bool // true if <meta name="ROBOTS" content="nofollow"> was found HasMetaNoFollow bool }
HTMLParser simply parses html passed from the fetcher. A new struct is intended to have Parse() called on it, which will populate it's member variables for reading.
func (*HTMLParser) Parse ¶
func (p *HTMLParser) Parse(body []byte)
Parse parses the given content body as HTML and populates instance variables as it is able. Parse errors will cause the parser to finish with whatever it has found so far. This method will reset it's instance variables if run repeatedly
type Handler ¶
type Handler interface { // HandleResponse will be called by fetchers as they make requests. // Handlers can do whatever they want with responses. HandleResponse will // be called as long as the request successfully reached the remote server // and got an HTTP code. This means there should never be a FetchError set // on the FetchResults. HandleResponse(res *FetchResults) }
Handler defines the interface for objects that will be set as handlers on a FetchManager.
type MockDatastore ¶
MockDatastore implements walker's Datastore interface for testing.
func (*MockDatastore) ClaimNewHost ¶
func (ds *MockDatastore) ClaimNewHost() string
ClaimNewHost implements walker.Datastore interface
func (*MockDatastore) Close ¶
func (ds *MockDatastore) Close()
func (*MockDatastore) KeepAlive ¶
func (ds *MockDatastore) KeepAlive() error
KeepAlive implements walker.Datastore interface
func (*MockDatastore) LinksForHost ¶
func (ds *MockDatastore) LinksForHost(domain string) <-chan *URL
func (*MockDatastore) StoreParsedURL ¶
func (ds *MockDatastore) StoreParsedURL(u *URL, fr *FetchResults)
func (*MockDatastore) StoreURLFetchResults ¶
func (ds *MockDatastore) StoreURLFetchResults(fr *FetchResults)
func (*MockDatastore) UnclaimAll ¶
func (ds *MockDatastore) UnclaimAll() error
UnclaimAll implements method on cassandra.Datastore
func (*MockDatastore) UnclaimHost ¶
func (ds *MockDatastore) UnclaimHost(host string)
UnclaimHost implements walker.Datastore interface
type MockDispatcher ¶
MockDispatcher implements the walker.Dispatcher interface
func (*MockDispatcher) StartDispatcher ¶
func (d *MockDispatcher) StartDispatcher() error
StartDispatcher implements the walker.Dispatcher interface
func (*MockDispatcher) StopDispatcher ¶
func (d *MockDispatcher) StopDispatcher() error
StopDispatcher implements the walker.Dispatcher interface
type MockHTTPHandler ¶
type MockHTTPHandler struct {
// contains filtered or unexported fields
}
MockHTTPHandler implements http.Handler to serve mock requests.
It is not a mere mock.Mock object because using `.Return()` to return *http.Response objects is hard to do, and this provides conveniences in our tests.
It should be instantiated with `NewMockRemoteServer()`
func NewMockHTTPHandler ¶
func NewMockHTTPHandler() *MockHTTPHandler
NewMockHTTPHandler creates a new MockHTTPHandler
func (*MockHTTPHandler) ServeHTTP ¶
func (s *MockHTTPHandler) ServeHTTP(w http.ResponseWriter, r *http.Request)
ServeHTTP implements http.Handler interface
func (*MockHTTPHandler) SetResponse ¶
func (s *MockHTTPHandler) SetResponse(link string, r *MockResponse)
SetResponse sets a mock response for the server to return when it sees an incoming request matching the given link. The link should have a scheme and host (ex. "http://test.com/stuff"). Empty fields on MockResponse will be filled in with default values (see MockResponse)
type MockHandler ¶
MockHandler implements the walker.Handler interface
func (*MockHandler) HandleResponse ¶
func (h *MockHandler) HandleResponse(fr *FetchResults)
type MockRemoteServer ¶
type MockRemoteServer struct { *MockHTTPHandler // contains filtered or unexported fields }
MockRemoteServer wraps MockHTTPHandler to start a fake server for the user. Use `NewMockRemoteServer()`
func NewMockRemoteServer ¶
func NewMockRemoteServer() (*MockRemoteServer, error)
NewMockRemoteServer starts a server listening on port 80. It wraps MockHTTPHandler so mock return values can be set. Stop should be called at the end of the test to stop the server.
func (*MockRemoteServer) Headers ¶
Headers allows user to inspect the headers included in the request object sent to MockRemoteServer. The triple (method, url, depth) selects which header to return. Here:
(a) method is the http method (GET, POST, etc.) (b) url is the full url of the page that received the request. (c) depth is an integer specifying which (of possibly many) headers for the given (method, url) pair to return. Use depth=-1 to get the latest header.
type MockResponse ¶
type MockResponse struct { // Status defaults to 200 Status int // Status defaults to "GET" Method string // Body defaults to nil (no response body) Body string // Headers of response Headers http.Header //ContentType defaults to "text/html" ContentType string // How long is the content ContentLength int }
MockResponse is the source object used to build fake responses in MockHTTPHandler.
type URL ¶
type URL struct { *url.URL // LastCrawled is the last time we crawled this URL, for example to use a // Last-Modified header. LastCrawled time.Time }
URL is the walker URL object, which embeds *url.URL but has extra data and capabilities used by walker. Note that LastCrawled should not be set to its zero value, it should be set to NotYetCrawled.
func CreateURL ¶
CreateURL creates a walker URL from values usually pulled out of the datastore. subdomain may optionally include a trailing '.', and path may optionally include a prefixed '/'.
func MustParse ¶
MustParse is a helper for calling ParseURL when we kow the string is a safe URL. It will panic if it fails.
func ParseAndNormalizeURL ¶
ParseAndNormalizeURL will walker.ParseURL the argument string, and then Normalize the resulting URL.
func ParseURL ¶
ParseURL is the walker.URL equivalent of url.Parse. Note, all URL's should be passed through this function so that we get consistency.
func (*URL) EqualIgnoreLastCrawled ¶
EqualIgnoreLastCrawled returns true if the URL portion of this link (excluding LastCrawled) is equal to `other`.
func (*URL) MakeAbsolute ¶
MakeAbsolute uses URL.ResolveReference to make this URL object an absolute reference (having Schema and Host), if it is not one already. It is resolved using `base` as the base URL.
func (*URL) Normalize ¶
func (u *URL) Normalize()
Normalize will process the URL according to the current set of normalizing rules.
func (*URL) NormalizedForm ¶
NormalizedForm returns nil if u is normalized. Otherwise, return the normalized version of u.
func (*URL) PrimaryKey ¶
func (u *URL) PrimaryKey() (dom string, subdom string, path string, proto string, time time.Time, err error)
PrimaryKey returns the 5 tuple that is the primary key for this url in the links table. The return values are (with cassandra keys in parens) (a) Domain (dom) (b) Subdomain (subdom) (c) Path part of url (path) (d) Schema of url (proto) (e) last update time of link (time) (f) any errors that occurred
func (*URL) Subdomain ¶
Subdomain provides the remaining subdomain after removing the ToplevelDomainPlusOne. For example http://www.bbc.co.uk/ will return 'www' as the subdomain (note that there is no trailing period). If there is no subdomain it will return "".
func (*URL) TLDPlusOneAndSubdomain ¶
TLDPlusOneAndSubdomain is a convenience function that calls ToplevelDomainPlusOne and Subdomain, returning an error if we could not get either one. The first return is the TLD+1 and second is the subdomain
func (*URL) ToplevelDomainPlusOne ¶
ToplevelDomainPlusOne returns the Effective Toplevel Domain of this host as defined by https://publicsuffix.org/, plus one extra domain component.
For example the TLD of http://www.bbc.co.uk/ is 'co.uk', plus one is 'bbc.co.uk'. Walker uses these TLD+1 domains as the primary unit of grouping.
Source Files
¶
Directories
¶
Path | Synopsis |
---|---|
Package cassandra implements walker.Datastore with the Cassandra database
|
Package cassandra implements walker.Datastore with the Cassandra database |
Package cmd provides access to build on the walker CLI
|
Package cmd provides access to build on the walker CLI |
Package console implements a web console for Walker in Go
|
Package console implements a web console for Walker in Go |
Package dnscache implements a Dial function that will cache DNS resolutions
|
Package dnscache implements a Dial function that will cache DNS resolutions |
Package mimetools provides functions for matching against media types
|
Package mimetools provides functions for matching against media types |
Package simplehandler provides a basic walker handler implementation
|
Package simplehandler provides a basic walker handler implementation |