Documentation ¶
Overview ¶
Package cassandra implements walker.Datastore with the Cassandra database
Index ¶
- Variables
- func CreateSchema() error
- func GetConfig() *gocql.ClusterConfig
- func GetSchema() string
- func GetTestDB() *gocql.Session
- type DQ
- type Datastore
- func (ds *Datastore) ClaimNewHost() string
- func (ds *Datastore) Close()
- func (ds *Datastore) FindDomain(domain string) (*DomainInfo, error)
- func (ds *Datastore) FindLink(u *walker.URL, collectContent bool) (*LinkInfo, error)
- func (ds *Datastore) InsertLink(link string, excludeDomainReason string) error
- func (ds *Datastore) InsertLinks(links []string, excludeDomainReason string) []error
- func (ds *Datastore) KeepAlive() error
- func (ds *Datastore) LinksForHost(domain string) <-chan *walker.URL
- func (ds *Datastore) ListDomains(query DQ) ([]*DomainInfo, error)
- func (ds *Datastore) ListLinkHistorical(u *walker.URL) ([]*LinkInfo, error)
- func (ds *Datastore) ListLinks(domain string, query LQ) ([]*LinkInfo, error)
- func (ds *Datastore) MaxPriority() int
- func (ds *Datastore) StoreParsedURL(u *walker.URL, fr *walker.FetchResults)
- func (ds *Datastore) StoreURLFetchResults(fr *walker.FetchResults)
- func (ds *Datastore) UnclaimAll() error
- func (ds *Datastore) UnclaimHost(host string)
- func (ds *Datastore) UpdateDomain(domain string, info *DomainInfo, cfg DomainInfoUpdateConfig) error
- type Dispatcher
- type DomainInfo
- type DomainInfoUpdateConfig
- type LQ
- type LinkInfo
- type LinkList
- type MockModelDatastore
- func (ds *MockModelDatastore) FindDomain(domain string) (*DomainInfo, error)
- func (ds *MockModelDatastore) FindLink(u *walker.URL, collectContent bool) (*LinkInfo, error)
- func (ds *MockModelDatastore) InsertLink(link string, excludeDomainReason string) error
- func (ds *MockModelDatastore) InsertLinks(links []string, excludeDomainReason string) []error
- func (ds *MockModelDatastore) ListDomains(query DQ) ([]*DomainInfo, error)
- func (ds *MockModelDatastore) ListLinkHistorical(u *walker.URL) ([]*LinkInfo, error)
- func (ds *MockModelDatastore) ListLinks(domain string, query LQ) ([]*LinkInfo, error)
- func (ds *MockModelDatastore) UpdateDomain(domain string, info *DomainInfo, cfg DomainInfoUpdateConfig) error
- type ModelDatastore
- type PriorityURL
- type SegmentGenerator
Constants ¶
This section is empty.
Variables ¶
var MaxPriorityPeriod time.Duration
Functions ¶
func CreateSchema ¶
func CreateSchema() error
CreateSchema creates the walker schema in the configured Cassandra database. It requires that the keyspace not already exist (so as to losing non-test data), with the exception of the walker_test schema, which it will drop automatically.
func GetConfig ¶
func GetConfig() *gocql.ClusterConfig
GetConfig returns a fresh ClusterConfig, configured against walker.Config
func GetSchema ¶
func GetSchema() string
GetSchema returns the CQL schema for this version of the cassandra datastore. Certain values, like keyspace and replication factor, are dynamically inserted.
Types ¶
type DQ ¶
type DQ struct { // When listing domains, the seed should be the domain preceding the // queried set. When paginating, use the last domain of the previous set as // the seed. // Default: select from the beginning Seed string // Limit the returned results, used for pagination. // Default: no limit Limit int // Set to true to get only dispatched domains // default: get all domains Working bool }
DQ is a domain query struct used for getting domains from cassandra. Zero-values mean use default behavior.
type Datastore ¶
type Datastore struct {
// contains filtered or unexported fields
}
Datastore is the primary walker Datastore implementation, using Apache Cassandra as a highly scalable backend. It provides extra access calls for the database for use in the console and other applications.
NewDatastore should be used to create one.
func NewDatastore ¶
NewDatastore creates a Cassandra session and initializes a Datastore
func (*Datastore) ClaimNewHost ¶
ClaimNewHost is documented on the walker.Datastore interface.
func (*Datastore) FindDomain ¶
func (ds *Datastore) FindDomain(domain string) (*DomainInfo, error)
func (*Datastore) InsertLink ¶
func (*Datastore) InsertLinks ¶
func (*Datastore) LinksForHost ¶
LinksForHost is documented on the walker.Datastore interface.
func (*Datastore) ListDomains ¶
func (ds *Datastore) ListDomains(query DQ) ([]*DomainInfo, error)
func (*Datastore) ListLinkHistorical ¶
func (*Datastore) MaxPriority ¶
func (*Datastore) StoreParsedURL ¶
func (ds *Datastore) StoreParsedURL(u *walker.URL, fr *walker.FetchResults)
StoreParsedURL is documented on the walker.Datastore interface.
func (*Datastore) StoreURLFetchResults ¶
func (ds *Datastore) StoreURLFetchResults(fr *walker.FetchResults)
StoreURLFetchResults is documented on the walker.Datastore interface.
func (*Datastore) UnclaimAll ¶
UnclaimAll iterates domains to unclaim them. Crawlers will unclaim domains by themselves, but this is used in case crawlers crash or are killed and have left domains claimed.
func (*Datastore) UnclaimHost ¶
UnclaimHost is documented on the walker.Datastore interface.
func (*Datastore) UpdateDomain ¶
func (ds *Datastore) UpdateDomain(domain string, info *DomainInfo, cfg DomainInfoUpdateConfig) error
type Dispatcher ¶
type Dispatcher struct {
// contains filtered or unexported fields
}
Dispatcher analyzes what we've crawled so far (generally on a per-domain basis) and updates the database. At minimum this means generating new segments to crawl in the `segments` table, but it can also mean updating domain_info if we find out new things about a domain.
This dispatcher has been designed to run simultaneously with the fetchmanager. Fetchers and dispatchers claim domains in Cassandra, so the dispatcher can operate on the domains not currently being crawled (and vice versa).
Always create a Dispatcher using NewDispatcher()
func NewDispatcher ¶
func NewDispatcher() (*Dispatcher, error)
func (*Dispatcher) StartDispatcher ¶
func (d *Dispatcher) StartDispatcher() error
func (*Dispatcher) StopDispatcher ¶
func (d *Dispatcher) StopDispatcher() error
StopDispatcher stops the dispatcher.
type DomainInfo ¶
type DomainInfo struct { // TLD+1 Domain string // Is this domain excluded from the crawl? Excluded bool // Why did this domain get excluded, or empty if not excluded ExcludeReason string // When did this domain last get queued to be crawled. Or TimeQueed.IsZero() if not crawled ClaimTime time.Time // What was the UUID of the crawler that last crawled the domain ClaimToken gocql.UUID // Number of (unique) links found in this domain NumberLinksTotal int // Number of (unique) links queued to be processed for this domain NumberLinksQueued int // Number of links not yet crawled NumberLinksUncrawled int // Priority of this domain Priority int }
DomainInfo defines a row from the domain_info table
type DomainInfoUpdateConfig ¶
type DomainInfoUpdateConfig struct { // Setting Exclude to true indicates that the ExcludeReason field of the // DomainInfo passed to UpdateDomain should be persisted to the database. Exclude bool // Setting Priority to true indicates that the Priority field of the // DomainInfo passed to UpdateDomain should be persisted to the database. Priority bool }
DomainInfoUpdateConfig is used to configure the method Datastore.UpdateDomain
type LQ ¶
type LQ struct { // When listing links, the seed should be the URL preceding the queried // set. When paginating, use the last URL of the previous set as the seed. // Default: select from the beginning Seed *walker.URL // Limit the returned results, used for pagination. // Default: no limit Limit int FilterRegex string }
LQ is a link query struct used for gettings links from cassandra. Zero-values mean use default behavior.
type LinkInfo ¶
type LinkInfo struct { // URL of the link URL *walker.URL // Status of the fetch Status int // When did this link get crawled CrawlTime time.Time // Any error reported when attempting to fetch the URL Error string // Was this excluded by robots RobotsExcluded bool // URL this link redirected to if it was a redirect RedirectedTo string // Whether this link was flagged for immediate fetching GetNow bool // Mime type (or Content-Type) of the returned data Mime string // FNV hash of the contents FnvFingerprint int64 // FNV hash of the text extracted from the page FnvTextFingerprint int64 // Body of request (if configured to be stored) Body string // Header of request (if configured to be stored) Headers http.Header }
LinkInfo defines a row from the link or segment table
type LinkList ¶
type LinkList []*LinkInfo
LinkList is a list of LinkInfos that implements sort.Interface, so we can easily sort and deduplicate it
type MockModelDatastore ¶
type MockModelDatastore struct {
walker.MockDatastore
}
MockModelDatastore implements walker/cassandra's ModelDatastore interface for testing.
func (*MockModelDatastore) FindDomain ¶
func (ds *MockModelDatastore) FindDomain(domain string) (*DomainInfo, error)
func (*MockModelDatastore) InsertLink ¶
func (ds *MockModelDatastore) InsertLink(link string, excludeDomainReason string) error
func (*MockModelDatastore) InsertLinks ¶
func (ds *MockModelDatastore) InsertLinks(links []string, excludeDomainReason string) []error
func (*MockModelDatastore) ListDomains ¶
func (ds *MockModelDatastore) ListDomains(query DQ) ([]*DomainInfo, error)
func (*MockModelDatastore) ListLinkHistorical ¶
func (ds *MockModelDatastore) ListLinkHistorical(u *walker.URL) ([]*LinkInfo, error)
func (*MockModelDatastore) ListLinks ¶
func (ds *MockModelDatastore) ListLinks(domain string, query LQ) ([]*LinkInfo, error)
func (*MockModelDatastore) UpdateDomain ¶
func (ds *MockModelDatastore) UpdateDomain(domain string, info *DomainInfo, cfg DomainInfoUpdateConfig) error
type ModelDatastore ¶
type ModelDatastore interface { walker.Datastore // FindDomain returns the DomainInfo for the specified domain FindDomain(domain string) (*DomainInfo, error) // ListDomains returns a slice of DomainInfo structs populated according to // the specified DQ (domain query) ListDomains(query DQ) ([]*DomainInfo, error) // UpdateDomain updates the given domain with fields from `info`. Which // fields will be persisted to the store from the argument DomainInfo is // configured from the DomainInfoUpdateConfig argument. For example, to // persist the Priority field in the info strut, one would pass // DomainInfoUpdateConfig{Priority: true} as the cfg argument to // UpdateDomain. UpdateDomain(domain string, info *DomainInfo, cfg DomainInfoUpdateConfig) error // FindLink returns a LinkInfo matching the given URL. Arguments to this // function are: (a) u is the url to find (b) collectContent, if true, // indicates that Body and Headers field of LinkInfo will be populated. FindLink(u *walker.URL, collectContent bool) (*LinkInfo, error) // ListLinks fetches links for the given domain according to the given LQ // (Link Query) ListLinks(domain string, query LQ) ([]*LinkInfo, error) // ListLinkHistorical gets the crawl history of a specific link ListLinkHistorical(u *walker.URL) ([]*LinkInfo, error) // InsertLink inserts the given link into the database, adding it's domain // if it does not exist. If excludeDomainReason is not empty, this domain // will be excluded from crawling marked with the given reason. InsertLink(link string, excludeDomainReason string) error // InsertLinks does the same as InsertLink with many potential errors. It // will insert as many as it can (it won't stop once it hits a bad link) // and only return errors for problematic links or domains. InsertLinks(links []string, excludeDomainReason string) []error }
ModelDatastore defines additional methods for querying and modifying domains and links in walker, and includes the walker.Datastore interface (which is intended only for the bare minimum that fetchers in the walker package need). This interface is good for use by the console and other tools that need CRUD-like capabilities.
type PriorityURL ¶
type PriorityURL []*LinkInfo
PriorityURL is a heap of URLs, where the next element Pop'ed off the list points to the oldest (as measured by LastCrawled) element in the list. This class is designed to be used with the container/heap package. This type is currently only used in generateSegments
func (PriorityURL) Less ¶
func (pq PriorityURL) Less(i, j int) bool
Return logical less-than between two items in this PriorityURL
func (*PriorityURL) Pop ¶
func (pq *PriorityURL) Pop() interface{}
Pop an item onto this PriorityURL
func (*PriorityURL) Push ¶
func (pq *PriorityURL) Push(x interface{})
Push an item onto this PriorityURL
type SegmentGenerator ¶
type SegmentGenerator struct { // A DB handle for the generator to use. Should be provided when // constructing a SegmentGenerator DB *gocql.Session // contains filtered or unexported fields }
SegmentGenerator is the dispatcher component for generating a segment of links for an individual domain. See the Generate() function.
func (*SegmentGenerator) Generate ¶
func (sg *SegmentGenerator) Generate(domain string) error
Generate reads links in for this domain, generates a segment for it, and inserts the domain into domains_to_crawl (assuming a segment is ready to go)