crawler

package

v0.0.0-...-73de0e8 Latest Latest Go to latest Published: Feb 2, 2018 License: Apache-2.0 Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/clanstyles/jivesearch

Links

Open Source Insights

Documentation ¶

Overview ¶

Package crawler is a distributed web crawler.

Index ¶

Variables
type Backend
type Crawler
- func New(cfg config.Provider) *Crawler
- func (c *Crawler) Close()
- func (c *Crawler) Start(t time.Duration) error
type ElasticSearch
- func (e *ElasticSearch) CrawledAndCount(u, domain string) (time.Time, int, error)
- func (e *ElasticSearch) Upsert(doc *document.Document) error
type Stats
type UserAgent

Constants ¶

This section is empty.

Variables ¶

View Source

var RobotsPath, _ = url.Parse("/robots.txt")

RobotsPath is robots.txt path

Functions ¶

This section is empty.

Types ¶

type Backend ¶

type Backend interface {
	Setup() error
	CrawledAndCount(u, domain string) (time.Time, int, error) // gotta be a better name for this
	Upsert(*document.Document) error
}

Backend outlines methods to save documents and count the docs a domain has

type Crawler ¶

type Crawler struct {
	HTTPClient *http.Client
	UserAgent

	Robots robots.Cacher
	Queue  queue.Queuer

	Backend
	// contains filtered or unexported fields
}

Crawler holds crawler settings for our UserAgent, Seed URLs, etc.

func New ¶

func New(cfg config.Provider) *Crawler

New creates a Crawler from a config Provider

func (*Crawler) Close ¶

func (c *Crawler) Close()

Close the crawler

func (*Crawler) Start ¶

func (c *Crawler) Start(t time.Duration) error

Start the crawler

type ElasticSearch ¶

type ElasticSearch struct {
	*document.ElasticSearch
	Bulk *elastic.BulkProcessor
	sync.Mutex
}

ElasticSearch satisfies the crawler's Backend interface

func (*ElasticSearch) CrawledAndCount ¶

func (e *ElasticSearch) CrawledAndCount(u, domain string) (time.Time, int, error)

CrawledAndCount returns the crawled date of the url (if any) and the total number of links a domain has

func (*ElasticSearch) Upsert ¶

func (e *ElasticSearch) Upsert(doc *document.Document) error

Upsert updates a document or inserts it if it doesn't exist NOTE: Elasticsearch has a 512-byte limit on an insert operation. Upsert does not have that limit.

type Stats ¶

type Stats struct {
	sync.Mutex
	Start time.Time

	StatusCodes map[int]int64
	// contains filtered or unexported fields
}

Stats keeps track of time elapsed & status codes

func (*Stats) Elapsed ¶

func (s *Stats) Elapsed() *Stats

Elapsed will set the total time the crawler has been running

func (*Stats) String ¶

func (s *Stats) String() string

Print our stats into human-readable

func (*Stats) Update ¶

func (s *Stats) Update(code int)

Update our stats from a document's results

type UserAgent ¶

type UserAgent struct {
	Full  string
	Short string
}

UserAgent holds the full and short version of the crawler's useragent

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd Command crawler demonstrates how to run the crawler	Command crawler demonstrates how to run the crawler
queue Package queue manages the queue for a distributed crawler	Package queue manages the queue for a distributed crawler
robots Package robots handles caching robots.txt files	Package robots handles caching robots.txt files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL