Documentation ¶
Overview ¶
Package ant implements a web crawler.
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( // UserAgent is the default user agent to use. // // The user agent is used by default when fetching // pages and robots.txt. UserAgent = StaticAgent("antbot") // DefaultFetcher is the default fetcher to use. // // It uses the default client and default user agent. DefaultFetcher = &Fetcher{ Client: DefaultClient, UserAgent: UserAgent, } )
var DefaultClient = &http.Client{ Transport: &http.Transport{ Proxy: http.ProxyFromEnvironment, DialContext: (&net.Dialer{ Timeout: 30 * time.Second, KeepAlive: 30 * time.Second, DualStack: true, }).DialContext, ForceAttemptHTTP2: true, MaxIdleConns: 0, MaxIdleConnsPerHost: 1000, IdleConnTimeout: 90 * time.Second, TLSHandshakeTimeout: 10 * time.Second, ExpectContinueTimeout: 1 * time.Second, }, Timeout: 10 * time.Second, }
DefaultClient is the default client to use.
It is configured the same way as the `http.DefualtClient` except for 3 changes:
- Timeout => 10s
- Transport.MaxIdleConns => infinity
- Transport.MaxIdleConnsPerHost => 1,000
Note that this default client is used for all robots.txt requests when they're enabled.
Functions ¶
This section is empty.
Types ¶
type Client ¶
type Client interface { // Do sends an HTTP request and returns an HTTP response. // // The method does not rely on the HTTP response code to return an error // also a non-nil error does not guarantee that the response is nil, its // body must be closed and read until EOF so that the underlying resources // may be reused. Do(req *http.Request) (*http.Response, error) }
Client represents an HTTP client.
A client is used by the fetcher to turn URLs into pages, it is up to the client to decide how it manages the underlying connections, redirects or cookies.
A client must be safe to use from multiple goroutines.
type Deduper ¶
type Deduper interface { // Dedupe de-duplicates the given URLs. // // The method returns a new slice of URLs // that were not visited yet, it must be // thread-safe. // // The function is not required to normalize the URLs // the engine normalizes them before calling the method. // // If an error is returned that implements // `Temporary() bool` and returns true, the // engine will retry. Dedupe(ctx context.Context, urls URLs) (URLs, error) }
Deduper represents a URL de-duplicator.
A deduper must be safe to use from multiple goroutines.
func DedupeBF ¶
DedupeBF returns a new deduper backed by bloom filter.
The de-duplicator uses an in-memory bloomfilter to check if a URL has been visited, when `Dedupe()` is called with a set of URLs, it will loop over them and check if they exist in the set, if they are not, it will add them to the set and return them.
func DedupeMap ¶
func DedupeMap() Deduper
DedupeMap returns a new deduper backed by sync.Map.
The de-duplicator is in-efficient and is meant to be used for smaller crawls, it keeps the URLs in-memory.
If you're concerned about memory use, either supply your own de-duplicator implementation or use `DedupeBF()`.
type Engine ¶
type Engine struct {
// contains filtered or unexported fields
}
Engine implements web crawler engine.
type EngineConfig ¶
type EngineConfig struct { // Scraper is the scraper to use. // // If nil, NewEngine returns an error. Scraper Scraper // Deduper is the URL de-duplicator to use. // // If nil, DedupeMap is used. Deduper Deduper // Fetcher is the page fetcher to use. // // If nil, the default HTTP fetcher is used. Fetcher *Fetcher // Queue is the URL queue to use. // // If nil, the default in-memory queue is used. Queue Queue // Limiter is the rate limiter to use. // // The limiter is called with each URL before // it is fetched. // // If nil, no limits are used. Limiter Limiter // Matcher is the URL matcher to use. // // The matcher is called with a URL before it is queued // if it returns false the URL is discarded. // // If nil, all URLs are queued. Matcher Matcher // Impolite skips any robots.txt checking. // // Note that it does not affect any configured // ratelimiters or matchers. // // By default the engine checks robots.txt, it uses // the default ant.UserAgent. Impolite bool // Workers specifies the amount of workers to use. // // Every worker the engine start consumes URLs from the queue // and starts a goroutine for each URL. // // If <= 0, defaults to 1. Workers int // Concurrency is the maximum amount of URLs to process // at any given time. // // The engine uses a global semaphore to limit the amount // of goroutines started by the workers. // // If <= 0, there's no limit. Concurrency int }
EngineConfig configures the engine.
type FetchError ¶
FetchError represents a fetch error.
func (FetchError) Temporary ¶
func (err FetchError) Temporary() bool
Temporary returns true if the HTTP status code generally means the error is temporary.
type Fetcher ¶
type Fetcher struct { // Client is the client to use. // // If nil, ant.DefaultClient is used. Client Client // UserAgent is the user agent to use. // // It implements the fmt.Stringer interface // to allow user agent spoofing when needed. // // If nil, the client decides the user agent. UserAgent fmt.Stringer // MaxAttempts is the maximum request attempts to make. // // When <= 0, it defaults to 5. MaxAttempts int // MinBackoff to use when the fetcher retries. // // Must be less than MaxBackoff, otherwise // the fetcher returns an error. // // Defaults to `50ms`. MinBackoff time.Duration // MaxBackoff to use when the fetcher retries. // // Must be greater than MinBackoff, otherwise the // fetcher returns an error. // // Defaults to `1s`. MaxBackoff time.Duration }
Fetcher implements a page fetcher.
func (*Fetcher) Fetch ¶
Fetch fetches a page by URL.
The method uses the configured client to make a new request parse the response and return a page.
The method returns a nil page and nil error when the status code is 404.
The will retry the request when the status code is temporary or when a temporary network error occures.
The returned page contains the response's body, the body must be read until EOF and closed so that the client can re-use the underlying TCP connection.
type Limiter ¶
type Limiter interface { // Limit blocks until a request is allowed to happen. // // The method receives a URL and must block until a request // to the URL is allowed to happen. // // If the given context is canceled, the method returns immediately // with the context's err. Limit(ctx context.Context, u *url.URL) error }
Limiter controls how many requests can be made by the engine.
A limiter receives a context and a URL and blocks until a request is allowed to happen or returns an error if the context is canceled.
A limiter must be safe to use from multiple goroutines.
type LimiterFunc ¶
LimiterFunc implements a limiter.
func Limit ¶
func Limit(n int) LimiterFunc
Limit returns a new limiter.
The limiter allows `n` requests per second.
func LimitHostname ¶
func LimitHostname(n int, name string) LimiterFunc
LimitHostname returns a hostname limiter.
The limiter allows `n` requests for the hostname per second.
func LimitPattern ¶
func LimitPattern(n int, pattern string) LimiterFunc
LimitPattern returns a pattern limiter.
The limiter allows `n` requests for any URLs that match the pattern per second.
The provided pattern is matched against a URL that does not contain the query string or the scheme.
func LimitRegexp ¶
func LimitRegexp(n int, expr string) LimiterFunc
LimitRegexp returns a new regexp limiter.
The limiter limits all URLs that match the regexp the URL does not contain the scheme and the query parameters.
type List ¶
List represents a list of nodes.
The list wraps the html node slice with helper methods to extract data and manipulate the list.
func (List) At ¶
At returns a list that contains the node at index i.
If a negative index is provided the method returns node from the end of the list.
func (List) Query ¶
Query returns a list of nodes matching selector.
If the selector is invalid, the method returns a nil list.
type Matcher ¶
type Matcher interface { // Match returns true if the URL matches. // // The method will be just before a URL is queued // if it returns false, the URL will not be queued. Match(url *url.URL) bool }
Matcher represents a URL matcher.
A matcher must be safe to use from multiple goroutines.
type MatcherFunc ¶
MatcherFunc implements a Matcher.
func MatchHostname ¶
func MatchHostname(host string) MatcherFunc
MatchHostname returns a new hostname matcher.
The matcher returns true for all URLs that match the host.
func MatchPattern ¶
func MatchPattern(pattern string) MatcherFunc
MatchPattern returns a new pattern matcher.
The matcher returns true for all URLs that match the pattern, the URL does not contain the scheme and the query parameters.
func MatchRegexp ¶
func MatchRegexp(expr string) MatcherFunc
MatchRegexp returns a new regexp matcher.
The matcher returns true for all URLs that match the regexp, the URL does not contain the scheme and the query parameters.
type Page ¶
Page represents a page.
func (*Page) Query ¶
Query returns all nodes matching selector.
The method returns an empty list if no nodes were found.
type Queue ¶
type Queue interface { // Enqueue enqueues the given set of URLs. // // The method returns an io.EOF if the queue was // closed and a context error if the context was // canceled. // // Any other error will be treated as a critical // error and will be porpagated. Enqueue(ctx context.Context, urls URLs) error // Dequeue dequeues a URL. // // The method returns a URL or io.EOF error if // the queue was stopped. // // The method blocks until a URL is available or // until the queue is closed. Dequeue(ctx context.Context) (*URL, error) // Done acknowledges a URL. // // When a URL has been handled by the engine the method // is called with the URL. Done(ctx context.Context, url *URL) error // Wait blocks until the queue is closed. // // When the engine encounters an error, or there are // no more URLs to handle the method should unblock. Wait() // Close closes the queue. // // The method blocks until the queue is closed // any queued URLs are discarded. Close(context.Context) error }
Queue represents a URL queue.
A queue must be safe to use from multiple goroutines.
type Scraper ¶
type Scraper interface { // Scrape scrapes the given page. // // The method can return a set of URLs that should // be queued and scraped next. // // If the scraper returns an error and it implements // a `Temporary() bool` method that returns true it will // be retried. Scrape(ctx context.Context, p *Page) (URLs, error) }
Scraper represents a scraper.
A scraper must be safe to use from multiple goroutines.
func JSON ¶
JSON returns a new JSON scraper.
The scraper receives the writer to write JSON lines into the type to scrape from pages and optional selectors from which to extract the next set of pages to crawl.
The provided type `t` must be a struct, otherwise the scraper will return an error on the initial scrape and the crawl engine will abort.
The scraper uses the `encoding/json` package to encode the provided type into JSON, any errors that are received from the encoder are returned from the scraper.
If no selectors are provided, the scraper will return all valid URLs on the page.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
_examples
|
|
Package antcache implements an HTTP client that caches responses.
|
Package antcache implements an HTTP client that caches responses. |
Package antcdp is an experimental package that implements an `ant.Client` that performs HTTP requests using chrome and returns a rendered response.
|
Package antcdp is an experimental package that implements an `ant.Client` that performs HTTP requests using chrome and returns a rendered response. |
Package anttest implements scraper test helpers.
|
Package anttest implements scraper test helpers. |
internal
|
|
normalize
Package normalize provides URL normalization.
|
Package normalize provides URL normalization. |
robots
Package robots implements a higher-level robots.txt interface.
|
Package robots implements a higher-level robots.txt interface. |
scan
Package scan implements structures that can scan HTML into go values.
|
Package scan implements structures that can scan HTML into go values. |
selectors
Package selectors provides utilities to compile and cache CSS selectors.
|
Package selectors provides utilities to compile and cache CSS selectors. |