gospider
gospider
is a concurrent web spider. By default, it respects robots.txt entries.
Usage
make install
gospider
is designed to be used either from the CLI or in code.
CLI
gospider
can be run from the CLI by running
gospider start -r "http://foo.bar/" > out.html
By default, gospider writes an HTML sitemap to stdout.
Use gospider --help
for more options.
Code
the spider.New
function follows the functional options pattern. The only parameter which is required
is the root URL - all others will be defaulted to sensible values if not supplied.
uri, _ := url.Parse("http://foo.bar/")
spider := spider.New(
spider.WithRoot(uri),
spider.WithConcurrency(5),
spider.WithTimeout(time.Second * 2),
)
err = spider.Run()
if err != nil {
log.Fatal("error running spider: ", err)
}
return spider.Report(os.Stdout)
Modularity
gospider
ships with a simple HTML reporter and uses the default HTTP client to make requests. However, any requester
or reporter can be used by supplying a struct which implements the Requester
or Reporter
interface. For example,
to make requests through a proxy you could do:
type proxyRequester struct {
client *http.Client
}
func (r *proxyRequester) Request(ctx context.Context, uri *url.URL) ([]byte, error) {
res, err := r.client.Get(uri.String())
// handle err, read body, etc.
return body, nil
}
s := spider.New(
WithRoot(...),
WithRequester(&proxyRequester{
client: &http.Client{
Transport: &http.Transport{Proxy: http.ProxyURL(...)}
}
})
)
Concurrency
gospider
uses a worker pool concurrency model. As URLs are found they are added to a queue. Each
worker (controlled with the concurrency parameter) will poll the queue for work. Once the queue is empty,
the worker pool is drained and the spider will stop.