GOV.UK Crawler Worker
This is a worker that will consume GOV.UK URLs
from a message queue and crawl them, saving the output to disk.
Requirements
To run this worker you will need:
Development
You can run the tests locally by running the following:
go get -v -t ./...
go test -v ./...
Alternatively to localise the dependencies you can use make
. This
will use the third_party.go
tool to vendorise dependencies into a
folder within the project.
Running
To run the worker you'll first need to build it using go build
to
generate a binary. You can then run the built binary directly using
./govuk_crawler_worker
. All configuration is injected using
environment varibles. For details on this look at the main.go
file.
How it works
This is a message queue worker that will consume URLs from a queue and
crawl them, saving the output to disk. Whilst this is the main reason
for this worker to exist it has a few activities that it covers before
the page gets written to disk.
Workflow
The workflow for the worker can be defined as the following set of
steps:
- Read a URL from the queue, e.g. https://www.gov.uk/bank-holidays
- Crawl the recieved URL
- Write the body of the crawled URL to disk
- Extract any matching URLs from the HTML body of the crawled URL
- Publish the extracted URLs to the worker's own exchange
- Acknowledge that the URL has been crawled
The Interface
The public interface for the worker is the exchange labelled
govuk_crawler_exchange. When the worker starts it creates this
exchange and binds it to it's own queue for consumption.
If you provide user credentials for RabbitMQ that aren't on the root
vhost /
, you may wish to bind a global exchange yourself for easier
publishing by other applications.