GOV.UK Crawler Worker
This is a worker that will consume GOV.UK URLs
from a message queue and crawl them, saving the output to disk.
Requirements
To run this worker you will need:
Development
You can run the tests locally by running make
.
This project uses Godep to manage it's dependencies. If you have a
working Go development setup, you should be able to install
Godep by running:
go get github.com/tools/godep
Running
To run the worker you'll first need to build it using go build
to
generate a binary. You can then run the built binary directly using
./govuk_crawler_worker
. All configuration is injected using
environment varibles. For details on this look at the main.go
file.
How it works
This is a message queue worker that will consume URLs from a queue and
crawl them, saving the output to disk. Whilst this is the main reason
for this worker to exist it has a few activities that it covers before
the page gets written to disk.
Workflow
The workflow for the worker can be defined as the following set of
steps:
- Read a URL from the queue, e.g. https://www.gov.uk/bank-holidays
- Crawl the recieved URL
- Write the body of the crawled URL to disk
- Extract any matching URLs from the HTML body of the crawled URL
- Publish the extracted URLs to the worker's own exchange
- Acknowledge that the URL has been crawled
The Interface
The public interface for the worker is the exchange labelled
govuk_crawler_exchange. When the worker starts it creates this
exchange and binds it to it's own queue for consumption.
If you provide user credentials for RabbitMQ that aren't on the root
vhost /
, you may wish to bind a global exchange yourself for easier
publishing by other applications.
Licence
MIT License