govuk_crawler_worker

command module

v0.0.0-...-d9bf8a9 Latest Latest Go to latest Published: Dec 5, 2022 License: MIT Imports: 21 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/alphagov/govuk_crawler_worker

Links

Open Source Insights

README ¶

GOV.UK Crawler Worker

This is a worker that will consume GOV.UK URLs from a message queue and crawl them, saving the output to disk.

Requirements

To run this worker you will need:

Go 1.7.1
RabbitMQ
Redis

Development

You can run the tests locally by running make.

This project uses Godep to manage it's dependencies. If you have a working Go development setup, you should be able to install Godep by running:

go get github.com/tools/godep

Running

To run the worker you'll first need to build it using go build to generate a binary. You can then run the built binary directly using ./govuk_crawler_worker. All configuration is injected using environment varibles. For details on this look at the main.go file.

How it works

This is a message queue worker that will consume URLs from a queue and crawl them, saving the output to disk. Whilst this is the main reason for this worker to exist it has a few activities that it covers before the page gets written to disk.

Workflow

The workflow for the worker can be defined as the following set of steps:

Read a URL from the queue, e.g. https://www.gov.uk/bank-holidays
Crawl the recieved URL
Write the body of the crawled URL to disk
Extract any matching URLs from the HTML body of the crawled URL
Publish the extracted URLs to the worker's own exchange
Acknowledge that the URL has been crawled

The Interface

The public interface for the worker is the exchange labelled govuk_crawler_exchange. When the worker starts it creates this exchange and binds it to it's own queue for consumption.

If you provide user credentials for RabbitMQ that aren't on the root vhost /, you may wish to bind a global exchange yourself for easier publishing by other applications.

Licence

MIT License

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
healthcheck Package healthcheck provides the ability to run and report on checks designed to give an overview of the basic health of a system.	Package healthcheck provides the ability to run and report on checks designed to give an overview of the basic health of a system.
http_crawler
queue
ttl_hash_set
util

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL