mcrawl

module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 29, 2020 License: MIT

README

ci-badge release-badge

MCrawl

This is an implementation of a CLI-based web crawler written in Golang, a take home test done as part of Monzo's interview process. This project aims to be a representation of how I would structure and write code in production.

Brief

We'd like you to write a simple web crawler in a programming language you're familiar with. Given a starting URL, the crawler should visit each URL it finds on the same domain. It should print each URL visited, and a list of links found on that page. The crawler should be limited to one subdomain - so when you start with https://monzo.com/, it would crawl all pages within monzo.com, but not follow external links, for example to facebook.com or community.monzo.com.

We would like to see your own implementation of a web crawler. Please do not use frameworks like scrapy or go-colly which handle all the crawling behind the scenes or someone else's code. You are welcome to use libraries to handle things like HTML parsing.

Installation

Pre-Compiled Binaries

Pre-compiled mcrawl binaries for the following platforms can be found under the Releases section of this repository.

Build From Source

Download this repository and build the binary from source using the given Makefile (requires go 1.15+):

$ make

This will compile and place the mcrawl binary into a local bin directory.

Usage

The web crawler can be invoked via the CLI:

$ mcrawl [<flags>] <url>
Help

Use the --help flag to view help description on using the CLI:

$ mcrawl --help
usage: mcrawl [<flags>] <url>

Flags:
  --help        Show context-sensitive help (also try --help-long and --help-man).
  --workers=10  Number of concurrent workers to use for crawling.
  --robotstxt   Respect the site's robots.txt file, if any, while crawling.
  --debug       Run the web crawler in debug mode.

Args:
  <url>  URL to start crawling from. Will only follow URLs belonging to the given URL's subdomain.
Example
$ mcrawl --workers 20 http://monzo.com/
...
https://monzo.com/i/loans/home-improvement-loans
  -> https://monzo.com/features/savings
  -> https://www.instagram.com/monzo
  ...
  -> https://monzo.com/about
  -> https://monzo.com/i/fraud
...
======================
Unique URLs crawled: 1675
Time taken: 325.539s
Output

Currently, all output is printed to stdout by default. To write the output to a file, use shell redirection.

Bash example:

$ mcrawl --workers 20 http://monzo.com/ >> output.txt

Development

The given Makefile provides some basic utilities to facilitate the development process:

  • To run the code formatter:

    $ make fmt
    
  • To run tests:

    $ make test
    
Release

Pre-compiled versions of the mcrawl binary for different platforms are automatically published to GitHub via GoReleaser and GitHub Actions whenever a new tag is pushed to the repository.

See .github/workflows/release.yaml and .goreleaser.yml for more information on how this automated release process works.

Approach

A high-level architecture of how the web crawler is designed is described in the figure below.

Architecture

The core components of the web crawler reside in the crawler package. Essentially, the crawler spawns a number of concurrent workers in the background that picks up URLs to crawl from a work queue, fetches the links found on each HTML page, and puts them into a results queue. At the same time, results are processed by a separate worker, whose job is to store the links found for each URL and enqueue those links back onto the work queue to be crawled. The crawler defines two interfaces - Fetcher and Excluder - which faciliate testing and customising of the crawler.

The fetchers package defines a LinksFetcher, that fulfils the crawler.Fetcher interface, used for making HTTP requests, parsing HTML content, and returning all unique resolved links found. LinksFetcher also normalizes links, so that the crawler doesn't have to re-visit pages it's seen before.

The excluders package defines a RobotsTxtExcluder, that fulfils the crawler.Excluder interface, that excludes paths based on rules defined in the requested site's robots.txt file (if any). This allows the crawler to respect robots.txt and not crawl paths it shouldn't. This feature is turned on by default, but can be turned off via a CLI flag.

Improvements

  • Limit crawl depth
  • Implement retries
  • Write output to file
  • Add timeouts to HTTP requests
  • Render JavaScript (eg. non-SSR SPAs)
  • Track URLs being processed to reduce duplicate work
  • Exclude non-HTML file-based links (eg. .pdf, .json)

Directories

Path Synopsis
cmd
pkg
test
e2e

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL