ta-site-crawler

module

v0.0.0-...-ec09336 Latest Latest Go to latest Published: Dec 4, 2023 License: MIT

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/skaurus/ta-site-crawler

Links

Open Source Insights

README ¶

This repository is a solution to a test assigment that goes like this:

Implement a web crawler that would be recursively downloading the given site (following the links). Crawler should download the document by the given URL and continue downloading by the links found in the document.

Crawler should support resuming the download. Crawler should download only text documents - html, css, js (ignore images, videos, etc.). Crawler should download documents only from the same domain (ignore external links). Crawler should be multithreaded (which parts to parallelize - is up to you).

Requirements are given informally on purpose. We want to see how you will make decisions on your own, what is more important and what is less.

We expect a working application that we can build and run. We do not expect correct handling of all error types and boundary cases, you should set the "good enough" bar yourself.

There are no restrictions on 3rd-party libraries.

Solution

Please see my reasoning in the docs/madr folder. MADR is a lean template to capture any decisions in a structured way.

I find that that kind of stuff - reasoning at some point of time - is usually slips through the cracks, and few years later, in a different context, looking at a different surrounding code, it is difficult to reason why certain things was done in a certain way back then.

So, recently I decided to look for solutions that would augment the code to solve it, and stumbled upon the MADR.

But the most important bits — which are, IMHO, goals and non-goals — I will list right here:

Goals:
- Recursively follow the links to the same domain (protocol and www subdomain are ignored)
- Do not download the same document twice
- Crawler work can be interrupted by Ctrl-C at any time (or it can crash...)
- Crawler must be able to resume the crawling after such an event
- Download only text documents - html, css, js
- Download only from the same domain
- Parse only the statically existing links (that were present in the HTML given to us by the server)
- Have reasonable timeouts on all requests
- Have a log file
- Support running on *nix systems
- Be multithreaded
- Be able to react adequately to the following errors:
  - URL is unreachable (initial one or any of the found ones)
  - output directory is... not a directory, not writable, etc
  - crawling subfolder in output directory is not a directory, not writable, or has strange content not from our crawler, or that content seems to be broken
Possible goals:
- Display what our threads are doing nicely in a console
- Also, maybe show some runtime stats, like number of documents downloaded, number of links to be processed, average (median?) server response time and download speed etc
- Have a nice CLI interface
- Support running on Windows (it would mostly involve path handling, I think)
Non goals:
- Resuming the download of the same document (that would be important to support if we were to support media formats)
- Support crawling SPA/Ajax sites (that would require a headless browser and a lot of headaches)

How to build and run

Building is easy: go build -o crawler ./cmd/crawler

Running is not so hard either: ./crawler --url https://bbcgoodfood.com --workers 10 --output-dir ~/crawled-sites -с

Crawler will create a subfolder inside a given output directory, and will download all the documents there. Also, it will be able to resume work if such subfolder already exists.

--log-to-stdout/-c (c for console) flag will make it log to STDOUT for better visibility. Without that flag, it will log to file inside the output directory. Also, you could control the log level with --log-level/-l flag (default is debug).

Also you can set the HTTP requests timeout with --http-timeout/-t flag (default is 5 seconds).

Values I tried to demonstrate through this solution

code should be easy to manage by devops (flags, clear errors, logging)
practical balance between being bulletproof and not overly complex
comment everything that might not be clear in your code, and warn about possible failure scenarios / trade offs
sometimes having a long method that you can read top to bottom is better than having five smaller methods you will have to jump back and forth to understand the code flow
when you add something new to your stack, spend some time to be sure you're adding the right thing
be aware of the corner cases and potential problems

Mistakes I made

Well, as was pointed out by the author of the test assignment, "already downloaded files" can be checked by looking in the "done" folder, and "files to be downloaded" can be stored as empty files in another directory. Effectively, we can can use filesystem as a simple KV storage in this case.

Directories ¶

Path	Synopsis
cmd
crawler
internal
crawler
queue
settings
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL