# build go workspace
$ cd ~
$ mkdir go
$ export GOPATH=$HOME/go
# download code
$ mkdir -p go/src/github.com/jkamenik
$ cd go/src/github.com/jkamenik
$ git clone http://github.com/jkamenik/crawler
# get the libraries
$ cd ~/go/src/github.com/jkamenik/crawler
$ go get .
# test and build
$ go test .
$ go build
# run
$ ~/go/bin/crawler <args>
Challenge
The goal is to provide a tool that takes a single command line argument of a URL and determine the content of that URL after crawling it.
The following requirements apply to this challenge:
The tool must download the HTML
The tool must parse and print all the links found in that HTML
The tool must allow for an optional depth argument (default 2) which will control how many pages it will crawl for links.
The output should be the link's text followed by the link url (see below).
A reasonable exit code needs to be provided if the main URL is not accessible; 2nd level URL errors can be ignored.
$ crawler http://somedomain.com
Home -> /
About Us -> /about_us.php
Careers -> http://otherdomain.com/somedomain.com
Home -> http://somedomain.com
Careers -> /somedomain.com
Extra credit (optional)
Parallelize the downloading, and parsing, and collecting of links
Follow redirects of any page
Add debugging which is off by default and can be enabled with "-v"
Control the level of debugging by repeating "-v" (i.e., "-vvvv")
Save the HTML in a folder matching the link title
Save any resources used the by page: CSS, JS, and Images.
Rewrite the links and references in the HTML to be relative file paths
Enable Javascript, using Selenium Webdriver, or similar