
gocrawler [1]
gocrawler
is an API layer, to crawl domains. Crawler adds a domain into a worker
queue configured with a given depth
, so that crawling is stopped after the depth
. Crawling is restricted to the same domain, since crawling external domains (in addition to the requested domain) can go into a ~infinite loop, for e.g. when the crawling request is received for https://google.com
, any child links outside of google.com
are not added back to the task queue.
gocrawler
is concurrent safe, utilises goroutines to achieve concurrency
gocrawler
uses Channels to pass references to data between goroutines
gocrawler
uses Channels to achieve throttled concurrency
- uses
robots.txt
& adheres to the policies of robots.txt exclusion standard
Getting started
Pre-requisites
This readme is prepared for OSX & it works mostly for Linux as well. However, if you are on Windows OS these instructions might vary significantly, the extent of which I'm not sure because I do not have a Windows machine to test these instructions.
- Golang is needed to build
gocrawler
. Steps to install Go can be found here.
- GNU Make. If you are on OSX, it comes with
make
, but if you are on a different OS, please consult this link for installation.
Quick Start
Quickstart to build and run gocrawler
. All instructions below assume that you are in the directory of gocrawler
and have the Pre-requisites installed.
# run tests and make binary
make
Now let's start gocrawler
./gocrawler -a 127.0.0.1 -p 8080
Accessing help
is just an argument away
./gocrawler -h
API Docs
API Docs are available at http://127.0.0.1:8080/docs
, assuming that you have started gocrawler
with flags -a 127.0.0.1 -p 8080
Testing
To run tests
make test