spidy

command module

v0.0.0-...-64444d9 Latest Latest Go to latest Published: Aug 8, 2019 License: Apache-2.0 Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/msvbhat/spidy

Links

Open Source Insights

README ¶

spidy

A simple web crawler written in go.

Description

This tool tries to crawl through the links on any given page and list the urls it finds.

Dependencies and Building

It uses the goquery to parse the HTML page and fetch the links from them. There is no other external dependencies for the application.

You can run make dep to install the dependencies.

Note: If you want to run make lint you need to have golint installed.

Building

To build a binary for Mac (Darwin) you can simply run make build. But if you want to build the binary for Linux run the below command on a linux instance.

go build -i -o spidy

Running

Simply run the application with url you want o crawl as the first arg.

./spidy https://xkcd.com

It will print all the links which belong to the same site. So in the example above, all the links of https://xkcd.com domain will be listed.

Considerations and Limitations

As of now, it doesn't limit the concurrency. But it is a good idea to limit the cocurrency to the number of cpu cores available the machine.

Also, while testing the application I quickly found that crawling through the links recursively can take hours or not even end. So it might be a great idea to limit the layers/depth of crawling.

Planned Enhancements

Sanitise the user input and reject incorrect starting URL.
Limit the concurrency of the process.
Set the depth until which the the crawling should be done.
Modify the result struct so that a sitemap of some kind can be printed.
Add more unit tests.
Setup a CI/CD system which runs some validations before merging.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL