dist-crawler

module

v0.0.0-...-fb0ae27 Latest Latest Go to latest Published: Jan 5, 2022 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/carlosdamazio/dist-crawler

Links

Open Source Insights

README ¶

Dist-crawler

A proposal for a distributed web crawler built upon RPC communication protocol.

What is it, really?

Well, I'm learning Go. So, instead of building a CRUD application, I decided to build something that I had in mind. It's a dead simple Web Crawler that works on a Master -> Node architecture. It's currently in development, so I have a few remarks on it:

It has a single node feature. In the future, the crawler must accept more than one node connection to make parallel processing and load balancing;
It's not using the concurrency features that Go has to offer;
It uses RPC communication. Maybe in the future I might use gRPC to make the most of the enhanced version of the protocol;
It should sanitize the URLs extracted from the pages, it comes with lots of trash (assets, mail:to, etc);
It might have a database to store indexed pages and it's contents, think about storing the whole page or slice it if you want to;
Someday must have an UI application that can detect running instances and show their status.

How to use

I know, this thing does not have tests since it's my little toy, a breakable one. But if you want to adventure yourself with it, be sure to have Go 1.14, since it's the version that I used to build it.

# Download the project
go get github.com/carlosdamazio/dist-crawler

# Start the node agent
$ go build -o dist-crawler ./cmd/dist-crawler/main.go
$ ./dist-crawler node

# On another instance, run the master
$ ./dist-crawler master --nodesAddr=localhost:13372 http://damazio.dev

Directories ¶

Path	Synopsis
cmd
dist-crawler
pkg
crawler
master
node
standalone
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL