scrp

module
v0.0.0-...-26bc80d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 20, 2019 License: Apache-2.0

README

What

A fully resumable horizontally (infinitely) scalable webscraper in Go. Think 1000's of machines scraping sites in a distributed way. Based on Docker swarm, Cassandra, Traefik, colly, gRPC, and my other boilerplate.

Note

You could probably just use colly... Especially if you don't care about scalability... or use a shell script (Example)

Why

I built this to distribute scraping across multiple servers, so as to go undetected. I could have used proxies, but wanted to reuse the code for other distributed apps.

Instructions

Run

docker-compose up

Then in the scrp container (docker exec -it 045 bash) run gcli to issue the command to service:

/app/scrp/gcli https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Dependencies

gRPC SSL Certificate (https://docs.traefik.io/v2.0/user-guides/grpc/)


In order to secure the gRPC server, we generate a self-signed certificate for service url:

openssl req -new -x509 -sha256 -newkey rsa:2048 -nodes -keyout backend.key -days 365 -out backend.cert -subj '/CN=backend.local'

openssl req -new -x509 -sha256 -newkey rsa:2048 -nodes -keyout frontend.key -days 365 -out frontend.cert -subj '/CN=frontend.local'


That will prompt for information, the important answer is:

Common Name (e.g. server FQDN or YOUR name) []: backend.local / frontend.local

Thanks

Cheers to the engineers of Cassandra, colly, gRPC,Consul, Traefik & protobuf to name a few.

Directories

Path Synopsis
src
proto
Package proto is a generated protocol buffer package.
Package proto is a generated protocol buffer package.
service
Example
Example

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL