Yinyo: A wonderfully simple API driven service to reliably execute many long running scrapers in a super scaleable way
- Easily run as many scrapers as you like across a cluster of machines without having to sweat the details. Powered by Kubernetes.
- Use the language and libraries you love for writing scrapers. Supports Python, JavaScript, Ruby, PHP and Perl via Heroku Buildpacks.
- Supports many different use cases through a simple, yet flexible API that can operate synchronously or asynchronously.
- Made specifically for developers of scraper systems be it open source or commercial. No chance of vendor lock-in because it's open source, Apache licensed.
Who is this README for?
This README is focused on getting developers of the core system up and running. It does not yet include
a guide for people who are just interested in being users of the API.
Table of Contents
Development: Guide to getting up and running quickly
Main dependencies
-
Minikube
-
Helm >=3.0
- nb: Helm is still releasing updates to 2.x; be sure to install the latest 3.x, not just the latest release
-
Skaffold
-
Go 1.13
-
Yinyo's web interface needs to be accessible on http://localhost:8080/. If you have something already listening on this port, you won't get any errors, but you won't be able to connect to Yinyo to start a scraper. You'll need to clear that port.
The main bit
First, follow the links to install the main dependencies
Start Minikube if you haven't already
make minikube
Let helm know where to find some of the development dependencies
helm repo add stable https://kubernetes-charts.storage.googleapis.com
helm repo add bitnami https://charts.bitnami.com/bitnami
Run skaffold. This will build all the bits and pieces and deploy things to your local kubernetes for you. The first time it builds everything it it takes a few minutes. After that when you make any changes to the code it does everything much faster.
make skaffold
Leave skaffold
running and open a new terminal window.
Now compile and install the binary into your GOPATH that allows you to run a scraper
make install
Now you're ready to run your first scraper. The first time you run this it will take a little while.
yinyo test/scrapers/test-python --output data.sqlite
Now, if you run the same scraper again it should run significantly faster.
yinyo test/scrapers/test-python --output data.sqlite
Getting the website running locally
Dependencies
There are some extra dependencies required for building the website and associated API documentation.
Running a local development server for the website
Do this after you've installed the dependencies (above):
make website
Then point your web browser at http://localhost:1313.
The custom herokuish docker image
The project currently depends on a custom version of the herokuish docker image mlandauer/herokuish:for-morph-ng which is built from the Github repo mlandauer/herokuish and pushed to docker hub manually.
There is an open pull request to try to get the bug
fix in our modified version merged upstream.
If this PR doesn't get merged we could use a workaround used by Dokku.
Notes for debugging and testing
To run the tests
From the top level directory:
make test
To see what's on the blob storage (Minio)
Point your web browser at http://localhost:9000. Login with the credentials in the file configs/secrets-minio.env
.
To see what Kubernetes is doing
make dashboard
You'll want to look in the "default" and "yinyo-runs" namespaces.
Accessing Redis
> kubectl exec -it redis-0 sh
/data # redis-cli
127.0.0.1:6379> auth changeme123
OK
127.0.0.1:6379> ping
PONG
Testing callback URLs
Use webhook.site to see calls to a specific URL in real time. Very handy.
You can run the test scraper and get the events directed to webhook.site. For example:
yinyo test/scrapers/test-python --output data.sqlite --callback https://webhook.site/#!/uuid-specific-to-you
Reclaiming diskspace in minikube
Sometimes after a while of testing and debugging the minikube VM runs out of disk space. You'll either see this as kubernetes refusing to run anything because the node is "tainted" or minio refusing to do anything because it doesn't have enough space. Luckily there is an easy way to clear space.
minikube ssh
docker system prune
Continuous integration
We're using Github Actions to run the tests (make test
), do some linting, measure coverage and build binaries of the yinyo client automatically on every push. Also,
release binaries are automatically built as well whenever a release is made in GitHub.