Collects and parses ads.txt
GoLang program scrapes sites for ads.txt and stores its significant details to PostgreSQL database.
Give it a file with CSV list of sites to check (rank,site.url). I use top 1M sites from https://tranco-list.eu/top-1m.csv.zip For demonstration smaller top-1k.csv
is supplied.
Scraper first checks HTTPS schema, if connection fails then fallback to HTTP. User-agent is spoofed. Timeout is 5 sec defined by const crawlerTimeout.
User who runs this program must have a ROLE in PostgreSQL allowing SELECT, INSERT, DELETE queries on working database. Program connects to the database via unix socket. Adjust dbConnectionString constant if TCP or another DB name or another authentication method used.
PostgreSQL database is named adstxt.
sudo -u postgres psql -c 'CREATE DATABASE adstxt'
Create tables in it with the mktables.sql script.
psql -d adstxt < mktables.sql
Run the program with
go run main.go top-1k.csv
or build executable first
go build main.go
./main top-1k.csv
By default 64 goroutines run to fetch ads.txt from sites. This number can be increased for fast machines on fast connections with optional argument after the file name.
go run main.go top-1k.csv 1000
The third argument is continuation flag. If previous scraping was not finished, it's possible to continue scraping in the next run of the program by specifying flag c
- continue. As arguments are positional then goroutines count parameter becomes mandatory for continuation flag to work.
go run main.go top-1k.csv 64 c