Statistics Pipeline Service
This repository contains code that processes NDT data and provides aggregate
metrics by day for standard global, and some national geographies. The resulting
aggregations are made available in JSON format, for use by other applications.
The stats-pipeline
service is written in Go, runs on GKE, and generates and
updates daily aggregate statistics. Access is provided in public BigQuery tables
and in per-year JSON formatted files hosted on GCS.
Documentation Provided for the Statistics Pipeline Service
General Recommendations for All Aggregations of NDT data
In general, our recommendations for research aggregating NDT data are:
- Don't oversimplify
- Aggregate by ASN in addition to time/date and location
- Be aware of, and illustrate multimodal distributions
- Use histogram and logarithmic scales
- Take into account, and compensate for, client bias and population drift
Roadmap
Below we list additional features, methods, geographies, etc. which may be
considered for future versioned releases of stats-pipeline
.
Geographies
- US Zip Codes, US Congressional Districts, Block Groups, Blocks
- histogram_daily_stats.csv - Same data as the JSON, but in CSV. Useful for importing into a spreadsheet.
- histogram_daily_stats.sql - A SQL query which returns the same rows in the corresponding .json and .csv. Useful for verifying the exported data against the source and to tweak the query as needed by different use cases.