README
Predator (Profiler and Auditor) is a tool to provide statistical description and data quality checking of downstream data.
Predator consist of two components:
- Profile : Collect basic metrics of table and column and calculate data quality metrics.
- Audit : Compare the data quality metrics against tolerance rules.
Requirements
-
Go v1.18
-
Postgres Instance
docker run -d -p 127.0.0.1:5432:5432/tcp --name predator-abcd -e POSTGRES_PASSWORD=secretpassword -e POSTGRES_DB=predator -e POSTGRES_USER=predator postgres
-
Tolerance Store
-
Local directory
For producing metrics on Profile and check issues using Audit, tolerance specification is needed. Each of .yaml
files in the local directory represents tolerance specification for a bigquery table. This options can be used for local testing. This store can be used by using local directory as TOLERANCE_STORE_URL
example/tolerance
-
Google Cloud Storage
Google cloud bucket is preferred for having file based tolerance spec to be used by Predator service, especially when combined with git repository for tolerance spec files collaboration with multiple users
Please read this doc for creating gcs bucket here. The gcs bucket can be used as tolerance storage configuration in TOLERANCE_STORE_URL
gs://your-bucket/audit-spec
-
Unique Constraint Store (optional)
Source of unique constraint column for each resource to calculate unique count and duplication percentage metrics,
in a single CSV file. This is an alternative solution if the unique constraint column is not specified in the tolerance
specification of each table. Please see documentation below for details of CSV content format.
-
Publisher
Predator publish data for profile and audit to for realtime data/event processing
-
Google Cloud credentials
Google cloud credentials is needed for predator to access Bigquery API
-
Google cloud personal account credentials
Using this credential we can use our own Google suite email to access google cloud API including Bigquery API.
This credential is the most suitable for local testing/exploration purpose.
-
Application Default Credentials
This type of google cloud credentials is needed for deploy predator as service especially to use predator in a non-local environment.
-
Create google cloud application credentials
Please read this doc for creating an application default credentials (ADC)
-
Set local environment variable
GOOGLE_APPLICATION_CREDENTIALS=/path/key.json
How to Build
How to Test
make test
How to run predator service
Create .env file
-
Create copy conf/.env.template and create .env file
-
Put .env file to the root of repository
-
Set env variable
example of config to run
PORT=
DB_HOST=localhost
DB_PORT=5432
DB_NAME=predator
DB_USER=predator
DB_PASS=secretpassword
BIGQUERY_PROJECT_ID=sample-project
PROFILE_KAFKA_TOPIC=profile
AUDIT_KAFKA_TOPIC=audit
KAFKA_BROKER=localhost:6668
TOLERANCE_STORE_URL=example/tolerance
UNIQUE_CONSTRAINT_STORE_URL=example/uniqueconstraints.csv
MULTI_TENANCY_ENABLED=true
GIT_AUTH_PRIVATE_KEY_PATH=~/.ssh/private.key
TZ=UTC
Setup DB
./predator migrate -e .env
to run the DB migration
Note: If any changes made on the migration files, re-run this command to re-generate the migration resource.
make generate-db-resource
How to Run
./predator start -e .env
How to do Profile and Audit using API Call
Before begin, decide below profiling details.
- URN
Target table ID
- Filter (optional)
Filter expression in SQL syntax. This expression will be applied in the WHERE clause of profiling query.
For example:
__PARTITION__ = '2021-01-01'
.
- Group (optional)
Which field the result should be grouped with. Can be any field or PARTITION
- Mode
Profiling mode will differentiate how the result will be visualized.
complete
for presenting the results as
independent data result, or incremental
for presenting it as part of another same group results.
- Audit time
Timestamp of when audit happened.
-
Create profile job : POST /v1beta1/profile
. Please include the profiling details as the payload.
-
Wait until status
becomes completed
Call GET /v1beta1/profile/{profile_id}
periodically until status
becomes completed
-
Audit the profiled data : POST /v1beta1/profile/{profile_id}/audit
How to do Profile and Audit using CLI
First, build by running make build
Usage example:
predator profile_audit \
-s http://sample-predator-server \
-u sample-project.sample_dataset.sample_table \
-g "date(sample_timestamp_field)" \
-f "date(sample_timestamp_field) in (\"2020-12-02\",\"2020-12-01\",\"2020-11-30\")" \
-m complete \
-a "2020-12-02T07:00:00.000Z"
Usage example by using Docker:
docker run --rm -e SUB_COMMAND=profile_audit \
-e PREDATOR_URL=http://sample-predator-server \
-e URN=sample-project.sample_dataset.sample_table \
-e GROUP="date(sample_timestamp_field)" \
-e FILTER="__PARTITION__ = \"2020-11-01\"" \
-e MODE=complete \
-e AUDIT_TIME="2020-12-02T07:00:00.000Z" \
predator:latest
Local Testing Guide
Dependencies
When doing local testing, some external dependency can be replaced with local files and folders. Here is the step by
step for set up the configuration and running predator for local testing purpose.
-
Tolerance Rules Configuration
Using yaml file in example/tolerance
.
-
Publisher
For local testing, Apache Kafka is not required. The protobuf serialised message will be shown as console log.
How to do local testing
- checkout predator repository
- go to predator repository directory
- build predator binary by running
make build
script
- create .env file
- setup postgres database, please follow details on
Requirements
section for quick setup of postgres db. make sure
to also run the db migration ./predator migrate -e .env
- run predator service
./predator start -e .env
- prepare the tolerance spec file
- create Profile job using API call
curl --location --request POST 'http://localhost:5000/v1beta1/profile' \
--header 'Content-Type: application/json' \
--data-raw '{
"urn": "sample-project.sample_dataset.sample_table",
"filter": "__PARTITION__ = '2020-03-01'",
"group": "__PARTITION__",
"mode": "complete"
}'
- API call to get the Profile job status & result, poll the status until the status becomes
completed
curl --location --request GET 'http://localhost:5000/v1beta1/profile/${profile_id}'
- API call to audit and get the result
curl --location --request POST 'http://localhost:5000/v1beta1/profile/${profile_id}/audit'
Register Entity (optional)
Predator provide Upload tolerance spec feature for better collaboration among users (using git) and within a multiple entity
environment. Each entity can be registered with its own git url, which at the time of upload Predator will clone the
git repository to find the tolerance specs and upload them to the destination storage and being used when profile & auditing.
- register entity
curl --location --request POST 'http://localhost:5000/v1/entity/entity-1' \
--header 'Content-Type: application/json' \
--data-raw '{
"entity_name": "sample-entity-1",
"git_url": "git@sample-url:sample-entity-1.git",
"environment" : "sample-env",
"gcloud_project_ids": [
"entity-1-project-1"
]
}'
Data Quality Spec
Specifying Data Quality Spec
tableid: "sample-project.sample_dataset.sample_table"
tablemetrics:
- metricname: "duplication_pct"
tolerance:
less_than_eq: 0
metadata:
uniquefields:
- field_1
fields:
- fieldid: "field_1"
fieldmetrics:
- metricname: "nullness_pct"
tolerance:
less_than_eq: 10.0
Data Quality Spec storage
Upload Data Quality Spec
There are multiple way to upload data quality spec to predator storage, one of them is using POST v1beta1/spec/upload
API.
Predator also provide cli to provide the same functionality.
Upload through Predator CLI
usage: predator upload --host=HOST --git-url=GIT-URL [<flags>]
upload spec from git repository to storage
Flags:
--help Show context-sensitive help (also try --help-long and --help-man).
-h, --host=http://sample-predator-server predator server
-g, --git-url=git@sample-url:sample-entity.git url of git, the source of data quality spec
-c, --commit-id="[sample-commit-id]" specific git commit hash, default value will be empty and always upload latest commit
-p, --path-prefix="predator" path to root of predator specs directory, default will be empty
./predator upload \
--host http://sample-predator-server \
--path-prefix predator --git-url git@sample-url:sample-entity-1.git \
--commit-id sample-commit-id
Example of Upload through API call
from git repository to tolerance store (optional)
curl --location --request POST 'http://localhost:5000/v1beta1/spec/upload' \
--header 'Content-Type: application/json' \
--data-raw '{
"git_url": "git@sample-url:sample-entity.git",
"commit_id": "sample-commit-id",
"path_prefix": "predator"
}'
API docs
api/predator.postman_collection.json
or api/swagger.json
Tech Debt
- remove ProfileMetric type and use only Metric type
- remove Meta from MetricSpec and Metric
- better abstraction of QualityMetricProfiler
- better abstraction of BasicMetricProfiler
Monitoring
How to setup monitoring:
This step by step tutorial is taken from cortex getting started tutorial
Prometheus is not required, because it only used as metric collector for Cortex, in this setup stats pushed from telegraf to cortex directly using remote write
Cortex
git clone https://github.com/cortexproject/cortex.git
cd cortex
go build ./cmd/cortex
./cortex -config.file=${PREDATOR_REPO_ROOT}/example/monitoring/single-process-config.yaml
Grafana
docker run --rm -d --name=grafana -p 3000:3000 grafana/grafana
In the Grafana UI (username/password admin/admin), add a Prometheus datasource for Cortex (http://host.docker.internal:9009/api/prom).
Dashboard config will be added later
Import dashboard by upload this file
Telegraf
cd ~/src
git clone https://github.com/influxdata/telegraf.git
cd ~/src/telegraf
make
./telegraf --config ${PREDATOR_REPO_ROOT}/example/monitoring/telegraf.conf