Capillaries
Capillaries is a distributed batch data processing framework.
TL;DR
What Capillaries is and what it is not, with a use case discussion and diagrams
Getting started guide with instructions on how to run a quick Docker-based demo without building any code
Why Capillaries?
|
BEFORE |
AFTER |
Data aggregation |
SQL joins |
Capillaries lookups in Cassandra + Go expressions (scalability, parallel execution) |
Data filtering |
SQL queries, custom code |
Go expressions (scalability, maintainability) |
Data transform |
SQL expressions, custom code |
Go expressions, Python formulas (parallel execution, maintainability) |
Intermediate data storage |
Files, relational databases |
on-the-fly-created Cassandra keyspaces and tables (scalability, maintainability) |
Workflow execution |
Shell scripts, custom code, workflow frameworks |
RabbitMQ as the Single Point of Failure + workflow status stored in Cassandra (parallel execution, fault tolerance, incremental computing) |
Workflow monitoring and interaction |
Custom solutions |
Capillaries API and Toolbelt utility (transparency, operator validation support) |
Workflow management |
Shell scripts, custom code |
Capillaries script file with DAG |
Highlights
Incremental computing
Allows splitting the whole data processing pipeline into separate runs that can be started independently and re-run if needed.
Parallel processing
Splits large data volumes into smaller batches processed in parallel. Executes multiple data processing tasks (DAG nodes) in parallel.
Operator interaction
Allows human data validation for selected data processing stages.
Fault tolerance
Survives temporary underlying database connectivity issues and processing node software and hardware failures.
Works with structured data artifacts
Consumes and produces delimited text files, uses database tables internally. Provides ETL/ELT capabilities. Implements a subset of the relational algebra.
Use scenarios
Capable of processing large amounts of data within SLA time limits, efficiently utilizing powerful computational (hardware, VM, containers) and storage (Cassandra) resources, with or without human monitoring/validation/intervention.
Capillaries in depth
(C) 2022 kleines.hertz[at]protonmail.com