Data is a first class citizen, Line's simple scheduling borrows from Hadoop's philosophy of moving compute to the data instead of the other way around.
Dataset: a git repository
Version: a git commit
Replica: a clone of a Dataset on different specific node (no 2 replicas of the dataset are on the same node)
Checkout: a local (mutable) working copy of specific DatasetVersion
replicas eventually become in the same state, tasks require a replica at specific version
Worker: a process that is responsible for:
reporting replica status
reporting alloc status
The main purpose of Line is to run a container that take a specific Dataset Checkout as input and produce one or more versions of the dataset as output.
Task: a planned container that takes N inputs and M outputs
Input/Output: a local checkout of a certain dataset version, inputs are read-only. Outputs are writeable and committed when the task is finished.
A task can only be run if a dataset replica is present on the worker and checkouts for input and outputs can be placed on the node.