pachyderm

package module
v1.10.0-rc5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 10, 2020 License: Apache-2.0 Imports: 0 Imported by: 3

README

GitHub release GitHub license GoDoc Go Report Card Slack Status CLA assistant

Pachyderm: Data Versioning, Data Pipelines, and Data Lineage

Pachyderm is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you're looking for a way to "productionize" them, Pachyderm can make this easy for you.

Features

  • Containerized: Pachyderm is built on Docker and Kubernetes. Whatever languages or libraries your pipeline needs, they can run on Pachyderm which can easily be deployed on any cloud provider or on prem.
  • Version Control: Pachyderm version controls your data as it's processed. You can always ask the system how data has changed, see a diff, and, if something doesn't look right, revert.
  • Provenance (aka data lineage): Pachyderm tracks where data comes from. Pachyderm keeps track of all the code and data that created a result.
  • Parallelization: Pachyderm can efficiently schedule massively parallel workloads.
  • Incremental Processing: Pachyderm understands how your data has changed and is smart enough to only process the new data.

Getting Started

Install Pachyderm locally or deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:

Documentation

Official Documentation

Community

Keep up to date and get Pachyderm support via:

  • Twitter Follow us on Twitter.
  • Slack Status Join our community Slack Channel to get help from the Pachyderm team and other users.
Contributing

To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Join Us

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our open positions or email us at jobs@pachyderm.io.

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
etc
examples
src
client/limit
Package limit provides primitives to limit concurrency.
Package limit provides primitives to limit concurrency.
server/pkg/backoff
Package backoff implements backoff algorithms for retrying operations.
Package backoff implements backoff algorithms for retrying operations.
server/pkg/dlock
Package dlock implements a distributed lock on top of etcd.
Package dlock implements a distributed lock on top of etcd.
server/pkg/exec
Package exec runs external commands.
Package exec runs external commands.
server/pkg/pfsdb
Package pfsdb contains the database schema that PFS uses.
Package pfsdb contains the database schema that PFS uses.
server/pkg/ppsconsts
Package ppsconsts constains constants relevant to PPS that are used across Pachyderm.
Package ppsconsts constains constants relevant to PPS that are used across Pachyderm.
server/pkg/ppsdb
Package ppsdb contains the database schema that PPS uses.
Package ppsdb contains the database schema that PPS uses.
server/pkg/ppsutil
Package ppsutil contains utilities for various PPS-related tasks, which are shared by both the PPS API and the worker binary.
Package ppsutil contains utilities for various PPS-related tasks, which are shared by both the PPS API and the worker binary.
server/pkg/serde
Package serde contains Pachyderm-specific data structures for marshalling and unmarshalling Go structs and maps to structured text formats (currently just JSON and YAML).
Package serde contains Pachyderm-specific data structures for marshalling and unmarshalling Go structs and maps to structured text formats (currently just JSON and YAML).
server/pkg/storage/fileset/tar
Package tar implements access to tar archives.
Package tar implements access to tar archives.
server/pkg/sync
Package sync provides utility functions similar to `git pull/push` for PFS
Package sync provides utility functions similar to `git pull/push` for PFS
server/pkg/transactiondb
Package transactiondb contains the database schema that Pachyderm transactions use.
Package transactiondb contains the database schema that Pachyderm transactions use.
server/pkg/watch
Package watch implements better watch semantics on top of etcd.
Package watch implements better watch semantics on top of etcd.
server/pps/server/githook
Package githook adds support for git-based sources in pipeline specs.
Package githook adds support for git-based sources in pipeline specs.
testing/loadtest/split/cmd/pipeline
main implements a the user logic run by the "split" loadtest (in loadtest/loadtest.go)
main implements a the user logic run by the "split" loadtest (in loadtest/loadtest.go)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL