pachyderm

package module

v1.2.0-RC2 Latest Latest Go to latest Published: Sep 21, 2016 License: Apache-2.0 Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/poopoothegorilla/pachyderm

README ¶

Pachyderm: A Containerized Data Lake

Pachyderm is Git for Data Science. We offer complete version control for data and give data scientists the same first-class development tools as software developers. Pachyderm is ideal for building machine learning pipelines and ETL workflows because we version and track track every model/output directly to the raw input datasets that created it (aka: Provenance).

Pachyderm is built on Docker and Kubernetes. Since everything in Pachyderm is a container, data scientists can use any languages or libraries they want (e.g. R, Python, OpenCV, etc) without any additional infrastructure overhead.

Getting Started

Install Pachyderm locally or [deploy Pachyderm on AWS/GCE] (http://pachyderm.readthedocs.io/development/deploying_on_the_cloud.html) in about 5 minutes. You can also refer to our complete developer docs to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

What is Pachyderm?

Pachyderm is a software platform the supports the storage and processing of large data sets. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead, we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker.

Pachyderm was designed to enable everything from "weekend data science" projects to large-scale data collaboration, just like Git does for code.

What's new about Pachyderm? (How is it different from Hadoop?)

There are two bold new ideas in Pachyderm:

Containers as the core processing primitive
Version Control for data

These ideas lead directly to a system that's much more powerful, flexible and easy to use.

To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it by way of a FUSE volume. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).

Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes and diffs, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!

Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!

Community

Keep up to date and get Pachyderm support via:

Twitter
[Join our mailing list]
Join our community Slack Channel to get help from the Pachyderm team and other users

Contributing

To get started, sign the Contributor License Agreement.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "noob-friendly" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Join Us

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our team and email us at jobs@pachyderm.io.

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

Documentation ¶

Index ¶

func Asset(name string) ([]byte, error)
func AssetDir(name string) ([]string, error)
func AssetInfo(name string) (os.FileInfo, error)
func AssetNames() []string
func MustAsset(name string) []byte
func RestoreAsset(dir, name string) error
func RestoreAssets(dir, name string) error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Asset ¶

func Asset(name string) ([]byte, error)

Asset loads and returns the asset for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetDir ¶

func AssetDir(name string) ([]string, error)

AssetDir returns the file names below a certain directory embedded in the file by go-bindata. For example if you run go-bindata on data/... and data contains the following hierarchy:

data/
  foo.txt
  img/
    a.png
    b.png

then AssetDir("data") would return []string{"foo.txt", "img"} AssetDir("data/img") would return []string{"a.png", "b.png"} AssetDir("foo.txt") and AssetDir("notexist") would return an error AssetDir("") will return []string{"data"}.

func AssetInfo ¶

func AssetInfo(name string) (os.FileInfo, error)

AssetInfo loads and returns the asset info for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetNames ¶

func AssetNames() []string

AssetNames returns the names of the assets.

func MustAsset ¶

func MustAsset(name string) []byte

MustAsset is like Asset but panics when Asset would return an error. It simplifies safe initialization of global variables.

func RestoreAsset ¶

func RestoreAsset(dir, name string) error

RestoreAsset restores an asset under the given directory

func RestoreAssets ¶

func RestoreAssets(dir, name string) error

RestoreAssets restores an asset under the given directory recursively

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
doc
examples/fruit_stand/generate
examples/opencv_dominant_color
examples/word_count
src
client
client/pfs Package pfs is a generated protocol buffer package.	Package pfs is a generated protocol buffer package.
client/pkg/discovery
client/pkg/grpcutil
client/pkg/require
client/pkg/shard Package shard is a generated protocol buffer package.	Package shard is a generated protocol buffer package.
client/pkg/uuid
client/pps Package pps is a generated protocol buffer package.	Package pps is a generated protocol buffer package.
client/version
server/cmd/job-shim
server/cmd/pachctl
server/cmd/pachctl-doc
server/cmd/pachctl/cmd
server/cmd/pachd
server/cmd/protofix
server/pfs
server/pfs/cmds
server/pfs/db
server/pfs/db/persist Package persist is a generated protocol buffer package.	Package persist is a generated protocol buffer package.
server/pfs/drive Package drive provides the definitions for the low-level pfs storage drivers.	Package drive provides the definitions for the low-level pfs storage drivers.
server/pfs/fuse Package fuse is a generated protocol buffer package.	Package fuse is a generated protocol buffer package.
server/pfs/pretty
server/pfs/server
server/pkg/cache/groupcachepb Package groupcachepb is a generated protocol buffer package.	Package groupcachepb is a generated protocol buffer package.
server/pkg/cache/server
server/pkg/cmd
server/pkg/container Package container provides functionality to interact with containers.	Package container provides functionality to interact with containers.
server/pkg/dag
server/pkg/deploy Package deploy is a generated protocol buffer package.	Package deploy is a generated protocol buffer package.
server/pkg/deploy/assets
server/pkg/deploy/cmds
server/pkg/metrics Package metrics is a generated protocol buffer package.	Package metrics is a generated protocol buffer package.
server/pkg/netutil
server/pkg/obj
server/pkg/pretty
server/pkg/protofix
server/pkg/provider
server/pkg/workload
server/pps Package pps is a generated protocol buffer package.	Package pps is a generated protocol buffer package.
server/pps/cmds
server/pps/example
server/pps/persist Package persist is a generated protocol buffer package.	Package persist is a generated protocol buffer package.
server/pps/persist/server
server/pps/persist/server/testing
server/pps/pretty
server/pps/server

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL