opencv_dominant_color

command
v1.4.7-RC2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 15, 2017 License: Apache-2.0 Imports: 7 Imported by: 0

README

Note: This is a Pachyderm pre version 1.4 tutorial. It needs to be updated for the latest versions of Pachyderm.

Quick Start Guide: OpenCV Dominant Color Example

In this guide you're going to create a Pachyderm pipeline to run a dominant color algorithm is written in the popular computer vision library OpenCV. The algorithm is written in C++, compiled and statically linked in order for easy distribution throughout the containers that run the Pachyderm pipeline. If you are interested in the implementation of the k-means clustering based algorithm, please refer to the OpenCV source code. We are also going to run the algorithm on 1000 images from the Flikr dataset which are stored in an S3 bucket.  

Setup

This guide assumes that you already have a Pachyderm cluster running and have configured pachctl to talk to the cluster. Installation instructions can be found here.

As mentioned in the introduction, having a statically linked native opencv binary simplifies the dev environment and reduces the size of docker images required. Note: an approach that was taken for the purposes of this guide was to build the binary (called dominantColor) in a docker images such as this one and then copy over the resulting binary into the containers that run the pipeline.

In the case that you are running your own version of build with a Dockerfile, a Makefile is provided in order to download the statically binary file:

$ make

This guide also assumes access to an S3 bucket for the dataset images which will be downloaded at the beginning of the pipeline. images.zip should be downloaded via google drive link and is to be uploaded to an S3 bucket for the purposes of this example. In our case, the bucket is named opencv-example.

Create a Repo

A Repo is the highest level primitive in pfs. Like all primitives in pfs, they share their name with a primitive in Git and are designed to behave analogously. Generally, a repo should be dedicated to a single source of data such as log messages from a particular service. Repos are dirt cheap so don't be shy about making them very specific.

For this demo we'll simply create a repo called “images” to hold a list of the name of images in the S3 bucket.

$ pachctl create-repo images
$ pachctl list-repo
images

Start a Commit

Now that we’ve created a repo we’ve got a place to add data. If you try writing to the repo right away though, it will fail because you can't write directly to a Repo. In Pachyderm, you write data to an explicit commit. Commits are immutable snapshots of your data which give Pachyderm its version control for data properties. Unlike Git though, commits in Pachyderm must be explicitly started and finished.

Let's start a new commit in the “images” repo:

$ pachctl start-commit images
e1b57763af0b41e5ad4ada262a392de1

This returns a brand new commit id. Yours should be different from mine. Now if we take a look inside our repo, we’ve created a directory for the new commit:

$ pachctl list-commit urls
e1b57763af0b41e5ad4ada262a392de1

A new directory has been created for our commit and now we can start adding data. Data for this example is just a single file with a list of image names. The sample file images.txt includes the name of the 1000 images that the pipeline will be processing. Remember to unzip images.zip and upload to S3 prior to running the pipeline. We're going to write that data as a file called “images” in pfs.

# Write sample data into pfs
$ cat examples/opencv_dominant_color/images.txt | pachctl put-file images e1b57763af0b41e5ad4ada262a392de1 images

However, you'll notice that we can't read the file “images” yet. This is because the commit hasn’t been completed yet and in a distributed environment could contain some dirty state.

$ pachctl get-file images e1b57763af0b41e5ad4ada262a392de1 images
File not found

Finish a Commit

Pachyderm won't let you read data from a commit until the commit is finished. This prevents reads from racing with writes. Furthermore, every write to pfs is atomic. Now let's finish the commit:

$ pachctl finish-commit images e1b57763af0b41e5ad4ada262a392de1

Now we can view the head of the file:

$ pach get-file images e1b57763af0b41e5ad4ada262a392de1 images | head
im1.jpg
im10.jpg
im100.jpg
im1000.jpg
im101.jpg
im102.jpg
im103.jpg
im104.jpg
im105.jpg

However, we've lost the ability to write to this commit since finished commits are immutable. In Pachyderm, a commit is always either write-only when it's been started and files are being added, or read-only after it's finished.

Create a Pipeline

Now that we've got some data in our repo it's time to do something with it. Pipelines are the core primitive for Pachyderm's processing system (pps) and they're specified with a JSON encoding. We're going to create a pipeline that simply scrapes each of the web pages in “urls.”

+---------------------------+     +------------------------------+
| image file names (images) | --> |opencv_dominant_color pipeline|
+---------------------------+     +------------------------------+

The pipeline we're creating can be found at examples/opencv_dominant_color/opencv-pipeline.json.

{
  "pipeline": {
    "name": "opencv"
  },
  "transform": {
    "image": "soheilsalehian/opencv_example:20170722-00",
    "cmd": ["sh"],
    "stdin": [
      "export AWS_ACCESS_KEY_ID=[your aws access key id]",
      "export AWS_SECRET_ACCESS_KEY=[your aws secret access key]",
      "./dominant_color -images_file /pfs/images/images -aws_bucket_name [your aws s3 bucket of images] >> /pfs/out/opencv"
    ]
  },
  "parallelism": "1",
  "inputs": [
    {
      "repo": {
        "name": "images"
      },
      "method": "map"
    }
  ]
}

As the pipeline uses S3 for image input storage, please add the aws access id and aws secret access key environment variables along with the name of the s3 bucket (in this case it is 'opencv-example').

In this pipeline, we’re using the already built docker image that includes the dominant color algorithm. For a closer look at this image please take a look at examples/opencv_dominant_color/Dockerfile which can be found in this example directory. S3 credentials are passed in explicitly to keep the scope of the example (Pachyderm also supports passing S3 creds through AWS secret passing via kubernetes manifests).

Finally, the interesting part of the pipeline: the dominant_color go binary (based on examples/opencv_dominant_color/dominant_color.go which is responsible for downloading the images from s3 and calling the c++ algorithm binary) is run with flags for the images file (which we have added as a commit to the images repo), the name of the AWS bucket that holds the images.

Another important section to notice is that we read data from /pfs/images/images (/pfs/[input_repo_name]) and write data to /pfs/out/opencv.

Now let's create the pipeline in Pachyderm:

$ pachctl create-pipeline -f examples/opencv_dominant_color/opencv-pipeline.json

What Happens When You Create a Pipeline

Creating a pipeline tells Pachyderm to run your code on every finished commit in a repo as well as all future commits that happen after the pipeline is created. Our repo already had a commit with the file “images” in it so Pachyderm will automatically launch a job to scrape those webpages.

You can view the job with:

$ pachctl list-job
ID                                 OUTPUT                                    STATE
09a7eb68995c43979cba2b0d29432073   opencv/2b43def9b52b4fdfadd95a70215e90c9   JOB_STATE_RUNNING

Depending on how quickly the job is ran, you may see running or successful. For 1000 images and no parallelism, this may take up to 15 minutes.

Pachyderm jobs are implemented as Kubernetes jobs, so you can also see your job with:

$ kubectl get job
JOB                                CONTAINER(S)   IMAGE(S)             SELECTOR                                                         SUCCESSFUL
09a7eb68995c43979cba2b0d29432073   user           ubuntu:14.04   app in (09a7eb68995c43979cba2b0d29432073),suite in (pachyderm)   1

A good way to check on the pipeline

Every pipeline creates a corresponding repo with the same name where it stores its output results. In our example, the pipeline was named “opencv” so it created a repo called “opencv” which contains the final output.

Reading the Output

There are a couple of different ways to retrieve the output. We can read a single output file from the “opencv” repo in the same fashion that we read the input data:

$ pachctl list-file opencv 09a7eb68995c43979cba2b0d29432073 opencv
$ pachctl get-file opencv 09a7eb68995c43979cba2b0d29432073 opencv

Using get-file is good if you know exactly what file you’re looking for, but for this example we want to just see the top 5 most dominant color for each image. One great way to do this is to mount the distributed file system locally and then just poke around.

Mount the Filesystem

First create the mount point:

$ mkdir ~/pfs

And then mount it:

# We background this process because it blocks.
$ pachctl mount ~/pfs &

This will mount pfs on ~/pfs you can inspect the filesystem like you would any other local filesystem. Try:

$ ls ~/pfs
images opencv

You should see the images repo that we created or the opencv repo that pachyderm created as part of the pipeline.

Now you can simply ls and cd around the file system. You can read the top 5 dominant colors in HSV space (from right to left in this case). Looking at the results:

$ head ~/pfs/opencv/09a7eb68995c43979cba2b0d29432073/opencv
im1.jpg |  20.948992   32.863491   216.865738  | 35.246891   60.671085   54.082661  | 97.668373   206.915863   55.697308  | 120.866508   25.080130   146.936523  | 19.569460   56.620155   137.499222  |

im10.jpg |  19.392403   17.160366   165.163666  | 149.043121   34.640427   91.192322  | 92.382164   203.258102   13.145894  | 130.047241   11.899960   194.779739  | 13.647389   55.731003   78.900078  |

im100.jpg |  32.702194   186.271851   159.835724  | 24.227354   79.743240   167.419662  | 38.063137   200.102142   215.346375  | 132.114319   68.211082   128.087021  | 38.833736   151.232437   107.771217  |

im1000.jpg |  7.989898   25.358656   153.719635  | 139.577148   11.809133   101.354256  | 13.005986   58.725502   223.807602  | 11.071401   19.911144   39.587875  | 109.780624   40.455315   50.042118  |

im101.jpg |  160.830017   46.397697   69.901993  | 28.465757   205.267059   25.863764  | 0.361928   5.317534   5.269302  | 13.312026   121.914291   85.093811  | 27.073235   83.940109   202.059784  |

As you can see we have ran the algorithm which gives us the top 5 most dominant colors in the HSV color space.

Processing More Data

Pipelines can be triggered manually, but also will automatically process the data from new commits as they are created. Think of pipelines as being subscribed to any new commits that are finished on their input repo(s).

If we want to run the algorithm on more images, we can add new commits and Pachyderm takes care of running the algorithm on new images.

Next Steps

You've now got a working Pachyderm cluster with data and a pipelines!

In the next part of this guide, we will take a look at how to search for images that have a distinct dominant color such as pink. Some of the ideas on how to do this include:

  • Parsing the opencv repo output and evaluating the results.
  • Dumping the results of the algorithm into Elastic Search and then make queries to it.

We'd love to help and see what you come up with so submit any issues/questions you come across or email at info@pachyderm.io if you want to show off anything nifty you've created with OpenCV and Pachyderm!

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL