dvid

command module
v0.0.0-...-e545f54 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 6, 2014 License: BSD-3-Clause Imports: 21 Imported by: 0

README

DVID Picture

Status: In development, being tested at Janelia, and not ready for external use due to possible breaking changes.

GoDoc Build Status

Web app for 3d inspection being served from and sending requests to DVID

DVID is a distributed, versioned, image-oriented datastore written to support Janelia Farm Reseach Center's brain imaging, analysis and visualization efforts. DVID's initial focus is on efficiently storing and retrieving 3d grayscale and label data in a variety of ways, e.g., subvolumes, images in XY, XZ, and YZ orientation, multiscale 2d and 3d (similar to quadtree and octree forms), and sparse volumes determined by a label.

DVID aspires to be a "github for large image-oriented data" because each DVID server can manage multiple repositories, each of which contains an image-oriented dataset (e.g., an image volume, labels, skeletons, etc). The goal is to provide scientists with a github-like web client + server that can push/pull data to a collaborator's DVID server.

DVID is written in Go and supports different storage backends, a Level 2 REST HTTP API, command-line access, and a FUSE frontend to at least one of its data types. It has been tested on both MacOS X and Linux (Fedora 16, CentOS 6, Ubuntu) but not on Windows.

Command-line and HTTP API documentation is currently distributed over data types and can be found in help constants:

Table of Contents

Philosophy

DVID (Distributed, Versioned, Image­-oriented Datastore) was designed as a datastore that can be easily installed and managed locally, yet still scale to the available storage system and number of compute nodes. If data transmission or computer memory is an issue, it allows us to choose a local first­ class datastore that will eventually (or continuously) be synced with remote datastores. By “first class”, we mean that each DVID server, even on laptops, behaves identically to larger institutional DVID servers save for resource limitations like the size of the data that can be managed. Our vision is to have something like a github for image-oriented data, although there are a number of differences due to the size and typing of data as well as the approach to transferring versioned data between DVID servers. We hope to leverage the significant experience in crafting workflows and management tools for distributed, versioned operations.

DVID promotes the view of data as a collection of key­-value pairs where each key is composed of global identifiers for versioning and data identification as well as a datatype­-specific index (e.g., a spatial index) that allows large data to be broken into chunks. DVID focuses on how to break data into these key­-value pairs in a way that optimizes data access for various clients.

Why is distributed versioning central to DVID instead of a centralized approach?

  • Significant processing can occur in small subsets of the data or using alternative, compact representations: FlyEM mostly works in portions of the data after using full data to establish context. We can't even see the cells if we zoom out to the scale of our volumes. And if we want to work on neurons, it's a sparse volume that can be relatively compact and proofreading occurs in that sparse volume and its neighboring structures. Frequently, we can also transform voxel­-level data to more compact data structures like region adjacency graphs.
  • Research groups may want to silo data but eventually share and sync that data: It's not clear researchers want just "one" centralized datastore but might require one for each institution/group. Researchers don't always want to share data. So as soon as you support more than one centralized location, and think about syncing, you are basically looking at a distributed data problem or you'll be doing some ad hoc solution instead of more elegant git-­like techniques. And sometimes, researchers want to only share a particular version of their dataset, e.g., the state of the dataset at the time of a publication that requires open access to the data, yet they want to continue to work on the dataset privately.
  • As computers increase in power, forcing centralization leads to significant wasted resources: Since significant workflows require only relatively small subsets of data, we can move data to workstations and laptops and use graphics/computation resources of those systems. Also, allowing distributed data persistence lets us explore other funding mechanisms, not just having deep­-pocketed institutions footing the bill for all storage and computation.
  • Large, multi­tenancy datastores can be difficult to optimize for particular use cases and guarantee throughput/latency. Shared resources can be exhausted if many users hit the resource when working toward seasonal deadlines. Aside from this timing issue, certain applications require tight bounds on availability, e.g., image acquisition. Since data access optimization via caching and other techniques is very specific to an application, datastore systems should be (1) relatively simple so systems exclusive to an application can be created and (2) have well­-defined interfaces both to the storage engine and the client, so application­-specific optimizations can be made. A research group can buy servers and deploy a relatively simple system that is dedicated for a particular use case or run applications in lock­step so optimizations are easier to make, e.g., the formatting of data to suit a particular data access pattern. After a particular use case is addressed like image acquisition, some or all of the data can be synced to another DVID server that may be optimized for different uses like parallel proofreading operations in disjoint but small subvolumes.

Planned and Existing Features for DVID:

  • Distributed operation: Once a DVID dataset is created and loaded with data, it can be cloned to remote sites using user­-defined spatial extents. Each DVID server chooses how much of the data set is held locally. (Status: Planned Q2 2014)
  • Versioning: Each version of a DVID dataset corresponds to a node in a version DAG (Directed Acyclic Graph). Versions are identified through a UUID that can be composed locally yet are unique globally. Versioning and distribution follow patterns similar to distributed version control systems like git and mercurial. Provenance is kept in the DAG. (Status: Simple DAG and UUID support implemented.
    Versioned compression schemes have been worked out with implementation Q2-Q3 2014.
    )
  • Denormalized Views: For any node in the version DAG, we can choose to create denormalized views that accelerate particular access patterns. For example, quad trees can be created for XY, XZ, and YZ orthogonal views or sparse volumes can compactly describe a neuron. The extra denormalized data is kept in the datastore until a node is archived, which removes all denormalized key­-value pairs associated with that version node. Views of the same data can be eventually consistent. (Status: Multi-scale 2d images in XY, XZ, YZ, surface voxels and sparse volumes implemented. Multi-scale 3d, like an octree but also supporting arbitrary scaling levels for anisotropic data, is planned by Q3 2014. Framework for syncing of denormalized views planned Q3-4 2014.)
  • Flexible Data Types: DVID provides a well­-defined interface to data type code that can be easily added by users. A DVID server provides HTTP and RPC APIs, authentication, authorization, versioning, provenance, and storage engines. It delegates datatype­-specific commands and processing to data type code. As long as a DVID type can return data for its implemented commands, we don’t care how its implemented. (Status: Variety of voxel types, multiscale2d, label maps, and key-value implemented. FUSE interface for key-value type working but not heavily used. Authentication and authorization support planned Q3 2014, likely using Google or other provider authentication + tokens similar to github API.)
  • Scalable Storage Engine: Although DVID may support polyglot persistence (i.e., allow use of relational, graph, or NoSQL databases), we are initially focused on key­-value stores. DVID has an abstract key­-value interface to its swappable storage engine. We choose a key­-value interface because (1) there are a large number of high­-performance, open­-source implementations that run from embedded to clustered systems, (2) the surface area of the API is very small, even after adding important cases like bulk loads or sequential key read/write, and (3) novel technology tends to match key­-value interfaces, e.g., groupcache and Seagate's Kinetic Open Storage Platform. As storage becomes more log structured, the key/value API becomes a more natural fit. (Status: Tested with three leveldb variants: Google's open source version, HyperLevelDB, and the default Basho-tuned leveldb. Added Lightning MDB and also experimental use of Bolt, although neither have been tuned to work as well as the leveldb variants.)

A DVID server is limited to local resources and the user determines what datasets, subvolume, and versions are held within that DVID server. Overwrites are allowed but once a version is locked, no further edits are allowed on that particular version. This allows manual or automated editing to be done during a period without accumulation of unnecessary deltas.

Build Process

DVID uses the buildem system to automatically download and build the specified storage engine (e.g., leveldb), Go language support, and all required Go packages.

One-time Setup

Make sure you have the basic requirements:

Before downloading DVID, setup the proper directory structure that adheres to Go standards and clone the dvid repo:

% export GOPATH=/path/to/go/workspace
% export DVIDSRC=$GOPATH/src/github.com/janelia-flyem/dvid
% mkdir -p $DVIDSRC
% cd $DVIDSRC
% git clone https://github.com/janelia-flyem/dvid .

You should also have a BUILDEM_DIR, either an empty directory or your previous buildem directory where you'll compile all the required software and eventually place the compiled dvid executable. You'll want to set your environment variables like so:

% export BUILDEM_DIR=/path/to/buildem/dir
% export PATH=$BUILDEM_DIR/bin:$PATH

For Linux, export your library path:

% export LD_LIBRARY_PATH=$BUILDEM_DIR/lib:$LD_LIBRARY_PATH

For Mac, the library path is specified as DYLD_LIBRARY_PATH:

% export DYLD_LIBRARY_PATH=$BUILDEM_DIR/lib:$LD_LIBRARY_PATH

In the above, we are saying to use the executables created in the $BUILDEM_DIR/bin directory first, which should include the DVID executable, and also use buildem-created libraries.

Create an empty build directory used to build just dvid:

% cd $DVIDSRC
% mkdir build
% cd build
% cmake -D BUILDEM_DIR=$BUILDEM_DIR ..

The example above creates a build directory in the dvid repo directory, which also has a .gitignore that ignores all "build" and "build-*" files/directories.

If you haven't built with that buildem directory before, do the additional steps:

% make
% cmake -D BUILDEM_DIR=/path/to/buildem/dir ..

You can specify a particular storage engine for DVID by adding a -D DVID_BACKEND=... option to the above cmake command. It currently defaults to basholeveldb (the Basho-tuned leveldb) but could be run with the lmdb setting to install Lightning MDB.

Making and testing DVID

Make dvid:

% make dvid

This will install a DVID executable 'dvid' in the buildem bin directory.

Tests are run with gocheck:

% make test

Specifying your choice of web client using "-webclient":

% dvid -webclient=/path/to/dvid-webclient serve /path/to/datastore/dir

You can then modify the web client code and refresh the browser to see the changes.

Particularly when using leveldb variants, we recommend modifying default "max open files" and also avoiding extra disk head seeks by turning off noatime, which records the last accessed time for all files. See this explanation on the basho page

First, make sure you allow a sufficient number of open files. This can be checked via the "ulimit -n" command in Linux and Mac. We suggest raising this to 65535 on Linux and at least 8192 on a Mac. You might have to modify the (1024 has proven sufficient for Teravoxel datasets on 64-bit Linux using standard leveldb but this had to be raised to several thousand even for 50 Gigavoxel datasets on Mac.)

% ulimit -n 65535

Second, disable access-time updates for the mount with your DVID data. In Linux, you can add the noatime mounting option to /etc/fstab for the partition holding your data. The line for the mount holding your DVID data should like something like this:

/dev/mapper/vg0-lv_data    /dvid/data     xfs      *noatime*,nobarrier     1 2

Then remount the disk:

% mount /dvid/data -o remount

Simple Example

In this example, we'll initialize a new datastore, create grayscale data, load some images, and then use a web browser to see the results.

Getting help

If DVID was installed correctly, you should be able to ask dvid for help:

% dvid help
Create the datastore

Depending on the storage engine, DVID will store data into a directory. We must initialize the datastore by specifying a datastore directory.

% dvid init /path/to/datastore/dir
Start the DVID server

The "-debug" option lets you see how DVID is processing requests.

% dvid -debug serve /path/to/datastore/dir

If dvid wasn't compiled with a built-in web client, you'll see some complaints and ways you can specify a web client. For our purposes, though, we don't need the web console for this simple example. We will be accessing the standard HTTP API directly through a web browser.

Open another terminal and run "dvid help" again. You'll see more information because dvid can detect a running server and describe the data types available. You can also ask for just the data types supported by this DVID server:

% dvid types
Create a new dataset

One DVID server can manage many different datasets. We create a dataset like so:

% dvid datasets new

A hexadecimal string will be printed in response. Datasets, like versions of data within a dataset, are identified by a global UUID, usually printed and read in hexadecimal format (e.g., "c78a0"). When supplying a UUID, you only need enough letters to uniquely identify the UUID within that DVID server. For example, if there are only two datasets, each with only one version, and the root versions for each dataset are "c78a0..." and "da0a4...", respectively, then you can uniquely specify the datasets using "c7" and "da". (We might use one letter, but generally two or more letters are better.)

% dvid types

Entering the above command will show the new dataset but there are no data under it.

Create new data for a dataset

We can create an instance of a supported data type for the new dataset:

% dvid dataset c7 new grayscale8 mygrayscale

Replace "c7" with the first two letters of whatever UUID was printed when you created a new dataset. You can also see the UUIDs for datasets by using the "dvid types" command.

After adding new data, you can see the result via the "dvid types" command. It will show new data named "mygrayscale" that is of grayscale8 data type. DVID allows a variety of data types, each implemented by some code that determines that data type's commands, HTTP API, and method of storing and retrieving data. The data type implementation is uniquely identified by the URL of that implementation's package or file.

Get data type-specific help

Since each data type has its own set of commands and HTTP API, we need to know what's available from each type:

% dvid types grayscale8 help

The above asks the grayscale8 data type implementation for its usage. In the next step, we will use the "load local" command described in the help response.

Loading some sample data

Download a small stack of grayscale images from github.

You can either clone it using git or use the "Download ZIP" button on the right. Once you've downloaded that 250 x 250 x 250 grayscale volume, enter the following:

% dvid node c7 mygrayscale load 100,100,2600 "/path/to/sample/*.png"

Once again, replace the "c7" with a UUID string for your dataset. Note that you have to specify the full path to the PNG images. If you started the DVID server using the "-debug" option, you'll see a series of messages from the server on reading each image and storing it into the datastore. This command loads all image filenames in alphanumeric order. Since we specified "xy" (and could have used the multidimensional specification "0,1"), each succeeding image is loaded with an offset incremented by one in the Z (3rd) dimension.

After this command completes, the image data has been ingested into small chunks of data in DVID.

Use web browser to explore data

You might have noticed a few HTTP API calls listed in the grayscale8 help text. We are going to use one of these to look at slices of data orthogonal to the volume axes. Launch a web browser and enter the following URL:

<api URL>/node/c7/mygrayscale/xy/250_250/100_100_2600

where <api URL> is typically localhost:8000/api where the hostname can be set by the dvid option -web as in dvid -web=:8080 server /path/to/db.

You should see a small grayscale image appear in your browser. It will be 250 x 250 pixels, taken from data in the XY plane using an offset of (100,100,2600). Note that when we did the "load local" we specified "100,100,2600". This loaded the first image with (100,100,2600) as an offset. Also note the HTTP APIs replace the comma separator with an underscore because commas are reserved URI characters.

Change the Z offset to 2800 to see a different portion of the volume:

<api URL>/node/c7/mygrayscale/xy/250_250/100_100_2800

We can see the extent of the loaded image using the following resectioning:

<api URL>/node/c7/mygrayscale/xz/500_500/0_0_2500

A larger 500 x 500 pixel image should now appear in the browser with black areas surrounding your loaded data. This is a slice along XZ, an orientation not present in the originally loaded XY images.

Adding multi-scale 2d images

Let's precompute XY, XZ, and YZ multiscale2d for our grayscale image. First, we add an instance of a multiscale2d data type under our previous dataset UUID:

% dvid dataset c7 new multiscale2d mymultiscale2d source=mygrayscale TileSize=128

Note that we set type-specific parameters, "source" to "mygrayscale", which is the name of the data we wish to tile, and "TileSize" to "128", which causes all future tile generation to be 128x128 pixels. Now that we have multiscale2d data, we generate the multiscale2d using this command:

% dvid node c7 mymultiscale2d generate tilespec.json

This will kick off the tile precomputation (about 30 seconds on my MacBook Pro). Since our loaded grayscale is 250 x 250 x 250, we will have two different scales in the multiscale2d. The original scale is "0" and can be accessed through the multiscale2d HTTP API. Visit this URL in your browser:

<api URL>/node/c7/mymultiscale2d/tile/xy/0/0_0_0

This will return a 128x128 pixel PNG tile, basically the upper left quadrant of the first slice of our test data. By replacing the "0_0_0" (0,0,0) portion with "1_0_0", you can see the upper right quadrant (tile x=1, y=0) of the first 250x250 image. Tile space has its origin in the upper left corner.

We can zoom out a bit by going to scale "1" where returned multiscale2d have reduced the size of the original image by 2.

<api URL>/node/c7/mymultiscale2d/tile/xy/1/0_0_0

The above URL will return a 128x128 pixel tile that covers the original 250x250 image so you see a bit of black space at the edges. DVID automatically creates as many scales as necessary until one tile fits the source image extents. In this test case, we only need two scales because at scale "1", our specified tile size can cover the original data.

As an exercise, look at the XZ and YZ multiscale2d as well.

Documentation

Overview

DVID is a *distributed, versioned, image-oriented datastore* written in Go that supports different storage backends, a Level 2 REST HTTP API, command-line access, and a FUSE frontend to at least one of its data types. It has been tested on both MacOS X and Linux (Fedora 16, CentOS 6) but not on Windows.

The starting point for DVID documentation is the README.md file in the DVID repo: https://github.com/janelia-flyem/dvid#dvid

Directories

Path Synopsis
Package datastore provides versioning and persisting supported data types using one of the supported storage engines.
Package datastore provides versioning and persisting supported data types using one of the supported storage engines.
Package datatype handles registration and processing of arbitrary data types with the DVID datastore.
Package datatype handles registration and processing of arbitrary data types with the DVID datastore.
keyvalue
Package keyvalue implements DVID support for data using generic key/value.
Package keyvalue implements DVID support for data using generic key/value.
labelmap
Package labelmap implements DVID support for label->label mapping including spatial index tracking.
Package labelmap implements DVID support for label->label mapping including spatial index tracking.
labels
Package labels provides various indexing options for labels and label maps.
Package labels provides various indexing options for labels and label maps.
labels64
Package labels64 tailors the voxels data type for 64-bit labels and allows loading of NRGBA images (e.g., Raveler superpixel PNG images) that implicitly use slice Z as part of the label index.
Package labels64 tailors the voxels data type for 64-bit labels and allows loading of NRGBA images (e.g., Raveler superpixel PNG images) that implicitly use slice Z as part of the label index.
multichan16
Package multichan16 tailors the voxels data type for 16-bit fluorescent images with multiple channels that can be read from V3D Raw format.
Package multichan16 tailors the voxels data type for 16-bit fluorescent images with multiple channels that can be read from V3D Raw format.
multiscale2d
Package multiscale2d implements DVID support for multiscale2ds in XY, XZ, and YZ orientation.
Package multiscale2d implements DVID support for multiscale2ds in XY, XZ, and YZ orientation.
voxels
Package voxels implements DVID support for data using voxels as elements.
Package voxels implements DVID support for data using voxels as elements.
Package dvid provides types, constants, and functions that have no other dependencies and can be used by all packages within DVID.
Package dvid provides types, constants, and functions that have no other dependencies and can be used by all packages within DVID.
Package server provides web and rpc interfaces to DVID operations.
Package server provides web and rpc interfaces to DVID operations.
Package storage provides a unified interface to a number of storage engines.
Package storage provides a unified interface to a number of storage engines.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL