RexRegistry Dataset Delivery
Overview
rex_deliver_dataset
is a simple command-line utility for uploading file-based
datasets into a RexRegistry system. It
will validate the format of the files in your dataset and then upload them to
the system with all the required metadata. It automates all the steps necessary
to satisfy the requirements for delivering files to
RexRegistry.
Installation
You can install rex_deliver_dataset
onto your system in a variety of ways:
-
Download from our GitHub
Releases.
Every release will be available here with pre-built binaries for all the
platforms we support. Unless you have a very specific reason not to, you
should always download the most recent version.
-
Homebrew. We provide a custom
tap that allows
Homebrew users to easily install this utility on MacOS.
$ brew tap prometheusresearch/public
$ brew install rex_deliver_dataset
-
Scoop. We provide a custom
bucket that allows
Scoop users to easily install this utility on Windows.
PS> scoop bucket add prometheusresearch https://github.com/prometheusresearch/scoop-public.git
PS> scoop install rex_deliver_dataset
-
Compile from
source. If
you're comfortable building Go projects, you're welcome
to retrieve the source and build it yourself.
$ git clone https://github.com/prometheusresearch/rex_deliver_dataset
$ cd rex_deliver_dataset
$ make init
$ make build
Usage
Once installed, you can use this utility by executing the rex_deliver_dataset
command on the commandline of your system. It requires two parameters: one
parameter specifies the location of the configuration file to
use, the other specifies the directory that contains the files you wish to
deliver. For example:
$ rex_deliver_dataset --config=my_config_file.yaml /path/to/my/files
When executed, rex_deliver_dataset
will validate the structure and format of
the files in your dataset according to the dataset_type
specified in your
configuration, and if everything is valid, it will then upload those files to
the cloud storage container described in the storage
section of your
configuration.
If you would like to only perform the validation step on your dataset, and not
actually upload anything, you can use the --validate-only
parameter to do
exactly that. It will report any issues it detects in your files and then
terminate.
$ rex_deliver_dataset --config=my_config_file.yaml --validate-only /path/to/my/files
For more information about other parameters you can use, run
rex_deliver_dataset --help
.
Configuration
rex_deliver_dataset
requires a configuration file in order to do its job.
This configuration file is a YAML-formatted file that
specifies a few properties. An example configuration is as follows:
dataset_type: "omop:5.2:csv"
storage:
kind: s3
container: my-bucket-name
access_key: SOME_ACCESS_KEY
secret_key: YOUR_SUPER_SECRET_KEY
region: us-east-1
The properties that this configuration file supports are:
dataset_type
The dataset_type
property tells the tool what kind of dataset you're trying
to upload. Using this information, it will perform a series of validations on
your files to ensure they are properly formatted. This property currently
allows the following values:
omop:5.2:csv
for CSV-formatted files representing OMOP CDM v5.2 tables
(specifications)
storage
The storage
property tells the tool where to upload the dataset to. This
property consists of several child properties that contain the details of the
location:
kind
The kind
property specifies type type of cloud storage container that is
being used. It is required, and currently allows the following values:
container
The container
property specifies the name of the container that should be
used. It is required.
access_key
The access_key
property specifies the Access Key that will be used to access
the S3 bucket that will be used. It is required when using a kind
of s3
.
secret_key
The secret_key
property specifies the Secret Key that will be used to access
the S3 bucket that will be used. It is required when using a kind
of s3
.
region
The region
property specifies the AWS region that the S3 bucket is hosted in.
It is required when using a kind
of s3
.
credentials_json
The credentials_json
property specifies the path to the JSON file containing
the GCS Application Credentials that will be used to access the Google Cloud
Storage container. It is required when using a kind
of gs
.
Support
If you require technical support in using this tool to submit datasets to a
RexRegistry system, please get in contact directly with your Registry
coordinator. Do not submit an issue in this GitHub Project.
If you believe you've found a bug with this utility, or would like to request
a new feature, feel free to submit a new issue in this GitHub
Project.
Contributing
You can contribute to this project by forking it, making your changes, and then
sending a Pull Request back to this project. To get started:
- Clone the repository.
- Run
make init
. This will retrieve all the dependencies of the project.
- Make your changes (please include tests!).
- Test your changes with
make test
.
License
rex_deliver_dataset
is published under the terms of the GNU Affero General
Public License v3.0.