dataset

package module
v2.0.0-a3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 30, 2022 License: BSD-3-Clause Imports: 22 Imported by: 2

README

Dataset Project

DOI

The Dataset Project provides tools for working with collections of JSON documents stored on the local file system in a pairtree or in a SQL database supporting JSON columns. Two tools are provided by the project -- a command line interface (dataset) and a RESTful web service (datasetd).

dataset, a command line tool

dataset is a command line tool for working with collections of JSON documents. Collections can be stored on the file system in a pairtree directory structure or stored in a SQL database that supports JSON columns (currently SQLite3 or MySQL 8 are supported). Collections using the file system store the JSON documents in a pairtree. The JSON documents are plain UTF-8 source. This means the objects can be accessed with common Unix text processing tools as well as most programming languages.

The dataset command line tool supports common data management operations such as initialization of collections; document creation, reading, updating and deleting; listing keys of JSON objects in the collection; and associating non-JSON documents (attachments) with specific JSON documents in the collection.

enhanced features include
  • aggregate objects into data frames
  • generate sample sets of keys and objects
  • clone a collection
  • clone a collection into training and test samples

See Getting started with dataset for a tour and tutorial.

datasetd, dataset as a web service

datasetd is a RESTful web service implementation of the dataset command line program. It features a sub-set of capability found in the command line tool. This allows dataset collections to be integrated safely into web applications or used concurrently by multiple processes. It achieves this by storing the dataset collection in a SQL database using JSON columns.

Design choices

dataset and datasetd are intended to be simple tools for managing collections JSON object documents in a predictable structured way.

dataset is guided by the idea that you should be able to work with JSON documents as easily as you can any plain text document on the Unix command line. dataset is intended to be simple to use with minimal setup (e.g. dataset init mycollection.ds creates a new collection called 'mycollection.ds').

  • dataset and datasetd store JSON object documents in collections.
    • Storage of the JSON documents may be either in a pairtree on disk or in a SQL database using JSON columns (e.g. SQLite3 or MySQL 8)
    • dataset collections are made up of a directory containing a collection.json and codemeta.json files.
    • collection.json metadata file describing the collection, e.g. storage type, name, description, if versioning is enabled
    • codemeta.json is a codemeta file describing the nature of the collection, e.g. authors, description, funding
    • collection objects are accessed by their key, a unique identifier made of lower case alpha numeric characters
    • collection names are usually lowered case and usually have a .ds extension for easy identification

datatset collection storage options

  • pairtree is the default disk organization of a dataset collection
    • the pairtree path is always lowercase
    • non-JSON attachments can be associated with a JSON document and found in a directories organized by semver (semantic version number)
    • versioned JSON documents are created along side the current JSON document but are named using both their key and semver
  • SQL store stores JSON documents in a JSON column
    • SQLite3 and MySQL 8 are the current SQL databases support
    • A "DSN URI" is used to identify and gain access to the SQL database
    • The DSN URI maybe passed through the environment

datasetd is a web service

  • is intended as a back end web service run on localhost
    • by default it runs on localhost port 8485
    • supports collections that use the SQL storage engine
  • should never be used as a public facing web service
    • there are no user level access mechanisms
    • anyone with access to the web service end point has access to the dataset collection content

The choice of plain UTF-8 is intended to help future proof reading dataset collections. Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi Zero while being equally comfortable on a more resource rich server or desktop environment. dataset can be re-implement in any programming language supporting file input and output, common string operations and along with JSON encoding and decoding functions. The current implementation is in the Go language.

Features

dataset supports

  • Initialize a new dataset collection
    • Define metadata about the collection using a codemeta.json file
    • Define a keys file holding a list of allocated keys in the collection
    • Creates a pairtree for object storage
  • Codemeta file support for describing the collection contents
  • Simple JSON object versioning
  • Listing Keys in a collection
  • Object level actions
  • The ability to create data frames from while collections or based on keys lists
    • frames are defined using a list of keys and a lost dot paths describing what is to be pulled out of a stored JSON objects and into the frame
    • frame level actions
      • frames, list the frame names in the collection
      • frame, define a frame, does not overwrite an existing frame with the same name
      • frame-def, show the frame definition (in case we need it for some reason)
      • frame-keys, return a list of keys in the frame
      • frame-objects, return a list of objects in the frame
      • refresh, using the current frame definition reload all the objects in the frame given a key list
      • reframe, replace the frame definition then reload the objects in the frame using the existing key list
      • has-frame, check to see if a frame exists
      • delete-frame remove the frame

datasetd supports

Both dataset and datasetd maybe useful for general data science applications needing JSON object management or in implementing repository systems in research libraries and archives.

Limitations of dataset and datasetd

dataset has many limitations, some are listed below

  • the pairtree implementation it is not a multi-process, multi-user data store
  • it is not a general purpose database system
  • it stores all keys in lower case in order to deal with file systems that are not case sensitive, compatibility needed by a pairtree
  • it stores collection names as lower case to deal with file systems that are not case sensitive
  • it does not have a built-in query language, search or sorting
  • it should NOT be used for sensitive or secret information

datasetd is a simple web service intended to run on "localhost:8485".

  • it does not include support for authentication
  • it does not support a query language, search or sorting
  • it does not support access control by users or roles
  • it does not provide auto key generation
  • it limits the size of JSON documents stored to the size supported by with host SQL JSON columns
  • it limits the size of attached files to less than 250 MiB
  • it does not support partial JSON record updates or retrieval
  • it does not provide an interactive Web UI for working with dataset collections
  • it does not support HTTPS or "at rest" encryption
  • it should NOT be used for sensitive or secret information

Authors and history

  • R. S. Doiel
  • Tommy Morrell

Releases

Compiled versions are provided for Linux (x86), Mac OS X (x86 and M1), Windows 11 (x86) and Raspberry Pi OS (ARM7).

github.com/caltechlibrary/dataset/releases

You can use dataset from Python via the py_dataset package.

Documentation

Overview

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Dataset Project ===============

The Dataset Project provides tools for working with collections of JSON Object documents stored on the local file system or via a dataset web service. Two tools are provided, a command line interface (dataset) and a web service (datasetd).

dataset command line tool -------------------------

_dataset_ is a command line tool for working with collections of JSON objects. Collections are stored on the file system in a pairtree directory structure or can be accessed via dataset's web service. For collections storing data in a pairtree JSON objects are stored in collections as plain UTF-8 text files. This means the objects can be accessed with common Unix text processing tools as well as most programming languages.

The _dataset_ command line tool supports common data management operations such as initialization of collections; document creation, reading, updating and deleting; listing keys of JSON objects in the collection; and associating non-JSON documents (attachments) with specific JSON documents in the collection.

### enhanced features include

- aggregate objects into data frames - generate sample sets of keys and objects

datasetd, dataset as a web service ----------------------------------

_datasetd_ is a web service implementation of the _dataset_ command line program. It features a sub-set of capability found in the command line tool. This allows dataset collections to be integrated safely into web applications or used concurrently by multiple processes. It achieves this by storing the dataset collection in a SQL database using JSON columns.

Design choices --------------

_dataset_ and _datasetd_ are intended to be simple tools for managing collections JSON object documents in a predictable structured way.

_dataset_ is guided by the idea that you should be able to work with JSON documents as easily as you can any plain text document on the Unix command line. _dataset_ is intended to be simple to use with minimal setup (e.g. `dataset init mycollection.ds` creates a new collection called 'mycollection.ds').

  • _dataset_ and _datasetd_ store JSON object documents in collections. The storage of the JSON documents differs.
  • dataset collections are defined in a directory containing a collection.json file
  • collection.json metadata file describing the collection, e.g. storage type, name, description, if versioning is enabled
  • collection objects are accessed by their key which is case insensitive
  • collection names lowered case and usually have a `.ds` extension for easy identification the directory must be lower case folder contain

_datatset_ stores JSON object documents in a pairtree

  • the pairtree path is always lowercase
  • a pairtree of JSON object documents
  • non-JSON attachments can be associated with a JSON document and found in a directories organized by semver (semantic version number)
  • versioned JSON documents are created sub directory incorporating a semver

_datasetd_ stores JSON object documents in a table named for the collection

  • objects are versioned into a collection history table by semver and key
  • attachments are not supported
  • can be exported to a collection using pairtree storage (e.g. a zip file will be generated holding a pairtree representation of the collection)

The choice of plain UTF-8 is intended to help future proof reading dataset collections. Care has been taken to keep _dataset_ simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi Zero while being equally comfortable on a more resource rich server or desktop environment. _dataset_ can be re-implement in any programming language supporting file input and output, common string operations and along with JSON encoding and decoding functions. The current implementation is in the Go language.

Features --------

_dataset_ supports - Initialize a new dataset collection

  • Define metadata about the collection using a codemeta.json file
  • Define a keys file holding a list of allocated keys in the collection
  • Creates a pairtree for object storage

- Listing _keys_ in a collection - Object level actions

  • create
  • read
  • update
  • delete
  • Documents as attachments
  • attachments (list)
  • attach (create/update)
  • retrieve (read)
  • prune (delete)
  • The ability to create data frames from while collections or based on keys lists
  • frames are defined using a list of keys and a lost "dot paths" describing what is to be pulled out of a stored JSON objects and into the frame
  • frame level actions
  • frames, list the frame names in the collection
  • frame, define a frame, does not overwrite an existing frame with the same name
  • frame-def, show the frame definition (in case we need it for some reason)
  • frame-objects, return a list of objects in the frame
  • refresh, using the current frame definition reload all the objects in the frame
  • reframe, replace the frame definition then reload the objects in the frame using the old frame key list
  • has-frame, check to see if a frame exists
  • delete-frame remove the frame

_datasetd_ supports

- List collections available from the web service - List or update a collection's metadata - List a collection's keys - Object level actions

  • create
  • read
  • update
  • delete
  • Documents as attachments
  • attachments (list)
  • attach (create/update)
  • retrieve (read)
  • prune (delete)
  • A means of importing to or exporting from pairtree based dataset collections
  • The ability to create data frames from while collections or based on keys lists
  • frames are defined using "dot paths" describing what is to be pulled out of a stored JSON objects

Both _dataset_ and _datasetd_ maybe useful for general data science applications needing JSON object management or in implementing repository systems in research libraries and archives.

Limitations of _dataset_ and _datasetd_ -------------------------------------------

_dataset_ has many limitations, some are listed below

  • the pairtree implementation it is not a multi-process, multi-user data store
  • it is not a general purpose database system
  • it stores all keys in lower case in order to deal with file systems that are not case sensitive, compatibility needed by pairtrees
  • it stores collection names as lower case to deal with file systems that are not case sensitive
  • it does not have a built-in query language, search or sorting
  • it should NOT be used for sensitive or secret information

_datasetd_ is a simple web service intended to run on "localhost:8485".

  • it is a RESTful service
  • it does not include support for authentication
  • it does not support a query language, search or sorting
  • it does not support access control by users or roles
  • it does not provide auto key generation
  • it limits the size of JSON documents stored to the size supported by with host SQL JSON columns
  • it limits the size of attached files to less than 250 MiB
  • it does not support partial JSON record updates or retrieval
  • it does not provide an interactive Web UI for working with dataset collections
  • it does not support HTTPS or "at rest" encryption
  • it should NOT be used for sensitive or secret information

Authors and history -------------------

- R. S. Doiel - Tommy Morrell

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Package dataset includes the operations needed for processing collections of JSON documents and their attachments.

Authors R. S. Doiel, <rsdoiel@library.caltech.edu> and Tom Morrel, <tmorrell@library.caltech.edu>

Copyright (c) 2022, Caltech All rights not granted herein are expressly reserved by Caltech.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Index

Constants

View Source
const (

	// PTSTORE describes the storage type using a pairtree
	PTSTORE = "pairtree"

	// SQLSTORE describes the SQL storage type
	SQLSTORE = "sqlstore"
)
View Source
const (

	// License is a formatted from for dataset package based command line tools
	License = `` /* 1545-byte string literal not displayed */

)
View Source
const Version = "2.0.0-a2"

Version of package

Variables

This section is empty.

Functions

func Analyzer

func Analyzer(cName string, verbose bool) error

Analyzer checks the collection version and analyzes current state of collection reporting on errors.

NOTE: the collection MUST BE CLOSED when Analyzer is called otherwise the results will not be accurate.

func DecodeJSON

func DecodeJSON(src []byte, obj *map[string]interface{}) error

DecodeJSON provides a common method for decoding data for use in Dataset.

```

obj := map[string]interface{}{}
if err := DecodeJSON(src, &obj); err != nil {
   ...
}

```

func EncodeJSON

func EncodeJSON(obj map[string]interface{}) ([]byte, error)

EncodeJSON provides a common method for encoding data for use in Dataset.

```

src, err := EncodeJSON(obj)
if err != nil {
   ...
}

```

func FixMissingCollectionJson

func FixMissingCollectionJson(cName string) error

FixMissingCollectionJson will scan the collection directory and environment making an educated guess to type of collection collection type

func Migrate

func Migrate(srcName string, dstName string, verbose bool) error

Migrate a dataset v1 collection to a v2 collection. Both collections need to already exist. Records from v1 will be read out of v1 and created in v2.

NOTE: Migrate does not current copy attachments.

func Repair

func Repair(cName string, verbose bool) error

Repair takes a collection name and calls walks the pairtree and repairs collection.json as appropriate.

NOTE: the collection MUST BE CLOSED when repair is called otherwise the repaired collection may revert.

Types

type Attachment

type Attachment struct {
	// Name is the filename and path to be used inside the generated tar file
	Name string `json:"name"`

	// Size remains to to help us migrate pre v0.0.61 collections.
	// It should reflect the last size added.
	Size int64 `json:"size"`

	// Sizes is the sizes associated with the version being attached
	Sizes map[string]int64 `json:"sizes"`

	// Current holds the semver to the last added version
	Version string `json:"version"`

	// Checksum, current implemented as a MD5 checksum for now
	// You should have one checksum per attached version.
	Checksums map[string]string `json:"checksums"`

	// HRef points at last attached version of the attached document
	// If you moved an object out of the pairtree it should be a URL.
	HRef string `json:"href"`

	// VersionHRefs is a map to all versions of the attached document
	// {
	//    "0.0.0": "... /photo.png",
	//    "0.0.1": "... /photo.png",
	//    "0.0.2": "... /photo.png"
	// }
	VersionHRefs map[string]string `json:"version_hrefs"`

	// Created a date string in RTC3339 format
	Created string `json:"created"`

	// Modified a date string in RFC3339 format
	Modified string `json:"modified"`

	// Metadata is a map for application specific metadata about attachments.
	Metadata map[string]interface{} `json:"metadata,omitempty"`
}

Attachment is a structure for holding non-JSON content metadata you wish to store alongside a JSON document in a collection Attachments reside in a their own pairtree of the collection directory. (even when using a SQL store for the JSON document). The attachment metadata is read as needed from disk where the collection folder resides.

type Collection

type Collection struct {
	// DatasetVersion of the collection
	DatasetVersion string `json:"dataset,omitempty"`

	// Name of collection
	Name string `json:"name"`

	// StoreType can be either "pairtree" (default or if attribute is
	// omitted) or "sqlstore".  If sqlstore the connection string, DSN URI,
	// will determine the type of SQL database being accessed.
	StoreType string `json:"storage_type,omitempty"`

	// DsnURI holds protocol plus dsn string. The protocol can be
	// "sqlite://", "mysql://" and the dsn conforming to the Golang
	// database/sql driver name in the database/sql package.
	DsnURI string `json:"dsn_uri,omitempty"`

	// Created
	Created string `json:"created,omitempty"`

	// Repaired
	Repaired string `json:"repaired,omitempty"`

	// PTStore the point to the pairtree implementation of storage
	PTStore *ptstore.Storage `json:"-"`
	// SQLStore points to a SQL database with JSON column support
	SQLStore *sqlstore.Storage `json:"-"`

	// Versioning holds the type of versioning implemented in the collection.
	// It can be set to an empty string (the default) which means no versioning.
	// It can be set to "patch" which means objects and attachments are versioned by
	// a semver patch value (e.g. 0.0.X where X is incremented), "minor" where
	// the semver minor value is incremented (e.g. e.g. 0.X.0 where X is incremented),
	// or "major" where the semver major value is incremented (e.g. X.0.0 where X is
	// incremented). Versioning affects storage of JSON objects and their attachments
	// across the whole collection.
	Versioning string `json:"versioning,omitempty"`
	// contains filtered or unexported fields
}

Collection is the holds both operational metadata for collection level operations on collections of JSON objects. General metadata is stored in a codemeta.json file in the root directory along side the collection.json file.

func Init

func Init(name string, dsnURI string) (*Collection, error)

Init - creates a new collection and opens it. It takes a name (e.g. directory holding the collection.json and codemeta.josn files) and an optional DSN in URI form. The default storage engine is a pairtree (i.e. PTSTORE) but some SQL storage engines are supported.

If a DSN URI is a non-empty string then it is the SQL storage engine is used. The database and user access in the SQL engine needs be setup before you can successfully intialized your dataset collection. Currently three SQL database engines are support, SQLite3 or MySQL 8. You select the SQL storage engine by forming a URI consisting of a "protocol" (e.g. "sqlite", "mysql"), the protocol delimiter "://" and a Go SQL supported DSN based on the database driver implementation.

A MySQL 8 DSN URI would look something like

`mysql://DB_USER:DB_PASSWD@PROTOCAL_EXPR/DB_NAME`

The one for SQLite3

`sqlite://PATH_TO_DATABASE`

NOTE: The DSN URI is stored in the collections.json. The file should NOT be world readable as that will expose your database password. You can remove the DSN URI after initializing your collection but will then need to provide the DATASET_DSN_URI envinronment variable so you can open your database successfully.

For PTSTORE the access value can be left blank.

```

var (
   c *Collection
   err error
)
name := "my_collection.ds"
c, err = dataset.Init(name, "")
if err != nil {
  // ... handle error
}
defer c.Close()

```

For a sqlstore collection we need to pass the "access" value. This is the file containing a DNS or environment variables formating a DSN.

```

var (
   c *Collection
   err error
)
name := "my_collection.ds"
dsnURI := "sqlite://my_collection.ds/collection.db"
c, err = dataset.Init(name, dsnURI)
if err != nil {
  // ... handle error
}
defer c.Close()

```

func Open

func Open(name string) (*Collection, error)

Open reads in a collection's operational metadata and returns a new collection structure and error value.

```

var (
   c *dataset.Collection
   err error
)
c, err = dataset.Open("collection.ds")
if err != nil {
   // ... handle error
}
defer c.Close()

```

func (*Collection) AttachFile

func (c *Collection) AttachFile(key string, filename string) error

```

key, filename := "123", "report.pdf"
err := c.AttachFile(key, filename)
if err != nil {
   ...
}

```

func (*Collection) AttachStream

func (c *Collection) AttachStream(key string, filename string, buf io.Reader) error

AttachStream is for attaching a non-JSON file via a io buffer. It requires the JSON document key, the filename and a io.Reader. It does not close the reader. If the collection is versioned then the document attached is automatically versioned per collection versioning setting.

Example: attach the file "report.pdf" to JSON document "123"
in an open collection.

```

key, filename := "123", "report.pdf"
buf, err := os.Open(filename)
if err != nil {
   ...
}
err := c.AttachStream(key, filename, buf)
if err != nil {
   ...
}
buf.Close()

```

func (*Collection) AttachVersionFile

func (c *Collection) AttachVersionFile(key string, filename string, version string) error

AttachVersionFile attaches a file to a JSON document in the collection. This does NOT increment the version number of attachment(s). It is used to explicitly replace a attached version of a file. It does not update the symbolic link to the "current" attachment.

```

key, filename, version := "123", "report.pdf", "0.0.3"
err := c.AttachVersionFile(key, filename, version)
if err != nil {
   ...
}

```

func (*Collection) AttachVersionStream

func (c *Collection) AttachVersionStream(key string, filename string, version string, buf io.Reader) error

AttachVersionStream is for attaching open a non-JSON file buffer (via an io.Reader) to a specific version of a file. If attached file exists it is replaced.

Example: attach the file "report.pdf", version "0.0.3" to
JSON document "123" in an open collection.

```

key, filename, version := "123", "helloworld.txt", "0.0.3"
buf, err := os.Open(filename)
if err != nil {
   ...
}
err := c.AttachVersionStream(key, filename, version, buf)
if err != nil {
   ...
}
buf.Close()

```

func (*Collection) AttachmentPath

func (c *Collection) AttachmentPath(key string, filename string) (string, error)

AttachmentPath takes a key and filename and returns the path file system path to the attached file (if found). For versioned collections this is the path the symbolic link for the "current" version.

```

key, filename := "123", "report.pdf"
docPath, err := c.AttachmentPath(key, filename)
if err != nil {
   ...
}

```

func (*Collection) AttachmentVersionPath

func (c *Collection) AttachmentVersionPath(key string, filename string, version string) (string, error)

AttachmentVersionPath takes a key, filename and semver returning the path to the attached versioned file (if found).

```

key, filename, version := "123", "report.pdf", "0.0.3"
docPath, err := c.AttachmentVersionPath(key, filename, version)
if err != nil {
   ...
}

```

func (*Collection) AttachmentVersions

func (c *Collection) AttachmentVersions(key string, filename string) ([]string, error)

AttachmentVersions returns a list of versions for an attached file to a JSON document in the collection.

Example: retrieve a list of versions of an attached file.
"key" is a key in the collection, filename is name of an
attached file for the JSON document referred to by key.

```

versions, err := c.AttachmentVersions(key, filename)
if err != nil {
   ...
}
for i, version := range versions {
   fmt.Printf("key: %q, filename: %q, version: %q", key, filename, version)
}

```

func (*Collection) Attachments

func (c *Collection) Attachments(key string) ([]string, error)

Attachments returns a list of filenames for a key name in the collection

Example: "c" is a dataset collection previously opened,
"key" is a string.  The "key" is for a JSON document in
the collection. It returns an slice of filenames and err.

```

filenames, err := c.Attachments(key)
if err != nil {
   ...
}
// Print the names of the files attached to the JSON document
// referred to by "key".
for i, filename := ranges {
   fmt.Printf("key: %q, filename: %q", key, filename)
}

```

func (*Collection) Clone

func (c *Collection) Clone(cloneName string, cloneDsnURI string, keys []string, verbose bool) error

Clone initializes a new collection based on the list of keys provided. If the keys list is empty all the objects are copied from one collection to the other. The collections do not need to be the same storage type.

NOTE: The cloned copy is not open after cloning is complete.

```

newName, dsnURI :=
   "new-collection.ds", "sqlite://new-collection.ds/collection.db"
c, err := dataset.Open("old-collection.ds")
if err != nil {
    ... // handle error
}
defer c.Close()
nc, err := c.Clone(newName, dsnURI, []string{}, false)
if err != nil {
    ... // handle error
}
defer nc.Close()

```

func (*Collection) CloneSample

func (c *Collection) CloneSample(trainingName string, trainingDsnURI string, testName string, testDsnURI string, keys []string, sampleSize int, verbose bool) error

CloneSample initializes two new collections based on a training and test // sampling of the keys in the original collection. If the keys list is empty all the objects are used for creating the taining and test sample collections. The collections do not need to be the same storage type.

NOTE: The cloned copy is not open after cloning is complete.

```

trainingSetSize := 10000
trainingName, trainingDsnURI :=
   "training.ds", "sqlite://training.ds/collection.db"
testName, testDsnURI := "test.ds", "sqlite://test.ds/collection.db"
c, err := dataset.Open("old-collection")
if err != nil {
    ... // handle error
}
defer c.Close()
nc, err := c.CloneSample(trainingName, trainingDsnURI,
                         testName, testDsnURI, []string{},
                         trainingSetSize, false)
if err != nil {
    ... // handle error
}
defer nc.Close()

```

func (*Collection) Close

func (c *Collection) Close() error

Close closes a collection. For a pairtree that means flushing the keymap to disk. For a SQL store it means closing a database connection. Close is often called in conjunction with "defer" keyword.

```

c, err := dataset.Open("my_collection.ds")
if err != nil { /* .. handle error ... */ }
/* do some stuff with the collection */
defer func() {
  if err := c.Close(); err != nil {
     /* ... handle closing error ... */
  }
}()

```

func (*Collection) Codemeta

func (c *Collection) Codemeta() ([]byte, error)

Codemeta returns a copy of the codemeta.json file content found in the collection directory. The collection must be previous open.

```

name := "my_collection.ds"
c, err := dataset.Open(name)
if err != nil {
   ...
}
defer c.Close()
src, err := c.Metadata()
if err != nil {
   ...
}
ioutil.WriteFile("codemeta.json", src, 664)

```

func (*Collection) Create

func (c *Collection) Create(key string, obj map[string]interface{}) error

Create store a an object in the collection. Object will get converted to JSON source then stored. Collection must be open. A Go `map[string]interface{}` is a common way to handle ad-hoc JSON data in gow. Use `CreateObject()` to store structured data.

```

key := "123"
obj := map[]*interface{}{ "one": 1, "two": 2 }
if err := c.Create(key, obj); err != nil {
   ...
}

```

func (*Collection) CreateObject

func (c *Collection) CreateObject(key string, obj interface{}) error

CreateObject is used to store structed data in a dataset collection. The object needs to be defined as a Go struct notated approriately with the domain markup for working with json.

```

import (
  "encoding/json"
  "fmt"
  "os"
)

type Record struct {
    ID string `json:"id"`
    Name string `json:"name,omitempty"`
    EMail string `json:"email,omitempty"`
}

func main() {
    c, err := dataset.Open("friends.ds")
    if err != nil {
         fmt.Fprintf(os.Stderr, "%s", err)
         os.Exit(1)
    }
    defer c.Close()

    obj := &Record{
        ID: "mojo",
        Name: "Mojo Sam",
        EMail: "mojo.sam@cosmic-cafe.example.org",
    }
    if err := c.CreateObject(obj.ID, obj); err != nil {
         fmt.Fprintf(os.Stderr, "%s", err)
         os.Exit(1)
    }
    fmt.Printf("OK\n")
    os.Exit(0)
}

```

func (*Collection) Delete

func (c *Collection) Delete(key string) error

Delete removes an object from the collection. If the collection is versioned then all versions are deleted. Any attachments to the JSON document are also deleted including any versioned attachments.

```

key := "123"
if err := c.Delete(key); err != nil {
   // ... handle error
}

```

func (*Collection) FrameClear

func (c *Collection) FrameClear(name string) error

FrameClear empties the frame's object and key lists but leaves in place the Frame definition. Use Reframe() to re-populate a frame based on a new key list.

```

frameName := "journals"
err := c.FrameClear(frameName)
if err != nil  {
   ...
}

func (*Collection) FrameCreate

func (c *Collection) FrameCreate(name string, keys []string, dotPaths []string, labels []string, verbose bool) (*DataFrame, error)

FrameCreate takes a set of collection keys, dot paths and labels builds an ObjectList and assembles additional metadata returning a new Frame associated with the collection as well as an error value. If there is a mis-match in number of labels and dot paths an an error will be returned. If the frame already exists an error will be returned.

Conceptually a frame is an ordered list of objects. Frames are associated with a collection and the objects in a frame can easily be refreshed. Frames also serve as the basis for indexing a dataset collection and provide the data paths (expressed as a list of "dot paths"), labels (aka attribute names), and type information needed for indexing and search.

If you need to update a frame's objects use FrameRefresh(). If you need to change a frame's objects or ordering use FrameReframe().

```

frameName := "journals"
keys := []string{ "123", "124", "125" }
dotPaths := []string{ ".title", ".description" }
labels := []string{ "Title", "Description" }
verbose := true
frame, err := c.FrameCreate(frameName, keys, dotPaths, labels, verbose)
if err != nil {
   ...
}

```

func (*Collection) FrameDef

func (c *Collection) FrameDef(name string) (map[string]interface{}, error)

FrameDef retrieves the frame definition returns a a map string interface.

```

definition := map[string]interface{}{}
frameName := "journals"
definition, err := c.FrameDef(frameName)
if err != nil {
   ..
}

```

func (*Collection) FrameDelete

func (c *Collection) FrameDelete(name string) error

FrameDelete removes a frame from a collection, returns an error if frame can't be deleted.

```

frameName := "journals"
err := c.FrameDelete(frameName)
if err != nil {
   ...
}

```

func (*Collection) FrameKeys

func (c *Collection) FrameKeys(name string) []string

FrameKeys retrieves a list of keys assocaited with a data frame

```

frameName := "journals"
keys := c.FrameKeys(frameName)

```

func (*Collection) FrameNames

func (c *Collection) FrameNames() []string

Frames retrieves a list of available frame names associated with a collection.

```

frameNames := c.FrameNames()
for _, name := range frames {
   // do something with each frame name
   objects, err := c.FrameObjects(name)
   ...
}

```

func (*Collection) FrameObjects

func (c *Collection) FrameObjects(fName string) ([]map[string]interface{}, error)

FrameObjects returns a copy of a DataFrame's object list given a collection's frame name.

```

var (
  err error
  objects []map[string]interface{}
)
frameName := "journals"
objects, err = c.FrameObjects(frameName)
if err != nil  {
   ...
}

```

func (*Collection) FrameRead

func (c *Collection) FrameRead(name string) (*DataFrame, error)

FrameRead retrieves a frame from a collection. Returns the DataFrame and an error value

```

frameName := "journals"
data, err := c.FrameRead(frameName)
if err != nil {
   ..
}

```

func (*Collection) FrameReframe

func (c *Collection) FrameReframe(name string, keys []string, verbose bool) error

FrameReframe **replaces** a frame's object list based on the keys provided. It uses the frame's existing definition.

```

frameName, verbose := "journals", false
keys := ...
err := c.FrameReframe(frameName, keys, verbose)
if err != nil {
   ...
}

```

func (*Collection) FrameRefresh

func (c *Collection) FrameRefresh(name string, verbose bool) error

FrameRefresh updates a DataFrames' object list based on the existing keys in the frame. It doesn't change the order of objects. It is used when objects in a collection that are included in the frame have been updated. It uses the frame's existing definition.

NOTE: If an object is missing in the collection it gets pruned from the object list.

```

frameName, verbose := "journals", true
err := c.FrameRefresh(frameName, verbose)
if err != nil {
   ...
}

```

func (*Collection) HasFrame

func (c *Collection) HasFrame(frameName string) bool

HasFrame checks if a frame is defined already. Collection needs to previously been opened.

```

frameName := "journals"
if c.HasFrame(frameName) {
   ...
}

```

func (*Collection) HasKey

func (c *Collection) HasKey(key string) bool

HasKey takes a collection and checks if a key exists. NOTE: collection must be open otherwise false will always be returned.

```

key := "123"
if c.HasKey(key) {
   ...
}

```

func (*Collection) Keys

func (c *Collection) Keys() ([]string, error)

Keys returns a array of strings holding all the keys in the collection.

```

keys, err := c.Keys()
for _, key := range keys {
   ...
}

```

func (*Collection) Length

func (c *Collection) Length() int64

Length returns the number of objects in a collection NOTE: Returns a -1 (as int64) on error, e.g. collection not open or Length not available for storage type.

```

var x int64
x = c.Length()

```

func (*Collection) ObjectList

func (c *Collection) ObjectList(keys []string, dotPaths []string, labels []string, verbose bool) ([]map[string]interface{}, error)

ObjectList (on a collection) takes a set of collection keys and builds an ordered array of objects from the array of keys, dot paths and labels provided.

```

var mapList []map[string]interface{}

keys := []string{ "123", "124", "125" }
dotPaths := []string{ ".title", ".description" }
labels := []string{ "Title", "Description" }
verbose := true
mapList, err = c.ObjectList(keys, dotPaths, labels, verbose)

```

func (*Collection) Prune

func (c *Collection) Prune(key string, filename string) error

Prune removes a an attached document from the JSON record given a key and filename. NOTE: In versioned collections this include removing all versions of the attached document.

```

key, filename := "123", "report.pdf"
err := c.Prune(key, filename)
if err != nil {
   ...
}

```

func (*Collection) PruneAll

func (c *Collection) PruneAll(key string) error

PruneAll removes attachments from a JSON record in the collection. When the collection is versioned it removes all versions of all too.

```

key := "123"
err := c.PruneAll(key)
if err != nil {
   ...
}

```

func (*Collection) PruneVersion

func (c *Collection) PruneVersion(key string, filename string, version string) error

PruneVersion removes an attached version of a document.

```

key, filename, version := "123", "report.pdf, "0.0.3"
err := c.PruneVersion(key, filename, version)
if err != nil {
   ...
}

```

func (*Collection) Read

func (c *Collection) Read(key string, obj map[string]interface{}) error

Read retrieves a map[string]inteferface{} from the collection, unmarshals it and updates the object pointed to by the map.

```

obj := map[string]interface{}{}

key := "123"
if err := c.Read(key, &obj); err != nil {
   ...
}

```

func (*Collection) ReadObject

func (c *Collection) ReadObject(key string, obj interface{}) error

ReadObject retrieves structed data via Go's general inteferface{} type. The JSON document is retreived from the collection, unmarshaled and variable holding the struct is updated.

```

type Record struct {
    ID string `json:"id"`
    Name string `json:"name,omitempty"`
    EMail string `json:"email,omitempty"`
}

// ...

var obj *Record

key := "123"
if err := c.Read(key, &obj); err != nil {
   // ... handle error
}

```

func (*Collection) ReadObjectVersion

func (c *Collection) ReadObjectVersion(key string, version string, obj interface{}) error

ReadObjectVersion retrieves a specific vesion from the collection for the given object.

```

type Record srtuct {
    // ... structure def goes here.
}

var obj = *Record

key, version := "123", "0.0.1"
if err := ReadObjectVersion(key, version, &obj); err != nil {
   ...
}

```

func (*Collection) ReadVersion

func (c *Collection) ReadVersion(key string, version string, obj map[string]interface{}) error

ReadVersion retrieves a specific vesion from the collection for the given object.

```

var obj map[string]interface{}

key, version := "123", "0.0.1"
if err := ReadVersion(key, version, &obj); err != nil {
   ...
}

```

func (*Collection) RetrieveFile

func (c *Collection) RetrieveFile(key string, filename string) ([]byte, error)

RetrieveFile retrieves a file attached to a JSON document in the collection.

```

key, filename := "123", "report.pdf"
src, err := c.RetrieveFile(key, filename)
if err != nil {
   ...
}
err = ioutil.WriteFile(filename, src, 0664)
if err != nil {
   ...
}

```

func (*Collection) RetrieveStream

func (c *Collection) RetrieveStream(key string, filename string, out io.Writer) error

RetrieveStream takes a key and filename then returns an io.Reader, and error. If the collection is versioned then the stream is for the "current" version of the attached file.

```

key, filename := "123", "report.pdf"
src := []byte{}
buf := bytes.NewBuffer(src)
err := c.Retrieve(key, filename, buf)
if err != nil {
   ...
}
ioutil.WriteFile(filename, src, 0664)

```

func (*Collection) RetrieveVersionFile

func (c *Collection) RetrieveVersionFile(key string, filename string, version string) ([]byte, error)

RetrieveVersionFile retrieves a file version attached to a JSON document in the collection.

```

key, filename, version := "123", "report.pdf", "0.0.3"
src, err := c.RetrieveVersionFile(key, filename, version)
if err != nil  {
   ...
}
err = ioutil.WriteFile(filename + "_" + version, src, 0664)
if err != nil {
   ...
}

```

func (*Collection) RetrieveVersionStream

func (c *Collection) RetrieveVersionStream(key string, filename string, version string, buf io.Writer) error

RetrieveVersionStream takes a key, filename and version then returns an io.Reader and error.

```

key, filename, version := "123", "helloworld.txt", "0.0.3"
src := []byte{}
buf := bytes.NewBuffer(src)
err := c.RetrieveVersion(key, filename, version, buf)
if err != nil {
   ...
}
ioutil.WriteFile(filename + "_" + version, src, 0664)

```

func (*Collection) Sample

func (c *Collection) Sample(size int) ([]string, error)

Sample takes a sample size and returns a list of randomly selected keys and an error. Sample size most be greater than zero and less or equal to the number of keys in the collection. Collection needs to be previously opened.

```

smapleSize := 1000
keys, err := c.Sample(sampleSize)

```

func (*Collection) SaveFrame

func (c *Collection) SaveFrame(name string, f *DataFrame) error

SaveFrame saves a frame in a collection or returns an error

```

frameName := "journals"
data, err := c.FrameRead(frameName)
if err != nil {
   ...
}
// do stuff with the frame's data
   ...
// Save the changed frame data
err = c.SaveFrame(frameName, data)

```

func (*Collection) SetVersioning

func (c *Collection) SetVersioning(versioning string) error

SetVersioning sets the versioning on a collection. The version string can be "major", "minor", "patch". Any other value (e.g. "", "off", "none") will turn off versioning for the collection.

func (*Collection) Update

func (c *Collection) Update(key string, obj map[string]interface{}) error

Update replaces a JSON document in the collection with a new one. If the collection is versioned then it creates a new versioned copy and updates the "current" version to use it.

```

key := "123"
obj["three"] = 3
if err := c.Update(key, obj); err != nil {
   ...
}

```

func (*Collection) UpdateMetadata

func (c *Collection) UpdateMetadata(fName string) error

UpdateMetadata imports new codemeta citation information replacing the previous version. Collection must be open.

```

name := "my_collection.ds"
codemetaFilename := "../codemeta.json"
c, err := dataset.Open(name)
if err != nil {
   ...
}
defer c.Close()
c.UpdateMetadata(codemetaFilename)

```

func (*Collection) UpdateObject

func (c *Collection) UpdateObject(key string, obj interface{}) error

UpdateObject replaces a JSON document in the collection with a new one. If the collection is versioned then it creates a new versioned copy and updates the "current" version to use it.

```

type Record struct {
    // ... structure def goes here.
    Three int `json:"three"`
}

var obj = *Record

key := "123"
obj := &Record {
  Three: 3,
}
if err := c.Update(key, obj); err != nil {
   // ... handle error
}

```

func (*Collection) Versions

func (c *Collection) Versions(key string) ([]string, error)

Versions retrieves a list of versions available for a JSON document if versioning is enabled for the collection.

```

key, version := "123", "0.0.1"
if versions, err := Versions(key); err != nil {
   ...
}

```

func (*Collection) WorkPath

func (c *Collection) WorkPath() string

WorkPath returns the working path to the collection.

type DataFrame

type DataFrame struct {
	// Explicit at creation
	Name string `json:"frame_name"`

	// CollectionName holds the name of the collection the frame was generated from. In theory you could
	// define a frame in one collection and use its results in another. A DataFrame can be rendered as a JSON
	// document.
	CollectionName string `json:"collection_name"`

	// DotPaths is a slice holding the definitions of what each Object attribute's data source is.
	DotPaths []string `json:"dot_paths"`

	// Labels are new attribute names for fields create from the provided
	// DotPaths.  Typically this is used to surface a deeper dotpath's
	// value as something more useful in the frame's context (e.g.
	// first_title from an array of titles might be labeled "title")
	Labels []string `json:"labels"`

	// NOTE: Keys is an orded list of object keys in the frame.
	Keys []string `json:"keys"`

	// NOTE: Object map privides a quick index by key to object index.
	ObjectMap map[string]interface{} `json:"object_map"`

	// Created is the date the frame is originally generated and defined
	Created time.Time `json:"created"`

	// Updated is the date the frame is updated (e.g. reframed)
	Updated time.Time `json:"updated"`
}

DataFrame is the basic structure holding a list of objects as well as the definition of the list (so you can regenerate an updated list from a changed collection). It persists with the collection.

func (*DataFrame) Grid

func (f *DataFrame) Grid(includeHeaderRow bool) [][]interface{}

Grid returns a table representaiton of a DataFrame's ObjectList

```

frameName, includeHeader := "journals", true
data, err := c.FrameRead(frameName)
if err != nil {
   ...
}
rows, err := data.Grid(includeHeader)
if err != nil {
   ...
}
... /* now do something with the rows */ ...

```

func (*DataFrame) Objects

func (f *DataFrame) Objects() []map[string]interface{}

Objects returns a copy of DataFrame's object list (an array of map[string]interface{})

```

frameName := "journals"
data, err := c.FrameRead(frameName)
if err != nil {
   ...
}
objectList, err := data.Objects()
if err != nil {
   ...
}

```

func (*DataFrame) String

func (f *DataFrame) String() string

String renders the data structure DataFrame as JSON to a string

```
 frameName := "journals"
 data, err := c.FrameRead(frameName)
 if err != nil {
    ...
 }
 fmt.Printf("\n%s\n", data.String())
```

type StorageSystem

type StorageSystem interface {

	// Open opens the storage system and returns an storage struct and error
	// It is passed either a filename. For a Pairtree the would be the
	// path to collection.json and for a sql store file holding a DSN
	//
	// “`
	//  store, err := c.Store.Open(c.Access)
	//  if err != nil {
	//     ...
	//  }
	// “`
	//
	Open(name string, dsnURI string) (*StorageSystem, error)

	// Close closes the storage system freeing resources as needed.
	//
	// “`
	//   if err := storage.Close(); err != nil {
	//      ...
	//   }
	// “`
	//
	Close() error

	// Create stores a new JSON object in the collection
	// It takes a string as a key and a byte slice of encoded JSON
	//
	//   err := storage.Create("123", []byte(`{"one": 1}`))
	//   if err != nil {
	//      ...
	//   }
	//
	Create(string, []byte) error

	// Read retrieves takes a string as a key and returns the encoded
	// JSON document from the collection
	//
	//   src, err := storage.Read("123")
	//   if err != nil {
	//      ...
	//   }
	//   obj := map[string]interface{}{}
	//   if err := json.Unmarshal(src, &obj); err != nil {
	//      ...
	//   }
	Read(string) ([]byte, error)

	// Versions returns a list of semver formatted version strings avialable for an JSON object
	Versions(string) ([]string, error)

	// ReadVersion takes a key and semver version string and return that version of the
	// JSON object.
	ReadVersion(string, string) ([]byte, error)

	// Update takes a key and encoded JSON object and updates a
	// JSON document in the collection.
	//
	//   key := "123"
	//   src := []byte(`{"one": 1, "two": 2}`)
	//   if err := storage.Update(key, src); err != nil {
	//      ...
	//   }
	//
	Update(string, []byte) error

	// Delete removes all versions and attachments of a JSON document.
	//
	//   key := "123"
	//   if err := storage.Delete(key); err != nil {
	//      ...
	//   }
	//
	Delete(string) error

	// Keys returns all keys in a collection as a slice of strings.
	//
	//   var keys []string
	//   keys, _ = storage.List()
	//   /* iterate over the keys retrieved */
	//   for _, key := range keys {
	//      ...
	//   }
	//
	Keys() ([]string, error)

	// HasKey returns true if collection is open and key exists,
	// false otherwise.
	HasKey(string) bool

	// Length returns the number of records in the collection
	Length() int64
}

StorageSystem describes the functions required to implement a dataset storage system. Currently two types of storage systems are supported -- pairtree and sql storage (via MySQL 8 and JSON columns) If the funcs describe are not supported by the storage system they must return a "Not Implemented" error value.

Directories

Path Synopsis
api is a submodule of dataset
api is a submodule of dataset
cli is a sub module of dataset.
cli is a sub module of dataset.
cmd
dataset
dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections on local disc.
dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections on local disc.
datasetd
datasetd implements a web service for working with dataset collections.
datasetd implements a web service for working with dataset collections.
config is a submodule of dataset
config is a submodule of dataset
dotpath.go package provides a convient way of mapping JSON dot path notation to a nested map structure.
dotpath.go package provides a convient way of mapping JSON dot path notation to a nested map structure.
dsv1 is a submodule of dataset package.
dsv1 is a submodule of dataset package.
tbl
tbl.go provides some utility functions to move string one and two demensional slices into/out of one and two deminsional slices.
tbl.go provides some utility functions to move string one and two demensional slices into/out of one and two deminsional slices.
pairtree.go implements encoding/decoding of object identifiers and pairtree paths (paths) per https://confluence.ucop.edu/download/attachments/14254128/PairtreeSpec.pdf?version=2&modificationDate=1295552323000&api=v2
pairtree.go implements encoding/decoding of object identifiers and pairtree paths (paths) per https://confluence.ucop.edu/download/attachments/14254128/PairtreeSpec.pdf?version=2&modificationDate=1295552323000&api=v2
ptstore is a submodule of the dataset package.
ptstore is a submodule of the dataset package.
semver is a semantic version number package used by dataset.
semver is a semantic version number package used by dataset.
sqlstore is a sub module of the dataset package.
sqlstore is a sub module of the dataset package.
texts is a submodule of dataset
texts is a submodule of dataset

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL