Durostore is a persistent file storage library that can be used to serialize data to disk and access it randomly (without having to load all of it into memory). Although it is optimized for FIFO operations and controlling how much memory durostore allocates.
Durostore does the obvious to allow random reads and sequential writes by writing fixed sized entries to an index file that is loaded into memory in total which contains enough information to be able to randomly read that non-fixed size data from a data file.
Durostore:
- Is safe for concurrent usage
- Implements file locking using github.com/gofrs/flock
- Provides the ability to read randomly using indexes
- Provides the ability to write one or more items simultaneously
Use cases
Durostore is useful when attempting to apply some variation of the producer consumer design pattern and you want a queue/fifo that can persist between runs or create a queue that can run between memory spaces (using the "disk" as a medium).
Queue
There is a go-queue implementation that wraps duro-store to try to get the best of both worlds, look at the documentation at github.com/antonio-alexander/go-durostore/queue.
Configuration
Durostore can be configured using the New() method or using the Load() method. Keep in mind that the New() function will panic if the configuration is invalid (or generates an error). Below is a brief description of the configuration items:
- Directory: this is the directory where durostore reads/writes files
- Max Files: this is the maximum number of files to write to
- Max File Size: the is the maximum size of each individual file
- Max Chunk: The maximum amount of data to read into memory
- File Prefix: the prefix for all the files generated by durostore
- File Locking: whether or not to use the file locking
- Launch Indexer: whether or not to launch the indexer go routine
- Indexer Rate: how often to re-read the indexes
//Configuration describes the different options to configure
// an instance of durostore
type Configuration struct {
Directory string `json:"directory" yaml:"directory"` //directory to store files in
MaxFiles int `json:"max_files" yaml:"max_files"` //maximum number of files to generate
MaxFileSize int64 `json:"max_file_size" yaml:"max_file_size"` //max size of each data file
MaxChunk int64 `json:"max_chunk" yaml:"max_chunk"` //max chunk of data to read into memory at once
FilePrefix string `json:"file_prefix" yaml:"file_prefix"` //the prefix for files
FileLocking bool `json:"file_locking" yaml:"file_locking"` //whether or not to use file locking
LaunchIndexer bool `json:"launch_indexer" yaml:"launch_indexer"` //whether or not to launch the indexer
IndexerRate time.Duration `json:"indexer_rate" yaml:"indexer_rate"` //how often to read indexes
}
This is an example configuration:
{
"directory": "./temp",
"max_files": 10,
"max_file_size": 52428800,
"max_chunk": 52428800,
"file_prefix": "duro",
"file_locking": true
}
As configured, it will create dat and idx files within ./temp, it will create a maximum of 10 files, with each file being 50MB (52428800 bytes) and having a maximum of 50MB in memory at a time. The fiel prefix of duro create the following files (on initial write):
- duro_lock.lock
- duro_001.idx
- duro_001.dat
- duro_002.idx
- duro_002.dat
Quickstart
An instance of durostore can be created with the following code snippet:
import (
"fmt"
"github.com/antonio-alexander/go-durostore"
)
func main() {
//this will create a pointer and attempt to load it
// with the given configuration
store := durostore.New(durostore.Configuration{
Directory: "./temp",
MaxFiles: 10,
MaxFileSize: durostore.MegaByteToByte(10),
MaxChunk: durostore.MegaByteToByte(50),
FilePrefix: filePrefix,
})
//alternatively, you could do the following:
store := durostore.New()
if err := store.Load(durostore.Configuration{
Directory: "./temp",
MaxFiles: 10,
MaxFileSize: durostore.MegaByteToByte(10),
MaxChunk: durostore.MegaByteToByte(50),
FilePrefix: filePrefix,
}); err != nil {
fmt.Println(err)
}
}
Durostore provides the following three API to interact with the store, it provides the ability to sequentially write as well as randomly read and delete data.
//Owner is a set of functions that should only be used
// by the caller of the NewDurostore function
type Owner interface {
//Close can be used to set the internal pointers to nil and prepare the underlying pointer
// for garbage cleanup, the API doesn't guarantee that it can be re-used
Close() (err error)
}
//Manage like Owner is an interface that should be only used by the caller of the NewDurostore
// function. It can be used to start/stop the business logic
type Manage interface {
//Load can be used to configure and initialize the business logic for a store, it will return a
// channel for errors since some of the enqueue interfaces can't communicate errors
Load(config Configuration) (err error)
//PruneDirectory can be used to completely delete all data on disk while maintaining all in-memory
// data, this is useful in situations where there's data loss, a single option index can be provided
// if none is provided -1 is assumed and all files are deleted
PruneDirectory(indices ...uint64) (err error)
}
//Reader is a subset of the Durostore interface that allows the ability to
// read from the store
type Reader interface {
//Read will randomly read given data, whether that index is a uint64
Read(indices ...uint64) (bytes []byte, err error)
}
type Writer interface {
//Write will sequentially write an item to the store and increment the
// internal index by one, it accepts a slice of bytes or one or more
// pointers/structs that implement the encoding.BinaryMarshaler interface
Write(items ...interface{}) (index uint64, err error)
//Delete will remove the index (but not delete the actual data) of a given
// index if it exists
Delete(indices ...uint64) (err error)
}
type Info interface {
//Length will return the total number of items available in the store
Length() (size int)
//Size will return the total size (in bytes)
Size() (size int64)
}
Implementation
Although the API is straight forward (in it's use), the API is complicated through use of variatics and empty interfaces. To actually write data, you have to provide either a slice of bytes or a pointer that implements encoding.BinaryMarshaller. The read data function just returns a slice of bytes you could then use to decode into useful data. The index variatics have different abilities depending on the function:
- Read() can only be used to read a single index, if no index is provided, the internal read index is used which will be the oldest data that hasn't been deleted.
- Delete can be used to delete one or more indexes, if no index is provided, the internal read index is used which will be the oldest data that hasn't been deleted.
See example code below:
import (
"encoding/json"
"github.com/antonio-alexander/go-durostore"
)
type Data struct {
Int int `json:"Int"`
String string `json:"String"`
}
func (d *Data) MarshalBinary() ([]byte, error) {
return json.MarshalIndent(d, "", " ")
}
func (d *Data) UnmarshalBinary(bytes []byte) error {
return json.Unmarshal(bytes, d)
}
func Read(reader durostore.Reader, i ...uint64) *Data {
data := &Data{}
bytes, err := reader.Read(index...)
if err != nil {
return nil
}
err = data.UnmarshalBinary(bytes)
if err != nil {
return nil
}
return data
}
The Read function is worthy of note since it provides a way to wrap the Read function in the API such that you can read data and decode it into the pointer (using the UnmarshalBinary) function.
Memory management
The initial use case for this package was a low level solution for data persistence that would allow control of memory and cpu usage for garbage collection. It does this through a couple of ideas/rules that are enforced via logic:
- when data is written, it's not stored in memory, but its committed to disk (there are some better solutions if you want a kind of running cache)
- when the store is initially loaded, data is read in (sequentially) from the lowest index to the chunk size (or the highest index if no chunk size is configured)
- data is actively stored as pointers to slices of bytes (to simplify garbage collection and reduce the footprint of the data map)
- all indexes are read into memory
- when data is deleted, the index and data are removed from the maps
- when data is read, a new byte slice is created and data is copied from the internal pointer to that slice to attempt to not mingle any slices and hinder garabge collection
- garbage collection of the internal maps can be "forced" by using the Load() function (since it'll re-create all the data and maps)
- when you delete all the data in memory and there's still data left (e.g. there are still indexes), it'll load in data up to the chunk size
Persistence storage management
Writing is optimized, meaning that you can write swaths of data at once (while reading is not). The only ways to "manage" your io operations when it comes to persistence storage are:
- group your writes and deletes to single operations (each actually attempts to write files)
- do whatever you need to do with the OS to get it to commit to disk in a resonable amount of time (so you're less likely to lose data)
- manage how much data you read into memory
- it's better to have more, smaller files than fewer larger files (due to inodes size etc), this also can simplify recovery
- deletes are practically soft until all data in the file has been deleted, smaller files should allow you to manage reduce your on-disk foot print more often (this happens automatically)
Asynchronous usage
durostore is safe for concurrent usage within the same memory space, just keep in mind that writes are sequential, so even though there's a "lock" the index order will be determined by who locks first. durostore isn't written so you need to know the indexes, but you could hypothetically create a database of indexes (if you wanted).
File locking IS included within durostore if configured to do so. File locking allows some magical implementations where you could have two applications using the same directory/file-prefix and communicating (e.g. producer consumer). The producer would write only, while the consumer would periodically read the data (there's some magic missing for this use case right now).
Catastrophic failure/recovery
Generally catastrophic failures should only occur if there's no recovery logic within the implementation of durostore: for example, if you attempted to write data, that write failed, and you didn't account for the failure within your business logic. If for some reason, a file becomes corrupted and you can't read that data using durostore, you an recover the data if:
- your data is homogenous and you have a proper UnmarshalBinary function
- your data is heterogenous and you implement some kind of wrapper to identify which function to use to UnmarshalBinary the data
Finally, if you just want to just clear the data, post configuration, you can do so with the following code snippet (this is DESTRUCTIVE):
d := durostore.New(durostore.Configuration{
Directory: "./",
MaxFiles: 2,
MaxFileSize: 1024,
MaxChunk: -1,
FilePrefix: filePrefix,
})
if err := d.PruneDirectory(); err != nil {
fmt.Println(err)
}
The PruneDirectory() function is a variatic where you can provide an index, if none is provided, it will empty the configured directory of anything iwth the configured file index.
Read Wrappers
Reading is a bit of a pain with durostore since it's read in as bytes and you have to otherwise convert it. The example below wraps the read function such that it'll convert it after reading the bare bytes using the UnmarshalBinary function.
type Data struct {
Int int `json:"Int"`
String string `json:"String"`
}
func (d *Data) UnmarshalBinary(bytes []byte) error {
return json.Unmarshal(bytes, d)
}
func Read(reader durostore.Reader, index ...uint64) *Data {
data := &Data{}
bytes, err := reader.Read(index...)
if err != nil {
return nil
}
err = data.UnmarshalBinary(bytes)
if err != nil {
return nil
}
return data
}
Alternatively, you may find that you require something a bit more versatile. In situations where you need to support multiple data types, you'll need to create a wrapper data type that is first unmarshalled to output a type identifier (a string) and a slice of bytes. The type identifier can be used to determine what data type to unmarshal the bytes into.
This is also a great use case for protocol buffers.