tabacco

package module
v0.0.0-...-dc55c6b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2024 License: GPL-3.0 Imports: 28 Imported by: 0

README

tabacco

A data backup manager for distributed environments, with a special focus on multi-tenant services, which can do backups of high-level datasets and maintain the association with low-level hosts and files.

Overview

The idea is to describe the data to be backed up in terms that make sense to the application layer: for instance, a web hosting provider may have datasets corresponding to data and sql databases for each individual user (e.g. data/user1, data/user2, sql/user1 etc). The software will then map these dataset names to hosts and files, it will back up the data and allow you to retrieve it on those terms.

The following scenarios / use cases for retrieval are supported:

  • retrieve specific datasets, identified by their name
  • restore an entire host or dataset group, for administrative or maintenance-related reasons

in order to do this, tabacco must allow you to restore datasets based on both high-level or low-level identifiers.

To explain what this means in practice, consider a distributed services environment: we may not care which host specifically the service foo might have been running on, or where exactly on the filesystem its data was stored; what we want is to be able to say "restore the data for service foo".

The tabacco system works using agents, running on all hosts you have data on, and a centralized metadata database (metadb) that stores information about each backup globally.

Usage

The tabacco command has a number of sub-commands to invoke various functions:

agent

The agent sub-command starts the backup agent. This is meant to run in background as a daemon (managed by init), and it will invoke backup jobs periodically at their desired schedule.

The daemon will read its configuration from /etc/tabacco/agent.yml and its subdirectories by default, though this can be changed with the --config option.

The process will also start a HTTP listener on an address you specify with the --http-addr option, which is used to export monitoring and debugging endpoints.

metadb

The metadb sub-command starts the metadata server, the central database (with an HTTP API) that stores data about backups. This is a critical component as without it you can't do backups or restores. This process is meant to run in the background (managed by init).

The daemon will read its configuration from /etc/tabacco/metadb.yml by default (change using the --config option).

User's Guide

Let's look at some of the fundamental concepts: consider the backup manager as a gateway between data sources and the destination storage layer. Each high-level dataset is known as an atom.

There is often a trade-off to be made when backing up multi-tenant services: do we invoke a backup handler once per tenant, or do we dump everything once and just say it's made of multiple atoms? You can pick the best approach on a case-by-case basis, by grouping atoms into datasets. We'll look at examples later to clarify what this means.

Repository

The first thing your backup needs is a destination repository, that is, a way to archive data long-term. The current implementation uses restic, an encrypted-deduplicating backup tool that supports a large number of remote storage options.

The file handler

Every dataset has an associated handler, which is responsible for actually taking the backup or performing the restore. The most straightforward handler is builtin and is called file: it simply backs up and restores files on the filesystem. It is configured with a single path attribute pointing at the location to backup/restore.

Builtin handlers (such as pipe described below) are usually used as templates for customized handlers. This is not the case with the file handler, which is so simple it can be used directly.

Datasets and atoms

Imagine a hosting provider with two FTP accounts on the local host. The first possibility is to treat each as its own dataset, in which case the backup command will be invoked twice:

- name: users/account1
  handler: file
  params:
    path: /users/account1
- name: users/account2
  handler: file
  params:
    path: /users/account2

Datasets that do not explicitly list atoms will implicitly be treated as if they contained a single, anonymous atom.

In the same scenario as above, it may be easier to simply dump all of /users, and just say that it contains account1 and account2:

- name: users
  handler: file
  params:
    path: /users
  atoms:
    - name: account1
    - name: account2

For datasets with one or more atoms explicitly defined, the final atom name is the concatenation of the dataset name and the atom name. So in this example we end up with the same identical atoms as above, users/account1 and users/account2.

Dynamic data sources

It would be convenient to generate the list of atoms dynamically, and in fact it is possible to do so using an atoms_command:

- name: users
  handler: file
  file:
    path: /users
  atoms_command: dump_accounts.sh

The script will be called on each backup, and it should print atom names to its standard output, one per line.

Pre and post scripts

Suppose the data to back up isn't just a file on the filesystem, but rather data in a service that must be extracted somehow using tools. It's possible to run arbitrary commands before or after a backup.

Regardless of the handler selected, all sources can define commands to be run before and after backup or restore operations on the whole dataset. These attributes are:

  • pre_backup_command is invoked before a backup of the dataset
  • post_backup_command is invoked after a backup of the dataset
  • pre_restore_command is invoked before a restore of a dataset
  • post_restore_command is invoked after a restore of a dataset

The scripts are run through a shell so they support environment variable substitution and other shell syntax. The following special environment variables are defined:

  • BACKUP_ID - unique backup ID
  • DATASET_NAME - name of the dataset
  • ATOM_NAMES - names of all atoms, space-separated (only available for dataset-level scripts)
  • ATOM_NAME - atom name (only available for atom-level scripts)

So, for instance, this would be a way to make a backup of a MySQL database instance:

- name: sql
  handler: file
  pre_backup_command: "mysqldump > /var/backups/sql/dump.sql"
  post_restore_command: "mysql < /var/backups/sql/dump.sql"
  params:
    path: /var/backups/sql

or, if you have a clever MySQL dump tool that saves each database into a separate directory, named after the database itself, you could do something a bit better like:

- name: sql
  handler: file
  pre_backup_command: "cd /var/backups/sql && clever_mysql_dump $ATOM_NAMES"
  post_restore_command: "cd /var/backups/sql && clever_mysql_restore $ATOM_NAMES"
  params:
    path: /var/backups/sql
  atoms:
    - name: db1
    - name: db2

This has the advantage of having the appropriate atom metadata, so we can restore individual databases.

The pipe handler

The MySQL example just above has a major disadvantage, in that it requires writing the entire database to the local disk in /var/backups/sql, only so that the backup tool can read it and send it to the repository. This process can be optimized away by having a command simply pipe its output to the backup tool, using the pipe handler.

Contrary to the file handler seen before, the pipe handler can't be used unless it is configured appropriately, by creating a user-defined handler.

Since it's impractical to access individual items within a single data stream, pipe handlers operate on individual atoms: datasets containing multiple atoms are automatically converted into a list of datasets with one atom each. This is an internal mechanism and has almost no practical consequences except in reports and logs, which will show the multiple datasources.

Configuration is performed by setting two parameters:

  • backup_command is the command to generate a backup of an atom on standard output
  • restore_command is the command used to restore an atom.

So, for instance, the MySQL example could be rewritten as this handler definition:

- name: mysql-pipe
  params:
    backup_command: "mysqldump --databases ${atom.name}"
    restore_command: "mysql"

and this dataset source:

- name: sql
  handler: mysql-pipe
  atoms:
    - name: db1
    - name: db2

Runtime signals

The agent will reload its configuration on SIGHUP, and it will immediately trigger all backup jobs upon receiving SIGUSR1.

TODO

Things still to do:

  • The agent can currently do both backups and restores, but there is no way to trigger a restore. Some sort of authenticated API is needed for this.

Things not to do:

  • Global (cluster-wide) scheduling - that's the job of a global cron scheduler, that could then easily trigger backups.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetLogger

func GetLogger(ctx context.Context) *log.Logger

func JobWithLogPrefix

func JobWithLogPrefix(j jobs.Job, pfx string) jobs.Job

func WithLogPrefix

func WithLogPrefix(ctx context.Context, pfx string) context.Context

Types

type Agent

type Agent struct {
	// contains filtered or unexported fields
}

Agent holds a Manager and a Scheduler together, and runs periodic backup jobs for all known sources.

func NewAgent

func NewAgent(ctx context.Context, configMgr *ConfigManager, ms MetadataStore) (*Agent, error)

NewAgent creates a new Agent with the specified config.

func (*Agent) Close

func (a *Agent) Close()

Close the Agent and all associated resources.

func (*Agent) Handler

func (a *Agent) Handler() http.Handler

Handler returns a HTTP handler implementing the debug HTTP server.

func (*Agent) RunNow

func (a *Agent) RunNow()

RunNow starts all jobs right now, regardless of their schedule.

func (*Agent) StartHTTPServer

func (a *Agent) StartHTTPServer(addr string)

StartHTTPServer starts a HTTP server that exports Prometheus metrics and debug information.

type Atom

type Atom struct {
	// Name (path-like).
	Name string `json:"name"`

	// Special attribute for the 'file' handler (path relative to
	// source root path).
	Path string `json:"path,omitempty"`
}

An Atom is a bit of data that can be restored independently as part of a Dataset. Atoms are identified uniquely by their absolute path in the global atom namespace: this path is built by concatenating the source name, the dataset name, and the atom name.

type Backup

type Backup struct {
	// Unique identifier.
	ID string `json:"id"`

	// Timestamp (backup start).
	Timestamp time.Time `json:"timestamp"`

	// Host.
	Host string `json:"host"`

	// Datasets.
	Datasets []*Dataset `json:"datasets"`
}

Backup is the over-arching entity describing a high level backup operation. Backups are initiated autonomously by individual hosts, so each Backup belongs to a single Host.

type Config

type Config struct {
	Hostname             string                    `yaml:"hostname"`
	Queue                *jobs.QueueSpec           `yaml:"queue_config"`
	Repository           RepositorySpec            `yaml:"repository"`
	DryRun               bool                      `yaml:"dry_run"`
	DefaultNiceLevel     int                       `yaml:"default_nice_level"`
	DefaultIOClass       int                       `yaml:"default_io_class"`
	WorkDir              string                    `yaml:"work_dir"`
	RandomSeedFile       string                    `yaml:"random_seed_file"`
	MetadataStoreBackend *clientutil.BackendConfig `yaml:"metadb"`

	HandlerSpecs []*HandlerSpec
	SourceSpecs  []*SourceSpec
}

Config is the global configuration object. While the actual configuration is spread over multiple files and directories, this holds it all together.

func ReadConfig

func ReadConfig(path string) (*Config, error)

ReadConfig reads the configuration from the given path. Sources and handlers are read from the 'sources' and 'handlers' subdirectories of the directory containing the main configuration file.

Performs a first level of static validation.

type ConfigManager

type ConfigManager struct {
	// contains filtered or unexported fields
}

ConfigManager holds all runtime data derived from the configuration itself, so it can be easily reloaded by calling Reload(). Listeners should register themselves with Notify() in order to be updated when the configuration changes (there is currently no way to unregister).

func NewConfigManager

func NewConfigManager(config *Config) (*ConfigManager, error)

NewConfigManager creates a new ConfigManager.

func (*ConfigManager) Close

func (m *ConfigManager) Close()

Close the ConfigManager and all associated resources.

func (*ConfigManager) NewRuntimeContext

func (m *ConfigManager) NewRuntimeContext() RuntimeContext

NewRuntimeContext returns a new RuntimeContext, capturing current configuration and runtime assets.

func (*ConfigManager) Notify

func (m *ConfigManager) Notify() <-chan struct{}

Notify the caller when the configuration is reloaded.

func (*ConfigManager) Reload

func (m *ConfigManager) Reload(config *Config) error

Reload the configuration (at least, the parts of it that can be dynamically reloaded).

type Dataset

type Dataset struct {
	// Unique identifier.
	ID string `json:"id"`

	// Source is the name of the source that created this Dataset,
	// stored so that the restore knows what to do.
	Source string `json:"source"`

	// Atoms that are part of this dataset.
	Atoms []Atom `json:"atoms"`

	// Snapshot ID (repository-specific).
	SnapshotID string `json:"snapshot_id"`

	// Number of files in this dataset.
	TotalFiles int64 `json:"total_files"`

	// Number of bytes in this dataset.
	TotalBytes int64 `json:"total_bytes"`

	// Number of bytes that were added / removed in this backup.
	BytesAdded int64 `json:"bytes_added"`

	// Duration in seconds.
	Duration int `json:"duration"`
}

A Dataset describes a data set as a high level structure containing one or more atoms. The 1-to-many scenario is justified by the following use case: imagine a sql database server, we may want to back it up as a single operation, but it contains multiple databases (the atom we're interested in), which we might want to restore independently.

type DatasetSpec

type DatasetSpec struct {
	Atoms        []Atom `yaml:"atoms"`
	AtomsCommand string `yaml:"atoms_command"`
}

DatasetSpec describes a dataset in the configuration.

func (*DatasetSpec) Check

func (spec *DatasetSpec) Check() error

Check syntactical validity of the DatasetSpec.

func (*DatasetSpec) Parse

func (spec *DatasetSpec) Parse(ctx context.Context, src *SourceSpec) (*Dataset, error)

Parse a DatasetSpec and return a Dataset.

type FindRequest

type FindRequest struct {
	Pattern string `json:"pattern"`

	Host        string    `json:"host"`
	NumVersions int       `json:"num_versions"`
	OlderThan   time.Time `json:"older_than,omitempty"`
	// contains filtered or unexported fields
}

FindRequest specifies search criteria for atoms.

type Handler

type Handler interface {
	BackupJob(RuntimeContext, *Backup, *Dataset) jobs.Job
	RestoreJob(RuntimeContext, *Backup, *Dataset, string) jobs.Job
}

Handler can backup and restore a specific class of datasets.

type HandlerSpec

type HandlerSpec struct {
	// Handler name (unique global identifier).
	Name string `yaml:"name"`

	// Handler type, one of the known types.
	Type string `yaml:"type"`

	Params Params `yaml:"params"`
}

HandlerSpec defines the configuration for a handler.

func (*HandlerSpec) Parse

func (spec *HandlerSpec) Parse(src *SourceSpec) (Handler, error)

Parse a HandlerSpec and return a Handler instance.

type JobStatus

type JobStatus struct {
	Host          string            `json:"host"`
	JobID         string            `json:"job_id"`
	BackupID      string            `json:"backup_id"`
	DatasetID     string            `json:"dataset_id"`
	DatasetSource string            `json:"dataset_source"`
	Status        *RunningJobStatus `json:"status"`
}

JobStatus has contextual information about a backup job that is currently running.

type Manager

type Manager interface {
	BackupJob(context.Context, *SourceSpec) (*Backup, jobs.Job, error)
	Backup(context.Context, *SourceSpec) (*Backup, error)
	RestoreJob(context.Context, *FindRequest, string) (jobs.Job, error)
	Restore(context.Context, *FindRequest, string) error
	Close() error

	// Debug interface.
	GetStatus() ([]jobs.Status, []jobs.Status, []jobs.Status)
}

Manager for backups and restores.

func NewManager

func NewManager(ctx context.Context, configMgr *ConfigManager, ms MetadataStore) (Manager, error)

NewManager creates a new Manager.

type MetadataStore

type MetadataStore interface {
	// Find the datasets that match a specific criteria. Only
	// atoms matching the criteria will be included in the Dataset
	// objects in the response.
	FindAtoms(context.Context, *FindRequest) ([]*Backup, error)

	// Add a dataset entry (the Backup might already exist).
	AddDataset(context.Context, *Backup, *Dataset) error

	// StartUpdates spawns a goroutine that periodically sends
	// active job status updates to the metadata server.
	StartUpdates(context.Context, func() *UpdateActiveJobStatusRequest)
}

MetadataStore is the client interface to the global metadata store.

type Params

type Params map[string]interface{}

Params are configurable parameters in a format friendly to YAML representation.

func (Params) Get

func (p Params) Get(key string) string

Get a string value for a parameter.

func (Params) GetBool

func (p Params) GetBool(key string) (bool, bool)

GetBool returns a boolean value for a parameter (may be a string). Returns value and presence.

func (Params) GetList

func (p Params) GetList(key string) []string

GetList returns a string list value for a parameter.

type Repository

type Repository interface {
	Init(context.Context, RuntimeContext) error

	RunBackup(context.Context, *Shell, *Backup, *Dataset, string, []string) error
	RunStreamBackup(context.Context, *Shell, *Backup, *Dataset, string, string) error

	RunRestore(context.Context, *Shell, *Backup, *Dataset, []string, string) error
	RunStreamRestore(context.Context, *Shell, *Backup, *Dataset, string, string) error

	Close() error
}

Repository is the interface to a remote repository.

type RepositorySpec

type RepositorySpec struct {
	Name   string `yaml:"name"`
	Type   string `yaml:"type"`
	Params Params `yaml:"params"`
}

RepositorySpec defines the configuration of a repository.

func (*RepositorySpec) Parse

func (spec *RepositorySpec) Parse() (Repository, error)

Parse a RepositorySpec and return a Repository instance.

type RunningJobStatus

type RunningJobStatus resticStatusMessage

RunningJobStatus has information about a backup job that is currently running.

type RuntimeContext

type RuntimeContext interface {
	Shell() *Shell
	Repo() Repository
	QueueSpec() *jobs.QueueSpec
	Seed() int64
	WorkDir() string
	SourceSpecs() []*SourceSpec
	FindSource(string) *SourceSpec
	HandlerSpec(string) *HandlerSpec
	Close()
}

RuntimeContext provides access to runtime objects whose lifetime is ultimately tied to the configuration. Configuration can change during the lifetime of the process, but we want backup jobs to have a consistent view of the configuration while they execute, so access to the current version of the configuration is controlled to the ConfigManager.

type Shell

type Shell struct {
	// contains filtered or unexported fields
}

Shell runs commands, with some options (a global dry-run flag preventing all executions, nice level, i/o class). As one may guess by the name, commands are run using the shell, so variable substitutions and other shell features are available.

func NewShell

func NewShell(dryRun bool) *Shell

NewShell creates a new Shell.

func (*Shell) Output

func (s *Shell) Output(ctx context.Context, arg string) ([]byte, error)

Output runs a command and returns the standard output.

func (*Shell) Run

func (s *Shell) Run(ctx context.Context, arg string) error

Run a command. Log its standard output and error.

func (*Shell) RunWithStdoutCallback

func (s *Shell) RunWithStdoutCallback(ctx context.Context, arg string, stdoutCallback func([]byte)) error

RunWithStdoutCallback executes a command and invokes a callback on every line read from its standard output. Stdandard output and error are still logged normally as in Run().

func (*Shell) SetIOClass

func (s *Shell) SetIOClass(n int)

SetIOClass sets the ionice(1) i/o class.

func (*Shell) SetNiceLevel

func (s *Shell) SetNiceLevel(n int)

SetNiceLevel sets the nice(1) level.

type SourceSpec

type SourceSpec struct {
	Name    string `yaml:"name"`
	Handler string `yaml:"handler"`

	// Schedule to run the backup on.
	Schedule string `yaml:"schedule"`

	// Define Datasets statically, or use a script to generate them
	// dynamically on every new backup.
	Datasets        []*DatasetSpec `yaml:"datasets"`
	DatasetsCommand string         `yaml:"datasets_command"`

	// Commands to run before and after operations on the source.
	PreBackupCommand   string `yaml:"pre_backup_command"`
	PostBackupCommand  string `yaml:"post_backup_command"`
	PreRestoreCommand  string `yaml:"pre_restore_command"`
	PostRestoreCommand string `yaml:"post_restore_command"`

	Params Params `yaml:"params"`

	// Timeout for execution of the entire backup operation.
	Timeout time.Duration `yaml:"timeout"`
}

SourceSpec defines the configuration for a data source. Data sources can dynamically or statically generate one or more Datasets, each containing one or more Atoms.

Handlers are launched once per Dataset, and they know how to deal with backing up / restoring individual Atoms.

func (*SourceSpec) Check

func (spec *SourceSpec) Check(handlers map[string]*HandlerSpec) error

Check syntactical validity of the SourceSpec. Not an alternative to validation at usage time, but it provides an early warning to the user. Checks the handler name against a string set of handler names.

func (*SourceSpec) Parse

func (spec *SourceSpec) Parse(ctx context.Context) ([]*Dataset, error)

Parse a SourceSpec and return one or more Datasets.

type UpdateActiveJobStatusRequest

type UpdateActiveJobStatusRequest struct {
	Host       string       `json:"host"`
	ActiveJobs []*JobStatus `json:"active_jobs,omitempty"`
}

UpdateActiveJobStatusRequest is the periodic "ping" sent by agents (unique host names are assumed) containing information about currently running jobs.

Directories

Path Synopsis
cmd
metadb

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL