octosql

package module

v0.1.1 Latest Latest Go to latest Published: Jul 21, 2019 License: MIT Imports: 8 Imported by: 3

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cube2222/octosql

Links

Open Source Insights

README ¶

OctoSQL

OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.

What is OctoSQL?

OctoSQL is a SQL query engine which allows you to write standard SQL queries on data stored in multiple SQL databases, NoSQL databases and files in various formats trying to push down as much of the work as possible to the source databases, not transferring unnecessary data.

OctoSQL does that by creating an internal representation of your query and later translating parts of it into the query languages or APIs of the source databases. Whenever a datasource doesn't support a given operation, OctoSQL will execute it in memory, so you don't have to worry about the specifics of the underlying datasources.

With OctoSQL you don't need O(n) client tools or a large data analysis system deployment. Everything's contained in a single binary.

Why the name?

OctoSQL stems from Octopus SQL.

Octopus, because octopi have many arms, so they can grasp and manipulate multiple objects, like OctoSQL is able to handle multiple datasources simultaneously.

Installation

Either download the binary for your operating system (Linux, OS X and Windows are supported) from the Releases page, or install using the go command line tool:

go get -u github.com/cube2222/octosql/cmd/octosql

Quickstart

Let's say we have a csv file with cats, and a redis database with people (potential cat owners). Now we want to get a list of cities with the number of distinct cat names in them and the cumulative number of cat lives (as each cat has up to 9 lives left).

First, create a configuration file (Configuration Syntax) For example:

dataSources:
  - name: cats
    type: csv
    config:
      path: "~/Documents/cats.csv"
  - name: people
    type: redis
    config:
      address: "localhost:6379"
      password: ""
      databaseIndex: 0
      databaseKeyName: "id"

Then, set the OCTOSQL_CONFIG environment variable to point to the configuration file.

export OCTOSQL_CONFIG=~/octosql.yaml

You can also use the --config command line argument.

Finally, query to your hearts desire:

octosql "SELECT p.city, FIRST(c.name), COUNT(DISTINCT c.name) cats, SUM(c.livesleft) catlives
FROM cats c JOIN people p ON c.ownerid = p.id
GROUP BY p.city
ORDER BY catlives DESC
LIMIT 9"

Example output:

+---------+--------------+------+----------+
| p.city  | c.name_first | cats | catlives |
+---------+--------------+------+----------+
| Warren  | Zoey         |   68 |      570 |
| Gadsden | Snickers     |   52 |      388 |
| Staples | Harley       |   54 |      383 |
| Buxton  | Lucky        |   45 |      373 |
| Bethany | Princess     |   46 |      366 |
| Noxen   | Sheba        |   49 |      361 |
| Yorklyn | Scooter      |   45 |      359 |
| Tuttle  | Toby         |   57 |      356 |
| Ada     | Jasmine      |   49 |      351 |
+---------+--------------+------+----------+

You can choose between table, tabbed, json and csv output formats.

Configuration

The configuration file has the following form

dataSources:
  - name: <table_name_in_octosql>
    type: <datasource_type>
    config:
      <datasource_specific_key>: <datasource_specific_value>
      <datasource_specific_key>: <datasource_specific_value>
      ...
  - name: <table_name_in_octosql>
    type: <datasource_type>
    config:
      <datasource_specific_key>: <datasource_specific_value>
      <datasource_specific_key>: <datasource_specific_value>
      ...
    ...

Supported Datasources

JSON

JSON file in one of the following forms:

one record per line, no commas
JSON list of records

options:

path - path to file containing the data, required
arrayFormat - if the JSON list of records format should be used, defaults to false

CSV

CSV file seperated using commas. The first row should contain column names.

options:

path - path to file containing the data, required

PostgreSQL

Single PostgreSQL database table.

options:

address - address including port number, defaults to localhost:5432
user - required
password - required
databaseName - required
tableName - required

MySQL

Single MySQL database table.

options:

address - address including port number, defaults to localhost:3306
user - required
password - required
databaseName - required
tableName - required

Redis

Redis database with the given index. Currently only hashes are supported.

options:

address - address including port number, defaults to localhost:6379
password - defaults to ""
databaseIndex - index number of Redis database, defaults to 0
databaseKeyName - column name of Redis key in OctoSQL records, defaults to "key"

Documentation

Documentation for the available functions: https://github.com/cube2222/octosql/wiki/Function-Documentation

Documentation for the available aggregates: https://github.com/cube2222/octosql/wiki/Aggregate-Documentation

The SQL dialect documentation: TODO ;) in short though:

Available SQL constructs: Select, Where, Order By, Group By, Offset, Limit, Left Join, Right Join, Inner Join, Distinct, Union, Union All, Subqueries, Operators.

Available SQL types: Int, Float, String, Bool, Time, Duration, Tuple (array), Object (e.g. JSON)

Architecture

An OctoSQL invocation gets processed in multiple phases.

SQL AST

First, the SQL query gets parsed into an abstract syntax tree. This phase only rules out syntax errors.

Logical Plan

The SQL AST gets converted into a logical query plan. This plan is still mostly a syntactic validation. It's the most naive possible translation of the SQL query. However, this plan already has more of a map-filter-reduce form.

If you wanted to add a new query language to OctoSQL, the only problem you'd have to solve is translating it to this logical plan.

Physical Plan

The logical plan gets converted into a physical plan. This conversion finds any semantic errors in the query. If this phase is reached, then the input is correct and OctoSQL will be able execute it.

This phase already understands the specifics of the underlying datasources. So it's here where the optimizer will iteratively transform the plan, pushing computation nodes down to the datasources, and deduplicating unnecessary parts.

The optimizer uses a pattern matching approach, where it has rules for matching parts of the physical plan tree and how those patterns can be restructured into a more efficient version. The rules are meant to be as simple as possible and make the smallest possible changes. For example, pushing filters under maps, if they don't use any mapped variables. This way, the optimizer just keeps on iterating on the whole tree, until it can't change anything anymore. (each iteration tries to apply each rule in each possible place in the tree) This ensures that the plan reaches a local performance minimum, and the rules should be structured so that this local minimum is equal - or close to - the global minimum. (i.e. one optimization, shouldn't make another - much more useful one - impossible)

Here is an example diagram of an optimized physical plan:

Execution Plan

The physical plan gets materialized into an execution plan. This phase has to be able to connect to the actual datasources. It may initialize connections, open files, etc.

Stream

Starting the execution plan creates a stream, which underneath may hold more streams, or parts of the execution plan to create streams in the future. This stream works in a pull based model.

Database Pushdown Operations

Datasource	Equality	In	> < <= =>
MySQL	supported	supported	supported
PostgreSQL	supported	supported	supported
Redis	supported	supported	scan
JSON	scan	scan	scan
CSV	scan	scan	scan

Where scan means that the whole table needs to be scanned for each access. We are planning to add an in memory index in the future, which would allow us to store small tables in-memory, saving us a lot of unnecessary reads.

Roadmap

Additional Datasources.
SQL Constructs:
- JSON Query
- Window Functions
- Polymorphic Table Functions (i.e. RANGE(1, 10) in table position)
- HAVING, ALL, ANY
Parallel expression evaluation.
Custom sql parser, so we can use sane function names, and support new sql constructs.
Streams support (Kafka, Redis)
Push down functions, aggregates to databases that support them.
An in-memory index to save values of subqueries and save on rescanning tables which don't support a given operation, so as not to recalculate them each time.
MapReduce style distributed execution mode.
Runtime statistics
Server mode
Querying a json or csv table from standard input.
Integration test suite
Tuple splitter, returning the row for each tuple element, with the given element instead of the tuple.
Describe-like functionality as in the diagram above.

Documentation ¶

Index ¶

Variables
func AreEqual(left, right Value) bool
type Bool
- func MakeBool(v bool) Bool
- func ZeroBool() Bool
- func (v Bool) AsBool() bool
- func (v Bool) Document() docs.Documentation
- func (v Bool) String() string
type Duration
- func MakeDuration(v time.Duration) Duration
- func ZeroDuration() Duration
- func (v Duration) AsDuration() time.Duration
- func (v Duration) Document() docs.Documentation
- func (v Duration) String() string
type Float
- func MakeFloat(v float64) Float
- func ZeroFloat() Float
- func (v Float) AsFloat() float64
- func (v Float) Document() docs.Documentation
- func (v Float) String() string
type Int
- func MakeInt(v int) Int
- func ZeroInt() Int
- func (v Int) AsInt() int
- func (v Int) Document() docs.Documentation
- func (v Int) String() string
type Null
- func MakeNull() Null
- func ZeroNull() Null
- func (v Null) AsNull() interface{}
- func (v Null) Document() docs.Documentation
- func (v Null) String() string
type Object
- func MakeObject(v map[string]Value) Object
- func ZeroObject() Object
- func (v Object) AsMap() map[string]Value
- func (v Object) Document() docs.Documentation
- func (v Object) String() string
type Phantom
- func MakePhantom() Phantom
- func ZeroPhantom() Phantom
- func (v Phantom) AsStruct() struct{}
- func (v Phantom) Document() docs.Documentation
- func (v Phantom) String() string
type String
- func MakeString(v string) String
- func ZeroString() String
- func (v String) AsString() string
- func (v String) Document() docs.Documentation
- func (v String) String() string
type Time
- func MakeTime(v time.Time) Time
- func ZeroTime() Time
- func (v Time) AsTime() time.Time
- func (v Time) Document() docs.Documentation
- func (v Time) String() string
type Tuple
- func MakeTuple(v []Value) Tuple
- func ZeroTuple() Tuple
- func (v Tuple) AsSlice() []Value
- func (v Tuple) Document() docs.Documentation
- func (v Tuple) String() string
type Value
- func NormalizeType(value interface{}) Value
type VariableName
- func NewVariableName(varname string) VariableName
- func (vn *VariableName) Name() string
- func (vn *VariableName) Source() string
- func (vn *VariableName) String() string
type Variables
- func NewVariables(variables map[VariableName]Value) Variables
- func NoVariables() Variables
- func (vs Variables) Get(k VariableName) (Value, error)
- func (vs Variables) MergeWith(other Variables) (Variables, error)

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrVariableNotFound = errors.New("variable not found")

Functions ¶

func AreEqual ¶

func AreEqual(left, right Value) bool

octosql.AreEqual checks the equality of the given values, returning false if the types don't match.

Types ¶

type Bool ¶

type Bool bool

func MakeBool ¶

func MakeBool(v bool) Bool

func ZeroBool ¶

func ZeroBool() Bool

func (Bool) AsBool ¶

func (v Bool) AsBool() bool

func (Bool) Document ¶

func (v Bool) Document() docs.Documentation

func (Bool) String ¶

func (v Bool) String() string

type Duration ¶

type Duration time.Duration

func MakeDuration ¶

func MakeDuration(v time.Duration) Duration

func ZeroDuration ¶

func ZeroDuration() Duration

func (Duration) AsDuration ¶

func (v Duration) AsDuration() time.Duration

func (Duration) Document ¶

func (v Duration) Document() docs.Documentation

func (Duration) String ¶

func (v Duration) String() string

type Float ¶

type Float float64

func MakeFloat ¶

func MakeFloat(v float64) Float

func ZeroFloat ¶

func ZeroFloat() Float

func (Float) AsFloat ¶

func (v Float) AsFloat() float64

func (Float) Document ¶

func (v Float) Document() docs.Documentation

func (Float) String ¶

func (v Float) String() string

type Int ¶

type Int int

func MakeInt ¶

func MakeInt(v int) Int

func ZeroInt ¶

func ZeroInt() Int

func (Int) AsInt ¶

func (v Int) AsInt() int

func (Int) Document ¶

func (v Int) Document() docs.Documentation

func (Int) String ¶

func (v Int) String() string

type Null ¶ added in v0.1.1

type Null struct{}

func MakeNull ¶ added in v0.1.1

func MakeNull() Null

func ZeroNull ¶ added in v0.1.1

func ZeroNull() Null

func (Null) AsNull ¶ added in v0.1.1

func (v Null) AsNull() interface{}

func (Null) Document ¶ added in v0.1.1

func (v Null) Document() docs.Documentation

func (Null) String ¶ added in v0.1.1

func (v Null) String() string

type Object ¶

type Object map[string]Value

func MakeObject ¶

func MakeObject(v map[string]Value) Object

func ZeroObject ¶

func ZeroObject() Object

func (Object) AsMap ¶

func (v Object) AsMap() map[string]Value

func (Object) Document ¶

func (v Object) Document() docs.Documentation

func (Object) String ¶

func (v Object) String() string

type Phantom ¶

type Phantom struct{}

func MakePhantom ¶

func MakePhantom() Phantom

func ZeroPhantom ¶

func ZeroPhantom() Phantom

func (Phantom) AsStruct ¶

func (v Phantom) AsStruct() struct{}

func (Phantom) Document ¶

func (v Phantom) Document() docs.Documentation

func (Phantom) String ¶

func (v Phantom) String() string

type String ¶

type String string

func MakeString ¶

func MakeString(v string) String

func ZeroString ¶

func ZeroString() String

func (String) AsString ¶

func (v String) AsString() string

func (String) Document ¶

func (v String) Document() docs.Documentation

func (String) String ¶

func (v String) String() string

type Time ¶

type Time time.Time

func MakeTime ¶

func MakeTime(v time.Time) Time

func ZeroTime ¶

func ZeroTime() Time

func (Time) AsTime ¶

func (v Time) AsTime() time.Time

func (Time) Document ¶

func (v Time) Document() docs.Documentation

func (Time) String ¶

func (v Time) String() string

type Tuple ¶

type Tuple []Value

func MakeTuple ¶

func MakeTuple(v []Value) Tuple

func ZeroTuple ¶

func ZeroTuple() Tuple

func (Tuple) AsSlice ¶

func (v Tuple) AsSlice() []Value

func (Tuple) Document ¶

func (v Tuple) Document() docs.Documentation

func (Tuple) String ¶

func (v Tuple) String() string

type Value ¶

type Value interface {
	docs.Documented

	fmt.Stringer
	// contains filtered or unexported methods
}

go-sumtype:decl Value

func NormalizeType ¶

func NormalizeType(value interface{}) Value

NormalizeType brings various primitive types into the type we want them to be. All types coming out of data sources have to be already normalized this way.

type VariableName ¶

type VariableName string

func NewVariableName ¶

func NewVariableName(varname string) VariableName

func (*VariableName) Name ¶

func (vn *VariableName) Name() string

func (*VariableName) Source ¶

func (vn *VariableName) Source() string

func (*VariableName) String ¶

func (vn *VariableName) String() string

type Variables ¶

type Variables map[VariableName]Value

func NewVariables ¶

func NewVariables(variables map[VariableName]Value) Variables

func NoVariables ¶

func NoVariables() Variables

func (Variables) Get ¶

func (vs Variables) Get(k VariableName) (Value, error)

func (Variables) MergeWith ¶

func (vs Variables) MergeWith(other Variables) (Variables, error)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
app
cmd
octosql
config
docs
execution
aggregates
aggregates/docgen
functions
functions/docgen
logical
output
csv
json
table
parser
physical
matcher
optimizer
storage
csv
json
mysql
postgres
redis

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

OctoSQL

Table of Contents

What is OctoSQL?

Why the name?

Installation

Quickstart

Configuration

Supported Datasources

JSON

CSV

PostgreSQL

MySQL

Redis

Documentation

Architecture

SQL AST

Logical Plan

Physical Plan

Execution Plan

Stream

Database Pushdown Operations

Roadmap

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func AreEqual ¶

Types ¶

type Bool ¶

func MakeBool ¶

func ZeroBool ¶

func (Bool) AsBool ¶

func (Bool) Document ¶

func (Bool) String ¶

type Duration ¶

func MakeDuration ¶

func ZeroDuration ¶

func (Duration) AsDuration ¶

func (Duration) Document ¶

func (Duration) String ¶

type Float ¶

func MakeFloat ¶

func ZeroFloat ¶

func (Float) AsFloat ¶

func (Float) Document ¶

func (Float) String ¶

type Int ¶

func MakeInt ¶

func ZeroInt ¶

func (Int) AsInt ¶

func (Int) Document ¶

func (Int) String ¶

type Null ¶ added in v0.1.1

func MakeNull ¶ added in v0.1.1

func ZeroNull ¶ added in v0.1.1

func (Null) AsNull ¶ added in v0.1.1

func (Null) Document ¶ added in v0.1.1

func (Null) String ¶ added in v0.1.1

type Object ¶

func MakeObject ¶

func ZeroObject ¶

func (Object) AsMap ¶

func (Object) Document ¶

func (Object) String ¶

type Phantom ¶

func MakePhantom ¶

func ZeroPhantom ¶

func (Phantom) AsStruct ¶

func (Phantom) Document ¶

func (Phantom) String ¶

type String ¶

func MakeString ¶

func ZeroString ¶

func (String) AsString ¶

func (String) Document ¶

func (String) String ¶

type Time ¶

func MakeTime ¶