V3IO Frames
![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)
V3IO Frames ("Frames") is a multi-model open-source data-access library, developed by Iguazio, which provides a unified high-performance DataFrame API for working with data in the data store of the Iguazio Data Science Platform ("the platform").
In This Document
Client Python API Reference
Overview
Initialization
To use Frames, you first need to import the v3io_frames Python library.
For example:
import v3io_frames as v3f
Then, you need to create and initialize an instance of the Client
class; see Client Constructor.
You can then use the client methods to perform different data operations on the supported backend types.
Backend Types
All Frames client methods receive a backend
parameter for setting the Frames backend type.
Frames supports the following backend types:
kv
— a platform NoSQL (key/value) table.
stream
— a platform data stream.
tsdb
— a time-series database (TSDB).
csv
— a comma-separated-value (CSV) file.
This backend type is used only for testing purposes.
Client
Methods
The Client
class features the following methods for supporting basic data operations:
create
— creates a new TSDB table or stream ("backend data").
delete
— deletes a table or stream or specific table items.
read
— reads data from a table or stream into pandas DataFrames.
write
— writes data from pandas DataFrames to a table or stream.
execute
— executes a backend-specific command on a table or stream.
Each backend may support multiple commands.
Note: Some methods or method parameters are backend-specific, as detailed in this reference.
User Authentication
When creating a Frames client, you must provide valid platform credentials for accessing the backend data, which Frames will use to identify the identity of the user.
This can be done by using any of the following alternative methods (documented in order of precedence):
-
Provide the authentication credentials in the Client
constructor parameters by using either of the following methods:
- Set the
token
constructor parameter to a valid platform access key with the required data-access permissions.
You can get the access key from the Access Keys window that's available from the user-profile menu of the platform dashboard, or by copying the value of the V3IO_ACCESS_KEY
environment variable in a platform web-shell or Jupyter Notebook service.
- Set the
user
and password
constructor parameters to the username and password of a platform user with the required data-access permissions.
Note: You cannot use both methods concurrently: setting both the token
and user
and password
parameters in the same constructor call will produce an error.
-
Set the authentication credentials in environment variables, by using either of the following methods:
-
Set the V3IO_ACCESS_KEY
environment variable to a valid platform access key with the required data-access permissions.
Note: The platform's Jupyter Notebook service automatically defines the V3IO_ACCESS_KEY
environment variable and initializes it to a valid access key for the running user of the service.
-
Set the V3IO_USERNAME
and V3IO_PASSWORD
environment variables to the username and password of a platform user with the required data-access permissions.
Note:
- When the client constructor is called with authentication parameters (option #1), the authentication-credentials environment variables (if defined) are ignored.
- When
V3IO_ACCESS_KEY
is defined, V3IO_USERNAME
and V3IO_PASSWORD
are ignored.
Client
Constructor
All Frames operations are executed via an object of the Client
class.
Syntax
Client(address=""[, data_url=""], container=""[, user="", password="", token=""])
Parameters and Data Members
-
address — The address of the Frames service (framesd
).
When running locally on the platform (for example, from a Jupyter Notebook service), set this parameter to framesd:8081
to use the gRPC (recommended) or to framesd:8080
to use HTTP.
When connecting to the platform remotely, set this parameter to the API address of a Frames platform service in the parent tenant.
You can copy this address from the API column of the V3IO Frames service on the Services platform dashboard page.
- Type:
str
- Requirement: Required
-
data_url — A web-API base URL for accessing the backend data.
By default, the client uses the data URL that's configured for the Frames service, which is typically the HTTPS URL of the web-APIs service of the parent platform tenant.
- Type:
str
- Requirement: Optional
-
container — The name of the platform data container that contains the backend data.
For example, "bigdata"
or "users"
.
- Type:
str
- Requirement: Required
-
user — The username of a platform user with permissions to access the backend data.
See User Authentication.
- Type:
str
- Requirement: Required when neither the
token
parameter or the authentication environment variables are set.
When the user
parameter is set, the password
parameter must also be set to a matching user password.
-
password — A platform password for the user configured in the user
parameter.
See User Authentication.
- Type:
str
- Requirement: Required when the
user
parameter is set.
-
token — A valid platform access key that allows access to the backend data.
See User Authentication.
- Type:
str
- Requirement: Required when neither the
user
or password
parameters or the authentication environment variables are set.
Return Value
Returns a new Frames Client
data object.
Examples
The following examples, for local platform execution, both create a Frames client for accessing data in the "users" container by using the authentication credentials of user "iguazio"; the first example uses access-key authentication while the second example uses username and password authentication (see User Authentication):
import v3io_frames as v3f
client = v3f.Client("framesd:8081", token="e8bd4ca2-537b-4175-bf01-8c74963e90bf", container="users")
import v3io_frames as v3f
client = v3f.Client("framesd:8081", user="iguazio", password="mypass", container="users")
Common Client
Method Parameters
All client methods receive the following common parameters; additional, method-specific parameters are described for each method.
-
backend — The backend data type for the operation.
See Backend Types.
- Type:
str
- Requirement: Required
- Valid Values:
"kv"
| "stream"
| "tsdb"
| "csv"
(for testing)
-
table — The relative path to the backend data — a directory in the target platform data container (as configured for the client object) that represents a TSDB or NoSQL table or a data stream.
For example, "mytable"
or "examples/tsdb/my_metrics"
.
- Type:
str
- Requirement: Required unless otherwise specified in the method-specific documentation
create
Method
Creates a new TSDB table or stream in a platform data container, according to the specified backend type.
The create
method is supported by the tsdb
and stream
backends, but not by the kv
backend, because NoSQL tables in the platform don't need to be created prior to ingestion; when ingesting data into a table that doesn't exist, the table is automatically created.
Syntax
create(backend, table, attrs=None, schema=None, if_exists=FAIL)
Common create
Parameters
All Frames backends that support the create
method support the following common parameters:
tsdb
Backend create
Parameters
The following create
parameters are specific to the tsdb
backend and are passed via the method's attrs
parameter; for more information about these parameters, see the V3IO TSDB documentation:
-
rate — The ingestion rate of the TSDB metric samples.
It's recommended that you set the rate to the average expected ingestion rate, and that the ingestion rates for a given TSDB table don't vary significantly; when there's a big difference in the ingestion rates (for example, x10), use separate TSDB tables.
- Type:
str
- Requirement: Required
- Valid Values: A string of the format
"[0-9]+/[smh]"
— where 's
' = seconds, 'm
' = minutes, and 'h
' = hours.
For example, "1/s"
(one sample per minute), "20/m"
(20 samples per minute), or "50/h"
(50 samples per hour).
-
aggregates — A list of aggregation functions for executing in real time during the samples ingestion ("pre-aggregation").
- Type:
str
- Requirement: Optional
- Valid Values: A string containing a comma-separated list of supported aggregation functions —
avg
| count
| last
| max
| min
| rate
| stddev
| stdvar
| sum
.
For example, "count,avg,min,max"
.
-
aggregation-granularity — Aggregation granularity; i.e., a time interval for applying pre-aggregation functions, if configured in the aggregates
parameter.
- Type:
str
- Requirement: Optional
- Valid Values: A string of the format
"[0-9]+[mhd]"
— where 'm
' = minutes, 'h
' = hours, and 'd
' = days.
For example, "30m"
(30 minutes), "2h"
(2 hours), or "1d"
(1 day).
- Default Value:
"1h"
(1 hour)
stream
Backend create
Parameters
The following create
parameters are specific to the stream
backend and are passed via the method's attrs
parameter; for more information about these parameters, see the platform's streams documentation:
-
shards — The number of stream shards to create.
- Type:
int
- Requirement: Optional
- Default Value:
1
- Valid Values: A positive integer (>= 1).
For example,
100
.
-
retention_hours — The stream's retention period, in hours.
- Type:
int
- Requirement: Optional
- Default Value:
24
- Valid Values: A positive integer (>= 1).
For example,
2
(2 hours).
create
Examples
tsdb
Backend
-
Create a TSDB table named "mytable" in the root directory of the client's data container with an ingestion rate of 10 samples per minute:
client.create("tsdb", "/mytable", attrs={"rate": "10/m"})
-
Create a TSDB table named "my_metrics" in a tsdb directory in the client's data container with an ingestion rate of 1 sample per second.
The table is created with the count
, avg
, min
, and max
aggregates and an aggregation granularity of 1 hour:
client.create("tsdb", "/tsdb/my_metrics", attrs={"rate": "1/s", "aggregates": "count,avg,min,max", "aggregation-granularity": "1h"})
stream
Backend
-
Create a stream named "mystream" in the root directory of the client's data container.
The stream has 6 shards and a retention period of 1 hour (default):
client.create("stream", "/mystream", attrs={"shards": 6})
-
Create a stream named "stream1" in a "my_streams" directory in the client's data container.
The stream has 24 shards (default) and a retention period of 2 hours:
client.create("stream", "my_streams/stream1", attrs={"retention_hours": 2})
write
Method
Writes data from a DataFrame to a table or stream in a platform data container, according to the specified backend type.
Syntax
write(backend, table, dfs, expression='', condition='', labels=None,
max_in_message=0, index_cols=None, partition_keys=None)
Note: The expression
and partition_keys
parameters aren't supported in the current release.
Common write
Parameters
All Frames backends that support the write
method support the following common parameters:
- dfs (Required) — A single DataFrame, a list of DataFrames, or a DataFrames iterator — One or more DataFrames containing the data to write.
(See the
tsdb
backend-specific parameters.)
- index_cols (Optional) (default:
None
) — []str
— A list of column (attribute) names to be used as index columns for the write operation, regardless of any index-column definitions in the DataFrame.
By default, the DataFrame's index columns are used.
Note: The significance and supported number of index columns is backend specific.
For example, the kv
backend supports only a single index column for the primary-key item attribute, while the tsdb
backend supports additional index columns for metric labels.
- labels (Optional) — This parameter is currently applicable only to the
tsdb
backend (although it's available for all backends) and is therefore documented as part of the write
method's tsdb
backend parameters.
- max_in_message (Optional) (default:
0
)
kv
Backend write
Parameters
The following write
parameters are specific to the kv
backend; for more information about these parameters, see the platform's NoSQL documentation:
- condition (Optional) (default:
None
) — A platform condition expression that defines conditions for performing the write operation.
For detailed information about platform condition expressions, see the platform documentation.
tsdb
Backend write
Parameters
The following write
parameter descriptions are specific to the tsdb
backend; for more information about these parameters, see the V3IO TSDB documentation:
-
dfs (Required) — A single DataFrame, a list of DataFrames, or a DataFrames iterator — One or more DataFrames containing the data to write.
This is a common write
parameter, but the following information is specific to the tsdb
backend:
- You must define one or more non-index DataFrame columns that represent the sample metrics; the name of the column is the metric name and its values is the sample data (i.e., the ingested metric).
- You must define a single index column whose value is the sample time of the data.
This column serves as the table's primary-key attribute.
Note that a TSDB DataFrame cannot have more than one index column of a time data type.
- You can optionally define string index columns that represent metric labels for the current DataFrame row.
Note that you can also define labels for all DataFrame rows by using the
labels
parameter (in addition or instead of using column indexes to apply labels to a specific row).
-
labels (Optional) (default: None
) — dict
— A dictionary of metric labels of the format {<label>: <value>[, <label>: <value>, ...]}
, which will be applied to all the DataFrame rows (i.e., to all the ingested metric samples).
For example, {"os": "linux", "arch": "x86"}
.
Note that you can also define labels for a specific DataFrame row by adding a string index column to the row (in addition or instead of using the labels
parameter to define labels for all rows), as explained in the description of the dfs
parameter.
write
Examples
kv
Backend
data = [["tom", 10, "TLV"], ["nick", 15, "Berlin"], ["juli", 14, "NY"]]
df = pd.DataFrame(data, columns = ["name", "age", "city"])
df.set_index("name", inplace=True)
client.write(backend="kv", table="mytable", dfs=df, condition="age>14")
read
Method
Reads data from a table or stream in a platform data container to a DataFrame, according to the configured backend.
Reads data from a backend.
Syntax
read(backend='', table='', query='', columns=None, filter='', group_by='',
limit=0, data_format='', row_layout=False, max_in_message=0, marker='',
iterator=False, **kw)
Note: The limit
, data_format
, row_layout
, and marker
parameters aren't supported in the current release.
Common read
Parameters
All Frames backends that support the read
method support the following common parameters:
-
iterator — (Optional) (default: False
) — bool
— True
to return a DataFrames iterator; False
to return a single DataFrame.
-
filter (Optional) — str
— A query filter.
For example, filter="col1=='my_value'"
.
This parameter cannot be used concurrently with the query
parameter of the tsdb
backend.
-
columns — []str
— A list of attributes (columns) to return.
This parameter cannot be used concurrently with the query
parameter of the tsdb
backend.
kv
Backend read
Parameters
The following read
parameters are specific to the kv
backend; for more information about these parameters, see the platform's NoSQL documentation:
- max_in_message —
int
— The maximum number of rows per message.
The following parameters are passed as keyword arguments via the kw
parameter:
-
reset_index — bool
— Determines whether to reset the index index column of the returned DataFrame: True
— reset the index column by setting it to the auto-generated pandas range-index column; False
(default) — set the index column to the table's primary-key attribute.
-
sharding_keys — []string
— A list of specific sharding keys to query, for range-scan formatted tables only.
tsdb
Backend read
Parameters
The following read
parameters are specific to the tsdb
backend; for more information about these parameters, see the V3IO TSDB documentation:
The following parameters are passed as keyword arguments via the kw
parameter:
-
start — str
— Start (minimum) time for the read operation, as a string containing an RFC 3339 time, a Unix timestamp in milliseconds, a relative time of the format "now"
or "now-[0-9]+[mhd]"
(where m
= minutes, h
= hours, and 'd'
= days), or 0 for the earliest time.
For example: "2016-01-02T15:34:26Z"
; "1451748866"
; "now-90m"
; "0"
.
The default start time is <end time> - 1h
.
-
end — str
— End (maximum) time for the read operation, as a string containing an RFC 3339 time, a Unix timestamp in milliseconds, a relative time of the format "now"
or "now-[0-9]+[mhd]"
(where m
= minutes, h
= hours, and 'd'
= days), or 0 for the earliest time.
For example: "2018-09-26T14:10:20Z"
; "1537971006000"
; "now-3h"
; "now-7d"
.
The default end time is "now"
.
-
step (Optional) — str
— The query step (interval), which determines the points over the query's time range at which to perform aggregations (for an aggregation query) or downsample the data (for a query without aggregators).
The default step is the query's time range, which can be configured via the start and end parameters.
-
aggregators (Optional) — str
— Aggregation information to return, as a comma-separated list of supported aggregation functions ("aggregators").
The following aggregation functions are supported for over-time aggregation (across each unique label set); for cross-series aggregation (across all metric labels), add "_all
" to the end of the function name:
avg
| count
| last
| max
| min
| rate
| stddev
| stdvar
| sum
This parameter cannot be used concurrently with the query
parameter.
-
aggregationWindow (Optional) — str
— Aggregation interval for applying over-time aggregation functions, if set in the aggregators
or query
parameters, as a string of the format "[0-9]+[mhd]"
where 'm
' = minutes, 'h
' = hours, and 'd
' = days.
The default aggregation window is the query's aggregation step.
When using the default aggregation window, the aggregation window starts at the aggregation step; when the aggregationWindow
parameter is set, the aggregation window ends at the aggregation step.
-
multi_index (Optional) — bool
— True
to receive the read results as multi-index DataFrames where the labels are used as index columns in addition to the metric sample-time primary-key attribute; False
(default) only the timestamp will function as the index.
stream
Backend read
Parameters
The following read
parameters are specific to the stream
backend and are passed as keyword arguments via the kw
parameter; for more information about these parameters, see the platform's streams documentation:
-
seek — str
(Required) — Seek type.
Valid values: "time"
| "seq"
| "sequence"
| "latest"
| "earliest"
.
When the "seq"
or "sequence"
seek type is set, you must set the sequence
parameter to the desired record sequence number.
When the time
seek type is set, you must set the start
parameter to the desired seek start time.
-
shard_id — str
(Required) The ID of the stream shard from which to read.
Valid values: "0"
... "<stream shard count> - 1"
.
-
sequence — int64
(Required when seek
= "sequence"
) — The sequence number of the record from which to start reading.
-
start — str
(Required when seek
= "time"
) — The earliest record ingestion time from which to start reading, as a string containing an RFC 3339 time, a Unix timestamp in milliseconds, a relative time of the format "now"
or "now-[0-9]+[mhd]"
(where m
= minutes, h
= hours, and 'd'
= days), or 0 for the earliest time.
For example: "2016-01-02T15:34:26Z"
; "1451748866"
; "now-90m"
; "0"
.
Return Value
- When the value of the
iterator
parameter is False
(default) — returns a single DataFrame.
- When the value of the
iterator
parameter is True
— returns a
DataFrames iterator.
read
Examples
kv
Backend
df = client.read(backend="kv", table="mytable", filter="col1>666")
tsdb
Backend
df = client.read(backend="tsdb", query="select avg(cpu) as cpu, avg(diskio), avg(network) from mytable where os='win'", start="now-1d", end="now", step="2h")
stream
Backend
df = client.read(backend="stream", table="mytable", seek="latest", shard_id="5")
delete
Method
Deletes a table or stream or specific table items from a platform data container, according to the specified backend type.
Syntax
delete(backend, table, filter='', start='', end='', if_missing=FAIL
kv
Backend delete
Parameters
tsdb
Backend delete
Parameters
The following delete
parameters are specific to the tsdb
backend; for more information about these parameters, see the V3IO TSDB documentation:
-
start — Start (minimum) time for the delete operation — i.e., delete only items whose data sample time is at or after (>=
) the specified start time.
- Type:
str
- Requirement: Optional
- Valid Values: A string containing an RFC 3339 time, a Unix timestamp in milliseconds, a relative time of the format
"now"
or "now-[0-9]+[mhd]"
(where m
= minutes, h
= hours, and 'd'
= days), or 0 for the earliest time.
For example: "2016-01-02T15:34:26Z"
; "1451748866"
; "now-90m"
; "0"
.
- Default Value:
""
when neither start
nor end
are set — to delete the entire table and its schema file (.schema) — and 0
when end
is set
-
end — str
— End (maximum) time for the delete operation — i.e., delete only items whose data sample time is before or at (<=
) the specified end time.
- Type:
str
- Requirement: Optional
- Valid Values: A string containing an RFC 3339 time, a Unix timestamp in milliseconds, a relative time of the format
"now"
or "now-[0-9]+[mhd]"
(where m
= minutes, h
= hours, and 'd'
= days), or 0 for the earliest time.
For example: "2018-09-26T14:10:20Z"
; "1537971006000"
; "now-3h"
; "now-7d"
.
- Default Value:
""
when neither start
nor end
are set — to delete the entire table and its schema file (.schema) — and 0
when start
is set
Note:
- When neither the
start
nor end
parameters are set, the entire TSDB table and its schema file are deleted.
- Only full table partitions within the specified time frame (as determined by the
start
and end
parameters) are deleted.
Items within the specified time frames that reside within partitions that begin before the delete start time or end after the delete end time aren't deleted.
The partition interval is calculated automatically based on the table's ingestion rate and is stored in the TSDB's partitionerInterval
schema field (see the .schema file).
delete
Examples
kv
Backend
client.delete(backend="kv", table="mytable", filter="age > 40")
tsdb
Backend
client.delete(backend="tsdb", table="mytable", start="now-1d", end="now-5h")
stream
Backend
client.delete(backend="stream", table="mystream")
execute
Method
Extends the basic CRUD functionality of the other client methods via backend-specific commands.
Note: Currently, no execute
commands are available for the tsdb
backend.
Syntax
execute(backend, table, command="", args=None)
Common execute
Parameters
All Frames backends that support the execute
method support the following common parameters:
kv
Backend execute
Commands
-
infer | inferschema — Infers the data schema of a given NoSQL table and creates a schema file for the table.
Example:
client.execute(backend="kv", table="mytable", command="infer")
stream
Backend execute
Commands
-
put — Adds records to a stream.
Example:
client.execute(backend="stream", table="mystream", command="put", args={"data": "this a record", "clientinfo": "some_info", "partition": "partition_key"})
Contributing
To contribute to V3IO Frames, you need to be aware of the following:
Components
The following components are required for building Frames code:
- Go server with support for both the gRPC and HTTP protocols
- Go client
- Python client
Development
The core is written in Go.
The development is done on the development
branch and then released to the master
branch.
Before submitting changes, test the code:
- To execute the Go tests, run
make test
.
- To execute the Python tests, run
make test-python
.
Adding and Changing Dependencies
- If you add Go dependencies, run
make update-go-deps
.
- If you add Python dependencies, update clients/py/Pipfile and run
make update-py-deps
.
Travis CI
Integration tests are run on Travis CI.
See .travis.yml for details.
The following environment variables are defined in the Travis settings:
- Docker Container Registry (Quay.io)
DOCKER_PASSWORD
— a password for pushing images to Quay.io.
DOCKER_USERNAME
— a username for pushing images to Quay.io.
- Python Package Index (PyPI)
V3IO_PYPI_PASSWORD
— a password for pushing a new release to PyPi.
V3IO_PYPI_USER
— a username for pushing a new release to PyPi.
- Iguazio Data Science Platform
-
V3IO_SESSION
— a JSON encoded map with session information for running tests.
For example:
'{"url":"45.39.128.5:8081","container":"mitzi","user":"daffy","password":"rabbit season"}'
Note: Make sure to embed the JSON object within single quotes ('{...}'
).
Docker Image
Building the Image
Use the following command to build the Docker image:
make build-docker
Running the Image
Use the following command to run the Docker image:
docker run \
-v /path/to/config.yaml:/etc/framesd.yaml \
quay.io/v3io/frames:unstable
LICENSE
Apache 2