README ¶
Introduction
The Elastic proxy adds a compatibility layer on top of the Sneller query engine. This allows for an easier migration path to Sneller.
Elastic is great, but it can become expensive to handle large amounts of data. It requires a lot of resources, but also the operational costs may become an issue. Sneller Cloud is fully serverless and doesn't require any maintenance. It's also easier to manage large data volumes in a self-hosted environment, because Sneller stores the data in object storage and not in the computing nodes. Object storage is cheap. With our cloud offering, you only pay for the data that is scanned. This also allows for much larger retention periods without high costs.
Architecture
The Elastic proxy translates Elastic queries into Sneller compatible SQL queries and executes the query using the standard Sneller engine. The engine returns the results in binary ION format and the Elastic proxy translates this back into an Elastic compatible format.
graph TB
client[Client]--1. Elastic query --> ep((Elastic proxy))
ep-- 2. PartiQL --> sc((Sneller))
sc-- 3. ION response --> ep
ep-- 4. Elastic response --> client
Each Elastic query is translated to a single SQL query that may return multiple result-sets in a single roundtrip. This ensures that the entire query is ran on an atomic data-set.
The Elastic Proxy implements the Search and Count API endpoints. All other
endpoints are not supported and return 404
(not found).
It is possible to add a backing Elastic that handles all other endpoints. This can be used in more complex scenarios, where the existing Elastic will still be used, but only some indexes are directed to Sneller. This can be very useful to move large indexes out of Elastic (and reduce costs and operational overhead), while still keeping smaller indexes (that may require updating) in Elastic.
graph TB
client[Client]--Elastic query --> ep((Elastic proxy))
ep-- PartiQL --> sc((Sneller))
sc-- ION response --> ep
ep-- Elastic response --> client
ep-- Elastic query --> el((Elastic))
el-- Elastic response --> ep
Configuration
The Elastic proxy is using a single configuration file that is stored in
s3://<sneller-bucket>/db/elastic-proxy.json
and contains the following fields:
-
logPath
contains the S3 prefix where to write the Elastic proxy logging to (i.e.s3://sneller-bucket/log/elastic-proxy/
). The Elastic proxy will dump all requests every minute. Note that there are multiple nodes handling Elastic queries, so you may receive multiple objects per minute. The logging key adds<yyyy>/<mm>/<dd>/<hh><mm><ss>-xxxxxxxxxx.ndjson.zst
to the configured prefix. -
logFlags
defines the structure that determines what is being logged and may contain:-
logRequest
logs the actual Elastic query (optional, defaults totrue
). -
logQueryParameters
logs the query parameters that were passed in the URL (optional, defaults totrue
). -
logSQL
logs the generated PartiQL query that is actually being sent to Sneller (optional, defaults tofalse
). This can also be used to help migrating from Elastic to Sneller, because the generated query may be used as a starting point for your own queries. -
logSnellerResult
logs the result that is being returned by the Sneller engine (optional, defaults tofalse
). -
logPreprocessed
logs the result after pre-processing the Sneller data (optional, defaults tofalse
). -
logResult
logs the actual Elastic query result (optional, defaults totrue
). Note that the Elastic proxy doesn't redact any values, so enabling logging may expose sensitive data in your logging and should be used with caution. EnablinglogSQL
,logSnellerResult
orlogPreprocessed
is generally not useful. They are only added to diagnose issues with the Elastic proxy that are related to query/result translation.
-
-
mapping
contains the mapping between the Elastic index and the Sneller table. Each index that should be handled by Sneller must be defined in this mapping and has the following fields:-
database
is the name of the Sneller database that holds the table (required). -
table
is the name of the Sneller table that holds the actual data (required). -
ignoreTotalHits
doesn't include thehits.total
value. This value is often not used and query generation may be more efficient when this value doesn't need to be calculated (optional, defaults tofalse
). Enabling this optimization implicitly setsignoreSumOtherDocCount
. -
ignoreSumOtherDocCount
doesn't include thesum_other_doc_count
value in some bucket aggregations. This may result in more efficient query generation, so you may want to enable this option if you don't use it (optional, defaults tofalse
). -
typeMapping
allows to annotate certain fields to allow proper PartiQL query translation. This may be needed for mapping integers to timestamps or to indicate that some fields should be treated as lists. More on this in the type mapping section.
-
Type mapping
Sneller is schema-less, so it sometimes needs some help translating Elastic queries properly. The Elastic query may use an integer value for a timestamp and without an explicit mapping it will perform a numerical comparison instead of a timestamp comparison. To help the Elastic proxy, we use type mappings to indicate the type/format of a particular field.
The mapping looks like this:
{
"timestamp": "unix_nano_seconds",
"*.tags": "list",
"message": {
"type": "contains",
"fields": {
"raw": "text"
}
}
}
The timestamp
, and *.tags
mappings specify only the type and uses the
compact definition. The value is just the type. The message
field uses the
extended definition that also allows to specify the fields.
Note that a field can contain wildcards to match a field. If there is an exact match, then this will be used. Otherwise a wildcard will be used. If multiple wildcards match, then the most specific wildcard will be used.
Timestamps
All fields that should be treated as a timestamp should be included in the type mapping and specify one of the following types:
datetime
indicates that the field should be treated as a date/time (timestamp) value. Queries should use either the RFC3339 or RFC3339 (nano) syntax.unix_seconds
,unix_milli_seconds
,unix_micro_seconds
orunix_nano_seconds
indicates that the field should be treated as a date/time (timestamp) value. Queries should us a numerical or text value that denotes the timestamp in seconds (or one of the other units) since epoch (January 1st, 1970).
Text searching
Elastic is known for its extensive text-search features (including fuzzy searching). Sneller only supports a subset of the functionality. When searching ordinary text, Sneller can be configured to:
keyword
requires a full and exact match on the entire text.keyword-ignore-case
requires a full and case-insensitive match.text
requires a case-insensitive match on a word within the text.contains
requires a case-insensitive match on a part of the text.
IMPORTANT: At this time, this setting is only used for query strings.
Elastic also supports
fields
that allows multiple methods of searching text. The Elastic proxy also supports
this functionality using the fields
keyword. You can define custom fields to
allow searching for exact matches (keyword
) and just an occurrence of the
specified search string (contains
).
Lists
Lists require the type to be set to list
to enable proper query generation
that can search within lists.
Example
The following configuration will create the example-ip-logging
index that maps
to the ip-logging
table in the networking
database. It logs all the Elastic
requests, query parameters and results. On top of that it also logs the SQL
query that is being generated.
{
"logPath": "s3://sneller-cache/log/elastic-proxy/",
"logFlags": {
"logRequest": true,
"logQueryParameters": true,
"logSQL": true,
"logResult": true
},
"mapping": {
"example-ip-logging": {
"database": "networking",
"table": "ip-logging",
"ignoreTotalHits": true,
"typeMapping": {
"timestamp": "unix_nano_seconds",
"src.tags": "list",
"dst.tags": "list",
"message": {
"type": "contains",
"fields": {
"raw": "text"
}
}
}
}
}
}
The client doesn't use the hits.total
field in the Elastic result, so this
optimization is enabled by setting the ignoreTotalHits
field.
Although Sneller stored the timestamp
value using a native timestamp format,
the actual Elastic query and result expect an integer value that represents the
time in nanoseconds since Epoch.
Both the src.tags
and dst.tags
fields are lists and are annotated. This
allows Elastic queries to actually search within the list.
The message
field contains text and it uses the extended configuration format,
where both type
and fields
are specified. The type
is set to contains
that translates text queries to match a part of the text (instead of exact
matches) when using query strings. It also defines the raw
field
that should be treated as normal text.
Supported features
Text search
The Sneller query engine supports all standard SQL text search methods and also includes (limited) support for fuzzy matching. At this point, the Elastic proxy only supports partial, word or exact (either case sensitive or case insensitive) matches. Fuzzy search results are not yet supported.
Date/time formats and timezone
The Elastic proxy currently only supports the UTC timezone. Specifying other timezones is not flagged as an error, but is ignored.
Query string (Lucene syntax)
The Elastic proxy fully supports the Lucene query syntax. Note that Sneller currently doesn't support fuzzy text matching and will only return exact matches.
Scripting and runtime fields
The Elastic proxy doesn't support scripting. This also prevents the use of runtime mappings in queries.
Aggregations
Elastic supports a lot of different kind of aggregations. The Elastic proxy supports most common aggregations. Aggregations are currently added only when customers make use of them. Note that not all aggregations are possible in Sneller due to limitations (i.e. scoring, scripting, ...). Please contact Sneller support if you need an aggregation that is currently not supported.
Bucket aggregations
Name | Supported | Remarks |
---|---|---|
Adjacency matrix | ❌ | |
Auto-interval date histogram | ❌ | |
Categorize text | ❌ | |
Children | ❌ | |
Composite | ❌ | |
Date histogram | ✅ | Week always starts on Sunday. When no documents match a certain date, then the date is omitted from the results. |
Date range | ❌ | |
Diversified sampler | ❌ | |
Filter | ✅ | |
Filters | ✅ | |
Frequent items | ❌ | |
Geo-distance | ❌ | |
Geohash grid | ❌ | |
Geohex grid | ❌ | |
Geotile grid | ✅ | |
Global | ❌ | |
Histogram | ✅ | When no documents match a certain value, then the value is omitted from the results. |
IP prefix | ❌ | |
IP range | ❌ | |
Missing | ❌ | |
Multi Terms | ✅ | |
Nested | ❌ | |
Parent | ❌ | |
Random sampler | ❌ | |
Range | ❌ | |
Rare terms | ❌ | |
Reverse nested | ❌ | |
Sampler | ❌ | |
Significant terms | ❌ | |
Significant text | ❌ | |
Terms | ✅ | |
Variable width histogram | ❌ | |
Subtleties of bucketing range fields | ❌ |
Metric aggregations
Name | Supported | Remarks |
---|---|---|
Avg | ✅ | Missing value and histogram fields are not supported. |
Boxplot | ❌ | |
Cardinality | ✅ | Counts are always precise |
Extended stats | ❌ | |
Geo-bounds | ❌ | |
Geo-centroid | ✅ | |
Geo-Line | ❌ | |
Cartesian-bounds | ❌ | |
Cartesian-centroid | ❌ | |
Matrix stats | ❌ | |
Max | ✅ | Missing value and histogram fields are not supported. |
Median absolute deviation | ❌ | |
Min | ✅ | Missing value and histogram fields are not supported. |
Percentile ranks | ❌ | |
Percentiles | ❌ | |
Rate | ❌ | |
Scripted metric | ❌ | |
Stats | ❌ | |
String stats | ❌ | |
Sum | ✅ | |
T-test | ❌ | |
Top hits | ❌ | |
Top metrics | ❌ | |
Value count | ✅ | Counts are always precise |
Weighted avg | ❌ |
Pipeline aggregations
Name | Supported | Remarks |
---|---|---|
Average bucket | ❌ | |
Bucket script | ❌ | |
Bucket count K-S test | ❌ | |
Bucket correlation | ❌ | |
Bucket selector | ❌ | |
Bucket sort | ❌ | |
Change point | ❌ | |
Cumulative cardinality | ❌ | |
Cumulative sum | ❌ | |
Derivative | ❌ | |
Extended stats bucket | ❌ | |
Inference bucket | ❌ | |
Max bucket | ❌ | |
Min bucket | ❌ | |
Moving function | ❌ | |
Moving percentiles | ❌ | |
Normalize | ❌ | |
Percentiles bucket | ❌ | |
Serial differencing | ❌ | |
Stats bucket | ❌ | |
Sum bucket | ❌ |
Logging
The logging contains the following fields for each query that is executed via the Elastic proxy:
revision
is the revision and build-date of the Elastic proxy (i.e.c19d664b 2023-02-06T21:10:58Z
).sourceIp
contains the IP address of the client.tenantId
contains the tenant ID that belongs to the Sneller token.queryId
is the query identifier. Each query that is executed by the Elastic proxy is assigned a unique identifier. When the query is executed successfully, then this identifier is the same as the Sneller query identifier.start
contains the timestamp (UTC) when the query was received by the Elastic proxy.index
contains the index for which the query was issued.duration
contains the duration (in nanoseconds) of the complete query.httpStatusCode
returns the HTTP status-code of the request.sneller.endpoint
contains the Sneller end-point that executed the actual query.sneller.database
contains the database name of the database that is associated with theindex
.sneller.table
contains the table name of the table that is associated with theindex
.sneller.tokenLast4
contains the last 4 characters of the token that was used to execute the query.sneller.cacheHits
contains the number of cache hits during query execution.sneller.cacheMisses
contains the number of cache misses during query execution.sneller.bytesScanned
contains the total number of bytes scanned to execute the query.request
contains the actual Elastic request and typically is a nested object (only logged when thelogRequest
option is enabled).queryParameters
contains a hash-map that holds all the query parameters and its values that were passed to execute the query (only logged when thelogQueryParameters
option is enabled).sql
contains the actual PartiQL query that is executed with the Sneller query engine (only logged when thelogSQL
option is enabled).snellerResult
contains the raw result-set that is returned by the Sneller query engine (only logged when thelogSnellerResult
option is enabled).preprocessed
contains the processed result-set that is returned by the Sneller query engine and is preprocessed for further handling (only logged when thelogPreprocessed
option is enabled).result
contains the actual Elastic result that is returned by the Elastic proxy (only logged when thelogResult
option is enabled).
The Elastic proxy should have permission to write to the Elastic proxy logging bucket. The IAM role that is associated with the tenant is used to write the logging objects.
Analyze logging
The Elastic proxy logging is emitted in a format that can be ingested by Sneller for further analysis. Create a table definition that points to the Elastic bucket and make sure the S3 event notification is set up correctly to automatically ingest the logging.
An example of the definition.json
to ingest the logging looks like this:
{
"input": [
{
"pattern": "s3://sneller-cache-bucket/log/elastic-proxy/*/*/*/*.ndjson.zst",
"format": "json.zst"
}
]
}
Limitations
The current Elastic proxy has some limitations:
- Only read-only operations are allowed. Due to the nature of Sneller, it is not possible to insert, update or delete records using Elastic queries.
- Although the Elastic proxy is highly optimized, it may still be more efficient to run PartiQL queries to reduce the amount of scanned data and reduce costs.
- Sneller is schema-less, whereas Elastic maintains a schema. The Elastic proxy can't tell if a column in the Elastic query is a number, timestamp or a text field. When an integer is used as a timestamp, then the Elastic proxy needs to be configured to parse that particular field as an Epoch-based timestamp.
- Elastic can apply a score to a result between 0 and 1. Sneller either returns a result or it doesn't. So scores are always 1.
Directories ¶
Path | Synopsis |
---|---|
cmd
|
|
Package elastic_proxy provides facilities to translate queries for Elastic Search JSON and SQL-style queries for Sneller engine.
|
Package elastic_proxy provides facilities to translate queries for Elastic Search JSON and SQL-style queries for Sneller engine. |
Package helpers provides .env file parser.
|
Package helpers provides .env file parser. |
Package proxy_http provides common HTTP handlers to use in application.
|
Package proxy_http provides common HTTP handlers to use in application. |