parquet

package
v1.33.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 10, 2025 License: MIT Imports: 14 Imported by: 0

README

Parquet Output Plugin

This plugin writes metrics to parquet files. By default, metrics are grouped by metric name and written all to the same file.

[!IMPORTANT] If a metric schema does not match the schema in the file it will be dropped.

To lean more about the parquet format, check out the parquet docs as well as a blog post on querying parquet.

⭐ Telegraf v1.32.0 🏷️ datastore 💻 all

Global configuration options

In addition to the plugin-specific configuration settings, plugins support additional global and plugin configuration settings. These settings are used to modify metrics, tags, and field or create aliases and configure ordering, etc. See the CONFIGURATION.md for more details.

Configuration

# A plugin that writes metrics to parquet files
[[outputs.parquet]]
  ## Directory to write parquet files in. If a file already exists the output
  ## will attempt to continue using the existing file.
  # directory = "."

  ## Files are rotated after the time interval specified. When set to 0 no time
  ## based rotation is performed.
  # rotation_interval = "0h"

  ## Timestamp field name
  ## Field name to use to store the timestamp. If set to an empty string, then
  ## the timestamp is omitted.
  # timestamp_field_name = "timestamp"

Building Parquet Files

Schema

Parquet files require a schema when writing files. To generate a schema, Telegraf will go through all grouped metrics and generate an Apache Arrow schema based on the union of all fields and tags. If a field and tag have the same name then the field takes precedence.

The consequence of schema generation is that the very first flush sequence a metric is seen takes much longer due to the additional looping through the metrics to generate the schema. Subsequent flush intervals are significantly faster.

When writing to a file, the schema is used to look for each value and if it is not present a null value is added. The result is that if additional fields are present after the first metric flush those fields are omitted.

Write

The plugin makes use of the buffered writer. This may buffer some metrics into memory before writing it to disk. This method is used as it can more compactly write multiple flushes of metrics into a single Parquet row group.

Additionally, the Parquet format requires a proper footer, so close must be called on the file to ensure it is properly formatted.

Close

Parquet files must close properly or the file will not be readable. The parquet format requires a footer at the end of the file and if that footer is not present then the file cannot be read correctly.

If Telegraf were to crash while writing parquet files there is the possibility of this occurring.

File Rotation

If a file with the same target name exists at start, the existing file is rotated to avoid over-writing it or conflicting schema.

File rotation is available via a time based interval that a user can optionally set. Due to the usage of a buffered writer, a size based rotation is not possible as the file may not actually get data at each interval.

Explore Parquet Files

If a user wishes to explore a schema or data in a Parquet file quickly, then consider the options below:

CLI

The Arrow repo contains a Go CLI tool to read and parse Parquet files:

go install github.com/apache/arrow/go/v18/parquet/cmd/parquet_reader@latest
parquet_reader <file>
Python

Users can also use the pyarrow library to quick open and explore Parquet files:

import pyarrow.parquet as pq

table = pq.read_table('example.parquet')

Once created, a user can look the various pyarrow.Table functions to further explore the data.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Parquet

type Parquet struct {
	Directory          string          `toml:"directory"`
	RotationInterval   config.Duration `toml:"rotation_interval"`
	TimestampFieldName string          `toml:"timestamp_field_name"`
	Log                telegraf.Logger `toml:"-"`
	// contains filtered or unexported fields
}

func (*Parquet) Close

func (p *Parquet) Close() error

func (*Parquet) Connect

func (p *Parquet) Connect() error

func (*Parquet) Init

func (p *Parquet) Init() error

func (*Parquet) SampleConfig

func (*Parquet) SampleConfig() string

func (*Parquet) Write

func (p *Parquet) Write(metrics []telegraf.Metric) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL