blob

package
v0.40.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 5, 2024 License: Apache-2.0 Imports: 30 Imported by: 0

README

runtime/blob

Package blob provides a way to download a batch of files ingested from remote sources like s3/gcs using google's go cdk (https://pkg.go.dev/gocloud.dev) as per user's glob pattern.

How many files are downloaded and how much data from a file is downloaded is controlled by runtimev1.Source_ExtractPolicy It also has support for ingesting partial files for some formats like parquet, unzipped csv/txt/tsv files.

It uses a planner to implement strategies for downloading.

A planner has a container which keeps track of files to be downloaded and a rowplanner which plans how much data per file needs to be downloaded.

For partial parquet file ingestion it uses apache arrow for go : https://github.com/apache/arrow/tree/master/go

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewIterator

func NewIterator(ctx context.Context, bucket *blob.Bucket, opts Options, l *zap.Logger) (drivers.FileIterator, error)

NewIterator returns an iterator for downloading objects matching a glob pattern and extract policy. The downloaded objects will be stored in a temporary directory with the same file hierarchy as in the bucket, enabling parsing of hive partitioning on the downloaded files. The client should call Close() once done to release all resources. Calling Close() on the iterator will also close the bucket.

Types

type ExtractPolicy added in v0.34.0

type ExtractPolicy struct {
	RowsStrategy   ExtractPolicyStrategy
	RowsLimitBytes uint64
	FilesStrategy  ExtractPolicyStrategy
	FilesLimit     uint64
}

func ParseExtractPolicy added in v0.34.0

func ParseExtractPolicy(cfg map[string]any) (*ExtractPolicy, error)

type ExtractPolicyStrategy added in v0.34.0

type ExtractPolicyStrategy int
const (
	ExtractPolicyStrategyUnspecified ExtractPolicyStrategy = 0
	ExtractPolicyStrategyHead        ExtractPolicyStrategy = 1
	ExtractPolicyStrategyTail        ExtractPolicyStrategy = 2
)

func (ExtractPolicyStrategy) String added in v0.34.0

func (s ExtractPolicyStrategy) String() string

type ObjectReader

type ObjectReader struct {
	// contains filtered or unexported fields
}

ObjectReader reads range of bytes from cloud objects implements io.ReaderAt and io.Seeker interfaces

func NewBlobObjectReader

func NewBlobObjectReader(ctx context.Context, bucket *blob.Bucket, obj *blob.ListObject) *ObjectReader

NewBlobObjectReader returns new instance of ObjectReader

func (*ObjectReader) Close

func (f *ObjectReader) Close() error

Close frees up resources if any clients should call Close once done with reader

func (*ObjectReader) Read

func (f *ObjectReader) Read(p []byte) (int, error)

Read implements io.Reader interface

func (*ObjectReader) ReadAt

func (f *ObjectReader) ReadAt(p []byte, off int64) (int, error)

ReadAt implements io.ReaderAt interface

func (*ObjectReader) Seek

func (f *ObjectReader) Seek(offset int64, whence int) (int64, error)

Seek implements io.Seeker interface

func (*ObjectReader) Size

func (f *ObjectReader) Size() int64

Size returns size of the object

type Options

type Options struct {
	GlobMaxTotalSize      int64
	GlobMaxObjectsMatched int
	GlobMaxObjectsListed  int64
	GlobPageSize          int
	ExtractPolicy         *ExtractPolicy
	GlobPattern           string
	// Although at this point GlobMaxTotalSize and StorageLimitInBytes have same impl but
	// this is total size the source should consume on disk and is calculated upstream basis how much data one instance has already consumed
	// across other sources and the instance level limits
	StorageLimitInBytes int64
	// Retain files and only delete during close
	KeepFilesUntilClose bool
	// BatchSizeBytes is the combined size of all files returned in one call to next()
	BatchSizeBytes int64
	// General blob format (json, csv, parquet, etc)
	Format string
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL