blob

package
v0.52.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 19, 2024 License: Apache-2.0 Imports: 31 Imported by: 0

README

runtime/blob

Package blob provides a way to download a batch of files ingested from remote sources like s3/gcs using google's go cdk (https://pkg.go.dev/gocloud.dev) as per user's glob pattern.

How many files are downloaded and how much data from a file is downloaded is controlled by runtimev1.Source_ExtractPolicy It also has support for ingesting partial files for some formats like parquet, unzipped csv/txt/tsv files.

It uses a planner to implement strategies for downloading.

A planner has a container which keeps track of files to be downloaded and a rowplanner which plans how much data per file needs to be downloaded.

For partial parquet file ingestion it uses apache arrow for go : https://github.com/apache/arrow/tree/master/go

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewIterator

func NewIterator(ctx context.Context, bucket *blob.Bucket, opts Options, l *zap.Logger) (drivers.FileIterator, error)

NewIterator returns an iterator for downloading objects matching a glob pattern and extract policy. The downloaded objects will be stored in a temporary directory with the same file hierarchy as in the bucket, enabling parsing of hive partitioning on the downloaded files. The client should call Close() once done to release all resources. Calling Close() on the iterator will also close the bucket.

Types

type Bucket added in v0.48.0

type Bucket struct {
	// contains filtered or unexported fields
}

Bucket wraps a blob.Bucket with functionality for implementing the drivers.ObjectStore interface. NOTE: It currently only supports listing objects, but eventually we should refactor NewIterator to a member function of this struct.

func NewBucket added in v0.48.0

func NewBucket(bucket *blob.Bucket, logger *zap.Logger) (*Bucket, error)

NewBucket wraps a *blob.Bucket. It takes ownership of the bucket and will close it when Close is called.

func (*Bucket) Close added in v0.48.0

func (b *Bucket) Close() error

Close the underlying bucket.

func (*Bucket) ListObjects added in v0.48.0

func (b *Bucket) ListObjects(ctx context.Context, glob string) ([]drivers.ObjectStoreEntry, error)

ListObjects lists objects in the bucket that match the given glob pattern. The glob pattern should be a valid path *without* scheme or bucket name. E.g. to list gs://my-bucket/path/to/files/*, the glob pattern should be "path/to/files/*".

type ExtractPolicy added in v0.34.0

type ExtractPolicy struct {
	RowsStrategy   ExtractPolicyStrategy
	RowsLimitBytes uint64
	FilesStrategy  ExtractPolicyStrategy
	FilesLimit     uint64
}

func ParseExtractPolicy added in v0.34.0

func ParseExtractPolicy(cfg map[string]any) (*ExtractPolicy, error)

type ExtractPolicyStrategy added in v0.34.0

type ExtractPolicyStrategy int
const (
	ExtractPolicyStrategyUnspecified ExtractPolicyStrategy = 0
	ExtractPolicyStrategyHead        ExtractPolicyStrategy = 1
	ExtractPolicyStrategyTail        ExtractPolicyStrategy = 2
)

func (ExtractPolicyStrategy) String added in v0.34.0

func (s ExtractPolicyStrategy) String() string

type ObjectReader

type ObjectReader struct {
	// contains filtered or unexported fields
}

ObjectReader reads range of bytes from cloud objects implements io.ReaderAt and io.Seeker interfaces

func NewBlobObjectReader

func NewBlobObjectReader(ctx context.Context, bucket *blob.Bucket, obj *blob.ListObject) *ObjectReader

NewBlobObjectReader returns new instance of ObjectReader

func (*ObjectReader) Close

func (f *ObjectReader) Close() error

Close frees up resources if any clients should call Close once done with reader

func (*ObjectReader) Read

func (f *ObjectReader) Read(p []byte) (int, error)

Read implements io.Reader interface

func (*ObjectReader) ReadAt

func (f *ObjectReader) ReadAt(p []byte, off int64) (int, error)

ReadAt implements io.ReaderAt interface

func (*ObjectReader) Seek

func (f *ObjectReader) Seek(offset int64, whence int) (int64, error)

Seek implements io.Seeker interface

func (*ObjectReader) Size

func (f *ObjectReader) Size() int64

Size returns size of the object

type Options

type Options struct {
	GlobMaxTotalSize      int64
	GlobMaxObjectsMatched int
	GlobMaxObjectsListed  int64
	GlobPageSize          int
	ExtractPolicy         *ExtractPolicy
	GlobPattern           string
	// Retain files and only delete during close
	KeepFilesUntilClose bool
	// Retainfiles retains files for debugging purposes
	RetainFiles bool
	// BatchSizeBytes is the combined size of all files returned in one call to next()
	BatchSizeBytes int64
	// General blob format (json, csv, parquet, etc)
	Format string
	// TempDir where temporary files should be stored
	TempDir string
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL