blob

package

v0.48.2 Latest Latest Go to latest Published: Aug 23, 2024 License: Apache-2.0 Imports: 31 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/rilldata/rill

Links

Open Source Insights

README ¶

`runtime/blob`

Package blob provides a way to download a batch of files ingested from remote sources like s3/gcs using google's go cdk (https://pkg.go.dev/gocloud.dev) as per user's glob pattern.

How many files are downloaded and how much data from a file is downloaded is controlled by runtimev1.Source_ExtractPolicy It also has support for ingesting partial files for some formats like parquet, unzipped csv/txt/tsv files.

It uses a planner to implement strategies for downloading.

A planner has a container which keeps track of files to be downloaded and a rowplanner which plans how much data per file needs to be downloaded.

For partial parquet file ingestion it uses apache arrow for go : https://github.com/apache/arrow/tree/master/go

Documentation ¶

Index ¶

func NewIterator(ctx context.Context, bucket *blob.Bucket, opts Options, l *zap.Logger) (drivers.FileIterator, error)
type Bucket
- func NewBucket(bucket *blob.Bucket, logger *zap.Logger) (*Bucket, error)
- func (b *Bucket) Close() error
- func (b *Bucket) ListObjects(ctx context.Context, glob string) ([]drivers.ObjectStoreEntry, error)
type ExtractPolicy
- func ParseExtractPolicy(cfg map[string]any) (*ExtractPolicy, error)
type ExtractPolicyStrategy
- func (s ExtractPolicyStrategy) String() string
type ObjectReader
- func NewBlobObjectReader(ctx context.Context, bucket *blob.Bucket, obj *blob.ListObject) *ObjectReader
type Options

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewIterator ¶

func NewIterator(ctx context.Context, bucket *blob.Bucket, opts Options, l *zap.Logger) (drivers.FileIterator, error)

NewIterator returns an iterator for downloading objects matching a glob pattern and extract policy. The downloaded objects will be stored in a temporary directory with the same file hierarchy as in the bucket, enabling parsing of hive partitioning on the downloaded files. The client should call Close() once done to release all resources. Calling Close() on the iterator will also close the bucket.

Types ¶

type Bucket ¶ added in v0.48.0

type Bucket struct {
	// contains filtered or unexported fields
}

Bucket wraps a blob.Bucket with functionality for implementing the drivers.ObjectStore interface. NOTE: It currently only supports listing objects, but eventually we should refactor NewIterator to a member function of this struct.

func NewBucket ¶ added in v0.48.0

func NewBucket(bucket *blob.Bucket, logger *zap.Logger) (*Bucket, error)

NewBucket wraps a *blob.Bucket. It takes ownership of the bucket and will close it when Close is called.

func (*Bucket) Close ¶ added in v0.48.0

func (b *Bucket) Close() error

Close the underlying bucket.

func (*Bucket) ListObjects ¶ added in v0.48.0

func (b *Bucket) ListObjects(ctx context.Context, glob string) ([]drivers.ObjectStoreEntry, error)

ListObjects lists objects in the bucket that match the given glob pattern. The glob pattern should be a valid path *without* scheme or bucket name. E.g. to list gs://my-bucket/path/to/files/*, the glob pattern should be "path/to/files/*".

type ExtractPolicy ¶ added in v0.34.0

type ExtractPolicy struct {
	RowsStrategy   ExtractPolicyStrategy
	RowsLimitBytes uint64
	FilesStrategy  ExtractPolicyStrategy
	FilesLimit     uint64
}

func ParseExtractPolicy ¶ added in v0.34.0

func ParseExtractPolicy(cfg map[string]any) (*ExtractPolicy, error)

type ExtractPolicyStrategy ¶ added in v0.34.0

type ExtractPolicyStrategy int

const (
	ExtractPolicyStrategyUnspecified ExtractPolicyStrategy = 0
	ExtractPolicyStrategyHead        ExtractPolicyStrategy = 1
	ExtractPolicyStrategyTail        ExtractPolicyStrategy = 2
)

func (ExtractPolicyStrategy) String ¶ added in v0.34.0

func (s ExtractPolicyStrategy) String() string

type ObjectReader ¶

type ObjectReader struct {
	// contains filtered or unexported fields
}

ObjectReader reads range of bytes from cloud objects implements io.ReaderAt and io.Seeker interfaces

func NewBlobObjectReader ¶

func NewBlobObjectReader(ctx context.Context, bucket *blob.Bucket, obj *blob.ListObject) *ObjectReader

NewBlobObjectReader returns new instance of ObjectReader

func (*ObjectReader) Close ¶

func (f *ObjectReader) Close() error

Close frees up resources if any clients should call Close once done with reader

func (*ObjectReader) Read ¶

func (f *ObjectReader) Read(p []byte) (int, error)

Read implements io.Reader interface

func (*ObjectReader) ReadAt ¶

func (f *ObjectReader) ReadAt(p []byte, off int64) (int, error)

ReadAt implements io.ReaderAt interface

func (*ObjectReader) Seek ¶

func (f *ObjectReader) Seek(offset int64, whence int) (int64, error)

Seek implements io.Seeker interface

func (*ObjectReader) Size ¶

func (f *ObjectReader) Size() int64

Size returns size of the object

type Options ¶

type Options struct {
	GlobMaxTotalSize      int64
	GlobMaxObjectsMatched int
	GlobMaxObjectsListed  int64
	GlobPageSize          int
	ExtractPolicy         *ExtractPolicy
	GlobPattern           string
	// Although at this point GlobMaxTotalSize and StorageLimitInBytes have same impl but
	// this is total size the source should consume on disk and is calculated upstream basis how much data one instance has already consumed
	// across other sources and the instance level limits
	StorageLimitInBytes int64
	// Retain files and only delete during close
	KeepFilesUntilClose bool
	// Retainfiles retains files for debugging purposes
	RetainFiles bool
	// BatchSizeBytes is the combined size of all files returned in one call to next()
	BatchSizeBytes int64
	// General blob format (json, csv, parquet, etc)
	Format string
	// TempDir where temporary files should be stored
	TempDir string
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL