dataflux

package

v1.46.0 Latest Latest Go to latest Published: Oct 31, 2024 License: Apache-2.0 Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/googleapis/google-cloud-go

Links

Open Source Insights

README ¶

Dataflux for Google Cloud Storage Go client library

Overview

The purpose of this client is to quickly list data stored in GCS.

Fast List

The fast list component of this client leverages GCS API to parallelize the listing of files within a GCS bucket. It does this by implementing a workstealing algorithm, where each worker in the list operation is able to steal work from its siblings once it has finished all currently stated listing work. This parallelization leads to a significant real world speed increase than sequential listing. Note that paralellization is limited by the machine on which the client runs.

Benchmarking has demonstrated that the larger the object count, the better Dataflux performs when compared to a linear listing. Around 100k objects, users will see improvemement in listing speed.

Example Usage

First create a storage.Client to use throughout your application:

ctx := context.Background()
client, err := storage.NewClient(ctx)
if err != nil {
    log.Fatal(err)
}


// storage.Query to filter objects that the user wants to list.
query := storage.Query{}
// Input for fast-listing.
dfopts := dataflux.ListerInput{
    BucketName:		"bucket",
    Parallelism:	500,
    BatchSize:		500000,
    Query:			query,
}

// Construct a dataflux lister.
df, close = dataflux.NewLister(sc, dfopts)
defer close()

// List objects in GCS bucket.
for {
    objects, err := df.NextBatch(ctx)

    if err == iterator.Done {
        // No more objects in the bucket to list.
        break
        }
    if err != nil {
        log.Fatal(err)
        }
    // TODO: process objects
}

Fast List Benchmark Results

VM used : n2d-standard-48 Region: us-central1-a NIC type: gVNIC

File Count	VM Core Count	List Time Without Dataflux	List Time With Dataflux
5000000 Obj	48 Core	319.72s	17.35s
1999032 Obj	48 Core	139.54s	8.98s
578703 Obj	48 Core	32.90s	5.71s
10448 Obj	48 Core	750.50ms	637.17ms

Documentation ¶

Overview ¶

Package dataflux provides an easy way to parallelize listing in Google Cloud Storage.

More information about Google Cloud Storage is available at https://cloud.google.com/storage/docs.

See https://pkg.go.dev/cloud.google.com/go for authentication, timeouts, connection pooling and similar aspects of this package.

NOTE: This package is in preview. It is not stable, and is likely to change.

Index ¶

type Lister
- func NewLister(c *storage.Client, in *ListerInput) *Lister
- func (c *Lister) Close()
- func (c *Lister) NextBatch(ctx context.Context) ([]*storage.ObjectAttrs, error)
type ListerInput

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Lister ¶

type Lister struct {
	// contains filtered or unexported fields
}

Lister is used for interacting with Dataflux fast-listing. The caller should initialize it with NewLister() instead of creating it directly.

func NewLister ¶

func NewLister(c *storage.Client, in *ListerInput) *Lister

NewLister creates a new dataflux Lister to list objects in the give bucket.

func (*Lister) Close ¶

func (c *Lister) Close()

Close closes the range channel of the Lister.

func (*Lister) NextBatch ¶

func (c *Lister) NextBatch(ctx context.Context) ([]*storage.ObjectAttrs, error)

NextBatch runs worksteal algorithm and sequential listing in parallel to quickly return a list of objects in the bucket. For smaller dataset, sequential listing is expected to be faster. For larger dataset, worksteal listing is expected to be faster.

type ListerInput ¶

type ListerInput struct {
	// BucketName is the name of the bucket to list objects from. Required.
	BucketName string

	// Parallelism is number of parallel workers to use for listing.
	// Default value is 10x number of available CPU. Optional.
	Parallelism int

	// BatchSize is the number of objects to list. Default value returns
	// all objects at once. The number of objects returned will be
	// rounded up to a multiple of gcs page size. Optional.
	BatchSize int

	// Query is the query to filter objects for listing. Default value is nil.
	// Use ProjectionNoACL for faster listing. Including ACLs increases
	// latency while fetching objects. Optional.
	Query storage.Query

	// SkipDirectoryObjects is to indicate whether to list directory objects.
	// Default value is false. Optional.
	SkipDirectoryObjects bool
}

ListerInput contains options for listing objects.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL