transfer

module

v0.0.0-rc1 Latest Latest Go to latest Published: Sep 16, 2024 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/doublecloud/transfer

README ¶

Transfer: Cloud Native Ingestion engine

Double Cloud Transfer | Documentation | Benchmarking | Roadmap

🦫 Introduction

Transfer, built in Go, is an open-source cloud native ingestion engine. Essentially we are building no-code (or low-code) EL(T) service that can scale data pipelines from several megabytes of data to dozens of petabytes without hassle.

Transfer provides a convenient way to transfer data between DBMSes, object stores, message brokers or anything that stores data. Our ultimate mission is to help you move data from any source to any destination with fast, effective and easy-to-use tool.

🚀 Try Transfer

1. Transfer Serverless Cloud

The fastest way to try Transfer, Double Cloud

2. Using CLI

Build from souces:

make build

Made with VHS

3. Using docker container

docker pull ghcr.io/doublecloud/transfer:dev

To run Transfer quickly:

docker run ghcr.io/doublecloud/transfer:dev activate --help

🚀 Getting Started

Ingestion from OLTP

Streaming Ingestion

CDC Streaming into Kafka

MySQL CDC into Kafka

Semi-structured Ingestion

Airbyte compatibility

Airbyte source

Transformers

Data parsers

Scaling Snapshot

Scaling Replication

Performance

Ingestion from Clickbench

🚀 Why Transfer

Cloud-Native: Single binary and cloud-native as heck, just drop it into your k8s cluster and be happy.
High Performance: Go-built, with cutting-edge, high-speed vectorized execution. 👉 Bench.
Data Simplification: Streamlines data ingestion, no code needed needed. 👉 Data Loading.
Schema infering: Automatically sync not just data but also data schemas.
Format Flexibility: Supports multiple data formats and types, including JSON, CSV, Parquet, Proto, and more.
ACID Transactions: Ensures data integrity with atomic, consistent, isolated, and durable operations.
Schemafull: Type system enabling schema-full data storage with flexible data modeling.
Community-Driven: Join a welcoming community for a user-friendly cloud analytics experience.

⚡ Performance

Naive-s3-vs-airbyte

📐 Architecture

Transfer is a golang pluggable package that include into transfer binary and register itself into it. Our transfer plugins can be one of:

Storage - one-time data reader
Sink - data writer
Source - streaming data reader

Data pipeline composes with two Endpoint-s: Source and Destination. Each Data pipeline essentially link between Source {Storage|Source} and Destination {Sink}. Transfer is a LOGICAL data transfer service. The minimum unit of data is a logical ROW (object). Between source and target we communicate via ChangeItem-s. Those items batched and we may apply stateless Transformations. Overall this pipeline called Transfer

We could compose our primitive to create 2 main different types of connection

{Storage} + {Sink} = Snapshot
{Source} + {Sink} = Replication
{Storage} + {Source} + {Sink} = Snapshot and Replication

These 2 directions are conceptually different and have different requirements for specific storages. Snapshot and Replication threads can follow each other. Event channels are conceptually unaware of the base types they bind. We mainly build cross system data connection (or as we called them Hetero replications), therefore we are not adding any nitpicking for them (type fit or schema adjustment). But for connection between same type of storages to improve accuracy, the system can tell Source|Storage|Sinks if they are homogeneous (or simply Homo replication), and do some adjustments and fine-tuning. Apart from this cross db-type connections should NOT know of what type of storage on apart side.

Storage / SnapshotProvider

Large-block reading primitive from data. The final stream of events of one type is the insertion of a row. It can give different levels of read consistency guarantees, depending on the depth of integration into a particular database.

snapshot image

ROW level Gurantee

At the most primitive storage level, it is enough to implement the reading of all logical lines from the source to work. In this case, the unit of consistency is the string itself. Example - if we say that one line is one file on disk, then reading the directory gives a guarantee of consistency within one specific file.

Table level Gurantee

Rows are logically grouped into groups of homogeneous rows, usually tables. If the source is able to read a consistent snapshot of the rows of one table, then we can guarantee that the data is consistent at the entire table level. From the point of view of the contract, consistency at the table / row level is indistinguishable for us.

Whole Storage

It can be arranged if we can take a consistent snapshot and reuse it to read several tables (for example, reading in one transaction sequentially or having a transaction pool with one database state).

Point of replication (Replication Slot)

If the source can atomically take a snapshot / snapshot mark for reading and a mark for future replication, we can implement a consistent transition between the snapshot and the replica.

Summary

From a contractual point of view, consistency at the table/row level is indistinguishable for us. We have no clear signs to clearly define with what level of assurance we have read the data from the source.

Source / ReplicationProvider

A streaming primitive. An endless stream of CRUD events line by line. In logical replication, conceptually there are only 3 types of events - create / edit / delete. For editing and deleting, we need to somehow identify the object with which we operate, so to support such events, we expect the source itself to be able to give them.

tx-bounds

For some storages such events can be grouped into transactions.

replication-lag

Once we start replication process we apply this stream of actions to target and try to minimize our data-lag between source database and target.

At the replication source level, we maintain different levels of consistency:

Row

This is the most basic mechanism, if the source does not link strings to each other, then there is a guarantee only at the string level. An example of MongoDB in FullDocument mode, each event in the source is one row living in its own timeline. Events with this level of assurance do not have a transaction tag and logical source time (LSN) or not in a strict order.

Table

If the rows begin to live in a single timeline - we can give consistency at the table level, applying the entire stream of events in the same order as we received them gives us a consistent slice of the table Eventually. Events with this level of guarantee do not have a transaction stamp in them, but contain a source logical timestamp (LSN) and a strict order.

Transaction

If the rows live in a single timeline and are attributed with transaction labels, as well as linearized in the transaction log (that is, there is a guarantee that all changes in one transaction are continuous and the transactions themselves are logically ordered) - we can give consistency at the table and transaction levels. Applying the entire stream of events in the same order with the same (or larger) batches of transactions, we will get a consistent slice of the table from the source at any moment in time.

Sink / Target

Each of our Targets is a simple thing that can consume a stream of events; at its level, the target can both support source guarantees and weaken them.

Primitive

At the most basic level, the target simply writes everything that comes in (the classic example is the / fs / s3 queue), at this level we do not guarantee anything other than the very fact of writing everything that comes in (while the records may be duplicated).

Unique Key deduplication

The Target can de-duplicate the row by the primary key, in which case we give an additional guarantee - there will be no key duplicates in the target.

Logical clock deduplication

If the Target can write to 2 tables in single transaction, we can transactional store the source logical timestamp in separate table and discard already written rows. In this case, there will be no duplicates in the targets, including in lines without keys.

Transaction boundaries

If the receiver can hold transactions for an arbitrarily long time and apply transactions of an arbitrary size, we can implement saving transaction boundaries on writes. In this case, the sink will receive rows in the same or larger transactions, which will give an exact cut of the source at any point in time.

Summary

For maximum guarantees (exact slice of the source at any point in time) both the source and the destination should give maximum guarantee between themselves.

For current storages, we have approximately the following matrix:

Storage Type	S/Row	S/Table	S/DB	S/Slot	R/Row	R/Table	R/TX	T/Rows	T/Keys	T/LSN	T/TX
PG	+	+	+	+	+	+	+	+	+	+	+
Mysql	+	+	+		+	+	+	+	+	+	+
Mongodb	+				+			+	+
Clickhouse	+							+		+
Greenplum	+	+	+					+	+	+	+
YDB	+	+						+	+
YT	+	+						+	+	+
Airbyte	+	+/-				+/-		+	+/-
Kafka	+				+			+
EventHub	+				+			+
LogBroker	+				+			+		+

🤝 Contributing

Transfer thrives on community contributions! Whether it's through ideas, code, or documentation, every effort helps in enhancing our project. As a token of our appreciation, once your code is merged, your name will be eternally preserved in the system.contributors table.

Here are some resources to help you get started:

👥 Community

For guidance on using Transfer, we recommend starting with the official documentation. If you need further assistance, explore the following community channels:

Slack (For live discussion with the Community)
GitHub (Feature/Bug reports, Contributions)
Twitter (Get the news fast)

🛣️ Roadmap

Stay updated with Transfer's development journey. Here are our roadmap milestones:

Roadmap 2024

📜 License

Transfer is released under the Apache License 2.0.

For more information, see the LICENSE file and Licensing FAQs.

Directories ¶

Path	Synopsis
cloud
dataplatform/testcontainer/clickhouse
dataplatform/testcontainer/kafka
dataplatform/testcontainer/localstack
dataplatform/testcontainer/objectstorage
dataplatform/testcontainer/postgres
library
go/blockcodecs
go/blockcodecs/all
go/blockcodecs/blockbrotli
go/blockcodecs/blocklz4
go/blockcodecs/blocksnappy
go/blockcodecs/blockzstd
go/core/buildinfo
go/core/buildinfo/test
go/core/log
go/core/log/compat/golog
go/core/log/compat/logrus
go/core/log/compat/pion
go/core/log/compat/stdlog
go/core/log/ctxlog
go/core/log/nop
go/core/log/zap
go/core/log/zap/asynczap Package asynczap implements asynchronous core for zap.	Package asynczap implements asynchronous core for zap.
go/core/log/zap/encoders
go/core/log/zap/logrotate
go/core/metrics Package metrics provides interface collecting performance metrics.	Package metrics provides interface collecting performance metrics.
go/core/metrics/collect
go/core/metrics/collect/policy/inflight
go/core/metrics/internal/pkg/metricsutil
go/core/metrics/internal/pkg/registryutil
go/core/metrics/mock
go/core/metrics/nop
go/core/metrics/prometheus
go/core/metrics/solomon
go/core/resource Package resource provides integration with RESOURCE and RESOURCE_FILES macros.	Package resource provides integration with RESOURCE and RESOURCE_FILES macros.
go/core/resource/cc
go/core/resource/test-bin
go/core/xerrors package xerrors is a drop in replacement for errors and golang.org/x/xerrors packages and functionally for github.com/pkg/errors.	package xerrors is a drop in replacement for errors and golang.org/x/xerrors packages and functionally for github.com/pkg/errors.
go/core/xerrors/assertxerrors
go/core/xerrors/benchxerrors
go/core/xerrors/internal/modes
go/core/xerrors/multierr
go/poolba
go/ptr
go/slices
go/test/canon
go/test/recipe Package recipe contains helper function for implementation of ya make recipes.	Package recipe contains helper function for implementation of ya make recipes.
go/test/testhelpers
go/test/yatest Package yatest provides access to testing context, when running under ya make -t.	Package yatest provides access to testing context, when running under ya make -t.
go/x/xreflect
go/x/xruntime
go/x/xsync
go/yandex/cloud/filter
go/yandex/cloud/filter/grammar
transfer_manager
go/cmd/trcli
go/cmd/trcli/activate
go/cmd/trcli/check
go/cmd/trcli/config
go/cmd/trcli/replicate
go/cmd/trcli/upload
go/cmd/trcli/validate
go/internal/config
go/internal/logger
go/internal/metrics Package metrics provides interface collecting performance metrics.	Package metrics provides interface collecting performance metrics.
go/pkg/abstract
go/pkg/abstract/changeitem
go/pkg/abstract/changeitem/strictify
go/pkg/abstract/coordinator
go/pkg/abstract/dterrors
go/pkg/abstract/model
go/pkg/abstract/typesystem
go/pkg/abstract/typesystem/values
go/pkg/base
go/pkg/base/adapter
go/pkg/base/events
go/pkg/base/filter
go/pkg/base/types
go/pkg/cleanup
go/pkg/cobraaux
go/pkg/config/env
go/pkg/connection
go/pkg/credentials
go/pkg/csv
go/pkg/data
go/pkg/dataplane
go/pkg/dataplane/provideradapter
go/pkg/dbaas
go/pkg/dblog
go/pkg/dblog/tablequery
go/pkg/debezium
go/pkg/debezium/bench
go/pkg/debezium/common
go/pkg/debezium/mysql
go/pkg/debezium/packer
go/pkg/debezium/packer/lightning_cache
go/pkg/debezium/parameters
go/pkg/debezium/pg
go/pkg/debezium/prodstatus
go/pkg/debezium/testutil
go/pkg/debezium/typeutil
go/pkg/debezium/unpacker
go/pkg/debezium/ydb
go/pkg/debezium/ydb/tests
go/pkg/errors
go/pkg/errors/categories
go/pkg/errors/coded
go/pkg/format
go/pkg/functions
go/pkg/instanceutil
go/pkg/kv
go/pkg/maplock
go/pkg/metering
go/pkg/metering/writer
go/pkg/middlewares
go/pkg/middlewares/async
go/pkg/middlewares/async/bufferer
go/pkg/middlewares/memthrottle
go/pkg/parsequeue
go/pkg/parsers
go/pkg/parsers/generic
go/pkg/parsers/registry
go/pkg/parsers/registry/audittrailsv1
go/pkg/parsers/registry/audittrailsv1/engine
go/pkg/parsers/registry/blank
go/pkg/parsers/registry/cloudevents
go/pkg/parsers/registry/cloudevents/engine
go/pkg/parsers/registry/cloudevents/engine/testutils
go/pkg/parsers/registry/cloudlogging
go/pkg/parsers/registry/cloudlogging/engine
go/pkg/parsers/registry/confluentschemaregistry
go/pkg/parsers/registry/confluentschemaregistry/engine
go/pkg/parsers/registry/debezium
go/pkg/parsers/registry/debezium/engine
go/pkg/parsers/registry/json
go/pkg/parsers/registry/json/engine
go/pkg/parsers/registry/logfeller/lib
go/pkg/parsers/registry/native
go/pkg/parsers/registry/protobuf
go/pkg/parsers/registry/protobuf/protoparser
go/pkg/parsers/registry/protobuf/protoparser/gotest/prototest
go/pkg/parsers/registry/protobuf/protoscanner
go/pkg/parsers/registry/raw2table
go/pkg/parsers/registry/raw2table/engine
go/pkg/parsers/registry/tskv
go/pkg/parsers/resources
go/pkg/parsers/scanner
go/pkg/parsers/tests/samples
go/pkg/pgha
go/pkg/predicate
go/pkg/providers
go/pkg/providers/airbyte
go/pkg/providers/bigquery
go/pkg/providers/clickhouse Package ch cluster - it's like stand-alone cluster with multimaster []*SinkServer - masters (AltHosts).	Package ch cluster - it's like stand-alone cluster with multimaster []*SinkServer - masters (AltHosts).
go/pkg/providers/clickhouse/async
go/pkg/providers/clickhouse/async/dao
go/pkg/providers/clickhouse/async/model/db
go/pkg/providers/clickhouse/async/model/parts
go/pkg/providers/clickhouse/columntypes
go/pkg/providers/clickhouse/conn
go/pkg/providers/clickhouse/errors
go/pkg/providers/clickhouse/format
go/pkg/providers/clickhouse/httpclient Code generated by MockGen.	Code generated by MockGen.
go/pkg/providers/clickhouse/httpuploader
go/pkg/providers/clickhouse/model
go/pkg/providers/clickhouse/recipe
go/pkg/providers/clickhouse/schema
go/pkg/providers/clickhouse/sharding
go/pkg/providers/clickhouse/tests/typefitting
go/pkg/providers/clickhouse/topology
go/pkg/providers/coralogix
go/pkg/providers/datadog
go/pkg/providers/delta
go/pkg/providers/delta/action
go/pkg/providers/delta/protocol
go/pkg/providers/delta/store
go/pkg/providers/delta/types
go/pkg/providers/elastic
go/pkg/providers/eventhub
go/pkg/providers/greenplum
go/pkg/providers/kafka Package kafka is a generated GoMock package.	Package kafka is a generated GoMock package.
go/pkg/providers/kinesis
go/pkg/providers/kinesis/consumer
go/pkg/providers/middlewares
go/pkg/providers/mongo
go/pkg/providers/mysql
go/pkg/providers/mysql/unmarshaller/replication
go/pkg/providers/mysql/unmarshaller/snapshot
go/pkg/providers/mysql/unmarshaller/types
go/pkg/providers/opensearch
go/pkg/providers/oracle
go/pkg/providers/oracle/common
go/pkg/providers/oracle/logtracker
go/pkg/providers/oracle/provider
go/pkg/providers/oracle/replication/log_miner
go/pkg/providers/oracle/schema
go/pkg/providers/oracle/snapshot
go/pkg/providers/postgres
go/pkg/providers/postgres/dblog
go/pkg/providers/postgres/pgrecipe
go/pkg/providers/postgres/sequencer
go/pkg/providers/postgres/splitter
go/pkg/providers/postgres/sqltimestamp
go/pkg/providers/postgres/utils
go/pkg/providers/s3
go/pkg/providers/s3/fallback
go/pkg/providers/s3/provider
go/pkg/providers/s3/pusher
go/pkg/providers/s3/reader
go/pkg/providers/s3/sink
go/pkg/providers/s3/sink/testutil
go/pkg/providers/s3/source
go/pkg/providers/s3/storage
go/pkg/providers/sample
go/pkg/providers/stdout
go/pkg/providers/ydb
go/pkg/providers/ydb/decimal
go/pkg/providers/ydb/logadapter
go/pkg/providers/yt
go/pkg/providers/yt/client
go/pkg/providers/yt/copy/events
go/pkg/providers/yt/copy/source
go/pkg/providers/yt/copy/target
go/pkg/providers/yt/init
go/pkg/providers/yt/iter
go/pkg/providers/yt/lfstaging
go/pkg/providers/yt/lightexe
go/pkg/providers/yt/mergejob
go/pkg/providers/yt/provider
go/pkg/providers/yt/provider/dataobjects
go/pkg/providers/yt/provider/schema
go/pkg/providers/yt/provider/table
go/pkg/providers/yt/provider/types
go/pkg/providers/yt/sink Used only in sorted_table	Used only in sorted_table
go/pkg/providers/yt/sink/v2
go/pkg/providers/yt/sink/v2/statictable
go/pkg/providers/yt/sink/v2/transactions
go/pkg/providers/yt/storage
go/pkg/providers/yt/tablemeta
go/pkg/randutil
go/pkg/runtime/local
go/pkg/runtime/shared
go/pkg/runtime/shared/pod
go/pkg/schemaregistry/confluent
go/pkg/schemaregistry/format
go/pkg/serializer
go/pkg/serializer/queue
go/pkg/sink
go/pkg/source
go/pkg/source/eventsource
go/pkg/stats
go/pkg/storage
go/pkg/stringutil
go/pkg/targets
go/pkg/targets/legacy
go/pkg/terryid
go/pkg/transformer
go/pkg/transformer/registry
go/pkg/transformer/registry/chmapper
go/pkg/transformer/registry/clickhouse
go/pkg/transformer/registry/custom
go/pkg/transformer/registry/dbt
go/pkg/transformer/registry/dbt/clickhouse
go/pkg/transformer/registry/filter
go/pkg/transformer/registry/filter_rows
go/pkg/transformer/registry/jsonparser
go/pkg/transformer/registry/lambda
go/pkg/transformer/registry/mask
go/pkg/transformer/registry/number_to_float
go/pkg/transformer/registry/problem_item_detector
go/pkg/transformer/registry/raw_doc_grouper
go/pkg/transformer/registry/rename
go/pkg/transformer/registry/replace_primary_key
go/pkg/transformer/registry/sharder
go/pkg/transformer/registry/table_splitter
go/pkg/transformer/registry/to_string
go/pkg/transformer/registry/yt_dict
go/pkg/util
go/pkg/util/castx
go/pkg/util/cli
go/pkg/util/diff
go/pkg/util/generics
go/pkg/util/glob
go/pkg/util/grpc
go/pkg/util/hostnameindex
go/pkg/util/ioreader
go/pkg/util/iter
go/pkg/util/jsonx
go/pkg/util/math
go/pkg/util/multibuf
go/pkg/util/oneof
go/pkg/util/pool
go/pkg/util/queues
go/pkg/util/queues/sequencer
go/pkg/util/rolechain
go/pkg/util/size
go/pkg/util/strict
go/pkg/util/throttler
go/pkg/util/validators
go/pkg/worker/tasks
go/pkg/worker/tasks/cleanup
go/pkg/xtls
go/recipe/mongo Basic instruction for Go recipes: https://docs.yandex-team.ru/devtools/test/environment#create-recipe	Basic instruction for Go recipes: https://docs.yandex-team.ru/devtools/test/environment#create-recipe
go/recipe/mongo/cmd/binurl
go/recipe/mongo/example/launch_cluster
go/recipe/mongo/pkg/binurl
go/recipe/mongo/pkg/cluster
go/recipe/mongo/pkg/config
go/recipe/mongo/pkg/tar
go/recipe/mongo/pkg/util
go/tests/canon
go/tests/canon/mongo
go/tests/canon/mysql
go/tests/canon/parser/samples/dynamic/sample_proto
go/tests/canon/parser/samples/dynamic/sample_proto/sample_proto
go/tests/canon/parser/testcase
go/tests/canon/postgres
go/tests/canon/reference
go/tests/canon/validator
go/tests/e2e/mongo2mongo/rps
go/tests/e2e/mongo2yt/rotator
go/tests/e2e/mysql2ch
go/tests/e2e/mysql2mysql/cascade_deletes/common
go/tests/e2e/mysql2mysql/replace_fkey/common
go/tests/e2e/pg2ch
go/tests/e2e/pg2mock/debezium/time
go/tests/e2e/pg2pg/insufficient_privileges
go/tests/e2e/pg2pg/replication_replica_identity
go/tests/e2e/pg2yt/bulk_jsonb_pkey
go/tests/helpers
go/tests/helpers/confluent_schema_registry_mock
go/tests/helpers/http_proxy
go/tests/helpers/mock_storage
go/tests/helpers/serde
go/tests/helpers/utils
go/tests/helpers/yt
go/tests/large/docker-compose
go/tests/tcrecipes

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL