postgres

package
v1.0.1-0...-1811ebb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 28, 2023 License: Apache-2.0 Imports: 23 Imported by: 0

README

HarbourBridge: PostgreSQL-to-Spanner Evaluation and Migration

HarbourBridge is a stand-alone open source tool for Cloud Spanner evaluation, using data from an existing PostgreSQL or MySQL database. This README provides details of the tool's PostgreSQL capabilities. For general HarbourBridge information see this README.

Example PostgreSQL Usage

HarbourBridge can either be used with pg_dump or it can be run directly on a PostgreSQL database (via go's database/sql package).

The following examples assume a harbourbridge alias has been setup as described in the Installing HarbourBridge section of the main README.

Using HarbourBridge with pg_dump

The tool can used to migrate schema from an existing pg_dump file:

harbourbridge schema -source=postgresql < my_pg_dump_file

You can use any of postgresql, postgres, or pg as the argument to the -source flag. They all specify PostgreSQL as the source database.

This will generate a session file with session.json suffix. This file contains schema mapping from source to destination. You will need to specify this file during data migration. You also need to specify a particular Spanner instance and database to use during data migration.

For example, run

harbourbridge data -session=mydb.session.json -source=pg -target-profile="instance=my-spanner-instance,dbName=my-spanner-database-name" < my_pg_dump_file

You can also run HarbourBridge in a schema-and-data mode, where it will perform both schema and data migration. This is useful for quick evaluation when source database size is small.

harbourbridge schema-and-data -source=pg -target-profile="instance=my-spanner-instance" < my_pg_dump_file

HarbourBridge generates a report file, a schema file, and a bad-data file (if there are bad-data rows). You can control where these files are written by specifying a file prefix. For example,

harbourbridge schema -prefix=mydb. -source=postgres < my_pg_dump_file

will write files mydb.report.txt, mydb.schema.txt, and mydb.dropped.txt. The prefix can also be a directory. For example,

harbourbridge schema -prefix=~/spanner-eval-mydb/ -source=postgres < my_pg_dump_file

would write the files into the directory ~/spanner-eval-mydb/. Note that HarbourBridge will not create directories as it writes these files.

Directly connecting to a PostgreSQL database

In this case, HarbourBridge connects directly to the PostgreSQL database to retrieve table schema and data. Set the -source=postgres and corresponding source profile connection parameters host, port, user, dbName and password.

For example, to perform schema conversion, run

harbourbridge schema -source=postgres -source-profile="host=<>,port=<>,user=<>,dbName=<>"

Parameters port and password are optional. Port (port) defaults to 5432 for PostgreSQL source. Password can be provided at the password prompt.

Alternatively, you can also set environment variables PGHOST, PGPORT, PGUSER, PGDATABASE for direct access. Password can be specified either in the PGPASSWORD environment variable or provided at the password prompt.

Note that the various target-profile params described in the previous section are also applicable in direct connect mode.

Schema Conversion

The HarbourBridge tool maps PostgreSQL types to Spanner types as follows:

PostgreSQL Type Spanner Type Notes
BOOL BOOL
BIGINT INT64
BIGSERIAL INT64 a
BYTEA BYTES(MAX)
CHAR STRING(1) CHAR defaults to length 1
CHAR(N) STRING(N) c
DATE DATE
DOUBLE PRECISION FLOAT64
INTEGER INT64 s
NUMERIC NUMERIC p
REAL FLOAT64 s
SERIAL INT64 a, s
SMALLINT INT64 s
TEXT STRING(MAX)
TIMESTAMP TIMESTAMP t
TIMESTAMPTZ TIMESTAMP
VARCHAR STRING(MAX)
VARCHAR(N) STRING(N) c
JSON, JSONB JSON
ARRAY(pgtype) ARRAY(spannertype) if scalar type pgtype maps to spannertype

All other types map to STRING(MAX). Some of the mappings in this table represent potential changes of precision (marked p), dropped autoincrement functionality (marked a), differences in treatment of timezones (marked t), differences in treatment of fixed-length character types (marked c), and changes in storage size (marked s). We discuss these, as well as other limits and notes on schema conversion, in the following sections.

NUMERIC

Spanner's NUMERIC type can store up to 29 digits before the decimal point and up to 9 after the decimal point. PostgreSQL's NUMERIC type can potentially support higher precision that this, so please verify that Spanner's NUMERIC support meets your application needs.

BIGSERIAL and SERIAL

Spanner does not support autoincrementing types, so these both map to INT64 and the autoincrementing functionality is dropped.

TIMESTAMP

PosgreSQL has two timestamp types: TIMESTAMP and TIMESTAMPTZ. Both have an 8 byte data representation and provide microsecond resolution, but neither actually stores a timezone with the data. The key difference between the two types is how string literals are converted to timestamps and queries return data. For TIMESTAMP, all timezone information is dropped, and data is returned without a timezone. For TIMESTAMPTZ, string literals are converted to UTC, using the literal's timezone if it is specified, or the PostgreSQL's timezone paramater if not. When data is printed stored data (in UTC) is converted to the timezone from the timezone parameter

Spanner has a single timestamp type. Data is stored as UTC (there is no separate timezone) Spanner client libraries convert timestamps to UTC before sending them to Spanner. Data is always returned as UTC. Spanner's timestamp type is essentially the same as TIMESTAMPTZ, except that there is no analog of PostgreSQL's timezone parameter.

In other words, mapping PostgreSQL TIMESTAMPTZ to TIMESTAMP is fairly straightforward, but care should be taken with PostgreSQL TIMESTAMP data because Spanner clients will not drop the timezone.

CHAR(n) and VARCHAR(n)

The semantics of fixed-length character types differ between PostgreSQL and Spanner. The CHAR(n) type in PostgreSQL is padded with spaces. If a string value smaller than the limit is stored, spaces will be added to pad it out to the specified length. If a string longer than the specified length is stored, and the extra characters are all spaces, then it will be silently truncated. Moreover, trailing spaces are ignored when comparing two values. In constrast, Spanner does not give special treatment to spaces, and the specified length simply represents the maximum length that can be stored. This is close to the semantics of PostgreSQL's VARCHAR(n). However there are some minor differences. For example, even VARCHAR(n) has some special treatment of spaces: strings longer than the specified length are silently truncated if the extra characters are all spaces.

Storage Use

The tool maps several PostgreSQL types to Spanner types that use more storage. For example, SMALLINT is a two-byte integer, but it maps to Spanner's INT64, an eight-byte integer. This additional storage could be significant for large arrays.

Arrays

Spanner does not support multi-dimensional arrays. So while TEXT[4] maps to ARRAY<STRING(MAX)> and REAL ARRAY maps to ARRAY<FLOAT64>, TEXT[][] maps to STRING(MAX).

Also note that PosgreSQL supports array limits, but the PostgreSQL implementation ignores them. Spanner does not support array size limits, but since they have no effect anyway, the tool just drops them.

Primary Keys

Spanner requires primary keys for all tables. PostgreSQL recommends the use of primary keys for all tables, but does not enforce this. When converting a table without a primary key, HarbourBridge will create a new primary key of type INT64. By default, the name of the new column is synth_id. If there is already a column with that name, then a variation is used to avoid collisions.

NOT NULL Constraints

The tool preserves NOT NULL constraints. Note that Spanner does not require primary key columns to be NOT NULL. However, in PostgreSQL, a primary key is a combination of NOT NULL and UNIQUE, and so primary key columns from PostgreSQL will be mapped to Spanner columns that are both primary keys and NOT NULL.

Foreign Keys

The tool maps PostgreSQL foreign key constraints into Spanner foreign key constraints, and preserves constraint names where possible. Note that Spanner requires foreign key constraint names to be globally unique (within a database), but in postgres they only have to be unique for a table, so we add a uniqueness suffix to a name if needed. Spanner doesn't support ON DELETE and ON UPDATE actions, so we drop these.

Default Values

Spanner does not currently support default values. We drop these PostgreSQL features during conversion.

Secondary Indexes

The tool maps PostgresSQL secondary indexes to Spanner secondary indexes, preserving constraint names where possible. The tool also maps PostgreSQL UNIQUE constraints to Spanner UNIQUE secondary indexes. Check here for more details.

Other PostgreSQL features

PostgreSQL has many other features we haven't discussed, including functions, sequences, procedures, triggers, (non-primary) indexes and views. The tool does not support these and the relevant statements are dropped during schema conversion.

See Migrating from PostgreSQL to Cloud Spanner for a general discussion of PostgreSQL to Spanner migration issues. HarbourBridge follows most of the recommendations in that guide. The main difference is that we map a few more types to STRING(MAX).

Data Conversion

Timestamps and Timezones

As noted earlier when discussing schema conversion of TIMESTAMP, there are some subtle differences in how timestamps are handled in PostgreSQL and Spanner.

During data conversion, PostgreSQL TIMESTAMPTZ values are converted to UTC and stored in Spanner. The conversion proceeds as follows. If the value has a timezone, that timezone is respected during the conversion to UTC. If the value does not have a timezone, then we look for any set timezone statements in the pg_dump output and use the timezone specified. Otherwise, we use the TZ environment variable as the timezone, and failing that, we use the local system timezone default (as determined by Go).

In constrast, conversion of PostgreSQL TIMESTAMP values proceeds by ignoring any timezone information and just treating the value as UTC and storing it in Spanner.

Strings, character set support and UTF-8

Spanner requires that STRING values be UTF-8 encoded. All Spanner functions and operators that act on STRING values operate on Unicode characters rather than bytes. Since we map many PostgreSQL types (including TEXT and CHAR types) to Spanner's STRING type, HarbourBridge is effectively a UTF-8 based tool.

Note that the tool itself does not do any encoding/decoding or UTF-8 checks: it passes through data from pg_dump to Spanner. Internally, we use Go's string type, which supports UTF-8.

Documentation

Overview

Package postgres handles schema and data migrations from Postgres.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ConvertData

func ConvertData(conv *internal.Conv, tableId string, colIds []string, vals []string) (string, []string, []interface{}, error)

ConvertData maps the source DB data in vals into Spanner data, based on the Spanner and source DB schemas. Note that since entries in vals may be empty, we also return the list of columns (empty cols are dropped).

func ProcessDataRow

func ProcessDataRow(conv *internal.Conv, tableId string, colIds, vals []string)

ProcessDataRow converts a row of data and writes it out to Spanner. srcTable and srcCols are the source table and columns respectively, and vals contains string data to be converted to appropriate types to send to Spanner. ProcessDataRow is only called in DataMode.

Types

type DbDumpImpl

type DbDumpImpl struct{}

DbDumpImpl Postgres specific implementation for DdlDumpImpl.

func (DbDumpImpl) GetToDdl

func (ddi DbDumpImpl) GetToDdl() common.ToDdl

GetToDdl functions below implement the common.DbDump interface

func (DbDumpImpl) ProcessDump

func (ddi DbDumpImpl) ProcessDump(conv *internal.Conv, r *internal.Reader) error

ProcessDump calls processPgDump to read a Postgres dump file

type InfoSchemaImpl

type InfoSchemaImpl struct {
	Db             *sql.DB
	SourceProfile  profiles.SourceProfile
	TargetProfile  profiles.TargetProfile
	IsSchemaUnique *bool
}

InfoSchemaImpl postgres specific implementation for InfoSchema.

func (InfoSchemaImpl) GetColumns

func (isi InfoSchemaImpl) GetColumns(conv *internal.Conv, table common.SchemaAndName, constraints map[string][]string, primaryKeys []string) (map[string]schema.Column, []string, error)

GetColumns returns a list of Column objects and names

func (InfoSchemaImpl) GetConstraints

func (isi InfoSchemaImpl) GetConstraints(conv *internal.Conv, table common.SchemaAndName) ([]string, map[string][]string, error)

GetConstraints returns a list of primary keys and by-column map of other constraints. Note: we need to preserve ordinal order of columns in primary key constraints. Note that foreign key constraints are handled in getForeignKeys.

func (InfoSchemaImpl) GetForeignKeys

func (isi InfoSchemaImpl) GetForeignKeys(conv *internal.Conv, table common.SchemaAndName) (foreignKeys []schema.ForeignKey, err error)

GetForeignKeys returns a list of all the foreign key constraints.

func (InfoSchemaImpl) GetIndexes

func (isi InfoSchemaImpl) GetIndexes(conv *internal.Conv, table common.SchemaAndName, colNameIdMap map[string]string) ([]schema.Index, error)

GetIndexes return a list of all indexes for the specified table. Note: Extracting index definitions from PostgreSQL information schema tables is complex. See https://stackoverflow.com/questions/6777456/list-all-index-names-column-names-and-its-table-name-of-a-postgresql-database/44460269#44460269 for background.

func (InfoSchemaImpl) GetRowCount

func (isi InfoSchemaImpl) GetRowCount(table common.SchemaAndName) (int64, error)

GetRowCount with number of rows in each table.

func (InfoSchemaImpl) GetRowsFromTable

func (isi InfoSchemaImpl) GetRowsFromTable(conv *internal.Conv, tableId string) (interface{}, error)

GetRowsFromTable returns a sql Rows object for a table.

func (InfoSchemaImpl) GetTableName

func (isi InfoSchemaImpl) GetTableName(schema string, tableName string) string

GetTableName returns table name.

func (InfoSchemaImpl) GetTables

func (isi InfoSchemaImpl) GetTables() ([]common.SchemaAndName, error)

GetTables return list of tables in the selected database. TODO: All of the queries to get tables and table data should be in a single transaction to ensure we obtain a consistent snapshot of schema information and table data (pg_dump does something similar).

func (InfoSchemaImpl) GetToDdl

func (isi InfoSchemaImpl) GetToDdl() common.ToDdl

GetToDdl function below implement the common.InfoSchema interface.

func (InfoSchemaImpl) ProcessData

func (isi InfoSchemaImpl) ProcessData(conv *internal.Conv, tableId string, srcSchema schema.Table, colIds []string, spSchema ddl.CreateTable, additionalAttributes internal.AdditionalDataAttributes) error

ProcessDataRows performs data conversion for source database 'db'. For each table, we extract data using a "SELECT *" query, convert the data to Spanner data (based on the source and Spanner schemas), and write it to Spanner. If we can't get/process data for a table, we skip that table and process the remaining tables.

Note that the database/sql library has a somewhat complex model for returning data from rows.Scan. Scalar values can be returned using the native value used by the underlying driver (by passing *interface{} to rows.Scan), or they can be converted to specific go types. Array values are always returned as []byte, a string encoding of the array values. This string encoding is database/driver specific. For example, for PostgreSQL, array values are returned in the form "{v1,v2,..,vn}", where each v1,v2,...,vn is a PostgreSQL encoding of the respective array value.

We choose to do all type conversions explicitly ourselves so that we can generate more targeted error messages: hence we pass *interface{} parameters to row.Scan.

func (InfoSchemaImpl) StartChangeDataCapture

func (isi InfoSchemaImpl) StartChangeDataCapture(ctx context.Context, conv *internal.Conv) (map[string]interface{}, error)

StartChangeDataCapture is used for automatic triggering of Datastream job when performing a streaming migration.

func (InfoSchemaImpl) StartStreamingMigration

func (isi InfoSchemaImpl) StartStreamingMigration(ctx context.Context, client *sp.Client, conv *internal.Conv, streamingInfo map[string]interface{}) error

StartStreamingMigration is used for automatic triggering of Dataflow job when performing a streaming migration.

type ToDdlImpl

type ToDdlImpl struct {
}

ToDdlImpl Postgres specific implementation for ToDdl.

func (ToDdlImpl) ToSpannerType

func (tdi ToDdlImpl) ToSpannerType(conv *internal.Conv, spType string, srcType schema.Type) (ddl.Type, []internal.SchemaIssue)

ToSpannerType maps a scalar source schema type (defined by id and mods) into a Spanner type. This is the core source-to-Spanner type mapping. toSpannerType returns the Spanner type and a list of type conversion issues encountered.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL