README ¶
HarbourBridge: Turnkey Postgres-to-Spanner Evaluation
HarbourBridge is a stand-alone open source tool for Cloud Spanner evaluation, using data from an existing PostgreSQL database. The tool ingests pg_dump output, automatically builds a Spanner schema, and creates a new Spanner database populated with data from pg_dump.
HarbourBridge is designed to simplify Spanner evaluation, and in particular to
bootstrap the process by getting moderate-size PostgreSQL datasets into Spanner
(up to a few GB). Many features of PostgreSQL, especially those that don't map
directly to Spanner features, are ignored, e.g. (non-primary) indexes,
functions, and sequences. Types such as integers, floats, char/text, bools,
timestamps, and (some) array types, map fairly directly to Spanner, but many
other types do not and instead are mapped to Spanner's STRING(MAX)
.
View HarbourBridge as a way to get up and running fast, so you can focus on critical things like tuning performance and getting the most out of Spanner. Expect that you'll need to tweak and enhance what HarbourBridge produces to complete your evaluation. For example, while HarbourBridge preserves primary keys, it does not currently translate other indexes. So, you'll need to add Spanner secondary indexes if your SQL queries rely on PostgreSQL indexes that have been dropped. HarbourBridge is not intended for production database migration.
To use the tool on a PostgreSQL database called mydb, run
pg_dump mydb | harbourbridge
HarbourBridge accepts pg_dump's standard plain-text format, but not archive or custom formats.
HarbourBridge automatically determines the cloud project and Spanner instance to
use, and generates a new Spanner database name (prefixed with pg_dump_
and
today's date). Command-line flags can be used to explicitly set the Spanner
instance or database name.
WARNING: Please check that permissions for the Spanner instance used by HarbourBridge are appropriate. Spanner manages access control at the database level, and the database created by HarbourBridge will inherit default permissions from the instance. All data written by HarbourBridge is visible to anyone who can access the created database.
As it processes the pg_dump data, HarbourBridge reports on progress, provides stats on the schema and data conversion steps, and an overall assessment of the quality of the conversion. It also generates a schema file and report file (and a bad-data file if data was dropped). See Files Generated by HarbourBridge. Details of how PostgreSQL is mapped to Spanner can be found in the Schema Conversion section.
This tool is part of the Cloud Spanner Ecosystem, a community contributed and supported open source repository. Please report issues and send pull requests. See the HarbourBridge Whitepaper for a discussion of our plans for the tool.
Note that the HarbourBridge tool is not an officially supported Google product and is not officially supported as part of the Cloud Spanner product.
Quickstart Guide
Before you begin
Complete the steps described in Set up, which covers creating and setting a default Google Cloud project, enabling billing, enabling the Cloud Spanner API, and setting up OAuth 2.0 to get authentication credentials to use the Cloud Spanner API.
In particular, ensure that you run
gcloud auth application-default login
to set up your local development environment with authentication credentials.
Set the GCLOUD_PROJECT environment variable to your Google Cloud project ID:
export GCLOUD_PROJECT=[MY_PROJECT_ID]
If you do not already have a Cloud Spanner instance, or you want to use a separate instance specifically for running HarbourBridge, then create a Cloud Spanner instance by following the "Create an instance" instructions on the Quickstart using the console guide. HarbourBridge will create a database for you, but it will not create a Spanner instance.
Install Go (download) on your development machine if it is not already installed, configure the GOPATH environment variable if it is not already configured, and test your installation.
Installing HarbourBridge
Download the tool to your machine and install it.
go get -u github.com/Dharmil-searce/testharbourbridge
The tool should now be installed as $GOPATH/bin/harbourbridge
Running HarbourBridge
To use the tool on a PostgreSQL database called mydb, run
pg_dump mydb | $GOPATH/bin/harbourbridge
This command will use the cloud project specified by the GCLOUD*PROJECT
environment variable, automatically determine the Cloud Spanner instance
associated with this project, convert the PostgreSQL schema for mydb
to a
Spanner schema, create a new Cloud Spanner database with this schema, and
finally, populate this new database with the data from mydb
. The new Cloud
Spanner database will have a name of the form pg_dump*{DATE}\_{RANDOM}
, where{DATE}
is today's date, and{RANDOM}
is a random suffix for uniqueness.
See the Troubleshooting Guide for help on debugging issues.
HarbourBridge also generates several files when it runs: a schema file, a report file (with detailed analysis of the conversion), and a bad data file (if any data was dropped).
Verifying Results
Once the tool has completed, you can verify the new database and its content using the Google Cloud Console. Go to the Cloud Spanner Instances page, select your Spanner instance, and then find the database created by HarbourBridge and select it. This will list the tables created by HarbourBridge. Select a table, and take a look at its schema and data. Next, go to the query page, and try some SQL statements. For example
SELECT COUNT(*) from mytable
to check the number of rows in table mytable
.
Next Steps
The tables created by HarbourBridge provide a starting point for evaluation of Spanner. While they preserve much of the core structure of your PostgreSQL schema and data, many key features have been dropped, including (non-primary) indexes, functions, sequences, procedures, triggers, and views.
As a result, the out-of-the-box performance you get from these tables could be slower than what you get from PostgreSQL. HarbourBridge does preserve primary keys, but we do not currently translate other indexes. If your SQL query performance relies on PostgreSQL indexes that are dropped, then the performance of the tables created by HarbourBridge could be significantly impaired.
To improve performance, consider adding Secondary Indexes to the tables created by HarbourBridge, using the existing PostgreSQL indexes as a guide. Also consider using Interleaved Tables to tune performance.
View HarbourBridge as a base set of functionality for Spanner evalution that can be readily expanded. Consider forking and modifying the codebase to add the functionality you need. Please file issues and send PRs for fixes and new functionality. See our backlog of open issues. Our plans and aspirations for developing HarbourBridge further are outlined in the HarbourBridge Whitepaper.
You can also change the way HarbourBridge behaves by directly editing the
pg_dump output. For example, suppose you want to try out different primary keys
for a table. First run pg_dump and save the output to a file. Then modify (or
add) the relevant ALTER TABLE ... ADD CONSTRAINT ... PRIMARY KEY ...
statement
in the pg_dump output file so that the primary keys match what you need. Then
run HarbourBridge on the modified pg_dump output.
Files Generated by HarbourBridge
HarbourBridge generates several files as it runs:
-
Schema file (ending in
schema.txt
): contains the generated Spanner schema, interspersed with comments that cross-reference to the relevant PostgreSQL schema definitions. -
Report file (ending in
report.txt
) contains a detailed analysis of the PostgreSQL to Spanner migration, including table-by-table stats and an analysis of PostgreSQL types that don't cleanly map onto Spanner types. Note that PostgreSQL types that don't have a corresponding Spanner type are mapped to STRING(MAX). -
Bad data file (ending in
dropped.txt
): contains details of pg_dump data that could not be converted and written to Spanner, including sample bad-data rows. If there is no bad-data, this file is not written (and we delete any existing file with the same name from a previous run).
By default, these files are prefixed by the name of the Spanner database (with a
dot separator). The file prefix can be overridden using the -prefix
option.
Options
HarbourBridge accepts the following options:
-dbname
Specifies the name of the Spanner database to create. This must be a
new database. If dbname is not specified, HarbourBridge creates a new unique
dbname.
-instance
Specifies the Spanner instance to use. The new database will be
created in this instance. If not specified, the tool automatically determines an
appropriate instance using gcloud.
-prefix
Specifies a file prefix for the report, schema, and bad-data files
written by the tool. If no file prefix is specified, the name of the Spanner
database (plus a '.') is used.
-v
Specifies verbose mode. This will cause HarbourBridge to output detailed
messages about the conversion.
Example Usage
The following examples assume harbourbridge
has been added to your PATH
environment variable.
To use HarbourBridge on a PostgreSQL database called mydb, run:
pg_dump mydb | harbourbridge
The tool can also be applied to an existing pg_dump file:
harbourbridge < my_pg_dump_file
To specify a particular Spanner instance to use, run:
pg_dump mydb | harbourbridge -instance my-spanner-instance
By default, HarbourBridge will generate a new Spanner database name to populate. You can override this and specify the database name to use by:
pg_dump mydb | harbourbridge -dbname my-spanner-database-name
HarbourBridge generates a report file, a schema file, and a bad-data file (if there are bad-data rows). You can control where these files are written by specifying a file prefix. For example,
pg_dump mydb | harbourbridge -prefix mydb.
will write files mydb.report.txt
, mydb.schema.txt
, and
mydb.dropped.txt
. The prefix can also be a directory. For example,
pg_dump mydb | harbourbridge -prefix ~/spanner-eval-mydb/
would write the files into the directory ~/spanner-eval-mydb/
. Note
that HarbourBridge will not create directories as it writes these files.
Schema Conversion
The HarbourBridge tool maps PostgreSQL types to Spanner types as follows:
PostgreSQL Type | Spanner Type | Notes |
---|---|---|
BOOL |
BOOL |
|
BIGINT |
INT64 |
|
BIGSERIAL |
INT64 |
a |
BYTEA |
BYTES(MAX) |
|
CHAR |
STRING(1) |
CHAR defaults to length 1 |
CHAR(N) |
STRING(N) |
c |
DATE |
DATE |
|
DOUBLE PRECISION |
FLOAT64 |
|
INTEGER |
INT64 |
s |
NUMERIC |
FLOAT64 |
p |
REAL |
FLOAT64 |
s |
SERIAL |
INT64 |
a, s |
SMALLINT |
INT64 |
s |
TEXT |
STRING(MAX) |
|
TIMESTAMP |
TIMESTAMP |
t |
TIMESTAMPTZ |
TIMESTAMP |
|
VARCHAR |
STRING(MAX) |
|
VARCHAR(N) |
STRING(N) |
c |
ARRAY( pgtype) |
ARRAY( spannertype) |
if scalar type pgtype maps to spannertype |
All other types map to STRING(MAX)
. Some of the mappings in this table
represent loss of precision (marked p), dropped autoincrement functionality
(marked a), differences in treatment of timezones (marked t), differences in
treatment of fixed-length character types (marked c), and changes in storage
size (marked s). We discuss these, as well as other limits and notes on
schema conversion, in the following sections.
NUMERIC
Spanner does not support numeric types, so these are mapped to FLOAT64
. For
some numeric types (e.g. NUMERIC(7, 3))
this mapping will preserve precision.
But for others, the numeric type does not fit in FLOAT64
; HarbourBridge
generates a warning in such cases. In general, mapping NUMERIC
to FLOAT64
can be useful for evaluation purposes, but it is not recommended for production
use.
BIGSERIAL
and SERIAL
Spanner does not support autoincrementing types, so these both map to INT64
and the autoincrementing functionality is dropped.
TIMESTAMP
PosgreSQL has two timestamp types: TIMESTAMP
and TIMESTAMPTZ
. Both have an 8
byte data representation and provide microsecond resolultion, but neither
actually stores a timezone with the data. The keys difference between the two
types is how string literals are converted to timestamps and queries return
data. For TIMESTAMP
, all timezone information is dropped, and data is returned
without a timezone. For TIMESTAMPTZ
, string literals are converted to UTC,
using the literal's timezone if it is specified, or the PostgreSQL's timezone
paramater if not. When data is printed stored data (in UTC) is converted to the
timezone from the timezone parameter
Spanner has a single timestamp type. Data is stored as UTC (there is no separate
timezone) Spanner client libraries convert timestamps to UTC before sending them
to Spanner. Data is always returned as UTC. Spanner's timestamp type is
essentially the same as TIMESTAMPTZ
, except that there is no analog of
PostgreSQL's timezone parameter.
In other words, mapping PostgreSQL TIMESTAMPTZ
to TIMESTAMP
is fairly
straightforward, but care should be taken with PostgreSQL TIMESTAMP
data
because Spanner clients will not drop the timezone.
CHAR(n) and VARCHAR(n)
The semantics of fixed-length character types differ between PostgreSQL and
Spanner. The CHAR(n)
type in PostgreSQL is padded with spaces. If a string
value smaller than the limit is stored, spaces will be added to pad it out to
the specified length. If a string longer than the specified length is stored,
and the extra characters are all spaces, then it will be silently
truncated. Moreover, trailing spaces are ignored when comparing two values. In
constrast, Spanner does not give special treatment to spaces, and the specified
length simply represents the maximum length that can be stored. This is close to
the semantics of PostgreSQL's VARCHAR(n)
. However there are some minor
differences. For example, even VARCHAR(n)
has some special treatment of
spaces: strings longer than the specified length are silently truncated if the
extra characters are all spaces.
Storage Use
The tool maps several PostgreSQL types to Spanner types that use more storage.
For example, SMALLINT
is a two-byte integer, but it maps to Spanner's INT64
,
an eight-byte integer. This additional storage could be significant for large
arrays.
Arrays
Spanner does not support multi-dimensional arrays. So while TEXT[4]
maps to
ARRAY<STRING(MAX)>
and REAL ARRAY
maps to ARRAY<FLOAT64>
, TEXT[][]
maps
to STRING(MAX)
.
Also note that PosgreSQL supports array limits, but the PostgreSQL implementation ignores them. Spanner does not support array size limits, but since they have no effect anyway, the tool just drops them.
Primary Keys
Spanner requires primary keys for all tables. PostgreSQL recommends the use of
primary keys for all tables, but does not enforce this. When converting a table
without a primary key, HarbourBridge will create a new primary key of type
INT64. By default, the name of the new column is synth_id
. If there is already
a column with that name, then a variation is used to avoid collisions.
NOT NULL Constraints
The tool preserves NOT NULL
constraints. Note that Spanner does not require
primary key columns to be NOT NULL
. However, in PostgreSQL, a primary key is a
combination of NOT NULL
and UNIQUE
, and so primary key columns from
PostgreSQL will be mapped to Spanner columns that are both primary keys and NOT NULL
.
Foreign Keys and Default Values
Spanner does not currently support foreign keys or default values. We drop these PostgreSQL features during conversion.
Other PostgreSQL features
PostgreSQL has many other features we haven't discussed, including functions, sequences, procecdures, triggers, (non-primary) indexes and views. The tool does not support these and the relevant statements are dropped during schema conversion.
See
Migrating from PostgreSQL to Cloud Spanner
for a general discussion of PostgreSQL to Spanner migration issues.
HarbourBridge follows most of the recommendations in that guide. The main
difference is that we map a few more types to STRING(MAX)
.
Data Conversion
HarbourBridge converts PostgreSQL data to Spanner data based on the Spanner schema it constructs. Conversion for most data types is fairly straightforward, but several types deserve discussion. Note that HarbourBridge is not intended for databases larger than a couple of GB.
Timestamps and Timezones
As noted earlier when discussing schema conversion of TIMESTAMP, there are some sutble differences in how timestamps are handled in PostgreSQL and Spanner.
During data conversion, PostgreSQL TIMESTAMPTZ
values are converted to UTC and
stored in Spanner. The conversion proceeds as follows. If the value has a
timezone, that timezone is respected during the conversion to UTC. If the value
does not have a timezone, then we look for any set timezone
statements in the
pg_dump output and use the timezone specified. Otherwise, we use the TZ
environment variable as the timezone, and failing that, we use the local system
timezone default (as determined by Go).
In constrast, conversion of PostgreSQL TIMESTAMP
values proceeds by ignoring
any timezone information and just treating the value as UTC and storing it in
Spanner.
Strings, character set support and UTF-8
Spanner requires that STRING
values be UTF-8 encoded. All Spanner functions
and operators that act on STRING
values operate on Unicode characters rather
than bytes. Since we map many PostgreSQL types (including TEXT
and CHAR
types) to Spanner's STRING
type, HarbourBridge is effectively a UTF-8 based
tool.
Note that the tool itself does not do any encoding/decoding or UTF-8 checks: it passes through data from pg_dump to Spanner. Internally, we use Go's string type, which supports UTF-8.
Troubleshooting Guide
The following steps can help diagnose common issues encountered while running HarbourBridge.
1. Verify pg_dump configuration
First, check that pg_dump is correctly configured to connect to your PostgreSQL database. Note that pg_dump uses the same options as psql to connect to your database. See the psql and pg_dump documentation.
Access to a PostgreSQL database is typically configured using the following environment variables, which are standard across PostgreSQL utilities including pg_dump:
PGHOST
PGPORT
PGUSER
It is also possible to configure access via pg_dump's command-line options
--host
, --port
, and --username
.
2. Verify pg_dump output
Next, verify that pg_dump is generating plain-text output. If your database is small, try running
pg_dump > file
and look at the output file. It should be a plain-text file containing SQL
commands. If your database is large, consider just dumping the schema via the
--schema-only
pg_dump command-line option.
pg_dump can export data in a variety of formats, but HarbourBridge only accepts
plain
format (aka plain-text). See the
pg_dump documentation for
details about formats.
3. Debugging HarbourBridge
The HarbourBridge tool can fail for a number of reasons.
3.1 No space left on device
HarbourBridge needs to read the pg_dump output twice, once to build a schema and
once for data ingestion. When pg_dump output is directly piped to HarbourBridge,
stdin
is not seekable, and so we write the output to a temporary file. That
temporary file is created via Go's ioutil.TempFile. On many systems, this
creates a file in /tmp
, which is sometimes configured with minimal space. A
simple workaround is to separately run pg_dump and write its output to a file in
a directory with sufficient space. For example, if the current working
directory has space, then:
pg_dump > tmpfile
harbourbridge < tmpfile
Make sure you cleanup the tmpfile after HarbourBridge has been run. Another
option is to set the location of Go's TempFile e.g. by setting the TMPDIR
environment variable.
3.2 Unparsable pg_dump
HarbourBridge uses the pg_query_go
library. It is possible that the pg_dump output is corrupted or uses features
that aren't parseable. Parsing errors should generate an error message of the
form Error parsing last 54321 line(s) of input
.
3.2 Credentials problems
HarbourBridge uses standard Google Cloud credential mechanisms for accessing
Cloud Spanner. If this is mis-configured, you may see errors containing
"unauthenticated", or "cannot fetch token", or "could not find default
credentials". You might need to run gcloud auth application-default login
. See
the Before you begin section for details.
3.4 Can't create database
In this case, the error message printed by the tool should help identify the cause. It could be an API permissions issue. For example, the Cloud Spanner API may not be appropriately configured. See Before you begin section for details. Alternatively, you have have hit the limit on the number of databases per instances (currently 100). This can occur if you re-run the HarbourBridge tool many times, since each run creates a new database. In this case you'll need to delete some databases.
4. Database-Specific Issues
The schema, report, and bad-data files generated by HarbourBridge contain detailed information about the schema and data conversion process, including issues and problems encountered.
5. Reporting Issues
If you are having problems with HarbourBridge, please submit an issue.
Documentation ¶
Overview ¶
Package main implements HarbourBridge, a stand-alone tool for Cloud Spanner evaluation, using data from an existing PostgreSQL/MySQL database. See README.md for details.
Directories ¶
Path | Synopsis |
---|---|
Package internal implements database-agnostic functionality for HarbourBridge.
|
Package internal implements database-agnostic functionality for HarbourBridge. |
Package schema provides a generic representation of database schemas.
|
Package schema provides a generic representation of database schemas. |
Package spanner provides high-level abstractions for working with Cloud Spanner that are not available from the core Cloud Spanner libraries.
|
Package spanner provides high-level abstractions for working with Cloud Spanner that are not available from the core Cloud Spanner libraries. |
ddl
Package ddl provides a go representation of Spanner DDL as well as helpers for building and manipulating Spanner DDL.
|
Package ddl provides a go representation of Spanner DDL as well as helpers for building and manipulating Spanner DDL. |