gonymizer

package module
v1.2.4-0...-91a2c17 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 3, 2020 License: Apache-2.0 Imports: 28 Imported by: 0

README

Gonymizer

GonymizerLogo.png


CircleCISlackCoverage StatusGo Report CardGoDoc

Weird name, what does it do?

The Gonymizer project (Go + Anonymizer) is a project that was built at SmithRx in hope to simplify the QA process. Gonymizer is written in Golang and is meant to help database administrators and infrastructure folks easily anonymize production database dumps before loading this data into a QA environment.

We have built in support, and examples, for:

  • Kubernetes CRONJOB scheduling
  • AWS-S3 Storage processing and loading

We plan to have built-in:

  • CRONJOB BASH scripts to use local disk as storage (see tasks, we need help!)
  • AWS-Lambda Job scheduling (see tasks, we need help!)

Our API is an easy one to follow and we encourage others to join in by trying Gonymizer with their own development and staging environments either directly using the CLI or using the API. We include in our documentation: example configurations, best practices, Kubernetes CRONJOB examples, examples for AWS-Lambda, and other infrastructure tools. Please see the docs directory in this application to see a full how-to guide and where to get started.

Supported RDBMS

Currently Gonymizer only supports PostgreSQL 9.x-11.x. We have not tested Gonymizer on versions 12+, but plan to in the near future. If you would like to help by adding support for other database management systems, new processors, or general questions please join by checking the CONTRIBUTING.md file in this repository.

Abbreviations and Definitions

  • HIPAA: Health Insurance Portability and Accountability Act of 1996
  • PCI DSS: Payment Card Industry Data Security Standard
  • PHI: Protected Health Information
  • PII: Personally identifiable information

In this document/codebase, we use them interchangeably.

Getting Started

If you are a seasoned Go veteran or already have an environment which contains Go>= 1.11 then you can skip to the next section.

OSX

Gonymizer requires that one has complete install of Go >= 1.11. To install Go on OSX you can run the following:

brew install go

Once this is complete we will need to make sure our Go paths are set correctly in our BASH profile. NOTE: You may need to change the directories below to match your setup.

echo "
export GOPATH=~/go
export GOROOT=/usr/local/Cellar/go/1.11.2/libexec
export GO111MODULE=on
" >> ~/.profile

It is recommended to put all Go source code under ~/go. Once this is complete we can attempt to build the application:

cd ~/go/src/github.com/smithoss/gonymizer/scripts
./build.sh

The build script will build two binaries. One for MacOS on the amd64 architecture as well as a Linux amd64 binary. These binaries are stored under the Gonymizer/bin directory. Now that we have a built binary we can attempt to download a map file using our JSON configuration:

./gonymizer-darwin -c ~/conf/gonymizer-config-file.json dump
Debian 9.x / Ubuntu 18.04

Use the following steps to get up and going. Commands should be similar for Debian 9.x and Ubuntu 18.04.

  1. Install Golang and Git
sudo apt-get install go git
  1. Add go path to profile
echo "
export GOPATH=~/go
export GO111MODULE=on
" >> ~/.bashrc

  1. Git checkout
mkdir -p ~/go/src/github.com/smithoss/
cd ~/go/src/github.com/smithoss/
git clone https://github.com/smithoss/Gonymizer.git gonymizer
  1. Build the project
cd gonymizer/scripts
bash build.sh

or

cd gonymizer/cmd/
go build . -o ../bin/gonymizer
  1. Run the binary
cd ../bin
./gonymizer-linux --help

Configuration

Gonymizer has many different configuration settings that can be enabled or disabled using the command line options. It is recommended that one run gonymizer --help or gonymizer CMD --help where CMD is one of the commands to see which options are available at any given time.

Below we give examples of both the CLI configuration as well as examples on how to create your map file.

CLI Configuration

Gonymizer was built using the Cobra + Viper Golang libraries to allow for easy configuration however you like it. We recommend using a JSON, YAML, or TOML file to configure Gonymizer. Below we will go over an example configuration for running Gonymizer.

For an example of how to set up a CLI configuration check our Dell Store 2 example in docs/demo/dellstore2/gonymizer_config.json

{
    comment: "This example is viewable under docs/demo/dellstore2"
    "dump":     {
        "database":             "store",
        "disable-ssl":          true,
        "dump-file":            "phi_dump.sql",
        "exclude-schema":      [
            "pg*",
            "information_schema"
        ],
        "host":                 "localhost",
        "port":                 5432,
        "schema":               ["public"],
        "row-count-file":       "row-counts.csv",
        "username":             "levi"
    }
  }
}

comment: is used to leave for comments for the reader and is not used by the application.

log-level: is the level the application uses to know what should be displayed to the screen. Choices are: FATAL, ERROR, WARN, INFO, DEBUG. We use the Logrus Golang library for logging so please read the documentation here for more information.

database: is the master database with PHI and PII that will be used for dumping a SQL dump file from.

host: is the hostname for the master database with PHI and PII that will be used for dumping a SQL dump file from.

port: is the host port that will be used to connect to the master database with PHI and PII.

username: is the username that will be used to connect to the master database with PHI and PII.

password: is the password that will be used to connect to the master database with PHI and PII.

disable-ssl: is the master database with PHI and PII that will be used for dumping a SQL dump file from.

dump-file: is where Gonymizer will store the SQL statements from the dump command.

map-file: is the file that gonymizer uses to map out which columns need to be anonymized and how. When using the map command in conjunction with --map-file, or in the configuration above, a file is named similarly to the map-file, but with skeleton in the name instead. More on this below in the map section.

exclude-table: is list of tables that are not to be included during the pg_dump step of the extraction process. This allows us to only focus on tables that are needed for our base application to work. Using this option minimizes the size of our dump file and in return decreases the amount of time needed for dumping, processing, and reloading. This option operates in the same fashion as pg_dump's --exclude-table option.

exclude-table-data: allows you to create a list of tables we would like to include in the pg_dump process but do not want to include any of the data (table schema only). The usage and advantages are the same as the exclude-table feature explained above and is identical to pg_dump's --exclude-table-data option.

schema: is a list of schemas the Gonymizer should dump from the master database. This option must be in the form of a list if you are using the configuration methods mentioned above.

exclude-schema: is a list of system level schemas that Gonymizer should ignore when adding CREATE SCHEMA statements to the dump file. These schemas may still be included in the --schema option, for example the public schema.

schema-prefix: is the prefix used for a schema environment where there is a prefix that matches other schemas. This is same as a sharded architecture design which is outside the scope of this article and it is recommended to read here if you are unfamiliar with this design paradigm. For example: [company_1, company2, company_..., company_n-1, company_n] would be --schema-prefix=company_ --schemas=company

NOTE: Some arguments are not included here. It is recommended to use gonymizer --help and gonymizer [COMMAND] --help for more information and configuration options.

Map File Configuration

Once one has created a skeleton map file it is recommended to create a new true map file which will be used to let gonymizer know which columns need to be anonymized in the database and which columns do not. There are two methods in which gonymizer map files work (inclusive and exclusive).

NOTE: Currently SmithRx is using an exclusive dump file which can be found under map_files/prod_map.json

Available Fakers and Scramblers

Below is a list of fake data creators and scramblers. This table may not be up to date so please make sure to check processor.go for a full list.

Processor Name Use
AlphaNumericScrambler Scrambles strings. If a number is in the string it will replace it with another random number
EmptyJson Replaces a JSON with an empty one ({})
FakeStreetAddress Used to replace a real US address with a fake one
FakeCity Used to replace a city column
FakeCompanyName Used to replace a company name
FakeEmailAddress Used to replace e-mail with a fake one
FakeFirstName Used to replace a person's first name with a fake first name (non-gender specific)
FakeIPv4 Used to replace an IP with a fake one
FakeLastName Used to replace a person's last name with a fake last name
FakePhoneNumber Used to replace a person's phone number with fake phone number
FakeState Used to replace a state (full state name, non-abbreviated)
FakeStateAbbrev Used to replace a state abbreviation
FakeUsername Used to replace a username with a fake one
FakeZip Used to replace a real zip code with another zip code
Identity Used to notify Gonymizer not to anonymize the column (same as leaving the column out of the map file)
RandomBoolean Randomizes boolean fields
RandomDate Randomizes Day and Month, but keeps year the same (HIPAA only requires month and day be changed)
RandomDigits Randomizes a string of digit(s), but keeps the same length
RandomUUID Randomizes a UUID string, but keep a mapping of the old UUID and map it to the new UUID. If the old is found elsewhere in the database the new UUID will be used instead of creating another one. Useful for UUID primary key mapping (relationships).
ScrubString Replaces a string with *'s. Useful for password hashes.
Inclusive Map Files

An inclusive map file is a map file which includes every column in every table that is contained in a list of schemas that is configurable by using the --schemas option. If you are using a sharded/group configuration only one copy of the column will be added to the file. An example map file can be found in map_files/example_db_map.json.

Once there is an up to date skeleton file one can then walk through the file and modify the "Processors"."Name" field for any column that needs to be anonymized. This can be done by simply replacing the "Identity" processor with one listed in the table above. For example to pick a fake first name for a column labeled first_name one would add the FakeFirstName to the "Processors"."Name" field like so:

{
    "TableSchema": "public",
    "TableName": "users",
    "ColumnName": "first_name",
    "DataType": "character varying",
    "ParentSchema": "",
    "ParentTable": "",
    "ParentColumn": "",
    "OrdinalPosition": 6,
    "IsNullable": false,
    "Processors": [
        {
            "Name": "FakeFirstName",
            "Max": 0,
            "Min": 0,
            "Variance": 0,
            "Comment": ""
        }
    ],
    "Comment": ""
}
Exclusive Map Files

An exclusive map file is a map file that contains only the columns that need to be anonymized. This is the only difference from the inclusive map file method and should make map files smaller and simpler to navigate since they will not contain any columns using the "Identity" processor. It is assumed that all columns that are not listed in the map file are considered to be OK to add to the dump file WITHOUT any scrambling or anonymization. This means that the user must add column definitions for every schema change that requires anonymization.

Pro Tip: An east way to handle schema changes is to run the map command to create a new map file and copy/paste the new columns into your map file while adding the proper processors at the same time.

Relationship Mapping

Relationship mapping allows the user to define columns that should remain congruent during the processing/anonymization step. For example if a user is identified by a unique UUID that is used across multiple tables in the database one may select the RandomUUID processor which keeps a global hash map of OLD-UUID => NEW-UUID. The global hash map then can be used by the processor and can also be stored to disk for back-tracing values to debug the application. The only way to enable this type of logging is to edit the generator.go file and add the function call the writeDebugMap function. Adding this to your run-time is outside of the scope of this documentation and it is recommended to NEVER use this option when working with real PHI and PII data. If this file is compromised and stolen, an attacker will gain full access of the mapping of (PHI, PII) => (Non-PHI, Non-PII).

Currently we only allow for global mapping of the following processors (more may be added later):

  • AlphaNumericScrambler
  • RandomUUID

They can be found in the processor.go file:

var UUIDMap = map[uuid.UUID]uuid.UUID{}
var AlphaNumericMap = map[string]map[string]string{}

There are plans to add more globally aware processors in the future, but at this time only 2 are available.

To map a relationship one can do this quite easily by notifying Gonymizer that there is a parent table and column that exist that the column should be mapped to. Below is an example where we identify the parent schema, table, and column:

{
    "TableSchema": "public",
    "TableName": "credit_scores",
    "ColumnName": "ssn",
    "DataType": "integer",
    "ParentSchema": "public",
    "ParentTable": "user",
    "ParentColumn": "ssn",
    "OrdinalPosition": 6,
    "IsNullable": false,
    "Processors": [
        {
            "Name": "AlphaNumericScrambler",
            "Max": 0,
            "Min": 0,
            "Variance": 0,
            "Comment": ""
        }
    ]
    "Comment": ""
}

In the example above we are mapping the social security number (SSN) from the credit_scores table to the users table by simply notifying gonymizer that there exists a map for ssn that is tied to the users.ssn table and column. Gonymizer will see this and look the value up in the global AlphaNumericMap variable mentioned earlier. If the original SSN key does not exist in the map the Gonymizer will automatically scramble the SSN and add an entry in the map such that:

map["old SSN"]: "new value (new SSN)"

Every time gonymizer checks a value in the SSN column it will look up this value and replace it with the previously anonymized SSN. This allows us to map keys between tables.

Also make sure to add the parent table itself as a parent when creating a relationship mapping. From the example above the same would be true:

{
    "TableSchema": "public",
    "TableName": "user",
    "ColumnName": "ssn",
    "DataType": "integer",
    "ParentSchema": "public",
    "ParentTable": "user",
    "ParentColumn": "ssn",
    "OrdinalPosition": 6,
    "IsNullable": false,
    "Processors": [
        {
            "Name": "AlphaNumericScrambler",
            "Max": 0,
            "Min": 0,
            "Variance": 0,
            "Comment": ""
        }
    ]
    "Comment": ""
}

Notice that we added the column as a parent of itself. If this step is missing all other columns will be mapped to the correct value, but the parent column will not be mapped to the same hash map so it will contain different values than expected.

Note 1: Multiple tables can link back to the user table by simply adding the schema, table, and column names to the parent fields in the map file for the specified column.

Grouping and Schema Prefix Matching (sharding)

Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole. Explanation is outside the scope of this READ.me and more information can be found at this [Wikipedia article](https://en.wikipedia.org/wiki/Shard_(database_architecture)).

NOTE: When working with a database that contains many schemas matching the schema-prefix (shards), one will need to make sure that all tables and columns are identical across each schema. Manging the DDL for each schema is outside the scope of Gonymizer project and should be done by external database administration tools.

Running Gonymizer

TL;DR Steps to anonymization (that's a word right?)
  1. Create a map file: gonymizer -c config/production-conf.json map
  2. Edit dump file to define which columns need to be anonymized.
  3. Create a PII encumbered dump file: gonymizer -c config/prod-conf.json dump
  4. Use the Process command to anonymize the PII dump file: gonymizer -c config/prod-conf.json process
  5. Use the Load command to load the anonymized database file into the database gonymizer -c config/staging.json load

Also check out our slides from Percona Live 2019 here

Detailed Steps
  • Step 1: Generate a Map Skeleton (should only need to use the first time or during schema changes)

    This will generate a new skeleton (defined, but empty) config file from scratch:

      ./gonymizer -c config/prod-conf.json map
    

    If you already have a map file and just need to due to migrations, schema changes, etc (2nd -> nth runs) change the path to the real map file. The map command will NOT overwrite your map file, instead it will create a new file with "skeleton" in the name. This will also append new columns to the bottom:

      ./gonymizer -c config/prod-conf.json --map-file=db_mapper.prod_map.json map
    

    Will output a file named:

      db_mapper.prod_map.json.skeleton.json
    
  • Step 2: Copy the newly created skeleton file to a new production map file

    Pro Tip: It is recommended to leave OUT column definitions from your map file that are to be skipped by the gonymizer. This is to keep the map file simple and clean. The gonymizer will skip any column that is not in the map file and continue on. The purpose of the skeleton file is to use it as a base line and to copy/paste your anonymized columns from the skeleton file into your true map file. This map file will be used in the processing step later. See Map Configuration above for more information.

      mv db_mapper.prod_map.json.skeleton.json db_mapper.prod_map.json
    

    Edit every field (removing unneeded columns if going Pro Tip route). Add processors or Min/Max as necessary.

  • Step 3: Generate PHI & PII-encumbered dumpfile

    CAUTION!! This dump file will contain PII! Only do this on secure machines with encrypted block devices only!

      ./gonymizer -c config/prod-config.json dump --dump-file=dump-pii.sql
    
  • Step 4: Generate altered data using the dumpfile built in step 3

    If you've correctly configured db_mapper.j

      ./gonymizer -c config/prod-conf.json --map-file=db_mapper.prod_nap.json\
       --dump-file=dump-pii.sql --s3-file-path=s3://my-bucket-name.s3.us-west-2.amazonaws.com/db-dump-processed.sql process
    
  • Step 5. Use the Load command to load the data into the database to verify that the data is correctly scrambled

    The processed SQL file can simply be imported using PSQL.

      ./gonymizer -c config/staging-conf.json --load-file=s3://my-bucket-name.s3.us-west-2.amazonaws.com/db-dump-processed.sql load
    

Creating Tests

Testing for Gonymizer is different than expected for typical projects. When adding a test to the project one will need to make sure the test is called from the main_test.go test harness file in the root directory of the project.

All tests should be added to the seqUnitTests function in the proper position in the test sequence. This sequence creates, imports, modifies, and drops the database in the local test database.

To run tests you will want to use the command (in the root directory of the project)

go test -v -run TestStart
Test Example

Lets assume we created a new processor function for anonymizing IP addresses as seen in #64. In this case we create the test using normal methods, but will need to add the function to the main_test.go by adding the following line:

t.Run("ProcessorIPV4", TestProcessorIPv4)

Notices and License

Please make sure to read our license agreement here LICENSE.txt. We may state throughout our documentation that we are using this application to anonymize data for HIPAA requirements, but this is in our own environment and we give NO guarantee this will be the same for other's uses. Considering everyone's data set is completely different and the configuration of this application is very involved we cannot guarantee that this application will guarantee any compliance of any type. This is the application users responsibility to verify with council that the dataset that is processed by the application is indeed HIPAA/PCI/PHI/PII compliant.

THERE IS ABSOLUTELY NO GUARANTEE THAT USING THIS SOFTWARE WILL COMPLETE A CORRECT ANONYMIZATION OF YOUR DATA SET FOR COMPLIANCE PURPOSES. PLEASE SEE LICENSE.txt FOR MORE INFORMATION.

Go Logo and Graphics

All graphics used in this project are released under the Create Common License 3.0

The Gonymizer Gophers logo was created by Levi Junkert which uses the Go Gopher that Takuya Ueda made from the original design of the Go Gopher which was created by Renee French

Documentation

Index

Constants

View Source
const (
	StateChangeTokenBeginCopy = "COPY"
	StateChangeTokenEndCopy   = "\\."
)

StateChangeTokenBeginCopy is the token used to notify the processor that we have hit SQL-COPY in the dump file StateChangeTokenEndCopy is the token used to notify the processor that we are done with SQL-COPY

Variables

View Source
var AlphaNumericMap = map[string]map[string]string{}

AlphaNumericMap is used to keep consistency with scrambled alpha numeric strings. For example, if we need to scramble things such as Social Security Numbers, but it is nice to keep track of these changes so if we run across the same SSN again we can scramble it to what we already have.

View Source
var CountryCodes []CountryCode
View Source
var IBANMap = map[string]string{}

IBANMap is the Global IBANs map for all IBANs we anonymize.

View Source
var ProcessorCatalog map[string]ProcessorFunc

ProcessorCatalog is the function map that points to each Processor to it's entry function. All Processors are listed in this map.

View Source
var UUIDMap = map[uuid.UUID]uuid.UUID{}

UUIDMap is the Global UUID map for all UUIDs that we anonymize. Similar to AlphaNumericMap this map contains all UUIDs and what they are changed to. Some tables use UUIDs as the primary key and this allows us to keep consistency in the data set when anonymizing it.

Functions

func AddFileToS3

func AddFileToS3(sess *session.Session, inFile string, s3file *S3File) (err error)

AddFileToS3 will upload the supplied inFile to the supplied S3File.FilePath

func BuildDate

func BuildDate() time.Time

BuildDate will return the current unix time as the build date time for the application.

func BuildNumber

func BuildNumber() int64

BuildNumber will return the build number for the application.

func CheckIfDbExists

func CheckIfDbExists(db *sql.DB, dbName string) (exists bool, err error)

CheckIfDbExists checks to see if the database exists using the provided db connection.

func CreateDatabase

func CreateDatabase(conf PGConfig) error

CreateDatabase will create the database that is supplied in the PGConfig.

func CreateDumpFile

func CreateDumpFile(
	conf PGConfig,
	dumpfilePath,
	schemaPrefix string,
	excludeTables,
	excludeDataTables,
	excludeCreateSchemas,
	schemas []string,
) error

CreateDumpFile will create a PostgreSQL dump file from the specified PGConfig to the location, and with restrictions, that are provided by the inputs to the function.

func DropDatabase

func DropDatabase(conf PGConfig) error

DropDatabase will drop the database that is supplied in the PGConfig.

func DropPublicTables

func DropPublicTables(conf PGConfig) error

DropPublicTables drops all tables in the public schema.

func ExecPostgresCmd

func ExecPostgresCmd(name string, args ...string) error

ExecPostgresCmd executes the psql command, but first opens the db_test_*.log log files for debugging runtime issues using the psql command.

func ExecPostgresCommandOutErr

func ExecPostgresCommandOutErr(stdOut, stdErr io.Writer, name string, arg ...string) error

ExecPostgresCommandOutErr is the executing function for the psql -f command. It also closed the loaded files/buffers from the calling functions.

func GetAllProceduresInSchema

func GetAllProceduresInSchema(conf PGConfig, schema string) ([]string, error)

GetAllProceduresInSchema will return all procedures for the given schemas in SQL form.

func GetAllSchemaColumns

func GetAllSchemaColumns(db *sql.DB) (*sql.Rows, error)

GetAllSchemaColumns will return a row pointer to a list of table and column names for the given database connection.

func GetAllTablesInSchema

func GetAllTablesInSchema(conf PGConfig, schema string) ([]string, error)

GetAllTablesInSchema will return a list of database tables for a given database configuration.

func GetFileFromS3

func GetFileFromS3(sess *session.Session, s3file *S3File, loadFile string) (err error)

GetFileFromS3 will save the S3File to the loadFile destination.

func GetSchemaColumnEquals

func GetSchemaColumnEquals(db *sql.DB, schema string) (*sql.Rows, error)

GetSchemaColumnEquals returns a pointer to a list of database rows containing the names of tables and columns for the provided schema (using the SQL equals operator).

func GetSchemaColumnsLike

func GetSchemaColumnsLike(db *sql.DB, schemaPrefix string) (*sql.Rows, error)

GetSchemaColumnsLike will return a pointer to a list of database rows containing the names of tables and columns for the provided schema (using the SQL LIKE operator).

func GetSchemasInDatabase

func GetSchemasInDatabase(conf PGConfig, excludeSchemas []string) ([]string, error)

GetSchemasInDatabase returns a list of schemas for a given database configuration. If an excludeSchemas list is provided GetSchemasInDatabase will leave them out of the returned list of schemas.

func GetTableRowCountsInDB

func GetTableRowCountsInDB(conf PGConfig, schemaPrefix string, excludeTable []string) (*[]RowCounts, error)

GetTableRowCountsInDB collects the number of rows for each table in the given supplied schema prefix and will not include any of the tables listed in the excludeTable list. Returns a list of tables the number of rows for each.

func KillDatabaseConnections

func KillDatabaseConnections(db *sql.DB, dbName string) (err error)

KillDatabaseConnections will kill all connections to the provided database name.

func LoadFile

func LoadFile(conf PGConfig, filePath string) (err error)

LoadFile will load an SQL file into the specified PGConfig.

func OpenDB

func OpenDB(conf PGConfig) (*sql.DB, error)

OpenDB will open the database set in the PGConfig and return a pointer to the database connection.

func ProcessDumpFile

func ProcessDumpFile(mapper *DBMapper,
	src,
	dst,
	preProcessFile,
	postProcessFile string,
	generateSeed bool,
) error

ProcessDumpFile will process the supplied dump file according to the supplied database map file. GenerateSeed can also be set to true which will inform the function to use Go's built-in random number generator.

func ProcessorAddress

func ProcessorAddress(cmap *ColumnMapper, input string) (string, error)

ProcessorAddress will return a fake address string that is compiled from the fake library

func ProcessorAlphaNumericScrambler

func ProcessorAlphaNumericScrambler(cmap *ColumnMapper, input string) (string, error)

ProcessorAlphaNumericScrambler will receive the column metadata via ColumnMap and the column's actual data via the input string. The processor will scramble all alphanumeric digits and characters, but it will leave all non-alphanumerics the same without modification. These values are globally mapped and use the AlphaNumericMap to remap values once they are seen more than once.

Example: "PUI-7x9vY" = ProcessorAlphaNumericScrambler("ABC-1a2bC")

func ProcessorCity

func ProcessorCity(cmap *ColumnMapper, input string) (string, error)

ProcessorCity will return a real city name that is >= 0.4 Jaro-Winkler similar than the input.

func ProcessorCompanyName

func ProcessorCompanyName(cmap *ColumnMapper, input string) (string, error)

ProcessorCompanyName will return a company name that is >= 0.4 Jaro-Winkler similar than the input.

func ProcessorEmailAddress

func ProcessorEmailAddress(cmap *ColumnMapper, input string) (string, error)

ProcessorEmailAddress will return an e-mail address that is >= 0.4 Jaro-Winkler similar than the input.

func ProcessorEmptyJson

func ProcessorEmptyJson(cmap *ColumnMapper, input string) (string, error)

ProcessorEmptyJson will return an empty JSON no matter what is the input.

func ProcessorFirstName

func ProcessorFirstName(cmap *ColumnMapper, input string) (string, error)

ProcessorFirstName will return a first name that is >= 0.4 Jaro-Winkler similar than the input.

func ProcessorFullName

func ProcessorFullName(cmap *ColumnMapper, input string) (string, error)

ProcessorFullName will return a full name that is >= 0.4 Jaro-Winkler similar than the input.

func ProcessorIBANScrambler

func ProcessorIBANScrambler(_ *ColumnMapper, input string) (string, error)

func ProcessorIPv4

func ProcessorIPv4(cmap *ColumnMapper, input string) (string, error)

func ProcessorIdentity

func ProcessorIdentity(cmap *ColumnMapper, input string) (string, error)

ProcessorIdentity will skip anonymization and leave output === input.

func ProcessorLastName

func ProcessorLastName(cmap *ColumnMapper, input string) (string, error)

ProcessorLastName will return a last name that is >= 0.4 Jaro-Winkler similar than the input.

func ProcessorPhoneNumber

func ProcessorPhoneNumber(cmap *ColumnMapper, input string) (string, error)

ProcessorPhoneNumber will return a phone number that is >= 0.4 Jaro-Winkler similar than the input.

func ProcessorRandomBoolean

func ProcessorRandomBoolean(cmap *ColumnMapper, input string) (string, error)

ProcessorRandomBoolean will return a random boolean value.

func ProcessorRandomCountryCode

func ProcessorRandomCountryCode(_ *ColumnMapper, _ string) (string, error)

func ProcessorRandomDate

func ProcessorRandomDate(cmap *ColumnMapper, input string) (string, error)

ProcessorRandomDate will return a random day and month, but keep year the same (See: HIPAA rules)

func ProcessorRandomDigits

func ProcessorRandomDigits(cmap *ColumnMapper, input string) (string, error)

ProcessorRandomDigits will return a random string of digit(s) keeping the same length of the input.

func ProcessorRandomUUID

func ProcessorRandomUUID(cmap *ColumnMapper, input string) (string, error)

ProcessorRandomUUID will generate a random UUID and replace the input with the new UUID. The input however will be mapped to the output so every occurrence of the input UUID will replace it with the same output UUID that was originally created during the first occurrence of the input UUID.

func ProcessorScrubString

func ProcessorScrubString(cmap *ColumnMapper, input string) (string, error)

ProcessorScrubString will replace the input string with asterisks (*). Useful for blanking out password fields.

func ProcessorState

func ProcessorState(cmap *ColumnMapper, input string) (string, error)

ProcessorState will return a state that is >= 0.4 Jaro-Winkler similar than the input.

func ProcessorStateAbbrev

func ProcessorStateAbbrev(cmap *ColumnMapper, input string) (string, error)

ProcessorStateAbbrev will return a state abbreviation.

func ProcessorUserName

func ProcessorUserName(cmap *ColumnMapper, input string) (string, error)

ProcessorUserName will return a username that is >= 0.4 Jaro-Winkler similar than the input.

func ProcessorZip

func ProcessorZip(cmap *ColumnMapper, input string) (string, error)

ProcessorZip will return a zip code that is >= 0.4 Jaro-Winkler similar than the input.

func RenameDatabase

func RenameDatabase(db *sql.DB, fromName, toName string) (err error)

RenameDatabase will rename a database using the fromName to the toName.

func SQLCommandFile

func SQLCommandFile(conf PGConfig, filepath string, ignoreErrors bool) error

SQLCommandFile will run psql -f on a file and execute any queries contained in the sql file. If ignoreErrors is supplied then psql will ignore errors in the file.

func VerifyRowCount

func VerifyRowCount(conf PGConfig, filePath string) (err error)

VerifyRowCount will verify that the rowcounts in the PGConfig matches the supplied CSV file (see command/dump)

func Version

func Version() string

Version will return the version number for the application.

func WriteConfigSkeleton

func WriteConfigSkeleton(dbmap *DBMapper, filepath string) error

WriteConfigSkeleton will save the supplied DBMap to filepath.

Types

type ColumnMapper

type ColumnMapper struct {
	Comment         string
	TableSchema     string
	TableName       string
	ColumnName      string
	DataType        string
	ParentSchema    string
	ParentTable     string
	ParentColumn    string
	OrdinalPosition int

	IsNullable bool

	Processors []ProcessorDefinition
}

ColumnMapper is the data structure that contains all gonymizer required information for the specified column.

type CountryCode

type CountryCode struct {
	Code, Name string
}

type DBMapper

type DBMapper struct {
	DBName       string
	SchemaPrefix string
	Seed         int64
	ColumnMaps   []ColumnMapper
}

DBMapper is the main structure for the map file JSON object and is used to map all database columns that will be anonymized.

func GenerateConfigSkeleton

func GenerateConfigSkeleton(conf PGConfig, schemaPrefix string, schemas, excludeTables []string) (*DBMapper, error)

GenerateConfigSkeleton will generate a column-map based on the supplied PGConfig and previously configured map file.

func LoadConfigSkeleton

func LoadConfigSkeleton(givenPathToFile string) (*DBMapper, error)

LoadConfigSkeleton will load the column-map into memory for use in dumping, processing, and loading of SQL files.

func (DBMapper) ColumnMapper

func (dbMap DBMapper) ColumnMapper(schemaName, tableName, columnName string) *ColumnMapper

ColumnMapper returns the address of the ColumnMapper object if it matches the given parameters otherwise it returns nil. Special cases exist for sharded schemas using the schema-prefix. See documentation for details.

func (*DBMapper) Validate

func (dbMap *DBMapper) Validate() error

Validate is used to verify that a database map is complete and correct.

type LineState

type LineState struct {
	LineNum     int64
	IsRow       bool
	SchemaName  string
	TableName   string
	ColumnNames []string
}

LineState contains all the required information for parsing a line in the SQL dump file.

func (*LineState) Clear

func (curLine *LineState) Clear()

Clear will clear out all known line stat for the current LineState object.

type PGConfig

type PGConfig struct {
	Username string
	Pass     string
	Host     string

	DefaultDBName string

	SSLMode string
}

PGConfig is the main configuration structure for different PostgreSQL server configurations.

func (*PGConfig) BaseDSN

func (conf *PGConfig) BaseDSN() string

BaseDSN will return the base of the DSN in string form.

func (*PGConfig) BaseURI

func (conf *PGConfig) BaseURI() string

BaseURI will return the BaseDSN for the supplied PGConfig.

func (*PGConfig) DSN

func (conf *PGConfig) DSN() string

DSN will construct the data source name from the supplied data in the PGConfig. See: https://en.wikipedia.org/wiki/Data_source_name

func (*PGConfig) LoadFromCLI

func (conf *PGConfig) LoadFromCLI(host, username, password, database string, port int32, disableSSL bool)

LoadFromCLI will load the PostgreSQL configuration using the function input variables.

func (*PGConfig) LoadFromEnv

func (conf *PGConfig) LoadFromEnv(debugNum int64, prefix, suffix string)

LoadFromEnv uses environment variables to load the PGConfig.

func (*PGConfig) URI

func (conf *PGConfig) URI() string

URI returns a URI constructed from the supplied PGConfig.

type ProcessorDefinition

type ProcessorDefinition struct {
	Name string

	// optional helpers
	Max      float64
	Min      float64
	Variance float64

	Comment string
}

ProcessorDefinition is the processor data structure used to map database columns to their specified column processor.

type ProcessorFunc

type ProcessorFunc func(*ColumnMapper, string) (string, error)

ProcessorFunc is a simple function prototype for the ProcessorMap function pointers.

type RowCounts

type RowCounts struct {
	SchemaName *string
	TableName  *string
	Count      *int
}

RowCounts is used to keep track of the number of rows for a given schema and table.

type S3File

type S3File struct {
	Bucket   string
	FilePath string
	Region   string
	Scheme   string
	URL      *url.URL
}

S3File is the main structure for gonymizer files in S3 metadata.

func (*S3File) ParseS3Url

func (s3f *S3File) ParseS3Url(s3url string) (err error)

ParseS3Url will parse the supplied S3 uri and load it into a S3File structure

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL