bigquery

package

v0.11.1 Latest Latest Go to latest Published: Dec 17, 2024 License: Apache-2.0 Imports: 31 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/goto/meteor

README ¶

bigquery

Usage

source:
  name: bigquery
  config:
    project_id: google-project-id
    table_pattern: gofood.fact_
    max_preview_rows: 3
    exclude:
      datasets:
        - dataset_a
        - dataset_b
      tables:
        - dataset_c.table_a
    max_page_size: 100
    profile_column: true
    build_view_lineage: true
    # Only one of service_account_base64 / service_account_json is needed. 
    # If both are present, service_account_base64 takes precedence
    service_account_base64: _________BASE64_ENCODED_SERVICE_ACCOUNT_________________
    service_account_json:
      {
        "type": "service_account",
        "private_key_id": "xxxxxxx",
        "private_key": "xxxxxxx",
        "client_email": "xxxxxxx",
        "client_id": "xxxxxxx",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "xxxxxxx",
        "client_x509_cert_url": "xxxxxxx"
      }
    collect_table_usage: false
    usage_period_in_day: 7
    usage_project_ids:
      - google-project-id
      - other-google-project-id

Inputs

Key	Value	Example	Description
`project_id`	`string`	`my-project`	BigQuery Project ID	required
`service_account_base64`	`string`	`____BASE64_ENCODED_SERVICE_ACCOUNT____`	Service Account in base64 encoded string. Takes precedence over `service_account_json` value	optional
`service_account_json`	`string`	`{"private_key": .., "private_id": ...}`	Service Account in JSON string	optional
`table_pattern`	`string`	`gofood.fact_`	Regex pattern to filter which bigquery table to scan (whitelist)	optional
`max_page_size`	`int`	`100`	max page size hint used for fetching datasets/tables/rows from bigquery	optional
`include_column_profile`	`bool`	`true`	true if you want to profile the column value such min, max, med, avg, top, and freq	optional
`max_preview_rows`	`int`	`30`	max number of preview rows to fetch, `0` will skip preview fetching, `-1` will restrict adding preview_rows key in asset data . Default to `30`.	optional
`mix_values`	`bool`	`false`	true if you want to mix the column values with the preview rows. Default to `false`.	optional
`collect_table_usage`	`boolean`	`false`	toggle feature to collect table usage, `true` will enable collecting table usage. Default to `false`.	optional
`usage_period_in_day`	`int`	`7`	collecting log from `(now - usage_period_in_day)` until `now`. only matter if `collect_table_usage` is true. Default to `7`.	optional
`usage_project_ids`	`[]string`	`[google-project-id, other-google-project-id]`	collecting log from defined GCP Project IDs. Default to BigQuery Project ID.	optional

Notes

Leaving service_account_json and service_account_base64 blank will default to Google's default authentication. It is recommended if Meteor instance runs inside the same Google Cloud environment as the BigQuery project.
Service account needs to have bigquery.privateLogsViewer role to be able to collect bigquery audit logs.
Setting max_preview_rows to -1 will restrict adding preview_rows key in asset data

Outputs

Field	Sample Value	Description
`resource.urn`	`project_id.dataset_name.table_name`
`resource.name`	`table_name`
`resource.service`	`bigquery`
`description`	`table description`
`profile.total_rows`	`2100`
`profile.usage_count`	`15`
`profile.joins`	[]Join
`profile.filters`	[`"WHERE t.param_3 = 'the_param' AND t.column_1 = \"xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx\""`,`"WHERE event_timestamp >= TIMESTAMP(\"2021-10-29\", \"UTC\") AND event_timestamp < TIMESTAMP(\"2021-11-22T02:01:06Z\")"`]
`schema`	[]Column
`properties.partition_data`	`"partition_data": {"partition_field": "data_date", "require_partition_filter": false, "time_partition": {"partition_by": "DAY","partition_expire": 0 } }`	partition related data for time and range partitioning.
`properties.clustering_fields`	`['created_at', 'updated_at']`	list of fields on which the table is clustered
`properties.partition_field`	`created_at`	returns the field on which table is time partitioned

Partition Data

Field	Sample Value	Description
`partition_field`	`created_at`	field on which the table is partitioned either by TimePartitioning or RangePartitioning. In case field is empty for TimePartitioning _PARTITIONTIME is returned instead of empty.
`require_partition_filter`	`true`	boolean value which denotes if every query on the bigquery table must include at least one predicate that only references the partitioning column
`time_partition.partition_by`	`HOUR`	returns partition type HOUR/DAY/MONTH/YEAR
`time_partition.partition_expire_seconds`	`0`	time in which data will expire from this partition. If 0 it will not expire.
`range_partition.interval`	`10`	width of a interval range
`range_partition.start`	`0`	start value for partition inclusive of this value
`range_partition.end`	`100`	end value for partition exclusive of this value

Column

Field	Sample Value
`name`	`total_price`
`description`	`item's total price`
`data_type`	`decimal`
`is_nullable`	`true`
`length`	`12,2`
`profile`	`{"min":...,"max": ...,"unique": ...}`

Join

Field	Sample Value
`urn`	`project_id.dataset_name.table_name`
`count`	`3`
`conditions`	[`"ON target.column_1 = source.column_1 and target.param_name = source.param_name"`,`"ON DATE(target.event_timestamp) = DATE(source.event_timestamp)"`]

Contributing

Refer to the contribution guidelines for information on contributing to this module.

Documentation ¶

Index ¶

func CreateClient(ctx context.Context, logger log.Logger, config *Config) (*bigquery.Client, error)
func IsExcludedDataset(datasetID string, excludedDatasets []string) bool
func IsExcludedTable(datasetID, tableID string, excludedTables []string) bool
type Config
type Exclude
type Extractor
- func New(logger log.Logger, newClient NewClientFunc, randFn randFn) *Extractor
- func (e *Extractor) Extract(ctx context.Context, emit plugins.Emit) error
- func (e *Extractor) Init(ctx context.Context, config plugins.Config) error
type NewClientFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CreateClient ¶ added in v0.8.5

func CreateClient(ctx context.Context, logger log.Logger, config *Config) (*bigquery.Client, error)

CreateClient creates a bigquery client

func IsExcludedDataset ¶

func IsExcludedDataset(datasetID string, excludedDatasets []string) bool

func IsExcludedTable ¶

func IsExcludedTable(datasetID, tableID string, excludedTables []string) bool

Types ¶

type Config ¶

type Config struct {
	ProjectID string `mapstructure:"project_id" validate:"required"`
	// ServiceAccountBase64 takes precedence over ServiceAccountJSON field
	ServiceAccountBase64 string  `mapstructure:"service_account_base64"`
	ServiceAccountJSON   string  `mapstructure:"service_account_json"`
	MaxPageSize          int     `mapstructure:"max_page_size"`
	DatasetPageSize      int     `mapstructure:"dataset_page_size"`
	TablePageSize        int     `mapstructure:"table_page_size"`
	TablePattern         string  `mapstructure:"table_pattern"`
	Exclude              Exclude `mapstructure:"exclude"`
	IncludeColumnProfile bool    `mapstructure:"include_column_profile"`
	// MaxPreviewRows can also be set to -1 to restrict adding preview_rows key in asset data
	MaxPreviewRows      int      `mapstructure:"max_preview_rows" default:"30"`
	MixValues           bool     `mapstructure:"mix_values" default:"false"`
	IsCollectTableUsage bool     `mapstructure:"collect_table_usage" default:"false"`
	UsagePeriodInDay    int64    `mapstructure:"usage_period_in_day" default:"7"`
	UsageProjectIDs     []string `mapstructure:"usage_project_ids"`
	BuildViewLineage    bool     `mapstructure:"build_view_lineage" default:"false"`
	Concurrency         int      `mapstructure:"concurrency" default:"10"`
}

Config holds the set of configuration for the bigquery extractor

type Exclude ¶

type Exclude struct {
	// list of datasetIDs
	Datasets []string `mapstructure:"datasets"`
	// list of tableNames in format - datasetID.tableID
	Tables []string `mapstructure:"tables"`
}

type Extractor ¶

type Extractor struct {
	plugins.BaseExtractor
	// contains filtered or unexported fields
}

Extractor manages the communication with the bigquery service

func New ¶

func New(logger log.Logger, newClient NewClientFunc, randFn randFn) *Extractor

func (*Extractor) Extract ¶

func (e *Extractor) Extract(ctx context.Context, emit plugins.Emit) error

Extract checks if the table is valid and extracts the table schema

func (*Extractor) Init ¶

func (e *Extractor) Init(ctx context.Context, config plugins.Config) error

Init initializes the extractor

type NewClientFunc ¶ added in v0.8.5

type NewClientFunc func(ctx context.Context, logger log.Logger, config *Config) (*bigquery.Client, error)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
auditlog
sqlparser

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL