substation

module
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 30, 2022 License: MIT

README

Substation

substation logo

Substation is a toolkit for creating highly configurable, no maintenance, and cost efficient serverless data pipelines.

What is Substation?

Originally designed to collect, normalize, and enrich security event data, Substation provides methods for achieving high quality data through interconnected, serverless data pipelines.

Substation also provides Go packages for filtering and modifying JSON data.

Features

As an event-driven ingest, transform, and load application, Substation has these features:

  • real-time event filtering and processing
  • cross-dataset event correlation and enrichment
  • concurrent event routing to downstream systems
  • runs on containers, built for extensibility
    • support for new event filters and processors
    • support for new ingest sources and load destinations
    • supports creation of custom applications (e.g., multi-cloud)

As a package, Substation has these features:

Use Cases

Substation was originally designed to support the mission of achieving high quality data for threat hunting, threat detection, and incident response, but it can be used to move data between many distributed systems and services. Here are some example use cases:

  • data availability: sink data to an intermediary streaming service such as AWS Kinesis, then concurrently sink it to a data lake, data warehouse, and SIEM
  • data consistency: normalize data across every dataset using a permissive schema such as the Elastic Common Schema
  • data completeness: enrich data by integrating AWS Lambda functions and building self-populating AWS DynamoDB tables for low latency, real-time event context

Example Data Pipelines

Simple

The simplest data pipeline is one with a single source (ingest), a single transform, and a single sink (load). The diagram below shows pipelines that ingest data from different sources and sink it unmodified to a data warehouse where it can be used for analysis.


graph TD
    sink(Data Warehouse)

    %% pipeline one
    source_a(HTTPS Source)
    processing_a[Transfer]

    %% flow
    subgraph pipeline X
    source_a ---|Push| processing_a
    end

    processing_a ---|Push| sink

    %% pipeline two
    source_b(Data Lake)
    processing_b[Transfer]

    %% flow
    subgraph pipeline Y
    source_b ---|Pull| processing_b
    end

    processing_b ---|Push| sink
Complex

The complexity of a data pipeline, including its features and how it connects with other pipelines, is up to the user. The diagram below shows two complex data pipelines that have these feature:

  • both pipelines write unmodified data to intermediary streaming data storage (e.g., AWS Kinesis) to support concurrent consumers and downstream systems
  • both pipelines transform data by enriching it from their own inter-pipeline metadata lookup (e.g., AWS DynamoDB)
  • pipeline Y additionally transforms data by enriching it from pipeline X's metadata lookup

graph TD

    %% pipeline a
    source_a_http(HTTPS Source)
    sink_a_streaming(Streaming Data Storage)
    sink_a_metadata(Metadata Lookup)
    sink_a_persistent[Data Warehouse]
    processing_a_http[Transfer]
    processing_a_persistent[Transform]
    processing_a_metadata[Transform]

    %% flow
    subgraph pipeline Y
    source_a_http ---|Push| processing_a_http
    processing_a_http ---|Push| sink_a_streaming
    sink_a_streaming ---|Pull| processing_a_persistent
    sink_a_streaming ---|Pull| processing_a_metadata
    processing_a_persistent---|Push| sink_a_persistent
    processing_a_persistent---|Pull| sink_a_metadata
    processing_a_metadata ---|Push| sink_a_metadata
    end

    processing_a_persistent ---|Pull| sink_b_metadata

    %% pipeline b
    source_b_http(HTTPS Source)
    sink_b_streaming(Streaming Data Storage)
    sink_b_metadata(Metadata Lookup)
    sink_b_persistent(Data Warehouse)
    processing_b_http[Transfer]
    processing_b_persistent[Transform]
    processing_b_metadata[Transform]

    %% flow
    subgraph pipeline X
    source_b_http ---|Push| processing_b_http
    processing_b_http ---|Push| sink_b_streaming
    sink_b_streaming ---|Pull| processing_b_persistent
    sink_b_streaming ---|Pull| processing_b_metadata
    processing_b_persistent---|Push| sink_b_persistent
    processing_b_persistent---|Pull| sink_b_metadata
    processing_b_metadata ---|Push| sink_b_metadata
    end

As a toolkit, Substation makes no assumptions about how data pipelines are configured and connected. We encourage experimentation and outside-the-box thinking when it comes to pipeline design!

Quickstart

Users can use the steps below to test Substation's functionality. We recommend doing the steps below in a Docker container (we've included Visual Studio Code configurations for developing and testing Substation in .devcontainer/ and .vscode/ ).

Step 0: Set Environment Variable
export SUBSTATION_ROOT=/path/to/repository
Step 1: Compile the File Binary

Run the commands below to compile the Substation file app.

cd $SUBSTATION_ROOT/cmd/file/substation/ && \
go build . && \
./substation -h
Step 2: Compile the quickstart Configuration File

Run the command below to compile the quickstart Jsonnet configuration files into a Substation JSON config.

cd $SUBSTATION_ROOT && \
sh build/scripts/config/compile.sh
Step 3: Test Substation

Run the command below to test Substation.

After this, we recommend reviewing the config documentation and running more tests with other event processors to learn how the app works.

cd $SUBSTATION_ROOT && \
./cmd/file/substation/substation -input examples/quickstart/data.json -config examples/quickstart/config.json

Users can continue exploring the system by iterating on the quickstart config, building and running custom example applications, and deploying a data pipeline in AWS.

Additional Documentation

More documentation about Substation can be found across the project, including:

Licensing

Substation and its associated code is released under the terms of the MIT License.

Directories

Path Synopsis
cmd
package cmd provides definitions and methods for building Substation applications.
package cmd provides definitions and methods for building Substation applications.
package config provides capabilities for managing configurations and handling data in Substation applications.
package config provides capabilities for managing configurations and handling data in Substation applications.
examples
condition
example from condition/README.md
example from condition/README.md
condition/data
example of reading data from a file and applying an inspector
example of reading data from a file and applying an inspector
condition/encapsulation
example of reading data from a file and applying an inspector
example of reading data from a file and applying an inspector
process
example from process/README.md
example from process/README.md
process/data
example of reading data from a file and applying a single processor to data
example of reading data from a file and applying a single processor to data
process/encapsulation
example of reading data from a file and applying a single processor to encapsulated data
example of reading data from a file and applying a single processor to encapsulated data
internal
aws/appconfig
package appconfig provides functions for interacting with AWS AppConfig.
package appconfig provides functions for interacting with AWS AppConfig.
aws/s3manager
package s3manager provides methods and functions for downloading and uploading objects in AWS S3.
package s3manager provides methods and functions for downloading and uploading objects in AWS S3.
bufio
package bufio wraps the standard library's bufio package.
package bufio wraps the standard library's bufio package.
file
package file provides functions that can be used to retrieve files from local and remote locations.
package file provides functions that can be used to retrieve files from local and remote locations.
log
Package log wraps logrus and provides global logging only debug logging should be used in condition/, process/, and internal/ to reduce the likelihood of corrupting output for apps debug and info logging can be used in cmd/
Package log wraps logrus and provides global logging only debug logging should be used in condition/, process/, and internal/ to reduce the likelihood of corrupting output for apps debug and info logging can be used in cmd/
media
package media provides capabilities for inspecting the content of data and identifying its media (Multipurpose Internet Mail Extensions, MIME) type.
package media provides capabilities for inspecting the content of data and identifying its media (Multipurpose Internet Mail Extensions, MIME) type.
regexp
Package regexp provides a global regexp cache via go-regexpcache
Package regexp provides a global regexp cache via go-regexpcache
proto

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL