dsbldr

package module
v0.0.0-...-41e0cfc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 29, 2023 License: MIT Imports: 5 Imported by: 0

README

ml-data-builder

An efficient Golang utility that simplifies the process of generating Machine Learning Datasets from publicly available Social Network APIs. This tool is quite helpful with managing concurrent operations and preserving data, all while offering an intuitive API for straightforward implementation.

Overview

Public Social Network APIs are frequently leveraged to build datasets used in training Machine Learning models. Models like Tweet2Vec that use this type of data to extract features or create embeddings are quite common. This tool is particularly useful for NLP-oriented models that can benefit from a large repository of structured (or even unstructured) text data.

Users often resort to finding specific publicly available general datasets or spend considerable time on setting an elaborate feature extraction pipeline which can detract time from actual feature engineering and model work. This tool aims to simplify such chores.

Despite being a newcomer to the Machine Learning space, valuable feedback is appreciated and taken seriously. The goal is to provide a genuinely useful tool that streamlines the process of ML data set creation.

Roadmap

  • ✔ Top-level Feature-based API

  • ✔ Manage concurrency using Goroutines, channels, and other advanced techniques

  • Upcoming features:

    • Caching operations to prevent redundant requests

    • Save functionality for different data formats

    • Multi-format API data support

    • Authentication support

    • Command-line functionality

  • Additional feature suggestions are welcome!

Documentation

Index

Constants

View Source
const (
	SingleRetrieve = iota
	RepeatedRetrieve
)

Structs representing RetreiveType SingleRetrieve Features only require one request to create the JSON Dump that's passed to the RunFunc Repeated Retrieve Features require one request per value-set of of parent features that are concatenated into a JSON array and then passed to the Features RunFunc Almost as a given, all dependent features will be of RepeatedRetrieve per value sets of their parent features

Variables

This section is empty.

Functions

func BasicOAuthHeader

func BasicOAuthHeader(consumerKey, nonce, signature, signatureMethod,
	timestamp, token string) string

BasicOAuthHeader spits out a basic OAuth Header based on access token

func WithBasicAuth

func WithBasicAuth(username, password string)

WithBasicAuth is a Builder option that adds a username and password for Basic API authentication

Types

type Builder

type Builder struct {
	BaseURL string
	// contains filtered or unexported fields
}

Builder is main type for this tool.

func NewBuilder

func NewBuilder(featureCount, recordCount int, options ...func(*Builder)) *Builder

NewBuilder creates new Builder struct

func (*Builder) AddFeatures

func (b *Builder) AddFeatures(features ...*Feature)

AddFeatures adds a Feature struct to the "Features" Field on Builder

func (*Builder) GetFeature

func (b *Builder) GetFeature(name string) *Feature

GetFeature returns a feature in the detaset based on it's name

func (*Builder) Run

func (b *Builder) Run(client endpointClient) error

Run Builder to aggregate all features and manage concurrent operations

func (*Builder) Save

func (b *Builder) Save(writer csv.Writer) error

Save commits the downloaded features to a file

func (*Builder) SaveIf

func (b *Builder) SaveIf(writer csv.Writer, saveCond func(r []string) bool) error

SaveIf saves records only if saveCond evaluate to true

type Feature

type Feature struct {
	Name         string
	Endpoint     string  // API Endpoint
	RunFunc      RunFunc // function that performs ad-hoc computation
	RetrieveType int     // Determines if multiple or single requests are made to the api
	// contains filtered or unexported fields
}

Feature in the dataset, on which all other features are based on

func NewFeature

func NewFeature() *Feature

NewFeature creates new Feature with defaults

type RunFunc

type RunFunc func(responses []string) []string // parents map[string]string

RunFunc holds the computation that processes the API responses to features is sent an array of JSON strings as the responses ??as well as a map of data from the features parent features?? Basically what you do with the run function is take in a string of serialized API data (could be in JSON or XML), do parsing on your own or using utility functions. You do whatever computations you want and then spit it back as an array of strings to read to CSV or JSON

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL