gobert

module
v0.2.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 1, 2021 License: MIT

README

gobert

Forked from github.com/buckhx/gobert

Go bindings for operationalizing BERT models. Train in Python, run in Go.

Simply put, gobert translates text sentences from any language into fixed length vectors called "embeddings". These embeddings can be used for downstream learning tasks or directly for comparison.

BERT

BERT is a state of the art NLP model that can be leveraged for transfer learning into domain specific use cases.

Under Active Development

This is a work in progress and should not use until a version has be tagged and a go.mod is present. Test coverage will also be added when the API settles.

The following advice from Go TensorFlow applies:

TensorFlow provides a Go API— particularly useful for loading models created with Python and running them within a Go application.
...
TensorFlow provides APIs for use in Go programs. These APIs are particularly well-suited to loading models created in Python and executing them within a Go application.
...
Be a real gopher, keep it simple! Use Python to define & train models; you can always load trained models and using them with Go later!

This project attempts to minimize dependencies

Installation

Prereqs

  1. Install Tensorflow for C (ideally to ./var/lib)
  2. Install Docker (Optional, but suggested)

Run Demo

The following demo will run a simple semantic search engine against the Go FAQ.

# Download & Export Pre-Trained Model
make model

# Run semantic search examples
make ex/search
Notes
  • SeqLen has large impact on performance
  • Perf is not great, need to determine if it's from python model or go runtime
  • model package is WIP

Examples

  • SemanticSearch: Simple search engine from CSV data using BERT sentence vectors
  • Classifier: exposing model from run_classifier
  • Embedding: returning sentence embeddings
  • Raw: Using only the gobert tokenize package and vanilla tensorflow API

Packages

Tokenize

The tokenize package includes methods to create BERT input features. It is fairly stable and can be used independently of the model package. This will be its own module since it does not require tensorflow bindings.

Vocab

The vocab package is a simple container for BERT vocabs. Could be rolled into tokenize.

Model

The model package is an experimental package to work with models exported. Requires tensorflow.

There are two main external components that are required to leverage the model package. Utilities to interop with these are supplied with in this repo.

  • Tensflow C Lib
  • TF Model exported with the SavedModel API
Export

The export dir includes utilities to export BERT models that can be exposed to the GO runtime. There is a loose coupling with the model package and exported models interop with the model package. The suggested way to run exports is through a container with a host mounted volume.

The models exported using this package interop with tensorflow/serving

TODOs

  • Python Embedding
  • Python Classifier
  • Go Classifier
  • Semantic Search Example
  • Raw Model Example
  • Token Lookup
  • Model Download
  • Documentation
  • Cleanup makefile
  • Test Coverage
  • Benchmark
  • Binary CMD
  • Docker Exporter
  • Docker TF-GO Image
  • go mod init

TBD

  • Pool layers in python or post-process
  • gonum interop
  • first class wrapper API ([][][]float32 -> []Embedding)
  • proto interops
  • pooling strategies
  • batching
  • other tuned models (SentencePrediction, SQUAD)

Current line of thought is to use core lib for raw types and supply a utility API

Directories

Path Synopsis
examples
semantic-search
Package main is an example of a semantic search engine using BERT embeddings
Package main is an example of a semantic search engine using BERT embeddings
Package model provides functionality for working with exported BERT models
Package model provides functionality for working with exported BERT models
estimator
Package estimator is a utility method for interacting with tf models.
Package estimator is a utility method for interacting with tf models.
Package tokenize supplies tokenization operations for BERT.
Package tokenize supplies tokenization operations for BERT.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL