imdb

package

v0.9.1 Latest Latest Go to latest Published: Apr 20, 2024 License: Apache-2.0 Imports: 19 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/gomlx/gomlx

Links

Open Source Insights

README ¶

IMDB Dataset of 50k Movie Reviews

Kaggle's IMDB Dataset of 50k Movie Reviews
Original dataset in https://ai.stanford.edu/~amaas/data/sentiment/

Downloaded from: https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

Documentation ¶

Overview ¶

Package imdb contains code to download and prepare datasets with IMDB Dataset of 50k Movie Reviews.

This can be used to train models, but this library has no library per se. See a demo model training in sub-package `demo`.

Index ¶

Constants
Variables
func Download(baseDir string) error
func InputToString(input tensor.Tensor, batchIdx int) string
func LoadIndividualFiles(baseDir string) (vocab *Vocab, examples []*Example, err error)
type Dataset
- func NewDataset(name string, set SetType, maxLen, batchSize int, labelDType shapes.DType, ...) *Dataset
- func NewUnsupervisedDataset(name string, maxLen, batchSize int, labelDType shapes.DType, infinite bool, ...) *Dataset
type Example
- func NewExample(contents []byte, vocab *Vocab) *Example
- func (e *Example) String(vocab *Vocab) string
type SetType
type Vocab
- func NewVocab() *Vocab
- func (v *Vocab) RegisterToken(token string) (idx int)
- func (v *Vocab) SortByFrequency() (oldIDtoNewID map[int]int)
type VocabEntry

Constants ¶

View Source

const (
	DownloadURL  = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
	LocalTarFile = "aclImdb_v1.tar.gz"
	TarHash      = "c40f74a18d3b61f90feba1e17730e0d38e8b97c05fde7008942e91923d1658fe"
	LocalDir     = "aclImdb"
	BinaryFile   = "aclImdb.bin"
)

Variables ¶

View Source

var (
	// IncludeSeparators indicates whether when parsing files it should create tokens out of the
	// separators (commas, dots, etc).
	IncludeSeparators = false

	// CaseSensitive indicates whether token collection should be case-sensitive.
	CaseSensitive = false

	// LoadedVocab is materialized after calling Download.
	LoadedVocab *Vocab

	// LoadedExamples is materialized after calling Download. It is based on LoadedVocab.
	LoadedExamples []*Example
)

Functions ¶

func Download ¶

func Download(baseDir string) error

Download IMDB reviews dataset to current directory, un-tar it, parses all individual files and saves the binary file version.

The vocabulary and examples loaded are set to LoadedVocab and LoadedExamples.

If it's already downloaded, simply load binary file version.

func InputToString ¶

func InputToString(input tensor.Tensor, batchIdx int) string

InputToString returns a string rendered content of one row (pointed to by batchIdx) of an input. The input is assumed to be a batch created by a Dataset object.

func LoadIndividualFiles ¶

func LoadIndividualFiles(baseDir string) (vocab *Vocab, examples []*Example, err error)

Types ¶

type Dataset ¶

type Dataset struct {
	SetType          SetType
	LabelDType       shapes.DType
	MaxLen, MaxVocab int
	BatchSize        int
	Examples         []*Example

	Pos                       int
	Infinite, WithReplacement bool
	Shuffle                   *rand.Rand
	// contains filtered or unexported fields
}

Dataset implements train.Dataset. It allows for concurrent Yield calls, so one can feed it to ParallelizedDataset.

func NewDataset ¶

func NewDataset(name string, set SetType, maxLen, batchSize int, labelDType shapes.DType, infinite bool, shuffle *rand.Rand) *Dataset

NewDataset creates a labeled Dataset.

func NewUnsupervisedDataset ¶

func NewUnsupervisedDataset(name string, maxLen, batchSize int, labelDType shapes.DType, infinite bool, shuffle *rand.Rand) *Dataset

NewUnsupervisedDataset with the SetType assumed to be Train.

func (*Dataset) Name ¶

func (ds *Dataset) Name() string

Name implements train.Dataset interface.

func (*Dataset) Reset ¶

func (ds *Dataset) Reset()

Reset restarts the dataset from the beginning. Can be called after io.EOF is reached, for instance when running another evaluation on a test dataset.

func (*Dataset) Yield ¶

func (ds *Dataset) Yield() (spec any, inputs, labels []tensor.Tensor, err error)

Yield implements train.Dataset interface. If not infinite, return io.EOF at the end of the dataset.

It trims the examples to ds.MaxLen tokens, taken from the end.

It returns `spec==nil` always, since `inputs` and `labels` have always the same type of content.

It can be called concurrently.

type Example ¶

type Example struct {
	Set           SetType
	Label, Rating int
	Length        int
	Content       []int
}

Example encapsulates all the information of one example in the IMDB 50k dataset. The fields are:

Set can be 0 or 1 for "test", train".
Label is 0, 1 or 2 for negative/positive/unlabeled examples.
Rating is a value from 1 to 10 in imdb. For unlabeled examples they are marked all as 0.
Length is the length (in # of tokens) of the content.
Content are the tokens of the IMDB entry -- there should be a vocabulary associated to the dataset.

func NewExample ¶

func NewExample(contents []byte, vocab *Vocab) *Example

NewExample parses an IMDB content file, tokenize it using the given Vocab and returns the parsed example.

It doesn't fill the SetIdx, Label and Rating attributes.

func (*Example) String ¶

func (e *Example) String(vocab *Vocab) string

type SetType ¶

type SetType int

SetType refers to either a train or test example(s).

const (
	Train SetType = iota
	Test
)

type Vocab ¶

type Vocab struct {
	ListEntries []VocabEntry
	MapTokens   map[string]int
	TotalCount  int
}

Vocab stores vocabulary information for the whole corpus.

func NewVocab ¶

func NewVocab() *Vocab

NewVocab creates a new vocabulary, with the first token set to "<INVALID>", usually a placeholder for padding, and the second token set to "<START>" to indicate start of sentence.

func (*Vocab) RegisterToken ¶

func (v *Vocab) RegisterToken(token string) (idx int)

RegisterToken returns the index for the token, and increments the count for the token.

func (*Vocab) SortByFrequency ¶

func (v *Vocab) SortByFrequency() (oldIDtoNewID map[int]int)

SortByFrequency sorts the vocabs by their frequency, and returns a map to convert the token ids from before the sorting to their new values.

Special tokens "<INVALID>" and "<START>" remain unchanged.

type VocabEntry ¶

type VocabEntry struct {
	Token string
	Count int
}

VocabEntry include the Token and its count.

Source Files ¶

View all Source files

imdb.go

Directories ¶

Path	Synopsis
demo IMDB Movie Review library (imdb) demo: you can run this program in 4 different ways:	IMDB Movie Review library (imdb) demo: you can run this program in 4 different ways:

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL