classifier

package module

v0.5.1 Latest Latest Go to latest Published: Jan 7, 2024 License: Apache-2.0 Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/alexsuslov/classifier

Links

Open Source Insights

README ¶

classifier

General purpose text classifier (naïve bayes, k-nearest neighbors)

Installation

go get github.com/alexsuslov/classifier

Usage

Classification

There are two methods of classifying text data: io.Reader or string. To classify strings, use the TrainString or ClassifyString functions. To classify larger sources, use the Train and Classify functions that take an io.Reader as input.

package main

import (
	"fmt"
	
	"github.com/alexsuslov/classifier/naive"
)

func main() {
    classifier := naive.New()
    classifier.TrainString("The quick brown fox jumped over the lazy dog", "ham")
    classifier.TrainString("Earn a degree online", "ham")
    classifier.TrainString("Earn cash quick online", "spam")
    
    if classification, err := classifier.ClassifyString("Earn your masters degree online"); err == nil {
        fmt.Println("Classification => ", classification) // ham
    } else {
        fmt.Println("error: ", err)
    }	
}

Contributing

Fork the repository
Create a local feature branch
Run gofmt
Bump the VERSION file using semantic versioning
Submit a pull request

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Documentation ¶

Index ¶

func Filter(vs chan string, filters ...Predicate) chan string
func IsNotStopWord(v string) bool
func IsStopWord(v string) bool
func IsWord(v string) bool
func LoadStopWords(filename string) error
func Map(vs chan string, f ...Mapper) chan string
func ScanAlphaWords(data []byte, atEOF bool) (advance int, token []byte, err error)
func WordCounts(r io.Reader) (map[string]int, error)
type Classifier
type Mapper
type Predicate
type StdOption
- func BufferSize(size int) StdOption
- func Filters(f ...Predicate) StdOption
- func SplitFunc(fn bufio.SplitFunc) StdOption
- func Transforms(m ...Mapper) StdOption
type StdTokenizer
- func NewTokenizer(opts ...StdOption) *StdTokenizer
- func (t *StdTokenizer) Tokenize(r io.Reader) chan string
type Tokenizer
type WeightScheme
- func BagOfWords(doc map[string]float64) WeightScheme
- func Binary(doc map[string]float64) WeightScheme
- func LogNorm(doc map[string]float64) WeightScheme
- func TermFrequency(doc map[string]float64) WeightScheme
type WeightSchemeStrategy

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Filter ¶

func Filter(vs chan string, filters ...Predicate) chan string

Filter removes elements from the input channel where the supplied predicate is satisfied Filter is a Predicate aggregation

func IsNotStopWord ¶

func IsNotStopWord(v string) bool

IsNotStopWord is the inverse function of IsStopWord

func IsStopWord ¶

func IsStopWord(v string) bool

IsStopWord checks against a list of known english stop words and returns true if v is a stop word; false otherwise

func IsWord ¶

func IsWord(v string) bool

IsWord is a predicate to determine if a string contains at least two characters and doesn't contain any numbers

func LoadStopWords ¶

func LoadStopWords(filename string) error

func Map ¶

func Map(vs chan string, f ...Mapper) chan string

Map applies f to each element of the supplied input channel

func ScanAlphaWords ¶

func ScanAlphaWords(data []byte, atEOF bool) (advance int, token []byte, err error)

ScanAlphaWords is a function that splits text on whitespace, punctuation, and symbols; derived bufio.ScanWords

func WordCounts ¶

func WordCounts(r io.Reader) (map[string]int, error)

WordCounts extracts term frequencies from a text corpus

Types ¶

type Classifier ¶

type Classifier interface {
	// Train allows clients to train the classifier
	Train(io.Reader, string) error
	// TrainString allows clients to train the classifier using a string
	TrainString(string, string) error
	// Classify performs a classification on the input corpus and assumes that
	// the underlying classifier has been trained.
	Classify(io.Reader) (string, error)
	// ClassifyString performs text classification using a string
	ClassifyString(string) (string, error)
}

Classifier provides a simple interface for different text classifiers

type Mapper ¶

type Mapper func(string) string

Mapper provides a map function

type Predicate ¶

type Predicate func(string) bool

Predicate provides a predicate function

type StdOption ¶

type StdOption func(*StdTokenizer)

StdOption provides configuration settings for a StdTokenizer

func BufferSize ¶

func BufferSize(size int) StdOption

BufferSize adjusts the size of the buffered channel

func Filters ¶

func Filters(f ...Predicate) StdOption

Filters overrides the list of predicates

func SplitFunc ¶

func SplitFunc(fn bufio.SplitFunc) StdOption

SplitFunc overrides the default word split function, based on whitespace

func Transforms ¶

func Transforms(m ...Mapper) StdOption

Transforms overrides the list of mappers

type StdTokenizer ¶

type StdTokenizer struct {
	// contains filtered or unexported fields
}

StdTokenizer provides a common document tokenizer that splits a document by word boundaries

func NewTokenizer ¶

func NewTokenizer(opts ...StdOption) *StdTokenizer

NewTokenizer initializes a new standard Tokenizer instance

func (*StdTokenizer) Tokenize ¶

func (t *StdTokenizer) Tokenize(r io.Reader) chan string

Tokenize words and return streaming results

type Tokenizer ¶

type Tokenizer interface {
	// Tokenize breaks the provided document into a channel of tokens
	Tokenize(io.Reader) chan string
}

Tokenizer provides a common interface to tokenize documents

type WeightScheme ¶

type WeightScheme func(term string) float64

WeightScheme provides a contract for term frequency weight schemes

func BagOfWords ¶

func BagOfWords(doc map[string]float64) WeightScheme

BagOfWords weight scheme: counts the number of occurrences

func Binary ¶

func Binary(doc map[string]float64) WeightScheme

Binary weight scheme: 1 if present; 0 otherwise

func LogNorm ¶

func LogNorm(doc map[string]float64) WeightScheme

LogNorm weight scheme: returns the natural log of the number of occurrences of a term

func TermFrequency ¶

func TermFrequency(doc map[string]float64) WeightScheme

TermFrequency weight scheme; counts the number of occurrences divided by the number of terms within a document

type WeightSchemeStrategy ¶

type WeightSchemeStrategy func(doc map[string]float64) WeightScheme

WeightSchemeStrategy provides support for pluggable weight schemes

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
index
knn
naive

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL