Documentation ¶
Overview ¶
Package go-vsm implements algebraic vector space model for comparison between documents and queries.
Vector space model represents documents and queries as objects (points or vector) in an N-dimensional space where each term is a dimension. So a document d with N number of terms can be represented as point or vector with coordinates d(w₁, w₂, ... wN), where w = term weight.
The term weighting scheme used in this package is the TFIDF:
w = term.Count * log(|Docs| / |{ d ∈ Docs | term ∈ d}|)
Where Docs are the the known documents in the corpus.
Vector space model uses the deviation of angles between each document vector and the query vector to check their similarities by calculating the cosine of the angle between the vectors. So for each document vector dᵢ and a query vector q:
cos0 = dᵢ•q / ||dᵢ|| * ||q||
where ||dᵢ|| = the magnitude of the document vector and ||q|| = the magnitude of the query vector.
See: http://www.minerazzi.com/tutorials/term-vector-1.pdf
Example ¶
docs := []Document{ { Sentence: "Shipment of gold damaged in a fire.", Class: "d1", }, { Sentence: "Delivery of silver arrived in a silver truck.", Class: "d2", }, { Sentence: "Shipment of gold arrived in a truck.", Class: "d3", }, } vsm := New(nil) for _, doc := range docs { fmt.Println(vsm.StaticTraining(doc)) } doc, err := vsm.Search("gold silver truck.") fmt.Println(doc.Class, err)
Output: <nil> <nil> <nil> d2 <nil>
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Document ¶
Document holds a sentence, which is tokenized and filtered in order to be represented as a vector, and a class for keywording/tagging the sentence.
type TrainingResult ¶
TrainingResult holds the Document used for the training process and an error which will be nil if no error occurred.
type VSM ¶
type VSM struct {
// contains filtered or unexported fields
}
VSM holds the corpus for the algebraic vector space model calculation.
func New ¶
func New(t transform.Transformer) *VSM
New returns a VSM structure used for searching documents in the corpus. The Transformer is used for filtering the documents sentences and queries.
func (*VSM) DynamicTraining ¶
func (v *VSM) DynamicTraining(ctx context.Context, docCh <-chan Document) <-chan TrainingResult
DynamicTraining receives a producer channel of Document used for dynamically augmenting the corpus. It returns a TrainingResult channel which can be used to check if an error occurred during the training process.
Example ¶
docs := []Document{ { Sentence: "Shipment of gold damaged in a fire.", Class: "d1", }, { Sentence: "Shipment of gold arrived in a truck.", Class: "d3", }, } vsm := New(nil) for _, doc := range docs { err := vsm.StaticTraining(doc) fmt.Println(err) } docCh := make(chan Document) go func() { defer close(docCh) // Loads document from some source dynamically // and sends it to the training channel. docCh <- Document{ Sentence: "Delivery of silver arrived in a silver truck.", Class: "d2", } }() trainCh := vsm.DynamicTraining(context.Background(), docCh) // Waits until all documents are consumed by the training process. for { _, ok := <-trainCh if !ok { break } } doc, err := vsm.Search("gold silver truck.") fmt.Println(doc.Class, err)
Output: <nil> <nil> d2 <nil>
func (*VSM) Search ¶
Search returns the most similar document from the corpus with the query based on vector space model, or an error. A nil Document means there's no similarity between any document in the corpus and the query.
Example ¶
docs := []Document{ { Sentence: "Shipment of gold damaged in a fire.", Class: "d1", }, { Sentence: "Delivery of silver arrived in a silver truck.", Class: "d2", }, { Sentence: "Shipment-of-gold-arrived in a truck.", Class: "d3", }, } // transformer will be applied to every sentence // and query. transformer := runes.Map(func(r rune) rune { // Replaces hyphens by space. if unicode.Is(unicode.Hyphen, r) { return ' ' } return r }) vsm := New(transformer) for _, doc := range docs { fmt.Println(vsm.StaticTraining(doc)) } doc, err := vsm.Search("shipment gold in a flying truck.") fmt.Println(doc.Class, err)
Output: <nil> <nil> <nil> d3 <nil>
func (*VSM) StaticTraining ¶
StaticTraining receives Document used for augmenting the corpus. It returns error if the training was unsucessful.