Documentation ¶
Index ¶
- Variables
- type Annotation
- type Document
- func (d *Document) Annotate(a *Annotation, what string) error
- func (d *Document) AssembleSentences()
- func (d *Document) Input(sec string) (string, error)
- func (d *Document) SectionAnnotationCount(sec string) (int, error)
- func (d *Document) SectionAnnotations(sec string) []*Annotation
- func (d *Document) SectionSentenceCount(sec string) (int, error)
- func (d *Document) SectionSentences(sec string) []*Sentence
- func (d *Document) SectionTokenCount(sec string) (int, error)
- func (d *Document) SectionTokens(sec string) []*TextToken
- func (d *Document) SectionWordCount(sec string) (int, error)
- func (d *Document) SectionWords(sec string) []*Word
- func (d *Document) SetInput(sec, input string) error
- func (d *Document) Tokenize()
- type Sentence
- type SentenceIterator
- type TextToken
- type TextTokenIterator
- type Token
- type TokenType
- type Word
Constants ¶
This section is empty.
Variables ¶
var MayBeTermAbbrevs = map[string]struct{}{
"etc": {},
}
MayBeTermAbbrevs lists the common abbreviations that could end with a full stop, possibly without ending the sentence. The abbrevs are in lowercase.
var MayBeTermGroupAbbrevs = map[string][]string{
"e": {"i"},
"g": {"e"},
}
MayBeTermGroupAbbrevs lists the common abbreviations that are compound, i.e. they involve more than one token. The table omits any intervening period. The abbrevs are in lowercase.
var NonTermAbbrevs = map[string]struct{}{
"viz": {},
"eg": {},
"ex": {},
"fig": {},
"mr": {},
"ms": {},
"mrs": {},
"dr": {},
"prof": {},
}
NonTermAbbrevs lists the common abbreviations that could end with a full stop, but without ending the sentence. The abbrevs are in lowercase.
var TtDescriptions = map[TokenType]string{ TokOther: "TokOther", TokSpace: "TokSpace", TokLetter: "TokLetter", TokNumber: "TokLetter", TokMayBeTerm: "TokMayBeTerm", TokTerm: "TokTerm", TokPause: "TokPause", TokParenOpen: "TokParenOpen", TokParenClose: "TokParenClose", TokBracketOpen: "TokBracketOpen", TokBracketClose: "TokBracketClose", TokBraceOpen: "TokBraceOpen", TokBraceClose: "TokBraceClose", TokSquote: "TokSquote", TokDquote: "TokDquote", TokIniQuote: "TokIniQuote", TokFinQuote: "TokFinQuote", TokPunct: "TokPunct", TokSymbol: "TokSymbol", TokMayBeWord: "TokMayBeWord", TokWord: "TokWord", TokSentence: "TokSentence", }
TtDescriptions helps in printing token types.
Functions ¶
This section is empty.
Types ¶
type Annotation ¶
type Annotation struct { DocumentID string Section string Begin int End int Entity string Property string }
Annotation represents a curated annotation of a logical word in a text.
Each annotated word belongs to exactly one input document, and exactly one identified section within that (title, abstract, etc.). The annotation also holds information about a particular property of the word. Annotations are used for training the tools.
func NewAnnotation ¶
func NewAnnotation(in string) (*Annotation, error)
NewAnnotation creates and initialises a new annotation for the given input word.
It expects its input to be in six columns that are tab-separated. The order of the fields is:
- document identifier,
- section,
- beginning index of the word in the input text,
- corresponding ending index,
- word itself and
- entity type.
type Document ¶
type Document struct {
// contains filtered or unexported fields
}
Document represents the entirety of input text of one logical document -- usually a file.
It holds information about its sections, tokens in them, and the words and sentences that were recognised by other processors. In case the document has associated training annotations, it holds them as well.
func NewDocument ¶
NewDocument creates and initialises a document with the given identifier.
It holds information about its sections. It also holds information of their constituent tokens, words, sentences and annotations.
func NewTechnicalDocument ¶
NewTechnicalDocument creates and initialises a document of technical nature, with the given identifier.
It holds information about its sections. It also holds information of their constituent tokens, words, sentences and annotations.
func (*Document) Annotate ¶
func (d *Document) Annotate(a *Annotation, what string) error
Annotate records the given annotation against the applicable sequence of tokens in the appropriate section of the document.
It creates or updates a `Word` corresponding to the text in the annotation. The annotation can be for one of: (a) part of speech ("POS"), (b) lemma ("LEM") or (c) class/category ("CLS").
func (*Document) AssembleSentences ¶ added in v0.1.1
func (d *Document) AssembleSentences()
AssembleSentences builds sentences the text tokens obtained as a result of tokenization of the sections in the document.
func (*Document) Input ¶
Input answers the registered input text of the given section, if one exists.
func (*Document) SectionAnnotationCount ¶
SectionAnnotationCount answers the number of registered annotations in the given section.
func (*Document) SectionAnnotations ¶ added in v0.1.1
func (d *Document) SectionAnnotations(sec string) []*Annotation
SectionAnnotations answers registered annotations for the given section.
func (*Document) SectionSentenceCount ¶ added in v0.1.1
SectionSentenceCount answers the number of assembled sentences in the given section.
func (*Document) SectionSentences ¶ added in v0.1.1
SectionSentences answers assembled sentences in the given section.
func (*Document) SectionTokenCount ¶
SectionTokenCount answers the number of recognised tokens in the given section.
func (*Document) SectionTokens ¶ added in v0.1.1
SectionTokens answers recognised tokens in the given section.
func (*Document) SectionWordCount ¶
SectionWordCount answers the number of recognised words in the given section.
func (*Document) SectionWords ¶ added in v0.1.1
SectionWords answers recognised words in the given section.
func (*Document) Tokenize ¶
func (d *Document) Tokenize()
Tokenize breaks the text in the various sections of the document into quasi-atomic tokens.
These tokens can be matched against any available annotations. They can also be combined into logical words for named entity recognition and part of speech recognition purposes.
type Sentence ¶
type Sentence struct {
// contains filtered or unexported fields
}
Sentence represents a logical sentence.
It holds information about its text, its offsets and its constituent text tokens.
func (*Sentence) BeginToken ¶
type SentenceIterator ¶
type SentenceIterator struct {
// contains filtered or unexported fields
}
SentenceIterator helps in assembling consecutive sentences from the underlying text tokens.
func NewSentenceIterator ¶
func NewSentenceIterator(toks []*TextToken) *SentenceIterator
NewSentenceIterator creates and initialises a sentence iterator over the given text tokens.
func NewTechnicalSentenceIterator ¶
func NewTechnicalSentenceIterator(toks []*TextToken) *SentenceIterator
NewTechnicalSentenceIterator creates and initialises a sentence iterator in technical mode, over the given text tokens.
func (*SentenceIterator) Item ¶
func (si *SentenceIterator) Item() *Sentence
Item answers the current sentence. This has no side effects, and can be invoked any number of times.
func (*SentenceIterator) MoveNext ¶
func (si *SentenceIterator) MoveNext() error
MoveNext assembles the next sentence from the given input tokens.
It begins with the current running token index (which could be at the beginning of the input slice of tokens), and continues until it can logically complete a sentence. Should it not be able to complete one such, it treats all remaining input tokens as constituting a single sentence.
The return value is either `nil` (more sentences may be available) or `io.EOF` (no more sentences).
type TextToken ¶
type TextToken struct {
// contains filtered or unexported fields
}
TextToken represents a piece of text extracted from a larger input. It holds information regarding its beginning and ending offsets in the input text. A text token may span the entire input.
type TextTokenIterator ¶
type TextTokenIterator struct {
// contains filtered or unexported fields
}
TextTokenIterator helps in retrieving consecutive text tokens from an input text.
func NewTextTokenIterator ¶
func NewTextTokenIterator(input string) *TextTokenIterator
NewTextTokenIterator creates and initialises a token iterator over the given input text.
func NewTextTokenIteratorWithOffset ¶
func NewTextTokenIteratorWithOffset(input string, n int) *TextTokenIterator
NewTextTokenIteratorWithOffset creates and initialises a token iterator over the given input text.
It treats the given offset - rather than 0 - as the starting index from which to track all subsequent indices.
func (*TextTokenIterator) Item ¶
func (ti *TextTokenIterator) Item() *TextToken
Item answers the current token. This has no side effects, and can be invoked any number of times.
func (*TextTokenIterator) MoveNext ¶
func (ti *TextTokenIterator) MoveNext() error
MoveNext detects the next token in the input, should one be available.
It begins with the current running byte offset (which could be the beginning of the input string), and continues until it can logically break on a token terminator. Should it not be able to find one such, it treats all remaining runes in the input string as constituting a single token.
The return value is either `nil` (more tokens may be available) or `io.EOF` (no more tokens).
type Token ¶
Token represents a piece of text extracted from a larger input. It holds information regarding its beginning and ending offsets in the input text. A token may span the entire input.
type TokenType ¶
type TokenType byte
TokenType represents types that a token can have. The granularity of a token is variable: character, smallest logical unit, word, sentence, etc. Accordingly, the corresponding tokens use appropriate token types.
const ( TokOther TokenType = iota TokSpace TokLetter TokNumber TokMayBeTerm TokTerm TokPause TokParenOpen TokParenClose TokBracketOpen TokBracketClose TokBraceOpen TokBraceClose TokSquote TokDquote TokIniQuote TokFinQuote TokPunct TokSymbol TokMayBeWord TokWord TokSentence )
List of defined token types.
type Word ¶
type Word struct {
// contains filtered or unexported fields
}
Word represents a token whose type is one of `TokMayBeWord` or `TokWord`, and qualifies it.
It holds information regarding the so-called IOB (Inside, Outside, Beginning) status of the token, its lemma form (in case of a word), its part of speech (in case of a word), etc.