Documentation ¶
Overview ¶
Copyright 2019 Tomas Machalek <tomas.machalek@gmail.com> Copyright 2019 Charles University, Faculty of Arts,
Institute of the Czech National Corpus
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Index ¶
- type ARFCalculator
- func (arfc *ARFCalculator) Finalize()
- func (arfc *ARFCalculator) ProcStruct(strc *vertigo.Structure, line int, err error) error
- func (arfc *ARFCalculator) ProcStructClose(strc *vertigo.StructureClose, line int, err error) error
- func (arfc *ARFCalculator) ProcToken(tk *vertigo.Token, line int, err error) error
- type NgramCounter
- func (c *NgramCounter) ARF() *WordARF
- func (c *NgramCounter) AddARF(tk *vertigo.Token)
- func (c *NgramCounter) AddToken(pos []int)
- func (c *NgramCounter) Count() int
- func (c *NgramCounter) CurrLength() int
- func (c *NgramCounter) ForEachAttr(wDict *WordDict, fn func(item string, i int))
- func (c *NgramCounter) HasARF() bool
- func (c *NgramCounter) IncCount()
- func (c *NgramCounter) Length() int
- func (c *NgramCounter) UniqueID(columns []int) string
- func (c *NgramCounter) Width() int
- type Position
- type WordARF
- type WordDict
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ARFCalculator ¶
type ARFCalculator struct {
// contains filtered or unexported fields
}
ARFCalculator calculates ARF for all the [ngram_uniq_id] => NgramCounter pairs we obtain in the 1st pass.
func NewARFCalculator ¶
func NewARFCalculator(counts map[string]*NgramCounter, ngramConf *cnf.NgramConf, numTokens int, columnModders []*modders.ModderChain, wordDict *WordDict, atomStruct string) *ARFCalculator
NewARFCalculator is the recommended factory to create an instance of the type
func (*ARFCalculator) Finalize ¶
func (arfc *ARFCalculator) Finalize()
Finalize performs some final calculations on obtained (and continuouslz calculated) data. It is required to to obtain correct ARF results.
func (*ARFCalculator) ProcStruct ¶
func (arfc *ARFCalculator) ProcStruct(strc *vertigo.Structure, line int, err error) error
ProcStruct is used by Vertigo parser but we don't need it here
func (*ARFCalculator) ProcStructClose ¶
func (arfc *ARFCalculator) ProcStructClose(strc *vertigo.StructureClose, line int, err error) error
ProcStructClose is used by Vertigo parser but we don't need it here
type NgramCounter ¶
type NgramCounter struct {
// contains filtered or unexported fields
}
NgramCounter stores an n-gram with multiple attributes per position along absolute freq. information and optionally with ARF information.
func NewNgramCounter ¶
func NewNgramCounter(size int) *NgramCounter
NewNgramCounter creates a new n-gram with count = 1
func (*NgramCounter) AddARF ¶
func (c *NgramCounter) AddARF(tk *vertigo.Token)
AddARF creates a new helper record to calculate ARF for the record.
func (*NgramCounter) AddToken ¶
func (c *NgramCounter) AddToken(pos []int)
AddToken add additional (besides 0th) tokens to the n-gram
func (*NgramCounter) Count ¶
func (c *NgramCounter) Count() int
Count tells how many occurences of the ngram has been found.
func (*NgramCounter) CurrLength ¶
func (c *NgramCounter) CurrLength() int
CurrLength returns actual n-gram length (i.e. if a trigram has only first position filled-in then the returned value is 1)
func (*NgramCounter) ForEachAttr ¶
func (c *NgramCounter) ForEachAttr(wDict *WordDict, fn func(item string, i int))
ForEachAttr calls the provided function on all of stored columns from vertical file (e.g. fn([word]) then fn([lemma]) then fn([pos]))
func (*NgramCounter) HasARF ¶
func (c *NgramCounter) HasARF() bool
HasARF tests whether ARF calculation storage is present. If it is not then it means either the job configuration does not want ARF to be calculated of that it is not set for the specific record yet.
func (*NgramCounter) IncCount ¶
func (c *NgramCounter) IncCount()
IncCount increase number of occurences for the n-gram
func (*NgramCounter) Length ¶
func (c *NgramCounter) Length() int
Length returns n-gram length (1 = unigram, 2 = bigram,...)
func (*NgramCounter) UniqueID ¶
func (c *NgramCounter) UniqueID(columns []int) string
UniqueID creates an unique ngram identifier
func (*NgramCounter) Width ¶
func (c *NgramCounter) Width() int
Width says how many columns are used for unique records in the result (e.g. [word, lemma, pos] means width of 3)
type Position ¶
type Position struct {
Columns []int
}
Position specifies positional attributes (e.g. word, lemma, tag) at some n-gram position
type WordARF ¶
WordARF is used as an attribute of NgramCounter to calculate ARF. The attributes are designed for two-pass calculation where in the 1st pass we obtain avg distance between word instance and in the 2nd pass we actually calculate the result. This method is slower (we parse the vertical file two times) but it needs less memory compared with single pass method.
type WordDict ¶ added in v0.10.0
type WordDict struct {
// contains filtered or unexported fields
}
func NewWordDict ¶ added in v0.10.0
func NewWordDict() *WordDict