ptcount

package
v0.18.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 8, 2021 License: Apache-2.0 Imports: 7 Imported by: 0

Documentation

Overview

Copyright 2019 Tomas Machalek <tomas.machalek@gmail.com> Copyright 2019 Charles University, Faculty of Arts,

Institute of the Czech National Corpus

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ARFCalculator

type ARFCalculator struct {
	// contains filtered or unexported fields
}

ARFCalculator calculates ARF for all the [ngram_uniq_id] => NgramCounter pairs we obtain in the 1st pass.

func NewARFCalculator

func NewARFCalculator(counts map[string]*NgramCounter, ngramConf *cnf.NgramConf, numTokens int,
	columnModders []*modders.ModderChain, wordDict *WordDict, atomStruct string) *ARFCalculator

NewARFCalculator is the recommended factory to create an instance of the type

func (*ARFCalculator) Finalize

func (arfc *ARFCalculator) Finalize()

Finalize performs some final calculations on obtained (and continuouslz calculated) data. It is required to to obtain correct ARF results.

func (*ARFCalculator) ProcStruct

func (arfc *ARFCalculator) ProcStruct(strc *vertigo.Structure, line int, err error) error

ProcStruct is used by Vertigo parser but we don't need it here

func (*ARFCalculator) ProcStructClose

func (arfc *ARFCalculator) ProcStructClose(strc *vertigo.StructureClose, line int, err error) error

ProcStructClose is used by Vertigo parser but we don't need it here

func (*ARFCalculator) ProcToken

func (arfc *ARFCalculator) ProcToken(tk *vertigo.Token, line int, err error) error

ProcToken is called by vertigo parser when a token is encountered

type NgramCounter

type NgramCounter struct {
	// contains filtered or unexported fields
}

NgramCounter stores an n-gram with multiple attributes per position along absolute freq. information and optionally with ARF information.

func NewNgramCounter

func NewNgramCounter(size int) *NgramCounter

NewNgramCounter creates a new n-gram with count = 1

func (*NgramCounter) ARF

func (c *NgramCounter) ARF() *WordARF

ARF returns ARF helper record

func (*NgramCounter) AddARF

func (c *NgramCounter) AddARF(tk *vertigo.Token)

AddARF creates a new helper record to calculate ARF for the record.

func (*NgramCounter) AddToken

func (c *NgramCounter) AddToken(pos []int)

AddToken add additional (besides 0th) tokens to the n-gram

func (*NgramCounter) Count

func (c *NgramCounter) Count() int

Count tells how many occurences of the ngram has been found.

func (*NgramCounter) CurrLength

func (c *NgramCounter) CurrLength() int

CurrLength returns actual n-gram length (i.e. if a trigram has only first position filled-in then the returned value is 1)

func (*NgramCounter) ForEachAttr

func (c *NgramCounter) ForEachAttr(wDict *WordDict, fn func(item string, i int))

ForEachAttr calls the provided function on all of stored columns from vertical file (e.g. fn([word]) then fn([lemma]) then fn([pos]))

func (*NgramCounter) HasARF

func (c *NgramCounter) HasARF() bool

HasARF tests whether ARF calculation storage is present. If it is not then it means either the job configuration does not want ARF to be calculated of that it is not set for the specific record yet.

func (*NgramCounter) IncCount

func (c *NgramCounter) IncCount()

IncCount increase number of occurences for the n-gram

func (*NgramCounter) Length

func (c *NgramCounter) Length() int

Length returns n-gram length (1 = unigram, 2 = bigram,...)

func (*NgramCounter) UniqueID

func (c *NgramCounter) UniqueID(columns []int) string

UniqueID creates an unique ngram identifier

func (*NgramCounter) Width

func (c *NgramCounter) Width() int

Width says how many columns are used for unique records in the result (e.g. [word, lemma, pos] means width of 3)

type Position

type Position struct {
	Columns []int
}

Position specifies positional attributes (e.g. word, lemma, tag) at some n-gram position

type WordARF

type WordARF struct {
	ARF        float64
	FirstIdx   int
	PrevTokIdx int
}

WordARF is used as an attribute of NgramCounter to calculate ARF. The attributes are designed for two-pass calculation where in the 1st pass we obtain avg distance between word instance and in the 2nd pass we actually calculate the result. This method is slower (we parse the vertical file two times) but it needs less memory compared with single pass method.

func (WordARF) String

func (ws WordARF) String() string

type WordDict added in v0.10.0

type WordDict struct {
	// contains filtered or unexported fields
}

func NewWordDict added in v0.10.0

func NewWordDict() *WordDict

func (*WordDict) Add added in v0.10.0

func (w *WordDict) Add(word string) int

func (*WordDict) Get added in v0.10.0

func (w *WordDict) Get(idx int) string

func (*WordDict) Size added in v0.10.0

func (w *WordDict) Size() int

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL