Documentation ¶
Index ¶
- func CheckAvailableLanguage(lang string) error
- type BadWordsPage
- type Exporter
- func (exporter Exporter) Delete() (err error)
- func (exporter Exporter) GlobalWords() (word2Occurencies *WikiWords, err error)
- func (exporter Exporter) PageBadwords(ctx context.Context, fail func(error) error) chan BadWordsPage
- func (exporter Exporter) Pages(ctx context.Context, fail func(error) error) chan PageTFIDF
- func (exporter Exporter) Topics(ctx context.Context, fail func(error) error) chan Topic
- type Limits
- type PageTFIDF
- type Topic
- type WikiWords
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CheckAvailableLanguage ¶
CheckAvailableLanguage check if a language is handled
Types ¶
type BadWordsPage ¶
BadWordsPage represents a single page with badwords data: PageID, TopicID, Absolute number of badwords in page, Relative number of badwords in page (tot/abs) and the list of the badwords in the following format: "badWord": number_of_occurrence
type Exporter ¶
type Exporter struct {
ResultDir, Lang string
}
Exporter represents the TFIDF data calculated from New.
func From ¶
From returns an exporter from existing data, it check if files that have to be exported exists. If not, returns an error with the specified missing file.
func New ¶
func New(ctx context.Context, lang string, in <-chan wikibrief.EvolvingPage, resultDir string, limits Limits, testMode bool) (exporter Exporter, err error)
New ingests, processes and stores the desidered Wikipedia dump from the channel.
func (Exporter) GlobalWords ¶
GlobalWords returns a dictionary with the top N words of GlobalWord in the following format: "word": occurencies
func (Exporter) PageBadwords ¶
func (exporter Exporter) PageBadwords(ctx context.Context, fail func(error) error) chan BadWordsPage
PageBadwords returns a channel with the data of BadWords Report pages sent in channel are descending ordered
type PageTFIDF ¶
PageTFIDF represents a single page with its data: ID, TopicID, Total number of words, dictionary with the top N words in the following format: "word": tfidf_value