wikitfidf

package module

v0.0.0-...-9aa6982 Latest Latest Go to latest Published: Jul 27, 2022 License: AGPL-3.0 Imports: 21 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

Wikipedia TFIDF Analyzer

Negapedia TFIDF Analyzer analyze Wikipedia's dumps and makes statistical analysis on reverts text.
The data produced in output can be used to clarify the theme of the contrast inside a Wikipedia page.

Handled languages

english, arabic, danish, dutch, finnish, french, german, greek, hungarian, indonesian, italian, kazakh, nepali, portuguese, romanian, russian, spanish, swedish, turkish, armenian, azerbaijani, basque, bengali, bulgarian, catalan, chinese, croatian, czech, galician, hebrew, hindi, irish, japanese, korean, latvian, lithuanian, marathi, persian, polish, slovak, thai, ukrainian, urdu, simple-english
This kind of data come from Negapedia/nltk

Badwords handled languages

english, arabic, danish, dutch, finnish, french, german, hungarian, italian, portuguese, spanish, swedish, chinese, czech, hindi, japanese, korean, persian, polish, thai, simple-english
This kind of data come from Negapedia/badwords

Outuput files

GlobalPagesTFIDF.json: contains for every page the list of words associated with their absolute frequency and tf-idf value;
GlobalPagesTFIDF_topNwords.json: as GlobalPagesTFIDF.json, but are reported only the most important N words (in term of tf-idf value);
GlobalWords.json: contains all the analyzed wiki's words associated with their absolute frequency;
GlobalTopic.json: contains all the words in every topic (using Negapedia topics);
BadWordsReport.json: contains for every page which has them, a list of badwords associated with their absolute frequency.

Minimum and Recommended Requirements

The minimum requirements which are needed for executing the project in reasonable times are:

At least 4 cores-8 threads CPU;
At least 16GB of RAM (required);
At least 300GB of disk space.

However the recommended requirements are:

32GB of RAM or more (highly recommended).

Usage

Building docker image

docker build -t <image_name> .
from the root of repository directory.

Running docker image

docker run -d -v <path_on_fs_where_to_save_results>:<container_results_path> <image_name>
example:
docker run -d -v /path/2/out/dir:/data my_image

Executions flags

-lang: wiki language;
-d: container result dir;
-s: revert starting date to consider;
-e: revert ending date to consider;
-specialList: special page list to consider;
-rev: number of revert to consider;
-topPages: number of top words per page to consider;
-topWords: number of top words of global words to consider;
-topTopic: number of top words per topic to consider;
-delete: if true, after compressing results directory will be deleted (default: true);
-test: if true, logs are shown and is processed a single dump.

example:
docker run -v /path/2/out/dir:/data wikitfidf dothething -lang it

Installation

Go packages can be installed by:
go get github.com/negapedia/wikitfidf
and docker image can be downloaded by:
docker pull negapedia/wikitfidf

Documentation ¶

Index ¶

func CheckAvailableLanguage(lang string) error
type BadWordsPage
type Exporter
- func From(lang, resultDir string) (exporter Exporter, err error)
- func New(ctx context.Context, lang string, in <-chan wikibrief.EvolvingPage, ...) (exporter Exporter, err error)
type Limits
- func ReasonableLimits() Limits
type PageTFIDF
type Topic
type WikiWords

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CheckAvailableLanguage ¶

func CheckAvailableLanguage(lang string) error

CheckAvailableLanguage check if a language is handled

Types ¶

type BadWordsPage ¶

type BadWordsPage struct {
	PageID uint32
	Abs    uint32
	Rel    float64
	BadW   map[string]uint32
}

BadWordsPage represents a single page with badwords data: PageID, TopicID, Absolute number of badwords in page, Relative number of badwords in page (tot/abs) and the list of the badwords in the following format: "badWord": number_of_occurrence

type Exporter ¶

type Exporter struct {
	ResultDir, Lang string
}

Exporter represents the TFIDF data calculated from New.

func From ¶

func From(lang, resultDir string) (exporter Exporter, err error)

From returns an exporter from existing data, it check if files that have to be exported exists. If not, returns an error with the specified missing file.

func New ¶

func New(ctx context.Context, lang string, in <-chan wikibrief.EvolvingPage, resultDir string, limits Limits, testMode bool) (exporter Exporter, err error)

New ingests, processes and stores the desidered Wikipedia dump from the channel.

func (Exporter) Delete ¶

func (exporter Exporter) Delete() (err error)

Delete deletes files from result directory

func (Exporter) GlobalWords ¶

func (exporter Exporter) GlobalWords() (word2Occurencies *WikiWords, err error)

GlobalWords returns a dictionary with the top N words of GlobalWord in the following format: "word": occurencies

func (Exporter) PageBadwords ¶

func (exporter Exporter) PageBadwords(ctx context.Context, fail func(error) error) chan BadWordsPage

PageBadwords returns a channel with the data of BadWords Report pages sent in channel are descending ordered

func (Exporter) Pages ¶

func (exporter Exporter) Pages(ctx context.Context, fail func(error) error) chan PageTFIDF

Pages returns a channel with the data of PageTFIDF (top N words per page), pages sent in channel are ascending order.

func (Exporter) Topics ¶

func (exporter Exporter) Topics(ctx context.Context, fail func(error) error) chan Topic

Topics returns a channel with the data of GlobalTopic (top N words per topic)

type Limits ¶

type Limits struct {
	WordsPages  int
	GlobalWords int
	TopicWords  int

	Reverts int
}

Limits represents limits at which data is cut off

func ReasonableLimits ¶

func ReasonableLimits() Limits

ReasonableLimits returns reasonable limits

type PageTFIDF ¶

type PageTFIDF struct {
	ID         uint32
	TotWords   uint32
	Word2TFIDF map[string]float64
}

PageTFIDF represents a single page with its data: ID, TopicID, Total number of words, dictionary with the top N words in the following format: "word": tfidf_value

type Topic ¶

type Topic struct {
	TopicID  uint32
	TotWords uint32
	Words    map[string]uint32
}

Topic represents a single topic with TopicID and the list of top N words in it in the following format: "word": number_of_occurrence

type WikiWords ¶

type WikiWords struct {
	TotalWords  uint32
	Words2Occur map[string]uint32
}

WikiWords represents the top N words in Wikipedia with the total number of words in it

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
dothething
internal
badwords
dumpreducer
structures
tfidf
topicwords
utils
wordmapper

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL