wikitfidf

package module
v0.0.0-...-9aa6982 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 27, 2022 License: AGPL-3.0 Imports: 21 Imported by: 2

README

Wikipedia TFIDF Analyzer

Go Report Card GoDoc Bugs Coverage Lines of Code Maintainability Rating Reliability Rating Security Rating Vulnerabilities Build Status
Negapedia TFIDF Analyzer analyze Wikipedia's dumps and makes statistical analysis on reverts text.
The data produced in output can be used to clarify the theme of the contrast inside a Wikipedia page.

Handled languages

english, arabic, danish, dutch, finnish, french, german, greek, hungarian, indonesian, italian, kazakh, nepali, portuguese, romanian, russian, spanish, swedish, turkish, armenian, azerbaijani, basque, bengali, bulgarian, catalan, chinese, croatian, czech, galician, hebrew, hindi, irish, japanese, korean, latvian, lithuanian, marathi, persian, polish, slovak, thai, ukrainian, urdu, simple-english
This kind of data come from Negapedia/nltk

Badwords handled languages

english, arabic, danish, dutch, finnish, french, german, hungarian, italian, portuguese, spanish, swedish, chinese, czech, hindi, japanese, korean, persian, polish, thai, simple-english
This kind of data come from Negapedia/badwords

Outuput files
  • GlobalPagesTFIDF.json: contains for every page the list of words associated with their absolute frequency and tf-idf value;
  • GlobalPagesTFIDF_topNwords.json: as GlobalPagesTFIDF.json, but are reported only the most important N words (in term of tf-idf value);
  • GlobalWords.json: contains all the analyzed wiki's words associated with their absolute frequency;
  • GlobalTopic.json: contains all the words in every topic (using Negapedia topics);
  • BadWordsReport.json: contains for every page which has them, a list of badwords associated with their absolute frequency.

The minimum requirements which are needed for executing the project in reasonable times are:

  • At least 4 cores-8 threads CPU;
  • At least 16GB of RAM (required);
  • At least 300GB of disk space.

However the recommended requirements are:

  • 32GB of RAM or more (highly recommended).

Usage

Building docker image

docker build -t <image_name> .
from the root of repository directory.

Running docker image

docker run -d -v <path_on_fs_where_to_save_results>:<container_results_path> <image_name>
example:
docker run -d -v /path/2/out/dir:/data my_image

Executions flags
  • -lang: wiki language;
  • -d: container result dir;
  • -s: revert starting date to consider;
  • -e: revert ending date to consider;
  • -specialList: special page list to consider;
  • -rev: number of revert to consider;
  • -topPages: number of top words per page to consider;
  • -topWords: number of top words of global words to consider;
  • -topTopic: number of top words per topic to consider;
  • -delete: if true, after compressing results directory will be deleted (default: true);
  • -test: if true, logs are shown and is processed a single dump.


example:
docker run -v /path/2/out/dir:/data wikitfidf dothething -lang it

Installation

Go packages can be installed by:
go get github.com/negapedia/wikitfidf
and docker image can be downloaded by:
docker pull negapedia/wikitfidf

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CheckAvailableLanguage

func CheckAvailableLanguage(lang string) error

CheckAvailableLanguage check if a language is handled

Types

type BadWordsPage

type BadWordsPage struct {
	PageID uint32
	Abs    uint32
	Rel    float64
	BadW   map[string]uint32
}

BadWordsPage represents a single page with badwords data: PageID, TopicID, Absolute number of badwords in page, Relative number of badwords in page (tot/abs) and the list of the badwords in the following format: "badWord": number_of_occurrence

type Exporter

type Exporter struct {
	ResultDir, Lang string
}

Exporter represents the TFIDF data calculated from New.

func From

func From(lang, resultDir string) (exporter Exporter, err error)

From returns an exporter from existing data, it check if files that have to be exported exists. If not, returns an error with the specified missing file.

func New

func New(ctx context.Context, lang string, in <-chan wikibrief.EvolvingPage, resultDir string, limits Limits, testMode bool) (exporter Exporter, err error)

New ingests, processes and stores the desidered Wikipedia dump from the channel.

func (Exporter) Delete

func (exporter Exporter) Delete() (err error)

Delete deletes files from result directory

func (Exporter) GlobalWords

func (exporter Exporter) GlobalWords() (word2Occurencies *WikiWords, err error)

GlobalWords returns a dictionary with the top N words of GlobalWord in the following format: "word": occurencies

func (Exporter) PageBadwords

func (exporter Exporter) PageBadwords(ctx context.Context, fail func(error) error) chan BadWordsPage

PageBadwords returns a channel with the data of BadWords Report pages sent in channel are descending ordered

func (Exporter) Pages

func (exporter Exporter) Pages(ctx context.Context, fail func(error) error) chan PageTFIDF

Pages returns a channel with the data of PageTFIDF (top N words per page), pages sent in channel are ascending order.

func (Exporter) Topics

func (exporter Exporter) Topics(ctx context.Context, fail func(error) error) chan Topic

Topics returns a channel with the data of GlobalTopic (top N words per topic)

type Limits

type Limits struct {
	WordsPages  int
	GlobalWords int
	TopicWords  int

	Reverts int
}

Limits represents limits at which data is cut off

func ReasonableLimits

func ReasonableLimits() Limits

ReasonableLimits returns reasonable limits

type PageTFIDF

type PageTFIDF struct {
	ID         uint32
	TotWords   uint32
	Word2TFIDF map[string]float64
}

PageTFIDF represents a single page with its data: ID, TopicID, Total number of words, dictionary with the top N words in the following format: "word": tfidf_value

type Topic

type Topic struct {
	TopicID  uint32
	TotWords uint32
	Words    map[string]uint32
}

Topic represents a single topic with TopicID and the list of top N words in it in the following format: "word": number_of_occurrence

type WikiWords

type WikiWords struct {
	TotalWords  uint32
	Words2Occur map[string]uint32
}

WikiWords represents the top N words in Wikipedia with the total number of words in it

Directories

Path Synopsis
cmd
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL