wvcgen

command module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 27, 2016 License: BSD-3-Clause Imports: 7 Imported by: 0

README

GoDoc Go Report Card

wvcgen

This is Wikipedia vandalism dataset generator.

This repository does not provide the full Wikipedia vandalism dataset provided by uni-weimar.de but provide the script to work with dataset, for example diff-ing revision, creating new dataset from it, and computing the features.

The generator is written using Go lang.

How To Use

Installation
  • Install Go from binary distribution or using your OS package management.

  • Download this package,

    $ go get github.com/shuLhan/wvcgen
    
PAN-WVC-2010
  • Change working directory to this package (located in $GOPATH/src/github.com/shuLhan/wvcgen)
  • Download the full dataset from uni-weimar.de site
  • Extract the zip file
  • Rename the extracted directory from pan-wikipedia-vandalism-corpus-2010 to pan-wvc-2010
  • Move all files in pan-wvc-2010/article-revisions/partXX/ to pan-wvc-2010/revisions
Creating Unified Dataset
  • Change working directory to cmd/merge-wvc2010

  • Run main.go script to merge and create new dataset

    $ go run main.go
    

    which will create file merge_edits_golds.dat that combine file pan-wvc-2010/edits.csv with pan-wvc-2010/gold-annotations.csv and add two new fields. List of attributes in unified dataset are,

    • editid
    • class
    • oldrevisionid
    • newrevisionid
    • edittime
    • editor
    • articletitle
    • editcomment
    • deletions
    • additions

The new fields are deletions and additions which contain diff of old revision with new revision at words level. Attribute deletions contain deleted text in old revision, and attribute additions contain inserted text in new revision.

One can customize the output of dataset by editing the merge_edits_gold.dsv configuration and run the merge script again.

Cleaning Wiki Revisions

Cleaning wiki text revision, which is located in revisions directory, is required to speeding up processing features.

  • Change working directory to cmd/wikiclean

  • Create directory where the output of cleaning will be located,

    $ mkdir -p ../../pan-wvc-2010/revisions_clean
    
  • Run main.go script to clean revisions file

    $ go run main.go ../../pan-wvc-2010/revisions ../../pan-wvc-2010/revisions_clean
    

    The first parameter is the input location where the revision text to be cleaning up, the second parameter is location where new revision that has been cleaned up will be written.

Generating Features

After one of PAN WVC dataset has been merged and cleaned up one can compute the vandalism features by runnning main.go script in root of repository.

$ go run main.go

Feature values will be written to file features.dat.

One can customize the input and which features should be computed by editing file features.dsv, which contains,

  • Input option point to the input file (the unified data set)
  • InputMetadata contains fields in input file,
  • Output option point the file where result of features computation will be written,
  • OutputMetadata contain list of features that will computed. The name for feature is described below.

List of Features

Feature implementation is located in directory feature.

Metadata
  • "anonim": give a value '1' if an editor is anonymous or '0' otherwise.
  • "comment_length": length of character in the comment supplied with an edit.
  • "size_increment": compute the size different between inserted text minus deletion.
  • "size_ratio": length of new revision divided by length of old revision.
Text
  • "upper_lower_ratio": ratio of uppercase to lowercase in inserted text.
  • "upper_to_all_ratio": ratio of uppercase to all character in inserted text.
  • "digit_ratio": ratio of digit to all character in inserted text.
  • "non_alnum_ratio": ratio of non alpha-numeric to all character in inserted text.
  • "char_diversity": length of inserted text power of (1 / number of unique character).
  • "char_distribution_insert": the distribution of character using Kullback-Leibler divergence algorithm.
  • "compress_rate": compute the compression rate of inserted text using LZW.
  • "good_token": compute number of good or known Wikipedia token in inserted text.
  • "term_frequency": compute frequency of inserted word in new revision.
  • "longest_word": the length of longest word in inserted text.
  • "longest_char_seq": length of the longest sequence of the same character in inserted text.
Language
  • "words_vulgar_frequency" compute frequency of vulgar words in inserted text.
  • "words_vulgar_impact" compute increased of vulgar words in new revision.
  • "words_pronoun_frequency" compute frequency of colloquial and slang pronoun in inserted text.
  • "words_pronoun_impact" compute increased of pronoun words in new revision.
  • "words_bias_frequency" compute frequency of colloquial words with high bias in inserted text.
  • "words_bias_impact" compute increased of biased words in new revision.
  • "words_sex_frequency" compute frequency of sex-related, non-vulgar words in inserted text.
  • "words_sex_impact" compute increased of sex-related words in new revision.
  • "words_bad_frequency" compute frequency of bad words, colloquial words or bad writing words.
  • "words_bad_impact" compute increased of bad words in new revision.
  • "words_all_frequency" compute frequency of vulgar, pronoun, bias, sex, and bad words in inserted text.
  • "words_all_impact" compute the increased of vulgar, pronoun, bias, sex, and bad words in new revision.
Misc
  • "class": convert the classification from text to numeric. The "regular" class will become 0 and the "vandalism" will become 1.

Extending The Feature

The feature directory provide implementation of above features, where one can see how the feature works. One must familiar with Go language to work with it.

There is also a template (named template.go) which can be copied to create a new feature.

In this section we will see how to create new feature to compute the length of inserted text with feature name is insert_length.

First, Copy feature/template.go to new name, for example feature/insert_length.go

Create new type using Feature as base type,

type InsertLength Feature

and then register it to global features list including feature value type and feature name that will be used later,

func init() {
	Register(&InsertLength{}, tabula.TInteger, "insert_length")
}

Create a function Compute using our InsertLength type with the first parameter is input dataset,

func (anon *Anonim) Compute(dataset tabula.Dataset) {
	// computation algorithm.
}

Test your feature by adding it to main_test.go,

func TestInsertLength(t *testing.T) {
	main.Generate("insert_length", fInputDsv)
}

and to test it, run with,

$ go test -v -run TestInsertLength -timeout 40m

If its works as intended add it to features.dsv.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
Package clean contain functions to clean wiki text, including removing wiki syntax, links, and section numbering.
Package clean contain functions to clean wiki text, including removing wiki syntax, links, and section numbering.
cmd
Package feature contain list of feature implementation to compute vandalism in wikipedia dataset.
Package feature contain list of feature implementation to compute vandalism in wikipedia dataset.
Package reader extend the dsv reader by adding option to set revision directory that will be used by package revision.
Package reader extend the dsv reader by adding option to set revision directory that will be used by package revision.
Package revision provide the module for working with dump files of Wikipedia revision.
Package revision provide the module for working with dump files of Wikipedia revision.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL