Documentation ¶
Overview ¶
wstats is used for parsing wikimedia dump files on the fly into word frequency lists.
It is NOT ready for proper use, so use at your own risk.
The program will print running progress and basic statistics to standard error. A complete word frequency list will be printed to standard out (limited by min freq, if set).
Usage:
$ go run wstats.go <flags> <wikipedia dump path (file or url, xml or xml.bz2)>
Cmd line flags:
-pl int page limit: limit number of pages to read (optional, default = unset) -mf int min freq: lower limit for word frequencies to be printed (optional, default = 2) -h(elp) help: print help message
Example usage:
$ go run wstats.go -pl 10000 https://dumps.wikimedia.org/svwiki/latest/svwiki-latest-pages-articles-multistream.xml.bz2
Notes ¶
Bugs ¶
Clean/check for junk in words
More tests should be added, not just for smaller functions, but also for the overall parsing functionality.
Specifically, tests are needed to detect if the xml parsing won't find any pages (known to happen when editing case for the field names)
Click to show internal directories.
Click to hide internal directories.