wikipedia

module

v0.3.0 Latest Latest Go to latest Published: Aug 1, 2024 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/willbeason/wikipedia

Links

Open Source Insights

README ¶

willbeason/wikipedia

This repository is a collection of code I use for analyzing wikipedia. The main idea here is to document methods for various analysis I've done.

Generally, code starts in commands under cmd/ and gets migrated to pkg/ when I figure out how to make it more general.

I make no guarantees on the interoperability of the code here, or that output will remain consistent with time. For now these are fun, non-professional projects.

Many of these commands will change with time as I learn more about the Wikipedia corpus. This is a large, complex set of data and there aren't (yet) good resources on the types of things these libraries need to account for.

The Data

Commands in this repository mainly operate on either the regular pages-articles-multistream Wikipedia dumps, or Badger databases constructed from the commands in this repository. So (for now) you can't really look at individual articles unless you know their Wikipedia Page ID (You can get this by clicking "Page information" on any page on Wikipedia.org).

The data format is optimized for streaming the entirety of Wikipedia. I've offloaded concerns about indexing articles or how large to make files to Badger as this quickly became a headache I didn't want to deal with.

The Commands

A brief description of the commands so far, roughly in the order you'd use them.

extract-wikipedia operates on the pages-articles-multistream dump of Wikipedia, extracting each element into its own file. As of this writing, extraction results in about 213,000 files.

clean-wikipedia operates on the output of extract-wikipedia, removing noisy elements present in Wikipedia such as tables or various non-shown elements. This results in an intermediate set of data which is essentially "the text of Wikipedia as a user would see it". I keep this as I'm still updating normalize-wikipedia, and extract-wikipedia takes annoyingly long so it's worth saving the output of this intermediate step.

normalize-wikipedia operates on the output of clean-wikipedia, performing operations like making everything lowercase, identifying numbers and dates, removing tables, and the like. This is the data set I directly operate on.

Warnings

Some commands just fail - they'll panic in bader's DB.Orchestrate. I don't have a good explanation for this. Either there's a bug in Badger when streaming data in 32 threads, or I'm using something wrong. In any case, rerunning the command several times seems to work fine. Each time the process will make more progress, and eventually it succeeds. For now commands take on the order of a couple of minutes for me to rerun, so it isn't worth my time to debug.

Memory consumption is essentially unbounded since I haven't messed with any of the Badger memory settings. This is fine for my machine since I have 64 GB of memory and can hold most (or in some cases all) of Wikipedia in memory at once, but you may experience slowdowns if you don't modify the options set in pkg/db/process.go.

Directories ¶

Path	Synopsis
cmd
clean-wikipedia
extract
first-links
ngrams
normalize-wikipedia
pagerank
reference-network
reference-network-2
title-index
view
wikopticon
pkg
charts
classify
config
db
documents
documents/tagtree
environment
flags
graphs
graphs/centrality
indexes
jobs
nlp
ordinality
pages
protos

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL