wikiparse

package module
v0.0.0-...-97c2378 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 21, 2015 License: MIT Imports: 16 Imported by: 0

README

go-wikiparse

If you're like me, then you enjoy playing with lots of textual data and scour the internet for sources of it.

mediawiki's dumps are a pretty awesome chunk that's fun to work with.

Installation

go get github.com/dustin/go-wikiparse

Usage

The parser takes any io.Reader as a source assuming it's a complete XML dump and lets you pull wikiparse.Page objects out of it. These typically arrive as bzip2 files, so I make my program open the file and set up a bzip reader over it and all that. But you don't need to do that if you want to read off of stdin. Here's a complete example that emits page titles from a decompressing stream on stdin:

package main

import (
	"fmt"
	"os"

	"github.com/dustin/go-wikiparse"
)

func main() {
	p, err := wikiparse.NewParser(os.Stdin)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error setting up parser", err)
		os.Exit(1)
	}

	for err == nil {
		var page *wikiparse.Page
		page, err = p.Next()
		if err == nil {
			fmt.Println(page.Title)
		}
	}
}

Example invocation:

bzcat enwiki-20120211-pages-articles.xml.bz2 | ./sample

Geographical Information

Because it's interesting to me, I wrote a parser for the wikiproject geographical coordinates that are found on many pages. Use this on the page's content to find out if it's a place or not. Then go there.

Documentation

Overview

Package wikiparse is library to understand the wikipedia xml dump format.

The dumps are available from the wikimedia group here:

http://dumps.wikimedia.org/

In particular, I've worked mostly with the enwiki dumps from here:

http://dumps.wikimedia.org/enwiki/

See the example programs in subpackages for an idea of how I've made use of these things.

Index

Constants

This section is empty.

Variables

View Source
var ErrNoCoordFound = errors.New("no coord data found")

ErrNoCoordFound is returned from ParseCoords when there's no coordinate date found.

Functions

func FindFiles

func FindFiles(text string) []string

FindFiles finds all the File references from within an article body.

This includes things in comments, as many I found were commented out.

func FindLinks(text string) []string

FindLinks finds all the links from within an article body.

func URLForFile

func URLForFile(name string) string

URLForFile gets the wikimedia URL for the given named file.

Types

type Contributor

type Contributor struct {
	ID       uint64 `xml:"id"`
	Username string `xml:"username"`
}

A Contributor is a user who contributed a revision.

type Coord

type Coord struct {
	Lon, Lat float64
}

Coord is Longitude/latitude pair from a coordinate match.

func ParseCoords

func ParseCoords(text string) (Coord, error)

ParseCoords parses geographical coordinates as specified in http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Geographical_coordinates

type IndexEntry

type IndexEntry struct {
	StreamOffset int64
	PageOffset   int
	ArticleName  string
}

An IndexEntry is an individual article from the index.

func (IndexEntry) String

func (i IndexEntry) String() string

type IndexReader

type IndexReader struct {
	// contains filtered or unexported fields
}

An IndexReader is a wikipedia multistream index reader.

func NewIndexReader

func NewIndexReader(r io.Reader) *IndexReader

NewIndexReader gets a wikipedia index reader.

func (*IndexReader) Next

func (ir *IndexReader) Next() (IndexEntry, error)

Next gets the next entry from the index stream.

This assumes the numbers were meant to be incremental.

type IndexSummaryReader

type IndexSummaryReader struct {
	// contains filtered or unexported fields
}

IndexSummaryReader gets offsets and counts from an index.

If you don't want to know the individual articles, just how many and where, this is for you.

func NewIndexSummaryReader

func NewIndexSummaryReader(r io.Reader) (rv *IndexSummaryReader, err error)

NewIndexSummaryReader gets a new IndexSummaryReader from the given stream of index lines.

func (*IndexSummaryReader) Next

func (isr *IndexSummaryReader) Next() (offset int64, count int, err error)

Next gets the next offset and count from the index summary reader.

Note that the last returns io.EOF as an error, but a valid offset and count.

type IndexedParseSource

type IndexedParseSource interface {
	OpenIndex() (io.ReadCloser, error)
	OpenData() (ReadSeekCloser, error)
}

An IndexedParseSource provides access to a multistream xml dump and its index.

This is typically downloaded as two files, but a seekable interface such as HTTP with range requests can also serve.

type Page

type Page struct {
	Title     string     `xml:"title"`
	ID        uint64     `xml:"id"`
	Revisions []Revision `xml:"revision"`
	Ns        uint64     `xml:"ns"`
}

A Page in the wiki.

type Parser

type Parser interface {
	// Get the next page from the parser
	Next() (*Page, error)
	// Get the toplevel site info from the stream
	SiteInfo() SiteInfo
}

A Parser emits wiki pages.

func NewIndexedParser

func NewIndexedParser(indexfn, datafn string, numWorkers int) (Parser, error)

NewIndexedParser gets an indexed/parallel wikipedia dump parser from the given index and data files.

func NewIndexedParserFromSrc

func NewIndexedParserFromSrc(src IndexedParseSource, numWorkers int) (Parser, error)

NewIndexedParserFromSrc creates a Parser that can parse multiple pages concurrently from a single source.

func NewParser

func NewParser(r io.Reader) (Parser, error)

NewParser gets a wikipedia dump parser reading from the given reader.

type ReadSeekCloser

type ReadSeekCloser interface {
	io.ReadSeeker
	io.Closer
}

ReadSeekCloser is io.ReadSeeker + io.Closer.

type Revision

type Revision struct {
	ID          uint64      `xml:"id"`
	Timestamp   string      `xml:"timestamp"`
	Contributor Contributor `xml:"contributor"`
	Comment     string      `xml:"comment"`
	Text        string      `xml:"text"`
}

A Revision to a page.

type SiteInfo

type SiteInfo struct {
	SiteName   string `xml:"sitename"`
	Base       string `xml:"base"`
	Generator  string `xml:"generator"`
	Case       string `xml:"case"`
	Namespaces []struct {
		Key   string `xml:"key,attr"`
		Case  string `xml:"case,attr"`
		Value string `xml:",chardata"`
	} `xml:"namespaces>namespace"`
}

SiteInfo is the toplevel site info describing basic dump properties.

Directories

Path Synopsis
tools
cbload
Load a wikipedia dump into CouchBase
Load a wikipedia dump into CouchBase
couchload
Load a wikipedia dump into CouchDB
Load a wikipedia dump into CouchDB
esload
Load a wikipedia dump into ElasticSearch
Load a wikipedia dump into ElasticSearch
traverse
Sample program that finds all the geo data in wikipedia pages.
Sample program that finds all the geo data in wikipedia pages.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL