Documentation ¶
Overview ¶
Package wikiparse is library to understand the wikipedia xml dump format.
The dumps are available from the wikimedia group here:
http://dumps.wikimedia.org/
In particular, I've worked mostly with the enwiki dumps from here:
http://dumps.wikimedia.org/enwiki/
See the example programs in subpackages for an idea of how I've made use of these things.
Index ¶
- Variables
- func FindFiles(text string) []string
- func FindLinks(text string) []string
- func URLForFile(name string) string
- type Contributor
- type Coord
- type IndexEntry
- type IndexReader
- type IndexSummaryReader
- type IndexedParseSource
- type Page
- type Parser
- type ReadSeekCloser
- type Redirect
- type Revision
- type SiteInfo
Constants ¶
This section is empty.
Variables ¶
var ErrNoCoordFound = errors.New("no coord data found")
ErrNoCoordFound is returned from ParseCoords when there's no coordinate date found.
Functions ¶
func FindFiles ¶
FindFiles finds all the File references from within an article body.
This includes things in comments, as many I found were commented out.
func URLForFile ¶
URLForFile gets the wikimedia URL for the given named file.
Types ¶
type Contributor ¶
A Contributor is a user who contributed a revision.
type Coord ¶
type Coord struct {
Lon, Lat float64
}
Coord is Longitude/latitude pair from a coordinate match.
func ParseCoords ¶
ParseCoords parses geographical coordinates as specified in http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Geographical_coordinates
type IndexEntry ¶
An IndexEntry is an individual article from the index.
func (IndexEntry) String ¶
func (i IndexEntry) String() string
type IndexReader ¶
type IndexReader struct {
// contains filtered or unexported fields
}
An IndexReader is a wikipedia multistream index reader.
func NewIndexReader ¶
func NewIndexReader(r io.Reader) *IndexReader
NewIndexReader gets a wikipedia index reader.
func (*IndexReader) Next ¶
func (ir *IndexReader) Next() (IndexEntry, error)
Next gets the next entry from the index stream.
This assumes the numbers were meant to be incremental.
type IndexSummaryReader ¶
type IndexSummaryReader struct {
// contains filtered or unexported fields
}
IndexSummaryReader gets offsets and counts from an index.
If you don't want to know the individual articles, just how many and where, this is for you.
func NewIndexSummaryReader ¶
func NewIndexSummaryReader(r io.Reader) (rv *IndexSummaryReader, err error)
NewIndexSummaryReader gets a new IndexSummaryReader from the given stream of index lines.
type IndexedParseSource ¶
type IndexedParseSource interface { OpenIndex() (io.ReadCloser, error) OpenData() (ReadSeekCloser, error) }
An IndexedParseSource provides access to a multistream xml dump and its index.
This is typically downloaded as two files, but a seekable interface such as HTTP with range requests can also serve.
type Page ¶
type Page struct { Title string `xml:"title"` ID uint64 `xml:"id"` Redir Redirect `xml:"redirect"` Revisions []Revision `xml:"revision"` Ns uint64 `xml:"ns"` }
A Page in the wiki.
type Parser ¶
type Parser interface { // Get the next page from the parser Next() (*Page, error) // Get the toplevel site info from the stream SiteInfo() SiteInfo }
A Parser emits wiki pages.
func NewIndexedParser ¶
NewIndexedParser gets an indexed/parallel wikipedia dump parser from the given index and data files.
func NewIndexedParserFromSrc ¶
func NewIndexedParserFromSrc(src IndexedParseSource, numWorkers int) (Parser, error)
NewIndexedParserFromSrc creates a Parser that can parse multiple pages concurrently from a single source.
type ReadSeekCloser ¶
type ReadSeekCloser interface { io.ReadSeeker io.Closer }
ReadSeekCloser is io.ReadSeeker + io.Closer.
type Redirect ¶
type Redirect struct {
Title string `xml:"title,attr"`
}
A Redirect to another Page.
type Revision ¶
type Revision struct { ID uint64 `xml:"id"` Timestamp string `xml:"timestamp"` Contributor Contributor `xml:"contributor"` Comment string `xml:"comment"` Text string `xml:"text"` }
A Revision to a page.
type SiteInfo ¶
type SiteInfo struct { SiteName string `xml:"sitename"` Base string `xml:"base"` Generator string `xml:"generator"` Case string `xml:"case"` Namespaces []struct { Key string `xml:"key,attr"` Case string `xml:"case,attr"` Value string `xml:",chardata"` } `xml:"namespaces>namespace"` }
SiteInfo is the toplevel site info describing basic dump properties.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
tools
|
|
cbload
Load a wikipedia dump into CouchBase
|
Load a wikipedia dump into CouchBase |
couchload
Load a wikipedia dump into CouchDB
|
Load a wikipedia dump into CouchDB |
esload
Load a wikipedia dump into ElasticSearch
|
Load a wikipedia dump into ElasticSearch |
traverse
Sample program that finds all the geo data in wikipedia pages.
|
Sample program that finds all the geo data in wikipedia pages. |