Documentation ¶
Overview ¶
An ad-hoc parser for Wikipedia's 45GB (and growing) XML database.
Index ¶
- func CategorizedParse(reader io.Reader, out chan<- *Page, categories *Categories)
- func FilterRedirects(rawPages <-chan []byte, nonRedirectPages chan<- []byte)
- func GetCategories(pages <-chan *Page, categorizedPages chan<- *Page, categories *Categories)
- func GetChunks(reader io.Reader, chunks chan<- []byte)
- func GetLinks(pages <-chan *Page, linkedPages chan<- *Page)
- func GetPages(rawPages <-chan []byte, pages chan<- *Page)
- func GetRawPages(chunks <-chan []byte, pages chan<- []byte)
- func Parse(reader io.Reader, pages chan<- *Page)
- type Categories
- type Page
- type Revision
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CategorizedParse ¶
func CategorizedParse(reader io.Reader, out chan<- *Page, categories *Categories)
CategorizedParse is just like Parse, except that it also categorizes pages.
func FilterRedirects ¶
FilterRedirects discards all pages that redirect to another page.
func GetCategories ¶
func GetCategories(pages <-chan *Page, categorizedPages chan<- *Page, categories *Categories)
GetCategories extracts categories out of each Wikipedia page and adds them to the given categories object. Only links in the form [[Category:target]] are extracted.
func GetChunks ¶
GetChunks reads an XML file line by line and dumps each line to its output channel.
func GetLinks ¶
GetLinks extracts all Wikipedia links found in pages. Only links in the form [[target]] are extracted.
func GetRawPages ¶
GetRawPages combines individual line elements into complete XML pages so that they can be processed by a standard in-memory XML parser.
Types ¶
type Categories ¶
type Categories struct {
// contains filtered or unexported fields
}
Categories is a simple structure for keeping track of the categories of Wikipedia articles. It is optimized for queries like "articles about science" rather than "which categories is this article in".
func NewCategories ¶
func NewCategories() *Categories
NewCategories returns an empty categories object.
func (*Categories) AddPage ¶
func (self *Categories) AddPage(page *Page, cats []string)
AddPage adds the given page to the given categories.
func (*Categories) String ¶
func (self *Categories) String() string
String produces a string represenation of the categories in the form: category -> (article, article, article)