sitemap

package module
v0.0.0-...-c27588b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 2, 2021 License: MIT Imports: 6 Imported by: 0

README

gopher-parse-sitemap

Build Status

A high effective golang library for parsing big-sized sitemaps and avoiding high memory usage. The sitemap parser was written on golang without external dependencies. See https://www.sitemaps.org/ for more information about the sitemap format.

Why yet another sitemaps parsing library?

Time by time needs to parse really huge sitemaps. If you just unmarshal the whole file to an array of structures it produces high memory usage and the application can crash due to OOM (out of memory error).

The solution is to handle sitemap entries on the fly. That is read one entity, consume it, repeat while there are unhandled items in the sitemap.

err := sitemap.ParseFromFile("./testdata/sitemap.xml", func(e Entry) error {
    return fmt.Println(e.GetLocation())
})
I need parse only small and medium-sized sitemaps. Should I use this library?

Yes. Of course, you can just load a sitemap to memory.

result := make([]string, 0, 0)
err := sitemap.ParseIndexFromFile("./testdata/sitemap-index.xml", func(e IndexEntry) error {
    result = append(result, e.GetLocation())
    return nil
})

But if you are pretty sure that you don't need to handle big-sized sitemaps, may be better to choose a library with simpler and more suitable API. In that case, you can try projects like https://github.com/yterajima/go-sitemap, https://github.com/snabb/sitemap, and https://github.com/decaseal/go-sitemap-parser.

Install

Installation is pretty easy, just do:

go get -u github.com/oxffaa/gopher-parse-sitemap

After that import it:

import "github.com/oxffaa/gopher-parse-sitemap"

Well done, you can start to create something awesome.

Documentation

Please, see here for documentation.

Documentation

Overview

Package sitemap provides primitives for high effective parsing of huge sitemap files.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Parse

func Parse(reader io.Reader, consumer EntryConsumer) error

Parse parses data which provides by the reader and for each sitemap entry calls the consumer's function.

func ParseFromFile

func ParseFromFile(sitemapPath string, consumer EntryConsumer) error

ParseFromFile reads sitemap from a file, parses it and for each sitemap entry calls the consumer's function.

Example

* Examples

err := ParseFromFile("./testdata/sitemap.xml", func(e Entry) error {
	fmt.Println(e.GetLocation())
	return nil
})
if err != nil {
	panic(err)
}
Output:

func ParseFromSite

func ParseFromSite(url string, consumer EntryConsumer) error

ParseFromSite downloads sitemap from a site, parses it and for each sitemap entry calls the consumer's function.

func ParseIndex

func ParseIndex(reader io.Reader, consumer IndexEntryConsumer) error

ParseIndex parses data which provides by the reader and for each sitemap index entry calls the consumer's function.

func ParseIndexFromFile

func ParseIndexFromFile(sitemapPath string, consumer IndexEntryConsumer) error

ParseIndexFromFile reads sitemap index from a file, parses it and for each sitemap index entry calls the consumer's function.

Example
result := make([]string, 0, 0)
err := ParseIndexFromFile("./testdata/sitemap-index.xml", func(e IndexEntry) error {
	result = append(result, e.GetLocation())
	return nil
})
if err != nil {
	panic(err)
}
Output:

func ParseIndexFromSite

func ParseIndexFromSite(sitemapURL string, consumer IndexEntryConsumer) error

ParseIndexFromSite downloads sitemap index from a site, parses it and for each sitemap index entry calls the consumer's function.

Types

type Entry

type Entry interface {
	GetLocation() string
	GetLastModified() *time.Time
	GetPriority() float32
}

Entry is an interface describes an element \ an URL in the sitemap file. Keep in mind. It is implemented by a totally immutable entity so you should minimize calls count because it can produce additional memory allocations.

GetLocation returns URL of the page. GetLocation must return a non-nil and not empty string value.

GetLastModified parses and returns date and time of last modification of the page. GetLastModified can return nil or a valid time.Time instance. Be careful. Each call return new time.Time instance.

GetChangeFrequency returns string value indicates how frequent the page is changed. GetChangeFrequency returns non-nil string value. See Frequency consts set.

GetPriority return priority of the page. The valid value is between 0.0 and 1.0, the default value is 0.5.

You shouldn't implement this interface in your types.

type EntryConsumer

type EntryConsumer func(Entry) error

EntryConsumer is a type represents consumer of parsed sitemaps entries

type IndexEntry

type IndexEntry interface {
	GetLocation() string
	GetLastModified() *time.Time
}

IndexEntry is an interface describes an element \ an URL in a sitemap index file. Keep in mind. It is implemented by a totally immutable entity so you should minimize calls count because it can produce additional memory allocations.

GetLocation returns URL of a sitemap file. GetLocation must return a non-nil and not empty string value.

GetLastModified parses and returns date and time of last modification of sitemap. GetLastModified can return nil or a valid time.Time instance. Be careful. Each call return new time.Time instance.

You shouldn't implement this interface in your types.

type IndexEntryConsumer

type IndexEntryConsumer func(IndexEntry) error

IndexEntryConsumer is a type represents consumer of parsed sitemaps indexes entries

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL