nuvi

package module
v0.0.0-...-a06ed9b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 24, 2016 License: MIT Imports: 10 Imported by: 0

README

nuvi

A web scraper for zip files. This utility inspect an html pages and downloads any files that is an zip archive. Note that it consider a file as a zip archive if its anchor hyperlink contains .zip extension.

Prerequisites

You should have the following dependencies installed and configured:

  • Golang 1.6
  • Redis
Installation
go get github.com/svett/nuvi/cmd/nuvi
Usage

The navi binary can be executed with the following arguments:

  • url the page address that will be inspected for zip files. required
  • redis-addr the address of redis server. optional
  • redis-password the password of redis server that the app is connecting to. optional
  • max-parallel-download-conn the number of files downloaded in parallel. optional
$ nuvi -url=http_url_to_desired_page \
       -redis-addr=redis_server_host_and_port \
       -redis-password=redis_server_password \
       -max-parallel-download-conn=5
Example
$ nuvi -url=http://feed.omgili.com/5Rh5AMTrc4Pv/mainstream/posts/
Contribution

Getting the sources and all dependencies with the following git commands:

$ git clone https://github.com/svett/nuvi
$ git submodule update --init --recursive

In order to start contributing to the project, you should install ginkgo and gomega package that are used in unit and integration tests:

$ go get github.com/onsi/ginkgo/ginkgo
$ go get github.com/onsi/gomega

You can run all unit and integration tests by executing the following script:

Note that you need redis-server installed. Every integration tests starts and stops the server. Therefore, you should not have it running as a deamon.

The redis-server is running on port 6379. If your instance is configured to run on different port, you should set the environment variable REDIS_SERVER_PORT before you execute the tests.

$ ./scripts/run_tests.sh

Also you can use ginkgo binary directly to execute the tests:

# Running the integration tests
$ ginkgo integration/
# Running the unit tests
$ ginkgo .

Presently the test coverage is 91.7%.

License

MIT License

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ArchiveWalker

type ArchiveWalker interface {
	// Walk walks throu the content of io.Reader
	Walk(reader io.Reader, walker ArchiveWalkerFunc)
}

ArchiveWalker unarchive zip archives

type ArchiveWalkerFunc

type ArchiveWalkerFunc func(io.Reader)

ArchiveWalkerFunc callback function

type Cacher

type Cacher interface {
	// Cache caches the content provided by the reader
	Cache(reader io.Reader)
}

Cacher caches any content

type Downloader

type Downloader interface {
	// Download downloads the content provided by url
	Download(url string) (io.ReadCloser, error)
}

Downloader downloads a content from URL

type Extractor

type Extractor interface {
	// Extract extracts a links/anchors from a io.Reader
	Extract(reader io.Reader) ([]string, error)
}

Extractor extracts a content of the page

type HTTPDownloader

type HTTPDownloader func(string) (*http.Response, error)

HTTPDownloader downloads

func (HTTPDownloader) Download

func (downloader HTTPDownloader) Download(url string) (io.ReadCloser, error)

type LinkExtractor

type LinkExtractor struct {
	// FileExt are the file extension
	FileExt string
	// Logger logs information
	Logger Logger
}

LinkExtractor extract <a href="*.zip"> links

func (*LinkExtractor) Extract

func (extractor *LinkExtractor) Extract(reader io.Reader) ([]string, error)

Extract extracts html anchor links

type Logger

type Logger interface {
	Println(v ...interface{})
	Printf(format string, v ...interface{})
}

Logger logs messages

type RedisCacher

type RedisCacher struct {
	Key    string
	Client RedisClient
	Logger Logger
}

RedisCacher caches content into redis

func (*RedisCacher) Cache

func (cacher *RedisCacher) Cache(reader io.Reader)

Cache caches the content of io.Reader

type RedisClient

type RedisClient interface {
	LPush(key string, values ...interface{}) *redis.IntCmd
	LIndex(key string, index int64) *redis.StringCmd
	LLen(key string) *redis.IntCmd
}

RedisClient connects to Redis

type Scraper

type Scraper struct {
	Downloader    Downloader
	Extractor     Extractor
	ArchiveWalker ArchiveWalker
	Cacher        Cacher
	MaxConn       int
	Logger        Logger
}

Scraper scrapes a web content

func (*Scraper) Scrape

func (scraper *Scraper) Scrape(url string) error

Scrape scrapes a web page

type ZIPWalker

type ZIPWalker struct {
	// FileExt specifies the file extetnsion
	FileExt string
	// Logger logs information
	Logger Logger
}

ZIPWalker unzip *.zip files

func (*ZIPWalker) Walk

func (walker *ZIPWalker) Walk(reader io.Reader, walk ArchiveWalkerFunc)

Walk unzips a *.zip files

The original implementation can be found in my blog http://blog.ralch.com/tutorial/golang-working-with-zip/

ZIP algorithm is using random access so unforthunately we need to read the whole file before we unzip it

Directories

Path Synopsis
cmd
This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter
This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter This file was generated by counterfeiter

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL