rattler

package module
v0.0.0-...-f9ca1e0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 4, 2019 License: MIT Imports: 14 Imported by: 0

README

Rattler

Rattler is a Go library for scraping web version of Twitter.

Sample Usage

An example that prints information about first 30 tweets in the feed from github's Twitter:

// Create a cursor and session for scraping unfiltered timeline from @github.
cursor := rattler.NewGenericFeedCursor("github", rattler.FeedTypeRegular)
session := rattler.NewTwitterSession(cursor)

counter := 0

// Print information about first 30 tweets.
for result := range session.FeedIter() {
	if result.Error != nil {
		fmt.Printf("Oops: %s", result.Error)
		return
	}

	fmt.Printf("Tweet from %s with ID %d: %s\n",
		result.Tweet.Timestamp.String(),
		result.Tweet.ID,
		result.Tweet.Text,
	)

	counter++
	if counter >= 30 {
		break
	}
}

Note that session.FeedIter() will keep fetching the feed until it hits Twitter's hard limit (which is about 3000 tweets per feed) or until the feed ends and that generates a lot of HTTP requests. It's roughly 1 request per 20 tweets. So please rate-limit your requests, if you need to scrape lots of data! In the above example, it can be achieved by inserting time.Sleep(...) into the loop.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type APICompatError

type APICompatError struct {
	// contains filtered or unexported fields
}

APICompatError occurs when the process of extracting scraped data was unsuccessful. This is most likely the result of Twitter changing its internal interfaces or bug in the parser.

func (*APICompatError) Error

func (e *APICompatError) Error() string

func (*APICompatError) TwitterID

func (e *APICompatError) TwitterID() *uint64

TwitterID returns a numeric twitter ID that's associated with the error.

type FeedCursor

type FeedCursor interface {
	RetrievePage() (FeedPageReader, error)
	Seek(string) bool
}

FeedCursor is an interface for navigating a paginated Twitter feed.

type FeedFilter

type FeedFilter int

FeedFilter enum represents a feed that is a target for scraping (regular or media feed).

const (
	// FeedTypeRegular is a regular Twitter feed (contains all tweets).
	FeedTypeRegular FeedFilter = 0
	// FeedTypeMedia is a media-only feed (contains only image/video/postcard
	// tweets).
	FeedTypeMedia FeedFilter = 1
)

type FeedIterResult

type FeedIterResult struct {
	Tweet *Tweet
	Error error
}

FeedIterResult is the result of calling FeedIterResult() to retrieve a single tweet from feed.

type FeedPage

type FeedPage struct {
	// contains filtered or unexported fields
}

FeedPage stores a single page from Twitter feed.

Tweets and additional page data can be retrieved through FeedPage interface, which is implemented by this type.

func NewFeedPage

func NewFeedPage(structuredJSON interface{}) *FeedPage

NewFeedPage creates a page parser.

func (*FeedPage) GetMinPosition

func (t *FeedPage) GetMinPosition() (string, error)

GetMinPosition returns a position of this page within feed.

func (*FeedPage) GetTweets

func (t *FeedPage) GetTweets() ([]*Tweet, error)

GetTweets returns a list of tweets in page.

type FeedPageReader

type FeedPageReader interface {
	GetTweets() ([]*Tweet, error)
	GetMinPosition() (string, error)
}

FeedPageReader interface defines means of accessing paginated feed's tweets and each page's position within feed. Twitter uses pagination to sub-divide feeds that contain large number of tweets.

type GalleryDownloadResult

type GalleryDownloadResult struct {
	FileExt string
	Body    io.ReadCloser
	Error   error
}

GalleryDownloadResult is a result of calling Download() on an embedded gallery object.

type GenericFeedCursor

type GenericFeedCursor struct {
	// contains filtered or unexported fields
}

GenericFeedCursor is used for traversing any paginated feed that is not a search feed.

This cursor has a limit to how many pages it can navigate. This limit is imposed by Twitter and if it's important to retrieve every possible tweet then SearchFeedCursor should be used instead.

func NewGenericFeedCursor

func NewGenericFeedCursor(
	username string,
	ttype FeedFilter, resumeAt ...string,
) *GenericFeedCursor

NewGenericFeedCursor creates a generic feed cursor for traversing single user's Twitter feed.

func (*GenericFeedCursor) RetrievePage

func (t *GenericFeedCursor) RetrievePage() (FeedPageReader, error)

RetrievePage downloads page at the current cursor position.

Does not advance the cursor.

func (*GenericFeedCursor) Seek

func (t *GenericFeedCursor) Seek(position string) bool

Seek positions cursor at given position within feed.

type MediaDownloadError

type MediaDownloadError struct {
	// contains filtered or unexported fields
}

MediaDownloadError is an error that happens when downloading embedded media in tweet.

func (*MediaDownloadError) Cause

func (t *MediaDownloadError) Cause() error

Cause returns cause of the error.

func (*MediaDownloadError) Error

func (t *MediaDownloadError) Error() string

func (*MediaDownloadError) URL

func (t *MediaDownloadError) URL() string

URL returns a URL associated with the error.

type SearchFeedCursor

type SearchFeedCursor struct {
	// contains filtered or unexported fields
}

SearchFeedCursor is used for traversing search feeds.

func NewSearchFeedCursor

func NewSearchFeedCursor(query string, resumeAt ...string) *SearchFeedCursor

NewSearchFeedCursor creates a cursor for traversing search results returned from given query.

func (*SearchFeedCursor) RetrievePage

func (t *SearchFeedCursor) RetrievePage() (FeedPageReader, error)

RetrievePage downloads page at the current cursor position.

Does not advance the cursor.

func (*SearchFeedCursor) Seek

func (t *SearchFeedCursor) Seek(position string) bool

Seek positions cursor at given position within feed.

type Tweet

type Tweet struct {
	ID        uint64      `json:"id,string"`
	Timestamp time.Time   `json:"timestamp"`
	Text      string      `json:"text"`
	Extra     interface{} `json:"embed"`
}

Tweet represents a single tweet.

type TweetEmbeddedCard

type TweetEmbeddedCard struct {
	CardURL string
}

TweetEmbeddedCard represents a postcard embedded within tweet.

func (*TweetEmbeddedCard) MarshalJSON

func (t *TweetEmbeddedCard) MarshalJSON() ([]byte, error)

MarshalJSON returns TweetEmbeddedCard encoded as a JSON bytestring.

type TweetEmbeddedGallery

type TweetEmbeddedGallery struct {
	ImageURLs []string
}

TweetEmbeddedGallery represents multiple images embedded within tweet.

func (*TweetEmbeddedGallery) Download

func (t *TweetEmbeddedGallery) Download() <-chan GalleryDownloadResult

Download initiates a sequental download of all images within a Tweet.

Returned channel can be used to read each image's entire body and file extension.

func (*TweetEmbeddedGallery) MarshalJSON

func (t *TweetEmbeddedGallery) MarshalJSON() ([]byte, error)

MarshalJSON returns TweetEmbeddedGallery encoded as a JSON bytestring.

type TweetEmbeddedQuote

type TweetEmbeddedQuote struct {
	QuoteURL string
}

TweetEmbeddedQuote represents a quote, that references another tweet, that is embedded within tweet.

func (*TweetEmbeddedQuote) MarshalJSON

func (t *TweetEmbeddedQuote) MarshalJSON() ([]byte, error)

MarshalJSON returns TweetEmbeddedQuote encoded as a JSON bytestring.

type TweetEmbeddedVideo

type TweetEmbeddedVideo struct {
	VideoURL string
}

TweetEmbeddedVideo represents a video embedded within tweet.

func (*TweetEmbeddedVideo) MarshalJSON

func (t *TweetEmbeddedVideo) MarshalJSON() ([]byte, error)

MarshalJSON returns TweetEmbeddedVideo encoded as a JSON bytestring.

type TwitterHTTP

type TwitterHTTP struct {
	// contains filtered or unexported fields
}

TwitterHTTP is a session parameters that can be shared across multiple TwitterSession`s.

func NewTwitterHTTP

func NewTwitterHTTP() *TwitterHTTP

NewTwitterHTTP creates new session parameters.

type TwitterSession

type TwitterSession struct {
	// contains filtered or unexported fields
}

TwitterSession represents a single scraping session.

func NewTwitterSession

func NewTwitterSession(cursor FeedCursor) *TwitterSession

NewTwitterSession creates new TwitterSession based on given cursor.

func (*TwitterSession) FeedIter

func (t *TwitterSession) FeedIter(singlePage ...bool) <-chan (FeedIterResult)

FeedIter returns a channel which can be used to read all available feed tweets.

Using FeedIter() is the recommended way for scraping tweet data.

Depending on cursor used, not all available tweets may be retrieved by the iterator. Twitter puts a hard limit on a maximum number tweets in a feed. So far, the only known way to completely retrieve the entire twitter feed is to iterate over the feed using a search query with a sliding time range until no tweets are getting returned.

type URLError

type URLError struct {
	// contains filtered or unexported fields
}

URLError is an error that can happen while fetching or parsing data from the remote server.

func (*URLError) Cause

func (e *URLError) Cause() error

Cause returns inner error object.

func (*URLError) Error

func (e *URLError) Error() string

func (*URLError) URL

func (e *URLError) URL() string

URL returns URL associated with the error.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL