crawl

package

v0.0.0-...-57c47fe Latest Latest Go to latest Published: May 30, 2023 License: Apache-2.0 Imports: 32 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/xrbt/hasbulla

Documentation ¶

Index ¶

type APIResult
type Crawler
- func NewCrawler(h host.Host, conf *config.Crawl) (*Crawler, error)
- func (c *Crawler) StartCrawling(ctx context.Context, crawlQueue *queue.FIFO[peer.AddrInfo], ...)
type P2PResult
type Persister
- func NewPersister(dbc db.Client, conf *config.Crawl, crawl *models.Crawl) (*Persister, error)
- func (p *Persister) StartPersisting(ctx context.Context, persistQueue *queue.FIFO[Result], ...)
type Result
type RoutingTable
- func (rt *RoutingTable) PeerIDs() []peer.ID
type Scheduler
- func NewScheduler(conf *config.Crawl, dbc db.Client) (*Scheduler, error)
- func (s *Scheduler) CrawlNetwork(ctx context.Context, bootstrap []peer.AddrInfo) error
- func (s *Scheduler) TotalErrors() int

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type APIResult ¶

type APIResult struct {
	// Indicates if we actually found IP addresses to probe
	Attempted bool

	// The ID response object from the Kubo API
	ID *api.IDResponse

	// The Kubo routing table. Doesn't contain multi addresses. Don't use this to continue crawling.
	RoutingTable *api.RoutingTableResponse
}

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler encapsulates a libp2p host that crawls the network.

func NewCrawler ¶

func NewCrawler(h host.Host, conf *config.Crawl) (*Crawler, error)

NewCrawler initializes a new crawler based on the given configuration.

func (*Crawler) StartCrawling ¶

func (c *Crawler) StartCrawling(ctx context.Context, crawlQueue *queue.FIFO[peer.AddrInfo], resultsQueue *queue.FIFO[Result])

StartCrawling enters an endless loop and consumes crawl jobs from the crawl queue and publishes its result on the results queue until it is told to stop or the crawl queue was closed.

type P2PResult ¶

type P2PResult struct {
	RoutingTable *RoutingTable

	// The agent version of the crawled peer
	Agent string

	// The protocols the peer supports
	Protocols []string

	// Any error that has occurred when connecting to the peer
	ConnectError error

	// The above error transferred to a known error
	ConnectErrorStr string

	// Any error that has occurred during fetching neighbor information
	CrawlError error

	// The above error transferred to a known error
	CrawlErrorStr string

	// When was the connection attempt made
	ConnectStartTime time.Time

	// As it can take some time to handle the result we track the timestamp explicitly
	ConnectEndTime time.Time
}

type Persister ¶

type Persister struct {
	// contains filtered or unexported fields
}

Persister handles the insert/upsert/update operations for a particular crawl result.

func NewPersister ¶

func NewPersister(dbc db.Client, conf *config.Crawl, crawl *models.Crawl) (*Persister, error)

NewPersister initializes a new persister based on the given configuration.

func (*Persister) StartPersisting ¶

func (p *Persister) StartPersisting(ctx context.Context, persistQueue *queue.FIFO[Result], resultsQueue *queue.FIFO[*db.InsertVisitResult])

StartPersisting enters an endless loop and consumes persist jobs from the persist queue until it is told to stop or the persist queue was closed.

type Result ¶

type Result struct {
	// The crawler that generated this result
	CrawlerID string

	// The crawled peer
	Peer peer.AddrInfo

	// The neighbors of the crawled peer
	RoutingTable *RoutingTable

	// Indicates whether the above routing table information was queried through the API.
	// The API routing table does not include MultiAddresses, so we won't use them for further crawls.
	RoutingTableFromAPI bool

	// The agent version of the crawled peer
	Agent string

	// The protocols the peer supports
	Protocols []string

	// Any error that has occurred when connecting to the peer
	ConnectError error

	// The above error transferred to a known error
	ConnectErrorStr string

	// Any error that has occurred during fetching neighbor information
	CrawlError error

	// The above error transferred to a known error
	CrawlErrorStr string

	// When was the crawl started
	CrawlStartTime time.Time

	// When did this crawl end
	CrawlEndTime time.Time

	// When was the connection attempt made
	ConnectStartTime time.Time

	// As it can take some time to handle the result we track the timestamp explicitly
	ConnectEndTime time.Time

	// Whether kubos RPC API is exposed
	IsExposed null.Bool
}

Result captures data that is gathered from crawling a single peer.

func (*Result) ConnectDuration ¶

func (r *Result) ConnectDuration() time.Duration

ConnectDuration returns the time it took to connect to the peer. This includes dialing and the identity protocol.

func (*Result) CrawlDuration ¶

func (r *Result) CrawlDuration() time.Duration

CrawlDuration returns the time it took to crawl to the peer (connecting + fetching neighbors)

func (*Result) Merge ¶

func (r *Result) Merge(p2pRes P2PResult, apiRes APIResult)

type RoutingTable ¶

type RoutingTable struct {
	// PeerID is the peer whose neighbors (routing table entries) are in the array below.
	PeerID peer.ID
	// The peers that are in the routing table of the above peer
	Neighbors []peer.AddrInfo
	// First error that has occurred during crawling that peer
	Error error
	// Little Endian representation of at which CPLs errors occurred during neighbors fetches.
	// errorBits tracks at which CPL errors have occurred.
	// 0000 0000 0000 0000 - No error
	// 0000 0000 0000 0001 - An error has occurred at CPL 0
	// 1000 0000 0000 0001 - An error has occurred at CPL 0 and 15
	ErrorBits uint16
}

RoutingTable captures the routing table information and crawl error of a particular peer

func (*RoutingTable) PeerIDs ¶

func (rt *RoutingTable) PeerIDs() []peer.ID

type Scheduler ¶

type Scheduler struct {
	// contains filtered or unexported fields
}

The Scheduler handles the scheduling and managing of

a) crawlers - They consume a queue of peer address information, visit them and publish their results
              on a separate results queue. This results queue is consumed by this scheduler and further
              processed
b) persisters - They consume a separate persist queue. Basically all results that are published on the
              crawl results queue gets passed on to the persisters. However, the scheduler investigates
              the crawl results and builds up aggregate information for the whole crawl. Letting the
              persister directly consume the results queue would not allow that.

func NewScheduler ¶

func NewScheduler(conf *config.Crawl, dbc db.Client) (*Scheduler, error)

NewScheduler initializes a new libp2p host and scheduler instance.

func (*Scheduler) CrawlNetwork ¶

func (s *Scheduler) CrawlNetwork(ctx context.Context, bootstrap []peer.AddrInfo) error

CrawlNetwork starts the configured amount of crawlers and fills the crawl queue with bootstrap nodes to start with. These bootstrap nodes will be enriched by nodes we have seen in the past from the database. It also starts the persisters

func (*Scheduler) TotalErrors ¶

func (s *Scheduler) TotalErrors() int

TotalErrors counts the total amount of errors - equivalent to undialable peers during this crawl.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL