crawl

package
v0.0.0-...-23d3fd2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 26, 2023 License: Apache-2.0 Imports: 34 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type APIResult

type APIResult struct {
	// Indicates if we actually found IP addresses to probe
	Attempted bool

	// The ID response object from the Kubo API
	ID *api.IDResponse

	// The Kubo routing table. Doesn't contain multi addresses. Don't use this to continue crawling.
	RoutingTable *api.RoutingTableResponse
}

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler encapsulates a libp2p host that crawls the network.

func NewCrawler

func NewCrawler(h *basichost.BasicHost, conf *config.Crawl) (*Crawler, error)

NewCrawler initializes a new crawler based on the given configuration.

func (*Crawler) StartCrawling

func (c *Crawler) StartCrawling(ctx context.Context, crawlQueue *queue.FIFO[peer.AddrInfo], resultsQueue *queue.FIFO[Result])

StartCrawling enters an endless loop and consumes crawl jobs from the crawl queue and publishes its result on the results queue until it is told to stop or the crawl queue was closed.

type P2PResult

type P2PResult struct {
	RoutingTable *RoutingTable

	// The agent version of the crawled peer
	Agent string

	// The protocols the peer supports
	Protocols []string

	// Any error that has occurred when connecting to the peer
	ConnectError error

	// The above error transferred to a known error
	ConnectErrorStr string

	// Any error that has occurred during fetching neighbor information
	CrawlError error

	// The above error transferred to a known error
	CrawlErrorStr string

	// When was the connection attempt made
	ConnectStartTime time.Time

	// As it can take some time to handle the result we track the timestamp explicitly
	ConnectEndTime time.Time

	// All connections that the remote peer claims to listen on
	// this can be different from the ones that we received from another peer
	// e.g., they could miss quic-v1 addresses if the reporting peer doesn't
	// know about that protocol.
	ListenAddrs []ma.Multiaddr
}

type Persister

type Persister struct {
	// contains filtered or unexported fields
}

Persister handles the insert/upsert/update operations for a particular crawl result.

func NewPersister

func NewPersister(dbc db.Client, conf *config.Crawl, crawl *models.Crawl) (*Persister, error)

NewPersister initializes a new persister based on the given configuration.

func (*Persister) StartPersisting

func (p *Persister) StartPersisting(ctx context.Context, persistQueue *queue.FIFO[Result], resultsQueue *queue.FIFO[*db.InsertVisitResult])

StartPersisting enters an endless loop and consumes persist jobs from the persist queue until it is told to stop or the persist queue was closed.

type Result

type Result struct {
	// The crawler that generated this result
	CrawlerID string

	// The crawled peer
	Peer peer.AddrInfo

	// The neighbors of the crawled peer
	RoutingTable *RoutingTable

	// Indicates whether the above routing table information was queried through the API.
	// The API routing table does not include MultiAddresses, so we won't use them for further crawls.
	RoutingTableFromAPI bool

	// The agent version of the crawled peer
	Agent string

	// The protocols the peer supports
	Protocols []string

	// Any error that has occurred when connecting to the peer
	ConnectError error

	// The above error transferred to a known error
	ConnectErrorStr string

	// Any error that has occurred during fetching neighbor information
	CrawlError error

	// The above error transferred to a known error
	CrawlErrorStr string

	// When was the crawl started
	CrawlStartTime time.Time

	// When did this crawl end
	CrawlEndTime time.Time

	// When was the connection attempt made
	ConnectStartTime time.Time

	// As it can take some time to handle the result we track the timestamp explicitly
	ConnectEndTime time.Time

	// Whether kubos RPC API is exposed
	IsExposed null.Bool
}

Result captures data that is gathered from crawling a single peer.

func (*Result) ConnectDuration

func (r *Result) ConnectDuration() time.Duration

ConnectDuration returns the time it took to connect to the peer. This includes dialing and the identity protocol.

func (*Result) CrawlDuration

func (r *Result) CrawlDuration() time.Duration

CrawlDuration returns the time it took to crawl to the peer (connecting + fetching neighbors)

func (*Result) Merge

func (r *Result) Merge(p2pRes P2PResult, apiRes APIResult)

type RoutingTable

type RoutingTable struct {
	// PeerID is the peer whose neighbors (routing table entries) are in the array below.
	PeerID peer.ID
	// The peers that are in the routing table of the above peer
	Neighbors []peer.AddrInfo
	// First error that has occurred during crawling that peer
	Error error
	// Little Endian representation of at which CPLs errors occurred during neighbors fetches.
	// errorBits tracks at which CPL errors have occurred.
	// 0000 0000 0000 0000 - No error
	// 0000 0000 0000 0001 - An error has occurred at CPL 0
	// 1000 0000 0000 0001 - An error has occurred at CPL 0 and 15
	ErrorBits uint16
}

RoutingTable captures the routing table information and crawl error of a particular peer

func (*RoutingTable) PeerIDs

func (rt *RoutingTable) PeerIDs() []peer.ID

type Scheduler

type Scheduler struct {
	// contains filtered or unexported fields
}

The Scheduler handles the scheduling and managing of

a) crawlers - They consume a queue of peer address information, visit them and publish their results
              on a separate results queue. This results queue is consumed by this scheduler and further
              processed
b) persisters - They consume a separate persist queue. Basically all results that are published on the
              crawl results queue gets passed on to the persisters. However, the scheduler investigates
              the crawl results and builds up aggregate information for the whole crawl. Letting the
              persister directly consume the results queue would not allow that.

func NewScheduler

func NewScheduler(conf *config.Crawl, dbc db.Client) (*Scheduler, error)

NewScheduler initializes a new libp2p host and scheduler instance.

func (*Scheduler) CrawlNetwork

func (s *Scheduler) CrawlNetwork(ctx context.Context, bootstrap []peer.AddrInfo) error

CrawlNetwork starts the configured amount of crawlers and fills the crawl queue with bootstrap nodes to start with. These bootstrap nodes will be enriched by nodes we have seen in the past from the database. It also starts the persisters

func (*Scheduler) TotalErrors

func (s *Scheduler) TotalErrors() int

TotalErrors counts the total amount of errors - equivalent to undialable peers during this crawl.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL