extractor

package module
v0.0.0-...-9b24770 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 20, 2025 License: MIT Imports: 10 Imported by: 2

README

rabbitcrawler

A powerful web scraping tool designed for extracting structured data from websites with configurable rules and multiple execution modes.

Features

  • Configurable JSON-based scraping rules
  • Multiple extraction modes:
    • Static: Fast HTML parsing without JavaScript execution
    • Browser: Full browser emulation with JavaScript support
  • Concurrent scraping with adjustable worker count

Installation

go install github.com/crawlerclub/extractor/cmd/rabbitextract@latest
go install github.com/crawlerclub/extractor/cmd/rabbitcrawler@latest

Using rabbitextract

rabbitextract is a command-line tool for extracting data from a single webpage using JSON configuration rules.

Command Line Options
  • -config: Path to the config JSON file (required)
  • -url: URL to extract data from (optional if provided in config)
  • -mode: Extraction mode (optional, defaults to "auto")
    • auto: Automatically choose between static and browser mode
    • static: Fast HTML parsing without JavaScript
    • browser: Full browser emulation with JavaScript support
  • -output: Output file path (optional, defaults to stdout)
Example Usage
  1. Create a configuration file config.json:
{
  "name": "example-scraper",
  "example_url": "https://example.com/page",
  "schemas": [
    {
      "name": "articles",
      "entity_type": "article",
      "selector": "//div[@class='article']",
      "fields": [
        {
          "name": "title",
          "type": "text",
          "selector": ".//h1"
        },
        {
          "name": "content",
          "type": "text",
          "selector": ".//div[@class='content']"
        }
      ]
    }
  ]
}
  1. Run the extractor:
rabbitextract -config config.json -url "https://example.com/page" -output result.json
Supported Field Types
  • text: Extract text content from an element
  • attribute: Extract specific attribute value from an element
  • nested: Extract nested object with multiple fields
  • list: Extract array of items
Special Fields
  • _id: Used to generate unique external_id for items
  • _time: Used to set external_time for items

Documentation

Index

Constants

View Source
const (
	FromURL     string = "url"
	FromElement string = "element"
)

Variables

This section is empty.

Functions

This section is empty.

Types

type BrowserExtractor

type BrowserExtractor struct {
	Config  ExtractorConfig
	Browser *rod.Browser
}

func NewBrowserExtractor

func NewBrowserExtractor(config ExtractorConfig) *BrowserExtractor

func (*BrowserExtractor) Extract

func (e *BrowserExtractor) Extract(url string) (*ExtractionResult, error)

func (*BrowserExtractor) ExtractWithoutCache

func (e *BrowserExtractor) ExtractWithoutCache(url string) (*ExtractionResult, error)

type ExtractedItem

type ExtractedItem map[string]interface{}

type ExtractionError

type ExtractionError struct {
	Field   string
	Message string
	URL     string
}

type ExtractionResult

type ExtractionResult struct {
	SchemaResults map[string]SchemaResult
	Errors        []ExtractionError
	FinalURL      string
}

type Extractor

type Extractor interface {
	Extract(url string) (*ExtractionResult, error)
	ExtractWithoutCache(url string) (*ExtractionResult, error)
}

func NewExtractor

func NewExtractor(config ExtractorConfig) Extractor

type ExtractorConfig

type ExtractorConfig struct {
	Name       string   `json:"name"`
	Pattern    string   `json:"pattern"`
	ExampleURL string   `json:"example_url"`
	Mode       string   `json:"mode"`
	Schemas    []Schema `json:"schemas"`
}

type Field

type Field struct {
	Name      string  `json:"name"`
	From      string  `json:"from"`
	Selector  string  `json:"selector"`
	Pattern   string  `json:"pattern"`
	Type      string  `json:"type"`
	Attribute string  `json:"attribute,omitempty"`
	Fields    []Field `json:"fields,omitempty"`
}

type Schema

type Schema struct {
	Name       string  `json:"name"`
	EntityType string  `json:"entity_type"`
	Selector   string  `json:"selector"`
	Type       string  `json:"type"`
	Fields     []Field `json:"fields,omitempty"`
}

type SchemaInfo

type SchemaInfo struct {
	Name       string `json:"name"`
	EntityType string `json:"entity_type"`
}

type SchemaResult

type SchemaResult struct {
	Schema SchemaInfo
	Items  []ExtractedItem
}

type StaticExtractor

type StaticExtractor struct {
	Config ExtractorConfig
}

func NewStaticExtractor

func NewStaticExtractor(config ExtractorConfig) *StaticExtractor

func (*StaticExtractor) Extract

func (e *StaticExtractor) Extract(url string) (*ExtractionResult, error)

func (*StaticExtractor) ExtractWithoutCache

func (e *StaticExtractor) ExtractWithoutCache(url string) (*ExtractionResult, error)

Directories

Path Synopsis
cmd
rabbitcrawler Module
rabbitextract Module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL