web

package

v0.28.0-beta Latest Latest Go to latest Published: Sep 24, 2024 License: MIT Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

README ¶

---
title: "Web"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Web component https://github.com/instill-ai/instill-core"
---

The Web component is an operator component that allows users to scrape websites.
It can carry out the following tasks:
- [Crawl Website](#crawl-website)
- [Scrape Sitemap](#scrape-sitemap)
- [Scrape Webpage](#scrape-webpage)

## Release Stage

`Alpha`

## Configuration

The component definition and tasks are defined in the [definition.json](https://github.com/instill-ai/component/blob/main/operator/web/v0/config/definition.json) and [tasks.json](https://github.com/instill-ai/component/blob/main/operator/web/v0/config/tasks.json) files respectively.



## Supported Tasks

### Crawl Website

Crawl the website contents and manipulate html with jquery command. The sequence of jquery commands will be executed in the order of only-main-content, remove-tags, and only-include-tags.

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CRAWL_WEBSITE` |
| Query (required) | `target-url` | string | The root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on. |
| Allowed Domains | `allowed-domains` | array[string] | A list of domains that are allowed to be scraped. If empty, all domains are allowed. |
| Max Number of Pages (required) | `max-k` | integer | The max number of pages to return. If the number is set to 0, all pages will be returned. If the number is set to a positive integer, at most max k pages will be returned. |
| Include Link Text | `include-link-text` | boolean | Indicate whether to scrape the link and include the text of the link associated with this page in the 'link-text' field |
| Include Link HTML | `include-link-html` | boolean | Indicate whether to scrape the link and include the raw HTML of the link associated with this page in the 'link-html' field |
| Only Main Content | `only-main-content` | boolean | Only return the main content of the page excluding header, nav, footer. |
| Remove Tags | `remove-tags` | array[string] | A list of tags, classes, and ids to remove from the output. If empty, no tags will be removed. Example: 'script, .ad, #footer' |
| Only Include Tags | `only-include-tags` | array[string] | A list of tags, classes, and ids to include in the output. If empty, all tags will be included. Example: 'script, .ad, #footer' |
| Timeout | `timeout` | integer | The time to wait for the page to load in milliseconds. Min 0, Max 60000. |





| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| [Pages](#crawl-website-pages) | `pages` | array[object] | The scraped webpages |

<details>
<summary> Output Objects in Crawl Website</summary>

<h4 id="crawl-website-pages">Pages</h4>

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| Link | `link` | string | The full URL to which the webpage link is pointing, e.g., http://www.example.com/foo/bar. |
| Link HTML | `link-html` | string | The scraped raw html of the link associated with this webpage link |
| Link Text | `link-text` | string | The scraped text of the link associated with this webpage link, in plain text |
| Title | `title` | string | The title of a webpage link, in plain text |
</details>

### Scrape Sitemap

Scrape the sitemap information

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_SCRAPE_SITEMAP` |
| Sitemap URL (required) | `url` | string | The URL of the sitemap to scrape |





| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| List | `list` | array | The list of information in a sitemap |

### Scrape Webpage

Scrape the webpage contents and manipulate html with jquery command. The sequence of jquery commands will be executed in the order of only-main-content, remove-tags, and only-include-tags.

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_SCRAPE_WEBPAGE` |
| URL (required) | `url` | string | The URL to be scrape the webpage contents |
| Include HTML | `include-html` | boolean | Indicate whether to include the raw HTML of the webpage in the output |
| Only Main Content | `only-main-content` | boolean | Only return the main content of the page excluding header, nav, footer. |
| Remove Tags | `remove-tags` | array[string] | A list of tags, classes, and ids to remove from the output. If empty, no tags will be removed. Example: 'script, .ad, #footer' |
| Only Include Tags | `only-include-tags` | array[string] | A list of tags, classes, and ids to include in the output. If empty, all tags will be included. Example: 'script, .ad, #footer' |
| Timeout | `timeout` | integer | The time to wait for the page to load in milliseconds. Min 0, Max 60000. Please set it as 0 if you only want to collect static content. |





| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Content | `content` | string | The scraped plain content without html tags of the webpage |
| Markdown | `markdown` | string | The scraped markdown of the webpage |
| HTML (optional) | `html` | string | The scraped html of the webpage |
| [Metadata](#scrape-webpage-metadata) (optional) | `metadata` | object | The metadata of the webpage |
| Links on Page (optional) | `links-on-page` | array[string] | The list of links on the webpage |

<details>
<summary> Output Objects in Scrape Webpage</summary>

<h4 id="scrape-webpage-metadata">Metadata</h4>

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| Description | `description` | string | The description of the webpage |
| Source URL | `source-url` | string | The source URL of the webpage |
| Title | `title` | string | The title of the webpage |
</details>
## Example Recipes

Recipe for the [Web scraper](https://instill.tech/instill-ai/pipelines/web-scraper/playground) pipeline.

```yaml
version: v1beta
component:
  scraper:
    type: web
    task: TASK_SCRAPE_WEBPAGE
    input:
      include-html: false
      only-main-content: true
      url: ${variable.url}
variable:
  url:
    title: url
    description: The url of the web you want to scrape
    instill-format: string
output:
  markdown:
    title: markdown
    value: ${scraper.output.markdown}
```

Documentation ¶

Index ¶

func Init(bc base.Component) *component
type Metadata
type PageInfo
type ScrapeSitemapInput
type ScrapeSitemapOutput
type ScrapeWebpageInput
type ScrapeWebpageOutput
type ScrapeWebsiteInput
type ScrapeWebsiteOutput
type SiteInformation
type URL
type URLSet

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Init ¶

func Init(bc base.Component) *component

Types ¶

type Metadata ¶

type Metadata struct {
	Title       string `json:"title"`
	Description string `json:"description,omitempty"`
	SourceURL   string `json:"source-url"`
}

type PageInfo ¶

type PageInfo struct {
	Link     string `json:"link"`
	Title    string `json:"title"`
	LinkText string `json:"link-text"`
	LinkHTML string `json:"link-html"`
}

type ScrapeSitemapInput ¶

type ScrapeSitemapInput struct {
	URL string `json:"url"`
}

type ScrapeSitemapOutput ¶

type ScrapeSitemapOutput struct {
	List []SiteInformation `json:"list"`
}

type ScrapeWebpageInput ¶

type ScrapeWebpageInput struct {
	URL             string   `json:"url"`
	IncludeHTML     bool     `json:"include-html"`
	OnlyMainContent bool     `json:"only-main-content"`
	RemoveTags      []string `json:"remove-tags,omitempty"`
	OnlyIncludeTags []string `json:"only-include-tags,omitempty"`
	Timeout         int      `json:"timeout,omitempty"`
}

type ScrapeWebpageOutput ¶

type ScrapeWebpageOutput struct {
	Content     string   `json:"content"`
	Markdown    string   `json:"markdown"`
	HTML        string   `json:"html"`
	Metadata    Metadata `json:"metadata"`
	LinksOnPage []string `json:"links-on-page"`
}

type ScrapeWebsiteInput ¶

type ScrapeWebsiteInput struct {
	// TargetURL: The URL of the website to scrape.
	TargetURL string `json:"target-url"`
	// AllowedDomains: The list of allowed domains to scrape.
	AllowedDomains []string `json:"allowed-domains"`
	// MaxK: The maximum number of pages to scrape.
	MaxK int `json:"max-k"`
	// IncludeLinkText: Whether to include the scraped text of the scraped web page.
	IncludeLinkText *bool `json:"include-link-text"`
	// IncludeLinkHTML: Whether to include the scraped HTML of the scraped web page.
	IncludeLinkHTML *bool `json:"include-link-html"`
	// OnlyMainContent: Whether to scrape only the main content of the web page. If true, the scraped text wull exclude the header, nav, footer.
	OnlyMainContent bool `json:"only-main-content"`
	// RemoveTags: The list of tags to remove from the scraped text.
	RemoveTags []string `json:"remove-tags"`
	// OnlyIncludeTags: The list of tags to include in the scraped text.
	OnlyIncludeTags []string `json:"only-include-tags"`
	// Timeout: The number of milliseconds to wait before scraping the web page. Min 0, Max 60000.
	Timeout int `json:"timeout"`
}

ScrapeWebsiteInput defines the input of the scrape website task

type ScrapeWebsiteOutput ¶

type ScrapeWebsiteOutput struct {
	// Pages: The list of pages that were scraped.
	Pages []PageInfo `json:"pages"`
}

ScrapeWebsiteOutput defines the output of the scrape website task

type SiteInformation ¶

type SiteInformation struct {
	Loc string `json:"loc"`
	// Follow ISO 8601 format
	LastModifiedTime string  `json:"lastmod"`
	ChangeFrequency  string  `json:"changefreq,omitempty"`
	Priority         float64 `json:"priority,omitempty"`
}

type URL ¶

type URL struct {
	Loc        string `xml:"loc"`
	LastMod    string `xml:"lastmod"`
	ChangeFreq string `xml:"changefreq"`
	Priority   string `xml:"priority"`
}

type URLSet ¶

type URLSet struct {
	XMLName xml.Name `xml:"urlset"`
	Urls    []URL    `xml:"url"`
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL