embedding

package module

v0.1.0 Latest Latest Go to latest Published: May 31, 2023 License: MIT Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/anthony-ozdemir/zfse

Links

Open Source Insights

README ¶

ZFSE -- (Zone File Search Engine)

Personalized Search, Infinite Exploration

ZFSE is a single binary independent search engine that can be self-hosted on the cheapest & most budget friendly servers.

Overview
Quick Start
Architecture & Configuration
Task Handlers
Development Guide
Roadmap

Overview

Have you ever dreamed of building your own search engine with it's very own search index, but felt held back by the complexities and expenses involved?

Fear not! ZFSE provides an effortless way to construct a self-hosted search engine through the power of a single binary, putting the control of search indexing in your hands.

Building your own customized search engine is as easy as running zfse run

The main objectives of this project are:

Developing a straightforward, single binary search engine complete with its own crawler, indexer, and rankers.
Minimizing RAM & CPU requirements to ensure affordability and flexibility. ZFSE is capable of running on $5/month budget friendly servers.
Offering a seamless customization experience through a single config.toml file, empowering users to quickly create their personalized search indexes.
Presenting an intuitive plugin system that allows users to effortlessly incorporate new filters, indexers, and rankers into their own search engine. (🚧 Under Development)

Quick Start

To initialize zfse, simply use the following command:

zfse init

This command creates all the necessary files and subdirectories for ZFSE.

Your folder structure should now look like this:

.
├── zfse                        # Base ZFSE executable
├── config.toml                 # ZFSE configuration
├── cache                       # Various cache and output files produced by ZFSE
├   └── ...
├── plugins                     # Filter, Indexer, Ranker plugins for ZFSE
└── zone-files                  # ICANN zone files to bootstrap ZFSE
    └── example_zone_file.txt   # Example zone file

Now, simply start ZFSE by running:

zfse run -query "MMORPG"

Command above will initiate a web crawl using the example zone file.

Once ZFSE is finished with the crawling and random ranking, you will have the results recorded under ./cache/ranking/output.txt

You are now ready to customize ZFSE and build your own search index!

ZFSE makes use of ICANN zone files to bootstrap the search index. First, you need to access and download the zone file you're interested in from ICANN, using their Centralized Zone Data Service here.

A good TLD to start with is .dev. Once you have access, extract the tar.gz archive from ICANN to the ./zone-files folder.

You can add as many TLDs as you want to ./zone-files, and ZFSE will utilize them all automatically. However, keep in mind that larger TLDs will require more disk space, particularly after crawling and indexing are complete.

Edit the config.toml to customize your search index. Refer to configuration.

NOTE: ZFSE will crawl the entire .dev TLD and record any <meta description=.../> tags it encounters. Therefore, ensure that your host has enough available space (~30-40GB).

WARNING: It is strongly advised against running the complete crawl on your personal computer. Some Internet Service Providers (ISPs) might create problems since ZFSE will initiate large amount of connections while crawling the entire Top-Level Domain (TLD), so it is preferable to host ZFSE on a cloud provider.

Additionally, be aware of security concerns. ZFSE employs default GoLang library parsers to analyze crawled websites, and security vulnerabilities within GoLang standard libraries could lead to the compromise of your host during the crawl.

Architecture & Configuration

ZFSE is configured by a single config.toml file.

ZFSE operates by reading the configuration files and executing the specified Task Handlers sequentially. The process is divided into four main components:

Pre-Crawl Filtering: ZFSE initiates all pre-crawl filters defined by the [[PreCrawlFilters]] tag. Each pre-crawl filter processes the TLD zone file line by line, filtering the content and forwarding the output to the subsequent pre-crawl filters. These filters have the ability to discard or append new fields to a domain. Added fields can be accessed and utilized by subsequent Task Handlers.
Crawling: At the moment, ZFSE concentrates on crawling only the index page of websites. The crawler will initially verify the validity of the DNS record and determine if the website has a robots.txt file available. Next, it will parse the robots.txt file to see if the ZFSE agent is allowed to index the website. Subsequently, the crawler will capture the headers and HTML content of the website's index page and forward it to post-crawl filters.
Post-Crawl Filtering: Similar to pre-crawl filtering, post-crawl filters are designated by the [[PostCrawlFilters]] tag. The filters are executed sequentially, passing their output to the next post-crawl filter in line.
Indexing: Unlike filters, indexers create independent index databases without sending their output to the next indexer in line. Indexers can be configured using the [Indexer] tag.
Ranking: Rankers receive a user query and leverage the collected data from filters and indexers to determine the search result ranking. Rankers can be customized using the [[Rankers]] tag.

After the indexing process is complete, ZFSE will rely solely on Indexers and Rankers to produce search results. Therefore, indexing a Top-Level Domain (TLD) just once is sufficient, unless there is a need to modify ZFSE's pre-index configuration.

A default configuration file can be found here.

Most important options are:

concurrent_connections: Controls the number of concurrent connections to use during the crawl. Adjust this setting according to the system's CPU, RAM, and the ulimits imposed by the operating system.
content_read_limit_in_bytes: Specifies the amount of data the crawler should read and record. Adjust this setting to manage disk usage. ZFSE is capable of parsing half-way read HTML content.

Task Handlers

Task Handlers

Development Guide

Development Guide

Roadmap

Please note that ZFSE is in its early stages of development (🚧). It is recommended to wait for the v0.7 release before handling large TLDs like .com. The planned milestones are as follows:

v0.2: Web UI
v0.3: ICANN Zone File Downloader
v0.4: docker-compose.yml
v0.5: Additional Indexers & Rankers
v0.6: Plugins
v0.7: Ability to back up & restore unfinished worksets
v0.8: Customizable configuration via Web UI
v0.9: CPU Profilers (runtime/pprof) and final optimization pass
v1.0: Increased unit test coverage & release
v1.x: TBD

Documentation ¶

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GetDefaultConfigFile ¶

func GetDefaultConfigFile() []byte

GetDefaultConfigFile Returns embedded default "config.toml".

func GetExampleZoneFile ¶

func GetExampleZoneFile() []byte

GetExampleZoneFile Returns embedded "example_zone_file.txt".

Types ¶

This section is empty.

Source Files ¶

View all Source files

embed.go

Directories ¶

Path	Synopsis
cmd
internal
app
common
config
crawler
database
enum
filebuf
helper
interfaces
metrics_manager
path_manager
task_handlers/indexers
task_handlers/post_crawl_filters
task_handlers/pre_crawl_filters
task_handlers/rankers

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

ZFSE -- (Zone File Search Engine)

Menu

Overview

Quick Start

Architecture & Configuration

Task Handlers

Development Guide

Roadmap

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func GetDefaultConfigFile ¶

func GetExampleZoneFile ¶

Types ¶

Source Files ¶

Directories ¶