Full Text Search
A simple full-text search engine for Wikipedia abstract dumps written in Go. This project demonstrates how to load, index, and search through Wikipedia abstract data using a custom-built indexing and search mechanism.
Features
- Load Wikipedia Abstract Dumps: Efficiently loads and parses Wikipedia abstract dump files in XML format.
- Indexing: Creates an index of documents for fast search operations.
- Full-Text Search: Supports searching for specific queries across all indexed documents.
- Performance Logging: Logs the time taken for loading, indexing, and searching operations.
Installation
-
Clone the Repository:
git clone https://github.com/MunishMummadi/full-text-search.git
cd full-text-search
-
Install Dependencies:
Make sure you have Go installed (version 1.16 or higher recommended). Then, run:
go mod tidy
Usage
Command-Line Interface
You can run the program using the go run
command. The program accepts two flags:
-p
: Path to the Wikipedia abstract dump (default: enwiki-latest-abstract1.xml.gz
).
-q
: Search query (default: Small wild cat
).
Example Usage
go run main.go -p /path/to/enwiki-latest-abstract1.xml.gz -q "Your search query"
Example
go run main.go -p enwiki-latest-abstract1.xml.gz -q "Small wild cat"
This command will load the Wikipedia abstract dump from the specified path, index the documents, and search for the query "Small wild cat". The results will be logged to the console.
Project Structure
full-text-search/
├── main.go # Entry point of the program
├── go.mod # Go module file
├── go.sum # Go dependencies file
└── utils/
├── document.go # Document struct and LoadDocuments function
├── filter.go # Filtering functionality for documents
├── index.go # Indexing functionality for documents
├── tokenizer.go # Tokenizer for processing text
├── filter_test.go # Tests for filter.go
├── index_test.go # Tests for index.go
└── tokenizer_test.go # Tests for tokenizer.go
License
This project is licensed under the MIT License. See the LICENSE file for more details.
Acknowledgments
- The project uses the Snowball stemming algorithm.
- Testify is used for unit testing.
- Inspired by Wikipedia and its open data.