sininen

package module
v0.0.0-...-1624685 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 25, 2022 License: MIT Imports: 11 Imported by: 0

README

Sininen

Sininen's goal is to provide tools to perform natural language queries on text data. Right now it is focused on searching though subtitles extracted from YouTube channels.

Usage

Download subtitles from a channel

The channel name is needed. To find it from a video like https://www.youtube.com/watch?v=aq4G-7v-_xI, click on the channel name (here Historia Civilis), landing on the page https://www.youtube.com/channel/UCv_vLHiWVBh_FR9vbeuiY-A. Then click on the HOME tab, this changes the URL to https://www.youtube.com/c/HistoriaCivilis/featured. The channel name is the string after /c/, here HistoriaCivilis.

Download the subtitles for HistoriaCivilis with:

./download-channel-subtitles.sh HistoriaCivilis
Build YouTube CLI
go get
go build cli/search-yt.go
Search through channel subtitles
./search-yt HistoriaCivilis "Crossing the Rubicon"

Requirements

The usage instructions above should work on a recent Linux distribution provided the following packages are installed and reasonably up-to-date:

  • Go
  • youtube-dl

Some adjustments might be needed to make it work on another OS.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CreateSubtitleIndex

func CreateSubtitleIndex(folder, lang string) (bleve.Index, error)

CreateSubtitleIndex opens, parses and indexes the subtitles file in the given folder and the given language. The created index is saved inside the folder.

func OpenTranscriptionIndex

func OpenTranscriptionIndex(folder, lang string) (bleve.Index, error)

OpenTranscriptionIndex opens a stored subtitle index, such as the one created by CreateSubtitleIndex.

func TextQuery

func TextQuery(query string, index bleve.Index) (*bleve.SearchResult, error)

TextQuery makes a plain text search against an transcription index.

Types

type ScoredSegment

type ScoredSegment struct {
	SegmentHit
	Score float64 `json:"score"`
	ID    string  `json:"id"`
}

ScoredSegment is a SegmentHit with its score and its transcription ID.

type SearchResult

type SearchResult struct {
	ID       string
	Score    float64
	Segments []SegmentHit // Segments that matched with the search query.
}

SearchResult represents a transcription file that matched with a search query.

type SearchResultSequence

type SearchResultSequence []SearchResult

SearchResultSequence represents a sequence of transcription files that matched with a search query.

func AssembleSearchResults

func AssembleSearchResults(bleveResults *bleve.SearchResult) (SearchResultSequence, error)

AssembleSearchResults builds transcription search results with timestamp information using raw bleve search results.

func (SearchResultSequence) ScoredSegments

func (srs SearchResultSequence) ScoredSegments() []ScoredSegment

ScoredSegments flattens a search results hierarchy by returning the scored segments, sorted by score.

type SegmentHit

type SegmentHit struct {
	StartTime   time.Duration `json:"start_time"`
	EndTime     time.Duration `json:"end_time"`
	SortedTerms []string      `json:"sorted_terms"` // Terms in the segment that matched with the search query, sorted in increasing order.
}

SegmentHit represents a transcription segment that matched with a search query.

func (SegmentHit) NDistinctTerms

func (sh SegmentHit) NDistinctTerms() int

NDistinctTerms returns the number of distinct terms in the segment that matched with the search query.

type Transcription

type Transcription struct {
	Words    string
	Segments []float64
}

Transcription stores the whole transcription text, as well as all the segments in a manner usable by bleve. The reason for using a slice of float64 rather then a slice of transcriptionSegment is that bleve does not support time.Duration or int, only float64.

func ParseSubtitleFile

func ParseSubtitleFile(filename string) (*Transcription, error)

ParseSubtitleFile transforms a subtitle file into a Transcription usable by bleve.

func (Transcription) BleveType

func (Transcription) BleveType() string

BleveType tells bleve what type of document a Transcription is.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL