vidsim

command module
v1.11.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 14, 2025 License: BSD-3-Clause Imports: 1 Imported by: 0

README

vidsim - find similar/duplicate videos in your collection

vidsim is a tool that scans a set of video files and identifies videos that are "similar." Since frame-by-frame video comparison is prohibitively slow and doesn't scale well for large collections, this tool takes a pragmatic approach, in that it extracts a single frame from erach video and compares those frames. While this is not as reliable, it works well on typical personal video collections (i.e. not requiring months to run).

Installation

go install github.com/abelikoff/vidsim@latest

[!NOTE] > vidsim uses ffmpeg tool for frame generation, so make sure the latter is installed as well.

Operation

The vidsim command supports multiple actions described below.

Find similar videos

In its basic form, vidsim scans specified directories and compares all video files, producting a JSON report for all files considered "similar:"

vidsim process <dir1> <dir2> ...

While processing data, vidsim cahces the computed artifacts, using cached values in future re-runs. By default state is kept in .vidsim directory. One can customize the name with -d option:

vidsim -d .my.cache.dir process <dir1> <dir2> ...

Controlling frame generation

By default vidsim calls ffmpeg to extract a single image frame from each video file. This behavior can be cusomized by specifying another external program or script via -G option. Such program is passed 3 command line arguments:

  • A video file to extract the frame from.
  • Name of the output frame file.
  • A time offset (in format MM:SS) at which to extract.

Upon normal completion terminate the program should exit with code 0.

Controlling the matching logic

Using external program for comparison

By default vidsim uses the images4 library. One can instead use any external program of choice by passing it to vidsim using --external_comparison_tool (or -C) option. The program is expected to adhere to the following protocol:

  • Take two image files as command line parameters.
  • Upon normal completion terminate with exit code 0.
  • The result of the program execution is a comparison score which the program should output to stdout.
  • The score is a floating point number between 0 and 1, where 0 means that the two images don't match at all and 1 means that they match fully.

Controlling matching logic

Similarity threshold

Both internal logic and external comparison programs are expected to return a comparison score for a pair of images which is a floating point value between 0 (images don't match at all) and 1 (images math fully). Similarity threshold is the lowest score value at which two images are considered a match. By default it is 0.7 but it can be customized via --similarity_threshold option.

Match grouping logic

When vidsim computes pairwise matches of files, it can group matching videos based on different heuristics. This is controlled by --clustering_mode (or -m) parameter. Currently, vidsim offers 3 different heuristics:

  • Strict matching (-m strict) - files are grouped as long as they all match in a pairwise fashion (this is the default behavior).

  • Weak matching (-m union) - assuming match property is transitive, files are groups as long as there is a transitive match between any two files (in other words, as long as A ^ B, B ^ C, C ^ D and so on, where ^ denotes the match property)

  • No clustering (-m none) - No grouping happens (each matching pair is reported separately).

Handling false positives

Since the comparison logic is imprecise, the will inevitably false positive matches: videos identified as similar, which are not. Running the tool repeatedly and revisiting those false positives again and again is annoying and distracting. To address this, vidsim allows marking pairs of videos as false positive matches, so that when it runs next time, this pair of videos will not be reported as a match. Naturally, this is only supported with caching on.

To mark a set of video files as pairwise false positives, use the unmatch command:

vidsim -d .my.cache.dir unmatch <video_file1> <video_file2> ...

State management

Multiple runs with different image comparison parameters

One can do multiple runs using different parameters or tools governing image comparison. To save and use comparison scores for such runs side by side, one can use a "prefix" which is stored along with comparison scores via --score_prefix (or -p) option, for example:

# First run with default comparison:
vidsim -d .my.cache.dir process -p run1 <dir1> <dir2> ...

# Second run with more strict comparison logic:
vidsim -d .my.cache.dir process -p run2 --chr_tolerance 5.22 <dir1> <dir2> ...

Comparison scores for both runs will be stored in the state without intermixing with each other.

Compacting the state

State can be compacted, removing data for files that no longer exist, by passing --compact option:

vidsim -d .my.cache.dir process --compact mydir
Exploring the state

The peek command allows extracting different data from the state:

vidsim -d .my.state.dir peek file <filename>                  - show file ID for the file
vidsim -d .my.state.dir peek score <filename> <filename>      - show match score for 2 files
vidsim -d .my.state.dir peek score <id> <id>                  - show match score for 2 files (represented by IDs)
Dumping the state

You can dump the accumulated state (frame information, comparison scores):

vidsim -d .my.cache.dir dump
Considerations about filenames

By default vidsim saves filenames using relative (to the top directories specified) paths. This has two implications:

  • When operating on files (e.g. marking false positives), paths have to be specified precisely using that convention.
  • When compacting the state, the command should be run in the same directory where process command was run, otherwise vidsim will not find the files listed in the state and will think those files have been deleted (it does have a sanity check agains mass deletion however).

Alternatively, one can specify a --abs_paths option to make vidsim store absolute paths. This takes more space but it avoids the problem above.

Documentation

Overview

Copyright © 2024 NAME HERE <EMAIL ADDRESS>

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL