smash
![GitHub release](https://img.shields.io/github/release/thushan/smash)
CLI tool to smash
through to find duplicate files efficiently by slicing a file (or blob) into multiple segments
and computing a hash using a fast non-cryptographic algorithm such as xxhash or murmur3.
Amongst the highlights of smash
:
- Super fast analysis of large files thanks to slicing.
- Suited for finding duplicates on bandwidth constrained networks, devices or very large files but plenty capable on smaller ones!
- Supports a variety of non-cryptographic algorithms (see algorithms supported).
- Read-only view of the underlying filesystem when analysing
- Reports on duplicate files & empty (0 byte) files
- Outputs a report in json, you can use tools like jq to operate on (see examples below or the vhs tapes)
- Used to dedupe multi-TB of astrophysics datasets, images and video content & run regularly to report duplicates
smash
does not support pruning of duplicates or empty files natively and it's encouraged you vet the output report before pruning via automated tools.
![Made with VHS](https://vhs.charm.sh/vhs-6UTX5Yc6CIQ6Y3lzulLKYF.gif)
Find duplicates in the linux/drivers source tree with smash
(see our 🍿 other demos). Made with vhs!
The name comes from a prototype tool called SmartHash (written many years ago in C/ASM that's now lost in source &
too hard to modernise). It operated on a similar concept of slicing and hashing (with CRC32 then later MD5).
Installation
![Operating Systems](https://img.shields.io/badge/platform-windows%20%7C%20macos%20%7C%20linux%20%7C%20freebsd-informational?style=for-the-badge)
You can download the latest binaries from Github Releases or via our simple installer script - which currently supports Linux, macos, FreeBSD & Windows:
bash <(curl -s https://raw.githubusercontent.com/thushan/smash/main/install.sh)
It will download the latest version & extract it to its own folder for you.
Alternatively, you can install it via go:
go install github.com/thushan/smash@latest
smash
has been developed on Linux (Pop!_OS & Fedora), tested on macOS, FreeBSD & Windows.
Usage
[!IMPORTANT]
Starting from v0.9.0+, smash
will only look for duplicates in the current folder,
to smash sub-folders, use the --recurse
or -r
switch.
Usage:
smash [flags] [locations-to-smash]
Flags:
--algorithm algorithm Algorithm to use to hash files. Supported: xxhash, murmur3, md5, sha512, sha256 (full list, see readme) (default xxhash)
--base strings Base directories to use for comparison Eg. --base=/c/dos,/c/dos/run/,/run/dos/run
--disable-autotext Disable detecting text-files to opt for a full hash for those
--disable-meta Disable storing of meta-data to improve hashing mismatches
--disable-slicing Disable slicing & hash the full file instead
--exclude-dir strings Directories to exclude separated by comma Eg. --exclude-dir=.git,.idea
--exclude-file strings Files to exclude separated by comma Eg. --exclude-file=.gitignore,*.csv
-h, --help help for smash
--ignore-empty Ignore empty/zero byte files (default true)
--ignore-hidden Ignore hidden files & folders Eg. files/folders starting with '.' (default true)
--ignore-system Ignore system files & folders Eg. '$MFT', '.Trash' (default true)
-L, --max-size int Maximum file size to consider for hashing (in bytes)
-p, --max-threads int Maximum threads to utilise (default 16)
-w, --max-workers int Maximum workers to utilise when smashing (default 16)
-G, --min-size int Minimum file size to consider for hashing (in bytes)
--nerd-stats Show nerd stats
--no-output Disable report output
--no-progress Disable progress updates
--no-top-list Hides top x duplicates list
-o, --output-file string Export analysis as JSON (generated automatically otherwise)
--profile Enable Go Profiler - see localhost:1984/debug/pprof
--progress-update int Update progress every x seconds (default 5)
-r, --recurse Recursively search directories for files
--show-duplicates Show full list of duplicates
--show-top int Show the top x duplicates (default 10)
-q, --silent Run in silent mode
--slice-size int Size of a Slice (in bytes) (default 8192)
--slice-threshold int Threshold to use for slicing (in bytes) - if file is smaller than this, it won't be sliced (default 102400)
--slices int Number of Slices to use (default 4)
--verbose Run in verbose mode
-v, --version Show version information
See the full list of algorithms supported.
Examples
Examples are given in Unix format, but apply to Windows as well.
[!TIP]
To recursively smash through directories, use the --recursive
or -r
switch.
By default, smash
will only look in the current folder (from v0.9+)
Basic
To check for duplicates in a single path (Eg. ~/media/photos
) & output report to report.json
$ ./smash ~/media/photos -r -o report.json
You can then look at report.json
with jq to check duplicates:
$ jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l
Show Empty Files
By default, smash
ignores empty files but can report on them with the --ignore-empty=false
argument:
$ ./smash ~/media/photos -r --ignore-empty=false -o report.json
You can then look at report.json
with jq to check empty files:
$ jq '.analysis.empty[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l
Show Top 50 Duplicates
By default, smash
shows the top 10 duplicate files in the CLI and leaves the rest for the report, you can change that with the --show-top=50
argument to show the top 50 instead.
$ ./smash ~/media/photos -r --show-top=50
Multiple Directories
To check across multiple directories - which can be different drives, or mounts (Eg. ~/media/photos
and /mnt/my-usb-drive/photos
):
$ ./smash -r ~/media/photos /mnt/my-usb-drive/photos
Smash will find and report all duplicates within any number of directories passed in.
Exclude Files or Directories
You can exclude certain directories or files with the --exclude-dir
and --exclude-file
switches including wildcard characters:
$ ./smash -r --exclude-dir=.git,.svn --exclude-file=.gitignore,*.csv ~/media/photos
For example, to ignore all hidden files on unix (those that start with .
such as .config
or .gnome
folders):
$ ./smash -r --exclude-dir=.config,.gnome ~/media/photos
Disabling Slicing & Getting Full Hash
By default, smash
uses slicing to efficiently slice a file into multiple segments and hash parts of the file.
If you prefer not to use slicing for a run, you can disable slicing with:
$ ./smash -r --disable-slicing ~/media/photos
Changing Hashing Algorithms
By default, smash uses xxhash
, an extremely fast non-cryptographic hash algorithm. However, you can choose a variety
of algorithms as documented.
To use another supported algorithm, use the --algorithm
switch:
$ ./smash -r --algorithm:murmur3 ~/media/photos
Acknowledgements
This project was possible thanks to the following projects or folks.
Testers - MarkB, JarredT, BenW, DencilW, JayT, ASV, TimW, RyanW, WilliamH, SpencerB, EmadA, ChrisE, AngelaB, LisaA, YousefI, JeffG, MattP
License
Copyright (c) Thushan Fernando and licensed under Apache License 2.0