filecollate

package module
v0.0.0-...-a60d657 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 7, 2024 License: MIT Imports: 19 Imported by: 1

README

filecollate

A tiny Go package for aggregating and organizing file paths based on customizable criteria. Originally designed for deduplication, the package groups files into sets based on a key generated by a custom filecollate.KeyGeneratorFunc function (see: key-generator).

Installation

go get github.com/ricci2511/filecollate

Usage

The package exposes two functions: GetResults and StreamResults. Both take a filecollate.Cfg struct to configure the search.

  • GetResults returns a slice of duplicate file paths once the search is complete.
  • StreamResults takes a channel of type chan []string, to which it sends each duplicate file path as they are found. Useful if you want to process the results as they come in instead of getting them all at once when the search is complete.

Check out deduplo for an example on how to use this package.

package main

import (
    "fmt"
    "github.com/ricci2511/filecollate"
)

func main() {
    filters:= filecollate.Filters{
        HiddenInclude: true,
        DirsExclude: []string{"node_modules"},
        ExtInclude: []string{".txt", ".json", ".go"}, // only search for .txt, .json and .go files
    }
    cfg := filecollate.Cfg{
        Paths: []string{"~/Dev", "~/Documents"},
        Filters: filters,
    }

    fmt.Println("Searching...")

    // Blocks until the search is complete
    dupes := filecollate.GetResults(cfg)

    fmt.Println("Search complete")

    for _, path := range selectedDupes {
        fmt.Println(path)
    }
}

The filecollate.Cfg struct has the following fields as of now:

type Cfg struct {
 Paths                         // paths to search in for duplicates
 Filters                       // various filters for the search (see filters.go)
 KeyGenerator KeyGeneratorFunc // key generator function to use
 Workers      int              // number of workers (defaults to GOMAXPROCS)
}

key-generator

The KeyGenerator field allows you to specify a custom function to generate a key for a given file path that maps to a slice of duplicate file paths.

Some functions are already provided, the default one being filecollate.Crc32HashKeyGenerator which simply hashes the first 16KB of file contents with crc32. The functions prefixed with Full hash the entire file contents instead of just the first 16KB, which is way slower but should be more accurate for rare cases where the first 16KB are not enough. Available KeyGenerator functions are:

  • filecollate.Crc32HashKeyGenerator
  • filecollate.FullCrc32HashKeyGenerator
  • filecollate.Sha256HashKeyGenerator
  • filecollate.FullSha256HashKeyGenerator

In case you want to use custom logic to generate keys, you simply pass a function that satisfies the filecollate.KeyGeneratorFunc. An example can be found here.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrSkipFile = fmt.Errorf("skip file")

Used to skip a file during key generation.

These kind of errors are ignored and not returned to the caller of filecollate.GetResults() or filecollate.StreamResults().

Functions

func Crc32HashKeyGenerator

func Crc32HashKeyGenerator(path string) (string, error)

Crc32HashKeyGenerator is the default if no KeyGenerator is specified.

Generates a crc32 hash of the first 16KB of the file contents as the key, which should be enough to achieve a good balance of uniqueness, collision resistance, and performance for most files.

func FullCrc32HashKeyGenerator

func FullCrc32HashKeyGenerator(path string) (string, error)

Generates a crc32 hash of the entire file contents as the key, which is a lot slower than HashKeyGenerator but should be more accurate.

func FullSha256HashKeyGenerator

func FullSha256HashKeyGenerator(path string) (string, error)

Generates a sha256 hash of the entire file contents as the key

func GetResults

func GetResults(c Cfg) (map[string][]string, error)

Runs the search and returns a map of keys and paths grouped by the generated key.

func GetResultsSlice

func GetResultsSlice(c Cfg) ([][]string, error)

Runs the search and returns 2D slice of paths, each high-level slice representing a group.

func Sha256HashKeyGenerator

func Sha256HashKeyGenerator(path string) (string, error)

Generates a sha256 hash of the first 16KB of the file contents as the key

func StreamPairs

func StreamPairs(c Cfg, collectorChan chan *pair) error

Directly streams produced pairs (key, path) to the provided channel. This is useful if you want to process the key-path pairs yourself.

Types

type Cfg

type Cfg struct {
	KeyGenerator KeyGeneratorFunc // Function to generate a key based on the file path.
	Paths                         // List of paths to search in for files to collect/group.
	Filters                       // Filters to apply when searching for files to group.
	Workers      int              // Number of max workers to use for the search.
}

func (*Cfg) String

func (c *Cfg) String() string

Beauty stringifies the Cfg struct.

type Filters

type Filters struct {
	ExtInclude    FiltersList // List of file extensions to include.
	ExtExclude    FiltersList // List of file extensions to exclude.
	DirsExclude   FiltersList // List of directories or subdirectories to exclude.
	SkipSubdirs   bool        // Skip subdirectories.
	HiddenInclude bool        // Include hidden files and directories.
}

func (*Filters) String

func (f *Filters) String() string

Beauty stringifies the Filters struct.

type FiltersList

type FiltersList []string

Satisfies the flag.Value interface, string values can be provided as a csv or space separated list.

`flag.Var(&cfg.DirsExclude "exclude-dirs", "exclude directories or subdirectories")

func (*FiltersList) Set

func (fl *FiltersList) Set(val string) error

func (*FiltersList) String

func (fl *FiltersList) String() string

type KeyGeneratorFunc

type KeyGeneratorFunc func(path string) (string, error)

KeyGenerator generates a key for a given file path, which then is mapped to a list of file paths that share the same key.

The provided KeyGeneratorFuncs hash the file contents to generate the key, but the logic can be anything as long as it's deterministic. For example, you could generate a key based on the file name, size, etc.

type Paths

type Paths []string

Satisfies the flag.Value interface, string values can be provided as a csv or space separated list.

`flag.Var(&cfg.Paths, "p", "list of paths to traverse")`

func (*Paths) Set

func (p *Paths) Set(val string) error

func (*Paths) String

func (p *Paths) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL