filewalker

package module
v0.0.0-...-b5e8b71 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 24, 2025 License: MIT Imports: 9 Imported by: 1

README

filewalker

A high-performance concurrent filesystem traversal library with filtering, progress monitoring, and CLI support.

🚀 Features

  • Concurrency: Parallel traversal for massive speed improvements over filepath.Walk.
  • Real-Time Progress: Live tracking of processed files, directories, and throughput.
  • Filtering: Control file selection based on size, modification time, patterns, and more.
  • Error Handling: Configurable behavior for skipping, continuing, or stopping on errors.
  • Symlink Handling: Options to follow, ignore, or report symbolic links.
  • Logging: Structured logging via zap with adjustable verbosity.
⚠️ Robust Error Handling
Mode Behavior
Continue Skip errors, process remaining files.
Stop Halt immediately on first error.
Skip Ignore problematic files & directories.

Errors are collected using errors.Join(), allowing detailed reporting.

  • Cycle detection prevents infinite loops
  • Configurable: Follow, Ignore, or Report
  • Thread-safe caching of visited symlinks
📝 Configurable Logging
  • Multiple log levels: ERROR, WARN, INFO, DEBUG
  • Structured logging with zap
  • Custom logger support

📈 Performance

Filewalker significantly outperforms filepath.Walk by using concurrent workers.

Workers Time (ns/op) Throughput (MB/s) Speedup
filepath.Walk 3,192,416,229 ~54 MB/s baseline
2 workers 1,557,652,298 ~110 MB/s 2.05x faster
4 workers 768,225,614 ~225 MB/s 4.21x faster
8 workers 372,091,401 ~465 MB/s 8.65x faster

Benchmarks run on Apple M2 Pro (10 cores)

🛠 Benchmark Setup
  • System: Apple M2 Pro
  • Test Data: Directory depth = 5, 20 files per directory
  • Measurement: Processing time per file, converted to MB/s

🏗 Architecture

Performance Design

Filewalker achieves high performance through several key architectural decisions.

📊 Architecture Diagram
graph TB
    subgraph Input
        Root[Root Directory]
    end

    subgraph Producer
        Walk[filepath.Walk]
        Cache[Directory Cache]
        Walk --> Cache
    end

    subgraph TaskQueue
        Channel[Buffered Channel<br>size=workers]
    end

    subgraph WorkerPool
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker 3]
        WN[Worker N]
    end

    subgraph Statistics
        Atomic[Atomic Counters]
        Progress[Progress Monitor]
        Speed[Speed Calculator]
    end

    subgraph ErrorHandling
        ErrContinue[Continue Mode]
        ErrStop[Stop Mode]
        ErrSkip[Skip Mode]
        ErrCollector[Error Collector]
    end

    Root --> Walk
    Cache --> Channel
    Channel --> W1
    Channel --> W2
    Channel --> W3
    Channel --> WN
    
    W1 --> Atomic
    W2 --> Atomic
    W3 --> Atomic
    WN --> Atomic
    
    Atomic --> Progress
    Progress --> Speed
    
    W1 --> ErrCollector
    W2 --> ErrCollector
    W3 --> ErrCollector
    WN --> ErrCollector
    
    ErrCollector --> ErrContinue
    ErrCollector --> ErrStop
    ErrCollector --> ErrSkip

    classDef default fill:#f9f,stroke:#333,stroke-width:2px;
    classDef producer fill:#bbf,stroke:#333,stroke-width:2px;
    classDef queue fill:#bfb,stroke:#333,stroke-width:2px;
    classDef workers fill:#fbf,stroke:#333,stroke-width:2px;
    classDef stats fill:#ffb,stroke:#333,stroke-width:2px;
    classDef errors fill:#fbb,stroke:#333,stroke-width:2px;

    class Root default;
    class Walk,Cache producer;
    class Channel queue;
    class W1,W2,W3,WN workers;
    class Atomic,Progress,Speed stats;
    class ErrContinue,ErrStop,ErrSkip,ErrCollector errors;
1. Worker Pool Model
[Directory Tree] → [Task Queue] → [Worker Pool (N workers)] → [Results]
      ↑                 ↑                 ↑
   Producer       Buffered Channel    Concurrent
    (Walk)        (Size = limit)      Processing
  • Producer: A single goroutine recursively walks the directory tree and pushes tasks into the queue.
  • Task Queue: A buffered channel efficiently controls memory usage and prevents overload.
  • Worker Pool: N concurrent workers fetch tasks from the queue for parallel processing.
  • Load Balancing: Dynamic work stealing ensures an even distribution of file-processing tasks.
2. Memory Optimizations
  • Atomic Operations: Lock-free statistics tracking for performance.
  • Sync.Map Caching: Thread-safe directory exclusion cache reduces redundant checks.
  • Buffer Control: Configurable task queue size prevents excessive memory usage.
  • Minimized Allocations: Reuses walkArgs structs to reduce GC overhead.
3. Concurrency Control
type walkArgs struct {
    path string
    info os.FileInfo
    err  error
}

// Worker Pool Implementation
for i := 0; i < limit; i++ {
    go worker(tasks <-chan walkArgs)
}
  • Workers efficiently pull tasks from the queue and process files concurrently.
  • The number of workers is configurable, scaling with available CPU cores.
  • Graceful shutdown ensures clean termination when walking is canceled.
4. Error Management
  • Non-blocking: Errors don't stop other workers
  • Aggregation: Combined using errors.Join()
  • Context: Graceful cancellation support
5. Progress Tracking
[Workers] → [Atomic Counters] → [Stats Aggregator] → [Progress Callback]
    ↑            ↑                     ↑                    ↑
 Updates    Thread-safe         500ms Intervals      User Interface
  • Workers update atomic counters in real time.
  • A stats aggregator collects periodic updates every 500ms.
  • Progress is reported via a customizable callback function.
  • Users can monitor:
    • Files Processed
    • Processing Speed (MB/s)
    • Elapsed Time
    • Error Count

License

MIT License. See the LICENSE file for details.

Author

Built with ❤️ by TFMV

Documentation

Overview

Package filewalker provides concurrent filesystem traversal with filtering and progress reporting. It builds upon the standard filepath.Walk functionality while adding concurrency, filtering, and monitoring capabilities.

Index

Constants

View Source
const DefaultConcurrentWalks int = 100

DefaultConcurrentWalks defines the default number of concurrent workers when no specific limit is provided.

Variables

This section is empty.

Functions

func Walk

func Walk(root string, walkFn filepath.WalkFunc) error

Walk traverses a directory tree using the default concurrency limit. It's a convenience wrapper around WalkLimit.

func WalkLimit

func WalkLimit(ctx context.Context, root string, walkFn filepath.WalkFunc, limit int) error

WalkLimit traverses a directory tree with a specified concurrency limit. It distributes work across a pool of goroutines while respecting context cancellation. Directories are processed synchronously so that a SkipDir result prevents descending.

func WalkLimitWithFilter

func WalkLimitWithFilter(ctx context.Context, root string, walkFn filepath.WalkFunc, limit int, filter FilterOptions) error

WalkLimitWithFilter adds file filtering capabilities to the walk operation.

func WalkLimitWithOptions

func WalkLimitWithOptions(ctx context.Context, root string, walkFn filepath.WalkFunc, opts WalkOptions) error

WalkLimitWithOptions provides the most flexible configuration, combining error handling, filtering, progress reporting, and optional custom logger/symlink handling.

func WalkLimitWithProgress

func WalkLimitWithProgress(ctx context.Context, root string, walkFn filepath.WalkFunc, limit int, progressFn ProgressFn) error

WalkLimitWithProgress adds progress monitoring to the walk operation.

Types

type ErrorHandling

type ErrorHandling int

ErrorHandling defines how errors are handled during traversal.

const (
	ErrorHandlingContinue ErrorHandling = iota // Continue on errors
	ErrorHandlingStop                          // Stop on first error
	ErrorHandlingSkip                          // Skip problematic files/dirs
)

type FilterOptions

type FilterOptions struct {
	MinSize        int64     // Minimum file size in bytes
	MaxSize        int64     // Maximum file size in bytes
	Pattern        string    // Glob pattern for matching files
	ExcludeDir     []string  // Directory patterns to exclude
	IncludeTypes   []string  // File extensions to include (e.g. ".txt", ".go")
	ModifiedAfter  time.Time // Only include files modified after
	ModifiedBefore time.Time // Only include files modified before
}

FilterOptions defines criteria for including/excluding files and directories.

type LogLevel

type LogLevel int

LogLevel defines the verbosity of logging.

const (
	LogLevelError LogLevel = iota
	LogLevelWarn
	LogLevelInfo
	LogLevelDebug
)

type MemoryLimit

type MemoryLimit struct {
	SoftLimit int64 // Pause processing when reached
	HardLimit int64 // Stop processing when reached
}

MemoryLimit sets memory usage boundaries for the traversal.

type ProgressFn

type ProgressFn func(stats Stats)

ProgressFn is called periodically with traversal statistics. Implementations must be thread-safe as this may be called concurrently.

type Stats

type Stats struct {
	FilesProcessed int64         // Number of files processed
	DirsProcessed  int64         // Number of directories processed
	EmptyDirs      int64         // Number of empty directories
	BytesProcessed int64         // Total bytes processed
	ErrorCount     int64         // Number of errors encountered
	ElapsedTime    time.Duration // Total time elapsed
	AvgFileSize    int64         // Average file size in bytes
	SpeedMBPerSec  float64       // Processing speed in MB/s
}

Stats holds traversal statistics that are updated atomically during the walk.

type SymlinkHandling

type SymlinkHandling int

SymlinkHandling defines how symbolic links are processed.

const (
	SymlinkFollow SymlinkHandling = iota // Follow symbolic links
	SymlinkIgnore                        // Ignore symbolic links
	SymlinkReport                        // Report links but don't follow
)

type WalkOptions

type WalkOptions struct {
	ErrorHandling   ErrorHandling
	Filter          FilterOptions
	Progress        ProgressFn
	Logger          *zap.Logger
	LogLevel        LogLevel // New field for logging verbosity
	BufferSize      int
	SymlinkHandling SymlinkHandling
	MemoryLimit     MemoryLimit
}

WalkOptions provides comprehensive configuration for the walk operation.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL