RProc - Reddit Data File Processor
RProc is a command-line tool for processing Reddit data dumps in zstd-compressed NDJSON format. It provides capabilities for filtering specific subreddit content and converting data to CSV format.
Features
- Filter Reddit submissions and comments by field values
- Convert Reddit data to CSV format
- Process large zstd-compressed files efficiently
- Support for parallel processing
- Progress tracking and detailed logging
- Filter using exact match, partial match, or regex patterns
Installation
Requires Go 1.21 or higher.
go install github.com/Caycedo/rproc@latest
Or clone and build from source:
git clone https://github.com/Caycedo/rproc.git
cd rproc
go build
Quick Start
Filter Submissions from a Subreddit
# Get all posts from r/wallstreetbets
rproc filter ./reddit_data ./output --field subreddit --value wallstreetbets
# Use regex matching
rproc filter ./reddit_data ./output --field subreddit --value "bitcoin.*" --regex
# Use partial matching
rproc filter ./reddit_data ./output --field subreddit --value "crypto" --partial
Convert to CSV
# Convert submissions to CSV
rproc csv ./reddit_data ./output.csv
Common Use Cases
Filter Submissions by Field
# Get all posts by a specific author
rproc filter ./input ./output --field author --value "spez"
# Get posts with specific words in title
rproc filter ./input ./output --field title --value "announcement" --partial
# Get posts from multiple subreddits (using a file)
echo "wallstreetbets\nbitcoin" > subreddits.txt
rproc filter ./input ./output --field subreddit --value-list subreddits.txt
Processing Large Datasets
# Use multiple threads for faster processing
rproc filter ./input ./output --field subreddit --value wallstreetbets --threads 4
# Only process specific date ranges
rproc filter ./input ./output --field subreddit --value wallstreetbets --file-filter "RS_2023-.*"
Available Fields
For filtering, you can use these fields:
subreddit
- Subreddit name
author
- Post author username
title
- Post title (submissions only)
selftext
- Post content (submissions only)
body
- Comment content (comments only)
domain
- Link domain (submissions only)
CSV Output Fields
Submissions (RS_*.zst files)
- author
- title
- score
- created
- link
- text
- url
- author
- score
- created
- link
- body
Command Reference
Global Flags
--debug
- Enable debug logging
--threads
- Number of processing threads (default: 1)
--file-filter
- Regex for matching input filenames (default: ".*")
Filter Command Flags
--field
- Field to filter on (required)
--value
- Value to match against (required unless using --value-list)
--value-list
- File containing newline-separated values to match
--partial
- Use partial string matching
--regex
- Use regex matching
--error-rate
- Acceptable percentage of errors (0-100)
RProc expects Reddit data files in the following format:
- Submissions: Files starting with
RS_
(e.g., RS_2023-01.zst
)
- Comments: Files starting with
RC_
(e.g., RC_2023-01.zst
)
- Compressed using zstd
- Each line contains a JSON object
Tips and Troubleshooting
- Use
--debug
for detailed logging if you encounter issues
- File patterns use regex:
RS_2023-.*
matches all 2023 submission files
- Memory usage scales with thread count - start low and increase if needed
- Use partial matching (
--partial
) for more flexible text searches
- Check log output for progress and error information
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License