rmd : ReMove Duplicates, an rm implementation able to remove duplicate files
An improved rm implementation able to remove duplicate files
rmd
is an rm
reimplementation made in pure Rust. It
is able to remove files and directories as usual.
rmd
is also able to:
Recursively remove duplicate file. Duplicates are found by comparing SHA256 file hash.
Recursively remove file by size
Recusively remove file by last access time.
This tool can be easly installed from sources:
cargo install rmd
It is also possible to directly clone the repository and compile rmd
from there.
In this case it is recommended to run all tests before compile rmd
for production.
A convenient way to do that is using make
make build
This will run all cargo tests (both unit and integration) and cli tests before compile rmd for production.
It works in an almost compatible way with the standard rm. To get a full help run:
rmd --help
But the most common scenarios includes:
rmd FILE_A FILE_B
rmd -rf DIR_A
rmd -v FILE_A
simple verbose just output the name of removed files and directory. For a more verbose:
rmd -vv FILE_A
rmd
shows also statistics about removed files, specifing if the removed
file is a directory or a regular file, in the latter case rmd
also shows
the size of the removed file.
rmd -l FILE_A
output the same information as -v to syslog. For more log:
rmd -ll FILE_A
output the same information as -vv to syslog.
rmd -d
or remove duplicates files in a specified directory:
rmd -d /PATH/TO/DIRECTORY
This functionality allows to remove file older or newer then a given time-specification.
Remove File older then time-spec
rmd --older <time-spec> [directory...]
Remove File newer then time-spec
rmd --newer <time-spec> [directory...]
rmd
checks if the last access is before (so the file is older) or after
(so the file is newer) then the time described by the time-specification.
time-specification describes a relative amount of time (in seconds) in the past
from the moment when the program is run.
time-specification format
[N+T]+
Where:
Time Descriptor Table
Short Format | Long Format | Meaning | Value |
---|---|---|---|
s | second | second | 1 second |
m | minute | minute | 60 seconds |
h | hour | hour | 60 minutes |
d | day | day | 24 hours |
w | week | week | 7 days |
M | month | month | 30 days |
y | year | year | 365 days |
rmd --older 2y4M5d
will remove in the current directory, and recursively in all sub directories, file with a last access time equal or before 2 year, 4 month and 5 days in the past from the time when the program is run.
rmd --newer '4h+30m'
will remove in the current directory, and recursively in all sub directories, file with a last access time equal or after 4 hour and 30 minutes in the past from the time when the program is run.
rmd --older '1M 15d' /home/user/temp-store
will remove in /home/user/temp-store and recursively in all sub directories, file with a last access time equal or before 1 month and 15 days in the past from the time when the program is run.
rmd --newer 30s /home/user/wrong-downloads
will remove in /home/user/wrong-downloads and recursively in all sub directories, file with a last access time equal or after 30 seconds in the past from the time when the program is run.
This functionality allows to remove file smaller or larger then a given size-specification.
Remove File smaller then size-spec
rmd --smaller <size-spec> [directory...]
Remove File larger then size-spec
rmd --larger <size-spec> [directory...]
rmd
checks if the file size, in bytes. If larger mode is used rmd
checks,
for each file in the specified directory, and recursivelly in all sub directories,
if the size is larger or equal to the size decribed in size-spec and if so rmd
remove the file. Of course if smaller mode is used rmd
checks for file smaller or equal to the size in size-spec.
size-specification format
[N+S]+
Where:
Deciamal Size Descriptor Table
Short Format | Long Format | Meaning | Value |
---|---|---|---|
b | byte | 1 byte | |
kb | kilo | kilobyte | 1000 byte |
mb | mega | megabyte | 1000 kilobyte |
gb | giga | gigabyte | 1000 megabyte |
tb | tera | terabyte | 1000 gigabyte |
pb | peta | petabyte | 1000 terabyte |
Binary Size Descriptor Table
Short Format | Long Format | Meaning | Value |
---|---|---|---|
b | byte | 1 byte | |
kib | kibi | kibibyte | 1024 byte |
mib | mebi | mebibyte | 1024 kibibyte |
gib | gibi | gibibyte | 1024 mebibyte |
tib | tebi | tebibyte | 1024 gibibyte |
pib | pebi | pebibyte | 1024 tebibyte |
Decimal and Binary size descriptor can be use together
rmd --smaller '2kb,56mib'
will remove in the current directory, and recursively in all sub directories, file with a size smaller or equal to 56 Mebibytes and 2 Kilobytes.
rmd --larger 4gb30mb
will remove in the current directory, and recursively in all sub directories, file with a size larger or equal to 4 Gigabytes and 30 Megabytes.
rmd --larger '1 mebi 15 kibi' /home/user/temp-store
will remove in /home/user/temp-store and recursively in all sub directories, file with a size larger or equal to 1 Mebibytes and 15 Kibibytes.
rmd --smaller 30kb /home/user/useless-files
will remove in /home/user/useless-files and recursively in all sub directories, file with a size smaller or equal to 30 Kilobytes.
Sometimes you may need to skip some files or directories
from been removed, for example you may want to preserve
any .bak file or to completely ignore directories like .git. In these cases rmd
provides two useful options:
--ignore-extensions
allows to specify a list of extensions that will be ignored by rmd
rmd --ignore-extensions bak --duplicates
will remove any duplicate file in the current directory and recursively in all sub directories ignoring any file with .bak extension. So if to equal file “file.rs.bak” and “copy-file.rs.bak” will be preserved. Also the original “file.bak” (if it is unique) will be preserved because .bak file are completely ignored.
rmd --ignore-extensions bak pdf mp3 --larger 40kb project
will remove all file larger or equal to 40 Kilobytes in the project directory, and recursively in all sub directories, but files with .bak, .pdf and .mp3 extensions. So, for example, project/docs.pdf a 4 Mb file will not be removed.
--ignore-directories
allows to specify a list of directory names (just
the last component in the path string) that will be ignored by rmd
rmd --clean --ignore-directories xmas_photos --older 1y documents
will remove any file older than one year in documents directory
and recursively in all sub directories, ignoring any directory named xmas_photo. If xmas_photo is empty it will not be removed. rmd
simply will never open any directory named xmas_photo in the directory tree
rooted in documents.
rmd --clean --ignore-directories important_files .git --duplicates /home/user
will remove any duplicate file in the user home, and recursively in all sub directories, ignoring any directory named .git or important_files.
It is allowed to use --ignore-directories
and --ignore-extensions
together.
It is also possible to simply ignore hidden files and directories.
--ignore-unix-hidden
allows to automatically ignore any file and directory whose name starts with ‘.’ (unix style hidden files). rmd
working with --ignore-unix-hidden
set skips hidden files and does not open hidden directories,
so any non hidden file inside an hidden directory is left untouched.
An Exmple:
rmd -d --ignore-unix-hidden important_project
will deduplicate important_project but hidden files or directories (such as .git) are ignored.
rmd
during an automatic removal prompts for each
file that need to be deleted, during a standard removal prompts just onceSpecification String, in both time and size remove, can contain any number of non alphanumeric characters between a number and a descriptor or between a descriptor and a number, those characters are simply treated as separators. The important thing are to NOT put separators into numbers or into descriptors and to properly quote the specification string so it will be treated as a unique argument.
-c
/--clean
flag deletes directories left empty after an automatic file removal,
(i.e. rmd
run with newer, older, duplicates, smaller, larger).
This operation is done from the bottom of the directory tree, so directories that contains only directoies (recursively) without any file are considered empty.
So, although not technically empty
those directoies will be removed. Pay attention ;-). Clean flag does not
take any effect where uses in standard mode.
Verbose flag can be set in standard or in automatic mode.
Log flag can be set in standard or in automatic mode.
Log and Verbose mode can work together.
Output generated by log and verbose is the same, it just changes where this output is sent. log send its output to syslog, verbose to stdout.
--ignore-extensions
, --ignore-directories
and --ignore-unix-hidden
can be used only with an automatic remover
--ignore-unix-hidden
and --clean
are used together empty hidden
directories are safe from removal as non empty hidden directories.It is very likely that you will end up using --ignore-extenions
and/or --ignore-directories
with the same arguments over and over. In this scenario a good idea could be add an alias to your shell configuration file like
alias rmdd='rmd --ignore-extensions bak --ignore-directories .git .hg'
or a shell function like
function rmd() {
rmd --ignore-extensions bak --ignore-directories .git .hg -- "$@"
}