The lovely Carol has a mac mini, which currently serves[sic] as an NFS server for my desktop.
It also backs up our laptops. This machine, which is backed up off-site, has a huge hard drive
that I thought would keep us in disk space for a long long time.
You can guess what happened: photos and music—especially photos—tend to
expand to fill the space available. It doesn't help that we have multiple copies of stuff.
So I thought to run some Perl or Python script to help me find
said copies.
Since I've become a Python partisan
I went that route. A web search turned up some helpful hints on stackoverflow and particularly this post on endlesslycurious.com. The mac mini has python2.6, so I made a few
modifications; you can see the whole thing at http://cpwriter.net/dup2/.
I ran that on /Users on the lovely Carol's mac mini, putting the results into dups.out.
mini1:~ collin$ wc -l dups.out
12489 dups.out
mini1:~ collin$
Yep, that's a lot of files. A couple of big offenders:
mini1:~ collin$ grep Best.*Wedding dups.out
[621850501, ['/Users/carol/from-macbook/Movies/Best of Wedding.mov', \
'/Users/collin/from-pbook/Desktop/Best of Wedding.mov']]
[989954048, ['/Users/carol/from-macbook/Desktop/Redeemer Marriage Series/Best of Wedding-DVD.img', \
'/Users/collin/from-pbook/Desktop/Best of Wedding-DVD.img']]
mini1:~ collin$
That's 621Mbytes and 989Mbytes. So about 1.5GB freed up just like that.
But I think we have a lot more. I discovered a lot of files under "archives" and "from-pbook"
that are the same, like this:
mini1:~ collin$ grep archives dups.out|grep -m5 from-pbook
[1049177, ['/Users/collin/archives/collin-laptop/Pictures/iPhoto Library/2009/01/01/IMG_3001.JPG', \
'/Users/collin/archives/data1/pix-dec08/img_3001.jpg', \
'/Users/collin/from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_3001.JPG',
'/Users/collin/pix/2008/12/pix-dec08/img_3001.jpg']]
…
Wow, four paths to the same file. Hey, can I get rid of all those pix-dec08 paths? Yes, because:
- A "diff -r archives/data1/pix-dec08 pix/2008/12/pix-dec08" showed that these two directories are
identical;
- every "large" (not a thumbnail or slide) image file under pix/2008/12/pix-dec08/ appeared
in dups.out. Except those under 1024×1024 bytes:
mini1:~ collin$ for F in pix/2008/12/pix-dec08/*jpg; do if grep -qF $F dups.out; then : OK; else ls -l $F; fi; done
-rwxr-xr-x 1 collin _lpoperator 1002328 Dec 31 2008 pix/2008/12/pix-dec08/img_2961.jpg
-rwxr-xr-x 1 collin _lpoperator 858104 Jan 1 2009 pix/2008/12/pix-dec08/img_2988.jpg
-rwxr-xr-x 1 collin _lpoperator 863361 Jan 1 2009 pix/2008/12/pix-dec08/img_2994.jpg
-rwxr-xr-x 1 collin _lpoperator 865777 Jan 1 2009 pix/2008/12/pix-dec08/img_2995.jpg
-rwxr-xr-x 1 collin _lpoperator 994298 Jan 1 2009 pix/2008/12/pix-dec08/img_2996.jpg
-rwxr-xr-x 1 collin _lpoperator 811491 Jan 1 2009 pix/2008/12/pix-dec08/img_2997.jpg
mini1:~ collin$
I'm going to take the leap of faith that the remaining files are in fact there in the other paths... well, no I won't:
mini1:~ collin$ for F in pix/2008/12/pix-dec08/*jpg; do if grep -qF $F dups.out; then : OK; else \
Y=`basename $F|tr [:lower:] [:upper:]`; \
Z=`/bin/ls from-pbook/Pictures/iPhoto\ Library/200*/*/*/$Y`; \
echo $Z;cmp "$F" "$Z"; echo; fi; done
from-pbook/Pictures/iPhoto Library/2008/12/31/IMG_2961.JPG
from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2988.JPG
from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2994.JPG
from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2995.JPG
from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2996.JPG
from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2997.JPG
mini1:~ collin$
So we can kill off those two paths. That might have saved another Gbyte or so.
Now, can we maybe hardlink the
/Users/collin/archives/collin-laptop/Pictures/ stuff to/from the /Users/collin/from-pbook/Pictures/
stuff? And how much space might that save?
mini1:~ collin$ du -sh archives/collin-laptop/Pictures/iPhoto\ Library/ from-pbook/Pictures/iPhoto\ Library/
12G archives/collin-laptop/Pictures/iPhoto Library/
11G from-pbook/Pictures/iPhoto Library/
mini1:~ collin$
Quite a bit. That plus the 1.5GB already saved earlier would
be a significant help here:
collin@p3:/mnt/home/collin> df -h .
Filesystem Size Used Avail Use% Mounted on
mini1:/Users 298G 257G 42G 87% /mnt/home
collin@p3:/mnt/home/collin> ssh mini1 df -h .
Filesystem Size Used Avail Capacity Mounted on
/dev/disk0s2 298Gi 256Gi 41Gi 87% /
collin@p3:/mnt/home/collin>
Not sure why the difference, but there it is.
Anyway, I wanted to hardlink one set of files to the other. (Why? Because the
from-pbook directory may get rsync'd. If I delete the from-pbook directory,
then it may come back later. And if I delete the other directory, and subsequently
decide to remove the files from the pbook, then we'll lose the photos. So
hardlink is the way to go.)
Consequently I wrote this silly script:
collin@p3:/mnt/home/collin> cat tmp/photos.sh
#!/bin/sh
D2="archives/collin-laptop/Pictures/iPhoto Library"
D1="from-pbook/Pictures/iPhoto Library"
find "$D1" -type f | while read AFILE; do
SUB=${AFILE#$D1/}
#echo SUB=$SUB
BFILE=$D2/$SUB
if [[ -s $BFILE ]] && [[ ! "$AFILE" -ef "$BFILE" ]] && cmp -s "$AFILE" "$BFILE"; then
if [[ $AFILE -ot $BFILE ]] ; then
echo ln -f "'$AFILE'" "'$BFILE'"
else
echo ln -f "'$BFILE'" "'$AFILE'"
fi
fi
done
collin@p3:/mnt/home/collin> time tmp/photos.sh > foo.out
real 41m18.878s
user 0m11.009s
sys 0m33.146s
collin@p3:/mnt/home/collin>
then ran it as you see above. A quick sanity
check of "foo.out" looked reasonable. Ah, I probably should have run it
on mini1, rather than on the NFS client. And the same here:
collin@p3:/mnt/home/collin> df -h .; ./foo.out; df -h .
Filesystem Size Used Avail Use% Mounted on
mini1:/Users 298G 257G 42G 87% /mnt/home
-bash: ./foo.out: Permission denied # D'oh! I didn't say "chmod +x"; well, let me fix it the easy way...
Filesystem Size Used Avail Use% Mounted on
mini1:/Users 298G 257G 42G 87% /mnt/home
collin@p3:/mnt/home/collin> df -h .; sh ./foo.out; df -h .
Filesystem Size Used Avail Use% Mounted on
mini1:/Users 298G 257G 42G 87% /mnt/home
Filesystem Size Used Avail Use% Mounted on
mini1:/Users 298G 246G 53G 83% /mnt/home
collin@p3:/mnt/home/collin>
OK, that's enough for now.