Tuesday, July 02, 2013

Let's see if I can free up 11GB (or: finding and taking care of duplicate files on a mac mini)

The lovely Carol has a mac mini, which currently serves[sic] as an NFS server for my desktop. It also backs up our laptops. This machine, which is backed up off-site, has a huge hard drive that I thought would keep us in disk space for a long long time.

You can guess what happened: photos and music—especially photos—tend to expand to fill the space available. It doesn't help that we have multiple copies of stuff. So I thought to run some Perl or Python script to help me find said copies.

Since I've become a Python partisan I went that route. A web search turned up some helpful hints on stackoverflow and particularly this post on endlesslycurious.com. The mac mini has python2.6, so I made a few modifications; you can see the whole thing at http://cpwriter.net/dup2/.

I ran that on /Users on the lovely Carol's mac mini, putting the results into dups.out.

mini1:~ collin$ wc -l dups.out
   12489 dups.out
mini1:~ collin$ 
Yep, that's a lot of files. A couple of big offenders:
mini1:~ collin$ grep Best.*Wedding dups.out
[621850501, ['/Users/carol/from-macbook/Movies/Best of Wedding.mov', \
   '/Users/collin/from-pbook/Desktop/Best of Wedding.mov']]
[989954048, ['/Users/carol/from-macbook/Desktop/Redeemer Marriage Series/Best of Wedding-DVD.img', \
   '/Users/collin/from-pbook/Desktop/Best of Wedding-DVD.img']]
mini1:~ collin$ 
That's 621Mbytes and 989Mbytes. So about 1.5GB freed up just like that. But I think we have a lot more. I discovered a lot of files under "archives" and "from-pbook" that are the same, like this:
mini1:~ collin$ grep archives dups.out|grep -m5 from-pbook
[1049177, ['/Users/collin/archives/collin-laptop/Pictures/iPhoto Library/2009/01/01/IMG_3001.JPG', \
   '/Users/collin/archives/data1/pix-dec08/img_3001.jpg', \
   '/Users/collin/from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_3001.JPG', 
   '/Users/collin/pix/2008/12/pix-dec08/img_3001.jpg']]
…
Wow, four paths to the same file. Hey, can I get rid of all those pix-dec08 paths? Yes, because:
  1. A "diff -r archives/data1/pix-dec08 pix/2008/12/pix-dec08" showed that these two directories are identical;
  2. every "large" (not a thumbnail or slide) image file under pix/2008/12/pix-dec08/ appeared in dups.out. Except those under 1024×1024 bytes:
    mini1:~ collin$ for F in pix/2008/12/pix-dec08/*jpg; do if grep -qF $F dups.out; then : OK; else ls -l $F; fi; done       
    -rwxr-xr-x  1 collin  _lpoperator  1002328 Dec 31  2008 pix/2008/12/pix-dec08/img_2961.jpg
    -rwxr-xr-x  1 collin  _lpoperator  858104 Jan  1  2009 pix/2008/12/pix-dec08/img_2988.jpg
    -rwxr-xr-x  1 collin  _lpoperator  863361 Jan  1  2009 pix/2008/12/pix-dec08/img_2994.jpg
    -rwxr-xr-x  1 collin  _lpoperator  865777 Jan  1  2009 pix/2008/12/pix-dec08/img_2995.jpg
    -rwxr-xr-x  1 collin  _lpoperator  994298 Jan  1  2009 pix/2008/12/pix-dec08/img_2996.jpg
    -rwxr-xr-x  1 collin  _lpoperator  811491 Jan  1  2009 pix/2008/12/pix-dec08/img_2997.jpg
    mini1:~ collin$
I'm going to take the leap of faith that the remaining files are in fact there in the other paths... well, no I won't:
mini1:~ collin$ for F in pix/2008/12/pix-dec08/*jpg; do if grep -qF $F dups.out; then : OK; else \
     Y=`basename $F|tr [:lower:] [:upper:]`; \
     Z=`/bin/ls from-pbook/Pictures/iPhoto\ Library/200*/*/*/$Y`; \
     echo $Z;cmp "$F" "$Z"; echo; fi; done                                                          
from-pbook/Pictures/iPhoto Library/2008/12/31/IMG_2961.JPG

from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2988.JPG

from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2994.JPG

from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2995.JPG

from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2996.JPG

from-pbook/Pictures/iPhoto Library/2009/01/01/IMG_2997.JPG

mini1:~ collin$ 
So we can kill off those two paths. That might have saved another Gbyte or so.

Now, can we maybe hardlink the /Users/collin/archives/collin-laptop/Pictures/ stuff to/from the /Users/collin/from-pbook/Pictures/ stuff? And how much space might that save?

mini1:~ collin$ du -sh archives/collin-laptop/Pictures/iPhoto\ Library/ from-pbook/Pictures/iPhoto\ Library/                                    
 12G archives/collin-laptop/Pictures/iPhoto Library/
 11G from-pbook/Pictures/iPhoto Library/
mini1:~ collin$ 
Quite a bit. That plus the 1.5GB already saved earlier would be a significant help here:
collin@p3:/mnt/home/collin> df -h .
Filesystem            Size  Used Avail Use% Mounted on
mini1:/Users          298G  257G   42G  87% /mnt/home
collin@p3:/mnt/home/collin> ssh mini1 df -h .
Filesystem     Size   Used  Avail Capacity  Mounted on
/dev/disk0s2  298Gi  256Gi   41Gi    87%    /
collin@p3:/mnt/home/collin> 
Not sure why the difference, but there it is. Anyway, I wanted to hardlink one set of files to the other. (Why? Because the from-pbook directory may get rsync'd. If I delete the from-pbook directory, then it may come back later. And if I delete the other directory, and subsequently decide to remove the files from the pbook, then we'll lose the photos. So hardlink is the way to go.) Consequently I wrote this silly script:
collin@p3:/mnt/home/collin> cat tmp/photos.sh 
#!/bin/sh
D2="archives/collin-laptop/Pictures/iPhoto Library"
D1="from-pbook/Pictures/iPhoto Library"

find "$D1" -type f | while read AFILE; do
    SUB=${AFILE#$D1/}
    #echo SUB=$SUB
    BFILE=$D2/$SUB 
    if [[ -s $BFILE ]] && [[ ! "$AFILE" -ef "$BFILE" ]] && cmp -s "$AFILE" "$BFILE"; then 
        if [[ $AFILE -ot $BFILE ]] ; then
     echo ln -f "'$AFILE'" "'$BFILE'"
 else
     echo ln -f "'$BFILE'" "'$AFILE'"
 fi
    fi
done
collin@p3:/mnt/home/collin> time tmp/photos.sh > foo.out

real 41m18.878s
user 0m11.009s
sys 0m33.146s
collin@p3:/mnt/home/collin> 
then ran it as you see above. A quick sanity check of "foo.out" looked reasonable. Ah, I probably should have run it on mini1, rather than on the NFS client. And the same here:
collin@p3:/mnt/home/collin> df -h .; ./foo.out; df -h .
Filesystem            Size  Used Avail Use% Mounted on
mini1:/Users          298G  257G   42G  87% /mnt/home
-bash: ./foo.out: Permission denied   # D'oh! I didn't say "chmod +x"; well, let me fix it the easy way...
Filesystem            Size  Used Avail Use% Mounted on
mini1:/Users          298G  257G   42G  87% /mnt/home
collin@p3:/mnt/home/collin> df -h .; sh ./foo.out; df -h .
Filesystem            Size  Used Avail Use% Mounted on
mini1:/Users          298G  257G   42G  87% /mnt/home
Filesystem            Size  Used Avail Use% Mounted on
mini1:/Users          298G  246G   53G  83% /mnt/home
collin@p3:/mnt/home/collin> 
OK, that's enough for now.

No comments: