Wednesday, October 07, 2009

Phone bill processing: MySQL, Python, bash

Here's an automated phone call I was actually glad to get: it was Verizon telling me we were close to exceeding our "shared minutes". Well, I wasn't glad about being close to the edge, but I was glad to know about it so we could do something about it.

I went to the website and grabbed the call information: one of our lines had used 297 minutes already. "Drilling down" to that line, I got a list that looked kind of like this (numbers, etc., have been changed):
Date Time Number Minutes Desc
12/11/2008     7:35 PM     2025558713     2 Paris
12/11/2008 7:35 PM 2025558713 1 INCOMING
11/11/2008 7:19 PM 2025558713 1 Paris
12/11/2008 6:32 PM 2025559062 1 INCOMING
12/11/2008 4:34 PM 2025554587 1 INCOMING
12/10/2008 7:58 PM 2025553291 4 ATLANTIC CY
12/10/2008 7:54 PM 2025555994 1 ATLANTIC CY
12/10/2008 1:11 PM 2025550001 1 DATA
12/10/2008 1:03 PM 2025553862 2 EL CENTRO
[[... about 180 lines elided...]]

Unfortunately, this doesn't quite tell me what I want to know, viz., which numbers were the "heaviest" in minutes?

Fortunately, I have a Linux box, in my case SuSE 9.3 at home, and OpenSuSE 10.2 at the office. I snarf'n'barfed it into a file named "297m.txt" and considered how to process it.

A few possibilities came to mind:
  1. shell script -- using 'dc' or 'expr' to add up the times
  2. Python -- use dictionaries (one entry per phone number)
  3. MySQL -- using "SELECT PhoneNum SUM(mins) GROUP BY PhoneNum"
One could write something in Perl, Java, or C, but why bother? I think it was the great Brooks (in the Mythical Man-Month?) who said that you should use the highest level language that could be appropriate.

The shell (bash)

This is perhaps the least elegant of the three solutions we'll consider, but it has a sort of retro appeal, particularly if you're running on a 20th century computer, or one that just doesn't have a database or Python.

So... how to keep track of data? The shell has arrays, but it's a lot easier to just create one file for each phone number and add the numbers up in each one.
INPUT=../297m.txt
rm -rf purgeme # a directory to put the files in
mkdir purgeme
grep '^[1-9]' $INPUT | while read date time ampm num min city; do
# The grep gives you lines that begin with a digit..
# The "read" takes "words" from each line; note
# that you need a shell variable for each word --
# one for "7:35" and one for "PM" for example

echo $min >> purgeme/$num
# The above says: to the file named PHONE_NUM, append
# a line containing the number of minutes. So for the above
# sample data, the file named '2025558713' would have 3 lines:
# 2
# 1
# 1

done

# OK, now we have a bunch of files in the "purgeme"
# directory. We're going to have "dc" calculate the
# totals for each file. To do this we'll put a '+' at the
# end of each line in the file. We give "dc" a "0" to start
# with, add each number in, and finally we'll say "p" so that
# he'll give us a total. The whole thing goes into `backquotes`
# so that we have the result of running the program.
# OK, here goes:

for f in purgeme/*; do
echo ${f#*/} `{ echo 0; sed 's/$/+/' $f; echo p; } | dc`
# We put the phone number first (that "#*/" stuff
# says to remove everything from the beginning of
# "$f" until the first '/') and then the total
# calculated by "dc" (explained above)

done | sort -k2nr | head
# OK, the "done" is the end of the "for" loop. We
# sort the for-loop's output -- "-k2nr" means the key
# (thus 'k') is in field #2, numeric (the 'n'),
# and reverse order (the 'r'). The "head" means
# we only want the first 10 lines (the Top Ten phone
# numbers, as defined by minutes, that is).
When run on the phone bill data, the result is:
$ ./doit.sh
2025553788 196
2025553291 141
2025555994 67
2025555521 57
2025556932 54
2025553799 45
2025555876 44
2025559980 38
2025557102 32
2025556930 23
As it turned out, the -3788 number is in our vendor's network, so that 196 there didn't push us toward our limit. The -3291 number, and the -5521 number, though, did. We were able to modify our calling patterns and avoid the onerous over-limit charge. Whew!

Python

I've become rather a Python fan, which doesn't mean that I'm any good at it. But fortunately, this problem is simple enough that a few lines of Python will handle it:
#!/usr/bin/python -tt
fonebill = open("../297m.txt", "r")
mapping = {}
for a_line in fonebill:
try:
(date, hrs, ampm, nr, mins, where) = a_line.split(None, 5)
except:
# because some lines are blank or something
# Actually, this catches the headings "Date Time..."

continue
# DEBUG CODE #print "nr", nr, "mins", mins
mapping[nr] = mapping.get(nr, 0) + int(mins)
fonebill.close()
# I don't know how to sort a dictionary, so make a list.
alist = []
for a_num in mapping:
alist.append((a_num, mapping[a_num]))
alist.sort(lambda x,y: cmp(y[1], x[1]))
for (x,y) in alist:
print x,y
We can run it like this and get results:
$ ./pscript2.py | head
2025553788 196
2025553291 141
2025555994 67
2025555521 57
2025556932 54
2025553799 45
2025555876 44
2025559980 38
2025557102 32
2025556930 23
That "| head" you see there is so that we get only the Top Ten phone numbers (and the total of minutes for each).

No surprises -- the numbers and their totals match what we got using the other method. It's nice when computer science actually works.

MySQL

This is in my view the most elegant and tersest solution. So here's the deal. You need access to MySQL server somewhere. If you're on a reasonably current GNU/Linux distribution, you can probably get one right on your desktop or laptop. The client program is called simply "mysql". You might have to install it (using "sudo rpm..." or "sudo apt-get...") and you might have to start the server (on my box it's "sudo /etc/init.d/mysql start")...

What makes this program so easy in MySQL is that we can create a temporary table, stuff our data in there, yank it out, and then pitch the temp table. Here is the unvarnished truth -- the commands I gave MySQL to produce the desired report:
  • use test
    Use (d'oh!) the database named "test" -- in other words, if you refer to table XYZ rather than <DATABASE_NAME>.XYZ, MySQL will interpret that to mean test.XYZ. The database "test" is world-writable I think in a default MySQL installation. By the way, "use" doesn't require a semicolon as a statement terminator.
  • create temporary table fb (nr INT(10), mins INT(10));
    Create a temporary table (really!) named "fb", for fone bill. It contains two up-to-10-digit INTeger fields, named "nr" (for NumbeR) and "mins" (minutes).
  • load data local infile 'xxx' into table fb fields terminated by ' ';
    This says to load the table "fb" with data from a file. In this case it's a local file (meaning here on the MySQL client). If not local, you have to supply the filename as it would appear to the server. I don't like to do that, because the server is likely running as another userid (like "mysql" maybe?) and it isn't necessarily allowed to read the file/directory that has the data. But it's my file, so I know where it is and certainly have permission to read it, hence I prefer "local".
  • select nr, sum(mins) as tot from fb group by nr order by tot desc limit 10;
    The "select" statement is the workhorse of SQL. Let me break this one down for you.
    • select nr, sum(mins)
      So we're going to select the phone number (nr), and we're going to add up (sum) the minutes (mins)
    • as tot
      This means we're going to refer to the column as "tot" rather than "sum(mins)"
    • from fb
      We're selecting from the table named "fb" -- i.e., the temporary table we created above.
    • group by nr
      Whenever you have a "sum" or "avg" or "min" (etc) in a table, you have to answer the question "of what group?" -- hence you need a "group by" clause in your SELECT. We're saying here to add up the "mins" among entries which have the same "nr" value.
    • order by tot desc
      If you don't say "order by" then MySQL will spit out the result in whatever order it feels like. Here we want to order by the value of "tot" (which is the sum of the minutes for each phone number) and we want the largest number first, hence I specified desc (for "DESCending").
    • limit 10;
      says we only want the first ten results.

You can probably think of several improvements, but here's the actual script that invokes MySQL with the above commands.

#!/bin/bash
awk '{print $4,$5;}' ../297m.txt | grep '^[0-9]' > xxx
# That "awk" thing prints the 4th and 5th "words"
# from each input line. The result is fed to
# "grep", which gives us only lines that begin with digits
# (because we don't want that "Date Time Number..." line).

# The following line runs mysql giving parameters "-t" (tabular
# output) and "--local-infile=1" (because we're going to
# tell it to take something from a local inputfile soon).
# The "<< ThatsAllFolks" means "Feed this script
# to mysql's stdin from here until you see 'ThatsAllFolks'"

mysql -t --local-infile=1 << ThatsAllFolks
use test
create temporary table fb (nr INT(10), mins INT(10));
load data local infile 'xxx' into table fb fields terminated by ' ';
select nr, sum(mins) as tot from fb group by nr order by tot desc limit 10;
ThatsAllFolks

Running it gives this result.
$ ./doit.sh 
+------------+------+
| nr | tot |
+------------+------+
| 2025553788 | 196 |
| 2025553291 | 141 |
| 2025555994 | 67 |
| 2025555521 | 57 |
| 2025556932 | 54 |
| 2025553799 | 45 |
| 2025555876 | 44 |
| 2025559980 | 38 |
| 2025557102 | 32 |
| 2025556930 | 23 |
+------------+------+
Now perhaps I should have said
select nr as 'Number Called', sum(mins) as Minutes
so that the table would have nicer headings, but hey, it's only ASCII "art" anyway.

I enjoyed writing that; I hope it was interesting, helpful, or entertaining.

No comments: