I have a love/hate dislike/hate relationship with web statistics report generation tools. I'm using AWStats now, and it works fine enough. But very often, my clients want to know, "how many times has X page been viewed?" and "where are people being referred to Y page from?" These questions can not always be readily answered by a full stats reporting system.
A previous practice I used for a while was to import all the log data into an SQL database. I figured from there I could run whatever SQL queries I needed to learn whatever anyone needed to know - I could even whiz-bang-wow a client by doing it right at a meeting! So I never even tried to do that, but I did use it to build some simple reports for some clients on specific, targeted pages or areas of their site. But it always seemed like too much effort, especially for some quick, one-off questions a client might have.
Linux Shell Commands to the Rescue
So I thought I'd try something to see how it works. For the benefit of AWStats, I already had a consolidated, ordered log file of the entire time period in question in plain text, W3C format. This is just perfect for using the *nix shell tools on, isn't it? After SCPing the file to my local Linux box, I ran this series of commands, piping each to a new file.
grep
First, I grep'ped for the specific page I was looking for. The site in question runs ASP.NET and I'm using URL Rewriting, but as far as the logs are concerned, the page being requested is the 'real' one, complete with whatever ugly URL applies. So my grep argument was this: 'GET /Pages.aspx id=70' The file that resulted contained only log lines that were requests for that specific page. I then ran the file through some grep -v to quickly filter out results I knew we did not want, such as hits from my IP, the client's IP, and hits with the back-end management interface as the referrer.
cut
Once I was sure I had a file containing only the 'hits' on the page I wanted, I used cut -d' ' -f8 and piped that to a new file. This cut column 8 (which happens to be the referrer in my case) from the file, using the space as the delimiter. By default, cut works on characters, not delimited fields. the -d option lets you specify a field delimiter. I had to specify the delimiter in single quotes since I was using a space. For comparison, if the delimiter had been a colon, the command would have looked like this: cut -d: -f8 Note that there is no space between the -d option and its argument.
sort and uniq
Once I had a file containing nothing but the HTTP referrers of the page in question, I wanted a list of the referrers and their frequency, sorted from most to least frequent. Just a little bit of piping gets me there:
cat INPUT_FILE | sort | uniq -c | sort -nr > OUTPUT_FILE
Easy enough, eh? So what I am doing is easy enough to follow, really; my list of referrers first needs to be sorted (I could have used the -f option to sort to ignore case, but I did not need to, and you may not want to, actually) then it is piped in to uniq -c, which produces a list of only each unique line in the file and, thanks to the -c option, precedes each line with a count of how many times it exists. Next I need to sort it again; this time I am sorting it in reverse order (-r) and numerically (-n) and sending it to the final file, which contains exactly what the client wants.
Conclusion
Depending on what information is needed, the above can be a quicker way to produce it than other methods. I work with SQL day-in, day-out, so it was natural for me to initially think of that as the first solution. However, one thing I like about this is that I can work with it more 'on the fly'; at each stage I have an output file narrowed down appropriately and I can think of the information I want in a more step-by-step way.