Using the Shell for Quick Web Log Answers

by Andrew Barber 18. May 2010 05:20

All that work was starting to approach Rube Goldberg-esque dimensionsI have a love/hate dislike/hate relationship with web statistics report generation tools. I'm using AWStats now, and it works fine enough. But very often, my clients want to know, "how many times has X page been viewed?" and "where are people being referred to Y page from?" These questions can not always be readily answered by a full stats reporting system.

A previous practice I used for a while was to import all the log data into an SQL database. I figured from there I could run whatever SQL queries I needed to learn whatever anyone needed to know - I could even whiz-bang-wow a client by doing it right at a meeting! So I never even tried to do that, but I did use it to build some simple reports for some clients on specific, targeted pages or areas of their site. But it always seemed like too much effort, especially for some quick, one-off questions a client might have.

Linux Shell Commands to the Rescue

So I thought I'd try something to see how it works. For the benefit of AWStats, I already had a consolidated, ordered log file of the entire time period in question in plain text, W3C format. This is just perfect for using the *nix shell tools on, isn't it? After SCPing the file to my local Linux box, I ran this series of commands, piping each to a new file.

grep

First, I grep'ped for the specific page I was looking for. The site in question runs ASP.NET and I'm using URL Rewriting, but as far as the logs are concerned, the page being requested is the 'real' one, complete with whatever ugly URL applies. So my grep argument was this: 'GET /Pages.aspx id=70' The file that resulted contained only log lines that were requests for that specific page. I then ran the file through some grep -v to quickly filter out results I knew we did not want, such as hits from my IP, the client's IP, and hits with the back-end management interface as the referrer.

cut

Once I was sure I had a file containing only the 'hits' on the page I wanted, I used cut -d' ' -f8 and piped that to a new file. This cut column 8 (which happens to be the referrer in my case) from the file, using the space as the delimiter. By default, cut works on characters, not delimited fields. the -d option lets you specify a field delimiter. I had to specify the delimiter in single quotes since I was using a space. For comparison, if the delimiter had been a colon, the command would have looked like this:  cut -d: -f8 Note that there is no space between the -d option and its argument.

sort and uniq

Once I had a file containing nothing but the HTTP referrers of the page in question, I wanted a list of the referrers and their frequency, sorted from most to least frequent. Just a little bit of piping gets me there:

cat INPUT_FILE | sort | uniq -c | sort -nr > OUTPUT_FILE

Easy enough, eh? So what I am doing is easy enough to follow, really; my list of referrers first needs to be sorted (I could have used the -f option to sort to ignore case, but I did not need to, and you may not want to, actually) then it is piped in to uniq -c, which produces a list of only each unique line in the file and, thanks to the -c option, precedes each line with a count of how many times it exists. Next I need to sort it again; this time I am sorting it in reverse order (-r) and numerically (-n) and sending it to the final file, which contains exactly what the client wants.

Conclusion

Depending on what information is needed, the above can be a quicker way to produce it than other methods. I work with SQL day-in, day-out, so it was natural for me to initially think of that as the first solution. However, one thing I like about this is that I can work with it more 'on the fly'; at each stage I have an output file narrowed down appropriately and I can think of the information I want in a more step-by-step way.

Random Stuff Learned/Remembered Lately

by Andrew Barber 4. April 2010 20:47

I have a bad habit of wanting to post something of a certain size and weight before I post anything at all. The end result? I don't post anything! As the beginnings of an attempt to break this habit, I offer some almost entirely unrelated items I have come across lately. The top two items are from some recent adventures in Linux administration, while the last one is a really dumb one with an ASP.NET app.

Paranoid hosts.deny Can Trip You Up

One of my Debian-based firewall servers had the 'paranoid' option set in its /etc/hosts.deny file. The purpose of this option is to block access to any remote host where the host name does not match the address. I imagine it would work by pulling a PTR record from the IP, and then querying the resulting host name to see if it matches back to the IP again.

That system has had that setting for quite some time now, and it never caused me any trouble. Every connection I have apparently used in that time has had its DNS settings properly set. But I recently used a connection at a partner's office, which was provided through a small, local cable company. Hair-pulling ensued; While I had no trouble accessing the services on the systems behind the firewall, I could not connect to the SSH or VPN services provided by the firewall itself. Packet sniffing showed there were packets going back and forth. As I recall (it's been a few weeks now), something in syslog prompted me to think of hosts.deny;

I don't know how prevalent it is that ISPs will have improperly setup PTR records, but it's something to keep in mind.

I Was Creating a New Grub menu.lst Every Kernel Upgrade!

On another Debian-based firewall system I have, I kept a backup copy of Grub's menu.lst file to restore over the automatically generated one every time I did a kernel upgrade. This is a big-time case of "RTFM!" The reason I was doing it has to do with the perculiarities of the hard drive subsystem on the motherboard in question, requiring me to need to use the IDs of the drives.

But if I had simply read the comments in menu.lst, or read the man page, I would have known I had two options: 1) add special entries before/after the 'AUTOMAGIC KERNELS LIST' section or, more appropos for my situation 2) edit the default options within that section. The default menu.lst file comes with comments that describe the process quite well; the key lines are commented by a single hash (#), and that hash must remain there. When configuration of a new kernel is done, the defaults entered there are used. I was actually considering replacing this server with another one due to this issue, when all it took was literally a 30-second fix.

Don't Forget That Global.asax File!

Okay, you say, what kind of silly advice is that? Who would forget to upload one of the files of a web site? And wouldn't it be obvious when it happened? Of course, ASP.NET folks know that Global.asax is a special file. It's not a page that you load (requests for it are specifically denied by IIS by default, in fact), but it is frequently a vital file. Application-level code runs - or does not run - based on it.

I almost always use Code-Behinds in ASP.NET, where the 'web site files' (including Global.asax) merely contain markup, and then names of the compiled program objects from which they are to derive - those objects having been programmed via the 'code-behind' files, which are pure C# code, in my case. Even though all the code is compiled in an assembly, the web files are still needed (without some other special circumstances); The layout/design and markup code is still in the ASPX files - you will get the standard "File Not Found" error if such a file was not uploaded.

But the Global.asax is a special case. If it does not exist, the result is simply that no application-level code is run. This should seem obvious, but without the Global.asax being present in the web directory, the ASP.NET runtime has no way of knowing what, if any, HttpApplication-derived object that may or may not be in your code should run. You could derive a dozen such objects and compile them in your assembly. You may have compiled only one, but you don't want it used. My point is, the Global.asax file and the markup contained therein is the way that the runtime knows what you intended.

In my case, this oversight was the reason on my development system, a global variable set in Application_Start was valid, but on the staging server it was NULL.

HTTPS Web Sites: Just One Per IP?

by Andrew Barber 14. December 2009 07:06

I came across a post the other day where someone stated a misconception; That you can only host one HTTPS web site per IP address available on a server. I think most fairly experienced web server admins know that this is not actually the case, and also know why the misconception came to be. Most web server documentation I've seen tells one how to exceed that false limit, but of course it does not say so in exactly so many words!

Like pretty much everyone else who was ever a teenager, when someone says, "you can't do this", I want to know why. And I want to know why for the same reason I wanted to know why, as a teenager, I could not stay out past X time: so I can find a way around it. The long-and-short of the story is this: The actual limit for HTTPS sites is one per TCP socket, not IP Address. So, for every combination of IP address and TCP port, an HTTPS site can be hosted. Note that Host Headers have nothing to do with this. However; For a number of public uses of HTTPS sites, varying the standard TCP port is not a good option here, meaning the "one HTTPS site per IP" is still an effective standard.

More...

Add/Remove Programs Cleaner Rescues (Kills!) Orphans

by Andrew Barber 13. November 2009 00:34

Sometimes a software uninstall might not complete fully on a Windows system, and you'll be left with an entry in Add/Remove Programs, even though the program files are no longer present. Attempting to remove the program from that list again sometimes will generate an error, and the entry will not be removed, leaving you with an annoying orphan. IntelliAdmin has a freeware program called Add/Remove Programs Cleaner (link) which removes entries from that list.

Important Note: This tool does not do anything toward actually uninstalling a program's files, shortcuts, or registry or profile data. It only removes the item in the Add/Remove Programs list, and it should only be used on a program which you know has been uninstalled, but which Windows won't remove from the list when you try via the normal means.

The Cleaner works on Windows NT, 2000, XP, 2003, 2008 and Vista, and may work on Windows 7; I believe it does not work on Windows 98 (seriously, you aren't still using that, are you?) It does not require an installation; it is simply a single executable file that you run.

SQL Server Won't Start Up Automatically

by Andrew Barber 11. November 2009 08:30

I've had a recurring issue with a client's web server and the local installed instance of SQL Server 2005 Express, in this case, but this issue applies to all versions of 2005 and 2008. The behavior was that the service would fail to start automatically once the system was rebooted, but once I connected via the RRAS VPN and then Terminal Services for remote management, the service would start up just fine. The Windows Event Log had the following SQL Server error messages, immediately back-to-back in order (SQL Server itself has the same messages in its own logs):

- Server failed to listen on x.x.x.x <ipv4> xxxxx. Error: 0x2741. To proceed, notify your system administrator.
- TDSSNIClient initialization failed with error 0x2741, status code 0xa.
- TDSSNIClient initialization failed with error 0x2741, status code 0x1.
- Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.
- SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

The server was configured to listen on only certain IP addresses, on port xxxxx. 127.0.0.1 was the primary address for the local web sites to use, and x.x.x.x was the private IP address assigned to the RRAS server; this was for remote management of the SQL Server via the VPN connection. Hopefully your light bulb is going off over your head more quickly than it did mine!

More...

Debian GNU/Linux 5 Release Keeps Debian on Top

by Andrew Barber 25. February 2009 13:01
Debian

On the 14th, it was announced that the next major version of Debian GNU/Linux - version 5, code-named Lenny - was officially released (link). Having had some time to consider this - particularly in relation to other distributions, like Ubuntu - I had some notes I wanted to share about the release. Well; more about Debian itself, really. I think I'm about to dive into something of a Holy War here, by daring to express some thoughts regarding my choice of Linux distros. Pray for me, will you?

I use Linux primarily (almost exclusively, actually) for network security functions; mostly for firewalls/routers, and to provide things such as DHCP, DNS and VPN functionality. This entry itself is being posted on a server protected by such a system, and typed on a computer also protected by such a system. Both systems are Debian based. In the past, I have made serious attempts to use Red Hat and Mandrake distributions, but once I gave Debian 3 a try, I was hooked, so to speak. I now primarily use Debian 4 (etch). The reasons I stick with Debian all come down to one simple issue; I want the servers I install to simply do their jobs.

More...

Macs and Malware; Pirates and Trojans!

by Andrew Barber 27. January 2009 10:08

iWork Logo A recently discovered bit of malware for the Apple Mac OSX operating system presents an opportunity to make a few brief points. I'll try not to preach. Too much.

The short version; The Peer-2-Peer file sharing networks have been discovered to be spreading a trojan horse software (link) posing as a free or cracked version of Apple's iWork 2009 (link) suite of productivity software. Apple does have a free trial version available for download for those who would legitimately like to try it out on their Mac.

For Heaven's Sake; Practice Safe Hex

Do not download from anonymous P2P networks. Forget the moral and ethical arguments entirely. These networks are simply a playground for people who would like to spread malware. All one has to do is create a trojan horse, and give it a name that suggests it is a crack for some expensive software, and off it goes. The prevalence of broadband connections means people will even download a 300 Megabyte piece of malware, which might actually be embedded within what appears to be the 'real' item claimed. The nature of most P2P networks makes it somewhat difficult to figure out where something came from, so there's little recourse when you get infected.

More...

MacBook Pro Battery Won't Charge?

by Andrew Barber 4. January 2009 11:24

MacBook ProFor those who do not already know, I use an Apple MacBook Pro (MBP) as my primary computer. I use Boot Camp to dual boot into Microsoft Windows Vista or Apple Mac OSX 10.5 as needed. I may make a separate post about some of the issues, solutions and tools I have found in that process. However, this post is about a small issue that happens to my MBP on occasion, and which I assume must happen to others also.

At times, the battery simply will not charge. Both Vista and OSX show the charger/power supply connected and in-use, and show the battery at a level other than 100%, but both also show that the battery is not being charged. Angela (my wife and business partner) and I have numerous chargers, and we have verified that the problem lies not with them. Since both OSX and Vista exhibit the same behavior (and since the behavior is also the same when the computer is off but plugged in), it lies not with the operating system, either.

The solution, then, is to reset the System Management Controller. This is a bit of firmware on the main logic board of the MacBook which controls many functions of the computer, including battery charging. This is accomplished like so;

  • Turn the computer completely off
  • Remove both the power supply and battery
  • Press and hold the power button for five seconds
  • Reconnect battery and power supply
  • Turn it all back on, and enjoy

One important note I want to emphasize is that this process should only be followed when the computer has been shut down properly. If you cannot get the computer to shut down properly, you have other issues which are more pressing than the battery not charging. Although Apple makes this caution only in relation to the MacBook Air, I think it would be wise to consider it for any MacBook, or at least to take the system to an authorized service center.

Finally, I want to note that this does not appear to help with another issue experienced by many MacBook Pro users. Many early MBP systems came with faulty batteries, which would not hold much of a charge at all. Apple had an exchange program for these batteries, which has long ago ended. There also was a software update to OSX which updated the firmware on some batteries to resolve some faulty batteries. All modern, fully-patched OSX systems (10.4+) will already have this update, and the battery itself would have been automatically updated, as well. If the battery still does not last long from a 100% charge, the best bet would be to purchase a new one.

Secure Web Sites Vulnerable?

by Andrew Barber 30. December 2008 09:48

Before anyone reading this sees the breathless headlines soon to come on the evening news, I thought I would post some quick analysis. In Berlin, Germany today, at the 25th Chaos Communication Congress (25C3), run by the Chaos Computer Club (CCC), a presentation was made which was entitled, "MD5 considered harmful today; Creating a rogue CA Certificate". I have no idea how the non-computer-literate (or even semi-) media will report this, but likely they will speak about SSL/Secure web sites being able to be spoofed, and that 'phishing' attacks will be more likely, and users won't have any way of knowing if they are victims.

Background

First, a quick attempt at giving a very simple explanation of what is actually very complex;

A 'Certificate Authority' (CA) is a company/entity which issues certificates that are used to verify the identity of something. Typically you will see this evident in the 'Padlock' or colored status bar of your web browser when browsing a secure web site. The 'certificate' is an electronic document, of sorts, which contains various information used to verify that the site is who it claims to be, and pointing your web browser (transparently to you) to a CA that does the actual verification. As you might expect, the identities of these CA's are very important. Every web browser comes with a pre-set list of known, trusted CA's. A user can add/remove CA's, which is a critical operation that in practice, is rarely done.

More...

Why Eels?

No one can really be certain. But those slimey underwater critters obviously have something going for them!

Links/Profile

Andrew Barber's Profiles:
Disclaimer
The opinions expressed herein are my own personal opinions and do not represent the views of employees, contractors or clients of Inkwell Creative Group, LLC in any way.

© Copyright 2008, 2009 Andrew Barber