It's Only Fitting

by Andrew Barber 16. December 2008 07:40

I got my start actually making a 'living' as a software developer massaging data; Getting something that was an old mess into a more modern, accessible format that could be used into the future.

At the time, I worked for the federal government, and I was charged with taking what amounted to a huge, manually-edited spreadsheet and normalizing the data so that it could be imported into a recently designed Oracle database. In actuality, that project was sort of meant to be busy work, and it was expected that I would spend a few months manually scanning tens-of-thousands of records to do things like assure the address fields were split into their parts (line 1, line 2, city, state, postal code, country). What I actually ended up doing was spending a few minutes starting to do that, and then realizing that there really were some patterns to the data which could be identified by a set of rules for at least 99% of the records.

So, I spent a couple days playing with some parsing code. The office I was in was all Microsoft on the desktop, so I created a simple Access 97 database the schema of which duplicated the target Oracle db, and then began to write some Visual Basic code within Office to parse every line in the spreadsheet that could be parsed, and slap it in. In the end, I was left with perhaps 1000 records which could not be readily parsed by code and needed to be manually converted. The whole job took about two weeks, largely because I had not previously used Visual Basic within Office at all before then, and so needed to learn a bit of the environment there.

What's That Got to Do With Today?

What's fitting, then, is that my first real post here will also be about moving some data from a manual format into a normalized database. I am working on a project for the Volunteer Action Center of the Mid-Ohio Valley (VACMOV), to take their online listing of programs and agencies and put it into a format with which they can more readily work. Right now, it is simply in individual, plain HTML files. Each agency/program has its own, and there are nearly 800 of them. These records are found on the current web site by key words, of which there are just over 400. Each key word itself has a manually-edited HTML page, which lists and links each record that applies. As one might expect of such a setup, editing this data is quite difficult to do. Adding a keyword to a record, adding a completely new keyword, adding a whole new agency/program record, and even just editing a program's information are all much more complex than they need to be. Of course, updating the appearance of the pages also is effectively impossible, since there are well over 1200 pages which would need to be edited.

You can see the VACMOV's Information and Referral pages to see the current system.

At this stage, I have been able to get all but 13 of the current records to be parsed into a database. The HTML files of those 13 records can be altered to parse except two, which will need to be manually entered into the new system. This was by far the hard part; Having been written a while ago, the HTML files do not follow XHTML standards, but rather the more lax, original HTML standards. For example, there are <p> and <li> tags throughout without their matching closing tags. Luckily, though, this has not been a big impediment. I've been able to use Regular Expression pattern matching to locate individual fields, often by their labels, which are surprisingly uniform.

So, next I will be writing a parser to get all of the key words in. Each key word HTML file will be scanned, and every matching record will have the appropriate key word added to it in the database. After that comes the generally much simpler task of presenting the information. We'll be hosting the VACMOV's web site soon, and providing the system to run the Information and Referral portion. I am not sure that the current method of clicking a letter of the alphabet and then scanning for a keyword in that list is the best way for people to find what they need, so I will instead be providing a search form which searches both the keywords and the text fields of the records themselves. I will leave something that looks a lot like what they have now with the alphabetical thing there only so that search engines will continue to have clickable links by which to find the records in their new locations, so they will be indexed there too. Since the new database also happens to store the old HTML file name of the records, I will also put in place a 404 error handler which detects if the page requested was one of those old records, and redirects them to the new one if so.

Volunteer Action Center

The Volunteer Action Center is a local non-profit group in our area which helps agencies find what they need to provide the services they do. We publish "Get Connected" from the VAC every month in Out & About, a Family Guide, and we are happy to donate our services for their web site here as well.

Why Eels?

No one can really be certain. But those slimey underwater critters obviously have something going for them!

Links/Profile

Andrew Barber's Profiles:
Disclaimer
The opinions expressed herein are my own personal opinions and do not represent the views of employees, contractors or clients of Inkwell Creative Group, LLC in any way.

© Copyright 2008, 2009 Andrew Barber