I'm in the process of working on a migration strategy to get the Igloo off of the Community Server platform and onto Subtext. My search of the internet provided few resources beyond BlogML to help me with the data transfer. I played around with BlogML, but in the end wasn't able to get it to work quite like I wanted so I started working on my own solution. After screwing around for a while I was working on writing the script to migrate the comments from one database to the other when I hit upon something that astonished me.
One of the first jobs that I had out of school was working on a WinForms application that used a SQL 6.5 database for the back end. That database had been designed by a couple of people who's last data design efforts had been in a pure value delimited data store. As a result there were columns that had many different logical values stored with a delimiter of something like char(152). As a result, if we wanted to report or process based on any one of those values, we had to know the location in the delimited field. Above all things, this was the most painful experience I ever had as a data consumer. Instead of easily being able to run a SUM and GROUP BY query against the database we would have to pull the values individually and then loop over them processing the grouping and summary manually prior to pushing the results to something like Crystal Reports. Painful. Brittle. Frustrating. I couldn't believe it when I saw it again.
To convert the comments out of the Community Server database I had to retrieve the comment author and their URL if either had been provided in the original comment. Community Server (at least version 2.0 which I am using) stores those values in an undelimited column called "PropertyValues". What you get is something that looks like this:
Not so bad I'm thinking as I peruse the data that I have. I have a URL followed by the author name, some boolean value and a numeric. Then I realize that I have no earthly idea where to start parsing for the author name because the URL value has a variable length.
How do I know where to get the start of the author name on row 1 and still be able to get the start of the author name on row 2? It doesn't matter if you're consuming this into a report, a sql query or an application, there has to be a way to split that string into its respective logical values (URL, comment author, posted, spam value). In my dismay and search for patterns, or parsing information I end up looking through a lot of the comment entries and I notice something even more frustrating. The data in the "PropertyValues" column is not necessarily in the same order. For instance:
Now I have two problems. How do I parse this damn string and how do I know which of the values to use once parsed? The answer lies in another column. The "PropertyNames" column is another piece of work. It is the key to translating the "PropertyValues" column into individually meaningful bits of data. What you get for the above two samples is this:
Yes, the "PropertyNames" column, which is containing the key to the "PropertyValues" column, is colon delimited. To figure out what the comment author's name is I need to go through the following steps:
- Find the string "SubmittedUserName" in the "PropertyNames" column and determine the index of the colon that appears immediately after it.
- Ignore the single character value after that colon and note the index of the next colon that appears.
- Retrieve the numeric values that are after the colon located at the end of step number 2 and note the index of the colon appearing immediately after them.
- Retrieve the numeric values that are after the colon located at the end of step number 3
In the end you have two numeric values. The one found in step number 3 is the starting index for the comment author value that is stored in the "PropertyValues" column. In the case of the samples above, the first comment author value starts on index 0 (zero) and the second one starts on index 33. The second value found (step number 4) is the length of the value representing the comment author. The first example above has a length of 14 and the second has a length of 9.
Once you have those two values you can simply use the T-SQL SUBSTRING function to get the data you need. The exact same principle can be used for getting the URL value. Simple huh?
I had a belief that the people at Telligent were pretty smart. Unless there is some good reason for making me be a data contortionist I'm starting to feel otherwise. Why couldn't Scott Watermasysk and his coworkers come up with a better solution than this in version 2.0 of this product? If they were worried that they may need to add more properties and didn't want to have to alter the database structure every time that they did, why not create a child table that has the post id, property id and property value columns in it? Turning the data vertical instead of trying to store it horizontally would give them all the expansion room that they want. Instead, we end up with this exclusionist data structure that prevents the outside people from getting any use out of it.
What if I wanted to create a component for my blog that showed the people who have commented the most? There's no way for me to create a simple component that can do this. I have to go through the act of getting all the comments back in a dataset, parsing each one to find the commenter name and then manually count and group the values. Painful. What I should be able to do is something like this:
Select CommenterName, Count(*) From Comments Group By CommenterName
I see no reason that data access, in this case, can't be this easy. With the power of relational database engines and the low cost of storage space no database designer should be able to justify a value delimited column over a normalized data structure.
There's also no reason not to migrate away from it. Every Community Server upgrade that I've performed has included a significant set of database upgrade scripts. You could change the data structure to something more user friendly and modify the application to work with it. Part of the database upgrade scripts then could include a section that migrates all the data from the old "PropertyName"/"PropertyValue" format into the new data structure. Voila. Value delimited columns gone.
Telligent, if you want a script to do this, let me know. I have one I can sell to you cheap.
There's a lot of discussion going on between Oren and Jacob (here, here and here) about the relative benefits, costs and overall usefulness of Dependency Injection. Both have done a decent job arguing their beliefs and they certainly triggered some thoughts in my head. In the not so distant past I did a post that showed the steps that I would take when refactoring from junk code into a full IoC implementation that has some Dependency Injection, SRP and interface based design.
I noticed some interesting things when I sat back and analyzed the results. The first thing that jumped out at me was that the coupling of my code, according to nDepend, only made a significant leap when I began coding to interfaces. Prior to the experiment I had expected Dependency Injection would have the largest impact. Looking at the code, it makes sense that interface based development has a larger impact on coupling. When you change from coding to concretions to coding to interfaces you are telling your consuming code that it doesn't care what the implementation of that object is, but instead just what it's contract with you is. Immediately I can create another class that implements the same interface as is currently being used, change one line of code (for that one instance of use only) and, voila, I have new functionality. Although the class using that implementation has to know how to create the implementation, it still is decoupled through the rest of it's use.
On the other hand Dependency Injection only provides the class with what it needs to use. As far as coupling is concerned, all you have done is remove the "newing-up" of the object and passed it in instead. This is an important part of decreasing the coupling, but the overall impact on the code is limited. Instead of having one "new Customer()" statement every time you need a Customer object, it will be created in on central location. Comparing the impact provided by the centralization of object initialization with the impact provided by coding to an interface and I believe that coding to an interface is far more reaching.
Probably the biggest thing that I've thought about since my experience with refactoring to IoC, is that Dependency Injection, SRP and coding to interfaces all are inextricably linked. Without interfaced based code, it becomes very difficult to see any decoupling benefit from Dependency Injection. Without adherence to SRP it is very difficult to practice good Dependency Injection. That has led me to formulate the thought that Dependency Injection actually acts as a pattern and practice aggregator.
If you want to implement Dependency Injection well, you are going to have to also begin practicing SRP and interfaced based development. Dependency Injection, by itself, isn't a silver bullet for creating loosely coupled applications. Neither are coding to an interface or SRP. Dependency Injection will, however, force you to take a long look at why your code exists as it does. It will be a catalyst for you to move to begin implementing better OO practices as you work towards making Dependency Injection work smoother and more powerfully for you.
The project I’m on is know less for it’s velocity than it is known for the stunningly good looks of its team leads. As I blogged a couple of moths back, we’re supposed to be in “Crunch Mode” right now so that we can meet the arbitrary hard deadline that management is imposing. To tell you the truth, I’ve never seen a “Crunch Mode” quite like this.
Instead of the development team having to bust its collective testicoli to get things done, we’re all just sitting around. Every day the hard deadline is nearer, yet every day we actually have less and less to do. This isn’t because we are whittling away the backlog. Its direct cause is that no work has been handed over to us by the designers and business analysts.
It’s an interesting situation. There is lots of backlog remaining, but none of it has been prepared for development. Every week I have a status meeting where, for the last 4 weeks (this week will be the same again), I tell management that we’re falling further and further behind. I give examples of numerous backlog items that should have been delivered to the development team, but have yet to appear. I figured that by displaying my concern and vocalizing the point-in-time state of the development effort I would stir up a little bit of anxious urgency. Nothing. Instead of the pressure being put on to deliver the backlog, or even just pieces of it, to development the schedule gets revamped so that all the items that are currently late now appear as work sometime in the coming months.
I suppose that this is the easy way to ensure that you’re on schedule, but it has some serious ramifications. Unbelievably this happened two weeks ago. No longer was there the pressure for all the late backlog items to be delivered. Instead they were deliverable in September. Roll ahead two weeks and now we have the same problem all over again, except with different backlog items. I’m quite interested in seeing what will happen when that comes out in our status meeting.
What can we, the development team, do? At this point I think that we have to stay nimble and ready to react. I can’t fill the down time with refactoring tasks for fear of an avalanche of work arriving while we have the application in a partially refactored state. Hopefully the last 8 months have allowed us to get the codebase into a state where we can easily add and modify when we do get work. We need to work on getting and keeping our environments in a state where they will work flawlessly (here that means only going down once a day) when we need them. We also need to get the new hires up to speed as quickly as possible when they start. Yes, new hires. Apparently we need them because we aren’t capable of keeping up with the work load that we have.