Value delimited fields

2007-08-22

I’m in the process of working on a migration strategy to get the Igloo off of the Community Server platform and onto Subtext. My search of the internet provided few resources beyond BlogML to help me with the data transfer. I played around with BlogML, but in the end wasn’t able to get it to work quite like I wanted so I started working on my own solution. After screwing around for a while I was working on writing the script to migrate the comments from one database to the other when I hit upon something that astonished me.

One of the first jobs that I had out of school was working on a WinForms application that used a SQL 6.5 database for the back end. That database had been designed by a couple of people who’s last data design efforts had been in a pure value delimited data store. As a result there were columns that had many different logical values stored with a delimiter of something like char(152). As a result, if we wanted to report or process based on any one of those values, we had to know the location in the delimited field. Above all things, this was the most painful experience I ever had as a data consumer. Instead of easily being able to run a SUM and GROUP BY query against the database we would have to pull the values individually and then loop over them processing the grouping and summary manually prior to pushing the results to something like Crystal Reports. Painful. Brittle. Frustrating. I couldn’t believe it when I saw it again.

To convert the comments out of the Community Server database I had to retrieve the comment author and their URL if either had been provided in the original comment. Community Server (at least version 2.0 which I am using) stores those values in an undelimited column called “PropertyValues”. What you get is something that looks like this:

http://www.igloocoder.comDonald BelchamTrue0

Not so bad I’m thinking as I peruse the data that I have. I have a URL followed by the author name, some boolean value and a numeric. Then I realize that I have no earthly idea where to start parsing for the author name because the URL value has a variable length.

http://www.igloocoder.comDonald BelchamTrue0
http://codivation.graysmatter.comJustice~!True9999

How do I know where to get the start of the author name on row 1 and still be able to get the start of the author name on row 2? It doesn’t matter if you’re consuming this into a report, a sql query or an application, there has to be a way to split that string into its respective logical values (URL, comment author, posted, spam value). In my dismay and search for patterns, or parsing information I end up looking through a lot of the comment entries and I notice something even more frustrating. The data in the “PropertyValues” column is not necessarily in the same order. For instance:

Donald Belchamhttp://www.igloocoder.comTrue0
http://codivation.graysmatter.comJustice~!True9999

Now I have two problems. How do I parse this damn string and how do I know which of the values to use once parsed? The answer lies in another column. The “PropertyNames” column is another piece of work. It is the key to translating the “PropertyValues” column into individually meaningful bits of data. What you get for the above two samples is this:

SubmittedUserName:S:0:14:TitleUrl:S:14:25:EverPublished:S:39:4:SpamScore:S:43:1:
TitlUrl:S:0:33:SubmittedUserName:S:33:9:EverPublished:S:42:4:SpamScore:S:46:4:

Yes, the “PropertyNames” column, which is containing the key to the “PropertyValues” column, is colon delimited. To figure out what the comment author’s name is I need to go through the following steps:

Find the string “SubmittedUserName” in the “PropertyNames” column and determine the index of the colon that appears immediately after it.
Ignore the single character value after that colon and note the index of the next colon that appears.
Retrieve the numeric values that are after the colon located at the end of step number 2 and note the index of the colon appearing immediately after them.
Retrieve the numeric values that are after the colon located at the end of step number 3

In the end you have two numeric values. The one found in step number 3 is the starting index for the comment author value that is stored in the “PropertyValues” column. In the case of the samples above, the first comment author value starts on index 0 (zero) and the second one starts on index 33. The second value found (step number 4) is the length of the value representing the comment author. The first example above has a length of 14 and the second has a length of 9.

Once you have those two values you can simply use the T-SQL SUBSTRING function to get the data you need. The exact same principle can be used for getting the URL value. Simple huh?

I had a belief that the people at Telligent were pretty smart. Unless there is some good reason for making me be a data contortionist I’m starting to feel otherwise. Why couldn’t Scott Watermasysk and his coworkers come up with a better solution than this in version 2.0 of this product? If they were worried that they may need to add more properties and didn’t want to have to alter the database structure every time that they did, why not create a child table that has the post id, property id and property value columns in it? Turning the data vertical instead of trying to store it horizontally would give them all the expansion room that they want. Instead, we end up with this exclusionist data structure that prevents the outside people from getting any use out of it.

What if I wanted to create a component for my blog that showed the people who have commented the most? There’s no way for me to create a simple component that can do this. I have to go through the act of getting all the comments back in a dataset, parsing each one to find the commenter name and then manually count and group the values. Painful. What I should be able to do is something like this:

1	Select CommenterName, Count(*) From Comments Group By CommenterName

I see no reason that data access, in this case, can’t be this easy. With the power of relational database engines and the low cost of storage space no database designer should be able to justify a value delimited column over a normalized data structure.

There’s also no reason not to migrate away from it. Every Community Server upgrade that I’ve performed has included a significant set of database upgrade scripts. You could change the data structure to something more user friendly and modify the application to work with it. Part of the database upgrade scripts then could include a section that migrates all the data from the old “PropertyName”/“PropertyValue” format into the new data structure. Voila. Value delimited columns gone.

Telligent, if you want a script to do this, let me know. I have one I can sell to you cheap.

uncategorized