Select Page

Considering Nullable Date Columns As A Representation Of State In SQL

Published: December 27, 2022

In my post yesterday on clearing NULLable database values in ColdFusion, I was using the concept of “Task Management” as my exploratory context. And, in the task database table that I created for the demo, I included both an isComplete column and a completedAt column. In theory, I could have written the demo using a single column, completedAt, since a non-NULL value within the completedAt column would indicate that the Task in question had been completed. But, I ended up using two columns because I believe they actually answer two different semantic questions.

Here is the CREATE TABLE statement for the Task table in this thought-experiment:

	`id` int unsigned NOT NULL AUTO_INCREMENT,
	`description` varchar(255) NOT NULL,
	`isComplete` tinyint unsigned NOT NULL,
	`createdAt` datetime NOT NULL,
	`completedAt` datetime DEFAULT NULL,

NOTE: I don’t have any secondary indices defined on this table. But, ignore that fact for this post – this isn’t a discussion about query performance or optimization, it’s a discussion about column semantics.

As you can see, I am defining both an isComplete and a completedAt column. Understandably, one might therefore assume that the following SQL statements both select the same data:

	task t
	-- Using "isComplete" to filter query for COMPLETED tasks.
	t.isComplete = TRUE
	task t
	-- Using "completedAt" to filter query for COMPLETED tasks.
	t.completedAt IS NOT NULL

Which could mean that I can get DROP the isComplete column altogether and just use the NULL vs non-NULL value of the completedAt column to determine the “completed” state of a given task.

I spent a few minutes going back-and-forth on whether I wanted to do this in my table design. Ultimately, I decided to include both columns because I felt that using a single completedAt column would be overloading the semantics of the column. Meaning, the completedAt column would be using a single value to store two different pieces of information:

The problem that I couldn’t get past was that these two pieces of information may be incidentally linked in some cases but are not inherently linked in all cases. I kept asking myself questions about the history of the data and the application in which it is being consumed (remember, this was just a thought-experiment so I didn’t have all that information on hand):

  • Was the completedAt column added from the start? Or, was it added later on in the application development life-cycle? And, if it’s the latter, were there existing records wherein I knew that the task had been completed (via isComplete) but I didn’t actually have a record of when the task was completed?

  • If a completed task is marked as incomplete, do I necessarily need to NULLify the completedAt column? Or, can I use that column to indicate when the task was “last marked as complete” – something that I might want to show the user within the application interface?

  • In the future, might I want to move the completedAt column into another table altogether so that I can create an “audit trail” regarding the “Who” and “When” of task completion?

Obviously, there’s something to be said about solving the problem you have in front of you and not prematurely optimizing for scenarios that don’t (yet) exist. But, the fact that I had all of these questions floating around in my head felt a bit like a “code smell” – an indication that using a single column to store multiple pieces of semantically-separate information was likely not the right choice.

There’s also something to be said about the fundamental readability of the SQL itself. I would argue that most people – myself very much included – find a positive assertion easier to read and to understand than a negative assertion. Meaning:

isComplete = TRUE

… has quite a bit less cognitive load when compared to:

completedAt IS NOT NULL

And, since storage is cheap and we simple humans spend the majority of our time reading code, not writing code, I would argue that the approach with the least cognitive load is implicitly the better option.

Database schema design isn’t simple. And, it certainly isn’t static – it must evolve in lock-step with the needs of the application. It’s also not cut-and-dry – there are usually multiple ways to solve the same problem. And, in my case, I believe two columns with slightly different semantics is better than a single column with somewhat overloaded semantics. Your mileage may vary.

Want to use code from this post?
Check out the license.