Derived Data

A sane application starts with sane data structures.

This was what our data structures professor told me when I presented my project a few years ago. I have followed this mantra ever since. A sane application starts with sensible data, and works it's way from there. The UI is merely decoration. However, time and again, project after project, debugging hell happens. And it all boils down to data structures, specifically about failing to determine what derived data is.


What is this "derived data"?

Derived data... erm... the simplest example I could come up would be a person's name. A person's name is commonly derived from the person's first, middle and last names. Local time is also another example, with it being derived from a server timestamp and the user's timezone. Simply put, derived data is simply a byproduct of undecomposable data and a computation.

To get a sense in context to frameworks and programming, derived data is a result of what is commonly called as "computed properties". In conceptual terms, it's a result of the "view logic". Derived data doesn't reside in the data model nor does it have a place in the data model. They're not persisted either, just constructed on-the-fly.

So what makes derived data a problem?

The problem with derived data is failure to identify if it is one. A common symptom of failure to determine derived data is writing observers explicitly for one or more pieces of data to update one or more other pieces of data.

Take for instance a form where it asks for a birthday, and determines if the user is classified as elderly, adult or a minor. A naive programmer would create a birthday property, add a watcher to it to update isElderly, isAdult, and isMinor. Changes are then rendered and persisted somewhere.

Sure, it works but that's just where the problem starts. What if a new developer comes in, who knows nothing nor finds documentation about the 3 flag properties, implements a feature that updates the birthday? New developer implements birthday-updating code, but the flags stay the same. Boom! Eroneous state!

How do we fix this?

Identify what pieces of data are derived. To know if it should be derived, the easiest check would be to see if any piece of data is some form of another piece of data. In the example above, the statuses are simply a byproduct of birthday and a set of conditions. Names are byproducts of first, middle and last names against a concatenation function. Local time is a computation of server time and user timezone.

It's these undecomposable properties that should be persisted in the system. Not storing derived data is perfectly fine. Getting them back would be as simple as running the data again through the same computation. If that computation is implemented properly (referentially transparent), running through that same data again will always give the same result.

Tradeoffs

The tradeoff I normally see is that computations get written all the time and needed across different parts of the app. But this is no problem, as these computations can be offloaded to a separate utility module. This module can then be used in the locations needed.

Now I was asked this question: What if that same data is to be represented the same way in another platform in the system? Like say represent that same data and validity state when the data is rendered into a PDF, or XML? I'd say recompute it! Write that computation again in that plaform!

Think about it. When the application gets a dependency that's no simpler than a bowl of spaghetti, and where parts often get put in, pulled out, and mangled, is it really worth losing time and hair?

Conclusion

Save hair. Know when data is derived. It will work wonders in development.