Essential and derived data

Feb 26, 2022

Essential data is the information inherent to the problem domain. For example, in a note-taking app, the text of the note is essential data. No matter the surrounding implementation, the software has to store the text that the user has entered.

Based on the essential data, the system derives data for various purposes. For example, based on the note text, the note-taking app can derive a full-text search index, display the text in the UI, or give the user statistics on average sentence length.

When deriving data, you can make it persistent or on-the-fly.

On-the-fly derived data gets re-computed whenever essential data changes upstream. It doesn’t live in any persistent location that you have to keep updated. For example, if you give input data to a chain of pure functions, you always get the correct derived data as output. The functions do not store results in databases. Or you might use a reactive library/framework that takes care of propagating changes to derived data. This is great for simplicity; when the essential data changes, you don’t have to worry about all the derived places you have to update.

Frameworks like React demonstrate this approach. With React, you change the essential state (e.g. the component props), your component transforms that into VDOM, and then React transforms the VDOM into real DOM mutations. So you can trust that when state changes, the DOM will accurately reflect it. With a database metaphor, the DOM is a “materialized view” of the app state.

The potential downside of on-the-fly derived data is performance. If it’s expensive to derive on the fly, then you can consider making it persistent derived data — like a “materialized view.” That way you can quickly access it without re-deriving it. That however causes another problem, a data synchronization/replication problem.

That problem happens whenever you have data in one place and need to keep a (derived) copy of it in another place up to date. Databases have solutions, e.g. primary-to-secondary replication or materialized views. The frontend ecosystem has implementations too – with reactive frameworks, we keep our app state in memory and keep a derived transformation of it in the DOM up to date.

In the small scale, every variable that persists derived data causes a little data sync problem. Now you have to somehow update that variable every time the essential data changes.

Here’s a contrived example, just to illustrate the point. A React component could do this:

const UsernameInput = () => {
const [firstName, setFirstName] = useState("");
const [lastName, setLastName] = useState("");
const [fullName, setFullName] = useState("");

useEffect(() => {
setFullName(firstName + "" + lastName);
}, [firstName, lastName]);

return <form>
    ...form inputs...
    Your name is {fullName}
    </form>
}

But here fullName is derived state. Unless there’s a reason it needs to persist, it’s simpler to make it re-computed on-the-fly:

const UsernameInput = () => {
  const [firstName, setFirstName] = useState("");
  const [lastName, setLastName] = useState("");
  const fullName = firstName + " " + lastName;

  return <form>
    ...form inputs...
    Your name is {fullName}
    <form>
}

What is the “real” essential data?

Software cannot “know” anything beyond what it “perceives” through input devices like mouse, keyboard, network connection, file system, etc. Those perceptions come in the form of discrete events. So I would say the closest software can get to the essence of things is storing the stream of events that it perceives. For example, let’s say a note-taking app stores the notes in a SQLite database. If the app instead stored an immutable log of all the user input events (mouse and keyboard), then it can derive the contents of the database by scanning through that log from the beginning. Thus, I could say that mutable databases typically don’t contain purely essential data. It’s just that, for pragmatic reasons, we don’t typically design systems that store raw perceptions.