Thursday, October 25, 2018

Life Cycle of Personal Data and the Personal Dataome

Caleb Scharf has written a pretty interesting essay entitled "The Selfish Dataome" in the October issue of Nautilus. In it he suggests that there is a systemic relationship between the data we produce and the lives we live. He observes that, as we live, we produce data. It's expensive. And there's a lot of it. How does it affect us? Is it worth it? How would we know? He asks the question: "Does the data we produce serve us, or vice versa?"

Interesting. Reminiscent of "The Selfish Gene". Perhaps our bodies, families and societies are only here to propagate our data as we are to propagate our genes?

But this is silly. It's a system with feedback, not a linear process. The lives we live produce data; and the data we produce surely affect the lives we live. And of course, the system encompassing ourselves and our data is evolving. I'd be happier with a question that's not binary (does the data serve us or visa versa) but one that examines the lifecycle of the data, it's relationship with people, and acknowledges the cost as well as the value of the process.

But his argument is flawed for many other reasons, not the least of which is that he's looking at the example of Shakespeare's life as an example. Shakespeare is the exception to the rule. Perhaps he should have examined instead the data produced by Shakespeare's grammar school teachers, all of whom surely produced written documents but none of which have survived. Their dataomes have vanished, all by themselves. And by and large, it seems to me, these creative and destructive forces are in balance, limited absolutely by human cognitive bandwidth because there's only so much data we can afford to pay attention to.

And attention is the key: without continuous attention, we surely don't maintain the storage media. Consider his paper example. Apparently it takes five grams of high quality coal to produce a single page of paper. (That is an awesome statistic, by the way). But that's not all: unless the printed page is buried in a desert cave, it needs to be REPRINTED every hundred years or so because the medium won't last. In fact, every medium has its lifespan. And when it dies, well, it's gone. The chain of custody is broken. The data is lost.

I think the real question he's asking, though, is why do we REPRINT some documents and how do we know THAT is worth it? Once again, it seems to me that there is a natural balance. When the information serves us, we allocate the resources. When it doesn't, well, we don't. And it dies.

Low cost digital storage for most of us is neither a problem nor a solution: it's irrelevant. Ironically, low cost digital media is likely to endure LESS time than paper. 500 years from now most of our data will be like the documents of Shakespeare's teachers. It will be gone forever, digital storage media notwithstanding. In fact, unless it appears to have some value to someone, society decides NOT to spend the energy to maintain it.

But here Caleb has missed three HUGE additional pieces of the puzzle, other kinds of maintenance that also require resources. From a maintenance POV there's the additional burden of maintaining access and retrieval methods. Paper books are stored, accessed and retrieved in the context of a library. Only some documents are filed. And that is expensive. We decide as a society what is (and what is not) organized. Most data that is not curated is lost simply because the media is not maintained in a library. And this also is a factor in balancing the flows of information that are produced and are lost.

Furthermore, for digital documents, there is another level of maintenance to the 'library'. We also need to maintain software that manages that storage, retrieval and visualization. In Stewart Brand's "Clock of the Long Now" these factors are also considered. In effect, we'll need to periodically load digital documents that are encoded by one generation of software and "translated" or re-encoded in the next generation. A vast majority of digital data will be lost when there are no longer processors and apps that can read them. This is another kind of 'natural death' that balances the creation of data with its demise.

But there's a third kind of natural death that is based on social and cultural context needed to interpret information. Without institutional knowledge and context preserved in living human beings, we won't be able to understand or utilize data even when it has been preserved on some medium, organized accessed, retrieved and visualized. We can still read Shakespeare today not only because his work is continuously reprinted on a massive scale, but also because we crank out dozens of PhDs every year and hundreds of thousands of high school and college undergrads who read him. The energy to maintain all of THIS infrastructure is vast and absolutely necessary. It's such a huge commitment of resources that will NEVER be allocated to maintain most of our personal data. Absent this level of institutional attention, most of the data we produce will be lost, as it was in pre-historical and pre-literate societies. Even if we had documents produced by Shakespeare's grammar school teachers, would we know enough about the grammar school of his day to interpret them? Unless we decide as a society to preserve that context, those documents are essentially lost to us as well.

Caleb's essay is interesting and his question is a good one, but in my view this is not a problem at all. We produce a huge amount of data that will not be carried forward forever. Instead it will die a 'natural' death when critical links on the chain of custody are lost: integrity of the medium, maintenance of access and visualization methods, or the preservation of cultural context. Personal data has a life cycle, that cycle includes a natural death, and for most of us it will be limited to a few generations at most.


  1. This comment has been removed by the author.

  2. How long do you suppose this blog will be accessible? I might not even maintain it at some point...