In today’s interconnected always-online digital-first world, we tend to think that every piece of information in the world is instantly accessible at our fingertips. Yet, the reality is that as we wrap our daily lives ever further in our digital blankets, our understanding of the world around us and the information accessible to us is ever more defined by what has been digitized or born digital. As we focus on information bubbles and how algorithms increasingly decide what we consume online, we all-too-often forget that these bubbles and algorithmic decisions are themselves constrained to just that information which is available in the digital realm. What about our vast undigitized past? As we race towards our digital future will we lose touch with and ultimately forget our history?

When we talk about preserving our history today, the conversation typically turns to web archives preserving our online world or museums and traditional archives preserving our past. Yet, in today’s digital world, that which has not been digitized does not exist. As academic libraries increasingly migrate their holdings from freely browsable stacks to inaccessible warehouses and as general society becomes accustomed to accessing the world’s literature through digital screens, works must be digitized to remain visible. Underrepresented topics, geographies and languages are rarely a focus of digitization efforts, meaning their content is especially at risk for being lost to the digital era. Research and preservation focuses on the realtime here and now, with far less focus on the past. In many ways the digital world is reprogramming human society to be locked in the present gazing towards the future while our past falls into the memory hole behind us.

In many ways Google Books inaugurated the heyday of the digitization boom, proving that mass scale access preservation could achieve scales not possible under traditional preservation digitization and could actually make it possible to digitize a large fraction of the world’s books. Yet, while mass digitization efforts continue on many fronts, there has been a noticeable plateau, especially when it comes to expanding beyond traditional English language content. Here, the rise of powerful smartphone cameras has the potential to enable vast crowdsourced digitization of underrepresented materials at scale.

Converting all of this scanned imagery to searchable content has become vastly easier with the rise of neural network OCR algorithms that can convert a 100 page book in any of 56 languages to searchable text for just 35 cents in just a matter of seconds and at accuracy approaching single human transcription.

Yet, it is not technology that forms the greatest challenge to accessing our past, but rather how copyright and fair use are interpreted in the digital era and the balance of protecting the rights of content owners while enabling transformative new applications of that material that benefit society while not taking away from those rights. Such issues are especially acute when it comes to orphaned works for which copyright protections may still apply but where extensive research cannot determine who legally has the right to grant permission to digitize or access their contents.

The result is the “the missing 20th century” in which there are as many books available on Amazon published from 2000-2010 as there are published from 1900-1910, but precious little from the 1920’s to the 1990’s. This is the period still covered by copyright protection in the United States and predating the born digital era.

Each of the major book digitization efforts have dealt with this period in different ways, choosing either to digitize in-copyright works and restrict access to simple keyword searches or to limit themselves to donated or public domain content (such as US Government works) over the period in question. The result is that data mining digitized book archives will often yield drastically different results in the post-1923 era.

Yet, nowhere is the impact of copyright on digitization and data mining more apparent than in the map below, which shows every geographic location mentioned in the text of all public domain English language books digitized by the Internet Archive from 1800 to 2013, by year. As the map marches forward year by year, the world rapidly expands as English language works broaden to focus on the entire world. Suddenly, as the timeline crosses 1922, the world becomes a lot smaller as the Archive is unable to digitize the majority of works from that point forward.

Kalev Leetaru

Map showing all geographic locations mentioned in the Internet Archive book collection published by year 1800-2013 (click to view animated version)

Every one of those dots that disappears in 1923 represents knowledge that is excluded from our digital world. Data miners cannot access those data points at all, while scholars and ordinary citizens must expend great effort to specifically seek them out and leave the comfort of their digital world to traverse the barren antiquated physical realm. As we descend ever further into our digital world, such points will be gradually extinguished forever, the domain of the few who live outside the exclusively digital domain.

Putting this all together, as our access to the world around us is increasingly mediated by screens and our understanding of it defined by digital information, we are rapidly losing touch with our undigitized past, left adrift in an ever-changing ephemeral world of bits and bytes without our physical past to anchor us. As George Santayana so famously put it, “Progress, far from consisting in change, depends on retentiveness … when experience is not retained … infancy is perpetual. Those who cannot remember the past are condemned to repeat it.”


