The news came out last month.
With little fanfare, Google has made a mammoth database culled from nearly 5.2 million digitized books available to the public for free downloads and online searches, opening a new landscape of possibilities for research and education in the humanities.
The digital storehouse, which comprises words and short phrases as well as a year-by-year count of how often they appear, represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.’s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words contained in books published between 1500 and 2008 in English, French, Spanish, German, Chinese and Russian.
The intended audience is scholarly, but a simple online tool allows anyone with a computer to plug in a string of up to five words and see a graph that charts the phrase’s use over time — a diversion that can quickly become as addictive as the habit-forming game Angry Birds.
By all reports since, the prediction of addictive diversion has proved true, as has the expected exploration of the database’s capabilities by scholars. Using what is called the Google Books Ngram Viewer, researchers have already confirmed the tool’s first significant flaw among those that were hypothesized – the quality of the OCR (optical character recognition) used in Google’s massive scanning project. At present acceptable for other uses, such as ordinary reading, the OCR’s error rate may currently be too high to produce reliable results at the research level. One specifically identified problem is in the non-recognition of the “medial S” in pre-1800 typesetting. It looks like an F. Aside from the other, ah, disagreements that may ensue from confusions between “fuck” and “suck,” the inability of OCR to distinguish between the F and the S can lead to highly significant error rates in word recognition. Still, this is a problem that will, no doubt, be overcome. The future still awaits.
“The goal is to give an 8-year-old the ability to browse cultural trends throughout history, as recorded in books,” said Erez Lieberman Aiden, a junior fellow at the Society of Fellows at Harvard. Mr. Lieberman Aiden and Jean-Baptiste Michel, a postdoctoral fellow at Harvard, assembled the data set with Google and spearheaded a research project to demonstrate how vast digital databases can transform our understanding of language, culture and the flow of ideas.
Their study, to be published in the journal Science on Friday, offers a tantalizing taste of the rich buffet of research opportunities now open to literature, history and other liberal arts professors who may have previously avoided quantitative analysis. Science is taking the unusual step of making the paper available online to nonsubscribers.
“We wanted to show what becomes possible when you apply very high-turbo data analysis to questions in the humanities,” said Mr. Lieberman Aiden, whose expertise is in applied mathematics and genomics. He called the method “culturomics.”
What is further interesting, though, is the reaction to the project of some actually in the humanities.
Reactions from humanities scholars who quickly reviewed the article were more muted. “In general it’s a great thing to have,” Louis Menand, an English professor at Harvard, said, particularly for linguists. But he warned that in the realm of cultural history, “obviously some of the claims are a little exaggerated.” He was also troubled that, among the paper’s 13 named authors, there was not a single humanist involved.
“There’s not even a historian of the book connected to the project,” Mr. Menand noted.
That last point is not negligible; it might have even led to anticipation of the medial S flaw. But it is difficult see any genuine implications following from it. What is meaningful is this humanist discomfort. From whence does it flow?
Truthfully speaking – and why not, let’s? – humanists (as in those whose fields are the humanities, not those who oppose “man” as the measure of all things to God as their source) have a paradoxical set of both inferiority and superiority complexes regarding the sciences. Scientists are more fortunate; they have only the superiority complex. The – oh, we wouldn’t want to use a word like “outcomes” – of study in the humanities are so… indeterminate. It is easy for those of the most limited imagination in the sciences, who by description would not think that much of a knock, to dismiss the subjective vagaries of study in the humanities as simply, in the end, not real. It is easy enough, in return, for humanist to dismiss the likes of them. Not much conversation there.
The greater threat comes from those in the sciences who recognize the significance of meaning, and who believe that science can discover it. It was Stephen Hawkings, for instance, hardly alone in such a thought, who claimed at the end of A Brief History of Time that if we could discover the origins, in its nature, of the Big Bang, “then we should know the mind of God.”
That’s a little turf stepping there.
As it is already, the world we live in is so increasingly driven and directed by outcomes – the empirically verifiable – by technology, it is easy to fear it is already the technopoly Neil Postman predicted. If the sciences are going to claim the capacity to lead us not only to ultimate empirical evidence and relations, but to their meaning as well, what will humane studies have come to in the end beyond pleasing diversion and a salve for the illusory self and soul?
Thus that insecurity complex of humanists. We know you can’t see it, humanists are forced to make their case to young people socially driven to and eyeing business and science and policy careers – there are no data sets, but alone at night and on Sundays, and summer days by lakes, when you read a book or find yourself somehow aesthetically moved, or the first time someone you love dies, what we teach you will rise up before you in ways you can’t anticipate now.
Hard case, that.
Johnny’s dad coded the Stuxnet worm. You read poetry?
Thus that inferiority complex.
Aware of concerns raised by humanists that the essence of their art is a search for meaning, Mr. Michel and Mr. Lieberman Aiden emphasized that culturomics simply provided information. Interpretation remains essential.
“I don’t want humanists to accept any specific claims — we’re just throwing a lot of interesting pieces on the table,” Mr. Lieberman Aiden said. “The question is: Are you willing to examine this data?”
Ah, the search for meaning. That old dog.
I’ve peppered this post with three key terms: implication, significance, and meaning. If you look them up you will find they have much overlap in their definitions and can, in contexts, be used synonymously. But they are separate words for reasons, by just that range of separation from each other in their primary definitions. Implication is the more logical, thus scientific word, of entailment, of necessary relation. Significance has scientific use too, as in what is meant by statistical significance, but it runs first into meaning, and meaning in its primary sense is “something that is conveyed or signified; sense or significance.” I have gone for a passive construction in offering the idea of “conveyance,” so we needn’t concern ourselves with a creator of meaning, an author, a God, who does the conveying. In any event, we receive – perhaps even decide – what is conveyed. And “sense” – was ever a more unscientific word uttered by historian or biologist?
It is here, in the labyrinth of meaning that some humanists will feel the superiority of what they do. After all, no metaphysics ever proposed the greater reality of phenomena over noumena. Some have only ever argued that the latter do not exist.
Christopher Hitchens does not simply disbelieve in God; he rejects the nature of the God manifested in both Testaments. Albert Camus thought the universe absurd, yet he found meaning in his life. Let particle and astrophysics lead us to the “mind of God” or a chance event, it is still for us to determine what conveyance we find in those logical and material implications.
Let the Ngram viewer be used in great fascination to extract from the history of the written word signs of our preoccupations and changing beliefs. Let those so skilled create data sets and tease out implications. Let those with the insight and training search for meaning. And let humanists relax.
- Google, culturomics and Harvard’s history gift (liberatemedia.com)
- Five-Million-Book Google Database Gets a Workout – and Debate – in Its First Days (artsbeat.blogs.nytimes.com)
- We Are the Words (technologyreview.in)
- “Humanities research with the Google Books corpus” and related posts (languagelog.ldc.upenn.edu)
- Using digitized books as ‘cultural genome,’ researchers unveil quantitative approach to humanities (eurekalert.org)
- The cultural genome: Google Books reveals traces of fame, censorship and changing languages | Not Exactly Rocket Science (blogs.discovermagazine.com)