from April 12, 2017
return to main page
spotted by Randy Thelen

How Google Book Search Got Lost

Google Books was the company’s first moonshot.
But 15 years later, the project is stuck in low-Earth orbit.

Books can do anything. As Franz Kafka once said, “A book must be the axe for the frozen sea inside us.”

It was Kafka, wasn’t it? Google confirms this. But where did he say it? Google offers links to some quotation websites, but they’re generally unreliable. (They misattribute everything, usually to Mark Twain.)

To answer such questions, you need Google Book Search, the tool that magically scours the texts of millions of digitized volumes. Just find the little “more” tab at the top of the Google results page?—?it’s right past Images, Videos, and News. Then click on it, find “Books,” and click on that. (That’s if you’re at your desk. On mobile, good luck locating it anywhere.)

It turns out that the “frozen sea” quote is from Kafka’s Letters to Friends, Family, and Editors, in a missive to Oskar Pollak, dated January 27, 1904.

Google Book Search is amazing that way. When it started almost 15 years ago, it also seemed impossibly ambitious: An upstart tech company that had just tamed and organized the vast informational jungle of the web would now extend the reach of its search box into the offline world. By scanning millions of printed books from the libraries with which it partnered, it would import the entire body of pre-internet writing into its database.

“You have thousands of years of human knowledge, and probably the highest-quality knowledge is captured in books,” Google cofounder Sergey Brin told The New Yorker at the time. “So not having that?—?it’s just too big an omission.”

Today, Google is known for its moonshot culture, its willingness to take on gigantic challenges at global scale. Books was, by general agreement of veteran Googlers, the company’s first lunar mission. Scan All The Books!

In its youth, Google Books inspired the world with a vision of a “library of utopia” that would extend online convenience to offline wisdom. At the time it seemed like a singularity for the written word: We’d upload all those pages into the ether, and they would somehow produce a phase-shift in human awareness. Instead, Google Books has settled into a quiet middle age of sourcing quotes and serving up snippets of text from the 25 million-plus tomes in its database.

Google employees maintain that’s all they ever intended to achieve. Maybe so. But they sure got everyone else’s hopes up.

Two things happened to Google Books on the way from moonshot vision to mundane reality. Soon after launch, it quickly fell from the idealistic ether into a legal bog, as authors fought Google’s right to index copyrighted works and publishers maneuvered to protect their industry from being Napsterized. A decade-long legal battle followed?—?one that finally ended last year, when the US Supreme Court turned down an appeal by the Authors Guild and definitively lifted the legal cloud that had so long hovered over Google’s book-related ambitions.

But in that time, another change had come over Google Books, one that’s not all that unusual for institutions and people who get caught up in decade-long legal battles: It lost its drive and ambition.

When I started work on this story, I feared at first that Books no longer existed as a discrete part of the Google organization?—?that Google had actually shut the project down. As with many aspects of Google, there’s always been some secrecy around Google Books, but this time, when I started asking questions, it closed up like a startled turtle. For weeks there didn’t seem to be anyone around or available who could or would speak to the current state of the Books effort.

The Google Books “History” page trails off in 2007, and its blog stopped updating in 2012, after which it got folded into the main Google Search blog, where information about Books is nearly impossible to find. As a functioning and useful service, Google Books remained a going concern. But as a living project, with plans and announcements and institutional visibility, it seemed to have pulled a vanishing act. All of which felt weird, given the legal victory it had finally won.

When I talked to alumni of the project who’d left Google, several mentioned that they suspected the company had stopped scanning books. Eventually, I learned that there are, indeed, still some Googlers working on Book Search, and they’re still adding new books, though at a significantly slower pace than at the project’s peak around 2010–11.

“We’re not focused on shiny features and things that are very visible to users,” says Stephane Jaskiewicz, a Google engineer who has worked on Books for a decade and now leads its team. “It’s more like behind the scenes work and perfecting the technology?—?acquiring content, processing it properly so that we can view the entire book online, and adjusting the search algorithm.”

One focus of work has been a constant throughout Google Books’ life: improving the scanners that add new books to the “corpus,” as the database is known. At the birth of the project, in 2002, as Larry Page and Marissa Mayer set out to gauge how long it might take to Scan All The Books, they set up a digital camera on a stand and timed themselves with a metronome. Once the company got serious about ramping its scanning up to efficient scale, it started jealously guarding details of the operation.

Jaskiewicz does say that the scanning stations keep evolving, with new revisions rolling out every six months. LED lighting, not widely available at the project’s start, has helped. So has studying more efficient techniques for human operators to flip pages. “It’s almost like finger-picking on a guitar,” Jaskiewicz says. “So we find people who have great ways of turning pages?—?where is the thumb and that kind of stuff.”

Still, the bulk of the work at Google Books continues to be on “search quality”?—?making sure that you find the Kafka passage you need, fast. It’s an unglamorous game of inches?—?less moonshot and more, say, satellite maintenance.

To understand how Google Books arrived at this point, you need to know a few things about copyright law, which essentially divides books into three classes. Some books are in the public domain, which means you can do what you want with their texts?—?mostly, those published before 1923, as well as more recent books whose authors chose to release them from standard copyright. Plenty of more recent books are still in print and under copyright; if you want to do anything with these texts, you have to come to terms with their authors and publishers.

Then there’s the third category: books that are out of print but still under copyright, known informally as “orphan works.” It turns out there are a whole lot of these?—?“between 17 percent and 25 percent of published works and as much as 70 percent of specialized collections,” a study by the US Copyright Office suggests.

How many books is that? No one knows for sure because no one can say with any certainty exactly how many total books there are. The statistic depends on how you define “book,” which isn’t as easy as it sounds. In 2010 a Google engineer named Leonid Taycher wrote a blog post that examined Google Books’ metadata and concluded that the number (then) was about 130 million. Others looked at this work and called it “bunk.” The actual number is probably somewhat lower than Taycher’s figure yet considerably higher than Google Books’ current 25 million-plus.

Some large chunk of that large number, then, are “orphan works.” And until recently, they weren’t much of an issue. You could borrow them from a library or find them in a used bookstore, and that was that. But once Google Books proposed to scan them all and make them available to the internet, everyone seemed to want a piece of them.

The legal battle that ensued was, essentially, a custody fight over these orphans, in which Google, publishers, and authors each sought to control the process of ushering them into a new home for the digital age. The three parties eventually agreed on a grand compromise known as the Google Books Settlement, under which Google would go ahead and make the orphan works available in their entirety and set aside money to compensate rights holders who stepped forward. But in 2011, a federal judge rejected the settlement, ruling in favor of advocates who feared it would forever ensconce a private for-profit company as the registrar and toll collector of the universe’s library.

Once the settlement collapsed, Google went back to its scanning, and publishers pursued the burgeoning business of selling e-books, which had leapfrogged Google’s lead in the future-of-books race due to the success of Amazon’s Kindle. But the Authors Guild continued to press its lawsuit, charging that Google’s arrogation of the right to scan and index books without the permission of copyright holders was illegal. Google is wealthy, but not so wealthy that it could ignore the threat of multi-billion dollar copyright infringement penalties (thousands of dollars per book for millions of books). This was the proceeding that dragged on until the Supreme Court put it out of its misery last year?—?establishing once and for all that Google had a fair-use right to catalogue books and provide brief excerpts (“snippets”) in search results, just as it did with web pages.

That ruling represents a foundational achievement for the future of online research—Google’s and everyone else’s. “It’s now established precedent?—?everyone benefits,” says Erin Simon, Google Books’ product counsel today. “This is going to be in textbooks. It’s supremely important for understanding what fair use means.” (Simon also notes with a chuckle that when the suit was originally filed, she hadn’t yet started law school.)

The Authors Guild may have lost in court, but it believes the fight was worth it. Google “did it wrong from the beginning,” says James Gleick, president of the Guild’s board. “They plowed ahead without involving the creative community on whose backs they were building this new thing. The big companies have a droit du seigneur attitude toward creative work. They think, ‘We are the masters of the universe now.’ They should have just licensed the books instead.”

You’d think a Supreme Court victory would have meant a renewal of energy for Google Books: Rev up the scanners?—?full speed ahead! By all the evidence, that has not been the case. Partly that’s because the database is so huge already. “We have a fixed budget that we’re spending,” says Jaskiewicz. “At the beginning, we were scanning everything on every shelf. At some point we started getting a lot of duplicates.” Today Google gives its partner libraries “pick lists” instead.

There are plenty of other explanations for the dampening of Google’s ardor: The bad taste left from the lawsuits. The rise of shiny and exciting new ventures with more immediate payoffs. And also: the dawning realization that Scanning All The Books, however useful, might not change the world in any fundamental way.

To many bibliophiles, Google’s self-appointment as universal librarian never made sense: That role properly belonged to some public institution. Once Google popularized the notion that Scanning All The Books was a feasible undertaking, others lined up to tackle it. Brewster Kahle’s Internet Archive, which stores historical snapshots of the whole web, already had its own scanning operation. The Digital Public Library of America grew out of meetings at Harvard’s Berkman Center beginning in 2010 and now serves as a clearinghouse and consortium for the digital collections of many libraries and institutions.

When Google partnered with university libraries to scan their collections, it had agreed to give them each a copy of the scanning data, and in 2011 the HathiTrust began organizing and sharing those files. (It had to fend off the Authors Guild in court, too.) HathiTrust has 125 member organizations and institutions who “believe that we can better steward research and cultural heritage by working together than alone or by leaving it to an organization like Google,” says Mike Furlough, the trust’s director. And of course there’s the Library of Congress itself, whose new leader, Carla Hayden, has committed to opening up public access to its collections through digitization.

In a sense each of these outfits is a competitor to Google Books. But in reality, Google is so far ahead that none of them is likely to catch up. The consensus among observers is that it cost Google several hundred million dollars to build Google Books, and nobody else is going to spend that kind of money to perform the feat a second time.

Still, the nonprofits have a strength Google lacks: They’re not subject to the changing priorities of a gigantic technology corporation. They have a focused commitment around books, unencumbered by distractions like running one of the largest advertising businesses in the world or managing a smartphone ecosystem. Unlike Google, they’re not going to lose interest in seeking new ways to connect readers with books that might, a la Kafka, melt a frozen mind.

In popular mythology, interminable lawsuits turn into hungry maelstroms that drown the participants. (The archetype is Dickens’ Jarndyce v. Jarndyce from Bleak House, the generations-spanning estate fight whose legal fees eat up all the assets at stake.) In the tech business, court battles like the celebrated antitrust suit that plagued IBM for years tend to pinion giant corporations and provide new competitors with an opening to lap an incumbent. Google itself rose to dominate search while Microsoft was busy defending itself from the Justice Department.

Yet the Books fight was never as central to Google’s corporate being as that kind of all-consuming conflict. And it wasn’t all a waste, either. It taught Google something valuable.

As the Authors Guild’s Gleick points out, Google started Books with a “better ask forgiveness than permission” attitude that’s common today in the world of startups. In a sense, the company behaved like the Uber of intellectual property?—?a kind of read-sharing service?—?while expecting to be seen the way it saw itself, as a beneficent pantheon of wizards serving the entire human species. It was naive, and the stubborn opposition it aroused came as a shock.

But Google took away a lesson that helped it immeasurably as it grew and gained power: Engineering is great, but it’s not the answer to all problems. Sometimes you have to play politics, too?—?consult stakeholders, line up allies, compromise with rivals. As a result, Google assembled a crew of lobbyists and lawyers and approached other similar challenges?—?like navigating YouTube’s rights maze?—?with greater care and better results. It grew up. It came to understand that it could shoot for the moon, but it wouldn’t always get there.

It’s possible that Google might someday take another run at solving the orphan works problem. But it looks like it’s going to wait for others to take the lead. “I don’t know that there’s anything that we could do without a different legal framework,” says Jaskiewicz.

As I worked on this piece, I kept thinking back to a book I’d read a few years ago called Mr. Penumbra’s 24-Hour Bookstore, a whimsical, nerdy novel by Robin Sloan. It’s about a secret society dedicated to solving a centuries-old Name of the Rose-style mystery that’s rooted in bookmaking and typography. Google plays a critical supporting role in Penumbra, as the protagonist attempts to unravel the riddle at the story’s heart. As it turns out, even the company’s unrivaled informational prowess isn’t enough to do the trick. That takes a chance encounter between the protagonist and a particular book that provides an illuminating insight. It takes, in the phrase with which Sloan closes his tale, “exactly the right book, at exactly the right time.”

Penumbra reminds us that Google’s engineering mindset isn’t omnipotent. Breaking a challenge into approachable pieces, turning it into data, and applying efficient routines is a powerful way to work. It can carry you a good distance toward a “library of utopia,” but it won’t get you there.

And even if you get there, it isn’t utopia, anyway. The hard labor is still ahead. That’s because when you turn a book into data, you make it easy to find quotes and search snippets, but you don’t make it fundamentally easier to do the work of reading the book?—?that irreplaceable experience of allowing one’s own mind to be temporarily inhabited by the voice of another person.

To date, the full experience of reading a book requires human beings at both ends. An index like Google Books helps us find and analyze texts but, so far, making use of them is still our job. Maybe the quest to digitize all books was bound to end in disappointment, with no grand epiphany.

Like many tech-friendly bibliophiles, Sloan says he uses Google Books a lot, but is sad that it isn’t continuing to evolve and amaze us. “I wish it was a big glittering beautiful useful thing that was growing and getting more interesting all the time,” he says. He also wonders: We know Google can’t legally make its millions of books available for anyone to read in full?—?but what if it made them available for machines to read?

Machine-learning tools that analyze texts in new ways are advancing quickly today, Sloan notes, and “the culture around it has a real Homebrew Computer Club or early web feel to it right now.” But to progress, researchers need big troves of data to feed their programs.

“If Google could find a way to take that corpus, sliced and diced by genre, topic, time period, all the ways you can divide it, and make that available to machine-learning researchers and hobbyists at universities and out in the wild, I’ll bet there’s some really interesting work that could come out of that. Nobody knows what,” Sloan says. He assumes Google is already doing this internally. Jaskiewicz and others at Google would not say.

Maybe, when some neural network of the future achieves self-awareness and find itself paralyzed by Kafka-esque existential doubts, it will find solace, as so many of us do, in finding exactly the right book to shatter its psychic ice. Or maybe, unlike us, it will be able to read all the books we’ve scanned?—?really read them, in a way that makes sense of them. What would it do then?