Wednesday, November 5, 2008

Google Makes PDF Files Searchable

Google has rarely included scanned documents in its search results because it had no way to determine the nature of the content, but that's about to change. The search engine giant says it will use optical character recognition (OCR) software to make it possible for Web surfers to search any Web-hosted document stored in the PDF file format developed by Adobe Systems.

Google is using the technology to convert scanned documents into equivalent text files that can be searched, indexed and returned as responses to Google search queries, noted Evin Levey, a Google product manager.

"This is a small but important step forward in our mission of making all the world's information accessible and useful," Levey said.

A Boon for Books

The company's brute-force application of OCR technology to the Web is also expected to aid Google Book Search -- the ambitious and controversial book-scanning project that the search engine giant first unveiled at the 2004 Frankfurt Book Fair. Ever since, Google has been scanning the book collections at the world's major libraries at a rate of 3,000 book titles per day.

Though the project initially raised copyright concerns, Google has just concluded an agreement with the Authors Guild and the Association of American Publishers under which Google will be able to expand online access to millions of in-copyright books and other written materials in the United States. The agreement resolves lawsuits that had challenged Google's plan to digitize, search and show snippets of in-copyright books and to share digital copies with libraries without the explicit permission of the copyright owner.

Google's Chief Legal Officer David Drummond says the agreement is truly groundbreaking because it will give readers online access to millions of in-copyright books for the very first time.

"Second, it will create a new market for authors and publishers to sell their works," Drummond explained. "And third, it will further the efforts of our library partners to preserve and maintain their collections while making books more accessible to students, readers and academic researchers."

Pursuing the Holy Grail

Given the continuing exponential growth of multimedia on the Web, however, the text-based nature of today's search-engine technology is clearly inadequate. That's because current-generation search engines can only locate multimedia material that has been tagged in text -- a cumbersome, time-consuming process that content producers often overlook.

This explains why a number of researchers are hot in pursuit of the Holy Grail of search -- the means whereby search engine providers can directly scan multimedia content and match results to search queries and the ad placement requests of their customers. Adobe Systems has already taken a step along the road to producing the next generation of search technology.

In July, the company revealed that it had optimized its Adobe Flash Player technology to enable search engines to index multimedia content produced in the Flash file format -- content that previously had been undiscoverable.

"We are initially working with Google and Yahoo to significantly improve search of this rich content on the Web," explained David Wadhwani, Adobe's vice president. "And we intend to broaden the availability of this capability to benefit all content publishers, developers and end users."

No comments:

Post a Comment