Google and the Deep Web - How to Find Out Anything (2012)

How to Find Out Anything

2 Google and the Deep Web

Google

Google needs no introduction, but it certainly needs an explanation.

Google is a paradox. Without it, the Internet is nothing more than an incomprehensible jumble of websites, but, believe it or not, Google actually misses more information than it finds. This indispensable research tool cannot make sense of all the information stored in the recesses of the “deep web,” a universe of databases where more information lives than in the “surface web” that Google users can see. Google makes the keyword search box so trivially simple to use, most users overlook the far more powerful “Advanced Search,” which can produce much better search results. Users tolerate the truckloads of useless information that every broad Google search turns up when they could save themselves time and effort by learning how to filter their results. And technology aside, Google’s seeming omnipotence lulls researchers into thinking that they’ve exhausted the known universe of information when, in fact, they’ve merely generated a crowd-sourced list of suggested places to look. Google is the Schrödinger’s cat of search engines—it’s simultaneously the greatest boon to online research ever invented and the archnemesis of effective information gathering.

Google, of course, is not the only search engine in town, but it is by far the best known and most powerful; according to Search Engine Watch, an organization that tracks such things, more than two thirds of all search engine queries each month are run on Google, so I’ll limit my discussion to it alone. For our purposes, which is learning how to conduct serious research, we need to understand three critical things about the Internet’s number-one search tool:

· It cannot yet comprehensively search the deep web (sometimes called the “invisible web”) and so misses more than it finds;

· Users don’t use “Advanced Search” and consequently suffer with bloated search results;

· Google is great for simple searches, but users too often accept Google’s results at face value without critically evaluating what they are seeing.

It’s this last item that should concern us most. We’ll get up to speed on the deep web shortly, and all of Chapter 3 is devoted to learning “Advanced Search.” Before we get to those, it’s important to understand that Google works best when you bring your own judgment to the process. The old teacher’s chestnut, “You get out of it what you put in to it” applies here too. And in fact it was a history teacher, Kevin M. Levin, who summed up the problem with Google perfectly when he described how his students approached online research. In the New York Times, he wrote,

These days, children turn first to their search engines to find information; they conduct a few keyword searches and click on the most popular results without questioning either the search engine’s ranking algorithm or the source of the content…. A search is only as good as the search strategy. The outcome of any search will be determined by a host of factors, including the choice of browser and keywords.

Levin is on to something. Popping a few words into Google and taking what comes back as gospel is a sketchy proposition. He’s not the only one who worries that online users are uncritical consumers of search engine results. In 2008, the British Library and the Joint Information Systems Committee (JISC) studied the so-called Google generation to find out how effectively people were using new technology tools to look up and read information. In the report titled “Information Behaviour of the Researcher of the Future,” the library concluded that more time is spent browsing the web than critically evaluating information. At the risk of sounding like some hectoring schoolmaster, what Google has done is to make it easier to skim a wide variety of information. It does not encourage a deep dive into the text. That’s a problem.

If a steady diet of junk food and a sedentary lifestyle leads to flabbiness, an unvaried diet of search engine results (accepting Google abstracts as writ without critical reading) leads just as surely to something that James Morris, former dean of the School of Computer Science at Carnegie Mellon, calls “infobesity.” Half-baked explanations, errors in facts or logic, and other unfiltered detritus that Google unearths can make for easy access to inaccurate information. Just because the text was produced by some fancy computer program and is displayed on a sleek tablet computer does not make the text any more or less reliable. A skeptical engagement with search results is as important as the analysis a smart reader would apply to the text of a traditional book.

Just because the text was produced by some fancy computer program and is displayed on a sleek tablet computer does not make the text any more or less reliable.

So, with all due deference to its undeniably brilliant engineering, Google is fundamentally a web-indexing tool. We need web-indexing tools, of course, the same way books need tables of contents and indices. Strangely, while no one would seriously argue that looking over the table of contents and the index of a book is the same as reading the whole work, too many Google users think it’s fine to skim the results list and stop there, which is essentially the same thing. And when well-written and reasoned materials appear right next to things that could have been written by a trained ape, it can be easy to give them equal weight: Many Google users conclude that whatever sites make the cut and show up within the all-important top twenty results are of equal value, and that just isn’t necessarily so.

Online information should go through the same critical vetting process material does in print.

Online information should go through the same critical vetting process material does in print. Does this author know what the heck she is talking about? Is this information accurate? Is it free from bias? If the author is making a point, does the reasoning hold up or does the author make specious claims? Has care been taken to write clearly and grammatically? In short, reading through a list of Google results is first and foremost an exercise in making choices about which results promise to answer your questions and which should be dismissed as unworthy of your attention. It is a given that Google will produce far more results than you could ever possibly need—it is up to you to sift through this mass and find the materials that contain valuable information.

Of course, Google occupies a very important place in your arsenal of tools to help you find out anything. It exists to point you to the places where you need to look for information. It does an excellent job of that, but even then, bear in mind that Google suffers from a giant blind spot concerning the deep web. The term deep web refers to the universe of web-accessible information locked away in databases where Google’s spidering computers can’t see it. That limitation is a very serious knock against Google and the most important reason for never depending on it exclusively for online information.

The Deep Web

In the 1990s, information professionals began to notice a peculiarity about search engines. Most did a credible job of locating web pages when all of a site’s information was readily exposed. (Remember, this was back in the days when most websites presented their information on web pages built from flat HTML, not the fancy, feature-rich sites we see today.) Web design matured. Most sites were designed to pull information from a database. No longer were sites created from hard-coded text. Instead, they produced information on the fly, pulled from a database, in response to user queries. Suddenly, search engines didn’t look as omnipotent as they once did. These days, the bulk of the interesting data is in the database, not on a web page where Google’s computers can find it.

Look at it this way. You certainly can Google the question “Do any trains run between Philadelphia and Boston on July 19?” Google obligingly finds more than 1,900,000 possible responses to the question. But, really, what you want to do is go to Amtrak’s site; specify the time, date, and class of service; and generate an answer from a database. By querying the Amtrak trip planner, you are tapping into the deep web. Or you may Google all you want for available tickets to the Bruce Springsteen concert, but eventually you’ll need to go to StubHub! or eBay or TicketMaster to run a search, because Google can’t peek into the database to see if the tickets are available. Deep web again. And this is the crux of the issue. Using Google for simple and routine questions is fine, but for academic research, in-depth business searches, tracking down people, or locating historic or hard-to-find data, Google searching doesn’t cut it. You have to know how to wring answers out of the deep web if you want to do thorough research. That will mean time spent searching databases, following the leads suggested by those databases, and digging around on searchable pages to find information.

For academic research, in-depth business searches, tracking down people, or locating historic or hard-to-find data, Google searching doesn’t cut it.

Imagine you are a graduate student in art history writing a paper about the American photographer Timothy O’Sullivan. When I ran a Google search, I found a Wikipedia page, a guide to his work from the J. Paul Getty Museum, something from a website called Masters of Photography, and a whopping three paragraphs from the Encyclopaedia Britannica along with more than 100,000 other hits, including pictures, YouTube videos, and plenty of links to other Timothy O’Sullivans, very few of whom photographed the Gettysburg battlefield. Do you stop with that Google search? Of course not. Google is a great tool to use to start research, but it’s a terrible place to end it. The smart researcher realizes that the really useful information is hidden away in databases that you’ll need to seek out yourself. Think of the questions that Google is not answering. Where in that results list is the bibliography of scholarly articles about his work? Did it name the experts in nineteenth-century Civil War photography? What do current photo collectors think of O’Sullivan’s work? How much are his pictures worth? And where would you go, as an ambitious grad student, to see for yourself what the original pictures looked like?

For those answers, you’ll need to leave the confines of Google and start plugging queries into art history databases, mining library catalogs with extensive holdings in photography, and maybe even paying a small subscription fee to one of the commercial services that track sales prices in the art market. This heavy lifting is for you to do, not Google. If what you most want to see is deep in a database, it won’t bubble up in a hit list.

As you work through your research questions, you will discover that your answer will lie not in the contents of web pages that Google can see but in the searchable databases that it can’t. These databases account for far more information than anything that is readily visible on the so-called surface web. And it is these databases and their contents are the bread and butter of good research. Let Google or the deep web search engines find the databases for you, and then search the appropriate ones for the question you’re asking.

Searchable databases and their contents are the bread and butter of good research.

Although the numbers from the experts are always inexact, the deep web is estimated to contain at least four or five times the amount of information that sites publish directly to the web and that is available to Google and other search engine. This tip-of-the-iceberg picture means that any conscientious researcher ought to be focusing on the retrievable data from the databases of the deep web more than on the results of a simple Google search.

Although I am not a fan of Wikipedia for research, I will in this one instance recommend it for its ongoing description of the invisible web as a way to keep up with developments in the area. Search engine companies, Google especially, understand that the current inability to harvest the riches of the hidden web is a significant handicap and are working to solve the problem. Wikipedia, if it is working as intended, should stay on top of developments. If you’re interested in the topic, I also recommend Search Engine Watch, which keeps an eye on the latest technology news about search engines.

Mining the Deep Web

Just because Google can’t do a deep dive into the vast underground stores of information doesn’t mean the riches of the deep web are off-limits to you. Quite the contrary. The immense repositories of facts, data elements, and information are eminently searchable, assuming that the sites that store information don’t restrict access. The value of the deep web is important enough that specialized search engines have cropped up to compete with the almighty Google.

IncyWincy

IncyWincy, the self-described “Invisible Web Search Engine,” provides a unique window into the hidden world of online databases. IncyWincy focuses its search power on locating websites that are equipped with queryable databases. The IncyWincy “Forms” search provides you with an easy way to locate websites that contain one or more search forms, which usually indicates that a database cannot be far away. For example, searching for “coal” on the IncyWincy “Forms” tab will return a list of web pages containing a search form. In this example, you will be able to dig through the databases of the World Coal Institute, the Coal Utilization Research Council, the American Coalition for Clean Coal Electricity, and more, from a single web page.

INFOMINE

INFOMINE is an excellent resource for scholarly materials, fashioned by academic librarians from the University of California. According to INFOMINE, the search engine digs through “useful Internet resources such as databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information.”

DeepPeep

DeepPeep is figuring out how to use a single search page to search multiple databases. It has identified more than 45,000 web forms to allow instant retrieval from automobile, airfare, hotel, and job sites. These all are the type of web pages that require a user to know some search parameters, like date, prices, and destinations, to produce meaningful results. The answers DeepPeep dredges up are interesting, but they haven’t quite cracked the deep web nut just yet.

If you’re interested in learning more about the deep web, try Bright Planet, a company that has pioneered identification of the features of the deep web and is devising a means to exploit it. Although Google cannot search the deep web, its “Advanced Search” can be an important tool for finding the databases that will lead you there, as you’ll see in the next chapter.

SITES AND SOURCES MENTIONED IN THIS CHAPTER

Amtrak

www.amtrak.com

Bright Planet

http://brightplanet.com

DeepPeep

www.deeppeep.org

eBay.com

www.ebay.com

Google

www.google.com

IncyWincy

www.incywincy.com

INFOMINE

http://infomine.ucr.edu

“Information Behaviour of the Researcher of the Future”

www.jisc.ac.uk/media/documents/programmes/reppres/ggworkpack ageii.pdf

StubHub!

www.stubhub.com

“Teaching Civil War History”

http://opinionator.blogs.nytimes.com/2011/01/21/teaching-civil-war-history-2-0

TicketMaster

www.ticketmaster.com

U.S. Government Printing Office

www.gpoaccess.gov

Wikipedia

www.wikipedia.com