October 25, 2021

I, Science

The science magazine of Imperial College

Most of the Internet is invisible to a Google search. Rosamund Pearce casts light on the Deep Web

flickr Olly Coffey underwater 1024w 6474261405_c71d4c6046_bUnderneath the surface hubbub of tweets, trolls and the profound musings of Justin Bieber, there lies a dark and mysterious world called the Deep Web. Invisible to most search engines, it is home to a diverse range of characters, from journalists to cybercriminals.

As we come to realise just how flimsy Internet privacy is, many are embracing the anonymity offered by Deep Web browsers. What was previously thought of as an accumulation of the Internet’s past may in fact represent a look to the future. Now, with new data mining techniques, science is being used to explore this vast internet underworld.

The Internet can be compared to an iceberg. While the cliffs above the water may seem pretty big, far greater is the invisible mass below the surface. Only 0.03% of the web can be seen by search engines. The remaining 99.7% is a vast repository of derelict or hidden content that general-purpose web crawlers cannot reach. It isn’t a particular destination – it exists in various interstices of the virtual world.

Some parts of the Deep Web cannot be accessed by traditional web browsers at all. These secretive file-sharing networks are known as ‘dark nets’, and require specially downloaded software such as The Onion Router (Tor). Tor encrypts and anonymises traffic masking the identity of its users.

It is this anonymity that not only allows the proliferation of pockets of counterculture and enables freedom of information in countries with censorious regimes, but that also attracts the murkier characters of the Deep Web. With the crypto currency Bitcoin, users can buy almost anything from rhino horns to olive oil. And despite much fabrication and hysteria surrounding darknets, there is certainly some genuinely disturbing content, such as child pornography and adverts for contract killers.

The anonymising nature of data cuts both ways – it is impossible to tell the casual user from law enforcement. It was this that led to the downfall of the infamous blackmarket Silk Road, which was shut down by the FBI in October last year. Some darker activities are also curbed by members of the community itself, such as the hacking collective Anonymous.

But despite its mysterious name much of the data in the Deep Web is unremarkable research data or defunct pages such as old flight bookings. All manner of everyday web companies, like Amazon, Twitter or eBay, have Deep Web content. But although mundane, such information is also potentially valuable.

Information is arguably the most coveted commodity of our new Information Age, and so the value of Deep Web data is immeasurable. New approaches are being pioneered to open up these vast data mines. Like deep-sea adventurers, studies like BrightPlanet are exploring the nature of information on the web. Science, or natural philosophy as it was, originally focused on the study of the natural world. Now, thanks to the incredible pace of advancement we are exploring a second-level reality that we have ourselves created.

Traditional search engines create their search results by crawling surface web pages, following one hypertext link to another. Like ripples travelling across a pond, crawlers are able to obtain pages further and further from their starting point. To be discovered, a web page must be static and linked to other pages. Although Deep Web content is stored in searchable databases they only produce results in response to a direct search. They cannot ‘see’ Deep Web content because those pages do not exist until they are created as the result of a specific search.

Scientists at the University of Utah have been developing ‘DeepPeep’ – a specialist search engine that trawls these dynamic databases. The challenge is to obtain the information automatically – it is obviously impractical to ask each website individually for their contents. To do this DeepPeep uses a technique called ‘iterative probing’- first, it analyses the database’s form for clues. For instance, the words “assignee” or “invention” are likely to indicate a patent database. DeepPeep uses these clues to fill in the forms, extracts new keywords from the results, and then repeats the process.

flickr_jurvetson_InternetSplatMap_916142_ddc2fd0140_oSearch providers such as Google and Kosmix are also getting involved, using new, ‘directed query’ search engines. Others are concentrating on more specific areas. Since most personal information can’t be found on the surface web, the query engine ‘Pipl’ has been designed to retrieve information about people from the Deep Web.

Indexing the whole web is not yet feasible, mostly because of the sheer scale of the data. The surface web alone grows at a rate of around 7.5 million documents per day with growth now exceeding the crawling ability of search engines. It is also difficult because some sites block crawlers, to protect commercial or criminal interests.

Despite often being thought of as a paragon for openness and transparency, the Internet has in fact been driven by a desire for secrecy from its genesis. The ‘ARPANET’, a 1969 precursor of the Internet, was part of a US defence project. In fact Tor is based on military encryption technology co-opted by cypherpunks.

Nowadays we are so used to its omnipresence that it is easy to forget just how much Internet technology has advanced. Remember when finding a website was like sending a letter? You needed to know the address, and in many ways browsing the Deep Web is comparable to browsing the Internet before the invention of search engines. Finding sites is a serendipitous activity, and pages have a ‘dial-up design’ aesthetic.

There is no doubt that the Internet is a wonder of the modern world – indeed is it in fact largely responsible for the modern world. Its stratospheric expansion into our lives has revolutionized culture and commerce. It has opened up the global marketplace to everyone’s backroom and helped to orchestrate the Arab Spring revolutions. But people are also increasingly worried about the power of this super-science.

Recently featured in House of Cards, the Deep Web seems to be having something of a cultural moment. As we put more and more of our lives online, the flimsy nature of digital privacy becomes an ever-greater concern. Indeed, The Guardian’s National Security Agency revelations of 2013 showed just how all-encompassing state surveillance can be. More and more people are embracing the anonymity offered by parts of the Deep Web. It may sound inaccessible but the reality is that it’s a few clicks away – all that’s needed is a download and a degree of technical know-how.

In 1962 computer scientist J.C.R. Licklider envisioned a ‘Man-Computer symbiosis’ where everyone on the globe would be connected in an ‘Intergalactic Network’. Fifty years later we are more connected than Licklider would ever have dreamed, even our household appliances are now being given Internet capability. But can science ever catch up with this vast data mine that it has created? How far will this super-science infiltrate our lives? Will more of us resort to encrypted browsers? The place of the Deep Web in the future of the Internet remains to be seen.

Images from Flickr under Creative Commons license (Top: Deep Blue, Olly Coffey; Bottom image: InternetSplatMap, Steve Jurvetson)