Needle in the haystack

Our topic for today, dear reader, is a little thought experiment: what if everyone had their own personalized (even local) search engine, instead of having to use central providers (DuckDuckGo, Google, Bing, and others)?

Obviously, from a privacy perspective, it would be a huge step forward if giant corporations wouldn't be able to collect who knows what kind of data about everyone in the world, and then to do who knows what with it. But is it technologically feasible?

The search engine

Before we get into details, let's take a broad look at how a search engine might work.

Download

There are programs called crawlers that visit a page, download its HTML source, extract links from it, visit those pages as well, extract links from them, and so on. Well-behaving crawlers respect the robots.txt and site owners can make their work easier by creating a sitemap.xml.

Processing

From the many-many HTML source files, data should be extracted. Most likely we would also need some context about where the text was found (e.g. page header or footer text), which can then be used later for ordering our search results. Additional metadata can be extracted if the website is using Open Graph, Microformats, or Schema.org protocols.

Storage

We have a couple of options here, based on the amount of data we are willing to store. We will definitely need an index that tells us all the pages that contain a word or phrase. If we also want to display the context where we found that phrase on the results page, we also need to store the data extracted by the processing step. If we want to display the web page from which the index was generated, we will need to store the full HTML page as well.

Search

A (web) application that converts a search term entered by a user into a database query and displays the results.

Download the Internet

So our first problem is the crawler. According to the Internet, there are roughly 400 million active websites today. Even if each of them has only 10 pages (probably a huge underestimation), we are talking about 4 billion pages we need to visit. If we can download every page in 100ms and extract links from it (also a highly optimistic estimate), it would take a crawler more than 12 years to visit everything. A thousand parallel crawlers could finish in 4-5 days... but it sure would be exciting to see billions of people sending thousands of crawlers to the Internet to build their own index.

And that was a very optimistic estimate. How much?

Just think about the fact that nowadays the f...antastic developers like to build websites that are unable to work without downloading and running (multiple megabytes of) JavaScript. So we might need a headless browser to extract the final HTML, which certainly won't finish in 100ms. That would be at least one (but maybe more like two) orders of magnitude slower.

Processing at this scale would also probably be too time-consuming and resource-intensive. Even if they would all be hand-crafted, minimalistic, syntactically, and semantically correct HTML pages... but obviously this is far from the reality. And then there are the SEO tricks, like text that is invisible to the user but present to the crawler and similar naughty things. We should filter out those as well.

Storage has similar problems. Google claims that its index is over 100,000,000 gigabytes in size. Even if it's mostly images and videos, this is way too much to store comfortably on a desktop computer. So it seems that there are problems with three of the four parts (download, processing, and storage). We are up to a bad start.

Alternative solutions

The overload caused by the crawlers could be solved by allowing crawlers to talk to each other about who has been where and exchange information. Although I don't know how we could do this safely so that a rogue crawler can't poison others with false information. And this doesn't help with the amount of data either.

But do we really need the whole Internet? Chances are that we are only interested in content in one or two languages other than our own, and we wouldn't need all of that data either. If we could somehow pick that one percent of the Internet we are interested in, then maybe we could make our own search engine work. We could enter pages into our personal search engine that we think are important enough to crawl, and then go through the external links on those sites, and so on. In the end, we would have a manageable amount of HTML files that could probably be stored on our computer.

In the end, however, it doesn't seem economical (or even possible) to have everyone run their own crawlers and produce their own index, but that doesn't necessarily mean that everyone can't have their own copy of the index. There could be, say, some open index format or database structure and anyone could publish their own indices.

The possibilities are endless, but let's take a look at some ideas for inspiration:

  • thematic indices, like an index for programmers, with documentations, StackOverflow, and more
  • big sites could publish their own index of their content (no crawling is needed, but in return, you trust them that the index and the real content of the site are the same)
  • location-based indices, when you need to find all the ice cream shops in Prague
  • companies that produce paid indices
  • libraries, and public organizations that would make indices of content in their own language
  • indices of non-profit organizations, such as archive.org, which already has such data anyway
  • frequently updated news-like indices
  • infrequently updated encyclopedia-like indices
  • the index of your neighbor Joe, which is created from his favorite websites

Users could load the indices of their choice into their personal search engine, deleting parts that are not relevant to them to save space or get better results. During the search, they could choose which indices to search in.

From here on, the choice of index providers would determine the quality of the results. I suppose, over time, the good providers would rise to the top, and there would be know-how about index customizations. Any time when the quality of an index deteriorates, or it's not fresh enough, one would have the option to look for a new provider. And for the more tech-savvy, there would still be the option to start their own crawler and build their own index (which they can then sell to others).

Not much has been said about the search interface itself, but that part seems pretty straightforward. Since the index/database has an open format, anyone could build software on it. There would probably be some great open-source alternatives, either as a desktop application or as a web application that could be self-hosted on a server. And there would be plug-ins for these applications that could add calculators, currency converters, search history, and who knows what else to the basic functionality.

Summary

I have a few more little ideas here and there, but I didn't want to ramble too much. Let's get back to the original question. Is it just a dream to have your own search engine? If you want to search the whole Internet: yes. But you don't necessarily need the whole Internet to be happy (or to have a search engine that works well). With the right index providers and index sizes that are acceptable to the end user, I think it could work.

Ez a bejegyzés magyar nyelven is elérhető: Tű a szénakazalban

Have a comment?

Send an email to the blog at deadlime dot hu address, or visit the tweet related to this post.

Want to subscribe?

We have a good old fashioned RSS feed if you're into that, and you can also follow the blog on Twitter.