Paul Buchheit: April 2008

Long, long ago, before Google, search engines evaluated and ranked web pages by considering each page in isolation, examining the size of the fonts, the contents of the meta tags, etc. In some cases, it was even possible to "hijack" another site's listings by simply cloning their HTML. Perhaps a few search engines attempted to improve on this with simple tactics such as counting the number of links to a page, but that was generally useless since it's so easy to create "fake" links in order to boost your count.

With Pagerank, Google took a very different approach. Instead of considering each page in isolation, they examined the link structure of the entire web and computed a global evaluation of that structure. In other words, they began looking at the entire forest instead of just the individual trees. Google did other things too -- Pagerank is just one of many factors, but this general approach of evaluating information in a global context is fundamental to many of the algorithms. These algorithms made it easier for Google to spot which web sites were actually important, and which were just pretenders. Of course Google isn't perfect, and people can still manipulate rankings to some extent, but it was substantially better than the old way, and good enough to form the foundation of what is now a $174 billion dollar company.

Last week I wrote about Facebook gathering similar information about people. By collecting information about people and the links between them, they can start to get a global view of the human "forest". Unfortunately, based on many of the responses, that post wasn't very well written. A lot of people focused on how annoying Facebook applications are (true), how search results limited to your friends would be useless (also true), or other things completely unrelated to my point. A few people mentioned that Facebook hasn't done anything useful with this data, which is actually a good point, but I think that has more to do with Facebook and the newness of the data than it does with the value of the data. After all, the web was around for many years before Google came along and started profitably mining the link structure.

Will Facebook ever do anything useful with the human link data? I have no idea, and it's not particularly important to me. However, I'm confident that SOMEONE will begin mining this data, and that it could ultimately be more valuable than the link data from the web. Facebook is a convenient example because they happen to have a head start on collecting the data, but others might be the first to actually profit from it. Google, in particular, is much better at data mining and also has quite a bit of human link data (from Gmail and Orkut). Microsoft+Yahoo will also have a nice data set, though I doubt that they will know what to do with it. Of course none of this data is perfectly clean and noise-free, but real data never is -- the web certainly isn't.

It's very fashionable to declare that Facebook is an over-hyped fad and will never make any real money, certainly not enough to justify its insane $15 billion valuation. At first glance, it's easy to understand why some people might think it's a toy -- most of the activity there seems to involve biting, poking, and joining groups with funny names.

However, I think that assessment misses out on something very interesting: Facebook is capturing everyone's identity and relationships. Of course there's some noise caused by random friending, but by examining the larger graph as well as other details such as location, affiliations, interactions, and of course explicitly entered relationship details ("how do you know Paul?"), they can get a pretty good idea of which people are actual friends and acquaintances.

The lack of reliable identity information has always been an issue on the web. It's the reason why we don't have a useful directory of email addresses -- everyone in the directory would get bombarded by spam or other unwanted messages, and even if it did exist, how would you know which of the thousands of Adam Smiths is the one that you are looking for? Facebook has already solved this problem for a large fraction of people. It's easy to search for a name and then pick out the right person based on their picture, location, or friends. I get a lot of messages on Facebook, but unlike email, I have yet to receive any spam. That's pretty remarkable.

Perhaps a people directory doesn't seem terribly valuable, but if you can't imagine how to make money from knowing everyone's identity and trust networks, then you aren't being very imaginative. Spam and fraud are two of the biggest problems on the internet, and they are very difficult to stop because it's so easy to create new identities, and we have no good way of differentiating between real identities and fake ones. Even in "real" life, people are able to skip town-to-town, defrauding people again and again because to the people in the new town, they have a new and unknown identity.

One of the best examples of this problem on the internet is eBay. If you try to buy or sell something on eBay (especially computers or electronics, apparently), there is a very good chance that someone will try to rip you off -- just search Google for ebay scammers and you will find pages such as "How scammers run rings round eBay" and "eBay Forums: Today's Scams In Progress". Ebay has had a relatively solid lock on the auction market due to network effects, but with billions of dollars in profits, a $42 billion market cap, and 10 years of not innovating, I'm willing to bet that won't last. With reliable identity information, most of these fraud schemes would become impractical, which would obviously be a real advantage for an eBay competitor.

What else is highly profitable on the internet? Search. I doubt that anyone will ever beat Google at Google-style search, certainly not Microsoft or Yahoo, even if they do tie their horses together. The only way anyone will create something significantly better than today's Google is if they add a new and important ingredient to the mix. Many people have suggested that demographic information, or perhaps knowing what your friends have searched for will help, but I doubt it. What could work is actual, direct, human involvement by the users. In fact, it's already helping in a very limited form -- Wikipedia pages are written and edited by random people on the internet and they frequently occupy the top spots on Google (and I always click on them). Of course the problem with letting random users edit or reorder the search results is that you will quickly be overwhelmed by spam and fraud. But what if you knew who the users were and which ones you could trust?

Those are just the first few things that come to mind -- the uses of identity information are endless. Of course there's no guarantee that Facebook will actually realize any of this potential -- there were many search engines before Google, and they all fumbled the opportunity they had, but it's important to at least understand the potential for big things.

Update: This post was supposed to be about data more so than Facebook (Facebook just happens to have the data). See this post for a (hopefully) better explanation.

Paul Buchheit

Wednesday, April 23, 2008

The power of links and the value of global knowledge

Thursday, April 17, 2008

Facebook knows who you are, and that's worth more than you think