In my last post, I talked about the dbpedia project, and a plethora of other services which are exporting RDF data. There is no doubt that open web-of-data is the future. The challenge, however, lies in the ability to support massive-scale data and provide an infrastructure for semantic mining. Let me explain why I feel this is going to be a key challenge:
- Smart indexing: every single wikipedia entry contains enormous amounts of information that can be linked / referenced to other sources on the web. For instance, for an entry on Berlin, a service that truly taps the potential, would be able to instantly look up highly contextual information from across the web on ‘Berlin’ (with the understanding that the word refers to a city, and nothing else). This is possible today, but only in a very limited fashion. This is exactly where the ‘semantic search’ battle is unfolding at the moment.
- Lightweight data wrappers: spread across the web of data would be a web of data-wrappers: algorithms that understand and intelligently mine the relevant data. Interestingly, in my preliminary experiments, perl-based technologies stand out: with the vast cpan repository base, and super fast regex, it provides the exact framework one needs to rapid protoype a text-mining agent.
- Sentence parsing: Why is it important? The power to automatically parse through blobs of english text, and generate semantic-web uris is going to be sweet. But then, lets put it in writing once and for all — natural language parsing is and has always been hard. The emergence of localized versions of the english language isn’t helping either. However, for sources of information that are well-formatted, the challenge is slightly easier. Take a look at the montylingua project here, along with the other heavyweights in this field such as wordnet. To quote: MontyLingua is ‘a Free, Commonsense-Enriched Natural Language Understander for English’. It actually works very well in most circumstances. In fact, I am planning to write a full review of the technology right after this post.
- The DB and the efficiency question: SPARQL, one of the most popular query formats for the semantic web, is very close to SQL in its structure. Yet, the semantic web is all about triples: subject predicate object. Since relational DB is all about key-value pairs, this leads to inherent inefficiencies (read, joins or other hacks using caches) which could have been avoided if the db world had evolved slightly differently (interestingly, Google App Engine’s data store is very powerful here — try it out if you don’t believe me).
- Revenues, revenues, revenues: of course, the big daddy of all… duh?! Why would someone pay to have shitload of relevant information? Think..?!! What if you could offer the user products which are exactly what he might have been looking for? What if you could understand what he has been searching for all along. Semantic Web is not just a technological gibberish from the business standpoint, it actually has some tremendous revenue potential built into it. Forget Google Adwords and Google Adsense, they just spam you on the right side of the screen. You can do much much better.
I wish to list out a few more, and then write a very structured post on this. However, that would have to wait a bit.

