mysql, ror, future web, optimization, scalability, cloud computing, web 3.0

Ideas on how not to get lost in a forest of ideas!

One of the most common problems that startups face is the inability to nail down on a core focus or a core offering that might have the potential of attracting users. Every startup founder has experienced this at some level —

You walk into a brainjam session with your founding team thinking that you have everything crystal clear in your head, and you walk out completely fuzzy and out of focus. Worst still, every brainjam session seems to bring in angles you didn’t think of… and now you are faced with the challenge of figuring out what is it exactly that you are doing.

What’s at play here? Is this desirable? How do you balance time spent in creating the full scope of the product, vs coding the initial working prototype that gives the much needed comfort?

At e2enetworks, we are faced with this very challenge on a regular basis — whenever we have an idea or product that we need to architect and scope out, we always have to walk the fine balance between product scope and delivery.

To tackle the problem, we have laid down a few ground rules:

1. Attack the problem at the core. Jot down the rest as points that would be tackled at some point in the future — here, we spend 80% of our brain time in figuring out the core value proposition and the feature set that would best represent that. Rest, we just write out as points in our ‘vision todo list’.

2. Create an initial infrastructure as soon as possible, or use an existing one — creating a seamless workflow is one that goes a long way towards increasing productivity. While not being dogmatic about it, there are several quick things that we do:

– setup a shared space for vision, milestone, idea jamming: example: basecamp (from 37signals), google calendar/docs etc.

– setup version control and feature tracking software (subversion, git, trac, assembla) 

– create repositories for ui designs, documentation, and codebase

– mailboxes, shared wiki, and ftp space

We have noticed an enormous amount of time ‘lost in translation’ — due to lack of a workflow, the information has to be revisited, thus leading to wastage of time.

3. Build up the core objects, core db models quickly — building an initial model of the core proposition gives a massive amount of insight into the future possibilities of the system, and whether its sufficiently different from existing players in the market, or other products. In this step, going from vision, all the way to first version of database modeling has been very helpful to us. Some tools in this space:

– High level diagrams that capture the workflow, the product overview / sitemap, the object relationships
– Omnigraffle, Mindmap, Freemind, and anything else that can help visualize object relationships… even Powerpoint can help (since its easy to draw boxes and arrows).
– Azurri Clay DB Modeling Plugin for Eclipse
– Model interrelationships in Rails, testing them using script/console

4. Create parallel processes for prototyping, feature jamming, product design. While the team operates in an agile fashion, it helps when everyone has a key responsibility — it their personal job then to make sure that it happens.

5. Create a staging environment asap — going from prototype dev to stage is key to discovering bugs / nuances in the system that would otherwise remain confined in the development (edge trunk) machines. The ideal workflow that has worked for us is — create a stage release every evening (nightly build), and let that be the system that gets tested next day while developers create the next version.

6. Play with at least one production deployment version — you discover a lot of holes in the system early on, if production deployment is tested as soon as there is some sort of working prototype. While this was inhibitive in the past due to costs involved, that’s no longer the case if one chooses cloud deployment where you pay as you go.

– it helps in figuring out number of frontend, monitoring, database servers that might be needed
– it helps in understanding areas that might need optimization later (although we are heavily against premature optimization).
– it helps discover any quirks the deployment scenario might bring into the picture.

The hosts we find interesting: Amazon EC3/S2, Slicehost, Google Appengine, Railsmachine…

7. Testing. This is an obvious one, so I am not going to go deep into it — but the gist of it is, with RAD frameworks like Rails, testing should be rigorously built into the development process, so that at the end of first iteration of the product, there are test cases already prepared. 

Keeping the above ground rules in mind for any particular product, idea or vision has helped us in minimizing time lost in random iterations that lead to nowhere… or from getting lost in the forest of ideas that inevitably emerges from a core one.

However, most important of all, one should never lose sight of the whiteboard. No amount of milestones and to do lists can capture the synergy of quick discussions that are chalked out on the whiteboard, and then eventually make their way into system specs.

Pulling in a Web 2.0 application into production: hosting thoughts

Faster, Cheaper and Better choose any two. Hosting/Datacenter is largely an optimization problem where there are trade-offs involved for every decision you can make. Knowing your choices then becomes very important.

In planning for capacity you are limited by your slowest components. First make an informed guess if its CPU/Memory/Disk IO or Bandwidth bound based on measurements in your Load testing lab which can give you hints about what might be your slower components.

According to me site typically needs to have
1. Raw bit pushing capability, how fast can you render the content to the browser. That is what your users care about at the end of the day.
a) Your small sized static content hosted(flash, javascript, CSS, images) as close as possible to the end users, as the request-response time is nearly equal to latency of your site from an end user. ( Hint: buy services of a CDN which has servers in India )
b) Larger blobs of content like progressive video downloads and like can be and should be hosted wherever bandwidth price is cheapest. Amazon S3 is a good starting point as there is no minimum there.
c) Ajax requests are typically designed to hide latency from a user so ideally it shouldn’t matter where in the world your application is hosted.
d) HTML rendering , are your pages cached, how many caching servers do you need can be determined by estimating data cached in-memory which would be used by your application for each user
2. Number crunching/backend processing capability, including your database. Your actual web application, middleware and database. Here is where the actual difference lies between hardware capacity requirements of different applications. You should run benchmarks of synthesized traffic from a typical user session replayed concurrently to your load testing servers(hint Perl WWW::Mechanize or Jmeter) . However its impossible to figure out in advance how your end users are actually going to use the site. They might stress that 5% of the code which is not optimized for performance bringing down your site anyway. Load testing doesn’t really yeild any useful information simply for the reason that its nearly impossible to create real world situations in a lab(that includes abuse and creative uses of your web application). Estimate how much data processing are you doing with the stats/data collected in your site and how you are feeding the results of that processing to your frontend application. What parts are synchronous/real-time and what parts are near-realtime (batch processing nearing real time hidden behind ajax/flash animations and like ) and what part is truly batch oriented.

3. Setting up a new site is then more about setting up initally with a reasonably sized capacity and be able to react to capex calls by monitoring the usage of bandwidth, CPU, memory and disk IO for each separated out component in the application by its class ( bit-pushing/caching or number crunching). If you have a reasonable budget for capacity then create an initial 4-20 servers(real or on the cloud at one of Amazon EC2 or other VPS based cloud solutions) with 2-4 instances of each component of your web application(outsource the things you wouldn’t want to worry like e-mail/DNS/CDN etc. ), get a good quality hardware loadbalancer (or buy shared access to a loadbalancer). And make sure you don’t constrain your flexibility in being able to add machines and switching capacity without requiring major physical layout changes. ( Hint: Buy larger switches than you need).

4. Long term goals for operations of a web application are
a) Bandwidth costs should decline as you start using more and more of it tending towards a very low(nearly zero) per Megabit cost
b) Cost(setup+rental or amortization) of adding physical machines(of the standard chosen configuration) and switching/loadbalancing should increase linearly.
c) Geographical scale up by being able to replicate your first datacenter node across the globe.
d) No single points of failure as in a atleast two geographical sites, access links for bandwidth at each datacenter node, loadbalancing, network switching, storage(multi-pathing) and your application components.

5. Start small and choose wisely and tend towards flexibility( aim for lower capex with no lock-in, even it means a higher opex initially) for you’ll need to live with limitations created by your initial set of decisions regarding production hosting environment for a long time to come or require a painful and costly migration to another production environment.

Improving linux IO performance

1. Mount options: use noatime

Most Linux server machines can do without last access time modified for every file and each directory which is being read. So I’ll just go ahead and re-quote for the nth time what Linux Kernel developer Ingor Molar has to say to emphasize the point.
<Quote>
i cannot over-emphasise how much of a deal it is in practice. Atime
updates are by far the biggest IO performance deficiency that Linux has
today. Getting rid of atime updates would give us more everyday Linux
performance than all the pagecache speedups of the past 10 years,
_combined_.
<Quote/>

You can simply remount your filesystems without rebooting your machine using remount option.
As an example:-
/bin/mount -t ext3 -o noatime ext3 /dev/sda5 /
for remounting
/bin/mount -t ext3 -o noatime,remount ext3 /dev/sda5 /

And don’t forget to modify corresponding lines in your /etc/fstab
/dev/sda5 / ext3 noatime 1 1

2. Use tmpfs
Speedup heavy read-write IO for temporary data stores by by using memory instead of disk.

3. On systems not constrained for memory reduce swappiness of the Linux machine
/bin/echo “10″ > /proc/sys/vm/swappiness

4. Set blockdev readahead to a reasonable value to improve read performance
/sbin/blockdev –setra 131072 /dev/sda

The default readahead value is too small.

Linux: In memory filesystems tmpfs vs ramdisk

Although reading and writing to files is fast in Linux with aggressive readahead and caching, it can still slow down applications that make extensive use of ondisk temporary files. As an example MySQL which can do a lot of on disk temporary tables if the temporary tables need to have a large varchar, text or binary column.
It makes sense to mount an in-memory filesystem on the MySQL’s tmpdir usually /tmp to ensure that your on-disk temporary tables are rapidly written to and read from memory to return query output fast by avoiding expensive disk IO. Similarly a lot of different web applications can derive a lot of benefit by writing temporary data to an in-memory filesystem as opposed to the disk.
The two choices ramdisk and tmpfs
Linux Kernel loads up 16 Ramdisks of 16 MB each at bootup time. They don’t occupy any memory space at initialization. Ramdisks allocate memory when they are put to use by formatting them as ext2 or some other non-journaling filesystem. ( no not ext3, there is no use of journaling for an filesystem that is transient ) Once allocated the memory can’t be returned from a ramdisk to the operating system. Ramdisks suffer from another limitation that its size can’t be dynamically increased.
tmpfs on the other hand doesn’t need to be formatted as another disk filesystem. Its can be dynamically resized. Un-utilized memory can be used by the operating system. The only downside is that tmpfs can also use Virtual Memory and its contents can be swapped out causing disk IO that we seek to avoid by using tmpfs. However swappiness should be minimized on a mission critical server anyway by tuning the /proc/sys/vm/swappiness value at boot time.

As an example of how to use tmpfs
/bin/mount -t tmpfs -o size=1G,nr_inodes=10k,mode=0775,noatime,nodiratime tmpfs /tmp

to dynamically increase its size
/bin/mount -t tmpfs -o size=2G,nr_inodes=20k,mode=0775,noatime,nodiratime,remount tmpfs /tmp

Tuning the swappiness
/bin/echo “1″ > /proc/sys/vm/swappiness

These need to be added to /etc/rc.local to make the settings persistent across reboots.

Tiny persistent automatons!

For a while, I have been pondering on this: can I create a bunch of extremely stupid automatons (actually, a tweak on Finite State Machines), that co-exist independently and making self-contained selfish decisions, and yet create an ecosystem that seems to be making smart decisions?

The idea is simple: I create a set of microprograms (I fondly call them ‘brats’!), each of which have their own selfish agendas, decision processes, survival rules, and their own I/O probes that constantly monitor the surroundings for resources they need (and perish soon if they don’t find them)… and then I spawn each variety of agent multiple times and see an overall behavior emerge.

The first set of such agents are already sweating their necks, making highly stupid decisions… and yet surviving, but my goal is to now take it a step further.

I now I want to go in two directions:

- train a GA or reinforcement learning based system to figure out the initial configurations.

- spread the agents across multiple computers on the web, thus creating a virtual cluster, an ecosystem that would be very interesting to study.

You may ask, why is this any different than Cellular Automata, the system that Stephen Wolfram brought back into fad through his book ‘New Kind of Science’? The fact is, though similarities may exist on the surface, a deeper inspection would reveal that my agents are far more inspired by ant colony / multi-agent based emergent behavior systems than CA — the only difference being, I am trying to far more flexible about the toolkit, instead of sticking with just pheromones!. Also, at the end of the day, I am just trying to learn my tools!

[Tools used: Ruby, RubyGems, MySQL, a lot of caffeine, and box dvd set of The Wire]

Google, watch out!

Throughout the last few years, Google has been attempting to play in the space where Facebook, MySpace have been creating waves. Yet, despite their acquisition of Orkut, Picasa, Google’s Social Web strategy never really came together. Somehow it doesn’t seamlessly fit into (what is now being called as) a Social OS over which everything rides off.

To complicate things further for them, let me make a wild prediction: Facebook will be the first real social web search backed with a semantic engine, that reaches widescale adoption. 

I will not make any further comments about why I think so — but by the second quarter of next year, the Google would have to seriously rethink on ways to protect its pie. Period.

Scalable services for the Semantic Web

In my last post, I talked about the dbpedia project, and a plethora of other services which are exporting RDF data. There is no doubt that open web-of-data is the future. The challenge, however, lies in the ability to support massive-scale data and provide an infrastructure for semantic mining. Let me explain why I feel this is going to be a key challenge:

  1. Smart indexing: every single wikipedia entry contains enormous amounts of information that can be linked / referenced to other sources on the web. For instance, for an entry on Berlin, a service that truly taps the potential, would be able to instantly look up highly contextual information from across the web on ‘Berlin’ (with the understanding that the word refers to a city, and nothing else). This is possible today, but only in a very limited fashion. This is exactly where the ’semantic search’ battle is unfolding at the moment.
  2. Lightweight data wrappers: spread across the web of data would be a web of data-wrappers: algorithms that understand and intelligently mine the relevant data. Interestingly, in my preliminary experiments, perl-based technologies stand out: with the vast cpan repository base, and super fast regex, it provides the exact framework one needs to rapid protoype a text-mining agent.
  3. Sentence parsing: Why is it important? The power to automatically parse through blobs of english text, and generate semantic-web uris is going to be sweet. But then, lets put it in writing once and for all — natural language parsing is and has always been hard. The emergence of localized versions of the english language isn’t helping either. However, for sources of information that are well-formatted, the challenge is slightly easier. Take a look at the montylingua project here, along with the other heavyweights in this field such as wordnet. To quote: MontyLingua is ‘a Free, Commonsense-Enriched Natural Language Understander for English’. It actually works very well in most circumstances. In fact, I am planning to write a full review of the technology right after this post.
  4. The DB and the efficiency question: SPARQL, one of the most popular query formats for the semantic web, is very close to SQL in its structure. Yet, the semantic web is all about triples: subject predicate object. Since relational DB is all about key-value pairs, this leads to inherent inefficiencies (read, joins or other hacks using caches) which could have been avoided if the db world had evolved slightly differently (interestingly, Google App Engine’s data store is very powerful here — try it out if you don’t believe me).
  5. Revenues, revenues, revenues: of course, the big daddy of all… duh?! Why would someone pay to have shitload of relevant information? Think..?!! What if you could offer the user products which are exactly what he might have been looking for? What if you could understand what he has been searching for all along. Semantic Web is not just a technological gibberish from the business standpoint, it actually has some tremendous revenue potential built into it. Forget Google Adwords and Google Adsense, they just spam you on the right side of the screen. You can do much much better.

I wish to list out a few more, and then write a very structured post on this. However, that would have to wait a bit. :)

Is Semantic Web poised to be ‘Web 3.0′?

With Microsoft’s recent acquisition of Powerset Labs, its clear that we are at the verge of a battle for the ’search’ space. At the moment, there are a handful of startups which are targeting their technology offerings towards providing more relevance and ‘natural language semantics’ to search query and results. Yet, none of them have really managed to pick up steam, and they play on the fringes of experimentation at the moment. I can’t even remember the last time I relied on a search engine other than Google for my actual day-to-day needs (although, it once used to be altavisa and metacrawler).

Well, none apart from the massive Wikipedia, of course.

The strength of Wikipedia lies not only on its vast knowledge database, but its highly relevant structure. One of the most overlooked projects in the recent date, dbpedia.org, gives a glimpse into the possibilities it offers.

DBPedia reminds me of my old days of tinkering with AI (context-based systems). But now that the ‘world knowledge’ is being fed in a structured fashion by the ‘mob’, what are the systems that would evolve?