disco{db,dex} @ erlang user conference 2009

We gave our talk this morning at EUC2009, overall it was a very nice conference. I forgot to mention that we are currently going through the Nokia internal process for open-sourcing discodb and discodex. Hopefully we can release them by the end of the year. We will put pointers on discoproject.org when they are available.

http://discoproject.org/erlanguserconference2009/

Filed under  //  discodb   discodex  
Comments (0)

Why Not Hadoop?

We are flying back from Boston after an excellent week at the Architecture Technology Review. This was my first interaction with the Nokia architecture community at large, and I was really pleased (and I have to admit, somewhat surprised), to see how awesome many of the developments coming down the pipe are. Ville gave a talk on what we have been doing with Disco, and we also gave a demo during one of the 'speed geeking' sessions. One of the most common questions we were asked was, "why not Hadoop?", so I thought I'd give my opinion on the subject.

Prior to coming to the NRC, I was using Hadoop for about a year and a half (doing bioinformatics), and I must say that it served me quite well. To be sure, there were problems along the way, but Hadoop enabled me to do analyses that I would not otherwise have done, not because they would be impossible without Hadoop, but because mapreduce makes it so easy to parallelize a huge class of problems, that the overhead of doing things with big data becomes amazingly small.

Even when using Hadoop, I always used Python (with Hadoop Streaming) to write map/reduce functions, because Python is such a pleasure to write, and because I am much more productive writing Python than Java (or pretty much any other language). Because of my love for Python, I often wondered why noone had yet written a Python implementation of mapreduce, and even considered writing my own. I think it is natural for anyone who thinks about the design of systems, to question the validity of architecture decisions and to wonder how those designs might be improved. Of course, actually implementing a new design is a whole other story, and finding the impetus to do so, especially when a reasonably good implementation (with lots of high-profile developers) already exists, is not always easy.

When I discovered the Disco project, which is part Erlang, part Python, I was deeply intrigued. I questioned the choice of Erlang (not knowing much about it), but Ville's argument was extremely pragmatic: Erlang is really good at distributed stuff (that's what it was built to do), and Python is awesome for high-level programming (i.e. its fun, easy to read/write, expressive, etc.). But I guess the question remains, why not Hadoop? The reason answering this question is hard, is because largely it is a matter of taste. The bottom line is that neither Hadoop nor Disco is really a mature project (Hadoop IS more highly developed than Disco though), while it seems to me the choice of framework is a long-term question. For me, wanting to use Python to improve the framework itself is a no-brainer (additionally, Jython is currently too far behind CPython for me to consider it a replacement).

Why Disco? Because of it's philosophy: massive data - "minimal code". Lightweight is a design goal in Disco, and we really, truly, care about programmer overhead. Framework development should be as agile as possible, if we are trying to optimize programmer productivity. My vision of Disco is a framework that can be shaped to the needs of its users (including myself), by its users. For me, the reality of Hadoop was quite different.

Filed under  //  architecture   erlang   hadoop   nokia   python  
Comments (0)

Erlang User Conference 2009

We will be speaking about discodex at the Erlang User Conference next week. discodex is our new index- building tool implemented using disco and a really awesome data structure called discodb. Check out our slides here:

http://discoproject.org/erlanguserconference2009

Comments (0)

git init disco.posterous.com

Hello, world (and welcome)!

Ville and I are starting a blog, so we can share our thoughts as we continue to develop Disco (http://discoproject.org) and related tools at the Nokia Research Center (http://research.nokia.com). Disco is 100% open-source, but a lot of its development takes place behind-the-scenes at NRC. We're looking for a way to share our future plans, tips and tricks, design discussions, or whatever else we might be thinking about. Let us know if there's something in particular you'd like to hear more about!

Filed under  //  welcome  
Comments (0)