Octopoda – MapReduce for Human Beings in Python

I have been wanting to learn MapReduce for a long time. I never got a requirement where I could use it. Last few weeks I have dabbling with huge datasets. It was time and as usual I started with wikipedia.

There are huge systems and frameworks built on the concept of MapReduce. They use distributed filesystem, have fault tolerance and can process petabytes of data. But I wanted something simple. I wanted something that's minimalistic and does everything that a MapReduce framework should do and is written in Python.

"Map" : The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes.

"Reduce" : The master node then collects the answers to all the sub-problems and combines them in some way to form the output.

I found MinceMeatPy and Octo.py. Both are single python file MapReduce frameworks. mincemeatpy is actively developed, where as last checkin to octo.py was probably in 2008.

I thought the best way to learn the concept is to write the framework that implements it. But then reinventing the wheel is waste of everybody's time. So I choose the middle ground and forked Octo.py and called it Octopoda.

I removed lot of code and in turn made it simple and inflexible. Added simple auth, added some examples, created a wiki and road map and how could I forget ASCII art :)

============================================================
        _____                                  _       
       / ___ \       _                        | |      
      | |   | | ____| |_  ___  ____   ___   _ | | ____ 
      | |   | |/ ___)  _)/ _ \|  _ \ / _ \ / || |/ _  |
      | |___| ( (___| |_| |_| | | | | |_| ( (_| ( ( | |
       \_____/ \____)\___)___/| ||_/ \___/ \____|\_||_|
                  MapReduce for HumanBeings
          Repo: http://code.thejeshgn.com/octopoda
============================================================

I am now working on channel encryption. I need help. The project is hosted on bitbucket. Go ahead and fork and send me pull request with your changes.

A standard MapReduce example is counting words.

#wordCount.py
source = {1:"Humpty Dumpty sat on a wall", 
2:"Humpty Dumpty had a great fall", 
3:"All the King's horses and all the King's men",
 4:"Couldn't put Humpty together again" }

def final(key, value):
    print key, value

# client
def mapfn(key, value):
    for w in value.split():
        yield w, 1

def reducefn(key, value):
    result = 0
    for v in value:
        result += v
    return result

On server:
$ python octopoda.py server ./examples/wordCount.py

On client or nodes:
$ python octopoda.py client localhost_or_server_ip

You can start as many clients as you want. Server will handle task distribution and aggregation. I know this is an overly simplistic example. With a little modification the same example can be made to calculate the word count from all the files in a directory. I will write about that in my next post. Until then have fun.