Octopoda – MapReduce for Human Beings in Python
I have been wanting to learn MapReduce for a long time. I never got a requirement where I could use it. Last few weeks I have dabbling with huge datasets. It was time and as usual I started with wikipedia.
There are huge systems and frameworks built on the concept of MapReduce. They use distributed filesystem, have fault tolerance and can process petabytes of data. But I wanted something simple. I wanted something that’s minimalistic and does everything that a MapReduce framework should do and is written in Python.
“Map” : The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes.
“Reduce” : The master node then collects the answers to all the sub-problems and combines them in some way to form the output.
I found MinceMeatPy and Octo.py. Both are single python file MapReduce frameworks. mincemeatpy is actively developed, where as last checkin to octo.py was probably in 2008.
I thought the best way to learn the concept is to write the framework that implements it. But then reinventing the wheel is waste of everybody’s time. So I choose the middle ground and forked Octo.py and called it Octopoda.
I removed lot of code and in turn made it simple and inflexible. Added simple auth, added some examples, created a wiki and road map and how could I forget ASCII art :)
============================================================
_____ _
/ ___ \ _ | |
| | | | ____| |_ ___ ____ ___ _ | | ____
| | | |/ ___) _)/ _ \| _ \ / _ \ / || |/ _ |
| |___| ( (___| |_| |_| | | | | |_| ( (_| ( ( | |
\_____/ \____)\___)___/| ||_/ \___/ \____|\_||_|
MapReduce for HumanBeings
Repo: http://code.thejeshgn.com/octopoda
============================================================
I am now working on channel encryption. I need help. The project is hosted on bitbucket. Go ahead and fork and send me pull request with your changes.
A standard MapReduce example is counting words.
#wordCount.py
source = {1:"Humpty Dumpty sat on a wall",
2:"Humpty Dumpty had a great fall",
3:"All the King's horses and all the King's men",
4:"Couldn't put Humpty together again" }
def final(key, value):
print key, value
# client
def mapfn(key, value):
for w in value.split():
yield w, 1
def reducefn(key, value):
result = 0
for v in value:
result += v
return result
On server:
$ python octopoda.py server ./examples/wordCount.py
On client or nodes:
$ python octopoda.py client localhost_or_server_ip
You can start as many clients as you want. Server will handle task distribution and aggregation. I know this is an overly simplistic example. With a little modification the same example can be made to calculate the word count from all the files in a directory. I will write about that in my next post. Until then have fun.


