Octopoda – MapReduce for Human Beings in Python
I have been wanting to learn MapReduce for a long time. I never got a requirement where I could use it. Last few weeks I have dabbling with huge datasets. It was time and as usual I started with wikipedia.
There are huge systems and frameworks built on the concept of MapReduce. They use distributed filesystem, have fault tolerance and can process petabytes of data. But I wanted something simple. I wanted something that's minimalistic and does everything that a MapReduce framework should do and is written in Python.
"Map" : The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes.
"Reduce" : The master node then collects the answers to all the sub-problems and combines them in some way to form the output.
I found MinceMeatPy and Octo.py. Both are single python file MapReduce frameworks. mincemeatpy is actively developed, where as last checkin to octo.py was probably in 2008.
I thought the best way to learn the concept is to write the framework that implements it. But then reinventing the wheel is waste of everybody's time. So I choose the middle ground and forked Octo.py and called it Octopoda.
I removed lot of code and in turn made it simple and inflexible. Added simple auth, added some examples, created a wiki and road map and how could I forget ASCII art :)
============================================================ _____ _ / ___ \ _ | | | | | | ____| |_ ___ ____ ___ _ | | ____ | | | |/ ___) _)/ _ \| _ \ / _ \ / || |/ _ | | |___| ( (___| |_| |_| | | | | |_| ( (_| ( ( | | \_____/ \____)\___)___/| ||_/ \___/ \____|\_||_| MapReduce for HumanBeings Repo: http://code.thejeshgn.com/octopoda ============================================================
I am now working on channel encryption. I need help. The project is hosted on bitbucket. Go ahead and fork and send me pull request with your changes.
A standard MapReduce example is counting words.
#wordCount.py source = {1:"Humpty Dumpty sat on a wall", 2:"Humpty Dumpty had a great fall", 3:"All the King's horses and all the King's men", 4:"Couldn't put Humpty together again" } def final(key, value): print key, value # client def mapfn(key, value): for w in value.split(): yield w, 1 def reducefn(key, value): result = 0 for v in value: result += v return result
On server:
$ python octopoda.py server ./examples/wordCount.py
On client or nodes:
$ python octopoda.py client localhost_or_server_ip
You can start as many clients as you want. Server will handle task distribution and aggregation. I know this is an overly simplistic example. With a little modification the same example can be made to calculate the word count from all the files in a directory. I will write about that in my next post. Until then have fun.