CS61A-Lab 14

MapReduce

Posted by Tianxiang Gao on June 20, 2015

Introduction: MapReduce

MapReduce is a programming paradigm developed by Google, which allows a programmer to process large ammouts of data in parallel on many computers.

A computation in MapReduce consists two components: mapper and reducer.

  • mapper takes an input file, and prints out a series of key-value pairs:
  • age 29
    name cecilia
    job gradstudent
    salary 42
    
    In this example, the key-value pairs are:
    {age: 29, name: cecilia, job: gradstudent, salary: 42}
    
  • reducer takes the (sorted) output from the mapper, and outputs a single value for each key. The mapper's output will be sorted according to the key.

This lab is split up into two parts: 1. Serial MapReduce: to introduce MapReduce framework, we will first use a non-parallelized version. Even with this limitation, we can still do quite a lot of data processing! 2. Parallelized MapReduce: the real power of MapReduce comes from parallelization. The same MapReduce jobs from Part 1 can be executed much faster by using multiple computers(i.e. a cluster) at the same time. For this lab, we will be using Hadoop, an open source implementation of the MapReduce paradignm.

Part 1: Serial MapReduce

Example: Line-Counting

Our first exercise will be counting the number of lines in one of Shakespear’s plays.

To formulate this as MapReduce problem, we need to define an appropriate mapper and reducer function.

What the mapper does: for each line in a text file, the mapper outputs a key-value pair. What should our key-value pairs be for our line counting example?

  • key: in our example, we don't care about the contents of each line -- there's no need to classify each line. Thus, our key can just be the string 'line'.
  • value: we want to count each line exactly once. Thus, our value can just be the number 1.

For example,

so much depends
upon
a red wheel
barrow
glazed with rain
water
beside the white
chickens.

(notice there are 8 lines); it then outputs a sequence of key-value pairs like this:

'line'  1
'line'  1
'line'  1
'line'  1
'line'  1
'line'  1
'line'  1
'line'  1

The reducer takes this sequence and simply adds up all the values tha are associated with the key ‘line’.

'line'  8

line_count.py is the mapper, which takes input from stdin(i.e. standard in) and outputs one key-value pair of each line to stdout (i.e. standard out, which is typically the terminal output).

Let’s run line_count.py by feeding it the_tempest.text. The question is, how do we give the_tempest.txt to line_count.py via stdin? We’ll use the Unix pipe | feature:

chmode +x line_count.py

This command lines change mode of file line_count.py to executable.

Once you’re made line_count.py executable, type in the following command:

cat the_tempest.tx | ./line_count.py

cat will display the contents of a given file to stdout.

If you’ve completed line_count.py correctly, your terminal output should be full of key-value pairs like:

'line' 1
'line' 1
'line' 1
...
'line' 1

Unix pipe-ing takes the output of one program (in this case, the cat program), and ‘pipes’ it as the input to another program (typically via stdin). This technique of piping programs together is called “modular programming” and is ubiquitous in Unix-sytle programming. Modular programming allows us to write small, simple programs and chain them together to accomplish complicated tasks.

The reducer: sum.py

In your current directory should be the file sum.py. The body of this file should be:

#!/usr/bin/env python3

import sys
from ucb import main
from mr import values_by_key, emit

@main
def run():
    for key, value_iterator in values_by_key(sys.stdin):
        emit(key, sum(value_iterator))

Let’s break down the process:

  1. values_by_key is a function that reads input from stdin, and groups all key-value pairs that have the same key. For exmaple
  2. 'line'  1
    'line'  1
    'line'  1
    will turn into the following pair:
    ('line', [1, 1, 1])
    Note: the second element should actually be an iterator, not a Python list; it is represented with square brackets for visual clarity.
  3. The variables key and value_iterator get bound to their respective values in the example above
  4. key: 'line'
    value_iterator: [1, 1, 1]
  5. For each of these key-iterator pairs, sum.py will add up the values in the iterator and output this new value with the same key:
  6. 'line'  3
    The emit function prints out a key and a value in the format shown above. emit also handles logistics for parallelization, which becomes important in Part 2 of the lab.

Putting it all together

Now that we have the mapper and reducer defined, let’s put it all together in the (simplified) MapReduce framework: Map -> Sort -> Reduce

Once you’ve doen this, issue the following Unix command:

cat the_tempest.txt | ./line_count.py | sort | ./sum.py

Notice that we’re using the Unix program sort, which is a built-in Unix program. As you’d expect, sort will, give a file, sort the lines of the file - by default, it will sort it alphabetically.

Exercises

Question 1:

import sys
from ucb import main
from mr import values_by_key, emit

def count_words(line):
	for word in ['the', 'he', 'she', 'it', 'thee']:
		count = line.count(word)
		if count > 0:
			emit(word, count)

@main
def run():
    for line in sys.stdin:
    	count_words(line)