In 2004, engineers at Google introduced a new paradigm for
large-scale parallel data processing known as MapReduce (see the original paper here, and
make sure to look in the citations at the end). One key aspect of MapReduce is
that it makes programming such tasks on large-scale clusters easy for
developers; instead of worrying about how to manage parallelism, handle machine
crashes, and many other complexities common within clusters of machines, the
developer can instead just focus on writing little bits of code (described
below) and the infrastructure handles the rest.
In this project, you'll be building a simplified version of MapReduce for just a single machine. While somewhat easier to build MapReduce for a single machine, there are still numerous challenges, mostly in building the correct concurrency support. Thus, you'll have to think a bit about how to build the MapReduce implementation, and then build it to work efficiently and correctly.
To understand how to make progress on any project that involves
concurrency, you should understand the basics of thread creation, mutual
exclusion (with locks), and signaling/waiting (with condition variables). These
are described in the following book chapters:
·
Intro
to Threads
·
Threads
API(C)
·
Locks
·
Using
Locks
·
Condition
Variables(C)
·
Java Lock
and Condition(This is a must read, but you can ignore built-in synchronization
primitives)
Let's now get into the exact code you'll have to build. The
MapReduce infrastructure you will build supports the execution of user-defined Map() and Reduce()
functions.
As from the original paper: "Map(), written by the user,
takes an input pair and produces a set of intermediate key/value pairs. The
MapReduce library groups together all intermediate values associated with the
same intermediate key K and passes them to the Reduce() function."
"The Reduce() function, also written by the user, accepts an
intermediate key K and a set of values for that key. It merges together these
values to form a possibly smaller set of values; typically just zero or one
output value is produced per Reduce() invocation. The intermediate values are
supplied to the user's reduce function via an iterator."
A classic example, written here in pseudocode, shows how to count
the number of occurrences of each word in a set of documents:
map(String key,
String value):
//
key: document name
//
value: document contents
for
each word w in value:
EmitIntermediate(w,
"1");
reduce(String
key, Iterator values):
//
key: a word
//
values: a list of counts
int
result = 0;
for
each v in values:
result
+= ParseInt(v);
print
key, result;
What's fascinating about MapReduce is that so many different kinds
of relevant computations can be mapped onto this framework. The original paper
lists many examples, including word counting (as above), a distributed grep, a
URL frequency access counters, a reverse web-link graph application, a
term-vector per host analysis, and others.
What's also quite interesting is how easy it is to parallelize:
many mappers can be running at the same time, and later, many reducers can be
running at the same time. Users don't have to worry about how to parallelize
their applications; rather, they just write Map() and Reduce()
functions and the infrastructure does the rest.
Get Free Quote!
283 Experts Online