[NLPL Task Force (A)] Apache Spark and NLPL(2)?

Martin Matthiesen martin.matthiesen at csc.fi
Mon Nov 26 14:06:59 UTC 2018


Hello all!

I attended an Apache Spark course last week and now have a better idea what Map Reduce actually means. I was wondering whether such an approach would be interesting to NLPL in the future. In any case here's my very brief summary for your consideration.

Cheers,
Martin


Sent: Thursday, 22 November, 2018 13:23:54
Subject: Course Report: Apache Spark in a nutshell

Hi,

I went to an Apache Spark[1] course Monday/Tuesday, here's an extremely short summary, ignore if it is not relevant to you.

Spark is a highly distributed framework for big data analysis. It implements the Map Reduce[2] Method. It is similar to Hadoop, but faster and apparently easier to use.

Map Reduce boils down to:

You have a list with gazillions of entries and define a MAP function to be used on every entry of the list.
You at some point gather results by applying a REDUCE function on the list as a whole.

A simple example:

You have a list with words and MAP them baseforms with POS tags

["bees", "flew"] -> [("bee","N PL"), ("fly", "V Past")]

You might then REDUCE the list by extracting only the "N PL" base forms:

["bee"]

The charm of the approach is that every MAP operation can be performed on each item independently of the other items. So it can be massively parallelilized. Syntax trees would need sentences as items, but the principle is always the same: run the same operation on all items in parallel.
REDUCE is more costly, since it needs to run on the whole list. The lists are read-only, map operations alsways create new lists.

CSC's Rahti has a template to set up a Spark cluster, this is all experiemntal an beta still, but already works on a small scale. Later Rahti will make it very easy to fine tune the resources needed for good performance.

Spark has a special format where the data is stored for fast access (parquet[3]). This could be interesting for the Language Bank: We could offer data in parquet format (say a few TB) and users would just extract the data they need (which could be a small fraction).

Martin

Links
[1] https://spark.apache.org/
[2] https://en.wikipedia.org/wiki/MapReduce
[3] http://parquet.apache.org/


-- 
Martin Matthiesen
CSC - Tieteen tietotekniikan keskus
CSC - IT Center for Science
PL 405, 02101 Espoo, Finland
+358 9 457 2376, martin.matthiesen at csc.fi
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704



More information about the infrastructure mailing list