Collection of few very interesting things that i have faced in my big data journey: Learning Spark

Spark has pluggable cluster manager. We can configure it with Yarn, Mesos Cluster and can work with Spark's Stand Alone Spark Cluster.

In distributed mode, Spark uses a master/slave architecture with one central coordinator

and many distributed workers. The central coordinator is called the driver.

The driver communicates with a potentially large number of distributed workers called

executors. The driver runs in its own Java process and each executor is a separate Java

process. A driver and its executors are together termed a Spark application.

Driver is process where main() method of your program runs and it is the process running user code that creates Spark Context, RDDs and performs transformations and actions.

When the driver runs, it performs two duties:

1) It is responsible for converting a user program into units of physical execution called tasks.

At a high level, all Spark programs follow the same structure:

They create RDDs from some input, derive new RDDs from those using transformations, and perform actions to collect or save data. A Spark program implicitly creates a logical directed acyclic graph (DAG) of operations.When the driver runs, it converts this logical graph into a physical execution

Spark performs several optimizations, such as “pipelining” map transformations together to merge them, and converts the execution graph into a set of stages. Each stage, in turn, consists of multiple tasks. The tasks are bundled up and prepared to be sent to the cluster. Tasks are the smallest unit of work in Spark; a typical user program can launch hundreds or thousands of individual tasks.

2)Scheduling tasks on executors

Given a physical execution plan, a Spark driver must coordinate the scheduling of individual tasks on executors. When executors are started they register themselves with the driver, so it has a complete view of the application’s executors at all times. Each executor represents a process capable of running tasks and storing RDD data. The Spark driver will look at the current set of executors and try to schedule each task in an appropriate location, based on data placement. When tasks execute, they may have a side effect of storing cached data. The driver also tracks the location of cached data and uses it to schedule future tasks that access that data.

The driver exposes information about the running Spark application through a web interface, which by default is available at port 4040. For instance, in local mode, this UI is available at http://localhost:4040

Executors
Spark executors are worker processes responsible for running the individual tasks in
a given Spark job. Executors are launched once at the beginning of a Spark application
and typically run for the entire lifetime of an application, though Spark applications
can continue if executors fail. Executors have two roles. First, they run the tasks
that make up the application and return results to the driver. Second, they provide
in-memory storage for RDDs that are cached by user programs, through a service
called the Block Manager that lives within each executor. Because RDDs are cached
directly inside of executors, tasks can run alongside the cached data.

Values for master

spark://host:port -> connect to a Spark Standalone cluster at the specified port. By default Spark Standalone masters use port 7077.
mesos://host:port -> Connect to a Mesos cluster master at the specified port. By default Mesos masters listen on port 5050.
yarn -> Connect to a YARN cluster. When running on YARN you’ll need to set the HADOOP_CONF_DIR environment variable to point the location of your Hadoop configuration directory, which contains information about the cluster.

local Run in local mode with a single core.
local[N] Run in local mode with N cores.
local[*] Run in local mode and use as many cores as the machine has.

Deploy Mode Importance:

Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit is itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.

Few Points about Mesos:

Unlike the other cluster managers, Mesos offers two modes to share resources between executors on the same cluster.

1) Fine Grained Mode: In “fine-grained” mode, which is the default, executors scale up and down the number of CPUs they claim from Mesos as they execute tasks, and so a machine running multiple executors can dynamically share CPU resources between them.

2) “coarse-grained” mode, Spark allocates a fixed number of CPUs to each executor in advance and never releases them until the application ends, even if the executor is not currently running tasks.

You can enable coarse-grained: mode by passing --conf spark.mesos.coarse=true to spark-submit

The fine-grained Mesos mode is attractive when multiple users share a cluster to run interactive workloads such as shells, because applications will scale down their number of cores when they’re not doing work and still allow other users’ programs to use the cluster. The downside, however, is that scheduling tasks through fine-grained mode adds more latency

you can use a mix of scheduling modes in the same Mesos cluster (i.e., some of your Spark applications might have spark.mesos.coarse set to true and some might not).

Spark on Mesos supports running applications only in the “client” deploy mode—that is, with the driver running on the machine that submitted the application. If you would like to run your driver in the Mesos cluster as well, frameworks like Aurora and Chronos allow you to submit arbitrary scripts to run on Mesos and monitor them.

Collection of few very interesting things that i have faced in my big data journey

Sunday, 16 August 2015

Learning Spark - Chapter 7 Notes

No comments:

Post a Comment

About Me