parallel.cluster.Spark
Description
A parallel.cluster.Spark
object represents and provides access to
a Spark™ cluster. Use the parallel.cluster.Spark
object as input to the
mapreduce
and mapreducer
functions, for specifying the
Spark cluster as the parallel execution environment for tall arrays and mapreduce
.
Creation
Use parallel.cluster.Spark
to create a Spark cluster object.
Description
creates asparkCluster
= parallel.cluster.Spark parallel.cluster.Spark
object representing the Spark cluster.
sets the optional sparkCluster
= parallel.cluster.Spark(Name,Value
)ClusterMatlabRoot
and
SparkInstallFolder
properties using one or more name-value arguments on the
parallel.cluster.Spark
object. For example, to change the Spark install folder, use
'SparkInstallFolder','/share/spark/spark-3.3.0'
.
Properties
Object Functions
mapreduce | Programming technique for analyzing data sets that do not fit in memory |
mapreducer | Define parallel execution environment for mapreduce and tall arrays |
Examples
Tips
Spark clusters place limits on how much memory is available. You must adjust the size of the data to gather to support your workflow.
The amount of data gathered to the client is limited by the Spark properties:
spark.driver.memory
spark.executor.memory
The amount of data to gather from a single Spark task must fit in these properties. A single Spark task processes one block of data from HDFS, which is 128 MB of data by default. If you gather a tall array containing most of the original data, you must ensure these properties are set to fit.
If these properties are set too small, you see an error like the following.
Error using tall/gather (line 50) Out of memory; unable to gather a partition of size 300m from Spark. Adjust the values of the Spark properties spark.driver.memory and spark.executor.memory to fit this partition.
The error message also specifies the property settings you need.
Adjust the properties either in the default settings of the cluster or directly in
MATLAB. To adjust the properties in MATLAB, add name-value pairs to the SparkProperties
property of
the cluster. For
example:
cluster = parallel.cluster.Spark; cluster.SparkProperties('spark.driver.memory') = '2048m'; cluster.SparkProperties('spark.executor.memory') = '2048m'; mapreducer(cluster);
Version History
Introduced in R2022b