matlab.compiler.mlspark.RDD Class
Namespace: matlab.compiler.mlspark
Superclasses:
Interface class to represent a Spark Resilient Distributed Dataset (RDD)
Description
A Resilient Distributed Dataset or RDD is a programming abstraction in Spark™. It represents a collection of elements distributed across many nodes that can be operated in parallel. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. You can create RDDs in two ways:
By loading an external dataset
By parallelizing a collection of objects in the driver program
Once created, two types of operations can be performed using RDDs: transformations and actions.
Construction
An RDD
object can only be created using the
methods of the SparkContext
class. A collection of SparkContext
methods
used to create RDDs is listed below for convenience. See the documentation
of the SparkContext
class
for more information.
SparkContext Method Name | Purpose |
---|---|
parallelize | Create an RDD from local MATLAB® values |
datastoreToRDD | Convert MATLAB |
textFile | Create an RDD from a text file |
Once an RDD has been created using a method from the SparkContext
class,
you can use any of the methods in the RDD
class to
manipulate your RDD.
Properties
The properties of this class are hidden.
Methods
Transformations
aggregateByKey | Aggregate the values of each key, using given combine functions and a neutral “zero value” |
cartesian | Create an RDD that is the Cartesian product of two RDDs |
coalesce | Reduce the number of partitions in an RDD |
cogroup | Group data from RDDs sharing the same key |
combineByKey | Combine the elements for each key using a custom set of aggregation functions |
distinct | Return a new RDD containing the distinct elements of an existing RDD |
filter | Return a new RDD containing only the elements that satisfy a predicate function |
flatMap | Return a new RDD by first applying a function to all elements of an existing RDD, and then flattening the results |
flatMapValues | Pass each value in the key-value pair RDD through a flatMap method
without changing the keys |
foldByKey | Merge the values for each key using an associative function and a neutral “zero value” |
fullOuterJoin | Perform a full outer join between two key-value pair RDDs |
glom | Coalesce all elements within each partition of an RDD |
groupBy | Return an RDD of grouped items |
groupByKey | Group the values for each key in the RDD into a single sequence |
intersection | Return the set intersection of one RDD with another |
join | Return an RDD containing all pairs of elements with matching keys |
keyBy | Create tuples of the elements in an RDD by applying a function |
keys | Return an RDD with the keys of each tuple |
leftOuterJoin | Perform a left outer join |
map | Return a new RDD by applying a function to each element of an input RDD |
mapValues | Pass each value in a key-value pair RDD through a map function without modifying the keys |
reduceByKey | Merge the values for each key using an associative reduce function |
repartition | Return a new RDD that has exactly numPartitions partitions |
rightOuterJoin | Perform a right outer join |
sortBy | Sort an RDD by a given function |
sortByKey | Sort RDD consisting of key-value pairs by key |
subtract | Return the values resulting from the set difference between two RDDs |
subtractByKey | Return key-value pairs resulting from the set difference of keys between two RDDs |
union | Return the set union of one RDD with another |
values | Return an RDD with the values of each tuple |
zip | Zip one RDD with another |
zipWithIndex | Zip an RDD with its element indices |
zipWithUniqueId | Zip an RDD with generated unique Long IDs |
Actions
aggregate | Aggregate the elements of each partition and subsequently the results for all partitions into a single value |
collect | Return a MATLAB cell array that contains all of the elements in an RDD |
collectAsMap | Return the key-value pairs in an RDD as a MATLAB containers.Map object |
count | Count number of elements in an RDD |
fold | Aggregate elements of each partition and the subsequent results for all partitions |
reduce | Reduce elements of an RDD using the specified commutative and associative function |
reduceByKeyLocally | Merge the values for each key using an associative reduce function, but return the results immediately to the driver |
saveAsKeyValueDatastore | Save key-value RDD as a binary file that can be read back
using the datastore function |
saveAsTallDatastore | Save RDD as a MATLAB tall array to a binary file
that can be read back using the datastore function |
saveAsTextFile | Save RDD as a text file |
Operations
cache | Store an RDD in memory |
checkpoint | Mark an RDD for checkpointing |
getCheckpointFile | Get the name of the file to which an RDD is checkpointed |
getDefaultReducePartitions | Get the number of default reduce partitions in an RDD |
getNumPartitions | Return the number of partitions in an RDD |
isEmpty | Determine if an RDD contains any elements |
keyLimit | Return threshold of unique keys that can be stored before spilling to disk |
persist | Set the value of an RDD’s storage level to persist across operations after it is computed |
toDebugString | Obtain a description of an RDD and its recursive dependencies for debugging |
unpersist | Mark an RDD as nonpersistent, remove all blocks for it from memory and disk |
More About
References
See the latest Spark documentation for more information.
Version History
Introduced in R2016b