pyspark.RDD.takeSample#

RDD.takeSample(withReplacement, num, seed=None)[source]#

Return a fixed-size sampled subset of this RDD.

New in version 1.3.0.

Parameters

withReplacementbool: whether sampling is done with replacement
numint: size of the returned sample
seedint, optional: random seed

Returns

list: a fixed-size sampled subset of this RDD in an array

See also

RDD.sample()

Notes

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

Examples

>>> import sys
>>> rdd = sc.parallelize(range(0, 10))
>>> len(rdd.takeSample(True, 20, 1))
20
>>> len(rdd.takeSample(False, 5, 2))
5
>>> len(rdd.takeSample(False, 15, 3))
10
>>> sc.range(0, 10).takeSample(False, sys.maxsize)
Traceback (most recent call last):
    ...
ValueError: Sample size cannot be greater than ...