pyspark.RDD.takeSample#
- RDD.takeSample(withReplacement, num, seed=None)[source]#
Return a fixed-size sampled subset of this RDD.
New in version 1.3.0.
- Parameters
- withReplacementbool
whether sampling is done with replacement
- numint
size of the returned sample
- seedint, optional
random seed
- Returns
- list
a fixed-size sampled subset of this
RDD
in an array
See also
Notes
This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
Examples
>>> import sys >>> rdd = sc.parallelize(range(0, 10)) >>> len(rdd.takeSample(True, 20, 1)) 20 >>> len(rdd.takeSample(False, 5, 2)) 5 >>> len(rdd.takeSample(False, 15, 3)) 10 >>> sc.range(0, 10).takeSample(False, sys.maxsize) Traceback (most recent call last): ... ValueError: Sample size cannot be greater than ...