

Yields and caches the current DataFrame.

The pandas-on-Spark DataFrame is yielded as a protected resource and its corresponding data is cached which gets uncached after execution goes off the context.

If you want to specify the StorageLevel manually, use DataFrame.spark.persist()

See also



>>> df = ps.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df
   dogs  cats
0   0.2   0.3
1   0.0   0.6
2   0.6   0.0
3   0.2   0.1
>>> with df.spark.cache() as cached_df:
...     print(cached_df.count())
dogs    4
cats    4
dtype: int64
>>> df = df.spark.cache()
>>> df.to_pandas().mean(axis=1)
0    0.25
1    0.30
2    0.30
3    0.15
dtype: float64

To uncache the dataframe, use unpersist function

>>> df.spark.unpersist()