pyspark.pandas.DataFrame.spark.cache#
- spark.cache()#
Yields and caches the current DataFrame.
The pandas-on-Spark DataFrame is yielded as a protected resource and its corresponding data is cached which gets uncached after execution goes off the context.
If you want to specify the StorageLevel manually, use
DataFrame.spark.persist()
See also
DataFrame.spark.persist
Examples
>>> df = ps.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], ... columns=['dogs', 'cats']) >>> df dogs cats 0 0.2 0.3 1 0.0 0.6 2 0.6 0.0 3 0.2 0.1
>>> with df.spark.cache() as cached_df: ... print(cached_df.count()) ... dogs 4 cats 4 dtype: int64
>>> df = df.spark.cache() >>> df.to_pandas().mean(axis=1) 0 0.25 1 0.30 2 0.30 3 0.15 dtype: float64
To uncache the dataframe, use unpersist function
>>> df.spark.unpersist()