pyspark.sql.DataFrame.mapInPandas¶
-
DataFrame.
mapInPandas
(func: PandasMapIterFunction, schema: Union[pyspark.sql.types.StructType, str], barrier: bool = False) → DataFrame¶ Maps an iterator of batches in the current
DataFrame
using a Python native function that takes and outputs a pandas DataFrame, and returns the result as aDataFrame
.The function should take an iterator of pandas.DataFrames and return another iterator of pandas.DataFrames. All columns are passed together as an iterator of pandas.DataFrames to the function and the returned iterator of pandas.DataFrames are combined as a
DataFrame
. Each pandas.DataFrame size can be controlled by spark.sql.execution.arrow.maxRecordsPerBatch. The size of the function’s input and output can be different.New in version 3.0.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- funcfunction
a Python native function that takes an iterator of pandas.DataFrames, and outputs an iterator of pandas.DataFrames.
- schema
pyspark.sql.types.DataType
or str the return type of the func in PySpark. The value can be either a
pyspark.sql.types.DataType
object or a DDL-formatted type string.- barrierbool, optional, default True
Use barrier mode execution.
See also
Notes
This API is experimental
Examples
>>> from pyspark.sql.functions import pandas_udf >>> df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age")) >>> def filter_func(iterator): ... for pdf in iterator: ... yield pdf[pdf.id == 1] ... >>> df.mapInPandas(filter_func, df.schema).show() +---+---+ | id|age| +---+---+ | 1| 21| +---+---+
Set
barrier
toTrue
to force themapInPandas
stage running in the barrier mode, it ensures all Python workers in the stage will be launched concurrently.>>> df.mapInPandas(filter_func, df.schema, barrier=True).show() +---+---+ | id|age| +---+---+ | 1| 21| +---+---+