Going Back to Python

Spark offers a few different ways to move your data to and from native Python. One popular way is by converting your Spark dataframe into a Pandas dataframe. Alternatively, you can convert your Spark dataframe directly to a Python list of objects (specifically, a list of "Row" objects, with an attribute for each column in your table).

(Example notebook can be found here)

Converting to Native Python

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Create a Spark DataFrame with some dummy data
df = spark.createDataFrame([(1, 2.0), (1, 4.0), (3, 6.0), (3, 8.0)], ("a", "b"))

# Convert to Python objects with .collect() to get a list of Row objects
data = df.collect()

# Alternatively, you can collect as a list of dictionaries
data = [row.asDict() for row in df.collect()]

Converting to Pandas

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Create a Spark DataFrame with some dummy data
df = spark.createDataFrame([(1, 2.0), (1, 4.0), (3, 6.0), (3, 8.0)], ("a", "b"))

# Convert to Pandas DF with .toPandas()
pdf = df.toPandas()

# Convert Pandas DF to Spark DF using the same .createDataFrame() method
df2 = spark.createDataFrame(pdf)

That's it!