Modifying Dataframes

Let's set up some test data and create some new calculated columns.

(Example notebook can be found here)

Note: We're now including from pyspark.sql import functions as F with our imports. This is a common convention in PySpark and makes it easier to quickly access any functions in the module that you might need for development.

PySpark

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.getOrCreate()

# Create dummy dataframe
df_src = spark.sql("""
select 'John' as name, 23 as age, 2000 as birth_year UNION ALL
select 'Jessie' as name, 24 as age, 1999 as birth_year
""")
df_src.show()

#  +------+---+----------+
#  |  name|age|birth_year|
#  +------+---+----------+
#  |  John| 23|      2000|
#  |Jessie| 24|      1999|
#  +------+---+----------+

Sequential updates

One method for making changes to a dataframe is to create or update a dataframe through sequential variable assignment using =.

Python

df = df_src
df = df.withColumn("double_age", F.col("age") * 2)  # Using column object
df = df.withColumn("quad_age", F.expr("age * 4"))   # Using expressions
df = df.withColumn("current_year", F.lit(2024))     # Assigning a literal value
df = df.withColumn("current_year2", F.expr("2024")) # You can also assign a literal from an expression
df.show()

#  +------+---+----------+----------+--------+------------+-------------+
#  |  name|age|birth_year|double_age|quad_age|current_year|current_year2|
#  +------+---+----------+----------+--------+------------+-------------+
#  |  John| 23|      2000|        46|      92|        2024|         2024|
#  |Jessie| 24|      1999|        48|      96|        2024|         2024|
#  +------+---+----------+----------+--------+------------+-------------+

Chained updates

You can also update your dataframe directly with method chaining. This is a bit more concise, but at the expense of being a bit more complex to debug, since you are applying multiple operations in a single step.

Python

df = (
    df_src
    .withColumn("double_age", F.col("age") * 2)  # Using column object
    .withColumn("quad_age", F.expr("age * 4"))   # Using expressions
    .withColumn("current_year", F.lit(2024))     # Assigning a literal value
    .withColumn("current_year2", F.expr("2024")) # You can also assign a literal from an expression
)
df.show()

#  +------+---+----------+----------+--------+------------+-------------+
#  |  name|age|birth_year|double_age|quad_age|current_year|current_year2|
#  +------+---+----------+----------+--------+------------+-------------+
#  |  John| 23|      2000|        46|      92|        2024|         2024|
#  |Jessie| 24|      1999|        48|      96|        2024|         2024|
#  +------+---+----------+----------+--------+------------+-------------+