Modifying Dataframes
Let's set up some test data and create some new calculated columns.
(Example notebook can be found here)
Note: We're now including from pyspark.sql import functions as F
with our imports.
This is a common convention in PySpark and makes it easier to quickly access any functions in the module
that you might need for development.
PySpark
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
# Create dummy dataframe
df_src = spark.sql("""
select 'John' as name, 23 as age, 2000 as birth_year UNION ALL
select 'Jessie' as name, 24 as age, 1999 as birth_year
""")
df_src.show()
# +------+---+----------+
# | name|age|birth_year|
# +------+---+----------+
# | John| 23| 2000|
# |Jessie| 24| 1999|
# +------+---+----------+
Sequential updates
One method for making changes to a dataframe is to create or update a dataframe through sequential variable assignment using =
.
Python
df = df_src
df = df.withColumn("double_age", F.col("age") * 2) # Using column object
df = df.withColumn("quad_age", F.expr("age * 4")) # Using expressions
df = df.withColumn("current_year", F.lit(2024)) # Assigning a literal value
df = df.withColumn("current_year2", F.expr("2024")) # You can also assign a literal from an expression
df.show()
# +------+---+----------+----------+--------+------------+-------------+
# | name|age|birth_year|double_age|quad_age|current_year|current_year2|
# +------+---+----------+----------+--------+------------+-------------+
# | John| 23| 2000| 46| 92| 2024| 2024|
# |Jessie| 24| 1999| 48| 96| 2024| 2024|
# +------+---+----------+----------+--------+------------+-------------+
Chained updates
You can also update your dataframe directly with method chaining. This is a bit more concise, but at the expense of being a bit more complex to debug, since you are applying multiple operations in a single step.
Python
df = (
df_src
.withColumn("double_age", F.col("age") * 2) # Using column object
.withColumn("quad_age", F.expr("age * 4")) # Using expressions
.withColumn("current_year", F.lit(2024)) # Assigning a literal value
.withColumn("current_year2", F.expr("2024")) # You can also assign a literal from an expression
)
df.show()
# +------+---+----------+----------+--------+------------+-------------+
# | name|age|birth_year|double_age|quad_age|current_year|current_year2|
# +------+---+----------+----------+--------+------------+-------------+
# | John| 23| 2000| 46| 92| 2024| 2024|
# |Jessie| 24| 1999| 48| 96| 2024| 2024|
# +------+---+----------+----------+--------+------------+-------------+