Column Selection

There are a few different ways to select columns from a dataframe.

(Example notebook can be found here)

Build a dummy dataframe

Let's create a simple row of data to work with using the spark.sql() function.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df_src = spark.sql("select 'John' as name, 23 as age, 2000 as birth_year")
df_src.show()

#  +----+---+----------+
#  |name|age|birth_year|
#  +----+---+----------+
#  |John| 23|      2000|
#  +----+---+----------+

Using .select()

This is the most common way to select columns. You can pass in a list of column names or Column objects.

Python

from pyspark.sql import functions as F

df = df_src.select(
    "name",                                 # You can pass column names as a <string>
    "age",
    df_src["age"].alias("birth_age1"),      # You can use bracket notation df["column_name"]
    F.col("age").alias("birth_age2"),       # You can use a column function F.col("column_name")
    F.lit(2024).alias("current_year"),      # You can create literals on the fly with F.lit()
    F.expr("age * 2 as double_birth_age"),  # You can also use raw SQL logic with F.expr()
    F.expr("2024 as current_year"),         # You can also create literals with F.expr()
)
df.show()

#  +----+---+----------+----------+------------+----------------+------------+
#  |name|age|birth_age1|birth_age2|current_year|double_birth_age|current_year|
#  +----+---+----------+----------+------------+----------------+------------+
#  |John| 23|        23|        23|        2024|              46|        2024|
#  +----+---+----------+----------+------------+----------------+------------+

Using .selectExpr()

With the .selectExpr() method, all arguments are treated as SQL statements, as if you were using F.expr().

df = df_src.selectExpr(
    "name",
    "age",
    "age as birth_year",
    "2024 as current_year",
    "age * 2 as double_birth_age",
)
df.show()

#  +----+---+----------+------------+----------------+
#  |name|age|birth_year|current_year|double_birth_age|
#  +----+---+----------+------------+----------------+
#  |John| 23|        23|        2024|              46|
#  +----+---+----------+------------+----------------+