Setting up a Spark Environment

Spark is designed to scale with the amount of compute you have available. It generally it requires a bit of work to set up locally, so most people tend to work with Spark in a cloud environment (e.g. Databricks, Dataproc, Colab, EMR, Synapse).

If you have access to a Spark cloud environment, then feel free to follow along in a notebook there. If not, there are a few options to quickly get started. We'll focus on two different ways to get setup:

  1. With a cloud notebook via Google Colab
  2. With a local notebook running in Docker

One last thing, for these examples, we're going to focus on PySpark (the Python API for Spark).

Method #1: Cloud Notebook (with Colab)

If you want to skip setting up anything on your machine, you're in luck. With cloud notebook environments like Google Colab, everything you need to get started is already set up for you. Here's a link to try it out.

Method #2: Local Notebook (with Docker)

If you prefer to work locally, then we can also get started quickly using Docker. Docker provides a containerized environment that can be pre-configured to run whatever software you need. If you're new to Docker, you can learn more here.

1. Download Docker

First, be sure to download and install docker.

2. Create a file called docker-compose.yaml in your project's root directory

This file will setup a Spark-enabled Jupyter environment. It will also mount local ./data and ./notebooks folders to the container so you can access them from inside the Docker container. If the folders do not exist already they will be created automatically.

We won't go into the full details of Docker Compose here, but you can read more about it here.

docker-compose.yaml

version: '3.8'
services:
  jupyter:
    image: jupyter/pyspark-notebook:latest
    container_name: jupyter
    working_dir: /notebooks
    ports:
      - 8888:8888
    volumes:
      - ./data/:/data
      - ./notebooks/:/notebooks

When you're done, your folder structure should look something like the following:

myproject/
  ├─ data/        <- optional - will be created automatically if it doesn't exist
  ├─ notebooks/   <- optional - will be created automatically if it doesn't exist
  ╰─ docker-compose.yaml

3. Start Docker

To start your docker container run the following command in your terminal:

docker-compose up

One the container is up and running, you should see a link to access Jupyter in your browser. It will look something like this:

jupyter  | [I 2024-01-23 17:50:34.864 ServerApp] Jupyter Server 2.8.0 is running at:
jupyter  | [I 2024-01-23 17:50:34.864 ServerApp] http://127.0.0.1:8888/lab?token=xxxxxxxxx

4. Stop Docker

To stop your container run the following:

docker-compose down