Setting up a Spark Environment
Spark is designed to scale with the amount of compute you have available. It generally it requires a bit of work to set up locally, so most people tend to work with Spark in a cloud environment (e.g. Databricks, Dataproc, Colab, EMR, Synapse).
If you have access to a Spark cloud environment, then feel free to follow along in a notebook there. If not, there are a few options to quickly get started. We'll focus on two different ways to get setup:
- With a cloud notebook via Google Colab
- With a local notebook running in Docker
One last thing, for these examples, we're going to focus on PySpark (the Python API for Spark).
Method #1: Cloud Notebook (with Colab)
If you want to skip setting up anything on your machine, you're in luck. With cloud notebook environments like Google Colab, everything you need to get started is already set up for you. Here's a link to try it out.
Method #2: Local Notebook (with Docker)
If you prefer to work locally, then we can also get started quickly using Docker. Docker provides a containerized environment that can be pre-configured to run whatever software you need. If you're new to Docker, you can learn more here.
1. Download Docker
First, be sure to download and install docker.
2. Create a file called docker-compose.yaml
in your project's root directory
This file will setup a Spark-enabled Jupyter environment. It will also mount local ./data
and ./notebooks
folders to the container so you can access them from inside the Docker container. If the folders do not exist already they will be created automatically.
We won't go into the full details of Docker Compose here, but you can read more about it here.
docker-compose.yaml
version: '3.8'
services:
jupyter:
image: jupyter/pyspark-notebook:latest
container_name: jupyter
working_dir: /notebooks
ports:
- 8888:8888
volumes:
- ./data/:/data
- ./notebooks/:/notebooks
When you're done, your folder structure should look something like the following:
myproject/
├─ data/ <- optional - will be created automatically if it doesn't exist
├─ notebooks/ <- optional - will be created automatically if it doesn't exist
╰─ docker-compose.yaml
3. Start Docker
To start your docker container run the following command in your terminal:
docker-compose up
One the container is up and running, you should see a link to access Jupyter in your browser. It will look something like this:
jupyter | [I 2024-01-23 17:50:34.864 ServerApp] Jupyter Server 2.8.0 is running at:
jupyter | [I 2024-01-23 17:50:34.864 ServerApp] http://127.0.0.1:8888/lab?token=xxxxxxxxx
4. Stop Docker
To stop your container run the following:
docker-compose down