书名：PySpark Cookbook
作者名：Denny Lee Tomasz Drabas
本章字数：226字
更新时间：2025-04-04 16:35:18

There's more...

Now that we have everything in place, let's see what this can do.

First, start Jupyter (note that we do not use the pyspark command):

jupyter notebook

You should now be able to see the following options if you want to add a new notebook:

If you click on PySpark, it will open a notebook and connect to a kernel.

There are a number of available magics to interact with the notebooks; type %%help to list them all. Here's the list of the most important ones:

Once you have configured your session, you will get information back from Livy about the active sessions that are currently running:

Let's try to create a simple data frame using the following code:

from pyspark.sql.types import *

# Generate our data 
ListRDD = sc.parallelize([
    (123, 'Skye', 19, 'brown'), 
    (223, 'Rachel', 22, 'green'), 
    (333, 'Albert', 23, 'blue')
])

# The schema is encoded using StructType 
schema = StructType([
    StructField("id", LongType(), True), 
    StructField("name", StringType(), True),
    StructField("age", LongType(), True),
    StructField("eyeColor", StringType(), True)
])

# Apply the schema to the RDD and create DataFrame
drivers = spark.createDataFrame(ListRDD, schema)

# Creates a temporary view using the data frame
drivers.createOrReplaceTempView("drivers")

Once you execute the preceding code in a cell inside the notebook, only then will the SparkSession be created:

If you execute %%sql magic, you will get the following: