- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 132字
- 2025-04-04 16:35:18
How to do it...
To quickly create an RDD, run PySpark on your machine via the bash terminal, or you can run the same query in a Jupyter notebook. There are two ways to create an RDD in PySpark: you can either use the parallelize() method—a collection (list or an array of some elements) or reference a file (or files) located either locally or through an external source, as noted in subsequent recipes.
The following code snippet creates your RDD (myRDD) using the sc.parallelize() method:
myRDD = sc.parallelize([('Mike', 19), ('June', 18), ('Rachel',16), ('Rob', 18), ('Scott', 17)])
To view what is inside your RDD, you can run the following code snippet:
myRDD.take(5)
The output is as follows:
Out[10]: [('Mike', 19), ('June', 18), ('Rachel',16), ('Rob', 18), ('Scott', 17)]