书名：PySpark Cookbook
作者名：Denny Lee Tomasz Drabas
本章字数：175字
更新时间：2025-04-04 16:35:18

Spark context parallelize method

Under the covers, there are quite a few actions that happened when you created your RDD. Let's start with the RDD creation and break down this code snippet:

myRDD = sc.parallelize( 
 [('Mike', 19), ('June', 18), ('Rachel',16), ('Rob', 18), ('Scott', 17)]
)

Focusing first on the statement in the sc.parallelize() method, we first created a Python list (that is, [A, B, ..., E]) composed of a list of arrays (that is, ('Mike', 19), ('June', 19), ..., ('Scott', 17)). The sc.parallelize() method is the SparkContext's parallelize method to create a parallelized collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data:

Now that we have created myRDD as a parallelized collection, Spark can operate against this data in parallel. Once created, the distributed dataset (distData) can be operated on in parallel. For example, we can call myRDD.reduceByKey(add) to add up the grouped by keys of the list; we have recipes for RDD operations in subsequent sections of this chapter.

本周热推：

Scratch趣味编程 Android Jetpack开发：原理解析与应用实战 Python应用与实战美丽洞察力：从化妆品行业看顾客需求洞察 JavaScript动态网页开发详解