- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 90字
- 2025-04-04 16:35:18
.sample(...) transformation
The sample(withReplacement, fraction, seed) transformation samples a fraction of the data, with or without replacement (the withReplacement parameter), based on a random seed.
Look at the following code snippet:
# Provide a sample based on 0.001% the
# flights RDD data specific to the fourth
# column (origin city of flight)
# without replacement (False) using random
# seed of 123
(
flights
.map(lambda c: c[3])
.sample(False, 0.001, 123)
.take(5)
)
We can expect the following result:
# Output
[u'ABQ', u'AEX', u'AGS', u'ANC', u'ATL']