Partitioning in Apache Spark
Below is an example how to use partitioning in Apache Spark:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f
import random
spark = SparkSession.builder \
.appName('app') \
.config("spark.sql.adaptive.enabled", False) \
.getOrCreate()
sc = spark.sparkContext
sc.defaultParallelism
Influencing the level of parallelism when creating an RDD:
rdd = sc.parallelize([("A", 2), ("A", 7), ("A", 12), ("B", 8), ("B", 5), ("B", 10)], 10)
rdd.getNumPartitions()
rdd.glom().collect()
[[],
[('A', 2)],
[],
[('A', 7)],
[('A', 12)],
[],
[('B', 8)],
[],
[('B', 5)],
[('B', 10)]]
rdd.glom().map(len).collect()
[0, 1, 0, 1, 1, 0, 1, 0, 1, 1]
My site is free of ads and trackers. Was this post helpful to you? Why not
Disqus is great for comments/feedback but I had no idea it came with these gaudy ads.