Below is an example how to use partitioning in Apache Spark: from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql import functions as f import random spark = SparkSession.builder \ .appName('app') \ .config("spark.sql.adaptive.enabled", False) \ .getOrCreate() sc = spark.sparkContext sc.defaultParallelism Influencing the level of parallelism when creating an RDD: rdd = sc.parallelize([("A", 2), ("A", 7), ("A", 12), ("B", 8), ("B", 5), ("B", 10)], 10) rdd.getNumPartitions() rdd.glom().collect() [[], [('A', 2)], [], [('A', 7)], [('A', 12)], [], [('B', 8)], [], [('B', 5)], [('B', 10)]] rdd.glom().map(len).collect() [0, 1, 0, 1, 1, 0, 1, 0, 1, 1] My site is free of ads and trackers. Was this post helpful to you? Why not