Cluster Yield — Spark & Databricks Cost Optimization

The problems hiding in your plans

Most Spark performance issues don't crash your jobs — they just make them quietly expensive. The Databricks bill goes up, but nobody can connect the dollars to specific decisions in the query plan.

⊘

Missed broadcast joins

A 12 MB lookup table gets shuffled alongside a 500 GB fact table because the broadcast threshold was lowered or the table's stats are stale. Two unnecessary shuffles on every run.

◫

Unfiltered partition scans

Your query needs one month of data, but Spark reads all 365 daily partitions because the filter doesn't align with the partition column. Over 90% of the I/O is wasted.

⚠

Config drift

Someone lowered the broadcast threshold six months ago to fix an OOM. Now every join that could broadcast doesn't. Nobody remembers the change.

What we find

Actual report format from our analysis engine. In your report, savings are calibrated to your DBU rate, cluster shape, and job frequency — not generic benchmarks.

⚡ Cluster Yield CI — Spark Plan Analysis

🔴 2 critical 🟡 1 warning

🔴 [daily-revenue-etl] Full scan on prod.analytics.orders — no filter on partition column order_date

💰 Estimated savings: ~$4,800/month

Table prod.analytics.orders is partitioned by order_date but the scan has no partition filters. Spark will read every partition of the table. The full table is 800 GB.

No filter on order_date was found anywhere in the plan.

What to do:

spark.table("prod.analytics.orders")
  .filter(col("order_date") >= "2025-01-01")  # enables partition pruning
  .select(...)

Raw details

tableName	prod.analytics.orders
partitionColumns	order_date
planLabel	daily-revenue-etl

🔴 [daily-revenue-etl] Missed broadcast: prod.analytics.customers (14.2 MB) in Inner join on customer_id

💰 Estimated savings: ~$3,200/month

The Inner join on customer_id between prod.analytics.orders and prod.analytics.customers is using SortMergeJoin, which shuffles both sides. But the right input (prod.analytics.customers) is only 14.2 MB — small enough to broadcast. Broadcasting eliminates the shuffle on the large side, typically cutting this join's cost by 50% or more.

What to do:

from pyspark.sql.functions import broadcast

# Broadcast the small table (14.2 MB) to avoid a shuffle
orders.join(broadcast(customers), on="customer_id")

Raw details

planLabel	daily-revenue-etl
estimatedSizeBytes	14890803
side	right

🟡 Config: broadcast join threshold set to -1 (disabled)

💰 Estimated savings: ~$1,900/month

spark.sql.autoBroadcastJoinThreshold is set to -1, which disables automatic broadcast joins entirely. Every join that could use a cheap broadcast will fall back to a shuffle-based strategy instead. This is affecting 4 joins across 3 plans.

What to do:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10485760)  // restore default (10 MB)

Raw details

subtype	broadcast-disabled
configKey	spark.sql.autoBroadcastJoinThreshold
configValue	-1

💰 Savings Breakdown

Plan	Est. Monthly Savings
daily-revenue-etl	$8,000.00
weekly-agg-reports	$1,900.00
Total	$9,900.00

Engagement model

Free

Triage

We add the capture package to 3–5 of your highest-cost jobs. After the next production run, snapshots are available for analysis. If there's meaningful savings, we proceed. If your workloads are already well-optimized, we part ways — no charge.

Turnaround depends on your job cadence. Most teams have results within a few days.

Full assessment

Comprehensive analysis

Analysis of all production Spark jobs once snapshots are captured. Every dollar amount is traceable to your actual Databricks bill — calibrated to your DBU rate, instance type, and job runtimes.

Includes a prioritized findings report with fix recommendations and estimated monthly savings per finding.

✓ Savings guarantee: if we don't identify at least 2× our fee in monthly savings, you pay nothing.

Open-source PySpark linter

A free static analysis tool that catches common PySpark anti-patterns before they reach production. No Spark runtime required — runs on source code in CI, pre-commit, or your terminal. 20 rules and growing.

Detects .collect() without filtering, .withColumn() in loops, implicit cartesians, missing .unpersist(), schema inference waste, and more. See the full analysis reference.

pip install cylint GitHub

$ pip install cylint

$ cy lint src/

src/orders.py:14 CY001 .collect() on unfiltered DataFrame

src/orders.py:23 CY003 .withColumn() in loop (8 iterations)

src/etl.py:7 CY004 SELECT * flows into write without projection

src/etl.py:31 CY009 UDF in .filter() blocks pushdown

4 findings in 2 files

Your Spark jobs are quietly wasting money. We show you exactly where.

The problems hiding in your plans

Missed broadcast joins

Unfiltered partition scans

Config drift

What we find

How it works

Capture

Analyze

Report

Engagement model

Triage

Comprehensive analysis

Open-source PySpark linter

Ready to find out what your Spark workloads are actually costing you?