Spark & Databricks cost optimization

Your Spark jobs are quietly wasting money. We show you exactly where.

We analyze your production Spark workloads and attach dollar amounts to every inefficiency — missed broadcasts, unfiltered partition scans, accidental cartesian joins. You get a prioritized report with fix recommendations your engineers can ship the same week.

The problems hiding in your plans

Most Spark performance issues don't crash your jobs — they just make them quietly expensive. The Databricks bill goes up, but nobody can connect the dollars to specific decisions in the query plan.

Missed broadcast joins

A 12 MB lookup table gets shuffled alongside a 500 GB fact table because the broadcast threshold was lowered or the table's stats are stale. Two unnecessary shuffles on every run.

Unfiltered partition scans

Your query needs one month of data, but Spark reads all 365 daily partitions because the filter doesn't align with the partition column. Over 90% of the I/O is wasted.

Config drift

Someone lowered the broadcast threshold six months ago to fix an OOM. Now every join that could broadcast doesn't. Nobody remembers the change.

What we find

Actual report format from our analysis engine. In your report, savings are calibrated to your DBU rate, cluster shape, and job frequency — not generic benchmarks.

⚡ Cluster Yield CI — Spark Plan Analysis
🔴 2 critical 🟡 1 warning
🔴 [daily-revenue-etl] Full scan on prod.analytics.orders — no filter on partition column order_date
💰 Estimated savings: ~$4,800/month
Table prod.analytics.orders is partitioned by order_date but the scan has no partition filters. Spark will read every partition of the table. The full table is 800 GB.

No filter on order_date was found anywhere in the plan.
What to do:
spark.table("prod.analytics.orders")
  .filter(col("order_date") >= "2025-01-01")  # enables partition pruning
  .select(...)
Raw details
tableNameprod.analytics.orders
partitionColumnsorder_date
planLabeldaily-revenue-etl

🔴 [daily-revenue-etl] Missed broadcast: prod.analytics.customers (14.2 MB) in Inner join on customer_id
💰 Estimated savings: ~$3,200/month
The Inner join on customer_id between prod.analytics.orders and prod.analytics.customers is using SortMergeJoin, which shuffles both sides. But the right input (prod.analytics.customers) is only 14.2 MB — small enough to broadcast. Broadcasting eliminates the shuffle on the large side, typically cutting this join's cost by 50% or more.
What to do:
from pyspark.sql.functions import broadcast

# Broadcast the small table (14.2 MB) to avoid a shuffle
orders.join(broadcast(customers), on="customer_id")
Raw details
planLabeldaily-revenue-etl
estimatedSizeBytes14890803
sideright

🟡 Config: broadcast join threshold set to -1 (disabled)
💰 Estimated savings: ~$1,900/month
spark.sql.autoBroadcastJoinThreshold is set to -1, which disables automatic broadcast joins entirely. Every join that could use a cheap broadcast will fall back to a shuffle-based strategy instead. This is affecting 4 joins across 3 plans.
What to do:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10485760)  // restore default (10 MB)
Raw details
subtypebroadcast-disabled
configKeyspark.sql.autoBroadcastJoinThreshold
configValue-1

💰 Savings Breakdown
PlanEst. Monthly Savings
daily-revenue-etl$8,000.00
weekly-agg-reports$1,900.00
Total$9,900.00

How it works

01

Capture

Add our lightweight open-source Python package to your notebook or job — two lines of code. It captures query plans and catalog metadata as JSON. Read-only. Works on serverless, Spark Connect, and classic clusters.

02

Analyze

Our engine parses every physical plan, cross-references catalog table sizes, and runs cost-aware detectors for partition pruning misses, broadcast opportunities, join strategy regressions, and more.

03

Report

You get findings ranked by dollar impact, calibrated to your actual cluster costs and job frequency. Each finding includes a fix recommendation with code your engineers can apply immediately.

Engagement model

Free

Triage

We add the capture package to 3–5 of your highest-cost jobs. After the next production run, snapshots are available for analysis. If there's meaningful savings, we proceed. If your workloads are already well-optimized, we part ways — no charge.

Turnaround depends on your job cadence. Most teams have results within a few days.

Open-source PySpark linter

A free static analysis tool that catches common PySpark anti-patterns before they reach production. No Spark runtime required — runs on source code in CI, pre-commit, or your terminal. 20 rules and growing.

Detects .collect() without filtering, .withColumn() in loops, implicit cartesians, missing .unpersist(), schema inference waste, and more. See the full analysis reference.

pip install cylint GitHub

$ pip install cylint
$ cy lint src/

src/orders.py:14 CY001 .collect() on unfiltered DataFrame
src/orders.py:23 CY003 .withColumn() in loop (8 iterations)
src/etl.py:7    CY004 SELECT * flows into write without projection
src/etl.py:31  CY009 UDF in .filter() blocks pushdown

4 findings in 2 files
Read-only analysis — we never access your data
Databricks first — EMR & Dataproc on roadmap
Spark Connect & serverless compatible

Ready to find out what your Spark workloads are actually costing you?

The free triage takes five minutes to set up. We'll tell you if there's money to save.

hello@clusteryield.app