We analyze your production Spark workloads and attach dollar amounts to every inefficiency — missed broadcasts, unfiltered partition scans, accidental cartesian joins. You get a prioritized report with fix recommendations your engineers can ship the same week.
Most Spark performance issues don't crash your jobs — they just make them quietly expensive. The Databricks bill goes up, but nobody can connect the dollars to specific decisions in the query plan.
A 12 MB lookup table gets shuffled alongside a 500 GB fact table because the broadcast threshold was lowered or the table's stats are stale. Two unnecessary shuffles on every run.
Your query needs one month of data, but Spark reads all 365 daily partitions because the filter doesn't align with the partition column. Over 90% of the I/O is wasted.
Someone lowered the broadcast threshold six months ago to fix an OOM. Now every join that could broadcast doesn't. Nobody remembers the change.
Actual report format from our analysis engine. In your report, savings are calibrated to your DBU rate, cluster shape, and job frequency — not generic benchmarks.
prod.analytics.orders — no filter on partition column order_dateprod.analytics.orders is partitioned by order_date but the scan has no partition filters. Spark will read every partition of the table. The full table is 800 GB.order_date was found anywhere in the plan.
spark.table("prod.analytics.orders")
.filter(col("order_date") >= "2025-01-01") # enables partition pruning
.select(...)
| tableName | prod.analytics.orders |
| partitionColumns | order_date |
| planLabel | daily-revenue-etl |
prod.analytics.customers (14.2 MB) in Inner join on customer_idcustomer_id between prod.analytics.orders and prod.analytics.customers is using SortMergeJoin, which shuffles both sides. But the right input (prod.analytics.customers) is only 14.2 MB — small enough to broadcast. Broadcasting eliminates the shuffle on the large side, typically cutting this join's cost by 50% or more.
from pyspark.sql.functions import broadcast # Broadcast the small table (14.2 MB) to avoid a shuffle orders.join(broadcast(customers), on="customer_id")
| planLabel | daily-revenue-etl |
| estimatedSizeBytes | 14890803 |
| side | right |
spark.sql.autoBroadcastJoinThreshold is set to -1, which disables automatic broadcast joins entirely. Every join that could use a cheap broadcast will fall back to a shuffle-based strategy instead. This is affecting 4 joins across 3 plans.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10485760) // restore default (10 MB)
| subtype | broadcast-disabled |
| configKey | spark.sql.autoBroadcastJoinThreshold |
| configValue | -1 |
| Plan | Est. Monthly Savings |
|---|---|
| daily-revenue-etl | $8,000.00 |
| weekly-agg-reports | $1,900.00 |
| Total | $9,900.00 |
Add our lightweight open-source Python package to your notebook or job — two lines of code. It captures query plans and catalog metadata as JSON. Read-only. Works on serverless, Spark Connect, and classic clusters.
Our engine parses every physical plan, cross-references catalog table sizes, and runs cost-aware detectors for partition pruning misses, broadcast opportunities, join strategy regressions, and more.
You get findings ranked by dollar impact, calibrated to your actual cluster costs and job frequency. Each finding includes a fix recommendation with code your engineers can apply immediately.
We add the capture package to 3–5 of your highest-cost jobs. After the next production run, snapshots are available for analysis. If there's meaningful savings, we proceed. If your workloads are already well-optimized, we part ways — no charge.
Turnaround depends on your job cadence. Most teams have results within a few days.
Analysis of all production Spark jobs once snapshots are captured. Every dollar amount is traceable to your actual Databricks bill — calibrated to your DBU rate, instance type, and job runtimes.
Includes a prioritized findings report with fix recommendations and estimated monthly savings per finding.
✓ Savings guarantee: if we don't identify at least 2× our fee in monthly savings, you pay nothing.
A free static analysis tool that catches common PySpark anti-patterns before they reach production. No Spark runtime required — runs on source code in CI, pre-commit, or your terminal. 20 rules and growing.
Detects .collect() without filtering, .withColumn() in loops, implicit cartesians, missing .unpersist(), schema inference waste, and more. See the full analysis reference.
The free triage takes five minutes to set up. We'll tell you if there's money to save.
hello@clusteryield.app