Analysis Reference
Cluster Yield catches cost, reliability, and clarity problems across two analysis tiers. The linter runs locally for free. The snapshot adds dollar amounts from your production environment.
Linter
Static analysis of PySpark source code. Catches 20 anti-patterns. No cluster, no account. pip install cylint
Plan Analysis
Analyzes query plans from your production cluster. Adds dollar amounts, detects plan-level problems the linter can't see.
How savings are calculated
When multiple findings target the same table or operator, the cost engine applies them in priority order and recomputes after each step. This ensures the total savings never exceeds the plan's actual cost — no double-counting, even when the linter and snapshot both flag the same scan.
Inline Suppression
# Single-line suppression df.collect() # cy:ignore CY001 — intentional, result set is small # Block suppression # cy:ignore-start CY001 result = df.collect() process(result) # cy:ignore-end
— is optional but recommended. It documents why the suppression is intentional and survives code review.CY016
Invalid Escape Sequences'\w' instead of '\\w' or r'\w'). This is a general Python hygiene rule, not a Spark-specific anti-pattern, and is excluded from the Spark analysis reference. It appears in linter output but is not included in corpus prevalence analysis or blog reporting.Snapshot-Only Analyzers
These problems require runtime context — table sizes, cluster config, query plan structure — that static analysis can't access. They only appear with a snapshot capture from your production cluster.