Cluster Yield — Analysis Reference

Free

Linter

Static analysis of PySpark source code. Catches 20 anti-patterns. No cluster, no account. pip install cylint

Snapshot

Plan Analysis

Analyzes query plans from your production cluster. Adds dollar amounts, detects plan-level problems the linter can't see.

How savings are calculated

When multiple findings target the same table or operator, the cost engine applies them in priority order and recomputes after each step. This ensures the total savings never exceeds the plan's actual cost — no double-counting, even when the linter and snapshot both flag the same scan.

Inline Suppression

Suppress individual findings with a comment on the same line or a block suppression around multiple lines.

# Single-line suppression
df.collect()  # cy:ignore CY001 — intentional, result set is small

# Block suppression
# cy:ignore-start CY001
result = df.collect()
process(result)
# cy:ignore-end

The reason after the — is optional but recommended. It documents why the suppression is intentional and survives code review.

CY016

Invalid Escape Sequences

infoclarity

The linter also detects Python invalid escape sequences in regex strings (e.g., '\w' instead of '\\w' or r'\w'). This is a general Python hygiene rule, not a Spark-specific anti-pattern, and is excluded from the Spark analysis reference. It appears in linter output but is not included in corpus prevalence analysis or blog reporting.

Snapshot-Only Analyzers

These problems require runtime context — table sizes, cluster config, query plan structure — that static analysis can't access. They only appear with a snapshot capture from your production cluster.