Cluster Yield — Corpus Analysis

5,025 PySpark repos · 88,091 files · 110,068 findings

Tier 1 · Learning
1,882
repos
24,563 files
Tier 2 · Personal
2,574
repos
42,198 files
Tier 3 · Serious
473
repos
19,457 files
Tier 4 · Production
96
repos
1,873 files

Chart 1 — Findings per file by tier

Overall code quality improves ~4x from hobby to production code. Engineering discipline works.

5 patterns rise with maturity
CY003.withColumn() loop
17.1% → 28.1%
CY025Missing .unpersist()
19.3% → 24.0%
CY007Cross join
5.3% → 9.4%
CY008.repartition() → write
3.6% → 6.2%
CY011.withColumnRenamed() loop
2.9% → 4.2%

Chart 2 — Tier 1 → Tier 4 prevalence slope

Lines sloping upward (red) are patterns that become more prevalent in production code. Amber lines persist across all tiers. Lines sloping downward (green) fade with maturity. Only rules with >5% prevalence in either tier shown for clarity.

Rising in production
Persistent
Declining with maturity

Chart 3 — Rule classification by cost type

Not all findings cost the same thing. Rules that cost money map to the snapshot and cost waterfall. Rules that cost reliability are valuable from the free linter alone. Rules that cost clarity are the hygiene baseline.

Full prevalence matrix

Complete prevalence data across all four tiers. Ratio = Tier 4 prevalence / Tier 1 prevalence. Sorted by ratio (rising patterns first).