Cluster Yield — Corpus Analysis

5,025 PySpark repos · 88,091 files · 110,068 findings

Tier 1 · Learning

1,882

repos

24,563 files

Tier 2 · Personal

2,574

repos

42,198 files

Tier 3 · Serious

473

repos

19,457 files

Tier 4 · Production

repos

1,873 files

Chart 1 — Findings per file by tier

Overall code quality improves ~4x from hobby to production code. Engineering discipline works.

5 patterns rise with maturity

CY003.withColumn() loop

17.1% → 28.1%

CY025Missing .unpersist()

19.3% → 24.0%

CY007Cross join

5.3% → 9.4%

CY008.repartition() → write

3.6% → 6.2%

CY011.withColumnRenamed() loop

2.9% → 4.2%

Chart 2 — Tier 1 → Tier 4 prevalence slope

Lines sloping upward (red) are patterns that become more prevalent in production code. Amber lines persist across all tiers. Lines sloping downward (green) fade with maturity. Only rules with >5% prevalence in either tier shown for clarity.

Rising in production

Persistent

Declining with maturity

Chart 3 — Rule classification by cost type

Not all findings cost the same thing. Rules that cost money map to the snapshot and cost waterfall. Rules that cost reliability are valuable from the free linter alone. Rules that cost clarity are the hygiene baseline.

Full prevalence matrix

Complete prevalence data across all four tiers. Ratio = Tier 4 prevalence / Tier 1 prevalence. Sorted by ratio (rising patterns first).