5,025 PySpark repos · 88,091 files · 110,068 findings
Overall code quality improves ~4x from hobby to production code. Engineering discipline works.
Lines sloping upward (red) are patterns that become more prevalent in production code. Amber lines persist across all tiers. Lines sloping downward (green) fade with maturity. Only rules with >5% prevalence in either tier shown for clarity.
Not all findings cost the same thing. Rules that cost money map to the snapshot and cost waterfall. Rules that cost reliability are valuable from the free linter alone. Rules that cost clarity are the hygiene baseline.
Complete prevalence data across all four tiers. Ratio = Tier 4 prevalence / Tier 1 prevalence. Sorted by ratio (rising patterns first).