This is an iOS CI/CD Scaling Failure Model—not a capacity calculator. Before we took over, the team grew from 8 to 20 engineers over fourteen months while the Runner topology barely moved: release-week PRs went red, builds sat in Queued for hours, and TestFlight slipped by two days. Below we answer why CI slows when headcount rises, and which segment of the pipeline actually breaks.
If you need “how many Macs for 500 runs/day,” see the companion piece 3 Cloud Macs for 500 iOS Builds per Day—that’s the post-handoff rollout; this article is why it collapsed before we arrived.
At a glance: how “more people” slides CI into the Failure Zone
Figure 1 is the mechanism overview for the whole piece—better for a postmortem opener or team onboarding than pipeline minutiae. Collapse is rarely a single point; it’s the cascade below.
Figure 1 · CI collapse mechanism (Scaling Failure Cascade)
This customer hit Queue pain at 12 people; Runner and Archive lit up together at 16; at 20 they landed in the red zone at the bottom. Compare to the healthy baseline below.
iOS CI/CD architecture: bottleneck map from PR to TestFlight
Figure 2 is the execution path for a macOS job (L0/L1/L2), with the two places that saturated first in this case. Use it to tell compile slowness from jobs that never left the queue.
Figure 2 · iOS CI/CD pipeline: L0 / L1 / L2 layers and typical bottlenecks (this customer, 2024–2025)
L0: lint / light checks; L1: PR integration build and Simulator tests; L2: Archive, notarization, upload. This customer broke on Queue and mixed Runner pools—not L1 compile speed. Matrix blow-up is often misread as “Xcode got slower.”
Starting point: CI that “worked” at 8 people—and didn’t scale with the team
Mid-2024, the setup was textbook:
- One office Mac mini M2 doubling as a self-hosted Mac Runner (see GitHub’s self-hosted runner docs)
- Everything else on GitHub-hosted macOS runners
- Monorepo with two apps, one YAML workflow for all
- Roughly 80–120 macOS jobs per day; queue P95 under five minutes
The CTO’s line was: “CI isn’t the bottleneck—don’t spend time on it.” True at eight people; after headcount doubled, CI became a hidden tax because nobody did capacity planning—only feature teams adding workflows.
Healthy baseline: what “normal” looks like
Staring only at the 20-person crash week makes it easy to think “big team = buy lots of Macs.” We add a healthy baseline vs crash week table—numbers from this customer’s eight-person era plus ranges we’ve seen on mid-size iOS teams (not an industry standard, but enough for an internal review).
| Metric | Healthy baseline (8–12 people · sane topology) | This customer · 20-person crash week |
|---|---|---|
| L1 queue_wait P95 | < 8 min | 47 min |
| L1 run duration P95 | 12–18 min | 14 min (compile did not slow down) |
| Archive-class job share | < 15% | 38% |
| macOS jobs per push · P99 | < 6 | 19 |
| macOS jobs / day | 80–180 | 500+ (release week) |
| Runner topology | L1/L2 label-separated pools | hosted + office mini mixed |
The telling contrast: run duration barely moved in the crash week; queue time jumped nearly 10×—the problem was queuing and workflows, not Xcode compile throughput. That’s where the phrase “CI got slow” misleads teams.
Timeline: three warnings, three times dismissed as “flaky”
First (12 people, +4 iOS): PR merge rate went from ~6/day to ~14/day. Hosted Runner build queue hit 25–40 minute waits in the afternoon. Team reaction: “GitHub’s having a bad day.” Nobody pulled the queue_wait_seconds curve.
Second (16 people, +2 backend on the same repo): They added a second office Mac mini, both registered as self-hosted with no label-based queue split. Archive and PR builds shared one pool—light jobs that were ~8 minutes sometimes took 35; heavy notarization jobs failed at random. Reaction: “Try another disk.”
Third (20 people, dual-app release collision): Release week macOS jobs spiked from ~120/day to 280+ in a single day. Slack #ios-ci had someone on-call around the clock. Reaction: “Procurement is approving eight Mac minis.”—the same eight-machine plan we dissect in the 500 runs/day case study; nobody had broken out jobs by type yet.
Figure 3 · Team size vs macOS jobs/day vs risk tier (this customer, 2024–2025 measured)
| Team size | macOS jobs / day | GitHub Actions queue time (L1 P95) | Risk tier | Stage notes |
|---|---|---|---|---|
| 8 | ~100 | < 5 min | Baseline | Single office mini sufficient |
| 12 | ~180 | 25–40 min | Warning Zone | Queue first sustained over threshold |
| 16 | ~260 | ~35 min | Structural Debt | Mixed Archive runs; signing flakes |
| 20 | 500+ | 47 min | Failure Zone | Release-week peak; CI effectively down |
Source: GitHub Actions workflow run logs (14-day rolling window); 20-person point is dual-app release-week peak. Job volume outpaced headcount—driven by matrix expansion and PR Archive, not headcount alone.
Two table details matter: jobs jumped sharply from 16→20 people; Archive share rose from ~18% to 38%. Queue time vs job volume is nonlinear—headcount didn’t double before queue did. That’s where expansion budgets go wrong.
Threshold model: 12 Warning / 16 Structural Debt / 20 Failure Zone
We condensed the case into a weekly-review framework—not a precise formula, but enough to decide whether Runner topology needs to move:
(PR frequency × matrix width × Archive share) ÷ effective Runner pool capacity
All three numerator terms come from GitHub Actions logs. “Effective Runner pool” is not registered machine count—it’s labeled concurrent Mac slots you can actually schedule. When an office mini is busy for local dev, the denominator shrinks instantly. When any two of the following hold, stop debating “one more Mac mini” and refactor the pipeline first:
- queue_wait_seconds P95 > 30 minutes and Archive-class job share > 30% → split L1/L2; stop full Archive on every PR
- macOS jobs per push P99 > 15 → matrix/path filter out of control; cut jobs before buying hardware
- L2 queue P95 > 40 minutes for 3 consecutive days with cache hit > 60% → structural scale-out; consider a dedicated release pool
This customer hit all three in the 20-person crash week—Failure Zone, not something “buy more minutes” fixes in Warning.
GitHub Actions queue time: the first metric to break after 12 people
Many teams hear “CI slowed” and tune DerivedData. In this customer’s logs, queue time was what ran away—jobs sat in Queued. GitHub reports queue_wait_seconds separately from run duration; optimize only the latter and you “fix” compile while the org concurrency ceiling and pool mixing stay broken.
Org-level macOS concurrency is documented in GitHub Actions usage limits. After 12 people, afternoon PR storms meant Runners weren’t slow—jobs waited for a slot. The UI spins on Queued and gets blamed on network or Xcode. Split wait vs run using job execution time and queue duration.
At 12 people, queue P95 was already 25–40 minutes. Label-separated fast/archive pools (instead of two minis in one pool) would have delayed the signing pile-up. For Runner topology choices, see elastic pool vs always-on nodes decision matrix.
Five scaling bottlenecks (Scaling Taxonomy)
Five labels we use internally for mobile CI postmortems—mapped to Figure 2’s pipeline.
① Capacity bottleneck · org macOS concurrency maxed
When hosted Runner concurrency fills, light jobs queue behind Archive—the fast lane gets blocked by slow work. Same curve as GitHub Actions queue time above. Minute packs help hosted queue only; moving L2 off hosted is more stable—org concurrency stays the first ceiling otherwise.
② Resource contention · office Mac mini as “spare dev machine”
Runners on SSH-able engineer machines eventually run local xcodebuild debug. Runner vs daily dev fights for CPU/disk—timeouts or flaky compile errors. At 16 people, two minis “dead Friday, fine Monday”—root cause was Remote Desktop branch switches.
③ Workflow design debt · full Archive on every PR
At eight people, “green PR = shippable” justified Archive + upload at the end of every pull request workflow. ~6 merges/day was tolerable; at 20, Archive-class jobs went from ~10% to 35%+ of volume, flooding an already crowded queue. Split L1 integration from L2 Archive—that landed post-handoff; see the companion piece’s job layering section.
④ Scaling explosion · monorepo matrix jobs grow superlinearly
At 20 people, a single push triggered up to 22 macOS jobs—2.5× people, ~4× jobs. Easiest collapse point to miss in product cadence.
⑤ Crypto / signing contention · release-day Keychain and codesign fights
Two self-hosted Macs running Archive could limp through on 16GB RAM—add Match unlock, notarization, and TestFlight upload and Keychain locks and codesign contention produce the same error code at random. Release week: four consecutive “certificate not found” failures that passed on a local Mac rerun—classic multi-job signing environment fight, not expired certs. See Apple Xcode Signing and Capabilities.
“Fixes” they tried—and why things got worse
More GitHub minute packs: Eased hosted queue only; Archive share unchanged—money spent, P95 still over 60 minutes.
Third office Mac mini: Still no fast/archive labels—three machines in one pool; signing failure rate went up.
No Friday merges: Artificially capped PR rate; release week concentrated spikes higher.
timeout-minutes: 180 on every job: Queued time doesn’t count toward timeout—more jobs hog slots longer.
Only partial win: move Release to a separate machine—but it was still an office mini; nobody unlocked Keychain overnight; L2 queue backed up Monday.
Three numbers dashboards should show—but didn’t
Before collapse, Grafana had “success rate” and “average duration.” Week one post-handoff we added three metrics and found 80% of the problem (how to split queue_wait_seconds in GitHub’s monitoring docs):
- queue_wait_seconds P95 (by L0/L1/L2 bucket)—separate “slow compile” from “waiting in line”
- Archive-class job share (weekly)—whether workflow design ran away
- macOS jobs triggered per push P99—whether the matrix exploded
Crash-week numbers: L1 queue_wait P95 47 minutes, Archive share 38%, jobs per push P99 19. Any two over threshold → change topology before opening a PO.
Triage order: fix topology before counting Macs
First two weeks after handoff, no new hardware—four changes only:
- Remove Archive from PR workflows; L2 only on nightly + release branches
- Tighten path filters—README/backend paths no longer fire the iOS matrix
- Retire office mini Runners to stop fighting local dev
- Hosted Runners for L1 spikes only; all L2 on dedicated nodes
Those four alone dropped release-week queue_wait P95 from 47 minutes to 22 minutes—still not enough, but proof the collapse was topology and workflow, not a magic Mac count. Cloud Mac fast/release pools and capacity validation came next; elastic vs always-on Runners in GitHub Actions self-hosted macOS Runner: elastic pool vs always-on decision matrix.
Three signals if you’re growing past 12 toward 20
If any of these show up, schedule a 30-minute CI review that week—don’t wait for release week:
- Developers ask “can we merge without CI?”
- Sticky notes on office Macs: “don’t reboot—Runner running”
- TestFlight builds succeed but “waiting for upload slot” is normal
Behind these signals you’ve usually hit at least two of the five failure modes above.
Postmortem Summary (quotable)
Root cause
Headcount grew without CI topology scaling: PR rate and matrix width rose while Runners stayed “hosted + office mini mixed,” and PR workflows kept running L2 Archive—queue and signing saturated together.
Contributing factors
- No
queue_wait_secondsmonitoring—Queued mistaken for slow compile - Second Mac mini without fast/archive labels—Archive and PR shared one pool
- Monorepo path filter too wide—README edits triggered the iOS matrix
- Release-week job spikes not tied to procurement or topology changes
Fixes tried (ineffective or harmful)
- More GitHub minute packs—hosted queue only; Archive share unchanged
- Third office mini in mixed pool—signing failure rate increased
- No Friday merges—spikes moved to release week
timeout-minutes: 180—Queued time excluded from timeout
What worked (first two weeks post-handoff)
- Archive off PRs; L2 on nightly + release branches only
- Tighter path filters; office mini Runners retired
- L2 on dedicated release nodes—queue P95 47min → 22min
FAQ
GitHub Actions stuck on Queued—not slow compile?
Check queue_wait_seconds vs run duration. High Queued share means queue or Runner pool, not compile. Use Figure 2’s Queue node and Figure 3’s healthy baseline.
At what team size does iOS CI usually break?
This case: 12 Warning / 16 Structural Debt / 20 Failure Zone. Metrics beat headcount: queue P95 > 30 minutes and Archive > 30% → refactor pipeline first.
Scale hardware or fix the pipeline first?
Job layering, stop PR Archive, tighten matrix first—otherwise new machines only delay the next release-week queue spike.
Is this the same article as “how many Macs for 500 runs/day”?
No. This piece is why it broke (Failure Model); the companion is post-crash sizing and machine count. Read this + diagnose, then read sizing.
Failure Zone prescription: L2 isolation first, then validate the Runner pool
If all three below hold, you’re in the Failure Zone (matches Figure 1 bottom)
- queue_wait_seconds P95 > 30 minutes
- Archive-class job share > 30%
- macOS jobs per push > 15
Engineering actions first (before procurement)
1. L2 isolation—remove Archive from PRs; release/nightly only on macos-archive labels.
2. Dedicated Mac pool—move L2 off office dev machines and mixed hosted pools; L1 can stay on hosted or elastic self-hosted for spikes.
3. Validate—run one release workflow on an isolated release node; compare queue P95 to Figure 3 healthy baseline (< 8 min tier).
Machine count and fast/release pool math: 3 Cloud Macs for 500 iOS Builds per Day; elastic vs always-on Runners: GitHub Actions self-hosted macOS Runner decision matrix.
For a one-day PoC on a release-only node to validate L2 isolation, start from Mac cloud plans or the VPSSpark homepage to provision a Cloud Mac with an isolated Archive queue—that’s topology validation, not “buy the full cluster on day one.”