VPSSpark Blog
← Back to Dev Diary

Where iOS CI/CD breaks first when your team reaches 20 people

Server Notes · 2026.06.04 · ~16 min read

Engineering team reviewing iOS CI build queue and scaling failure
Mid-size iOS team CI review: queue time vs compile duration

This is an iOS CI/CD Scaling Failure Model—not a capacity calculator. Before we took over, the team grew from 8 to 20 engineers over fourteen months while the Runner topology barely moved: release-week PRs went red, builds sat in Queued for hours, and TestFlight slipped by two days. Below we answer why CI slows when headcount rises, and which segment of the pipeline actually breaks.

If you need “how many Macs for 500 runs/day,” see the companion piece 3 Cloud Macs for 500 iOS Builds per Day—that’s the post-handoff rollout; this article is why it collapsed before we arrived.

Bottom line: When a 20-person team’s CI falls apart, “not enough Macs” is rarely the whole story. In this case five bottlenecks stacked in sequence: concurrency maxed out → Runners fought with dev machines → PRs ran Archive → matrix jobs exploded → signing contended. Warning signs showed up at 12–16 people; by 20, with no queue split, they were in the Failure Zone.

At a glance: how “more people” slides CI into the Failure Zone

Figure 1 is the mechanism overview for the whole piece—better for a postmortem opener or team onboarding than pipeline minutiae. Collapse is rarely a single point; it’s the cascade below.

Figure 1 · CI collapse mechanism (Scaling Failure Cascade)

PR merge rate climbsteam scaling · nonlinear start
Matrix / job count explodesmonorepo · path filter broken
Queue backlogqueue_wait_seconds ↑
Runner pool saturatedhosted concurrency · minis occupied
Archive overloadL2 flooding PR queue
Signing / Keychain contentioncodesign contention
Failure Zonequeue P95 > 30min · Archive > 30%

This customer hit Queue pain at 12 people; Runner and Archive lit up together at 16; at 20 they landed in the red zone at the bottom. Compare to the healthy baseline below.

iOS CI/CD architecture: bottleneck map from PR to TestFlight

Figure 2 is the execution path for a macOS job (L0/L1/L2), with the two places that saturated first in this case. Use it to tell compile slowness from jobs that never left the queue.

Figure 2 · iOS CI/CD pipeline: L0 / L1 / L2 layers and typical bottlenecks (this customer, 2024–2025)

PR Pushpath filter · matrix
Queuequeue_wait_seconds
Runner Poolhosted + self-hosted
L0/L1 Buildintegration · unit tests
L2 Archivexcodebuild archive
SigningKeychain · Match
UploadTestFlight
↑ Bottleneck · Queue (org macOS concurrency maxed) ↑ Contention · Runner Pool + Signing (office Macs and Keychain fighting)

L0: lint / light checks; L1: PR integration build and Simulator tests; L2: Archive, notarization, upload. This customer broke on Queue and mixed Runner pools—not L1 compile speed. Matrix blow-up is often misread as “Xcode got slower.”

Starting point: CI that “worked” at 8 people—and didn’t scale with the team

Mid-2024, the setup was textbook:

The CTO’s line was: “CI isn’t the bottleneck—don’t spend time on it.” True at eight people; after headcount doubled, CI became a hidden tax because nobody did capacity planning—only feature teams adding workflows.

Healthy baseline: what “normal” looks like

Staring only at the 20-person crash week makes it easy to think “big team = buy lots of Macs.” We add a healthy baseline vs crash week table—numbers from this customer’s eight-person era plus ranges we’ve seen on mid-size iOS teams (not an industry standard, but enough for an internal review).

Metric Healthy baseline (8–12 people · sane topology) This customer · 20-person crash week
L1 queue_wait P95 < 8 min 47 min
L1 run duration P95 12–18 min 14 min (compile did not slow down)
Archive-class job share < 15% 38%
macOS jobs per push · P99 < 6 19
macOS jobs / day 80–180 500+ (release week)
Runner topology L1/L2 label-separated pools hosted + office mini mixed

The telling contrast: run duration barely moved in the crash week; queue time jumped nearly 10×—the problem was queuing and workflows, not Xcode compile throughput. That’s where the phrase “CI got slow” misleads teams.

Timeline: three warnings, three times dismissed as “flaky”

First (12 people, +4 iOS): PR merge rate went from ~6/day to ~14/day. Hosted Runner build queue hit 25–40 minute waits in the afternoon. Team reaction: “GitHub’s having a bad day.” Nobody pulled the queue_wait_seconds curve.

Second (16 people, +2 backend on the same repo): They added a second office Mac mini, both registered as self-hosted with no label-based queue split. Archive and PR builds shared one pool—light jobs that were ~8 minutes sometimes took 35; heavy notarization jobs failed at random. Reaction: “Try another disk.”

Third (20 people, dual-app release collision): Release week macOS jobs spiked from ~120/day to 280+ in a single day. Slack #ios-ci had someone on-call around the clock. Reaction: “Procurement is approving eight Mac minis.”—the same eight-machine plan we dissect in the 500 runs/day case study; nobody had broken out jobs by type yet.

Figure 3 · Team size vs macOS jobs/day vs risk tier (this customer, 2024–2025 measured)

Team size macOS jobs / day GitHub Actions queue time (L1 P95) Risk tier Stage notes
8 ~100 < 5 min Baseline Single office mini sufficient
12 ~180 25–40 min Warning Zone Queue first sustained over threshold
16 ~260 ~35 min Structural Debt Mixed Archive runs; signing flakes
20 500+ 47 min Failure Zone Release-week peak; CI effectively down
Jobs / day 0 200 400 600 100 180 260 500+ 8 12 16 20

Source: GitHub Actions workflow run logs (14-day rolling window); 20-person point is dual-app release-week peak. Job volume outpaced headcount—driven by matrix expansion and PR Archive, not headcount alone.

Two table details matter: jobs jumped sharply from 16→20 people; Archive share rose from ~18% to 38%. Queue time vs job volume is nonlinear—headcount didn’t double before queue did. That’s where expansion budgets go wrong.

Threshold model: 12 Warning / 16 Structural Debt / 20 Failure Zone

We condensed the case into a weekly-review framework—not a precise formula, but enough to decide whether Runner topology needs to move:

CI collapse risk ∝ (PR frequency × matrix width × Archive share) ÷ effective Runner pool capacity

All three numerator terms come from GitHub Actions logs. “Effective Runner pool” is not registered machine count—it’s labeled concurrent Mac slots you can actually schedule. When an office mini is busy for local dev, the denominator shrinks instantly. When any two of the following hold, stop debating “one more Mac mini” and refactor the pipeline first:

  • queue_wait_seconds P95 > 30 minutes and Archive-class job share > 30% → split L1/L2; stop full Archive on every PR
  • macOS jobs per push P99 > 15 → matrix/path filter out of control; cut jobs before buying hardware
  • L2 queue P95 > 40 minutes for 3 consecutive days with cache hit > 60% → structural scale-out; consider a dedicated release pool

This customer hit all three in the 20-person crash week—Failure Zone, not something “buy more minutes” fixes in Warning.

GitHub Actions queue time: the first metric to break after 12 people

Many teams hear “CI slowed” and tune DerivedData. In this customer’s logs, queue time was what ran away—jobs sat in Queued. GitHub reports queue_wait_seconds separately from run duration; optimize only the latter and you “fix” compile while the org concurrency ceiling and pool mixing stay broken.

Org-level macOS concurrency is documented in GitHub Actions usage limits. After 12 people, afternoon PR storms meant Runners weren’t slow—jobs waited for a slot. The UI spins on Queued and gets blamed on network or Xcode. Split wait vs run using job execution time and queue duration.

At 12 people, queue P95 was already 25–40 minutes. Label-separated fast/archive pools (instead of two minis in one pool) would have delayed the signing pile-up. For Runner topology choices, see elastic pool vs always-on nodes decision matrix.

Five scaling bottlenecks (Scaling Taxonomy)

Five labels we use internally for mobile CI postmortems—mapped to Figure 2’s pipeline.

① Capacity bottleneck · org macOS concurrency maxed

When hosted Runner concurrency fills, light jobs queue behind Archive—the fast lane gets blocked by slow work. Same curve as GitHub Actions queue time above. Minute packs help hosted queue only; moving L2 off hosted is more stable—org concurrency stays the first ceiling otherwise.

② Resource contention · office Mac mini as “spare dev machine”

Runners on SSH-able engineer machines eventually run local xcodebuild debug. Runner vs daily dev fights for CPU/disk—timeouts or flaky compile errors. At 16 people, two minis “dead Friday, fine Monday”—root cause was Remote Desktop branch switches.

③ Workflow design debt · full Archive on every PR

At eight people, “green PR = shippable” justified Archive + upload at the end of every pull request workflow. ~6 merges/day was tolerable; at 20, Archive-class jobs went from ~10% to 35%+ of volume, flooding an already crowded queue. Split L1 integration from L2 Archive—that landed post-handoff; see the companion piece’s job layering section.

④ Scaling explosion · monorepo matrix jobs grow superlinearly

At 20 people, a single push triggered up to 22 macOS jobs—2.5× people, ~4× jobs. Easiest collapse point to miss in product cadence.

⑤ Crypto / signing contention · release-day Keychain and codesign fights

Two self-hosted Macs running Archive could limp through on 16GB RAM—add Match unlock, notarization, and TestFlight upload and Keychain locks and codesign contention produce the same error code at random. Release week: four consecutive “certificate not found” failures that passed on a local Mac rerun—classic multi-job signing environment fight, not expired certs. See Apple Xcode Signing and Capabilities.

“Fixes” they tried—and why things got worse

More GitHub minute packs: Eased hosted queue only; Archive share unchanged—money spent, P95 still over 60 minutes.

Third office Mac mini: Still no fast/archive labels—three machines in one pool; signing failure rate went up.

No Friday merges: Artificially capped PR rate; release week concentrated spikes higher.

timeout-minutes: 180 on every job: Queued time doesn’t count toward timeout—more jobs hog slots longer.

Only partial win: move Release to a separate machine—but it was still an office mini; nobody unlocked Keychain overnight; L2 queue backed up Monday.

Three numbers dashboards should show—but didn’t

Before collapse, Grafana had “success rate” and “average duration.” Week one post-handoff we added three metrics and found 80% of the problem (how to split queue_wait_seconds in GitHub’s monitoring docs):

  • queue_wait_seconds P95 (by L0/L1/L2 bucket)—separate “slow compile” from “waiting in line”
  • Archive-class job share (weekly)—whether workflow design ran away
  • macOS jobs triggered per push P99—whether the matrix exploded

Crash-week numbers: L1 queue_wait P95 47 minutes, Archive share 38%, jobs per push P99 19. Any two over threshold → change topology before opening a PO.

Triage order: fix topology before counting Macs

First two weeks after handoff, no new hardware—four changes only:

  • Remove Archive from PR workflows; L2 only on nightly + release branches
  • Tighten path filters—README/backend paths no longer fire the iOS matrix
  • Retire office mini Runners to stop fighting local dev
  • Hosted Runners for L1 spikes only; all L2 on dedicated nodes

Those four alone dropped release-week queue_wait P95 from 47 minutes to 22 minutes—still not enough, but proof the collapse was topology and workflow, not a magic Mac count. Cloud Mac fast/release pools and capacity validation came next; elastic vs always-on Runners in GitHub Actions self-hosted macOS Runner: elastic pool vs always-on decision matrix.

Three signals if you’re growing past 12 toward 20

If any of these show up, schedule a 30-minute CI review that week—don’t wait for release week:

  • Developers ask “can we merge without CI?”
  • Sticky notes on office Macs: “don’t reboot—Runner running”
  • TestFlight builds succeed but “waiting for upload slot” is normal

Behind these signals you’ve usually hit at least two of the five failure modes above.

Postmortem Summary (quotable)

Root cause

Headcount grew without CI topology scaling: PR rate and matrix width rose while Runners stayed “hosted + office mini mixed,” and PR workflows kept running L2 Archive—queue and signing saturated together.

Contributing factors

  • No queue_wait_seconds monitoring—Queued mistaken for slow compile
  • Second Mac mini without fast/archive labels—Archive and PR shared one pool
  • Monorepo path filter too wide—README edits triggered the iOS matrix
  • Release-week job spikes not tied to procurement or topology changes

Fixes tried (ineffective or harmful)

  • More GitHub minute packs—hosted queue only; Archive share unchanged
  • Third office mini in mixed pool—signing failure rate increased
  • No Friday merges—spikes moved to release week
  • timeout-minutes: 180—Queued time excluded from timeout

What worked (first two weeks post-handoff)

  • Archive off PRs; L2 on nightly + release branches only
  • Tighter path filters; office mini Runners retired
  • L2 on dedicated release nodes—queue P95 47min → 22min

FAQ

GitHub Actions stuck on Queued—not slow compile?

Check queue_wait_seconds vs run duration. High Queued share means queue or Runner pool, not compile. Use Figure 2’s Queue node and Figure 3’s healthy baseline.

At what team size does iOS CI usually break?

This case: 12 Warning / 16 Structural Debt / 20 Failure Zone. Metrics beat headcount: queue P95 > 30 minutes and Archive > 30% → refactor pipeline first.

Scale hardware or fix the pipeline first?

Job layering, stop PR Archive, tighten matrix first—otherwise new machines only delay the next release-week queue spike.

Is this the same article as “how many Macs for 500 runs/day”?

No. This piece is why it broke (Failure Model); the companion is post-crash sizing and machine count. Read this + diagnose, then read sizing.

Failure Zone prescription: L2 isolation first, then validate the Runner pool

If all three below hold, you’re in the Failure Zone (matches Figure 1 bottom)

  • queue_wait_seconds P95 > 30 minutes
  • Archive-class job share > 30%
  • macOS jobs per push > 15

Engineering actions first (before procurement)

1. L2 isolation—remove Archive from PRs; release/nightly only on macos-archive labels.

2. Dedicated Mac pool—move L2 off office dev machines and mixed hosted pools; L1 can stay on hosted or elastic self-hosted for spikes.

3. Validate—run one release workflow on an isolated release node; compare queue P95 to Figure 3 healthy baseline (< 8 min tier).

Machine count and fast/release pool math: 3 Cloud Macs for 500 iOS Builds per Day; elastic vs always-on Runners: GitHub Actions self-hosted macOS Runner decision matrix.

For a one-day PoC on a release-only node to validate L2 isolation, start from Mac cloud plans or the VPSSpark homepage to provision a Cloud Mac with an isolated Archive queue—that’s topology validation, not “buy the full cluster on day one.”

Limited offer

20-person iOS CI melting down? Split the queues first

build queue · Archive isolation · Cloud Mac Runner

Back to home
Limited offer See plans now