GitHub Actions stuck on Queued—is it slow compile?

Split queue_wait_seconds from run duration. High Queued time means runner pool or org concurrency, not Xcode compile.

How many people before iOS CI breaks?

Use zones: ~12 Warning, ~16 Structural Debt, ~20 Failure Zone—or queue P95 > 30 min with Archive > 30%.

Add Macs or fix the pipeline first?

Tier jobs, stop PR Archives, tighten matrix/path filters—then size dedicated release pools.

Why iOS CI/CD Breaks at 20 Developers | GitHub Actions Queue + Build Scaling Case Study

This is an iOS CI/CD Scaling Failure Model—not a capacity calculator. Before we took over, the team grew from 8 to 20 engineers over fourteen months while the Runner topology barely moved: release-week PRs went red, builds sat in Queued for hours, and TestFlight slipped by two days. Below we answer why CI slows when headcount rises, and which segment of the pipeline actually breaks.

If you need “how many Macs for 500 runs/day,” see the companion piece 3 Cloud Macs for 500 iOS Builds per Day—that’s the post-handoff rollout; this article is why it collapsed before we arrived.

Bottom line: When a 20-person team’s CI falls apart, “not enough Macs” is rarely the whole story. In this case five bottlenecks stacked in sequence: concurrency maxed out → Runners fought with dev machines → PRs ran Archive → matrix jobs exploded → signing contended. Warning signs showed up at 12–16 people; by 20, with no queue split, they were in the Failure Zone.

At a glance: how “more people” slides CI into the Failure Zone

Figure 1 is the mechanism overview for the whole piece—better for a postmortem opener or team onboarding than pipeline minutiae. Collapse is rarely a single point; it’s the cascade below.

Figure 1 · CI collapse mechanism (Scaling Failure Cascade)

PR merge rate climbsteam scaling · nonlinear start

Matrix / job count explodesmonorepo · path filter broken

Queue backlogqueue_wait_seconds ↑

Runner pool saturatedhosted concurrency · minis occupied

Archive overloadL2 flooding PR queue

Signing / Keychain contentioncodesign contention

Failure Zonequeue P95 > 30min · Archive > 30%

This customer hit Queue pain at 12 people; Runner and Archive lit up together at 16; at 20 they landed in the red zone at the bottom. Compare to the healthy baseline below.

iOS CI/CD architecture: bottleneck map from PR to TestFlight

Figure 2 is the execution path for a macOS job (L0/L1/L2), with the two places that saturated first in this case. Use it to tell compile slowness from jobs that never left the queue.

Figure 2 · iOS CI/CD pipeline: L0 / L1 / L2 layers and typical bottlenecks (this customer, 2024–2025)

PR Pushpath filter · matrix

Queuequeue_wait_seconds

Runner Poolhosted + self-hosted

L0/L1 Buildintegration · unit tests

L2 Archivexcodebuild archive

SigningKeychain · Match

UploadTestFlight

↑ Bottleneck · Queue (org macOS concurrency maxed) ↑ Contention · Runner Pool + Signing (office Macs and Keychain fighting)

L0: lint / light checks; L1: PR integration build and Simulator tests; L2: Archive, notarization, upload. This customer broke on Queue and mixed Runner pools—not L1 compile speed. Matrix blow-up is often misread as “Xcode got slower.”

Starting point: CI that “worked” at 8 people—and didn’t scale with the team

Mid-2024, the setup was textbook:

One office Mac mini M2 doubling as a self-hosted Mac Runner (see GitHub’s self-hosted runner docs)
Everything else on GitHub-hosted macOS runners
Monorepo with two apps, one YAML workflow for all
Roughly 80–120 macOS jobs per day; queue P95 under five minutes

The CTO’s line was: “CI isn’t the bottleneck—don’t spend time on it.” True at eight people; after headcount doubled, CI became a hidden tax because nobody did capacity planning—only feature teams adding workflows.

Healthy baseline: what “normal” looks like

Staring only at the 20-person crash week makes it easy to think “big team = buy lots of Macs.” We add a healthy baseline vs crash week table—numbers from this customer’s eight-person era plus ranges we’ve seen on mid-size iOS teams (not an industry standard, but enough for an internal review).

Metric	Healthy baseline (8–12 people · sane topology)	This customer · 20-person crash week
L1 queue_wait P95	< 8 min	47 min
L1 run duration P95	12–18 min	14 min (compile did not slow down)
Archive-class job share	< 15%	38%
macOS jobs per push · P99	< 6	19
macOS jobs / day	80–180	500+ (release week)
Runner topology	L1/L2 label-separated pools	hosted + office mini mixed

The telling contrast: run duration barely moved in the crash week; queue time jumped nearly 10×—the problem was queuing and workflows, not Xcode compile throughput. That’s where the phrase “CI got slow” misleads teams.

Timeline: three warnings, three times dismissed as “flaky”

First (12 people, +4 iOS): PR merge rate went from ~6/day to ~14/day. Hosted Runner build queue hit 25–40 minute waits in the afternoon. Team reaction: “GitHub’s having a bad day.” Nobody pulled the queue_wait_seconds curve.

Second (16 people, +2 backend on the same repo): They added a second office Mac mini, both registered as self-hosted with no label-based queue split. Archive and PR builds shared one pool—light jobs that were ~8 minutes sometimes took 35; heavy notarization jobs failed at random. Reaction: “Try another disk.”

Third (20 people, dual-app release collision): Release week macOS jobs spiked from ~120/day to 280+ in a single day. Slack #ios-ci had someone on-call around the clock. Reaction: “Procurement is approving eight Mac minis.”—the same eight-machine plan we dissect in the 500 runs/day case study; nobody had broken out jobs by type yet.

Figure 3 · Team size vs macOS jobs/day vs risk tier (this customer, 2024–2025 measured)

Team size	macOS jobs / day	GitHub Actions queue time (L1 P95)	Risk tier	Stage notes
8	~100	< 5 min	Baseline	Single office mini sufficient
12	~180	25–40 min	Warning Zone	Queue first sustained over threshold
16	~260	~35 min	Structural Debt	Mixed Archive runs; signing flakes
20	500+	47 min	Failure Zone	Release-week peak; CI effectively down

Source: GitHub Actions workflow run logs (14-day rolling window); 20-person point is dual-app release-week peak. Job volume outpaced headcount—driven by matrix expansion and PR Archive, not headcount alone.

Two table details matter: jobs jumped sharply from 16→20 people; Archive share rose from ~18% to 38%. Queue time vs job volume is nonlinear—headcount didn’t double before queue did. That’s where expansion budgets go wrong.

Threshold model: 12 Warning / 16 Structural Debt / 20 Failure Zone

We condensed the case into a weekly-review framework—not a precise formula, but enough to decide whether Runner topology needs to move:

CI collapse risk ∝ (PR frequency × matrix width × Archive share) ÷ effective Runner pool capacity

All three numerator terms come from GitHub Actions logs. “Effective Runner pool” is not registered machine count—it’s labeled concurrent Mac slots you can actually schedule. When an office mini is busy for local dev, the denominator shrinks instantly. When any two of the following hold, stop debating “one more Mac mini” and refactor the pipeline first:

queue_wait_seconds P95 > 30 minutes and Archive-class job share > 30% → split L1/L2; stop full Archive on every PR
macOS jobs per push P99 > 15 → matrix/path filter out of control; cut jobs before buying hardware
L2 queue P95 > 40 minutes for 3 consecutive days with cache hit > 60% → structural scale-out; consider a dedicated release pool

This customer hit all three in the 20-person crash week—Failure Zone, not something “buy more minutes” fixes in Warning.

GitHub Actions queue time: the first metric to break after 12 people

Many teams hear “CI slowed” and tune DerivedData. In this customer’s logs, queue time was what ran away—jobs sat in Queued. GitHub reports queue_wait_seconds separately from run duration; optimize only the latter and you “fix” compile while the org concurrency ceiling and pool mixing stay broken.

Org-level macOS concurrency is documented in GitHub Actions usage limits. After 12 people, afternoon PR storms meant Runners weren’t slow—jobs waited for a slot. The UI spins on Queued and gets blamed on network or Xcode. Split wait vs run using job execution time and queue duration.

At 12 people, queue P95 was already 25–40 minutes. Label-separated fast/archive pools (instead of two minis in one pool) would have delayed the signing pile-up. For Runner topology choices, see elastic pool vs always-on nodes decision matrix.

Five scaling bottlenecks (Scaling Taxonomy)

Five labels we use internally for mobile CI postmortems—mapped to Figure 2’s pipeline.

① Capacity bottleneck · org macOS concurrency maxed

When hosted Runner concurrency fills, light jobs queue behind Archive—the fast lane gets blocked by slow work. Same curve as GitHub Actions queue time above. Minute packs help hosted queue only; moving L2 off hosted is more stable—org concurrency stays the first ceiling otherwise.

② Resource contention · office Mac mini as “spare dev machine”

Runners on SSH-able engineer machines eventually run local xcodebuild debug. Runner vs daily dev fights for CPU/disk—timeouts or flaky compile errors. At 16 people, two minis “dead Friday, fine Monday”—root cause was Remote Desktop branch switches.

③ Workflow design debt · full Archive on every PR

At eight people, “green PR = shippable” justified Archive + upload at the end of every pull request workflow. ~6 merges/day was tolerable; at 20, Archive-class jobs went from ~10% to 35%+ of volume, flooding an already crowded queue. Split L1 integration from L2 Archive—that landed post-handoff; see the companion piece’s job layering section.

④ Scaling explosion · monorepo matrix jobs grow superlinearly

At 20 people, a single push triggered up to 22 macOS jobs—2.5× people, ~4× jobs. Easiest collapse point to miss in product cadence.

⑤ Crypto / signing contention · release-day Keychain and codesign fights

Two self-hosted Macs running Archive could limp through on 16GB RAM—add Match unlock, notarization, and TestFlight upload and Keychain locks and codesign contention produce the same error code at random. Release week: four consecutive “certificate not found” failures that passed on a local Mac rerun—classic multi-job signing environment fight, not expired certs. See Apple Xcode Signing and Capabilities.

“Fixes” they tried—and why things got worse

More GitHub minute packs: Eased hosted queue only; Archive share unchanged—money spent, P95 still over 60 minutes.

Third office Mac mini: Still no fast/archive labels—three machines in one pool; signing failure rate went up.

No Friday merges: Artificially capped PR rate; release week concentrated spikes higher.

timeout-minutes: 180 on every job: Queued time doesn’t count toward timeout—more jobs hog slots longer.

Only partial win: move Release to a separate machine—but it was still an office mini; nobody unlocked Keychain overnight; L2 queue backed up Monday.

Three numbers dashboards should show—but didn’t

Before collapse, Grafana had “success rate” and “average duration.” Week one post-handoff we added three metrics and found 80% of the problem (how to split queue_wait_seconds in GitHub’s monitoring docs):

queue_wait_seconds P95 (by L0/L1/L2 bucket)—separate “slow compile” from “waiting in line”
Archive-class job share (weekly)—whether workflow design ran away
macOS jobs triggered per push P99—whether the matrix exploded

Crash-week numbers: L1 queue_wait P95 47 minutes, Archive share 38%, jobs per push P99 19. Any two over threshold → change topology before opening a PO.

Triage order: fix topology before counting Macs

First two weeks after handoff, no new hardware—four changes only:

Remove Archive from PR workflows; L2 only on nightly + release branches
Tighten path filters—README/backend paths no longer fire the iOS matrix
Retire office mini Runners to stop fighting local dev
Hosted Runners for L1 spikes only; all L2 on dedicated nodes

Those four alone dropped release-week queue_wait P95 from 47 minutes to 22 minutes—still not enough, but proof the collapse was topology and workflow, not a magic Mac count. Cloud Mac fast/release pools and capacity validation came next; elastic vs always-on Runners in GitHub Actions self-hosted macOS Runner: elastic pool vs always-on decision matrix.

Three signals if you’re growing past 12 toward 20

If any of these show up, schedule a 30-minute CI review that week—don’t wait for release week:

Developers ask “can we merge without CI?”
Sticky notes on office Macs: “don’t reboot—Runner running”
TestFlight builds succeed but “waiting for upload slot” is normal

Behind these signals you’ve usually hit at least two of the five failure modes above.

Postmortem Summary (quotable)

Root cause

Headcount grew without CI topology scaling: PR rate and matrix width rose while Runners stayed “hosted + office mini mixed,” and PR workflows kept running L2 Archive—queue and signing saturated together.

Contributing factors

No queue_wait_seconds monitoring—Queued mistaken for slow compile
Second Mac mini without fast/archive labels—Archive and PR shared one pool
Monorepo path filter too wide—README edits triggered the iOS matrix
Release-week job spikes not tied to procurement or topology changes

Fixes tried (ineffective or harmful)

More GitHub minute packs—hosted queue only; Archive share unchanged
Third office mini in mixed pool—signing failure rate increased
No Friday merges—spikes moved to release week
timeout-minutes: 180—Queued time excluded from timeout

What worked (first two weeks post-handoff)

Archive off PRs; L2 on nightly + release branches only
Tighter path filters; office mini Runners retired
L2 on dedicated release nodes—queue P95 47min → 22min

FAQ

GitHub Actions stuck on Queued—not slow compile?

Check queue_wait_seconds vs run duration. High Queued share means queue or Runner pool, not compile. Use Figure 2’s Queue node and Figure 3’s healthy baseline.

At what team size does iOS CI usually break?

This case: 12 Warning / 16 Structural Debt / 20 Failure Zone. Metrics beat headcount: queue P95 > 30 minutes and Archive > 30% → refactor pipeline first.

Scale hardware or fix the pipeline first?

Job layering, stop PR Archive, tighten matrix first—otherwise new machines only delay the next release-week queue spike.

Is this the same article as “how many Macs for 500 runs/day”?

No. This piece is why it broke (Failure Model); the companion is post-crash sizing and machine count. Read this + diagnose, then read sizing.