VPSSpark CI Queue Diagnosis Standard (Cloud Mac CI #2). Read in order: Hook → formula → symptoms → Failure Model → Hard Rules → Runbook → FAQ. Related: #1 capacity · #5 scaling failure · #3 self-hosted TCO · #8 build speed.
1 · Hook: counter-intuitive takeaway
In most macOS CI queued incidents, the problem is not “we need more Macs”—it is L2 (Archive) running on pull requests.
2 · Core: the one formula
Everything below expands this single check. Paste it into your on-call wiki.
Master formula
CI queue problem ⇔ wait time >> run time
Engineering read: macOS runner queued means runner pool saturation, not slower Xcode builds.
Formula false → not this article; see #8 (Xcode / cache). Formula true → map to Failure 3 Layer Model and enforce CI Hard Rules.
3 · Symptom: Queued ≠ Running
In GitHub Actions, Queued means the job entered the workflow but has no runner slot yet. In progress is when checkout, xcodebuild, and signing actually run. Teams often treat a one-hour red PR as “slow compile”—then open the timeline: 52 minutes waiting, 11 minutes running.
Wrong fixes: bump Xcode, raise timeout-minutes, buy more GitHub minutes. Queued time often does not count toward the timeout you think you set.
Measure with job execution time (queue_wait_seconds vs run duration).
| wait time | run time | |
|---|---|---|
| UI | Queued | In progress |
| Bottleneck | GitHub Actions macOS runner queue / self-hosted runner queue | Xcode · signing · upload |
| Wrong fix | more minutes · new Xcode | caching (→ #8) |
4 · Explain: CI Queue Failure 3 Layer Model
When wait time >> run time, the cause is one of these SRE-style layers (or a stack of them).
| Layer | Meaning | Signals | Today |
|---|---|---|---|
| Capacity Limit | Platform · macos-latest |
Only hosted jobs macOS runner queued; minutes ≠ concurrency | Cut fanout; move Archive off hosted → #3 |
| Pool Misdesign | Architecture · self-hosted | Archive on PR; fast/archive shared; one runner busy, all Queued | Hard Rules → #1 #6 |
| Trigger Explosion | Load · workflow fanout | matrix / paths / duplicate workflows; jobs > runners | Tighten triggers → #5 |
Capacity Limit — hits hosted macOS runners and org macOS concurrency cap (limits). Pool Misdesign — usual root cause: job tier L2 on PR, mixed pools. Trigger Explosion — fix YAML, not hardware. (Failure layers ≠ job tiers L0/L1/L2.)
5 · Fix: CI Hard Rules (MUST NOT violate)
Normative language for runbooks—not suggestions.
Rule 1: PR MUST NOT run L2 Rule 2: L2 MUST run on isolated pool (macos-archive) Rule 3: fast pool MUST remain unblocked (fast ≠ archive) # Mapping PR → L0 + L1 only →macos-fastmain → L2 only →macos-archive
Job tier reference (implements Hard Rules only):
| Tier | Work | Pool | Rules |
|---|---|---|---|
| L0 | Module build, lint, light tests | macos-fast |
Rule 1 · OK on PR |
| L1 | PR integration, simulator tests | macos-fast |
Rule 1 · MUST NOT Archive |
| L2 | Archive, IPA, TestFlight | macos-archive |
Rules 2+3 · PR MUST NOT |
Fig. 1 · Pool Misdesign: L2 holds slot → all macOS jobs queued
jobs: pr-fast: if: github.event_name == 'pull_request' runs-on: [self-hosted, macos-fast] release-archive: if: github.ref == 'refs/heads/main' || github.event_name == 'schedule' runs-on: [self-hosted, macos-archive]
Runner topology: elastic vs always-on matrix · #1 sizing · after queue is healthy → #3.
6 · Runbook: one page (on-call)
0. CI queue problem ⇔ wait time >> run time ? ─No→ #8
1. Layer: Capacity | Pool Misdesign | Trigger Explosion
2. MUST: Rule1 PR no L2 · Rule2 L2 isolated · Rule3 fast unblocked
3. PR workflow has no archive/export/upload ?
4. wait P95 < 8min → then size runners (#3)
One-liner: Most macOS CI queued issues are L2 on the PR path—not missing hardware. macOS runner queued = pool saturation; apply wait >> run, then Failure Model, then Hard Rules.
7 · FAQ
Why do macOS runners queue more often?
Capacity Limit (concurrency cap) plus Pool Misdesign (L2 on PR). Start with wait >> run.
Queue problem or slow build?
CI queue problem ⇔ wait time >> run time. True → pool; false → #8.
Why does self-hosted still show queued?
Pool Misdesign: self-hosted runner queue follows your labels and pool layout. Breaking Hard Rules blocks the queue—Cloud Mac does not fix bad YAML.
GitHub Actions minutes vs concurrency?
minutes = billing; concurrency = simultaneous slots. More minutes do not fix macOS runner queued.
Is it a runner pool problem?
wait >> run + one self-hosted runner long-busy → Pool Misdesign; only macos-latest queued → Capacity Limit. See Runbook.
Queue OK but build still slow?
#8 (speed). Sizing: #1. Team growth: #5.
Validate pool split: need a fast pool PoC?
After Hard Rules, to move L0/L1 off the hosted GitHub Actions macOS runner queue, run one macos-fast runner and confirm wait time P95 < 8 min—then add macos-archive per #1.
For a daily isolated fast pool without office network changes, see Mac cloud plans or VPSSpark home—topology validation only; rules still required. TCO: series #3.