~bigbes/sourcehut-root

ref: 72581ff94e21863373e2c72a01eb75332cb22f32 sourcehut-root/skills/sourcehut-ci/references/debugging.md -rw-r--r-- 9.7 KiB
72581ff9 — Eugene Blikh docs(sourcehut-ci): fix factual errors found in full audit 2 days ago

#Debugging Failed Builds

The single most important debugging tool on sourcehut is SSH into the build VM. Use it; don't iterate blindly on the manifest.

#Reading sr.ht log output

The build log is plain text, tasks separated by headers like:

[#1273143] 2025/01/15 10:23:01 Running task "build"
+ cd myproject
+ make
gcc -c foo.c
...
[#1273143] 2025/01/15 10:23:42 Task "build" failed (exit status 1)

Lines starting with + are from set -x — they show the command being run, with environment variables expanded. The next lines are the command's stdout/stderr. The trailing line with "failed" gives the exit status.

When a task fails, everything after that task is skipped. The summary at the bottom of the log lists task statuses and exit codes.

#SSH into the failed VM

On failure, the log prints:

[#1273143] 2025/01/15 10:23:42 Build failed.
[#1273143] 2025/01/15 10:23:42 The build environment will be kept alive for 10 minutes.
[#1273143] 2025/01/15 10:23:42 
[#1273143] 2025/01/15 10:23:42     ssh -t builds@fra02.builds.sr.ht connect 1273143
[#1273143] 2025/01/15 10:23:42 
[#1273143] 2025/01/15 10:23:42 After logging in, the deadline is increased to your remaining build time.

Run that SSH command. You'll be dropped into the VM as the build user, exactly as the build left it. The VM lives for 10 minutes by default if you don't log in. Once you log in, the deadline extends to your remaining build time (which is [builds.sr.ht::worker] timeout minus already-elapsed time, often capped — instance-dependent, log will say "Your VM will be terminated N hours from now").

What to do once inside:

  • cd ~/home/build is your home, where sources are cloned.
  • Re-run the failing command manually to see actual errors interactively.
  • which <tool> — verify a package actually installed and is on PATH.
  • cat ~/.buildenv — see exactly what environment: set.
  • env — full environment, including $OAUTH2_TOKEN, $JOB_ID, etc.
  • sudo is passwordless — install missing packages, modify system config, whatever.
  • logout (or Ctrl-D) when done. The VM gets torn down.

For SSH into the VM, your sourcehut SSH key needs to be added at https://meta.sr.ht/keys. The same key used for git operations is fine.

#shell: true for always-on SSH

Add to the manifest:

shell: true

The VM stays alive after tasks complete, even on success. Use this when iterating; remove before committing for real.

You can also SSH in while the build is running to watch progress interactively, run top, inspect the filesystem mid-build, etc.

#complete-build for early exit

Magic in-VM command that ends the build successfully without running subsequent tasks:

tasks:
  - check-branch: |
      if [ "$GIT_REF" != "refs/heads/master" ]; then
          complete-build
      fi
  - deploy: |
      # only runs on master

It exits the task with status 0 and tells the runner to skip all subsequent tasks. The build is marked successful. Use for "this push doesn't need a full build" cases.

Not for security gating — anyone editing the manifest can remove the complete-build call.

#Common errors and what they mean

#"No such image: foo/bar"

The image: value isn't a valid sourcehut image. Check the spelling against https://man.sr.ht/builds.sr.ht/compatibility.md. Common typos: alpine/3.18 (real) vs alpine/3.18.0 (not real); debian/bookworm (real) vs debian/12 (not real).

#"Cannot find package: xyz"

Package isn't in the image's repos under that name. Cross-distro names differ:

  • Alpine: nodejs for Node, npm separate.
  • Debian: nodejs includes npm since recent versions.
  • Arch: nodejs and npm both.

When unsure: image: alpine/edge + packages: [xyz], push, see the error, find the right name via https://pkgs.alpinelinux.org/packages.

#"Permission denied (publickey)"

Trying to SSH/git over SSH without a key, or with the wrong key.

  • Secret SSH key not configured: verify secrets: includes the right UUID and the secret type is "SSH key".
  • Public key not added on the receiving end: for GitHub mirror, add the build's public key (printed by ssh-keygen -y -f ~/.ssh/id_* inside the VM) as a deploy key on GitHub.
  • Wrong known_hosts: ssh-keyscan -H <host> >> ~/.ssh/known_hosts before the SSH call.

#"401 Unauthorized" from a hut command or curl with $OAUTH2_TOKEN

  • oauth: directive missing or insufficient. Check the scope: read operations need :RO, write operations need :RW.
  • The OAuth grant is for a different service than you're calling.
  • Build was submitted in a context that disables secrets/OAuth (e.g. mailing-list patch test, hut builds submit --no-secrets, web "disable secrets" checkbox). When secrets are off, neither ~/.config/hut/config nor $OAUTH2_TOKEN is provisioned.

#"missing access-token" from hut, even though $OAUTH2_TOKEN is set

hut does not read $OAUTH2_TOKEN. It reads ~/.config/hut/config. The worker pre-writes that file only when oauth: is in the manifest and secrets are enabled. If the env var is set but hut fails, something in your script removed/overwrote the config, or you're running hut as a user other than build. Inspect ~/.config/hut/config to confirm. See references/hut.md.

#"Build failed with exit code 137"

OOM kill. The VM ran out of memory. The VM's memory size is an instance/operator setting (builds.sr.ht::worker config), not a per-manifest value — there's no manifest key to bump it. Upstream builds.sr.ht.org runs a fixed amount per VM; self-hosted instances vary. Workarounds, in order of effort:

  1. Reduce parallelism inside the build (make -j2 instead of make -j$(nproc)).
  2. Tell the compiler to use less memory (go build -p 1, cargo build -j 1, cc -O1 instead of -O3, etc.).
  3. Split the work across multiple jobs in .builds/.
  4. Run a self-hosted runner on bigger hardware.

#Tar/pages publish accepts but site is broken

pages.sr.ht silently discards invalid uploads. Verify the tarball:

tar -tzvf site.tar.gz | head -20

Every line should look like -rw-r--r-- (mode 644), no drwx directories with weird modes, no l (symlinks), and the top-level entries should be files (index.html, etc.), not a directory like public/.

#"skip-ci doesn't seem to work"

git push options need protocol v2 (default since git 2.26). If you're stuck on a very old git, pass -c protocol.version=2 explicitly:

git -c protocol.version=2 push -o skip-ci

Also: some middleboxes (mirroring services, certain proxies) strip push options entirely. If you push to a mirror that re-pushes to git.sr.ht, the options don't make it through — push directly to git.sr.ht.

If you want skip-ci to be the default for a repo (e.g. for an auto-changelog branch), set it in git config:

git config --add push.pushOption skip-ci

…and remember to override it (-o '' or unset the config) when you do want a build.

#"Variable from previous task is undefined"

Tasks are separate sessions. Variables exported in one task don't persist. Write to ~/.buildenv:

tasks:
  - compute: |
      VERSION=$(...)
      echo "VERSION=$VERSION" >> ~/.buildenv
  - use: |
      echo "Version is $VERSION"

#"Source directory not found"

The sources: URL was wrong, the ref doesn't exist, or you cloned the repo but tried cd <wrong-dir>. The clone directory is named after the last URL component: https://git.sr.ht/~user/myprojectmyproject/. Custom names aren't supported via sources:; use a task git clone for that.

#Build hangs forever

A task that's waiting for user input hangs until the per-job timeout elapses (instance config; the upstream config example uses 45m, your self-hosted instance may differ). Check for: unattended apt-get (use apt-get install -y), interactive make menuconfig, prompts from gpg --gen-key without --batch, npm asking before installing a dependency, etc.

When the job times out it ends with status timeout (treated as failure by triggers), prints the SSH connect line, and gives you the standard 10-minute grace window to log in and look around.

#Iteration workflow

The slow way: edit .build.yml, commit, push, wait for build, read log, repeat. Each iteration takes minutes.

The fast way:

  1. Go to https://builds.sr.ht/submit.
  2. Paste your manifest.
  3. Click submit. The job runs. Watch the log streaming live.
  4. On failure, SSH into the VM, fix manually, write down what worked.
  5. Repeat in the web form with the corrected manifest.
  6. Once green, commit the working manifest as .build.yml.

This avoids polluting your git history with "fix CI try 7" commits.

For local iteration, hut builds submit --follow .build.yml does the same thing from the CLI, streaming the log to your terminal.

#Reproducing locally

The build images are public. You can pull them locally with QEMU if you want to reproduce a build environment exactly:

# Image scripts are in the builds.sr.ht repo
git clone https://git.sr.ht/~sircmpwn/builds.sr.ht
cd builds.sr.ht/images/<image-name>
# Build the image with the genimg script (requires QEMU + the right tooling)

Most people don't go this far. For "is this an environment issue or a code issue", a local Docker run with docker run -it alpine sh followed by manually running the build steps catches 90% of issues.

#When to ask for help

The sourcehut admins are helpful but expect:

  • Push UUID if it's a push problem: git push -o debug prints it.
  • Job URL for build problems: https://builds.sr.ht/~user/job/N.
  • The manifest itself, in the message body.
  • What you've tried: SSH'd in? Read the log? Tried locally?

The sr.ht-discuss mailing list is the right venue for general questions. sr.ht-support is for account and billing issues.