Troubleshooting

Diagnostic procedures and known failure modes. Organized by symptom — find the closest match to what you're seeing.

If a procedure here doesn't resolve the issue, see Filing a support request at the bottom for what to include.


Pod won't start

Symptom: Pod stuck in CreateContainerError or CrashLoopBackOff

Check the pod's events:

oc describe pod -l app.kubernetes.io/name=<app> -n <namespace>

Look at the Events: section at the bottom.

Cause: SecurityContextConstraints denial

Error creating: pods "..." is forbidden: unable to validate against any
security context constraint: provider "restricted-v2": ...

The application image is built to run cleanly under restricted-v2. If you see this, either:

  • The image tag being deployed isn't the OpenShift-compatible build. Confirm image.tag matches what the vendor specified for OpenShift installs.
  • A securityContext override in values.yaml is requesting privileges the SCC doesn't allow. Remove custom securityContext blocks and re-deploy.

Cause: Image manifest unknown (Bitnami subchart)

Failed to pull image "docker.io/bitnami/postgresql:<tag>":
manifest unknown

Bitnami changed their image hosting in mid-2025; the chart's pinned postgres / redis tags are no longer published at docker.io/bitnami/. Override the subchart image repositories in your values.yaml:

postgresql:
  image:
    repository: bitnamilegacy/postgresql
redis:
  image:
    repository: bitnamilegacy/redis

For air-gapped clusters: mirror the Bitnami images into your internal registry and override repository to point there instead. See deploy.md §0 "Air-gapped or restricted-egress clusters".

Cause: Image pull failure

Failed to pull image "...": rpc error: ... unauthorized
  • Verify the image-pull Secret exists in the right namespace and is referenced via image.pullSecrets in values.yaml.
  • Verify the Secret credentials are valid: sh oc get secret <pull-secret> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
  • For air-gapped clusters, verify the image was successfully mirrored to your internal registry: oc image info <internal-registry>/....

Cause: PVC pending

Pod has unbound immediate PersistentVolumeClaims
  • Check the PVC: oc get pvc -n <namespace>. State should be Bound. If Pending, the requested StorageClass either doesn't exist or has no capacity.
  • For multi-replica deployments, ReadWriteOnce PVCs cannot be mounted by more than one pod. Switch to ReadWriteMany storage or skip PVC and use S3-backed Active Storage.

Migration Job fails

Symptom: helm install / helm upgrade reports "post-install hook failed"

The migration Job runs as a Helm post-install / post-upgrade hook; on failure, Helm aborts the release and the running deployment is unchanged.

# Find the failed Job (the chart names it with a release-revision suffix)
oc get jobs -n <namespace> -l app.kubernetes.io/component=migration

# Logs from the failed Job's migrate container
oc logs -n <namespace> -l app.kubernetes.io/component=migration -c migrate

Cause: Database connection failure

PG::ConnectionBad: could not connect to server: ...
  • Bundled mode: confirm postgres pod is Running and Ready. The migration Job runs after postgres is up, but if postgres crashed during install, the Job will hit a connection error. Check oc logs sts/<release-name>-postgresql -n <namespace>.
  • Production mode: confirm externalDatabase.host in values.yaml resolves from inside the cluster. Test with: sh oc run -it --rm test-pg --image=postgres:15 --restart=Never -- \ psql -h <host> -U <user> -d <db>

Cause: Insufficient database privileges

PG::InsufficientPrivilege: ERROR: permission denied
  • The application database user needs CREATE, ALTER, DROP on its own schema for migrations.
  • Also needs the CREATE EXTENSION privilege if the application uses postgres extensions (the chart's values.yaml field externalDatabase.allowExtensionCreate must be true and the user must have the privilege).

Cause: Constraint violation on existing data

ERROR: column "..." contains null values
ERROR: duplicate key value violates unique constraint

The migration tried to add a constraint or unique index against data that doesn't satisfy it. Release notes for this version will call out pre-migration data cleanup steps. After cleanup, retry:

oc delete job/<release-name>-migrate -n <namespace>
helm upgrade ...   # same command as before

Application returns 503 / probe failures

Symptom: Route returns 503, oc get pods shows pods running but 0/1 ready

Liveness or readiness probes are failing. Check pod events and container logs:

oc describe pod -l app.kubernetes.io/name=<app> -n <namespace>
oc logs deploy/<release-name> -n <namespace>

Cause: Slow boot

Rails takes longer than initialDelaySeconds to start, especially on first pod start with a cold image cache. Increase livenessProbe.initialDelaySeconds and readinessProbe.initialDelaySeconds in values.yaml and helm upgrade.

Cause: Cannot reach database

Application boots but /health checks fail because the database isn't reachable. Apply the database connection diagnostics in the "Migration Job fails" section above.

Cause: Cannot reach redis

Same as database — /health checks redis reachability. Verify redis is running and the connection string is correct.


License upload fails

Symptom: /licensing/installations/new returns "Invalid license"

Cause: License file mismatch

The license file is signed against a public key baked into the image at build time. If the vendor regenerated your license against a different keypair, or if the image build doesn't match the license, validation fails.

  • Verify with the vendor: which image build was your license file generated against?
  • Compare to the image tag your deployment is running: sh oc get deploy/<release-name> -n <namespace> \ -o jsonpath='{.spec.template.spec.containers[0].image}'

Cause: License file corruption during transfer

# Inspect what's actually mounted in the pod
oc exec deploy/<release-name> -n <namespace> -- cat /rails/config/license.json

Compare to the license file you originally received from the vendor. If different, re-create the Secret from the original.


Random-UID-related write failures

Symptom: Application logs show Errno::EACCES: Permission denied

The application is running as a random UID (correct for OpenShift) but trying to write to a directory the random UID can't access.

oc exec deploy/<release-name> -n <namespace> -- ls -la /rails/storage /rails/log /rails/tmp

Expected: directories owned by 1001 group 0, mode drwxrwxr-x or drwxrwsr-x. If group is not 0 or group-write is missing, the image is the older fixed-UID build.

Resolution: this is a vendor image fix, not a customer-side configuration. Confirm the image tag matches the OpenShift- compatible build and contact the vendor if it does and you're still seeing EACCES.


Application logs show "bind: permission denied"

Symptom: Container restarts immediately, logs end with

{"level":"ERROR","msg":"Failed to start HTTP listener",
 "error":"listen tcp :80: bind: permission denied"}

The application's reverse proxy (Thruster) is trying to bind a privileged port (<1024). OpenShift's restricted-v2 SCC denies this; non-root containers can only bind ports ≥1024.

The chart's default ConfigMap sets HTTP_PORT=3000 and TARGET_PORT=8080 to keep both Thruster and Puma in the non-privileged range, and the Deployment's args pin Puma to -p 8080. If you're seeing this error, check:

  • Your values.yaml doesn't override config.* env vars in a way that drops HTTP_PORT / TARGET_PORT.
  • The chart version installed includes the port-rerouting fix. Compare your release's chart version (helm list -n <ns>) with the version the vendor specified for OpenShift installs.

Single-replica vs multi-replica issues

Symptom: After scaling replicas above 1, second pod fails to start

Multi-Attach error for volume "...": Volume is already used by pod ...

You have a ReadWriteOnce PVC for application storage and tried to scale beyond one replica. Two ways out:

  1. Switch to a ReadWriteMany storage class. Requires cluster support — check oc get storageclass for an RWX-capable class (NFS, CephFS, Trident, EFS, etc.). Update persistence.storageClass and persistence.accessMode in values.yaml.
  2. Disable the PVC and use S3-compatible object storage for Active Storage. Set persistence.enabled: false and configure objectStorage.*. Recommended for multi-replica.

Sessions invalidating on every pod rotation

Symptom: Users get logged out every time the application is upgraded or pods restart

SECRET_KEY_BASE is changing across pod rotations. This shouldn't happen if the secret is referenced by name (secrets.existingSecret in Production mode); it can happen in Bundled mode if you're running helm uninstall then helm install instead of helm upgrade.

  • Production mode: the Secret is customer-owned and should not change between deployments. Rotate explicitly only when needed.
  • Bundled mode: don't uninstall + install for upgrades; use helm upgrade. The chart's persisted Secret survives upgrades.

Filing a support request

When you cannot resolve an issue from this guide, file a support request and include the following:

1. Diagnostic bundle

Capture the following into a single tarball (sanitize secrets first — see step 2):

# Cluster context
oc version
oc whoami --show-server

# Release state
helm list -n <namespace>
helm get values <release-name> -n <namespace>
helm get manifest <release-name> -n <namespace>

# Workload state
oc get all -n <namespace>
oc describe pods -n <namespace>
oc get events -n <namespace> --sort-by lastTimestamp

# Application logs (last 1000 lines)
oc logs deploy/<release-name> -n <namespace> --tail=1000

# Migration job logs (if relevant)
oc logs job/<release-name>-migrate -n <namespace>

# Image identity
oc get deploy/<release-name> -n <namespace> -o yaml | grep image:

2. Sanitize before sending

Strip from the diagnostic bundle:

  • RAILS_MASTER_KEY — appears in environment variables and Secret references. Replace with [REDACTED].
  • SECRET_KEY_BASE — same.
  • Database passwords and connection strings — replace.
  • Customer email addresses in audit logs — replace if confidential.

The application image's BUILD_GIT_SHA, the chart version, and the license customer ID are non-sensitive and helpful for triage — keep those.

3. What to describe in the ticket

  • The procedure you ran (helm install <args> / helm upgrade <args>).
  • The expected behavior.
  • The observed behavior (error messages, screenshots).
  • The timing — when did it start failing? After an upgrade? On a fresh install?
  • What you've already tried from this troubleshooting guide.

Vendor support teams can resolve issues much faster when given the diagnostic bundle plus a clear chronology than they can from a log snippet alone.