Troubleshooting

Diagnostic procedures and known failure modes. Organized by symptom — find the closest match to what you're seeing.

If a procedure here doesn't resolve the issue, see Filing a support request at the bottom for what to include.

Pod won't start

Symptom: Pod stuck in `CreateContainerError` or `CrashLoopBackOff`

Check the pod's events:

oc describe pod -l app.kubernetes.io/name=<app> -n <namespace>

Look at the Events: section at the bottom.

Cause: SecurityContextConstraints denial

Error creating: pods "..." is forbidden: unable to validate against any
security context constraint: provider "restricted-v2": ...

The application image is built to run cleanly under restricted-v2. If you see this, either:

The image tag being deployed isn't the OpenShift-compatible build. Confirm image.tag matches what the vendor specified for OpenShift installs.
A securityContext override in values.yaml is requesting privileges the SCC doesn't allow. Remove custom securityContext blocks and re-deploy.

Cause: Image manifest unknown (Bitnami subchart)

Failed to pull image "docker.io/bitnami/postgresql:<tag>":
manifest unknown

Bitnami changed their image hosting in mid-2025; the chart's pinned postgres / redis tags are no longer published at docker.io/bitnami/. Override the subchart image repositories in your values.yaml:

postgresql:
  image:
    repository: bitnamilegacy/postgresql
redis:
  image:
    repository: bitnamilegacy/redis

For air-gapped clusters: mirror the Bitnami images into your internal registry and override repository to point there instead. See deploy.md §0 "Air-gapped or restricted-egress clusters".

Cause: Image pull failure

Failed to pull image "...": rpc error: ... unauthorized

Verify the image-pull Secret exists in the right namespace and is referenced via image.pullSecrets in values.yaml.
Verify the Secret credentials are valid: sh oc get secret <pull-secret> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
For air-gapped clusters, verify the image was successfully mirrored to your internal registry: oc image info <internal-registry>/....

Cause: PVC pending

Pod has unbound immediate PersistentVolumeClaims

Check the PVC: oc get pvc -n <namespace>. State should be Bound. If Pending, the requested StorageClass either doesn't exist or has no capacity.
For multi-replica deployments, ReadWriteOnce PVCs cannot be mounted by more than one pod. Switch to ReadWriteMany storage or skip PVC and use S3-backed Active Storage.

Migration Job fails

Symptom: `helm install` / `helm upgrade` reports "post-install hook failed"

The migration Job runs as a Helm post-install / post-upgrade hook; on failure, Helm aborts the release and the running deployment is unchanged.

# Find the failed Job (the chart names it with a release-revision suffix)
oc get jobs -n <namespace> -l app.kubernetes.io/component=migration

# Logs from the failed Job's migrate container
oc logs -n <namespace> -l app.kubernetes.io/component=migration -c migrate

Cause: Database connection failure

PG::ConnectionBad: could not connect to server: ...

Bundled mode: confirm postgres pod is Running and Ready. The migration Job runs after postgres is up, but if postgres crashed during install, the Job will hit a connection error. Check oc logs sts/<release-name>-postgresql -n <namespace>.
Production mode: confirm externalDatabase.host in values.yaml resolves from inside the cluster. Test with: sh oc run -it --rm test-pg --image=postgres:15 --restart=Never -- \ psql -h <host> -U <user> -d <db>

Cause: Insufficient database privileges

PG::InsufficientPrivilege: ERROR: permission denied

The application database user needs CREATE, ALTER, DROP on its own schema for migrations.
Also needs the CREATE EXTENSION privilege if the application uses postgres extensions (the chart's values.yaml field externalDatabase.allowExtensionCreate must be true and the user must have the privilege).

Cause: Constraint violation on existing data

ERROR: column "..." contains null values
ERROR: duplicate key value violates unique constraint

The migration tried to add a constraint or unique index against data that doesn't satisfy it. Release notes for this version will call out pre-migration data cleanup steps. After cleanup, retry:

oc delete job/<release-name>-migrate -n <namespace>
helm upgrade ...   # same command as before

Application returns 503 / probe failures

Symptom: Route returns 503, `oc get pods` shows pods running but `0/1` ready

Liveness or readiness probes are failing. Check pod events and container logs:

oc describe pod -l app.kubernetes.io/name=<app> -n <namespace>
oc logs deploy/<release-name> -n <namespace>

Cause: Slow boot

Rails takes longer than initialDelaySeconds to start, especially on first pod start with a cold image cache. Increase livenessProbe.initialDelaySeconds and readinessProbe.initialDelaySeconds in values.yaml and helm upgrade.

Cause: Cannot reach database

Application boots but /health checks fail because the database isn't reachable. Apply the database connection diagnostics in the "Migration Job fails" section above.

Cause: Cannot reach redis

Same as database — /health checks redis reachability. Verify redis is running and the connection string is correct.

License upload fails

Symptom: `/licensing/installations/new` returns "Invalid license"

Cause: License file mismatch

The license file is signed against a public key baked into the image at build time. If the vendor regenerated your license against a different keypair, or if the image build doesn't match the license, validation fails.

Verify with the vendor: which image build was your license file generated against?
Compare to the image tag your deployment is running: sh oc get deploy/<release-name> -n <namespace> \ -o jsonpath='{.spec.template.spec.containers[0].image}'

Cause: License file corruption during transfer

# Inspect what's actually mounted in the pod
oc exec deploy/<release-name> -n <namespace> -- cat /rails/config/license.json

Compare to the license file you originally received from the vendor. If different, re-create the Secret from the original.

Random-UID-related write failures

Symptom: Application logs show `Errno::EACCES: Permission denied`

The application is running as a random UID (correct for OpenShift) but trying to write to a directory the random UID can't access.

oc exec deploy/<release-name> -n <namespace> -- ls -la /rails/storage /rails/log /rails/tmp

Expected: directories owned by 1001 group 0, mode drwxrwxr-x or drwxrwsr-x. If group is not 0 or group-write is missing, the image is the older fixed-UID build.

Resolution: this is a vendor image fix, not a customer-side configuration. Confirm the image tag matches the OpenShift- compatible build and contact the vendor if it does and you're still seeing EACCES.

Application logs show "bind: permission denied"

Symptom: Container restarts immediately, logs end with

{"level":"ERROR","msg":"Failed to start HTTP listener",
 "error":"listen tcp :80: bind: permission denied"}

The application's reverse proxy (Thruster) is trying to bind a privileged port (<1024). OpenShift's restricted-v2 SCC denies this; non-root containers can only bind ports ≥1024.

The chart's default ConfigMap sets HTTP_PORT=3000 and TARGET_PORT=8080 to keep both Thruster and Puma in the non-privileged range, and the Deployment's args pin Puma to -p 8080. If you're seeing this error, check:

Your values.yaml doesn't override config.* env vars in a way that drops HTTP_PORT / TARGET_PORT.
The chart version installed includes the port-rerouting fix. Compare your release's chart version (helm list -n <ns>) with the version the vendor specified for OpenShift installs.

Single-replica vs multi-replica issues

Symptom: After scaling replicas above 1, second pod fails to start

Multi-Attach error for volume "...": Volume is already used by pod ...

You have a ReadWriteOnce PVC for application storage and tried to scale beyond one replica. Two ways out:

Switch to a ReadWriteMany storage class. Requires cluster support — check oc get storageclass for an RWX-capable class (NFS, CephFS, Trident, EFS, etc.). Update persistence.storageClass and persistence.accessMode in values.yaml.
Disable the PVC and use S3-compatible object storage for Active Storage. Set persistence.enabled: false and configure objectStorage.*. Recommended for multi-replica.

Sessions invalidating on every pod rotation

Symptom: Users get logged out every time the application is upgraded or pods restart

SECRET_KEY_BASE is changing across pod rotations. This shouldn't happen if the secret is referenced by name (secrets.existingSecret in Production mode); it can happen in Bundled mode if you're running helm uninstall then helm install instead of helm upgrade.

Production mode: the Secret is customer-owned and should not change between deployments. Rotate explicitly only when needed.
Bundled mode: don't uninstall + install for upgrades; use helm upgrade. The chart's persisted Secret survives upgrades.

Filing a support request

When you cannot resolve an issue from this guide, file a support request and include the following:

1. Diagnostic bundle

Capture the following into a single tarball (sanitize secrets first — see step 2):

# Cluster context
oc version
oc whoami --show-server

# Release state
helm list -n <namespace>
helm get values <release-name> -n <namespace>
helm get manifest <release-name> -n <namespace>

# Workload state
oc get all -n <namespace>
oc describe pods -n <namespace>
oc get events -n <namespace> --sort-by lastTimestamp

# Application logs (last 1000 lines)
oc logs deploy/<release-name> -n <namespace> --tail=1000

# Migration job logs (if relevant)
oc logs job/<release-name>-migrate -n <namespace>

# Image identity
oc get deploy/<release-name> -n <namespace> -o yaml | grep image:

2. Sanitize before sending

Strip from the diagnostic bundle:

RAILS_MASTER_KEY — appears in environment variables and Secret references. Replace with [REDACTED].
SECRET_KEY_BASE — same.
Database passwords and connection strings — replace.
Customer email addresses in audit logs — replace if confidential.

The application image's BUILD_GIT_SHA, the chart version, and the license customer ID are non-sensitive and helpful for triage — keep those.

3. What to describe in the ticket

The procedure you ran (helm install <args> / helm upgrade <args>).
The expected behavior.
The observed behavior (error messages, screenshots).
The timing — when did it start failing? After an upgrade? On a fresh install?
What you've already tried from this troubleshooting guide.

Vendor support teams can resolve issues much faster when given the diagnostic bundle plus a clear chronology than they can from a log snippet alone.