Troubleshooting
Diagnostic procedures and known failure modes. Organized by symptom — find the closest match to what you're seeing.
If a procedure here doesn't resolve the issue, see Filing a support request at the bottom for what to include.
Pod won't start
Symptom: Pod stuck in CreateContainerError or CrashLoopBackOff
Check the pod's events:
oc describe pod -l app.kubernetes.io/name=<app> -n <namespace>
Look at the Events: section at the bottom.
Cause: SecurityContextConstraints denial
Error creating: pods "..." is forbidden: unable to validate against any
security context constraint: provider "restricted-v2": ...
The application image is built to run cleanly under restricted-v2.
If you see this, either:
- The image tag being deployed isn't the OpenShift-compatible build.
Confirm
image.tagmatches what the vendor specified for OpenShift installs. - A
securityContextoverride invalues.yamlis requesting privileges the SCC doesn't allow. Remove customsecurityContextblocks and re-deploy.
Cause: Image manifest unknown (Bitnami subchart)
Failed to pull image "docker.io/bitnami/postgresql:<tag>":
manifest unknown
Bitnami changed their image hosting in mid-2025; the chart's pinned
postgres / redis tags are no longer published at docker.io/bitnami/.
Override the subchart image repositories in your values.yaml:
postgresql:
image:
repository: bitnamilegacy/postgresql
redis:
image:
repository: bitnamilegacy/redis
For air-gapped clusters: mirror the Bitnami images into your internal
registry and override repository to point there instead. See
deploy.md §0 "Air-gapped or restricted-egress clusters".
Cause: Image pull failure
Failed to pull image "...": rpc error: ... unauthorized
- Verify the image-pull Secret exists in the right namespace and is
referenced via
image.pullSecretsinvalues.yaml. - Verify the Secret credentials are valid:
sh oc get secret <pull-secret> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d - For air-gapped clusters, verify the image was successfully
mirrored to your internal registry:
oc image info <internal-registry>/....
Cause: PVC pending
Pod has unbound immediate PersistentVolumeClaims
- Check the PVC:
oc get pvc -n <namespace>. State should beBound. IfPending, the requested StorageClass either doesn't exist or has no capacity. - For multi-replica deployments,
ReadWriteOncePVCs cannot be mounted by more than one pod. Switch toReadWriteManystorage or skip PVC and use S3-backed Active Storage.
Migration Job fails
Symptom: helm install / helm upgrade reports "post-install hook failed"
The migration Job runs as a Helm post-install / post-upgrade hook; on failure, Helm aborts the release and the running deployment is unchanged.
# Find the failed Job (the chart names it with a release-revision suffix)
oc get jobs -n <namespace> -l app.kubernetes.io/component=migration
# Logs from the failed Job's migrate container
oc logs -n <namespace> -l app.kubernetes.io/component=migration -c migrate
Cause: Database connection failure
PG::ConnectionBad: could not connect to server: ...
- Bundled mode: confirm postgres pod is
RunningandReady. The migration Job runs after postgres is up, but if postgres crashed during install, the Job will hit a connection error. Checkoc logs sts/<release-name>-postgresql -n <namespace>. - Production mode: confirm
externalDatabase.hostinvalues.yamlresolves from inside the cluster. Test with:sh oc run -it --rm test-pg --image=postgres:15 --restart=Never -- \ psql -h <host> -U <user> -d <db>
Cause: Insufficient database privileges
PG::InsufficientPrivilege: ERROR: permission denied
- The application database user needs
CREATE,ALTER,DROPon its own schema for migrations. - Also needs the
CREATE EXTENSIONprivilege if the application uses postgres extensions (the chart'svalues.yamlfieldexternalDatabase.allowExtensionCreatemust be true and the user must have the privilege).
Cause: Constraint violation on existing data
ERROR: column "..." contains null values
ERROR: duplicate key value violates unique constraint
The migration tried to add a constraint or unique index against data that doesn't satisfy it. Release notes for this version will call out pre-migration data cleanup steps. After cleanup, retry:
oc delete job/<release-name>-migrate -n <namespace>
helm upgrade ... # same command as before
Application returns 503 / probe failures
Symptom: Route returns 503, oc get pods shows pods running but 0/1 ready
Liveness or readiness probes are failing. Check pod events and container logs:
oc describe pod -l app.kubernetes.io/name=<app> -n <namespace>
oc logs deploy/<release-name> -n <namespace>
Cause: Slow boot
Rails takes longer than initialDelaySeconds to start, especially
on first pod start with a cold image cache. Increase
livenessProbe.initialDelaySeconds and readinessProbe.initialDelaySeconds
in values.yaml and helm upgrade.
Cause: Cannot reach database
Application boots but /health checks fail because the database
isn't reachable. Apply the database connection diagnostics in the
"Migration Job fails" section above.
Cause: Cannot reach redis
Same as database — /health checks redis reachability. Verify
redis is running and the connection string is correct.
License upload fails
Symptom: /licensing/installations/new returns "Invalid license"
Cause: License file mismatch
The license file is signed against a public key baked into the image at build time. If the vendor regenerated your license against a different keypair, or if the image build doesn't match the license, validation fails.
- Verify with the vendor: which image build was your license file generated against?
- Compare to the image tag your deployment is running:
sh oc get deploy/<release-name> -n <namespace> \ -o jsonpath='{.spec.template.spec.containers[0].image}'
Cause: License file corruption during transfer
# Inspect what's actually mounted in the pod
oc exec deploy/<release-name> -n <namespace> -- cat /rails/config/license.json
Compare to the license file you originally received from the vendor. If different, re-create the Secret from the original.
Random-UID-related write failures
Symptom: Application logs show Errno::EACCES: Permission denied
The application is running as a random UID (correct for OpenShift) but trying to write to a directory the random UID can't access.
oc exec deploy/<release-name> -n <namespace> -- ls -la /rails/storage /rails/log /rails/tmp
Expected: directories owned by 1001 group 0, mode drwxrwxr-x
or drwxrwsr-x. If group is not 0 or group-write is missing,
the image is the older fixed-UID build.
Resolution: this is a vendor image fix, not a customer-side configuration. Confirm the image tag matches the OpenShift- compatible build and contact the vendor if it does and you're still seeing EACCES.
Application logs show "bind: permission denied"
Symptom: Container restarts immediately, logs end with
{"level":"ERROR","msg":"Failed to start HTTP listener",
"error":"listen tcp :80: bind: permission denied"}
The application's reverse proxy (Thruster) is trying to bind a
privileged port (<1024). OpenShift's restricted-v2 SCC denies
this; non-root containers can only bind ports ≥1024.
The chart's default ConfigMap sets HTTP_PORT=3000 and
TARGET_PORT=8080 to keep both Thruster and Puma in the
non-privileged range, and the Deployment's args pin Puma to
-p 8080. If you're seeing this error, check:
- Your
values.yamldoesn't overrideconfig.*env vars in a way that dropsHTTP_PORT/TARGET_PORT. - The chart version installed includes the port-rerouting fix.
Compare your release's chart version (
helm list -n <ns>) with the version the vendor specified for OpenShift installs.
Single-replica vs multi-replica issues
Symptom: After scaling replicas above 1, second pod fails to start
Multi-Attach error for volume "...": Volume is already used by pod ...
You have a ReadWriteOnce PVC for application storage and tried
to scale beyond one replica. Two ways out:
- Switch to a
ReadWriteManystorage class. Requires cluster support — checkoc get storageclassfor an RWX-capable class (NFS, CephFS, Trident, EFS, etc.). Updatepersistence.storageClassandpersistence.accessModeinvalues.yaml. - Disable the PVC and use S3-compatible object storage for Active
Storage. Set
persistence.enabled: falseand configureobjectStorage.*. Recommended for multi-replica.
Sessions invalidating on every pod rotation
Symptom: Users get logged out every time the application is upgraded or pods restart
SECRET_KEY_BASE is changing across pod rotations. This shouldn't
happen if the secret is referenced by name (secrets.existingSecret
in Production mode); it can happen in Bundled mode if you're
running helm uninstall then helm install instead of helm
upgrade.
- Production mode: the Secret is customer-owned and should not change between deployments. Rotate explicitly only when needed.
- Bundled mode: don't
uninstall+installfor upgrades; usehelm upgrade. The chart's persisted Secret survives upgrades.
Filing a support request
When you cannot resolve an issue from this guide, file a support request and include the following:
1. Diagnostic bundle
Capture the following into a single tarball (sanitize secrets first — see step 2):
# Cluster context
oc version
oc whoami --show-server
# Release state
helm list -n <namespace>
helm get values <release-name> -n <namespace>
helm get manifest <release-name> -n <namespace>
# Workload state
oc get all -n <namespace>
oc describe pods -n <namespace>
oc get events -n <namespace> --sort-by lastTimestamp
# Application logs (last 1000 lines)
oc logs deploy/<release-name> -n <namespace> --tail=1000
# Migration job logs (if relevant)
oc logs job/<release-name>-migrate -n <namespace>
# Image identity
oc get deploy/<release-name> -n <namespace> -o yaml | grep image:
2. Sanitize before sending
Strip from the diagnostic bundle:
RAILS_MASTER_KEY— appears in environment variables and Secret references. Replace with[REDACTED].SECRET_KEY_BASE— same.- Database passwords and connection strings — replace.
- Customer email addresses in audit logs — replace if confidential.
The application image's BUILD_GIT_SHA, the chart version, and the
license customer ID are non-sensitive and helpful for triage —
keep those.
3. What to describe in the ticket
- The procedure you ran (
helm install <args>/helm upgrade <args>). - The expected behavior.
- The observed behavior (error messages, screenshots).
- The timing — when did it start failing? After an upgrade? On a fresh install?
- What you've already tried from this troubleshooting guide.
Vendor support teams can resolve issues much faster when given the diagnostic bundle plus a clear chronology than they can from a log snippet alone.