Red Hat Developer Sandbox — gotchas

The Sandbox is Red Hat's free 30-day-renewable shared OpenShift cluster (https://sandbox.redhat.com). It's the cheapest place to run a Bundled-mode install for evaluation or for our own end-to-end validation runs against a real OpenShift, but it has a handful of behaviors that don't apply on a customer-managed OpenShift cluster and have caused real time-to-recover during validation work.

This doc captures them so the next pass through the validation queue doesn't have to rediscover them. None of these issues affect customer deployments — they're Sandbox-tenant policy, not chart or application defects.


member-operator owns .spec.replicas

The Sandbox runs a controller called member-operator that auto-scales idle developer workloads. It claims field ownership of .spec.replicas on every Deployment it watches via Kubernetes server-side apply. After the first scale-down, any subsequent helm upgrade that tries to set replicas — even implicitly via replicaCount in the chart values — fails with:

UPGRADE FAILED: conflict occurred while applying object
caldredge-dev/<release>-deltamap-host: conflict with "member-operator"
using apps/v1: .spec.replicas

Workarounds:

  1. Don't set replicaCount in helm upgrade --set flags. If you need to scale, use oc scale deployment/<name> --replicas=N — that command takes ownership of the field cleanly without triggering the server-side-apply conflict.
  2. Use helm upgrade --reuse-values with no --set replicaCount=.... The replica field stays whatever member-operator (or your last oc scale) set it to.
  3. --force on helm discards the field-manager state but is destructive of other in-flight changes; not recommended.

This isn't a problem on a customer-managed OpenShift — there's no external controller fighting .spec.replicas.


Idle-eviction to zero

After ~24h of no traffic to the Route, member-operator scales the Deployment + StatefulSets to 0 replicas. PVCs are kept; helm release status stays deployed. When you come back, you'll see:

oc get pods -n <namespace>      # → No resources found
oc get deploy -n <namespace>    # → READY 0/0  AVAILABLE 0

Recovery:

oc scale deployment/<release>-deltamap-host --replicas=1 -n <ns>
oc scale statefulset/extdb-postgresql       --replicas=1 -n <ns>   # if Bundled mode
oc scale statefulset/extredis-master        --replicas=1 -n <ns>

The application pod won't go Ready until postgres + redis are back — the wait-for-postgres init container handles this gracefully but the app's normal liveness probe will start failing after ~60 s on a slow Sandbox node, so don't panic if you see 0/1 Running for a minute or two while the data-store pods finish booting.


Token TTL is short

The oc login --token=... session is good for a few hours, then the cluster starts returning:

the server has asked for the client to provide credentials

Refresh: open https://oauth-openshift.apps.<cluster>.openshiftapps.com/oauth/token/display in a browser; it shows a fresh oc login command line you can paste.


oc debug doesn't help when the initializer crashes

oc debug deploy/<name> -- <command> clones the deployment's pod spec but overrides the command. Useful for one-off bin/rails invocations. Limit: the override only changes the command, not the boot path. bin/rails db:migrate still has to load Rails.application.config.environment, which loads every initializer. If an initializer crashes (e.g. reads a column that doesn't exist yet), oc debug -- bin/rails db:migrate fails the same way the deployment's normal pod fails.

The runbook oc debug is right for: connecting to a working DB, rendering a one-off report, exec'ing an isolated maintenance task. Wrong for: bypassing a boot-time crash to "just run migrations" — the boot is the crash.


Wedged-release recovery

When helm upgrade times out waiting for a crashlooping deployment to become Ready, the release lands in STATUS: pending-upgrade and no further upgrade can run until it's cleared.

Cleanest path on Sandbox:

helm rollback <release> <last-good-revision> -n <ns> --no-hooks

--no-hooks skips the post-rollback migration Job. Useful for clearing the wedge fast; you'd typically follow with a real helm upgrade once the underlying issue (image bug, missing migration, broken values) is resolved.

If rollback can't recover (very stale schema vs. very new code), full uninstall + reinstall is the Sandbox-only escape hatch:

helm uninstall <release> <db-release> <redis-release> -n <ns>
oc delete pvc -n <ns>  # only if you want a clean DB
helm install <db-release> ... -f values.yaml
helm install <redis-release> ... -f values.yaml
helm install <release> ./<chart-path> -f values.yaml

This is destructive — loses application data, license file, admin user. Acceptable on Sandbox; never the right call on a customer-managed cluster.


Pull policy + the :latest tag

The Sandbox install uses image.tag: latest and pullPolicy: Always. With those defaults, oc rollout restart always pulls the newest digest under :latest. On customer clusters the recommended pattern is to pin to a specific version tag (e.g. :v0.2.3) and bump it via helm upgrade --set image.tag=v0.2.4 — gives proper release boundaries and rollback semantics. The :latest flow is fine for our own validation but not what a customer should run.


Reauth + token rotation

The Sandbox's oc token rotates every ~24h. Scripts that exec oc commands need to re-login when their token expires. Symptom: memcache.go:265 ... unhandled Error ... the server has asked for the client to provide credentials on the first oc call after a gap. Re-run oc login per the "Token TTL" section above.


What this doc is not

These are Sandbox-specific issues. None of them apply to a customer running OpenShift on their own infrastructure:

Sandbox issue Customer-managed OpenShift
member-operator .spec.replicas ownership No such controller
Idle-eviction to zero No idle eviction
24h token TTL Customer-controlled (typically days/weeks via cluster auth integration)
:latest image tag Customers should pin to specific versions

So when validation runs surface a Sandbox-specific behavior, log it here, not in troubleshooting.md — the customer-facing docs stay focused on the customer environment.