The Mystery of the Restarting Containers, A Production Docker Troubleshooting Case Study

It was 10:15 PM when my phone started vibrating uncontrollably. Monitoring alerts flooded the screen: “Critical Alert: Web Service is Down”. As the one responsible for the infrastructure, my heart skipped a beat. This wasn’t just a minor glitch, it was our primary service, used by thousands of users at night.

In this article, we cover Docker container crash in a practical way so you can apply it with confidence.

I scrambled to open my laptop, brewed a quick instant coffee, and SSHed into the server. My mind was racing, trying to guess what went wrong. Was it a DDoS attack? Did the database explode? Or did I make a mistake during this afternoon’s deployment?

This incident taught me that in the world of Docker, a green status (Running) doesn’t always mean your application is healthy. Often, there’s a storm brewing beneath the surface of a simple docker ps command.

Let me walk you through the troubleshooting steps I took that night a journey from total confusion to finding the root causes, which turned out to be a combination of several small, overlooked details.

The symptoms: a never-ending alert loop

The initial user reports were clear: “The web is extremely slow, and we’re seeing frequent 502 Bad Gateway errors.”

The first thing I did was check the status of the running containers. With slightly shaking hands, I typed the “holy” command:

docker ps

The output made me squint in confusion. My main web container showed a status of: Up 45 seconds (restarting)

Every time I refreshed the command, that “45 seconds” would flip back to “2 seconds,” then “5 seconds,” then die again. This is what we call a Restart Loop. The container was trying to start, failing, dying, and being forced back to life by Docker’s restart policy.

Docker Restart Loop Visualization

Investigation phase 1: reading the death message

A common beginner mistake is to immediately restart the server or delete the container. Don’t do that! You need evidence. And that evidence is in the logs.

I tried to see what the dying container was trying to say:

docker logs web_app_container

The logs were painfully brief, usually just containing:

[INFO] Starting application...
[ERROR] Could not connect to database at db_host:3306. Retrying...
[ERROR] Connection timeout. Exiting.

Okay, one clue found: the web app was dying because it couldn’t talk to the database. But wait I checked the database container, and its status was Up 3 hours. So, the DB was alive. Why was the app saying it couldn’t connect?

Investigation phase 2: finding the silent killer (oomkilled)

I dived deeper using docker inspect. This is how we look into the “guts” of a container.

docker inspect web_app_container

I searched for the State section. And there, I found the smoking gun:

"State": {
    "Status": "exited",
    "Running": false,
    "Paused": false,
    "Restarting": false,
    "OOMKilled": true,
    "Dead": false,
    "ExitCode": 137,
    "Error": "",
    "StartedAt": "2024-06-05T15:20:10Z",
    "FinishedAt": "2024-06-05T15:20:55Z"
}

OOMKilled: true. This meant the operating system had forcibly killed my container because it was using more RAM than allowed. Exit Code 137 is the classic fingerprint of an Out Of Memory (OOM) killer assassination.

It turned out that earlier that afternoon, I had restricted the container’s RAM to 512MB in the docker-compose.yml file, not realizing that during startup, the application spikes to about 600MB to scan libraries and initialize caches.

Resource Usage Visualization

Investigation phase 3: the problem wasn’t over (disk full)

After I bumped the RAM limit, the web container started staying up longer. However, the server as a whole still felt sluggish. Even a simple ls command took 3 seconds to respond.

I checked the disk storage:

df -h

The output was staggering: Disk usage 99%. This was bizarre. This server has 100GB of disk space, and my app data isn’t even 10GB. Where did the rest go?

I used Docker’s built in cleanup tool to check the “trash”:

docker system df

It turned out that Docker’s log files had ballooned to tens of Gigabytes. That container that was constantly restarting earlier? It had generated thousands of lines of heartless error logs. Because I hadn’t configured log rotation, Docker just kept writing until every byte of remaining disk space was gone.

When a disk is full, a database cannot write temporary files, subsystems slow down, and it triggers a domino effect that brings down the entire service.

Final solution: fixing the architecture

That night, I didn’t just increase the RAM. I overhauled the entire configuration to ensure this embarrassing incident would never happen again.

Here is a comparison of my problematic docker-compose.yml (from the afternoon) and the hardened version (from midnight).

Before the fix (full of gaps)

services:
  web:
    image: company/web-app:latest
    deploy:
      resources:
        limits:
          memory: 512M # Too small for startup spikes
    restart: always

  db:
    image: postgres:latest
    # No readiness checks for the database

After the fix (harden & observable)

services:
  web:
    image: company/web-app:latest
    deploy:
      resources:
        limits:
          memory: 1G # Increased for a safety margin
        reservations:
          memory: 512M
    logging:
      driver: "json-file"
      options:
        max-size: "10m" # Limit log size to prevent disk overflow
        max-file: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    depends_on:
      db:
        condition: service_healthy # Wait until DB is actually ready
    restart: unless-stopped

  db:
    image: postgres:latest
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Improved Docker Compose Architecture

Root cause recapitulation

Once the sun came up and the system was stable, I sat back to reflect. This incident happened because three small oversights collided at the worst possible time:

Wrong Memory Estimation: I only looked at RAM usage while the app was idle, not during the resource heavy startup phase.
Startup Sequencing Issues: The web container tried to connect to the DB while it was still initializing files. Without a Healthcheck or Wait for it mechanism, the web app simply crashed upon the first failed attempt.
Uncontrolled Logging: Without log rotation, a small incident can escalate into a disk space apocalypse.

Lessons learned from the field

For those managing containers in production, here is some hard earned advice from the front lines:

Don’t Trust the Green Light: A container status of Up doesn’t mean the application is functional. Use Healthchecks to verify the actual state of the application inside the container.
Observability is Everything: Without monitoring resource usage (RAM, Disk, CPU), you’re working in the dark. At the very least, set up alerts if server disk reaches 80%.
Give Your Apps Room to Breathe: Don’t be too stingy with resource limits. Apps need extra room for unpredictable spikes during runtime (like Garbage Collection).
Log Rotation is Mandatory: This is the first thing you should set up in the Docker daemon or every docker-compose file. Don’t let text files destroy your business.

If you’re new to Docker and found some terms confusing, I recommend starting with my Docker beginner’s guide. For overall server security, also check out the Linux server hardening best practices.

Closing: the future of my deployments

Since that incident, I’ve overhauled our team’s deployment standards. Every configuration file now includes healthchecks and log limits by default. We also utilize external monitoring that periodically pings the /health endpoint.

If you ever get an alert in the middle of the night like I did, take a breath. Don’t panic. Docker gives you all the tools needed to diagnose the problem, you just need to know where to look.

I hope this troubleshooting story is helpful for your own journey and prevents your servers from suffering a similar fate. 🚀

I hope this guide on Docker container crash helps you make better decisions in real-world situations.

The Mystery of the Restarting Containers, A Production Docker Troubleshooting Case Study

The symptoms: a never-ending alert loop

Investigation phase 1: reading the death message

Investigation phase 2: finding the silent killer (oomkilled)

Investigation phase 3: the problem wasn’t over (disk full)

Final solution: fixing the architecture

Before the fix (full of gaps)

After the fix (harden & observable)

Root cause recapitulation

Lessons learned from the field

Closing: the future of my deployments

Implementation Checklist

Official References

Need a Hand?

Kamandanu Wijaya

Need IT Solutions?

Related Posts

Linux Server Normal but Silently Becoming a Pivot Attack

Proxmox VE Safe from Outside, but VM Used to Attack Hypervisor

Cloud Computing Learning Tips for Beginners, From Zero to Job Ready

📋 Table of Contents