How Important is Monitoring and How to Choose the Right Tools

“Why is the website down?”

In this article, we cover infrastructure monitoring in a practical way so you can apply it with confidence.

That question came from the Client. Directly in the my team WhatsApp group. 3 PM on a sunday.

I opened my laptop, checked the server. Everything looked normal. The website was accessible from the vpn network. But from outside? Unreachable.

It took 2 hours to find the problem. Turns out one of the load balancer nodes had a memory leak and stopped responding. But because there was no proper monitoring, there was no alert. No notification. We only found out after customers complained.

That experience changed how I view monitoring. From “nice to have” to “absolutely critical”.

Why monitoring is often neglected

Many IT teams, especially small ones, consider monitoring as something that can be postponed. The reasons vary.

“The server is still new, no need for monitoring yet.”

“We can check manually if there is a problem.”

“Setting up monitoring is complicated and time consuming.”

I used to think like that too. Until the incident above woke me up.

Reality on the ground

Without monitoring, you are only reacting. Not being proactive.

Imagine you have a car without a dashboard. No speedometer, no fuel indicator, no engine warning light. You can still drive, but you will not know there is a problem until the car breaks down in the middle of the road.

A server without monitoring is exactly like that.

When disk is almost full, you do not know. When memory usage spikes dramatically, you do not know. When there is unusual traffic spike, you do not know.

Until everything crashes.

What should be monitored

Before talking about tools, let us first determine what needs to be monitored.

Infrastructure metrics

This is the foundation of all monitoring.

CPU usage: Is the server working too hard?
Memory usage: Is there a memory leak?
Disk usage: Is storage almost full?
Network I/O: Is there a bandwidth bottleneck?
Load average: Is the system overwhelmed?

Application metrics

Besides infrastructure, applications also need to be monitored.

Response time: How long does it take to process a request?
Error rate: What percentage of requests fail?
Requests per second: How much traffic is coming in?
Database query time: Are there slow queries?

Business metrics

This is often forgotten, yet very important.

Active users: How many users are currently using the system?
Transaction volume: How many transactions per hour?
Conversion rate: Are there anomalies in the funnel?

Log aggregation

Metrics alone are not enough. You also need logs for debugging.

Application logs: Error messages, stack traces
Access logs: Who is accessing what
Security logs: Failed login attempts, suspicious activities

Popular monitoring tools comparison

There are many choices in the market. Here are some I frequently use.

Prometheus + Grafana

The most popular combination for open source monitoring.

Pros:

Free and open source
Flexible pull based model
Powerful query language (PromQL)
Wide ecosystem of exporters
Beautiful Grafana dashboards

Cons:

Initial setup can be complex
Requires sufficient storage for time series data
No good built in alerting UI (needs Alertmanager)

Good for: Teams that have resources to setup and maintain themselves.

Datadog

A very complete commercial monitoring platform.

Pros:

Very easy setup
Intuitive UI
APM, logs, and metrics in one platform
Integration with hundreds of services
Responsive support

Cons:

Expensive, especially for large infrastructure
Vendor lock in
Pricing can be confusing

Good for: Companies with budget that need a quick solution.

Zabbix

A veteran in the monitoring world, around since 2001.

Pros:

Free and open source
Very complete features
Good auto discovery
Can monitor almost any type of device

Cons:

UI feels outdated
Learning curve is quite high
Documentation can be confusing

Good for: Organizations that need traditional infrastructure monitoring.

Uptime kuma

A lightweight solution for uptime monitoring.

Pros:

Very easy setup (one Docker container)
Modern and clean UI
Notifications to many channels
Self hosted, free forever

Cons:

Focus only on uptime, not detailed metrics
No APM or log aggregation

Good for: Simple endpoint monitoring, suitable for startups or personal projects.

Netdata

Real time monitoring with attractive visualization.

Pros:

One command installation
Detailed real time dashboard
Lightweight, not much overhead
Free for basic usage

Cons:

Alert configuration not as flexible as Prometheus
Cloud version is paid for full features

Good for: Quick setup to see what is happening on a server.

How to choose the right one

There is no one size fits all solution. The choice depends on several factors.

Consider budget

If budget is limited, Prometheus + Grafana is a solid choice. The investment is in setup time and learning curve.

If budget is available and you need to be running quickly, Datadog or other SaaS solutions make more sense.

Consider team skills

Open source tools require expertise to setup and maintain. If the team does not have bandwidth for that, a SaaS solution is more practical.

Consider scale

For small infrastructure (less than 10 servers), Uptime Kuma or Netdata is enough to start.

For medium to large infrastructure, Prometheus + Grafana or Zabbix is more appropriate.

Consider specific needs

Need APM? Datadog or Jaeger.

Need log aggregation? ELK Stack or Loki.

Need network monitoring? Zabbix or PRTG.

For startups or small projects, this is the minimal setup I suggest.

Level 1: uptime monitoring

Start with the most basic. You need to know when a website or API is down.

# docker-compose.yml for Uptime Kuma
version: '3.8'
services:
  uptime-kuma:
    image: louislam/uptime-kuma
    ports:
      - "3001:3001"
    volumes:
      - uptime-kuma-data:/app/data
    restart: unless-stopped

volumes:
  uptime-kuma-data:

Setup monitors for each important endpoint. Enable notifications to Telegram, Slack, or email.

Level 2: server metrics

After uptime, monitor server health.

Netdata can be installed in one command.

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

Or if you want something more proper, setup Prometheus with node_exporter.

Level 3: application metrics

Instrument your application to expose metrics.

For Node.js, use prom-client.

const client = require('prom-client');

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status']
});

Level 4: log aggregation

Centralize all logs to one place.

Loki + Grafana is a lightweight combination that works well for this.

Anti patterns i often encounter

Over the years, I have seen several common mistakes in monitoring implementation.

Alert fatigue

Too many alerts make people ignore them. If there is a notification every 5 minutes, nothing gets responded to.

Solution: Alert only for actionable things. CPU usage at 80% might not need an alert. CPU usage at 95% for 5 minutes? Needs one.

No baseline

How do you know something is abnormal if you do not know what is normal?

Solution: Collect data for a few weeks before setting thresholds. Understand the normal pattern.

Monitoring without runbook

Alert goes off, then what? If there is no documentation about what to do, the alert is useless.

Solution: Every alert should have a runbook. Steps to investigate and resolve.

Single point of monitoring

Monitoring server on the same server as the application. If that server dies, monitoring dies too.

Solution: Monitoring should be separate from what is being monitored. Ideally on a different machine or even a different cloud provider.

To ensure the monitored servers are also secure, make sure you have applied Linux hardening best practices.

Monitoring evolution in my team

I want to share how monitoring in my team evolved over time.

Year 1: No monitoring. Manual check via SSH.

Year 2: Basic uptime monitoring with Pingdom (free version).

Year 3: Self hosted Uptime Kuma + Netdata on every server.

Year 4: Prometheus + Grafana for metrics. Loki for logs.

Year 5: Full observability stack with tracing (Jaeger), APM, and custom dashboards.

You do not need to go straight to level 5. Start simple, iterate as needs grow.

Lessons i learned

First, monitoring is not an expense, it is an investment. Time spent on setting up monitoring will pay off many times over when incidents happen.

Second, start simple. Do not try to setup a complex monitoring stack on day one. Start with uptime, then expand.

Third, monitoring without a response plan is useless. Alerts must be followed by clear action.

Fourth, review and refine regularly. Monitoring needs change as infrastructure grows.

Closing thoughts

Monitoring is your eyes and ears in the infrastructure world. Without it, you are blind to what is happening in the systems you manage.

You do not need the most expensive or most sophisticated tools. What matters is visibility into system health.

Start with something simple. Uptime monitoring for important endpoints. Basic metrics for each server. Alerts to a channel that is definitely read.

Do not wait until the CEO asks “Why is the website down?” to think about monitoring.

Because by then, it is already too late.

I hope this guide on infrastructure monitoring helps you make better decisions in real-world situations.

Why monitoring is often neglected

Reality on the ground

What should be monitored

Infrastructure metrics

Application metrics

Business metrics

Log aggregation

Popular monitoring tools comparison

Prometheus + Grafana

Datadog

Zabbix

Uptime kuma

Netdata

How to choose the right one

Consider budget

Consider team skills

Consider scale

Consider specific needs

Minimal setup i recommend

Level 1: uptime monitoring

Level 2: server metrics

Level 3: application metrics

Level 4: log aggregation

Anti patterns i often encounter

Alert fatigue

No baseline

Monitoring without runbook

Single point of monitoring

Monitoring evolution in my team

Lessons i learned

Closing thoughts

Implementation Checklist

Official References

Need a Hand?

Kamandanu Wijaya

Need IT Solutions?

Related Posts

Proxmox VE Safe from Outside, but VM Used to Attack Hypervisor

Linux Server Normal but Silently Becoming a Pivot Attack

Cloud Computing Learning Tips for Beginners, From Zero to Job Ready

📋 Table of Contents