20
Scenarios
4
Categories
40+
CLI Commands
8
Common Mistakes
💾

Filesystem

Disk full, permission errors, inode exhaustion, mount issues

⚙️

Processes

High CPU, zombies, OOM kills, open file limits

🌐

Networking

Ports unreachable, DNS failures, firewall blocks, interface down

🔧

Services & Cron

Systemd failures, restart loops, cron not firing

Bash One-liners

Quick triage commands reusable by L1 staff

⚠️

Common Mistakes

Misconfigs that cause 80% of first-response tickets

How to use this runbook

  • ① Use the sidebar or search bar to locate a scenario by symptom or keyword.
  • ② Click a scenario card to expand diagnosis steps, commands, and resolution.
  • ③ Run commands on the affected host in the order shown.
  • ④ If unresolved after L1 steps, escalate with the output collected.
FS-01
Disk Space Alert — Partition at 90%+ Usage
High df · du · find

Symptoms

Application writes fail with No space left on device. Monitoring alert fires for partition >90% full. Log rotation stops.

Diagnosis Steps

  1. Check overall disk usage per partition.
    df -hT
  2. Identify the top 10 directories consuming space.
    du -ahx / 2>/dev/null | sort -rh | head -20
  3. Find large files (over 500 MB) across the filesystem.
    find / -xdev -type f -size +500M -exec ls -lh {} \; 2>/dev/null
  4. Check for deleted-but-open files still holding space (common with log handles).
    lsof +L1 | grep -i deleted

Resolution

Archive or delete identified large files. If log files are open by processes, restart the process so the file handle is released. For recurring issues, implement logrotate policies.

# Clear old journal logs safely (keep last 7 days)
sudo journalctl --vacuum-time=7d

# Remove old compressed logs
sudo find /var/log -name "*.gz" -mtime +30 -delete

# Restart service to release deleted-file handles
sudo systemctl restart <service-name>
FS-02
Permission Denied — User Cannot Access File or Directory
Medium chmod · chown · stat

Symptoms

User gets Permission denied when reading, writing, or executing a file. Service fails to read its config or write to its data directory.

Diagnosis Steps

  1. Check the file's permissions and ownership.
    stat /path/to/file
    ls -la /path/to/file
  2. Check what user/group the process or shell is running as.
    whoami
    id
    ps aux | grep <process-name>
  3. Verify parent directory permissions (execute bit needed to traverse).
    ls -la /path/to/
  4. Check SELinux or AppArmor denials if standard permissions look correct.
    sudo ausearch -m avc -ts recent   # SELinux
    sudo dmesg | grep apparmor        # AppArmor

Resolution

# Fix ownership
sudo chown user:group /path/to/file

# Fix permissions (files: 644, dirs: 755 as baseline)
sudo chmod 644 /path/to/file
sudo chmod 755 /path/to/dir

# Recursively fix a directory tree
sudo chown -R www-data:www-data /var/www/html
sudo chmod -R 755 /var/www/html
FS-03
Inode Exhaustion — No Space Left Despite Free Disk
High df -i · find

Symptoms

No space left on device errors even though df -h shows ample free space. File creation fails. Often caused by millions of small temp files or mail queue entries.

Diagnosis Steps

  1. Confirm inode exhaustion (look for 100% under IUse%).
    df -ih
  2. Find the directory with the most files.
    find / -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -20
  3. Count files in suspected directories.
    ls /tmp | wc -l
    ls /var/spool/postfix/maildrop | wc -l

Resolution

# Remove temp files older than 1 day
sudo find /tmp -mtime +1 -delete

# Clear mail queue if postfix is the culprit
sudo postsuper -d ALL deferred

# For PHP session file buildup
sudo find /var/lib/php/sessions -type f -mtime +1 -delete
FS-04
Read-only Filesystem — Writes Failing After Crash
High dmesg · fsck · mount

Symptoms

All write operations fail with Read-only file system. Typically happens after an unclean shutdown or I/O errors that caused the kernel to remount read-only for data safety.

Diagnosis Steps

  1. Confirm the filesystem is mounted read-only.
    mount | grep "ro,"
    cat /proc/mounts
  2. Check kernel I/O error messages.
    sudo dmesg | grep -iE "error|I/O|remount|read-only" | tail -30
  3. Check filesystem journal for corruption clues.
    sudo tune2fs -l /dev/sda1 | grep -i "mount\|check"

Resolution

# Attempt live remount read-write (only if no hardware errors)
sudo mount -o remount,rw /

# If there are journal errors, unmount and run fsck (requires reboot)
# Schedule fsck at next boot:
sudo touch /forcefsck
sudo reboot

# Or force fsck on a specific device (unmounted):
sudo fsck -y /dev/sda1
FS-05
Mount Point Failure — Filesystem Not Mounted at Boot
Medium fstab · mount · blkid

Symptoms

Expected directory is empty after reboot. Application cannot find its data. Service fails to start because its data volume is not mounted. /etc/fstab entry may be misconfigured.

Diagnosis Steps

  1. Check currently mounted filesystems.
    mount | column -t
    lsblk -f
  2. Verify the device UUID matches the fstab entry.
    sudo blkid
    cat /etc/fstab
  3. Check systemd mount unit failures.
    systemctl --failed
    sudo journalctl -u "*.mount" --since "1 hour ago"

Resolution

# Mount all filesystems in fstab (safe dry-run first)
sudo mount -a --verbose

# Fix a UUID mismatch in /etc/fstab
# 1. Get correct UUID:
sudo blkid /dev/sdb1
# 2. Edit fstab to update UUID:
sudo nano /etc/fstab
# UUID=<correct-uuid>  /data  ext4  defaults  0  2

# Test fstab syntax without rebooting:
sudo findmnt --verify
PR-01
High CPU — System Load Spike, Identifying the Offender
High top · ps · strace

Symptoms

Server load average exceeds CPU count. Applications become sluggish. Monitoring alert fires on CPU >85% sustained. SSH response is slow.

Diagnosis Steps

  1. Get an instant top-process snapshot sorted by CPU.
    top -b -n1 | head -20
    # or
    ps aux --sort=-%cpu | head -15
  2. Check load average versus CPU count.
    uptime
    nproc
  3. Get a thread-level breakdown for the suspect PID.
    top -H -p <PID>
  4. See what system calls the process is making.
    sudo strace -p <PID> -c -f 2>&1 | head -30

Resolution

# Renice a runaway process to lower its priority
sudo renice +10 -p <PID>

# Kill if confirmed runaway (SIGTERM first, then SIGKILL)
kill -15 <PID>
sleep 3
kill -9 <PID>

# Investigate if it's a recurring scheduled task
sudo crontab -l
sudo crontab -l -u www-data
PR-02
Zombie Processes — Defunct Entries Accumulating
Low ps · kill · pstree

Symptoms

ps aux shows processes with state Z (defunct/zombie). If zombie count is large it can exhaust the PID table. Usually indicates a bug in the parent process (not calling wait()).

Diagnosis Steps

  1. Count and list zombie processes.
    ps aux | awk '$8 == "Z" { print }'
    ps aux | grep -c "defunct"
  2. Find the parent process (PPID) that is not reaping children.
    ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/'
    pstree -p | grep defunct

Resolution

# You cannot kill a zombie directly — kill the parent process.
# 1. Find parent PID:
ps -o ppid= -p <zombie-PID>

# 2. Send SIGCHLD to parent (asks it to reap children):
kill -SIGCHLD <parent-PID>

# 3. If parent ignores SIGCHLD, restart the parent service:
sudo systemctl restart <service-name>

# 4. Last resort — kill the parent (ensure it is safe to restart):
kill -9 <parent-PID>
PR-03
OOM Kill — Process Terminated by Out-of-Memory Killer
High dmesg · journalctl · free

Symptoms

Service suddenly disappears. No graceful shutdown log. dmesg contains Out of memory: Kill process entries. Often happens at peak load or during memory leaks.

Diagnosis Steps

  1. Confirm OOM kill in kernel messages.
    sudo dmesg | grep -i "oom\|killed\|out of memory" | tail -20
  2. Check systemd journal for the time of death.
    sudo journalctl -k --since "1 hour ago" | grep -i oom
  3. Review current memory usage and which process is the largest.
    free -h
    ps aux --sort=-%mem | head -15

Resolution

# Short-term: restart the killed service
sudo systemctl restart <service-name>

# Add swap space if system has none (emergency measure):
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make permanent:
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# Tune OOM score to protect critical processes:
echo -1000 | sudo tee /proc/<PID>/oom_score_adj
PR-04
Process Won't Start — Binary or Dependency Missing
Medium which · ldd · strace

Symptoms

Executing a binary returns command not found, No such file or directory, or error while loading shared libraries. Service fails to start with exit code 127 or 1.

Diagnosis Steps

  1. Check if the binary exists and is executable.
    which java
    ls -la /usr/bin/java
    file /usr/bin/java
  2. Check shared library dependencies.
    ldd /path/to/binary
  3. Trace the startup to see exactly where it fails.
    sudo strace -e trace=openat /path/to/binary 2>&1 | grep "ENOENT"
  4. Check if a required environment variable is missing.
    printenv | grep -i java
    printenv PATH

Resolution

# Install missing library
sudo apt-get install <package>

# Update library cache after manual install
sudo ldconfig

# Add missing PATH entry (session):
export PATH=$PATH:/opt/custom/bin

# Persistent PATH fix in service unit:
sudo systemctl edit <service-name>
# Add under [Service]:
# Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
PR-05
Too Many Open Files — ulimit Exhaustion
Medium lsof · ulimit · limits.conf

Symptoms

Application logs show Too many open files. New connections refused. Java applications throw java.io.IOException: Too many open files. Commonly affects high-traffic web servers or databases.

Diagnosis Steps

  1. Check current limits for the running process.
    cat /proc/<PID>/limits | grep "open files"
    ulimit -n
  2. Count actual open file descriptors for the process.
    ls /proc/<PID>/fd | wc -l
    lsof -p <PID> | wc -l
  3. Find which process has the most open files system-wide.
    lsof | awk '{print $2}' | sort | uniq -c | sort -rn | head -10

Resolution

# Raise limit for current session immediately:
ulimit -n 65535

# Permanent system-wide increase in /etc/security/limits.conf:
sudo tee -a /etc/security/limits.conf << EOF
*    soft nofile 65535
*    hard nofile 65535
EOF

# For systemd services, set in unit override:
sudo systemctl edit <service-name>
# Add under [Service]:
# LimitNOFILE=65535
NET-01
Service Unreachable — Port Not Listening
High ss · netstat · curl

Symptoms

Client gets Connection refused or times out. Health check fails. The service may have crashed or is bound to the wrong interface/port.

Diagnosis Steps

  1. Check whether anything is listening on the expected port.
    ss -tlnp | grep <PORT>
    # or
    sudo netstat -tlnp | grep <PORT>
  2. Test connectivity from the server itself.
    curl -v http://localhost:<PORT>
    telnet localhost <PORT>
  3. Check whether the service is running.
    sudo systemctl status <service-name>

Resolution

# Restart the service
sudo systemctl restart <service-name>

# If service binds to 127.0.0.1 instead of 0.0.0.0, edit config:
# e.g. for nginx: listen 0.0.0.0:80;
# e.g. for Spring Boot: server.address=0.0.0.0

# Verify after fix:
ss -tlnp | grep <PORT>
NET-02
DNS Resolution Failure — Cannot Resolve Hostnames
High dig · nslookup · resolv.conf

Symptoms

curl: (6) Could not resolve host. Application cannot connect to external APIs or databases by hostname. ping google.com fails but ping 8.8.8.8 succeeds.

Diagnosis Steps

  1. Test basic resolution.
    dig google.com
    nslookup google.com
    host google.com
  2. Check which DNS servers are configured.
    cat /etc/resolv.conf
    systemd-resolve --status | grep DNS
  3. Test resolution against a known-good resolver directly.
    dig @8.8.8.8 google.com

Resolution

# If /etc/resolv.conf is empty or wrong, set a DNS server:
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
echo "nameserver 1.1.1.1" | sudo tee -a /etc/resolv.conf

# Flush systemd-resolved cache:
sudo systemd-resolve --flush-caches

# Restart the resolver service:
sudo systemctl restart systemd-resolved

# For Docker containers — check the container's DNS:
docker exec -it <container> cat /etc/resolv.conf
NET-03
Network Interface Down — No Connectivity After Reboot
High ip · ifconfig · nmcli

Symptoms

Server is unreachable after reboot. No IP address assigned. ip addr shows interface in DOWN state. Often caused by a misconfigured netplan or interface config.

Diagnosis Steps

  1. List all interfaces and their state.
    ip link show
    ip addr show
  2. Check routing table.
    ip route show
  3. Look for interface-related errors in kernel log.
    sudo dmesg | grep -iE "eth|ens|link|network" | tail -20

Resolution

# Bring interface up manually (temporary):
sudo ip link set eth0 up
sudo dhclient eth0

# For Ubuntu with Netplan — check and apply config:
cat /etc/netplan/*.yaml
sudo netplan try
sudo netplan apply

# For NetworkManager-based systems:
nmcli device status
nmcli device connect eth0
NET-04
Port Binding Conflict — Address Already in Use
Medium ss · fuser · lsof

Symptoms

Service fails to start with Address already in use (EADDRINUSE). Often happens when a previous instance did not shut down cleanly, or two services are configured on the same port.

Diagnosis Steps

  1. Find which process is using the port.
    sudo ss -tlnp | grep :<PORT>
    sudo fuser -n tcp <PORT>
    sudo lsof -i :<PORT>
  2. Identify the process by PID.
    ps -p <PID> -o pid,cmd

Resolution

# Option 1 — Kill the conflicting process (if stale):
sudo kill -9 <PID>

# Option 2 — Change the port in the service config and restart:
# e.g. for Spring Boot: server.port=8081

# Option 3 — Check if a previous instance is still tracked by systemd:
sudo systemctl stop <old-service-name>
sudo systemctl start <new-service-name>
NET-05
Firewall Blocking Connection — iptables / ufw Drop
Medium ufw · iptables · tcpdump

Symptoms

Connection times out (not refused — the packet is silently dropped). Service is listening on the port locally but remote clients cannot reach it. ufw status may show the rule is missing.

Diagnosis Steps

  1. Check ufw status and active rules.
    sudo ufw status verbose
  2. Check raw iptables chains.
    sudo iptables -L -n -v | grep -E "DROP|REJECT|<PORT>"
  3. Capture packets to see if they arrive at the interface.
    sudo tcpdump -i any port <PORT> -nn

Resolution

# Allow a port through ufw:
sudo ufw allow <PORT>/tcp
sudo ufw reload

# Allow with source restriction:
sudo ufw allow from 10.0.0.0/8 to any port <PORT>

# iptables: insert ACCEPT rule before a DROP:
sudo iptables -I INPUT -p tcp --dport <PORT> -j ACCEPT

# Verify:
curl -v http://<server-ip>:<PORT>
SVC-01
Systemd Service Failed to Start — ExecStart Error
High systemctl · journalctl

Symptoms

Service shows failed status in systemctl. Application is not running. Dashboard or health endpoint returns 503. Logs may show exit code 1 or 2.

Diagnosis Steps

  1. Check service status and last exit code.
    sudo systemctl status <service-name>
  2. Read the full service log since last boot.
    sudo journalctl -u <service-name> -n 100 --no-pager
  3. Check the unit file for configuration issues.
    sudo systemctl cat <service-name>
  4. Validate the unit file syntax.
    sudo systemd-analyze verify /etc/systemd/system/<service-name>.service

Resolution

# Fix the unit file or config, then reload the daemon and restart:
sudo systemctl daemon-reload
sudo systemctl restart <service-name>

# Watch live logs during startup to catch the failure point:
sudo journalctl -u <service-name> -f &
sudo systemctl start <service-name>
SVC-02
Service Keeps Restarting — Restart Loop (CrashLoop)
High journalctl · systemctl

Symptoms

systemctl status shows "activating" or rapid start/stop cycles. Service enters systemd's throttle state. Log shows repeated startup then immediate failure.

Diagnosis Steps

  1. Check how many times it has been restarted.
    systemctl show <service-name> | grep -i "NRestarts\|ActiveState\|Result"
  2. Get the last N log entries to see the failure reason.
    sudo journalctl -u <service-name> --since "10 minutes ago"
  3. Check the Restart= policy in the unit file.
    systemctl cat <service-name> | grep -i restart

Resolution

# Stop the restart loop to investigate without systemd restarting it:
sudo systemctl stop <service-name>

# Run the binary manually to see the real error:
sudo -u <service-user> /path/to/binary --options

# If it's a config error, fix it and restart:
sudo systemctl start <service-name>

# Increase StartLimitIntervalSec/StartLimitBurst to allow more restarts
# while debugging (edit unit file via: sudo systemctl edit <service>)
SVC-03
Cron Job Not Running — Silent Failure
Medium crontab · syslog · mail

Symptoms

Scheduled task does not execute. Backup, report, or cleanup job silently skipped. No output or error visible. Cron is running, but the specific job is not firing.

Diagnosis Steps

  1. Verify the cron entry syntax.
    crontab -l
    sudo crontab -l -u <user>
  2. Check cron execution logs.
    grep CRON /var/log/syslog | tail -30
    sudo journalctl -u cron --since "1 hour ago"
  3. Check if cron daemon is running.
    sudo systemctl status cron
  4. Test the exact command manually as the cron user.
    sudo -u <user> /path/to/script.sh

Resolution

# Common fix 1: PATH issue — add PATH at the top of crontab:
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Common fix 2: Redirect output to capture errors:
* * * * * /path/to/script.sh >> /tmp/cron_output.log 2>&1

# Common fix 3: Use absolute paths in the script:
# BAD:  mysql -u root ...
# GOOD: /usr/bin/mysql -u root ...

# Validate cron syntax online or use:
echo "30 2 * * * /path/to/script.sh" | crontab -
SVC-04
Failed Service Dependency — Unit Waiting on Another
Medium systemctl · After= · Requires=

Symptoms

Service appears to start but immediately stops waiting. Another service it depends on (database, network, cache) has not started. Exit reason shows start-limit-hit or dependency failure.

Diagnosis Steps

  1. List all failed units.
    systemctl --failed
  2. Check the dependency graph for the service.
    systemctl list-dependencies <service-name>
  3. Check status of each dependency.
    sudo systemctl status mysql
    sudo systemctl status network-online.target

Resolution

# Start the dependency first, then the service:
sudo systemctl start mysql
sudo systemctl start <app-service>

# If dependency ordering is wrong in the unit file, fix it:
sudo systemctl edit <service-name>
# Add under [Unit]:
# After=mysql.service network-online.target
# Wants=mysql.service

sudo systemctl daemon-reload
sudo systemctl restart <service-name>
SVC-05
Log Flooding — journald Consuming All Disk Space
Medium journalctl · journald.conf

Symptoms

/var/log/journal or /run/log/journal is consuming gigabytes. A chatty service is emitting thousands of log entries per second. Disk fills up causing service crashes.

Diagnosis Steps

  1. Check total journal size.
    sudo journalctl --disk-usage
  2. Find the noisiest service.
    sudo journalctl --since "1 hour ago" | awk '{print $5}' | sort | uniq -c | sort -rn | head -10
  3. View the high-volume service log to understand why.
    sudo journalctl -u <noisy-service> -n 50

Resolution

# Immediate cleanup — keep only last 7 days:
sudo journalctl --vacuum-time=7d

# Or limit by size (keep last 500 MB):
sudo journalctl --vacuum-size=500M

# Permanent cap in /etc/systemd/journald.conf:
sudo tee -a /etc/systemd/journald.conf << EOF
SystemMaxUse=500M
MaxFileSec=7day
RateLimitIntervalSec=30s
RateLimitBurst=1000
EOF

sudo systemctl restart systemd-journald

Disk & Filesystem

TaskCommand
Disk usage summarydf -hT
Top 10 directories by sizedu -ahx / 2>/dev/null | sort -rh | head -10
Find files larger than 1 GBfind / -xdev -type f -size +1G -ls 2>/dev/null
Check inode usagedf -ih
Deleted files held open by processeslsof +L1 | awk 'NR==1 || $7 < 1'
Vacuum journal logs older than 7 dayssudo journalctl --vacuum-time=7d

Processes

TaskCommand
Top 10 processes by CPUps aux --sort=-%cpu | head -11
Top 10 processes by memoryps aux --sort=-%mem | head -11
Count zombie processesps aux | awk '$8=="Z"' | wc -l
Kill all processes by namepkill -f <process-name>
Watch CPU/mem of one processwatch -n1 "ps -p <PID> -o pid,pcpu,pmem,cmd"
Check OOM kill in last hoursudo dmesg | grep -i "killed process" | tail -10
Open file descriptors for PIDls /proc/<PID>/fd | wc -l

Networking

TaskCommand
Check all listening portsss -tlnp
Check established connectionsss -tnp state established
Find process using a portsudo fuser -n tcp <PORT>
Test port reachabilitycurl -sv telnet://<host>:<port>
Trace route to hosttraceroute -n <host>
Capture packets on a portsudo tcpdump -i any port <PORT> -nn -c 50
DNS lookup with timingdig <hostname> | grep -E "time|ANSWER"
Check open ports on remote hostnmap -sT -p 1-65535 <host>

Services & Logs

TaskCommand
List all failed servicessystemctl --failed
Live log stream for a servicesudo journalctl -u <service> -f
Logs since last bootsudo journalctl -b -u <service>
Last 100 lines of any log filetail -100 /var/log/<app>.log
Grep log for ERROR linesgrep -i "error\|exception\|fail" /var/log/<app>.log | tail -30
Check cron execution historygrep CRON /var/log/syslog | tail -20
Find which service owns a portsudo ss -tlnp | grep :<PORT>
Journal disk usagesudo journalctl --disk-usage
Wrong File Permissions — Too Restrictive or Too Permissive

Setting chmod 777 on everything "to fix it" is a security risk. Setting 600 on a directory makes it inaccessible.

✓ Fix: Use 644 for files (owner rw, group/other r), 755 for directories and executables. Use 640 for config files with secrets. Always check parent directory execute bits — you need x on every directory in the path.
find /var/www -type f -exec chmod 644 {} \;
find /var/www -type d -exec chmod 755 {} \;
Missing or Wrong Environment Variables

Application works when run manually but fails under systemd because the service environment does not inherit the user's shell environment (JAVA_HOME, DATABASE_URL, etc.).

✓ Fix: Set environment variables explicitly in the systemd unit file.
# /etc/systemd/system/myapp.service [Service] section:
Environment="JAVA_HOME=/usr/lib/jvm/java-17"
Environment="DATABASE_URL=jdbc:mysql://localhost:3306/mydb"
# Or load from a file:
EnvironmentFile=/etc/myapp/env.conf
Port Binding Conflicts — Service Configured on Wrong or Taken Port

Deploying a new service on port 8080 without checking whether another process already owns that port. Also common: accidentally starting two instances of the same service.

✓ Fix: Always check before configuring a port.
ss -tlnp | grep :8080
# If occupied, pick a free port or stop the conflicting service.
Cron Jobs Without Absolute Paths

Cron runs with a minimal PATH (/usr/bin:/bin). Commands like mysql, python3, or node that work in a shell will silently fail in cron.

✓ Fix: Use full paths in cron scripts and redirect output.
# In crontab, set PATH explicitly:
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# In the script, use absolute paths:
/usr/bin/python3 /opt/scripts/backup.py >> /var/log/backup.log 2>&1
Forgetting to Run systemctl daemon-reload After Editing Unit Files

Editing /etc/systemd/system/myapp.service and then restarting the service — systemd still loads the old cached unit. Changes appear to have no effect.

✓ Fix: Always reload after editing.
sudo systemctl daemon-reload
sudo systemctl restart myapp
Wrong File Owner on Config or Data Directory

Creating directories as root that the service needs to write to as www-data or appuser. Service starts but fails on first write.

✓ Fix: Check the user the service runs as and ensure ownership matches.
# Who does the service run as?
systemctl show myapp -p User

# Fix ownership:
sudo chown -R appuser:appuser /var/lib/myapp
Firewall Rules Not Persisted After Reboot

Adding iptables rules directly without saving them. Rules disappear on reboot causing mystery outages.

✓ Fix: Use ufw which persists rules automatically, or save iptables rules.
# Preferred (ufw handles persistence):
sudo ufw allow 443/tcp

# For raw iptables — save and restore:
sudo iptables-save > /etc/iptables/rules.v4
Service Bound to 127.0.0.1 — Not Accessible Remotely

Developers test locally with localhost bindings then deploy without changing the bind address. Port is open, firewall is clear, but remote connections never arrive at the application.

✓ Fix: Confirm the service binds to 0.0.0.0 for all interfaces or the specific interface IP.
# Check what address the service is bound to:
ss -tlnp | grep <PORT>

# Spring Boot: application.properties
server.address=0.0.0.0

# Node.js: app.listen(PORT, '0.0.0.0', ...)

# nginx: listen 0.0.0.0:80;