Linux Troubleshooting Runbook

20

Scenarios

4

Filesystem

Disk full, permission errors, inode exhaustion, mount issues

⚙️

Processes

High CPU, zombies, OOM kills, open file limits

🌐

Networking

Ports unreachable, DNS failures, firewall blocks, interface down

🔧

Services & Cron

Systemd failures, restart loops, cron not firing

⚡

Bash One-liners

Quick triage commands reusable by L1 staff

⚠️

Common Mistakes

Misconfigs that cause 80% of first-response tickets

How to use this runbook

① Use the sidebar or search bar to locate a scenario by symptom or keyword.
② Click a scenario card to expand diagnosis steps, commands, and resolution.
③ Run commands on the affected host in the order shown.
④ If unresolved after L1 steps, escalate with the output collected.

FS-01

Disk Space Alert — Partition at 90%+ Usage

High df · du · find

▶

Symptoms

Application writes fail with No space left on device. Monitoring alert fires for partition >90% full. Log rotation stops.

Diagnosis Steps

Check overall disk usage per partition.
```
df -hT
```
Identify the top 10 directories consuming space.
```
du -ahx / 2>/dev/null | sort -rh | head -20
```

Find large files (over 500 MB) across the filesystem.

find / -xdev -type f -size +500M -exec ls -lh {} \; 2>/dev/null

Check for deleted-but-open files still holding space (common with log handles).
```
lsof +L1 | grep -i deleted
```

Resolution

Archive or delete identified large files. If log files are open by processes, restart the process so the file handle is released. For recurring issues, implement logrotate policies.

# Clear old journal logs safely (keep last 7 days)
sudo journalctl --vacuum-time=7d

# Remove old compressed logs
sudo find /var/log -name "*.gz" -mtime +30 -delete

# Restart service to release deleted-file handles
sudo systemctl restart <service-name>

FS-02

Permission Denied — User Cannot Access File or Directory

Medium chmod · chown · stat

▶

Symptoms

User gets Permission denied when reading, writing, or executing a file. Service fails to read its config or write to its data directory.

Diagnosis Steps

Check the file's permissions and ownership.
```
stat /path/to/file
ls -la /path/to/file
```
Check what user/group the process or shell is running as.
```
whoami
id
ps aux | grep <process-name>
```
Verify parent directory permissions (execute bit needed to traverse).
```
ls -la /path/to/
```

Check SELinux or AppArmor denials if standard permissions look correct.

sudo ausearch -m avc -ts recent   # SELinux
sudo dmesg | grep apparmor        # AppArmor

Resolution

# Fix ownership
sudo chown user:group /path/to/file

# Fix permissions (files: 644, dirs: 755 as baseline)
sudo chmod 644 /path/to/file
sudo chmod 755 /path/to/dir

# Recursively fix a directory tree
sudo chown -R www-data:www-data /var/www/html
sudo chmod -R 755 /var/www/html

FS-03

Inode Exhaustion — No Space Left Despite Free Disk

High df -i · find

▶

Symptoms

No space left on device errors even though df -h shows ample free space. File creation fails. Often caused by millions of small temp files or mail queue entries.

Diagnosis Steps

Confirm inode exhaustion (look for 100% under IUse%).
```
df -ih
```

Find the directory with the most files.

find / -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -20

Count files in suspected directories.

ls /tmp | wc -l
ls /var/spool/postfix/maildrop | wc -l

Resolution

# Remove temp files older than 1 day
sudo find /tmp -mtime +1 -delete

# Clear mail queue if postfix is the culprit
sudo postsuper -d ALL deferred

# For PHP session file buildup
sudo find /var/lib/php/sessions -type f -mtime +1 -delete

FS-04

Read-only Filesystem — Writes Failing After Crash

High dmesg · fsck · mount

▶

Symptoms

All write operations fail with Read-only file system. Typically happens after an unclean shutdown or I/O errors that caused the kernel to remount read-only for data safety.

Diagnosis Steps

Confirm the filesystem is mounted read-only.
```
mount | grep "ro,"
cat /proc/mounts
```

Check kernel I/O error messages.

sudo dmesg | grep -iE "error|I/O|remount|read-only" | tail -30

Check filesystem journal for corruption clues.

sudo tune2fs -l /dev/sda1 | grep -i "mount\|check"

Resolution

# Attempt live remount read-write (only if no hardware errors)
sudo mount -o remount,rw /

# If there are journal errors, unmount and run fsck (requires reboot)
# Schedule fsck at next boot:
sudo touch /forcefsck
sudo reboot

# Or force fsck on a specific device (unmounted):
sudo fsck -y /dev/sda1

FS-05

Mount Point Failure — Filesystem Not Mounted at Boot

Medium fstab · mount · blkid

▶

Symptoms

Expected directory is empty after reboot. Application cannot find its data. Service fails to start because its data volume is not mounted. /etc/fstab entry may be misconfigured.

Diagnosis Steps

Check currently mounted filesystems.
```
mount | column -t
lsblk -f
```
Verify the device UUID matches the fstab entry.
```
sudo blkid
cat /etc/fstab
```

Check systemd mount unit failures.

systemctl --failed
sudo journalctl -u "*.mount" --since "1 hour ago"

Resolution

# Mount all filesystems in fstab (safe dry-run first)
sudo mount -a --verbose

# Fix a UUID mismatch in /etc/fstab
# 1. Get correct UUID:
sudo blkid /dev/sdb1
# 2. Edit fstab to update UUID:
sudo nano /etc/fstab
# UUID=<correct-uuid>  /data  ext4  defaults  0  2

# Test fstab syntax without rebooting:
sudo findmnt --verify

PR-01

High CPU — System Load Spike, Identifying the Offender

High top · ps · strace

▶

Symptoms

Server load average exceeds CPU count. Applications become sluggish. Monitoring alert fires on CPU >85% sustained. SSH response is slow.

Diagnosis Steps

Get an instant top-process snapshot sorted by CPU.

top -b -n1 | head -20
# or
ps aux --sort=-%cpu | head -15

Check load average versus CPU count.
```
uptime
nproc
```
Get a thread-level breakdown for the suspect PID.
```
top -H -p <PID>
```
See what system calls the process is making.
```
sudo strace -p <PID> -c -f 2>&1 | head -30
```

Resolution

# Renice a runaway process to lower its priority
sudo renice +10 -p <PID>

# Kill if confirmed runaway (SIGTERM first, then SIGKILL)
kill -15 <PID>
sleep 3
kill -9 <PID>

# Investigate if it's a recurring scheduled task
sudo crontab -l
sudo crontab -l -u www-data

PR-02

Zombie Processes — Defunct Entries Accumulating

Low ps · kill · pstree

▶

Symptoms

ps aux shows processes with state Z (defunct/zombie). If zombie count is large it can exhaust the PID table. Usually indicates a bug in the parent process (not calling wait()).

Diagnosis Steps

Count and list zombie processes.

ps aux | awk '$8 == "Z" { print }'
ps aux | grep -c "defunct"

Find the parent process (PPID) that is not reaping children.

ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/'
pstree -p | grep defunct

Resolution

# You cannot kill a zombie directly — kill the parent process.
# 1. Find parent PID:
ps -o ppid= -p <zombie-PID>

# 2. Send SIGCHLD to parent (asks it to reap children):
kill -SIGCHLD <parent-PID>

# 3. If parent ignores SIGCHLD, restart the parent service:
sudo systemctl restart <service-name>

# 4. Last resort — kill the parent (ensure it is safe to restart):
kill -9 <parent-PID>

PR-03

OOM Kill — Process Terminated by Out-of-Memory Killer

High dmesg · journalctl · free

▶

Symptoms

Service suddenly disappears. No graceful shutdown log. dmesg contains Out of memory: Kill process entries. Often happens at peak load or during memory leaks.

Diagnosis Steps

Confirm OOM kill in kernel messages.

sudo dmesg | grep -i "oom\|killed\|out of memory" | tail -20

Check systemd journal for the time of death.

sudo journalctl -k --since "1 hour ago" | grep -i oom

Review current memory usage and which process is the largest.
```
free -h
ps aux --sort=-%mem | head -15
```

Resolution

# Short-term: restart the killed service
sudo systemctl restart <service-name>

# Add swap space if system has none (emergency measure):
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make permanent:
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# Tune OOM score to protect critical processes:
echo -1000 | sudo tee /proc/<PID>/oom_score_adj

PR-04

Process Won't Start — Binary or Dependency Missing

Medium which · ldd · strace

▶

Symptoms

Executing a binary returns command not found, No such file or directory, or error while loading shared libraries. Service fails to start with exit code 127 or 1.

Diagnosis Steps

Check if the binary exists and is executable.

which java
ls -la /usr/bin/java
file /usr/bin/java

Check shared library dependencies.
```
ldd /path/to/binary
```

Trace the startup to see exactly where it fails.

sudo strace -e trace=openat /path/to/binary 2>&1 | grep "ENOENT"

Check if a required environment variable is missing.
```
printenv | grep -i java
printenv PATH
```

Resolution

# Install missing library
sudo apt-get install <package>

# Update library cache after manual install
sudo ldconfig

# Add missing PATH entry (session):
export PATH=$PATH:/opt/custom/bin

# Persistent PATH fix in service unit:
sudo systemctl edit <service-name>
# Add under [Service]:
# Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"

PR-05

Too Many Open Files — ulimit Exhaustion

Medium lsof · ulimit · limits.conf

▶

Symptoms

Application logs show Too many open files. New connections refused. Java applications throw java.io.IOException: Too many open files. Commonly affects high-traffic web servers or databases.

Diagnosis Steps

Check current limits for the running process.

cat /proc/<PID>/limits | grep "open files"
ulimit -n

Count actual open file descriptors for the process.
```
ls /proc/<PID>/fd | wc -l
lsof -p <PID> | wc -l
```

Find which process has the most open files system-wide.

lsof | awk '{print $2}' | sort | uniq -c | sort -rn | head -10

Resolution

# Raise limit for current session immediately:
ulimit -n 65535

# Permanent system-wide increase in /etc/security/limits.conf:
sudo tee -a /etc/security/limits.conf << EOF
*    soft nofile 65535
*    hard nofile 65535
EOF

# For systemd services, set in unit override:
sudo systemctl edit <service-name>
# Add under [Service]:
# LimitNOFILE=65535

NET-01

Service Unreachable — Port Not Listening

High ss · netstat · curl

▶

Symptoms

Client gets Connection refused or times out. Health check fails. The service may have crashed or is bound to the wrong interface/port.

Diagnosis Steps

Check whether anything is listening on the expected port.

ss -tlnp | grep <PORT>
# or
sudo netstat -tlnp | grep <PORT>

Test connectivity from the server itself.

curl -v http://localhost:<PORT>
telnet localhost <PORT>

Check whether the service is running.
```
sudo systemctl status <service-name>
```

Resolution

# Restart the service
sudo systemctl restart <service-name>

# If service binds to 127.0.0.1 instead of 0.0.0.0, edit config:
# e.g. for nginx: listen 0.0.0.0:80;
# e.g. for Spring Boot: server.address=0.0.0.0

# Verify after fix:
ss -tlnp | grep <PORT>

NET-02

DNS Resolution Failure — Cannot Resolve Hostnames

High dig · nslookup · resolv.conf

▶

Symptoms

curl: (6) Could not resolve host. Application cannot connect to external APIs or databases by hostname. ping google.com fails but ping 8.8.8.8 succeeds.

Diagnosis Steps

Test basic resolution.

dig google.com
nslookup google.com
host google.com

Check which DNS servers are configured.

cat /etc/resolv.conf
systemd-resolve --status | grep DNS

Test resolution against a known-good resolver directly.
```
dig @8.8.8.8 google.com
```

Resolution

# If /etc/resolv.conf is empty or wrong, set a DNS server:
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
echo "nameserver 1.1.1.1" | sudo tee -a /etc/resolv.conf

# Flush systemd-resolved cache:
sudo systemd-resolve --flush-caches

# Restart the resolver service:
sudo systemctl restart systemd-resolved

# For Docker containers — check the container's DNS:
docker exec -it <container> cat /etc/resolv.conf

NET-03

Network Interface Down — No Connectivity After Reboot

High ip · ifconfig · nmcli

▶

Symptoms

Server is unreachable after reboot. No IP address assigned. ip addr shows interface in DOWN state. Often caused by a misconfigured netplan or interface config.

Diagnosis Steps

List all interfaces and their state.
```
ip link show
ip addr show
```
Check routing table.
```
ip route show
```

Look for interface-related errors in kernel log.

sudo dmesg | grep -iE "eth|ens|link|network" | tail -20

Resolution

# Bring interface up manually (temporary):
sudo ip link set eth0 up
sudo dhclient eth0

# For Ubuntu with Netplan — check and apply config:
cat /etc/netplan/*.yaml
sudo netplan try
sudo netplan apply

# For NetworkManager-based systems:
nmcli device status
nmcli device connect eth0

NET-04

Port Binding Conflict — Address Already in Use

Medium ss · fuser · lsof

▶

Symptoms

Service fails to start with Address already in use (EADDRINUSE). Often happens when a previous instance did not shut down cleanly, or two services are configured on the same port.

Diagnosis Steps

Find which process is using the port.

sudo ss -tlnp | grep :<PORT>
sudo fuser -n tcp <PORT>
sudo lsof -i :<PORT>

Identify the process by PID.
```
ps -p <PID> -o pid,cmd
```

Resolution

# Option 1 — Kill the conflicting process (if stale):
sudo kill -9 <PID>

# Option 2 — Change the port in the service config and restart:
# e.g. for Spring Boot: server.port=8081

# Option 3 — Check if a previous instance is still tracked by systemd:
sudo systemctl stop <old-service-name>
sudo systemctl start <new-service-name>

NET-05

Firewall Blocking Connection — iptables / ufw Drop

Medium ufw · iptables · tcpdump

▶

Symptoms

Connection times out (not refused — the packet is silently dropped). Service is listening on the port locally but remote clients cannot reach it. ufw status may show the rule is missing.

Diagnosis Steps

Check ufw status and active rules.
```
sudo ufw status verbose
```

Check raw iptables chains.

sudo iptables -L -n -v | grep -E "DROP|REJECT|<PORT>"

Capture packets to see if they arrive at the interface.
```
sudo tcpdump -i any port <PORT> -nn
```

Resolution

# Allow a port through ufw:
sudo ufw allow <PORT>/tcp
sudo ufw reload

# Allow with source restriction:
sudo ufw allow from 10.0.0.0/8 to any port <PORT>

# iptables: insert ACCEPT rule before a DROP:
sudo iptables -I INPUT -p tcp --dport <PORT> -j ACCEPT

# Verify:
curl -v http://<server-ip>:<PORT>

SVC-01

Systemd Service Failed to Start — ExecStart Error

High systemctl · journalctl

▶

Symptoms

Service shows failed status in systemctl. Application is not running. Dashboard or health endpoint returns 503. Logs may show exit code 1 or 2.

Diagnosis Steps

Check service status and last exit code.
```
sudo systemctl status <service-name>
```

Read the full service log since last boot.

sudo journalctl -u <service-name> -n 100 --no-pager

Check the unit file for configuration issues.
```
sudo systemctl cat <service-name>
```

Validate the unit file syntax.

sudo systemd-analyze verify /etc/systemd/system/<service-name>.service

Resolution

# Fix the unit file or config, then reload the daemon and restart:
sudo systemctl daemon-reload
sudo systemctl restart <service-name>

# Watch live logs during startup to catch the failure point:
sudo journalctl -u <service-name> -f &
sudo systemctl start <service-name>

SVC-02

Service Keeps Restarting — Restart Loop (CrashLoop)

High journalctl · systemctl

▶

Symptoms

systemctl status shows "activating" or rapid start/stop cycles. Service enters systemd's throttle state. Log shows repeated startup then immediate failure.

Diagnosis Steps

Check how many times it has been restarted.

systemctl show <service-name> | grep -i "NRestarts\|ActiveState\|Result"

Get the last N log entries to see the failure reason.

sudo journalctl -u <service-name> --since "10 minutes ago"

Check the Restart= policy in the unit file.

systemctl cat <service-name> | grep -i restart

Resolution

# Stop the restart loop to investigate without systemd restarting it:
sudo systemctl stop <service-name>

# Run the binary manually to see the real error:
sudo -u <service-user> /path/to/binary --options

# If it's a config error, fix it and restart:
sudo systemctl start <service-name>

# Increase StartLimitIntervalSec/StartLimitBurst to allow more restarts
# while debugging (edit unit file via: sudo systemctl edit <service>)

SVC-03

Cron Job Not Running — Silent Failure

Medium crontab · syslog · mail

▶

Symptoms

Scheduled task does not execute. Backup, report, or cleanup job silently skipped. No output or error visible. Cron is running, but the specific job is not firing.

Diagnosis Steps

Verify the cron entry syntax.
```
crontab -l
sudo crontab -l -u <user>
```

Check cron execution logs.

grep CRON /var/log/syslog | tail -30
sudo journalctl -u cron --since "1 hour ago"

Check if cron daemon is running.
```
sudo systemctl status cron
```
Test the exact command manually as the cron user.
```
sudo -u <user> /path/to/script.sh
```

Resolution

# Common fix 1: PATH issue — add PATH at the top of crontab:
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Common fix 2: Redirect output to capture errors:
* * * * * /path/to/script.sh >> /tmp/cron_output.log 2>&1

# Common fix 3: Use absolute paths in the script:
# BAD:  mysql -u root ...
# GOOD: /usr/bin/mysql -u root ...

# Validate cron syntax online or use:
echo "30 2 * * * /path/to/script.sh" | crontab -

SVC-04

Failed Service Dependency — Unit Waiting on Another

Medium systemctl · After= · Requires=

▶

Symptoms

Service appears to start but immediately stops waiting. Another service it depends on (database, network, cache) has not started. Exit reason shows start-limit-hit or dependency failure.

Diagnosis Steps

List all failed units.
```
systemctl --failed
```
Check the dependency graph for the service.
```
systemctl list-dependencies <service-name>
```

Check status of each dependency.

sudo systemctl status mysql
sudo systemctl status network-online.target

Resolution

# Start the dependency first, then the service:
sudo systemctl start mysql
sudo systemctl start <app-service>

# If dependency ordering is wrong in the unit file, fix it:
sudo systemctl edit <service-name>
# Add under [Unit]:
# After=mysql.service network-online.target
# Wants=mysql.service

sudo systemctl daemon-reload
sudo systemctl restart <service-name>

SVC-05

Log Flooding — journald Consuming All Disk Space

Medium journalctl · journald.conf

▶

Symptoms

/var/log/journal or /run/log/journal is consuming gigabytes. A chatty service is emitting thousands of log entries per second. Disk fills up causing service crashes.

Diagnosis Steps

Check total journal size.
```
sudo journalctl --disk-usage
```

Find the noisiest service.

sudo journalctl --since "1 hour ago" | awk '{print $5}' | sort | uniq -c | sort -rn | head -10

View the high-volume service log to understand why.
```
sudo journalctl -u <noisy-service> -n 50
```

Resolution

# Immediate cleanup — keep only last 7 days:
sudo journalctl --vacuum-time=7d

# Or limit by size (keep last 500 MB):
sudo journalctl --vacuum-size=500M

# Permanent cap in /etc/systemd/journald.conf:
sudo tee -a /etc/systemd/journald.conf << EOF
SystemMaxUse=500M
MaxFileSec=7day
RateLimitIntervalSec=30s
RateLimitBurst=1000
EOF

sudo systemctl restart systemd-journald

Disk & Filesystem

Task	Command
Disk usage summary	`df -hT`
Top 10 directories by size	`du -ahx / 2>/dev/null \| sort -rh \| head -10`
Find files larger than 1 GB	`find / -xdev -type f -size +1G -ls 2>/dev/null`
Check inode usage	`df -ih`
Deleted files held open by processes	`lsof +L1 \| awk 'NR==1 \|\| $7 < 1'`
Vacuum journal logs older than 7 days	`sudo journalctl --vacuum-time=7d`

Processes

Task	Command
Top 10 processes by CPU	`ps aux --sort=-%cpu \| head -11`
Top 10 processes by memory	`ps aux --sort=-%mem \| head -11`
Count zombie processes	`ps aux \| awk '$8=="Z"' \| wc -l`
Kill all processes by name	`pkill -f <process-name>`
Watch CPU/mem of one process	`watch -n1 "ps -p <PID> -o pid,pcpu,pmem,cmd"`
Check OOM kill in last hour	`sudo dmesg \| grep -i "killed process" \| tail -10`
Open file descriptors for PID	`ls /proc/<PID>/fd \| wc -l`

Networking

Task	Command
Check all listening ports	`ss -tlnp`
Check established connections	`ss -tnp state established`
Find process using a port	`sudo fuser -n tcp <PORT>`
Test port reachability	`curl -sv telnet://<host>:<port>`
Trace route to host	`traceroute -n <host>`
Capture packets on a port	`sudo tcpdump -i any port <PORT> -nn -c 50`
DNS lookup with timing	`dig <hostname> \| grep -E "time\|ANSWER"`
Check open ports on remote host	`nmap -sT -p 1-65535 <host>`

Services & Logs

Task	Command
List all failed services	`systemctl --failed`
Live log stream for a service	`sudo journalctl -u <service> -f`
Logs since last boot	`sudo journalctl -b -u <service>`
Last 100 lines of any log file	`tail -100 /var/log/<app>.log`
Grep log for ERROR lines	`grep -i "error\\|exception\\|fail" /var/log/<app>.log \| tail -30`
Check cron execution history	`grep CRON /var/log/syslog \| tail -20`
Find which service owns a port	`sudo ss -tlnp \| grep :<PORT>`
Journal disk usage	`sudo journalctl --disk-usage`

Wrong File Permissions — Too Restrictive or Too Permissive

Setting chmod 777 on everything "to fix it" is a security risk. Setting 600 on a directory makes it inaccessible.

✓ Fix: Use 644 for files (owner rw, group/other r), 755 for directories and executables. Use 640 for config files with secrets. Always check parent directory execute bits — you need x on every directory in the path.

find /var/www -type f -exec chmod 644 {} \;
find /var/www -type d -exec chmod 755 {} \;

Missing or Wrong Environment Variables

Application works when run manually but fails under systemd because the service environment does not inherit the user's shell environment (JAVA_HOME, DATABASE_URL, etc.).

✓ Fix: Set environment variables explicitly in the systemd unit file.

# /etc/systemd/system/myapp.service [Service] section:
Environment="JAVA_HOME=/usr/lib/jvm/java-17"
Environment="DATABASE_URL=jdbc:mysql://localhost:3306/mydb"
# Or load from a file:
EnvironmentFile=/etc/myapp/env.conf

Port Binding Conflicts — Service Configured on Wrong or Taken Port

Deploying a new service on port 8080 without checking whether another process already owns that port. Also common: accidentally starting two instances of the same service.

✓ Fix: Always check before configuring a port.

ss -tlnp | grep :8080
# If occupied, pick a free port or stop the conflicting service.

Cron Jobs Without Absolute Paths

Cron runs with a minimal PATH (/usr/bin:/bin). Commands like mysql, python3, or node that work in a shell will silently fail in cron.

✓ Fix: Use full paths in cron scripts and redirect output.

# In crontab, set PATH explicitly:
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# In the script, use absolute paths:
/usr/bin/python3 /opt/scripts/backup.py >> /var/log/backup.log 2>&1

Forgetting to Run systemctl daemon-reload After Editing Unit Files

Editing /etc/systemd/system/myapp.service and then restarting the service — systemd still loads the old cached unit. Changes appear to have no effect.

✓ Fix: Always reload after editing.

sudo systemctl daemon-reload
sudo systemctl restart myapp

Wrong File Owner on Config or Data Directory

Creating directories as root that the service needs to write to as www-data or appuser. Service starts but fails on first write.

✓ Fix: Check the user the service runs as and ensure ownership matches.

# Who does the service run as?
systemctl show myapp -p User

# Fix ownership:
sudo chown -R appuser:appuser /var/lib/myapp

Firewall Rules Not Persisted After Reboot

Adding iptables rules directly without saving them. Rules disappear on reboot causing mystery outages.

✓ Fix: Use ufw which persists rules automatically, or save iptables rules.

# Preferred (ufw handles persistence):
sudo ufw allow 443/tcp

# For raw iptables — save and restore:
sudo iptables-save > /etc/iptables/rules.v4

Service Bound to 127.0.0.1 — Not Accessible Remotely

Developers test locally with localhost bindings then deploy without changing the bind address. Port is open, firewall is clear, but remote connections never arrive at the application.

✓ Fix: Confirm the service binds to 0.0.0.0 for all interfaces or the specific interface IP.

# Check what address the service is bound to:
ss -tlnp | grep <PORT>

# Spring Boot: application.properties
server.address=0.0.0.0

# Node.js: app.listen(PORT, '0.0.0.0', ...)

# nginx: listen 0.0.0.0:80;