🐧 Linux Troubleshooting Runbook
A searchable L1/L2 support knowledge base covering real-world Linux incidents with diagnosis steps, commands, and resolutions.
Filesystem
Disk full, permission errors, inode exhaustion, mount issues
Processes
High CPU, zombies, OOM kills, open file limits
Networking
Ports unreachable, DNS failures, firewall blocks, interface down
Services & Cron
Systemd failures, restart loops, cron not firing
Bash One-liners
Quick triage commands reusable by L1 staff
Common Mistakes
Misconfigs that cause 80% of first-response tickets
How to use this runbook
- ① Use the sidebar or search bar to locate a scenario by symptom or keyword.
- ② Click a scenario card to expand diagnosis steps, commands, and resolution.
- ③ Run commands on the affected host in the order shown.
- ④ If unresolved after L1 steps, escalate with the output collected.
💾 Filesystem Scenarios
Disk space alerts, permission errors, inode exhaustion, large-file hunts, and mount-point failures.
Symptoms
Application writes fail with No space left on device. Monitoring alert fires for partition >90% full. Log rotation stops.
Diagnosis Steps
- Check overall disk usage per partition.
df -hT - Identify the top 10 directories consuming space.
du -ahx / 2>/dev/null | sort -rh | head -20 - Find large files (over 500 MB) across the filesystem.
find / -xdev -type f -size +500M -exec ls -lh {} \; 2>/dev/null - Check for deleted-but-open files still holding space (common with log handles).
lsof +L1 | grep -i deleted
Resolution
Archive or delete identified large files. If log files are open by processes, restart the process so the file handle is released. For recurring issues, implement logrotate policies.
# Clear old journal logs safely (keep last 7 days)
sudo journalctl --vacuum-time=7d
# Remove old compressed logs
sudo find /var/log -name "*.gz" -mtime +30 -delete
# Restart service to release deleted-file handles
sudo systemctl restart <service-name>
Symptoms
User gets Permission denied when reading, writing, or executing a file. Service fails to read its config or write to its data directory.
Diagnosis Steps
- Check the file's permissions and ownership.
stat /path/to/file ls -la /path/to/file - Check what user/group the process or shell is running as.
whoami id ps aux | grep <process-name> - Verify parent directory permissions (execute bit needed to traverse).
ls -la /path/to/ - Check SELinux or AppArmor denials if standard permissions look correct.
sudo ausearch -m avc -ts recent # SELinux sudo dmesg | grep apparmor # AppArmor
Resolution
# Fix ownership
sudo chown user:group /path/to/file
# Fix permissions (files: 644, dirs: 755 as baseline)
sudo chmod 644 /path/to/file
sudo chmod 755 /path/to/dir
# Recursively fix a directory tree
sudo chown -R www-data:www-data /var/www/html
sudo chmod -R 755 /var/www/html
Symptoms
No space left on device errors even though df -h shows ample free space. File creation fails. Often caused by millions of small temp files or mail queue entries.
Diagnosis Steps
- Confirm inode exhaustion (look for 100% under
IUse%).df -ih - Find the directory with the most files.
find / -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -20 - Count files in suspected directories.
ls /tmp | wc -l ls /var/spool/postfix/maildrop | wc -l
Resolution
# Remove temp files older than 1 day
sudo find /tmp -mtime +1 -delete
# Clear mail queue if postfix is the culprit
sudo postsuper -d ALL deferred
# For PHP session file buildup
sudo find /var/lib/php/sessions -type f -mtime +1 -delete
Symptoms
All write operations fail with Read-only file system. Typically happens after an unclean shutdown or I/O errors that caused the kernel to remount read-only for data safety.
Diagnosis Steps
- Confirm the filesystem is mounted read-only.
mount | grep "ro," cat /proc/mounts - Check kernel I/O error messages.
sudo dmesg | grep -iE "error|I/O|remount|read-only" | tail -30 - Check filesystem journal for corruption clues.
sudo tune2fs -l /dev/sda1 | grep -i "mount\|check"
Resolution
# Attempt live remount read-write (only if no hardware errors)
sudo mount -o remount,rw /
# If there are journal errors, unmount and run fsck (requires reboot)
# Schedule fsck at next boot:
sudo touch /forcefsck
sudo reboot
# Or force fsck on a specific device (unmounted):
sudo fsck -y /dev/sda1
Symptoms
Expected directory is empty after reboot. Application cannot find its data. Service fails to start because its data volume is not mounted. /etc/fstab entry may be misconfigured.
Diagnosis Steps
- Check currently mounted filesystems.
mount | column -t lsblk -f - Verify the device UUID matches the fstab entry.
sudo blkid cat /etc/fstab - Check systemd mount unit failures.
systemctl --failed sudo journalctl -u "*.mount" --since "1 hour ago"
Resolution
# Mount all filesystems in fstab (safe dry-run first)
sudo mount -a --verbose
# Fix a UUID mismatch in /etc/fstab
# 1. Get correct UUID:
sudo blkid /dev/sdb1
# 2. Edit fstab to update UUID:
sudo nano /etc/fstab
# UUID=<correct-uuid> /data ext4 defaults 0 2
# Test fstab syntax without rebooting:
sudo findmnt --verify
⚙️ Process Scenarios
High CPU processes, hangs, zombie cleanup, OOM kills, and open-file-descriptor limits.
Symptoms
Server load average exceeds CPU count. Applications become sluggish. Monitoring alert fires on CPU >85% sustained. SSH response is slow.
Diagnosis Steps
- Get an instant top-process snapshot sorted by CPU.
top -b -n1 | head -20 # or ps aux --sort=-%cpu | head -15 - Check load average versus CPU count.
uptime nproc - Get a thread-level breakdown for the suspect PID.
top -H -p <PID> - See what system calls the process is making.
sudo strace -p <PID> -c -f 2>&1 | head -30
Resolution
# Renice a runaway process to lower its priority
sudo renice +10 -p <PID>
# Kill if confirmed runaway (SIGTERM first, then SIGKILL)
kill -15 <PID>
sleep 3
kill -9 <PID>
# Investigate if it's a recurring scheduled task
sudo crontab -l
sudo crontab -l -u www-data
Symptoms
ps aux shows processes with state Z (defunct/zombie). If zombie count is large it can exhaust the PID table. Usually indicates a bug in the parent process (not calling wait()).
Diagnosis Steps
- Count and list zombie processes.
ps aux | awk '$8 == "Z" { print }' ps aux | grep -c "defunct" - Find the parent process (PPID) that is not reaping children.
ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/' pstree -p | grep defunct
Resolution
# You cannot kill a zombie directly — kill the parent process.
# 1. Find parent PID:
ps -o ppid= -p <zombie-PID>
# 2. Send SIGCHLD to parent (asks it to reap children):
kill -SIGCHLD <parent-PID>
# 3. If parent ignores SIGCHLD, restart the parent service:
sudo systemctl restart <service-name>
# 4. Last resort — kill the parent (ensure it is safe to restart):
kill -9 <parent-PID>
Symptoms
Service suddenly disappears. No graceful shutdown log. dmesg contains Out of memory: Kill process entries. Often happens at peak load or during memory leaks.
Diagnosis Steps
- Confirm OOM kill in kernel messages.
sudo dmesg | grep -i "oom\|killed\|out of memory" | tail -20 - Check systemd journal for the time of death.
sudo journalctl -k --since "1 hour ago" | grep -i oom - Review current memory usage and which process is the largest.
free -h ps aux --sort=-%mem | head -15
Resolution
# Short-term: restart the killed service
sudo systemctl restart <service-name>
# Add swap space if system has none (emergency measure):
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make permanent:
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
# Tune OOM score to protect critical processes:
echo -1000 | sudo tee /proc/<PID>/oom_score_adj
Symptoms
Executing a binary returns command not found, No such file or directory, or error while loading shared libraries. Service fails to start with exit code 127 or 1.
Diagnosis Steps
- Check if the binary exists and is executable.
which java ls -la /usr/bin/java file /usr/bin/java - Check shared library dependencies.
ldd /path/to/binary - Trace the startup to see exactly where it fails.
sudo strace -e trace=openat /path/to/binary 2>&1 | grep "ENOENT" - Check if a required environment variable is missing.
printenv | grep -i java printenv PATH
Resolution
# Install missing library
sudo apt-get install <package>
# Update library cache after manual install
sudo ldconfig
# Add missing PATH entry (session):
export PATH=$PATH:/opt/custom/bin
# Persistent PATH fix in service unit:
sudo systemctl edit <service-name>
# Add under [Service]:
# Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
Symptoms
Application logs show Too many open files. New connections refused. Java applications throw java.io.IOException: Too many open files. Commonly affects high-traffic web servers or databases.
Diagnosis Steps
- Check current limits for the running process.
cat /proc/<PID>/limits | grep "open files" ulimit -n - Count actual open file descriptors for the process.
ls /proc/<PID>/fd | wc -l lsof -p <PID> | wc -l - Find which process has the most open files system-wide.
lsof | awk '{print $2}' | sort | uniq -c | sort -rn | head -10
Resolution
# Raise limit for current session immediately:
ulimit -n 65535
# Permanent system-wide increase in /etc/security/limits.conf:
sudo tee -a /etc/security/limits.conf << EOF
* soft nofile 65535
* hard nofile 65535
EOF
# For systemd services, set in unit override:
sudo systemctl edit <service-name>
# Add under [Service]:
# LimitNOFILE=65535
🌐 Networking Scenarios
Port unreachable, DNS failures, network interface issues, firewall blocks, and port conflicts.
Symptoms
Client gets Connection refused or times out. Health check fails. The service may have crashed or is bound to the wrong interface/port.
Diagnosis Steps
- Check whether anything is listening on the expected port.
ss -tlnp | grep <PORT> # or sudo netstat -tlnp | grep <PORT> - Test connectivity from the server itself.
curl -v http://localhost:<PORT> telnet localhost <PORT> - Check whether the service is running.
sudo systemctl status <service-name>
Resolution
# Restart the service
sudo systemctl restart <service-name>
# If service binds to 127.0.0.1 instead of 0.0.0.0, edit config:
# e.g. for nginx: listen 0.0.0.0:80;
# e.g. for Spring Boot: server.address=0.0.0.0
# Verify after fix:
ss -tlnp | grep <PORT>
Symptoms
curl: (6) Could not resolve host. Application cannot connect to external APIs or databases by hostname. ping google.com fails but ping 8.8.8.8 succeeds.
Diagnosis Steps
- Test basic resolution.
dig google.com nslookup google.com host google.com - Check which DNS servers are configured.
cat /etc/resolv.conf systemd-resolve --status | grep DNS - Test resolution against a known-good resolver directly.
dig @8.8.8.8 google.com
Resolution
# If /etc/resolv.conf is empty or wrong, set a DNS server:
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
echo "nameserver 1.1.1.1" | sudo tee -a /etc/resolv.conf
# Flush systemd-resolved cache:
sudo systemd-resolve --flush-caches
# Restart the resolver service:
sudo systemctl restart systemd-resolved
# For Docker containers — check the container's DNS:
docker exec -it <container> cat /etc/resolv.conf
Symptoms
Server is unreachable after reboot. No IP address assigned. ip addr shows interface in DOWN state. Often caused by a misconfigured netplan or interface config.
Diagnosis Steps
- List all interfaces and their state.
ip link show ip addr show - Check routing table.
ip route show - Look for interface-related errors in kernel log.
sudo dmesg | grep -iE "eth|ens|link|network" | tail -20
Resolution
# Bring interface up manually (temporary):
sudo ip link set eth0 up
sudo dhclient eth0
# For Ubuntu with Netplan — check and apply config:
cat /etc/netplan/*.yaml
sudo netplan try
sudo netplan apply
# For NetworkManager-based systems:
nmcli device status
nmcli device connect eth0
Symptoms
Service fails to start with Address already in use (EADDRINUSE). Often happens when a previous instance did not shut down cleanly, or two services are configured on the same port.
Diagnosis Steps
- Find which process is using the port.
sudo ss -tlnp | grep :<PORT> sudo fuser -n tcp <PORT> sudo lsof -i :<PORT> - Identify the process by PID.
ps -p <PID> -o pid,cmd
Resolution
# Option 1 — Kill the conflicting process (if stale):
sudo kill -9 <PID>
# Option 2 — Change the port in the service config and restart:
# e.g. for Spring Boot: server.port=8081
# Option 3 — Check if a previous instance is still tracked by systemd:
sudo systemctl stop <old-service-name>
sudo systemctl start <new-service-name>
Symptoms
Connection times out (not refused — the packet is silently dropped). Service is listening on the port locally but remote clients cannot reach it. ufw status may show the rule is missing.
Diagnosis Steps
- Check ufw status and active rules.
sudo ufw status verbose - Check raw iptables chains.
sudo iptables -L -n -v | grep -E "DROP|REJECT|<PORT>" - Capture packets to see if they arrive at the interface.
sudo tcpdump -i any port <PORT> -nn
Resolution
# Allow a port through ufw:
sudo ufw allow <PORT>/tcp
sudo ufw reload
# Allow with source restriction:
sudo ufw allow from 10.0.0.0/8 to any port <PORT>
# iptables: insert ACCEPT rule before a DROP:
sudo iptables -I INPUT -p tcp --dport <PORT> -j ACCEPT
# Verify:
curl -v http://<server-ip>:<PORT>
🔧 Services & Cron Scenarios
Systemd service failures, restart loops, cron job problems, dependency issues, and log flooding.
Symptoms
Service shows failed status in systemctl. Application is not running. Dashboard or health endpoint returns 503. Logs may show exit code 1 or 2.
Diagnosis Steps
- Check service status and last exit code.
sudo systemctl status <service-name> - Read the full service log since last boot.
sudo journalctl -u <service-name> -n 100 --no-pager - Check the unit file for configuration issues.
sudo systemctl cat <service-name> - Validate the unit file syntax.
sudo systemd-analyze verify /etc/systemd/system/<service-name>.service
Resolution
# Fix the unit file or config, then reload the daemon and restart:
sudo systemctl daemon-reload
sudo systemctl restart <service-name>
# Watch live logs during startup to catch the failure point:
sudo journalctl -u <service-name> -f &
sudo systemctl start <service-name>
Symptoms
systemctl status shows "activating" or rapid start/stop cycles. Service enters systemd's throttle state. Log shows repeated startup then immediate failure.
Diagnosis Steps
- Check how many times it has been restarted.
systemctl show <service-name> | grep -i "NRestarts\|ActiveState\|Result" - Get the last N log entries to see the failure reason.
sudo journalctl -u <service-name> --since "10 minutes ago" - Check the Restart= policy in the unit file.
systemctl cat <service-name> | grep -i restart
Resolution
# Stop the restart loop to investigate without systemd restarting it:
sudo systemctl stop <service-name>
# Run the binary manually to see the real error:
sudo -u <service-user> /path/to/binary --options
# If it's a config error, fix it and restart:
sudo systemctl start <service-name>
# Increase StartLimitIntervalSec/StartLimitBurst to allow more restarts
# while debugging (edit unit file via: sudo systemctl edit <service>)
Symptoms
Scheduled task does not execute. Backup, report, or cleanup job silently skipped. No output or error visible. Cron is running, but the specific job is not firing.
Diagnosis Steps
- Verify the cron entry syntax.
crontab -l sudo crontab -l -u <user> - Check cron execution logs.
grep CRON /var/log/syslog | tail -30 sudo journalctl -u cron --since "1 hour ago" - Check if cron daemon is running.
sudo systemctl status cron - Test the exact command manually as the cron user.
sudo -u <user> /path/to/script.sh
Resolution
# Common fix 1: PATH issue — add PATH at the top of crontab:
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
# Common fix 2: Redirect output to capture errors:
* * * * * /path/to/script.sh >> /tmp/cron_output.log 2>&1
# Common fix 3: Use absolute paths in the script:
# BAD: mysql -u root ...
# GOOD: /usr/bin/mysql -u root ...
# Validate cron syntax online or use:
echo "30 2 * * * /path/to/script.sh" | crontab -
Symptoms
Service appears to start but immediately stops waiting. Another service it depends on (database, network, cache) has not started. Exit reason shows start-limit-hit or dependency failure.
Diagnosis Steps
- List all failed units.
systemctl --failed - Check the dependency graph for the service.
systemctl list-dependencies <service-name> - Check status of each dependency.
sudo systemctl status mysql sudo systemctl status network-online.target
Resolution
# Start the dependency first, then the service:
sudo systemctl start mysql
sudo systemctl start <app-service>
# If dependency ordering is wrong in the unit file, fix it:
sudo systemctl edit <service-name>
# Add under [Unit]:
# After=mysql.service network-online.target
# Wants=mysql.service
sudo systemctl daemon-reload
sudo systemctl restart <service-name>
Symptoms
/var/log/journal or /run/log/journal is consuming gigabytes. A chatty service is emitting thousands of log entries per second. Disk fills up causing service crashes.
Diagnosis Steps
- Check total journal size.
sudo journalctl --disk-usage - Find the noisiest service.
sudo journalctl --since "1 hour ago" | awk '{print $5}' | sort | uniq -c | sort -rn | head -10 - View the high-volume service log to understand why.
sudo journalctl -u <noisy-service> -n 50
Resolution
# Immediate cleanup — keep only last 7 days:
sudo journalctl --vacuum-time=7d
# Or limit by size (keep last 500 MB):
sudo journalctl --vacuum-size=500M
# Permanent cap in /etc/systemd/journald.conf:
sudo tee -a /etc/systemd/journald.conf << EOF
SystemMaxUse=500M
MaxFileSec=7day
RateLimitIntervalSec=30s
RateLimitBurst=1000
EOF
sudo systemctl restart systemd-journald
⚡ Bash One-liners
Reusable triage commands for L1 support staff — copy, paste, run.
Disk & Filesystem
| Task | Command |
|---|---|
| Disk usage summary | df -hT |
| Top 10 directories by size | du -ahx / 2>/dev/null | sort -rh | head -10 |
| Find files larger than 1 GB | find / -xdev -type f -size +1G -ls 2>/dev/null |
| Check inode usage | df -ih |
| Deleted files held open by processes | lsof +L1 | awk 'NR==1 || $7 < 1' |
| Vacuum journal logs older than 7 days | sudo journalctl --vacuum-time=7d |
Processes
| Task | Command |
|---|---|
| Top 10 processes by CPU | ps aux --sort=-%cpu | head -11 |
| Top 10 processes by memory | ps aux --sort=-%mem | head -11 |
| Count zombie processes | ps aux | awk '$8=="Z"' | wc -l |
| Kill all processes by name | pkill -f <process-name> |
| Watch CPU/mem of one process | watch -n1 "ps -p <PID> -o pid,pcpu,pmem,cmd" |
| Check OOM kill in last hour | sudo dmesg | grep -i "killed process" | tail -10 |
| Open file descriptors for PID | ls /proc/<PID>/fd | wc -l |
Networking
| Task | Command |
|---|---|
| Check all listening ports | ss -tlnp |
| Check established connections | ss -tnp state established |
| Find process using a port | sudo fuser -n tcp <PORT> |
| Test port reachability | curl -sv telnet://<host>:<port> |
| Trace route to host | traceroute -n <host> |
| Capture packets on a port | sudo tcpdump -i any port <PORT> -nn -c 50 |
| DNS lookup with timing | dig <hostname> | grep -E "time|ANSWER" |
| Check open ports on remote host | nmap -sT -p 1-65535 <host> |
Services & Logs
| Task | Command |
|---|---|
| List all failed services | systemctl --failed |
| Live log stream for a service | sudo journalctl -u <service> -f |
| Logs since last boot | sudo journalctl -b -u <service> |
| Last 100 lines of any log file | tail -100 /var/log/<app>.log |
| Grep log for ERROR lines | grep -i "error\|exception\|fail" /var/log/<app>.log | tail -30 |
| Check cron execution history | grep CRON /var/log/syslog | tail -20 |
| Find which service owns a port | sudo ss -tlnp | grep :<PORT> |
| Journal disk usage | sudo journalctl --disk-usage |
⚠️ Common Mistakes Appendix
Misconfigurations responsible for the majority of first-response support tickets. Check these first before deep investigation.
Setting chmod 777 on everything "to fix it" is a security risk. Setting 600 on a directory makes it inaccessible.
644 for files (owner rw, group/other r), 755 for directories and executables. Use 640 for config files with secrets. Always check parent directory execute bits — you need x on every directory in the path.
find /var/www -type f -exec chmod 644 {} \;
find /var/www -type d -exec chmod 755 {} \;
Application works when run manually but fails under systemd because the service environment does not inherit the user's shell environment (JAVA_HOME, DATABASE_URL, etc.).
# /etc/systemd/system/myapp.service [Service] section:
Environment="JAVA_HOME=/usr/lib/jvm/java-17"
Environment="DATABASE_URL=jdbc:mysql://localhost:3306/mydb"
# Or load from a file:
EnvironmentFile=/etc/myapp/env.conf
Deploying a new service on port 8080 without checking whether another process already owns that port. Also common: accidentally starting two instances of the same service.
ss -tlnp | grep :8080
# If occupied, pick a free port or stop the conflicting service.
Cron runs with a minimal PATH (/usr/bin:/bin). Commands like mysql, python3, or node that work in a shell will silently fail in cron.
# In crontab, set PATH explicitly:
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# In the script, use absolute paths:
/usr/bin/python3 /opt/scripts/backup.py >> /var/log/backup.log 2>&1
Editing /etc/systemd/system/myapp.service and then restarting the service — systemd still loads the old cached unit. Changes appear to have no effect.
sudo systemctl daemon-reload
sudo systemctl restart myapp
Creating directories as root that the service needs to write to as www-data or appuser. Service starts but fails on first write.
# Who does the service run as?
systemctl show myapp -p User
# Fix ownership:
sudo chown -R appuser:appuser /var/lib/myapp
Adding iptables rules directly without saving them. Rules disappear on reboot causing mystery outages.
ufw which persists rules automatically, or save iptables rules.
# Preferred (ufw handles persistence):
sudo ufw allow 443/tcp
# For raw iptables — save and restore:
sudo iptables-save > /etc/iptables/rules.v4
Developers test locally with localhost bindings then deploy without changing the bind address. Port is open, firewall is clear, but remote connections never arrive at the application.
0.0.0.0 for all interfaces or the specific interface IP.
# Check what address the service is bound to:
ss -tlnp | grep <PORT>
# Spring Boot: application.properties
server.address=0.0.0.0
# Node.js: app.listen(PORT, '0.0.0.0', ...)
# nginx: listen 0.0.0.0:80;