Lesson 1When to kill, restart, or throttle a process: safe kill practices, systemctl restart, and using cgroups and nice/reniceKnow when to kill, restart, or throttle a process safely. Learn signal types, safe kill patterns, systemctl restart quirks, and apply cgroups plus nice/renice to limit damage.
Choosing SIGTERM, SIGKILL, and othersUsing kill and pkill with safeguardsRestarting services with systemctlThrottling CPU with nice and reniceLimiting resources using cgroupsDocumenting and automating remediesLesson 2Analysing swap usage and OOM events: dmesg, kernel OOM killer logs, and /var/log/kern.logCheck swap usage and Out Of Memory events with free, dmesg, kernel OOM logs, and /var/log/kern.log. Spot thrashing, tune swappiness, and decide on adding RAM or tweaking limits.
Checking swap usage with free and /procRecognizing swap thrashing symptomsReading dmesg for OOM killer entriesParsing /var/log/kern.log detailsTuning swappiness and vm overcommitDeciding when to add RAM or adjust limitsLesson 3Identifying hot processes: ps, ps aux --sort, pgrep, pidstat and mapping PIDs to servicesQuickly spot hot or misbehaving processes with ps, pgrep, pidstat, and sorting. Map PIDs to services, units, and containers to link resource hogging to culprits.
Sorting ps output by CPU and memoryUsing pgrep and pkill name filtersMonitoring per-process stats with pidstatMapping PIDs to systemd unitsRelating PIDs to containers or cgroupsTracking short-lived bursty processesLesson 4Identifying recurring resource spikes: inspecting cron, systemd timers, at jobs, and application schedulersDetect recurring CPU, memory, I/O spikes by matching metrics to scheduled tasks. Check cron, systemd timers, at jobs, and app schedulers to fix noisy or clashing jobs.
Listing and reading user and system crontabsInspecting systemd timers and calendar unitsReviewing at jobs and one-off schedulesTracing app-level schedulers and workersCorrelating spikes with job execution timesRefining or staggering noisy recurring jobsLesson 5Memory troubleshooting: free, /proc/meminfo, smem, pmap and checking for memory leaksTroubleshoot memory with free, /proc/meminfo, smem, pmap. Tell cache from real pressure, check per-process usage, spot leaks or fragmentation patterns.
Interpreting free and available memoryReading /proc/meminfo key fieldsUsing smem for per-process breakdownsInspecting process maps with pmapSpotting memory leak growth patternsDifferentiating cache from real pressureLesson 6Integrating with monitoring data (Prometheus, Grafana) and using historical metrics to determine trendsBlend local troubleshooting with Prometheus and Grafana data. Use historical metrics, dashboards, alerts to spot trends, regressions, slow drifts, and check fix impacts.
Reviewing key CPU and load dashboardsInspecting memory, cache, and swap panelsAnalyzing disk and network latency graphsUsing PromQL to slice historical metricsCorrelating deploys with metric changesValidating fixes with before and after viewsLesson 7Load vs CPU saturation: uptime, load average interpretation and relation to CPU coresDecode system load averages and CPU core links, run queues. Spot healthy high load vs saturation, link to I/O wait, context switches, latency.
Reading uptime and load averagesRelating load to CPU core countsSeparating runnable and blocked tasksIdentifying CPU-bound saturation casesRecognizing I/O wait driven loadUsing vmstat and mpstat to confirmLesson 8Collecting live system metrics: top, htop, vmstat, mpstat, iostat and how to interpret outputsCollect and read live Linux metrics with top, htop, vmstat, mpstat, iostat. Grasp CPU, memory, I/O views, key fields, refresh rates, spot real-time bottlenecks.
Reading CPU usage in top and htopMonitoring memory and swap in topUsing vmstat for system-wide snapshotsAnalyzing CPU stats with mpstatChecking disk I/O patterns with iostatChoosing sampling intervals and filtersLesson 9Using perf, strace, and ltrace for deep process analysis and when to use eachKnow when/how to use perf, strace, ltrace for deep dives. Profile CPU hotspots, trace syscalls, library calls, keep overhead low for solid diagnostics.
Profiling CPU hotspots with perf recordViewing perf reports and call graphsTracing syscalls with strace safelyFiltering noisy strace outputInspecting library calls using ltraceChoosing the right tool for each symptomLesson 10Using lightweight profiling and tracing tools (py-spy, gdb, flamegraphs) for Python appsLightweight profiling for Python apps with py-spy, gdb, flamegraphs. Grab stack samples in prod, find hot paths, read flamegraphs without halting services.
Sampling Python stacks with py-spyGenerating and reading flamegraphsAttaching gdb safely to live PythonHandling stripped or optimized buildsProfiling async and multithreaded codeReducing profiler overhead in production