Lesson 1Synthetic and availability checks: uptime, cronjob health, backup completion alertsThis lesson implements synthetic checks for uptime, job verification, backup success, probes mimicking users, validating deps, early warnings.
HTTP uptime and availability probesMulti-step synthetic user journeysCronjob and scheduler health checksBackup success and RPO validationPlacement and frequency of probesLesson 2Defining thresholds and alert rules: static thresholds, rate-based alerts, anomaly detection, and suppression windowsThis lesson sets alert thresholds, rules with static, rate, anomaly, suppression, balancing sensitivity, adapting to loads, seasons.
Static thresholds and baselinesRate-of-change and derivative alertsAnomaly and outlier detectionMaintenance and silence windowsTuning rules to reduce noiseLesson 3Infrastructure monitoring for hypervisor hosts and cloud instance health and billing alertsThis lesson monitors hypervisors, VMs, cloud instances for resources, storage, network, services, billing alerts to avoid outages, cost surprises.
Hypervisor host health checksVM and container resource usageCloud provider health metricsBilling, budget, and quota alertsMonitoring managed cloud servicesLesson 4Log aggregation strategy: central syslog, Windows Event Forwarding, log formats, parsing considerationsThis lesson centralises logs via syslog, Event Forwarding, agents, formats, parsing, retention, indexing, access for troubleshooting, audits.
Central syslog and relay designWindows Event Forwarding basicsStructured log formats and fieldsParsing, grok, and JSON pipelinesRetention, indexing, and archivingAccess control and privacy concernsLesson 5Alerting platforms and routing: Alertmanager, PagerDuty, OpsGenie, email and Slack integrationsThis lesson covers alerting platforms for events, dedupe, routing to email, chat, paging like Alertmanager, PagerDuty, OpsGenie for quick notifications.
Alertmanager routing treesPagerDuty and OpsGenie basicsEmail and Slack notification designAlert grouping and deduplicationMulti-channel delivery and fallbacksLesson 6Key metrics to monitor: CPU, memory, disk, I/O, network, swap, load average, inode usageThis lesson covers vital host metrics like CPU, memory, disk, I/O, network, swap, load, inodes, sane intervals, baselines for early detection.
CPU utilization and saturationMemory pressure and swappingDisk capacity and I/O latencyNetwork throughput and errorsLoad average and run queuesInode exhaustion risksLesson 7Escalation policies, runbooks, alert deduplication, and on-call scheduling best practicesThis lesson designs escalation, runbooks, dedupe, on-call rotations for efficient incidents, less fatigue, team wellbeing.
Defining escalation paths and tiersWriting clear, actionable runbooksAlert deduplication and noise controlOn-call rotation and handoff rulesPost-incident reviews and learningLesson 8Monitoring tools: Prometheus + node_exporter, Grafana, Zabbix, Nagios, Datadog – selection rationale and tradeoffsThis lesson compares Prometheus, Grafana, Zabbix, Nagios, Datadog on exporters, agents, scale, cost, ecosystem for org fit.
Prometheus and node_exporter usageGrafana dashboards and alertingZabbix and Nagios strengths and limitsDatadog features and pricing impactCriteria for tool evaluation and choiceLesson 9Application-level monitoring: response times, error rates, HTTP status codes, custom application metricsThis lesson monitors app behaviour like latency, errors, HTTP codes, custom metrics, instrumenting, SLIs, correlating with infra.
Request latency and percentilesError rates and failure patternsTracking HTTP status code classesCustom business and domain metricsInstrumentation libraries and SDKsLesson 10Service-level monitoring: process/service checks, HTTP(S) endpoints, database health, AD/Kerberos latencyThis lesson monitors services via process checks, HTTP probes, DB health, AD/Kerberos, linking to user reliability, SLAs.
Process and service supervisionHTTP(S) endpoint probingDatabase connectivity and latencyAD and Kerberos health checksMapping checks to SLAs and SLOs