Lesson 1Data lake an' object storage choices: S3, GCS, Azure Blob — partitionin' strategies, file formats (Parquet/ORC/Avro) an' compressionExplore data lake design pon major clouds, comparin' S3, GCS, an' Azure Blob. Learn partitionin' strategies, file layout, an' how Parquet, ORC, Avro, an' compression choices affect performance, cost, an' downstream processin' fi optimal setup.
Comparing S3, GCS, and Azure Blob capabilitiesDesigning buckets, folders, and naming conventionsPartitioning by time, entity, and lifecycle stageChoosing Parquet, ORC, or Avro for workloadsCompression codecs and performance tradeoffsOptimizing small files and compaction jobsLesson 2Batch ingestion an' interoperability: Sqoop/CDC tools, AWS Glue, Google Dataflow batch, Airbyte fi connectors, nightly export schedulin'Learn batch ingestion options from databases an' SaaS systems usin' Sqoop, CDC tools, AWS Glue, Google Dataflow batch, an' Airbyte. Design nightly an' intraday loads, schema handlin', an' interoperability across different sources fi smooth integration.
Sqoop and JDBC based bulk extractionChange Data Capture tools and patternsAWS Glue jobs for batch ingestionGoogle Dataflow batch pipelines designAirbyte connectors and configurationDesigning nightly and intraday load schedulesLesson 3Stream processin' frameworks: Apache Flink, Kafka Streams, Spark Structured Streamin' — exactly-once semantics, state management, windowin', watermarkin'Dive inna stream processin' wid Apache Flink, Kafka Streams, an' Spark Structured Streamin'. Learn how fi design stateful operators, implement exactly once semantics, an' configure windows an' watermarks fi robust real-time analytics dat deliver.
Flink architecture and deployment optionsKafka Streams topology and state storesSpark Structured Streaming microbatch modelExactly once semantics and idempotent sinksState management, checkpoints, and recoveryWindowing, watermarking, and late eventsLesson 4Integration an' API layers: GraphQL/REST APIs, materialized views fi product feeds, data access patterns fi consumersExplore integration an' API layers dat expose analytical an' operational data. Learn GraphQL an' REST patterns, usin' materialized views fi product feeds, an' designin' secure, governed data access fi diverse consumers wid ease.
REST API design for data accessGraphQL schemas and resolvers for analyticsUsing materialized views for product feedsCaching and pagination strategies for APIsRow level security and authorizationVersioning and backward compatible contractsLesson 5Streamin' ingestion options an' patterns: Kafka, Confluent Platform, AWS Kinesis, Google Pub/Sub — producers, partitionin', schema evolution considerationsUnderstand streamin' ingestion platforms includin' Kafka, Confluent, Kinesis, an' Pub/Sub. Learn producer design, partitionin' strategies, schema evolution, an' patterns fi durable, scalable event collection across domains fi reliable flow.
Kafka topics, partitions, and replicationConfluent Platform ecosystem componentsAWS Kinesis streams and firehose usageGoogle Pub/Sub design and quotasProducer design, batching, and backpressureSchema evolution with Avro and schema registryLesson 6Real-time servin' stores: Redis, RocksDB-backed stores, Cassandra, Druid fi OLAP streamin' queriesStudy real-time servin' stores such as Redis, RocksDB backed engines, Cassandra, an' Druid. Learn access patterns, data modelin', an' how fi support low latency lookups an' OLAP style queries pon fresh streamin' data fi quick insights.
Redis as cache and primary data storeRocksDB backed stateful servicesCassandra data modeling for time seriesDruid architecture for streaming OLAPBalancing consistency, latency, and costCapacity planning and hotspot mitigationLesson 7Data warehouse options fi analytics: Snowflake, BigQuery, Redshift — CTAS, materialized views, cost/freshness trade-offsCompare data warehouse options such as Snowflake, BigQuery, an' Redshift. Learn CTAS patterns, materialized views, clusterin', an' how fi balance cost, performance, an' data freshness fi analytics workloads dat shine.
Snowflake virtual warehouses and scalingBigQuery storage and query optimizationRedshift distribution and sort keysCTAS patterns for derived tablesMaterialized views and refresh policiesCost versus freshness tradeoffs and tuningLesson 8Batch processin' an' orchestration: Apache Spark, Spark pon EMR/Dataproc, DBT fi transformations, Airflow/Cloud Composer/Managed Workflows fi orchestrationUnderstand batch processin' wid Spark pon EMR an' Dataproc, an' SQL-centric transformations wid dbt. Learn orchestration patterns usin' Airflow, Cloud Composer, an' Managed Workflows fi build reliable, observable batch pipelines dat work.
Spark cluster modes and resource sizingSpark job design for ETL and ELTdbt models, tests, and documentationAirflow DAG design and dependency managementScheduling, retries, and SLAs for batch jobsMonitoring, logging, and alerting for pipelinesLesson 9Feature store an' ML data platform: Feast, Tecton, or custom feature pipelines usin' Delta Lake/BigQuery; online vs offline feature servin'Examine feature stores an' ML data platforms usin' Feast, Tecton, or custom pipelines pon Delta Lake an' BigQuery. Learn feature definitions, lineage, an' how fi manage online versus offline servin' fi consistent model behavior inna action.
Core concepts of feature stores and entitiesFeast architecture and deployment patternsTecton capabilities and integration optionsBuilding custom feature pipelines on Delta LakeOffline feature computation in BigQueryOnline versus offline feature serving design