Lesson 1Data lake and object storage picks: S3, GCS, Azure Blob — splitting plans, file kinds (Parquet/ORC/Avro) and squeezingLook into data lake plans on main clouds, comparing S3, GCS, and Azure Blob. Learn splitting plans, file setup, and how Parquet, ORC, Avro, and squeezing picks affect work, expense, and later handling for optimal storage.
Comparing S3, GCS, and Azure Blob capabilitiesDesigning buckets, folders, and naming conventionsPartitioning by time, entity, and lifecycle stageChoosing Parquet, ORC, or Avro for workloadsCompression codecs and performance tradeoffsOptimizing small files and compaction jobsLesson 2Group taking in and working together: Sqoop/CDC tools, AWS Glue, Google Dataflow group, Airbyte for links, night export timingLearn group taking in choices from databases and SaaS systems using Sqoop, CDC tools, AWS Glue, Google Dataflow group, and Airbyte. Plan night and mid-day loads, form handling, and working across mixed sources for seamless integration.
Sqoop and JDBC based bulk extractionChange Data Capture tools and patternsAWS Glue jobs for batch ingestionGoogle Dataflow batch pipelines designAirbyte connectors and configurationDesigning nightly and intraday load schedulesLesson 3Stream handling frames: Apache Flink, Kafka Streams, Spark Structured Streaming — exact-once sense, state handling, windowing, watermarkingGo deep into stream handling with Apache Flink, Kafka Streams, and Spark Structured Streaming. Learn to plan stateful workers, do exact once sense, and set windows and watermarks for strong live analysis in real-time systems.
Flink architecture and deployment optionsKafka Streams topology and state storesSpark Structured Streaming microbatch modelExactly once semantics and idempotent sinksState management, checkpoints, and recoveryWindowing, watermarking, and late eventsLesson 4Joining and API levels: GraphQL/REST APIs, made views for product feeds, data reach ways for usersLook into joining and API levels that show analysis and work data. Learn GraphQL and REST ways, using made views for product feeds, and planning safe, ruled data reach for varied users in secure environments.
REST API design for data accessGraphQL schemas and resolvers for analyticsUsing materialized views for product feedsCaching and pagination strategies for APIsRow level security and authorizationVersioning and backward compatible contractsLesson 5Streaming taking in choices and ways: Kafka, Confluent Platform, AWS Kinesis, Google Pub/Sub — makers, splitting, form change thoughtsUnderstand streaming taking in platforms including Kafka, Confluent, Kinesis, and Pub/Sub. Learn maker plan, splitting plans, form change, and ways for lasting, growing event gathering across areas for robust pipelines.
Kafka topics, partitions, and replicationConfluent Platform ecosystem componentsAWS Kinesis streams and firehose usageGoogle Pub/Sub design and quotasProducer design, batching, and backpressureSchema evolution with Avro and schema registryLesson 6Live serving stores: Redis, RocksDB-backed stores, Cassandra, Druid for OLAP streaming questionsStudy live serving stores such as Redis, RocksDB backed motors, Cassandra, and Druid. Learn reach ways, data shaping, and how to back low speed lookups and OLAP style questions on new streaming data for quick responses.
Redis as cache and primary data storeRocksDB backed stateful servicesCassandra data modeling for time seriesDruid architecture for streaming OLAPBalancing consistency, latency, and costCapacity planning and hotspot mitigationLesson 7Data warehouse choices for analysis: Snowflake, BigQuery, Redshift — CTAS, made views, expense/freshness give-and-takeCompare data warehouse choices such as Snowflake, BigQuery, and Redshift. Learn CTAS ways, made views, grouping, and how to balance expense, work, and data newness for analysis tasks in cost-effective ways.
Snowflake virtual warehouses and scalingBigQuery storage and query optimizationRedshift distribution and sort keysCTAS patterns for derived tablesMaterialized views and refresh policiesCost versus freshness tradeoffs and tuningLesson 8Group handling and timing: Apache Spark, Spark on EMR/Dataproc, DBT for changes, Airflow/Cloud Composer/Managed Workflows for timingUnderstand group handling with Spark on EMR and Dataproc, and SQL-center changes with dbt. Learn timing ways using Airflow, Cloud Composer, and Managed Workflows to build reliable, seen group lines for steady operations.
Spark cluster modes and resource sizingSpark job design for ETL and ELTdbt models, tests, and documentationAirflow DAG design and dependency managementScheduling, retries, and SLAs for batch jobsMonitoring, logging, and alerting for pipelinesLesson 9Feature store and ML data platform: Feast, Tecton, or custom feature lines using Delta Lake/BigQuery; online vs offline feature servingLook at feature stores and ML data platforms using Feast, Tecton, or custom lines on Delta Lake and BigQuery. Learn feature states, line, and how to handle online versus offline serving for steady model acts in ML workflows.
Core concepts of feature stores and entitiesFeast architecture and deployment patternsTecton capabilities and integration optionsBuilding custom feature pipelines on Delta LakeOffline feature computation in BigQueryOnline versus offline feature serving design