Lesson 1Handling channel metadata (channel-specific token patterns, metadata encoding)Understand how to process channel metadata such as chat, email, and phone logs. We cover channel-specific token patterns, encoding strategies, and how to combine metadata with text for richer modeling in local support systems.
Cataloging support channels and fieldsChannel-specific token patternsOne-hot and embedding encodingsCombining text and metadata featuresHandling missing channel metadataLesson 2Emoji, emoticon and non-standard token handling and mapping to sentiment signalsStudy how to normalize emojis, emoticons, and other non-standard tokens while preserving sentiment. We discuss mapping strategies, lexicons, and how to integrate these signals into downstream sentiment and intent models for diverse user inputs.
Cataloging emoji and emoticon usageUnicode handling and normalizationMapping tokens to sentiment scoresBuilding custom emoji lexiconsIntegrating signals into modelsLesson 3Punctuation, contractions, and tokenization strategies for English support textExamine punctuation, contractions, and tokenization strategies for English support text. We compare rule-based and library tokenizers, handle edge cases, and align tokenization with downstream model requirements in everyday communications.
Role of punctuation in support ticketsExpanding and normalizing contractionsRule-based vs statistical tokenizersHandling URLs and emojis in tokensTokenization for transformer modelsLesson 4Stemming vs lemmatization: algorithms, libraries, and when to apply eachCompare stemming and lemmatization approaches, including algorithms and libraries. You will learn when to apply each method in support ticket workflows and how they affect vocabulary size and model behavior for accurate analysis.
Rule-based and algorithmic stemmersDictionary-based lemmatizersLibrary choices and performanceImpact on vocabulary and sparsityTask-driven method selectionLesson 5Handling spelling mistakes, abbreviations, and domain-specific shorthand (spell correction, lookup dictionaries)Explore methods to correct spelling, expand abbreviations, and normalize domain shorthand in tickets. You will combine spell correction, lookup dictionaries, and custom rules while avoiding harmful changes to key entities and codes in local dialects.
Common error types in support textDictionary and edit-distance correctionCustom domain abbreviation lexiconsContext-aware correction strategiesProtecting entities and codesLesson 6Stopword removal tradeoffs and configurable stopword lists for support ticket domainsExamine the tradeoffs of stopword removal in support ticket domains. You will design configurable stopword lists, evaluate their impact on models, and handle domain-specific function words that may carry subtle intent in regional contexts.
Standard vs domain stopword listsImpact on bag-of-words featuresEffect on embeddings and transformersConfigurable and layered stopword setsEvaluating removal with ablationLesson 7Text normalization fundamentals: lowercasing, Unicode normalization, whitespace and linebreak handlingCover core text normalization steps such as lowercasing, Unicode normalization, and whitespace cleanup. We discuss ordering of operations, language-specific caveats, and preserving important formatting cues for clean data processing.
Lowercasing and case preservation rulesUnicode normalization formsHandling accents and special symbolsWhitespace and linebreak cleanupOrdering normalization operationsLesson 8Data splitting strategies: time-based splits, stratified sampling by topic/sentiment, and nested cross-validation considerationsStudy data splitting strategies tailored to temporal and labeled ticket data. We compare time-based splits, stratified sampling by topic or sentiment, and nested cross-validation for robust model evaluation in time-sensitive scenarios.
Holdout, k-fold, and temporal splitsStratification by topic and sentimentPreventing temporal data leakageNested cross-validation workflowsAligning splits with business goalsLesson 9Handling URLs, email addresses, code snippets, and identifiers in text (masking vs preserving)Learn strategies for handling URLs, emails, code snippets, and identifiers in text. We compare masking, normalization, and preservation choices, focusing on privacy, deduplication, and model performance implications for secure handling.
Detecting URLs and email patternsMasking versus normalization rulesRepresenting code snippets safelyHandling ticket and user identifiersPrivacy and leakage considerationsLesson 10Understanding CSV schema and data types (ticket_id, created_at, customer_id, text, channel, resolved, resolution_time_hours, manual_topic, manual_sentiment)Learn to interpret CSV schemas for ticket datasets and assign correct data types. We cover parsing identifiers, timestamps, booleans, and text fields, plus validation checks that prevent subtle downstream errors in data management.
Inspecting headers and sample rowsAssigning robust column data typesValidating timestamps and IDsDetecting malformed or mixed typesSchema validation in pipelinesLesson 11Techniques to detect and quantify missing values and label noise (missingness patterns, label consistency checks, inter-annotator metrics)Learn to detect missing values and noisy labels in support ticket datasets. We cover missingness patterns, label consistency checks, and inter-annotator agreement metrics to quantify label quality and guide cleaning decisions effectively.
Types of missingness in ticket datasetsVisualizing missingness patternsDetecting inconsistent labelsInter-annotator agreement metricsHeuristics to flag label noiseLesson 12Creating reproducible pipelines and versioning cleaned datasets (data contracts, hashing)Learn to build reproducible preprocessing pipelines and versioned cleaned datasets. We cover modular pipeline design, configuration management, hashing, and data contracts that keep models, code, and data aligned over time for reliability.
Designing modular preprocessing stepsConfiguration and parameter trackingHashing raw and processed datasetsData contracts and schema guaranteesLogging and audit trails for changesLesson 13Date/time parsing and timezone handling, deriving temporal features (daypart, weekday, recency)Understand how to parse heterogeneous date and time fields, handle timezones, and derive temporal features. We focus on robust parsing, normalization to canonical time, and engineered features such as recency and seasonality for local times.
Parsing heterogeneous date formatsTimezone normalization strategiesHandling missing or invalid timestampsDeriving recency and age featuresDaypart, weekday, and seasonalityLesson 14Imputation and treatment of non-text columns (resolved, resolution_time_hours, channel) for modelingExplore imputation and preprocessing for non-text columns like resolution status, resolution time, and channel. We discuss encoding strategies, leakage risks, and how to align these features with text for modeling in practical setups.
Profiling non-text ticket columnsImputation for numeric durationsEncoding categorical status fieldsAvoiding target leakage in featuresJoint modeling with text signals