Lesson 1Handling channel metadata (channel-specific token patterns, metadata encoding)Understand how fi process channel metadata like chat, email, an phone logs. Wi cover channel-specific token patterns, encoding strategies, an how fi combine metadata wid text fi richer modeling.
Cataloging support channels and fieldsChannel-specific token patternsOne-hot and embedding encodingsCombining text and metadata featuresHandling missing channel metadataLesson 2Emoji, emoticon and non-standard token handling and mapping to sentiment signalsStudy how fi normalize emojis, emoticons, an other non-standard tokens while preserving sentiment. Wi discuss mapping strategies, lexicons, an how fi integrate dese signals into downstream sentiment an intent models.
Cataloging emoji and emoticon usageUnicode handling and normalizationMapping tokens to sentiment scoresBuilding custom emoji lexiconsIntegrating signals into modelsLesson 3Punctuation, contractions, and tokenization strategies for English support textExamine punctuation, contractions, an tokenization strategies fi English support text. Wi compare rule-based an library tokenizers, handle edge cases, an align tokenization wid downstream model requirements.
Role of punctuation in support ticketsExpanding and normalizing contractionsRule-based vs statistical tokenizersHandling URLs and emojis in tokensTokenization for transformer modelsLesson 4Stemming vs lemmatization: algorithms, libraries, and when to apply eachCompare stemming an lemmatization approaches, including algorithms an libraries. Yuh wi learn when fi apply each method in support ticket workflows an how dem affect vocabulary size an model behavior.
Rule-based and algorithmic stemmersDictionary-based lemmatizersLibrary choices and performanceImpact on vocabulary and sparsityTask-driven method selectionLesson 5Handling spelling mistakes, abbreviations, and domain-specific shorthand (spell correction, lookup dictionaries)Explore methods fi correct spelling, expand abbreviations, an normalize domain shorthand in tickets. Yuh wi combine spell correction, lookup dictionaries, an custom rules while avoiding harmful changes to key entities an codes.
Common error types in support textDictionary and edit-distance correctionCustom domain abbreviation lexiconsContext-aware correction strategiesProtecting entities and codesLesson 6Stopword removal tradeoffs and configurable stopword lists for support ticket domainsExamine de tradeoffs of stopword removal in support ticket domains. Yuh wi design configurable stopword lists, evaluate dem impact on models, an handle domain-specific function words dat may carry subtle intent.
Standard vs domain stopword listsImpact on bag-of-words featuresEffect on embeddings and transformersConfigurable and layered stopword setsEvaluating removal with ablationLesson 7Text normalization fundamentals: lowercasing, Unicode normalization, whitespace and linebreak handlingCover core text normalization steps such as lowercasing, Unicode normalization, an whitespace cleanup. Wi discuss ordering of operations, language-specific caveats, an preserving important formatting cues.
Lowercasing and case preservation rulesUnicode normalization formsHandling accents and special symbolsWhitespace and linebreak cleanupOrdering normalization operationsLesson 8Data splitting strategies: time-based splits, stratified sampling by topic/sentiment, and nested cross-validation considerationsStudy data splitting strategies tailored to temporal an labeled ticket data. Wi compare time-based splits, stratified sampling by topic or sentiment, an nested cross-validation fi robust model evaluation.
Holdout, k-fold, and temporal splitsStratification by topic and sentimentPreventing temporal data leakageNested cross-validation workflowsAligning splits with business goalsLesson 9Handling URLs, email addresses, code snippets, and identifiers in text (masking vs preserving)Learn strategies fi handling URLs, emails, code snippets, an identifiers in text. Wi compare masking, normalization, an preservation choices, focusing on privacy, deduplication, an model performance implications.
Detecting URLs and email patternsMasking versus normalization rulesRepresenting code snippets safelyHandling ticket and user identifiersPrivacy and leakage considerationsLesson 10Understanding CSV schema and data types (ticket_id, created_at, customer_id, text, channel, resolved, resolution_time_hours, manual_topic, manual_sentiment)Learn fi interpret CSV schemas fi ticket datasets an assign correct data types. Wi cover parsing identifiers, timestamps, booleans, an text fields, plus validation checks dat prevent subtle downstream errors.
Inspecting headers and sample rowsAssigning robust column data typesValidating timestamps and IDsDetecting malformed or mixed typesSchema validation in pipelinesLesson 11Techniques to detect and quantify missing values and label noise (missingness patterns, label consistency checks, inter-annotator metrics)Learn fi detect missing values an noisy labels in support ticket datasets. Wi cover missingness patterns, label consistency checks, an inter-annotator agreement metrics fi quantify label quality an guide cleaning decisions.
Types of missingness in ticket datasetsVisualizing missingness patternsDetecting inconsistent labelsInter-annotator agreement metricsHeuristics to flag label noiseLesson 12Creating reproducible pipelines and versioning cleaned datasets (data contracts, hashing)Learn fi build reproducible preprocessing pipelines an versioned cleaned datasets. Wi cover modular pipeline design, configuration management, hashing, an data contracts dat keep models, code, an data aligned over time.
Designing modular preprocessing stepsConfiguration and parameter trackingHashing raw and processed datasetsData contracts and schema guaranteesLogging and audit trails for changesLesson 13Date/time parsing and timezone handling, deriving temporal features (daypart, weekday, recency)Understand how fi parse heterogeneous date an time fields, handle timezones, an derive temporal features. Wi focus on robust parsing, normalization to canonical time, an engineered features such as recency an seasonality.
Parsing heterogeneous date formatsTimezone normalization strategiesHandling missing or invalid timestampsDeriving recency and age featuresDaypart, weekday, and seasonalityLesson 14Imputation and treatment of non-text columns (resolved, resolution_time_hours, channel) for modelingExplore imputation an preprocessing fi non-text columns like resolution status, resolution time, an channel. Wi discuss encoding strategies, leakage risks, an how fi align dese features wid text fi modeling.
Profiling non-text ticket columnsImputation for numeric durationsEncoding categorical status fieldsAvoiding target leakage in featuresJoint modeling with text signals