Lesson 1Handling channel metadata (channel-specific token patterns, metadata encoding)Learn how to deal with extra info from different channels like chat, email, and phone calls. We talk about special word patterns for each channel, ways to code the info, and mixing it with messages for better analysis in South Sudanese support systems.
Cataloging support channels and fieldsChannel-specific token patternsOne-hot and embedding encodingsCombining text and metadata featuresHandling missing channel metadataLesson 2Emoji, emoticon and non-standard token handling and mapping to sentiment signalsWe study how to make emojis, smileys, and unusual words normal while keeping feelings intact. We discuss ways to link them to emotions, word lists, and adding these signs to models for feelings and intentions in local languages.
Cataloging emoji and emoticon usageUnicode handling and normalizationMapping tokens to sentiment scoresBuilding custom emoji lexiconsIntegrating signals into modelsLesson 3Punctuation, contractions, and tokenization strategies for English support textWe check punctuation, short forms, and ways to break text into words for English support messages. We compare simple rules and tools, deal with tricky parts, and match breaking to what models need later.
Role of punctuation in support ticketsExpanding and normalizing contractionsRule-based vs statistical tokenizersHandling URLs and emojis in tokensTokenization for transformer modelsLesson 4Stemming vs lemmatization: algorithms, libraries, and when to apply eachWe compare cutting words to roots and full form reduction, with methods and tools. You will know when to use each in ticket handling and how they change word lists and model actions in our context.
Rule-based and algorithmic stemmersDictionary-based lemmatizersLibrary choices and performanceImpact on vocabulary and sparsityTask-driven method selectionLesson 5Handling spelling mistakes, abbreviations, and domain-specific shorthand (spell correction, lookup dictionaries)We look at fixing spelling errors, full forms for short words, and special short terms in tickets. You will mix fixing tools, word books, and own rules without harming key names and codes in South Sudan.
Common error types in support textDictionary and edit-distance correctionCustom domain abbreviation lexiconsContext-aware correction strategiesProtecting entities and codesLesson 6Stopword removal tradeoffs and configurable stopword lists for support ticket domainsWe check the good and bad of removing common words in ticket areas. You will make changeable word lists, see effects on models, and handle special words that show hidden meanings in local support.
Standard vs domain stopword listsImpact on bag-of-words featuresEffect on embeddings and transformersConfigurable and layered stopword setsEvaluating removal with ablationLesson 7Text normalization fundamentals: lowercasing, Unicode normalization, whitespace and linebreak handlingWe cover main steps to make text standard like small letters, fixing special characters, and cleaning spaces. We talk about order of steps, issues with languages, and keeping useful formats.
Lowercasing and case preservation rulesUnicode normalization formsHandling accents and special symbolsWhitespace and linebreak cleanupOrdering normalization operationsLesson 8Data splitting strategies: time-based splits, stratified sampling by topic/sentiment, and nested cross-validation considerationsWe study ways to divide data for time and labeled tickets. We compare time splits, balanced sampling by theme or feeling, and layered checks for strong model testing in our data.
Holdout, k-fold, and temporal splitsStratification by topic and sentimentPreventing temporal data leakageNested cross-validation workflowsAligning splits with business goalsLesson 9Handling URLs, email addresses, code snippets, and identifiers in text (masking vs preserving)Learn ways to deal with web links, emails, code bits, and IDs in messages. We compare hiding, standardizing, and keeping them, focusing on privacy, removing duplicates, and model work.
Detecting URLs and email patternsMasking versus normalization rulesRepresenting code snippets safelyHandling ticket and user identifiersPrivacy and leakage considerationsLesson 10Understanding CSV schema and data types (ticket_id, created_at, customer_id, text, channel, resolved, resolution_time_hours, manual_topic, manual_sentiment)Learn to read CSV setups for ticket data and set right types. We cover reading IDs, times, yes/no, and text, plus checks to stop small errors later in South Sudanese systems.
Inspecting headers and sample rowsAssigning robust column data typesValidating timestamps and IDsDetecting malformed or mixed typesSchema validation in pipelinesLesson 11Techniques to detect and quantify missing values and label noise (missingness patterns, label consistency checks, inter-annotator metrics)Learn to find missing parts and wrong labels in ticket data. We cover patterns of missing, label match checks, and agreement measures to check label quality and guide cleaning.
Types of missingness in ticket datasetsVisualizing missingness patternsDetecting inconsistent labelsInter-annotator agreement metricsHeuristics to flag label noiseLesson 12Creating reproducible pipelines and versioning cleaned datasets (data contracts, hashing)Learn to make repeatable cleaning steps and versioned clean data. We cover step-by-step design, settings control, hashing, and agreements to keep models, code, and data in line over time.
Designing modular preprocessing stepsConfiguration and parameter trackingHashing raw and processed datasetsData contracts and schema guaranteesLogging and audit trails for changesLesson 13Date/time parsing and timezone handling, deriving temporal features (daypart, weekday, recency)Understand parsing mixed date and time fields, dealing with zones, and making time features. We focus on strong parsing, standard time, and made features like nearness and seasons.
Parsing heterogeneous date formatsTimezone normalization strategiesHandling missing or invalid timestampsDeriving recency and age featuresDaypart, weekday, and seasonalityLesson 14Imputation and treatment of non-text columns (resolved, resolution_time_hours, channel) for modelingExplore filling and preparing non-message columns like status, time, and channel. We discuss coding ways, leak risks, and matching these with text for modeling in local support.
Profiling non-text ticket columnsImputation for numeric durationsEncoding categorical status fieldsAvoiding target leakage in featuresJoint modeling with text signals