Lesson 1Handling channel metadata (channel-specific token patterns, metadata encoding)Learn how to deal with channel details like chat, email, and phone records. We talk about patterns unique to each channel, ways to encode them, and mixing these details with text to make better models for Eritrean customer interactions.
Cataloging support channels and fieldsChannel-specific token patternsOne-hot and embedding encodingsCombining text and metadata featuresHandling missing channel metadataLesson 2Emoji, emoticon and non-standard token handling and mapping to sentiment signalsLook into making emojis, emoticons, and unusual tokens normal while keeping their emotional meaning. We cover ways to map them, word lists for feelings, and adding these to models for emotions and intentions in Eritrean texts.
Cataloging emoji and emoticon usageUnicode handling and normalizationMapping tokens to sentiment scoresBuilding custom emoji lexiconsIntegrating signals into modelsLesson 3Punctuation, contractions, and tokenization strategies for English support textCheck out punctuation, shortened words, and breaking text into parts for English support messages. We compare simple rules and tools, manage tricky cases, and match breaking methods to what models need in Eritrean English usage.
Role of punctuation in support ticketsExpanding and normalizing contractionsRule-based vs statistical tokenizersHandling URLs and emojis in tokensTokenization for transformer modelsLesson 4Stemming vs lemmatization: algorithms, libraries, and when to apply eachCompare cutting words to roots and changing them to base forms, including methods and tools. You will know when to use each in handling support tickets and how they change word sets and model actions in Eritrean contexts.
Rule-based and algorithmic stemmersDictionary-based lemmatizersLibrary choices and performanceImpact on vocabulary and sparsityTask-driven method selectionLesson 5Handling spelling mistakes, abbreviations, and domain-specific shorthand (spell correction, lookup dictionaries)Find out ways to fix spelling errors, expand short forms, and standardize special terms in tickets. You will mix fixing tools, search lists, and custom rules without harming key names and codes used in Eritrean support.
Common error types in support textDictionary and edit-distance correctionCustom domain abbreviation lexiconsContext-aware correction strategiesProtecting entities and codesLesson 6Stopword removal tradeoffs and configurable stopword lists for support ticket domainsLook at the pros and cons of removing common words in support areas. You will create adjustable lists of such words, check their effect on models, and deal with special words that show hidden meanings in Eritrean tickets.
Standard vs domain stopword listsImpact on bag-of-words featuresEffect on embeddings and transformersConfigurable and layered stopword setsEvaluating removal with ablationLesson 7Text normalization fundamentals: lowercasing, Unicode normalization, whitespace and linebreak handlingGo over basic steps to standardize text like making all lowercase, fixing special characters, and cleaning spaces. We discuss the order of steps, issues with languages, and keeping useful formats for Eritrean text handling.
Lowercasing and case preservation rulesUnicode normalization formsHandling accents and special symbolsWhitespace and linebreak cleanupOrdering normalization operationsLesson 8Data splitting strategies: time-based splits, stratified sampling by topic/sentiment, and nested cross-validation considerationsStudy ways to divide data suited to time and labeled tickets. We compare splits by time, balanced sampling by theme or feeling, and layered checks for strong model testing in Eritrean data scenarios.
Holdout, k-fold, and temporal splitsStratification by topic and sentimentPreventing temporal data leakageNested cross-validation workflowsAligning splits with business goalsLesson 9Handling URLs, email addresses, code snippets, and identifiers in text (masking vs preserving)Learn ways to manage web links, emails, code bits, and IDs in text. We compare hiding them, standardizing, or keeping as is, focusing on privacy, removing duplicates, and model effects in Eritrean systems.
Detecting URLs and email patternsMasking versus normalization rulesRepresenting code snippets safelyHandling ticket and user identifiersPrivacy and leakage considerationsLesson 10Understanding CSV schema and data types (ticket_id, created_at, customer_id, text, channel, resolved, resolution_time_hours, manual_topic, manual_sentiment)Learn to read CSV structures for ticket data and set right data kinds. We cover reading IDs, dates, true/false values, and text, plus checks to stop small errors later in Eritrean data processing.
Inspecting headers and sample rowsAssigning robust column data typesValidating timestamps and IDsDetecting malformed or mixed typesSchema validation in pipelinesLesson 11Techniques to detect and quantify missing values and label noise (missingness patterns, label consistency checks, inter-annotator metrics)Learn to find missing parts and wrong labels in ticket data. We cover patterns of missing info, checks for label match, and measures of agreement between labelers to judge quality and guide cleaning in Eritrean setups.
Types of missingness in ticket datasetsVisualizing missingness patternsDetecting inconsistent labelsInter-annotator agreement metricsHeuristics to flag label noiseLesson 12Creating reproducible pipelines and versioning cleaned datasets (data contracts, hashing)Learn to make repeatable cleaning processes and tracked clean data sets. We cover building blocks for pipelines, managing settings, checking sums, and agreements that keep models, code, and data in sync over time in Eritrea.
Designing modular preprocessing stepsConfiguration and parameter trackingHashing raw and processed datasetsData contracts and schema guaranteesLogging and audit trails for changesLesson 13Date/time parsing and timezone handling, deriving temporal features (daypart, weekday, recency)Understand parsing mixed date and time fields, managing time zones, and creating time-based traits. We focus on strong parsing, standard time, and made features like nearness and seasons for Eritrean data.
Parsing heterogeneous date formatsTimezone normalization strategiesHandling missing or invalid timestampsDeriving recency and age featuresDaypart, weekday, and seasonalityLesson 14Imputation and treatment of non-text columns (resolved, resolution_time_hours, channel) for modelingLook into filling and preparing non-text parts like status, time taken, and channel. We discuss encoding ways, risks of leaks, and matching these traits with text for modeling in Eritrean support analysis.
Profiling non-text ticket columnsImputation for numeric durationsEncoding categorical status fieldsAvoiding target leakage in featuresJoint modeling with text signals