Lesson 1Handling channel metadata (channel-specific token patterns, metadata encoding)Learn how to handle channel details like chat, email, and phone records. We discuss patterns unique to each channel, ways to encode them, and how to mix these details with text for better modelling.
Listing support channels and their fieldsPatterns for tokens in specific channelsOne-hot and embedding methods for encodingMixing text and metadata featuresDealing with missing channel detailsLesson 2Emoji, emoticon and non-standard token handling and mapping to sentiment signalsLook into normalising emojis, emoticons, and unusual tokens while keeping their emotional meaning. We talk about ways to map them, word lists, and how to add these signals to models for feelings and intentions.
Listing uses of emojis and emoticonsHandling and normalising UnicodeMapping tokens to sentiment scoresCreating custom lists for emojisAdding signals to modelsLesson 3Punctuation, contractions, and tokenization strategies for English support textCheck out punctuation, shortenings, and ways to break text into tokens for English support messages. We compare rule-based and tool-based tokenisers, manage tricky cases, and match tokenisation to what models need later.
Role of punctuation in support ticketsExpanding and normalising contractionsRule-based versus statistical tokenisersHandling URLs and emojis in tokensTokenisation for transformer modelsLesson 4Stemming vs lemmatization: algorithms, libraries, and when to apply eachCompare stemming and lemmatisation methods, including their algorithms and tools. You will find out when to use each in support ticket processes and how they change word variety and model actions.
Rule-based and algorithmic stemmersDictionary-based lemmatizersChoices of libraries and their performanceImpact on word list and emptinessSelecting methods based on tasksLesson 5Handling spelling mistakes, abbreviations, and domain-specific shorthand (spell correction, lookup dictionaries)Find out ways to fix spelling errors, expand short forms, and standardise special shortcuts in tickets. You will mix spell fixing, lookup lists, and custom rules without harming key names and codes.
Common error types in support textDictionary and distance-based correctionCustom lists for domain short formsStrategies for correction with contextProtecting names and codesLesson 6Stopword removal tradeoffs and configurable stopword lists for support ticket domainsLook at the pros and cons of removing common words in support ticket areas. You will create adjustable lists of common words, check their effect on models, and manage special words that show hidden intentions.
Standard versus domain-specific common word listsImpact on bag-of-words featuresEffect on embeddings and transformersAdjustable and layered common word setsEvaluating removal with testsLesson 7Text normalization fundamentals: lowercasing, Unicode normalization, whitespace and linebreak handlingCover basic steps for standardising text like making all lowercase, fixing Unicode, and cleaning spaces. We discuss the order of steps, issues with languages, and keeping important format signs.
Lowercasing and rules for keeping caseForms for Unicode normalisationHandling accents and special symbolsCleaning spaces and line breaksOrdering of normalisation stepsLesson 8Data splitting strategies: time-based splits, stratified sampling by topic/sentiment, and nested cross-validation considerationsStudy ways to split data suited to time and labelled ticket info. We compare splits by time, sampling by topic or feeling evenly, and layered cross-checking for strong model checks.
Holdout, k-fold, and time-based splitsEven sampling by topic and sentimentPreventing leaks from time dataLayered cross-validation processesMatching splits to business aimsLesson 9Handling URLs, email addresses, code snippets, and identifiers in text (masking vs preserving)Learn ways to manage web links, emails, code bits, and IDs in text. We compare hiding, standardising, and keeping them, focusing on privacy, removing duplicates, and effects on model work.
Finding web links and email patternsHiding versus standardising rulesSafely showing code snippetsHandling ticket and user IDsPrivacy and leak concernsLesson 10Understanding CSV schema and data types (ticket_id, created_at, customer_id, text, channel, resolved, resolution_time_hours, manual_topic, manual_sentiment)Learn to read CSV structures for ticket data sets and set right data types. We cover reading IDs, times, true/false values, and text fields, plus checks to stop small errors later.
Checking headers and sample rowsSetting strong column data typesValidating times and IDsFinding wrong or mixed typesSchema checks in processesLesson 11Techniques to detect and quantify missing values and label noise (missingness patterns, label consistency checks, inter-annotator metrics)Learn to find missing values and messy labels in support ticket data. We cover patterns of missing info, checks for label match, and measures of agreement between labellers to gauge label quality and guide cleaning.
Types of missing info in ticket dataShowing patterns of missingnessFinding mismatched labelsMeasures of agreement between labellersRules to spot label messLesson 12Creating reproducible pipelines and versioning cleaned datasets (data contracts, hashing)Learn to make repeatable preprocessing lines and versioned clean data sets. We cover design of modular lines, managing settings, hashing, and data agreements that keep models, code, and data in line over time.
Designing modular preprocessing stepsTracking settings and parametersHashing raw and cleaned data setsData agreements and schema promisesLogging and trails for changesLesson 13Date/time parsing and timezone handling, deriving temporal features (daypart, weekday, recency)Understand parsing mixed date and time fields, handling time zones, and creating time features. We focus on strong parsing, standardising to main time, and made features like how recent and seasons.
Parsing mixed date formatsStrategies for time zone standardisingHandling missing or wrong timesCreating recency and age featuresDay parts, weekdays, and seasonsLesson 14Imputation and treatment of non-text columns (resolved, resolution_time_hours, channel) for modelingLook into filling and preprocessing non-text columns like solve status, solve time, and channel. We discuss encoding ways, leak risks, and how to match these features with text for modelling.
Profiling non-text ticket columnsFilling for number timesEncoding category status fieldsAvoiding target leaks in featuresJoint modelling with text signals