Lesson 1Normalising Emojis, Emoticons, and Special Characters into Sentiment TokensLearn how to standardise emojis, emoticons, and special characters into consistent sentiment tokens. We compare mapping strategies, discuss ambiguity, and show how to keep emotional nuance without messing up the vocabulary, especially with local slang and symbols used in Ghana.
Emoji and emoticon lists for your fieldMapping icons to sentiment or intent tokensDealing with mixed scripts and decorative symbolsHandling ambiguous or sarcastic emoji useLesson 2Detecting Language and Handling Multiple Languages: Filtering, Translating, or Separate PipelinesFind out methods for spotting language in text and building pipelines for multiple languages. You will choose when to filter, translate, or separate by language, and manage code-switching and low-resource languages like Twi or Ga in Ghanaian texts.
Language identification tools and shortcutsFiltering or routing by detected languageTranslation-based versus native pipelinesHandling code-switching and mixed-language textLesson 3Understanding Schema and Data Types: review_id, channel, product_category, review_text, rating, review_date, customer_segmentLearn to read dataset schemas and data types for text analytics projects. You will map fields like review_id, channel, product_category, text, rating, date, and segment to analytical questions and modelling choices, using examples from Ghanaian e-commerce.
Key identifiers and relational joins for reviewsChannel and product category field meaningsRating, dates, and customer segment detailsDesigning schemas for scalable text analyticsLesson 4Data Validation: Duplicates, Missing Values, Inconsistent Channels or Categories, Date RangesLearn to check text datasets for duplicates, missing values, inconsistent channels, categories, and date ranges. You will create validation rules, build summary reports, and set up automatic quality checks for data from Ghanaian sources like mobile money reviews.
Profiling duplicates and missing text fieldsValidating channels, categories, and labelsChecking time coverage and date odditiesAutomated validation reports and alertsLesson 5Removing Duplicates and Consolidating Threads: Merging Repeated Reviews or Multi-Message TicketsUnderstand how to spot duplicate reviews, near-duplicates, and broken threads across channels. You will learn similarity measures, ID-based rules, and ways to merge or link messages into clear analytical units, relevant for Ghanaian customer service chats.
Exact and fuzzy duplicate detection methodsThread rebuilding from multi-message ticketsMerging rules for repeated or updated reviewsImpact of de-duplication on model trainingLesson 6Stopwords and Custom Stoplist: Standard Stopwords, Product-Brand Names, Frequent Non-Informative TermsUnderstand how to use standard stopword lists and create custom ones for product names and common non-informative terms. You will check their effect on models and avoid removing useful tokens, adapting to Ghanaian English with local phrases.
Reviewing standard stopword lists by languageSpotting field-specific non-informative termsHandling brand, product, and competitor namesMeasuring impact of stoplists on featuresLesson 7Tokenisation Choices: Word, Subword, Sentence; Handling Contractions and Domain-Specific TokensCompare tokenisation strategies at word, subword, and sentence levels, focusing on contractions and field-specific tokens. You will see how token choices affect vocabulary size, context windows, and downstream models in Ghanaian text contexts.
Word and sentence tokenisation trade-offsSubword tokenisation and byte-pair encodingHandling contractions and clitics in textCustom tokens for SKUs, hashtags, and codesLesson 8Text Normalisation: Lowercasing, Unicode Normalisation, Whitespace and Punctuation HandlingExplore main text normalisation steps like lowercasing, Unicode normalisation, and handling whitespace or punctuation. You will build pipelines that standardise text while keeping info needed for models, suitable for mixed Ghanaian online content.
Lowercasing strategies and case preservationUnicode normalisation and encoding issuesWhitespace cleanup and token boundary controlPunctuation handling and symbol retentionLesson 9Spelling Correction and Normalisation: Dictionary-Based, Edit-Distance, and Context-Aware MethodsStudy spelling correction and normalisation methods, from dictionary and edit-distance to context-aware models. You will handle noisy user text, brand names, and slang while avoiding overcorrections, especially in Ghanaian Pidgin influences.
Dictionary and rule-based spell checkingEdit-distance and candidate ranking methodsContext-aware neural spelling correctionProtecting slang, names, and domain termsLesson 10Handling HTML, Markup, URLs, Emails, and Removing Personally Identifiable Information (PII)Explore ways to detect and remove HTML, markup, URLs, emails, and PII while keeping useful context. You will design regex patterns, use libraries, and balance privacy, compliance, and model needs for Ghanaian data protection standards.
Stripping HTML tags while keeping visible textDetecting and masking URLs and email addressesIdentifying and anonymising PII fieldsBalancing privacy, utility, and legal compliance