Lesson 1Standardising Emojis, Emoticons, and Special Characters into Sentiment TokensLearn how to make emojis, emoticons, and special characters into steady sentiment tokens. We look at mapping ways, talk about unclear meanings, and show how to keep emotional details without messing up the word list, especially in Zambian social media chats.
Emoji and emoticon lists for your areaMapping symbols to sentiment or purpose tokensDealing with mixed writings and fancy symbolsHandling unclear or joking emoji useLesson 2Spotting Languages and Handling Multiple Ones: Filtering, Translating, or Separate LinesFind out ways to spot languages in text and set up lines for multiple languages. You'll choose when to filter, translate, or split by language, and manage switching languages and those with few resources, common in Zambia's diverse tongues.
Tools and quick ways for language spottingFiltering or directing by spotted languageTranslation lines vs local onesManaging language switches and mixed textLesson 3Understanding Schema and Data Types: review_id, channel, product_category, review_text, rating, review_date, customer_segmentLearn to read dataset setups and data types for text analytics jobs. You'll match fields like review_id, channel, product_category, text, rating, date, and group to analysis questions and model picks, using Zambian e-commerce examples.
Main IDs and links for reviewsChannel and product group meaningsRating, dates, and customer group detailsSetting up schemas for big text analyticsLesson 4Checking Data: Duplicates, Missing Bits, Inconsistent Channels or Groups, Date RangesLearn to check text datasets for duplicates, missing parts, odd channels, groups, and date spans. You'll make check rules, build summary reports, and set automatic quality watches, vital for Zambian business records.
Checking duplicates and missing text partsValidating channels, groups, and tagsLooking at time coverage and date odditiesAutomatic check reports and warningsLesson 5Removing Duplicates and Joining Threads: Merging Repeated Reviews or Multi-Message TicketsUnderstand spotting duplicate reviews, near-duplicates, and broken threads across channels. You'll learn similarity checks, ID rules, and ways to join or link messages into clear analysis units, helpful for Zambian customer service logs.
Exact and fuzzy duplicate spotting methodsRebuilding threads from multi-message ticketsJoining rules for repeated or updated reviewsEffect of removing duplicates on model learningLesson 6Stopwords and Custom Stoplist: Standard Stopwords, Product-Brand Names, Common Useless TermsUnderstand using standard stopword lists and making custom ones for product names and common useless terms. You'll check their effect on models and avoid losing useful tokens, tailored to Zambian English slang and brands.
Looking at standard stopword lists by languageSpotting area-specific useless termsHandling brand, product, and rival namesMeasuring stoplist effect on featuresLesson 7Token Choices: Word, Subword, Sentence; Handling Contractions and Area-Specific TokensCompare token ways at word, subword, and sentence levels, noting contractions and area tokens. You'll learn how token picks affect word lists, context spans, and later models, with Zambian text examples.
Word and sentence token trade-offsSubword token and byte-pair codingHandling contractions and stuck words in textCustom tokens for SKUs, hashtags, and codesLesson 8Text Standardising: Lowercasing, Unicode Standardising, Spaces and Punctuation HandlingLook into main text standardising steps like lowercasing, Unicode fixing, and space or punctuation handling. You'll set up lines that steady text while keeping info needed for models, suited to Zambian multilingual texts.
Lowercasing ways and case keepingUnicode standardising and coding problemsSpace cleanup and token edge controlPunctuation handling and symbol keepingLesson 9Spelling Fixing and Standardising: Dictionary-Based, Edit-Distance, and Context-Aware MethodsStudy spelling fixing and standardising ways, from dictionary and edit-distance to context models. You'll handle noisy user text, brand names, and slang while avoiding bad over-fixing, common in Zambian informal writing.
Dictionary and rule-based spell checksEdit-distance and candidate ranking waysContext-aware neural spelling fixingProtecting slang, names, and area termsLesson 10Handling HTML, Markup, URLs, Emails, and Personal Info (PII) RemovalLook into ways to spot and remove HTML, markup, URLs, emails, and personal info while keeping useful context. You'll make regex patterns, use tools, and balance privacy, rules, and model needs, key for Zambian data protection.
Stripping HTML tags while keeping seen textSpotting and hiding URLs and email addressesFinding and hiding personal info fieldsBalancing privacy, use, and legal rules