Lesson 1TF-IDF, Hashing, and Document Embeddings: When to Use Each and Parameter ChoicesThis part compares TF-IDF, hashing, and document embeddings for text showing. You will learn strengths, weaknesses, and tuning ways, and how to pick methods and settings for search, grouping, and classifying tasks.
TF-IDF weighting schemes and normalizationHashing trick, collisions, and feature space sizeChoosing n-grams and vocabulary pruning rulesWhen sparse vectors beat dense embeddingsEmbedding dimensionality and pooling choicesEvaluating representations for downstream tasksLesson 2N-Gram Extraction and Selection: Unigrams, Bigrams, Trigrams; Frequency and PMI FilteringThis part details n-gram pulling and picking. You will make unigrams, bigrams, and trigrams, apply frequency and PMI filters, and build strong word lists for models and exploration.
Generating n-grams with sliding windowsMinimum frequency thresholds and cutoffsPMI and other association measures for n-gramsHandling multiword expressions and phrasesDomain-specific stoplists and collocation filtersEvaluating n-gram sets on downstream tasksLesson 3Keyphrase Extraction: RAKE, YAKE, TextRank and Scoring/Threshold SelectionThis part covers keyphrase pulling with RAKE, YAKE, and TextRank. You will learn prep, scoring, threshold picking, and checking, and how to adjust methods for fields like support tickets or reviews.
Text preprocessing and candidate phrase generationRAKE scoring, stoplists, and phrase length limitsYAKE features, window sizes, and language settingsTextRank graph construction and edge weightingScore normalization and threshold calibrationEvaluating keyphrases with gold labels or expertsLesson 4Dimensionality Reduction for Topics: LSA (SVD), UMAP, t-SNE for VisualisationThis part covers reducing dimensions for topic exploration. You will use LSA with SVD, UMAP, and t-SNE to project document or topic vectors, tune settings, and make clear, reliable visuals.
LSA with truncated SVD for topic structureChoosing k and interpreting singular vectorsUMAP parameters for global versus local structuret-SNE perplexity, learning rate, and iterationsVisual encoding choices for topic scatterplotsPitfalls and validation of visual clustersLesson 5Word and Sentence Embeddings: Word2Vec, GloVe, FastText, Transformer Embeddings (BERT Variants)This part looks at word and sentence embeddings, from Word2Vec, GloVe, and FastText to transformer models. You will learn training, fine-tuning, pooling, and picking embeddings for different analysis tasks.
Word2Vec architectures and training settingsGloVe co-occurrence matrices and hyperparametersFastText subword modeling and rare wordsSentence pooling strategies for static embeddingsTransformer embeddings and BERT variantsTask-specific fine-tuning versus frozen encodersLesson 6Neural Topic Approaches and BERTopic: Clustering Embeddings, Topic Merging and InterpretabilityThis part shows neural topic ways, focusing on BERTopic. You will group embeddings, reduce dimensions, refine topics, merge or split groups, and boost understanding with key terms and labels.
Embedding selection and preprocessing for topicsUMAP and HDBSCAN configuration in BERTopicTopic representation and c-TF-IDF weightingMerging, splitting, and pruning noisy topicsImproving topic labels with domain knowledgeEvaluating neural topics against LDA baselinesLesson 7Frequent Pattern Mining and Association Rules for Co-Occurring Complaint TermsThis part brings in frequent pattern mining and association rules for text. You will turn documents into deals, mine co-happening complaint terms, tune support and confidence, and explain rules for insights.
Building term transactions from documentsChoosing support and confidence thresholdsApriori and FP-Growth algorithm basicsInterpreting association rules and liftFiltering spurious or redundant patternsUsing patterns to refine taxonomies and alertsLesson 8Unsupervised Topic Modelling: LDA Configuration, Coherence Measures, Number of Topics TuningThis part brings in unsupervised topic modelling with LDA. You will set priors, passes, and optimisation, use coherence and perplexity, and plan tests to pick topic numbers that balance understanding and steadiness.
Bag-of-words preparation and stopword controlDirichlet priors: alpha, eta, and sparsityPasses, iterations, and convergence diagnosticsTopic coherence metrics and their variantsTuning number of topics with grid searchesStability checks and qualitative topic reviewLesson 9Basic Lexical Features: Token Counts, Character Counts, Unique Token Ratio, Readability ScoresThis part focuses on basic word features for text analytics. You will calculate token and character counts, type-token ratios, and readability scores, and learn when simple features beat complex ones.
Tokenization choices and token count featuresCharacter-level counts and length distributionsType–token ratio and vocabulary richnessStopword ratios and punctuation-based signalsReadability indices and formula selectionCombining lexical features with other signalsLesson 10Annotation Schema Design for Manual Labels: Issue Types, Sentiment, Urgency, Topic TagsThis part explains how to plan marking schemas for manual labels. You will define issue types, sentiment, urgency, and topic tags, write clear guides, handle unclear parts, and measure agreement to improve the schema step by step.
Defining label taxonomies and granularityOperationalizing sentiment and emotion labelsModeling urgency, impact, and priority levelsDesigning multi-label topic tag structuresWriting annotation guidelines with examplesInter-annotator agreement and schema revision