Lesson 1TF-IDF, Hashing, and Document Embeddings: When to Use Each and Parameter ChoicesThis part compares TF-IDF, hashing, and document embeddings for text representation. You'll learn strengths, weaknesses, and adjustment ways, and how to pick methods and parameters for search, grouping, and classification tasks.
TF-IDF weighting schemes and normalisationHashing trick, collisions, and feature space sizeChoosing n-grams and vocabulary pruning rulesWhen sparse vectors beat dense embeddingsEmbedding dimensionality and pooling choicesEvaluating representations for later tasksLesson 2N-Gram Extraction and Selection: Unigrams, Bigrams, Trigrams; Frequency and PMI FilteringThis part details n-gram extraction and selection. You'll create unigrams, bigrams, and trigrams, apply frequency and PMI filters, and build strong vocabularies for models and exploratory analysis.
Generating n-grams with sliding windowsMinimum frequency thresholds and cutoffsPMI and other association measures for n-gramsHandling multiword expressions and phrasesDomain-specific stoplists and collocation filtersEvaluating n-gram sets on later tasksLesson 3Keyphrase Extraction: RAKE, YAKE, TextRank and Scoring/Threshold SelectionThis part covers keyphrase extraction with RAKE, YAKE, and TextRank. You'll learn preprocessing, scoring, threshold selection, and evaluation, and how to adapt methods for areas like support tickets or reviews.
Text preprocessing and candidate phrase generationRAKE scoring, stoplists, and phrase length limitsYAKE features, window sizes, and language settingsTextRank graph construction and edge weightingScore normalisation and threshold calibrationEvaluating keyphrases with gold labels or expertsLesson 4Dimensionality Reduction for Topics: LSA (SVD), UMAP, t-SNE for VisualisationThis part covers dimensionality reduction for topic exploration. You'll apply LSA with SVD, UMAP, and t-SNE to project document or topic vectors, adjust parameters, and design clear, reliable visualisations.
LSA with truncated SVD for topic structureChoosing k and interpreting singular vectorsUMAP parameters for global versus local structuret-SNE perplexity, learning rate, and iterationsVisual encoding choices for topic scatterplotsPitfalls and validation of visual clustersLesson 5Word and Sentence Embeddings: Word2Vec, GloVe, FastText, Transformer Embeddings (BERT Variants)This part explores word and sentence embeddings, from Word2Vec, GloVe, and FastText to transformer-based models. You'll learn training, fine-tuning, pooling, and how to select embeddings for different analytic tasks.
Word2Vec architectures and training settingsGloVe co-occurrence matrices and hyperparametersFastText subword modelling and rare wordsSentence pooling strategies for static embeddingsTransformer embeddings and BERT variantsTask-specific fine-tuning versus frozen encodersLesson 6Neural Topic Approaches and BERTopic: Clustering Embeddings, Topic Merging and InterpretabilityThis part presents neural topic approaches, focusing on BERTopic. You'll cluster embeddings, reduce dimensionality, refine topics, merge or split clusters, and improve interpretability with representative terms and labels.
Embedding selection and preprocessing for topicsUMAP and HDBSCAN configuration in BERTopicTopic representation and c-TF-IDF weightingMerging, splitting, and pruning noisy topicsImproving topic labels with domain knowledgeEvaluating neural topics against LDA baselinesLesson 7Frequent Pattern Mining and Association Rules for Co-Occurring Complaint TermsThis part introduces frequent pattern mining and association rules for text. You'll transform documents into transactions, mine co-occurring complaint terms, adjust support and confidence, and interpret rules for insights.
Building term transactions from documentsChoosing support and confidence thresholdsApriori and FP-Growth algorithm basicsInterpreting association rules and liftFiltering spurious or redundant patternsUsing patterns to refine taxonomies and alertsLesson 8Unsupervised Topic Modelling: LDA Configuration, Coherence Measures, Number of Topics TuningThis part introduces unsupervised topic modelling with LDA. You'll configure priors, passes, and optimisation, use coherence and perplexity, and design experiments to select topic numbers that balance interpretability and stability.
Bag-of-words preparation and stopword controlDirichlet priors: alpha, eta, and sparsityPasses, iterations, and convergence diagnosticsTopic coherence metrics and their variantsTuning number of topics with grid searchesStability checks and qualitative topic reviewLesson 9Basic Lexical Features: Token Counts, Character Counts, Unique Token Ratio, Readability ScoresThis part focuses on basic lexical features for text analytics. You'll compute token and character counts, type–token ratios, and readability scores, and learn when these simple features outperform complex ones.
Tokenization choices and token count featuresCharacter-level counts and length distributionsType–token ratio and vocabulary richnessStopword ratios and punctuation-based signalsReadability indices and formula selectionCombining lexical features with other signalsLesson 10Annotation Schema Design for Manual Labels: Issue Types, Sentiment, Urgency, Topic TagsThis part explains how to design annotation schemas for manual labels. You'll define issue types, sentiment, urgency, and topic tags, write clear guidelines, handle ambiguity, and measure agreement to refine the schema step by step.
Defining label taxonomies and granularityOperationalizing sentiment and emotion labelsModelling urgency, impact, and priority levelsDesigning multi-label topic tag structuresWriting annotation guidelines with examplesInter-annotator agreement and schema revision