Lesson 1TF-IDF, Hashing, and Document Embeddings: When to Use Each and Parameter ChoicesThis section compares TF-IDF, hashing, and document embeddings for text representation. You will learn strengths, weaknesses, and tuning strategies, and how to choose methods and parameters for search, clustering, and classification tasks.
TF-IDF weighting schemes and normalisationHashing trick, collisions, and feature space sizeChoosing n-grams and vocabulary pruning rulesWhen sparse vectors beat dense embeddingsEmbedding dimensionality and pooling choicesEvaluating representations for downstream tasksLesson 2N-gram Extraction and Selection: Unigrams, Bigrams, Trigrams; Frequency and PMI FilteringThis section details n-gram extraction and selection. You will generate unigrams, bigrams, and trigrams, apply frequency and PMI filters, and build robust vocabularies for models and exploratory analysis.
Generating n-grams with sliding windowsMinimum frequency thresholds and cutoffsPMI and other association measures for n-gramsHandling multiword expressions and phrasesDomain-specific stoplists and collocation filtersEvaluating n-gram sets on downstream tasksLesson 3Keyphrase Extraction: RAKE, YAKE, TextRank and Scoring/Threshold SelectionThis section covers keyphrase extraction with RAKE, YAKE, and TextRank. You will learn preprocessing, scoring, threshold selection, and evaluation, and how to adapt methods for domains like support tickets or reviews.
Text preprocessing and candidate phrase generationRAKE scoring, stoplists, and phrase length limitsYAKE features, window sizes, and language settingsTextRank graph construction and edge weightingScore normalisation and threshold calibrationEvaluating keyphrases with gold labels or expertsLesson 4Dimensionality Reduction for Topics: LSA (SVD), UMAP, t-SNE for VisualisationThis section covers dimensionality reduction for topic exploration. You will apply LSA with SVD, UMAP, and t-SNE to project document or topic vectors, tune parameters, and design clear, trustworthy visualisations.
LSA with truncated SVD for topic structureChoosing k and interpreting singular vectorsUMAP parameters for global versus local structuret-SNE perplexity, learning rate, and iterationsVisual encoding choices for topic scatterplotsPitfalls and validation of visual clustersLesson 5Word and Sentence Embeddings: Word2Vec, GloVe, FastText, Transformer Embeddings (BERT Variants)This section explores word and sentence embeddings, from Word2Vec, GloVe, and FastText to transformer-based models. You will learn training, fine-tuning, pooling, and how to select embeddings for different analytic tasks.
Word2Vec architectures and training settingsGloVe co-occurrence matrices and hyperparametersFastText subword modelling and rare wordsSentence pooling strategies for static embeddingsTransformer embeddings and BERT variantsTask-specific fine-tuning versus frozen encodersLesson 6Neural Topic Approaches and BERTopic: Clustering Embeddings, Topic Merging and InterpretabilityThis section presents neural topic approaches, focusing on BERTopic. You will cluster embeddings, reduce dimensionality, refine topics, merge or split clusters, and improve interpretability with representative terms and labels.
Embedding selection and preprocessing for topicsUMAP and HDBSCAN configuration in BERTopicTopic representation and c-TF-IDF weightingMerging, splitting, and pruning noisy topicsImproving topic labels with domain knowledgeEvaluating neural topics against LDA baselinesLesson 7Frequent Pattern Mining and Association Rules for Co-occurring Complaint TermsThis section introduces frequent pattern mining and association rules for text. You will transform documents into transactions, mine co-occurring complaint terms, tune support and confidence, and interpret rules for insights.
Building term transactions from documentsChoosing support and confidence thresholdsApriori and FP-Growth algorithm basicsInterpreting association rules and liftFiltering spurious or redundant patternsUsing patterns to refine taxonomies and alertsLesson 8Unsupervised Topic Modelling: LDA Configuration, Coherence Measures, Number of Topics TuningThis section introduces unsupervised topic modelling with LDA. You will configure priors, passes, and optimisation, use coherence and perplexity, and design experiments to select topic numbers that balance interpretability and stability.
Bag-of-words preparation and stopword controlDirichlet priors: alpha, eta, and sparsityPasses, iterations, and convergence diagnosticsTopic coherence metrics and their variantsTuning number of topics with grid searchesStability checks and qualitative topic reviewLesson 9Basic Lexical Features: Token Counts, Character Counts, Unique Token Ratio, Readability ScoresThis section focuses on basic lexical features for text analytics. You will compute token and character counts, type–token ratios, and readability scores, and learn when these simple features outperform more complex representations.
Tokenisation choices and token count featuresCharacter-level counts and length distributionsType–token ratio and vocabulary richnessStopword ratios and punctuation-based signalsReadability indices and formula selectionCombining lexical features with other signalsLesson 10Annotation Schema Design for Manual Labels: Issue Types, Sentiment, Urgency, Topic TagsThis section explains how to design annotation schemas for manual labels. You will define issue types, sentiment, urgency, and topic tags, write clear guidelines, handle ambiguity, and measure agreement to refine the schema iteratively.
Defining label taxonomies and granularityOperationalising sentiment and emotion labelsModelling urgency, impact, and priority levelsDesigning multi-label topic tag structuresWriting annotation guidelines with examplesInter-annotator agreement and schema revision