Lesson 1TF-IDF, hashing, and document embeddings: when to use each and parameter choicesThis part compares TF-IDF, hashing, and document embeddings for text showing. You'll learn strengths, weaknesses, adjust ways, and how to pick methods and settings for search, grouping, and sorting tasks.
TF-IDF weighting schemes and normalizationHashing trick, collisions, and feature space sizeChoosing n-grams and vocabulary pruning rulesWhen sparse vectors beat dense embeddingsEmbedding dimensionality and pooling choicesEvaluating representations for downstream tasksLesson 2N-gram extraction and selection: unigrams, bigrams, trigrams; frequency and PMI filteringThis part details n-gram pulling and picking. You'll make unigrams, bigrams, trigrams, use frequency and PMI filters, and build strong word lists for models and explore analysis.
Generating n-grams with sliding windowsMinimum frequency thresholds and cutoffsPMI and other association measures for n-gramsHandling multiword expressions and phrasesDomain-specific stoplists and collocation filtersEvaluating n-gram sets on downstream tasksLesson 3Keyphrase extraction: RAKE, YAKE, TextRank and scoring/threshold selectionThis part covers keyphrase pulling with RAKE, YAKE, TextRank. You'll learn prep, scoring, threshold picking, and checking, and how to fit methods for fields like support tickets or reviews.
Text preprocessing and candidate phrase generationRAKE scoring, stoplists, and phrase length limitsYAKE features, window sizes, and language settingsTextRank graph construction and edge weightingScore normalization and threshold calibrationEvaluating keyphrases with gold labels or expertsLesson 4Dimensionality reduction for topics: LSA (SVD), UMAP, t-SNE for visualizationThis part covers size cutting for topic explore. You'll use LSA with SVD, UMAP, t-SNE to project document or topic vectors, adjust settings, and make clear, trusted views.
LSA with truncated SVD for topic structureChoosing k and interpreting singular vectorsUMAP parameters for global versus local structuret-SNE perplexity, learning rate, and iterationsVisual encoding choices for topic scatterplotsPitfalls and validation of visual clustersLesson 5Word and sentence embeddings: Word2Vec, GloVe, FastText, Transformer embeddings (BERT variants)This part looks at word and sentence embeddings, from Word2Vec, GloVe, FastText to transformer models. You'll learn training, fine-adjust, pooling, and picking embeddings for different analysis tasks.
Word2Vec architectures and training settingsGloVe co-occurrence matrices and hyperparametersFastText subword modeling and rare wordsSentence pooling strategies for static embeddingsTransformer embeddings and BERT variantsTask-specific fine-tuning versus frozen encodersLesson 6Neural topic approaches and BERTopic: clustering embeddings, topic merging and interpretabilityThis part shows neural topic ways, focusing BERTopic. You'll group embeddings, cut size, refine topics, join or split groups, and boost readability with rep terms and labels.
Embedding selection and preprocessing for topicsUMAP and HDBSCAN configuration in BERTopicTopic representation and c-TF-IDF weightingMerging, splitting, and pruning noisy topicsImproving topic labels with domain knowledgeEvaluating neural topics against LDA baselinesLesson 7Frequent pattern mining and association rules for co-occurring complaint termsThis part brings in frequent pattern mining and link rules for text. You'll turn documents to deals, mine co-happening complaint terms, adjust support and trust, and read rules for insights.
Building term transactions from documentsChoosing support and confidence thresholdsApriori and FP-Growth algorithm basicsInterpreting association rules and liftFiltering spurious or redundant patternsUsing patterns to refine taxonomies and alertsLesson 8Unsupervised topic modeling: LDA configuration, coherence measures, number of topics tuningThis part brings unsupervised topic modeling with LDA. You'll set priors, passes, optimise, use steady and puzzle measures, and make tests to pick topic numbers balancing readability and steady.
Bag-of-words preparation and stopword controlDirichlet priors: alpha, eta, and sparsityPasses, iterations, and convergence diagnosticsTopic coherence metrics and their variantsTuning number of topics with grid searchesStability checks and qualitative topic reviewLesson 9Basic lexical features: token counts, character counts, unique token ratio, readability scoresThis part focuses basic word features for text analytics. You'll count tokens and chars, type-token shares, readability scores, and learn when simple features beat complex ones.
Tokenization choices and token count featuresCharacter-level counts and length distributionsType–token ratio and vocabulary richnessStopword ratios and punctuation-based signalsReadability indices and formula selectionCombining lexical features with other signalsLesson 10Annotation schema design for manual labels: issue types, sentiment, urgency, topic tagsThis part explains designing mark setups for hand labels. You'll set issue types, feeling, urgency, topic tags, write clear guides, handle unclear, and measure agreement to refine setup step by step.
Defining label taxonomies and granularityOperationalizing sentiment and emotion labelsModeling urgency, impact, and priority levelsDesigning multi-label topic tag structuresWriting annotation guidelines with examplesInter-annotator agreement and schema revision