Aralin 1TF-IDF, hashing, at document embeddings: kailan gagamitin ang bawat isa at parameter choicesIkukumpara ng seksyong ito ang TF-IDF, hashing, at document embeddings para sa text representation. Matututo kang tungkol sa strengths, weaknesses, at tuning strategies, at kung paano pumili ng methods at parameters para sa search, clustering, at classification tasks.
TF-IDF weighting schemes at normalizationHashing trick, collisions, at feature space sizePagpili ng n-grams at vocabulary pruning rulesKailan sparse vectors ay mas mabuti kaysa dense embeddingsEmbedding dimensionality at pooling choicesPag-e-evaluate ng representations para sa downstream tasksAralin 2N-gram extraction at selection: unigrams, bigrams, trigrams; frequency at PMI filteringTinutukoy ng seksyong ito ang n-gram extraction at selection. Magge-generate ka ng unigrams, bigrams, at trigrams, mag-aapply ng frequency at PMI filters, at magtayo ng robust vocabularies para sa models at exploratory analysis.
Pagge-generate ng n-grams na may sliding windowsMinimum frequency thresholds at cutoffsPMI at iba pang association measures para sa n-gramsPaghawak ng multiword expressions at phrasesDomain-specific stoplists at collocation filtersPag-e-evaluate ng n-gram sets sa downstream tasksAralin 3Keyphrase extraction: RAKE, YAKE, TextRank at scoring/threshold selectionTinatakpan ng seksyong ito ang keyphrase extraction na may RAKE, YAKE, at TextRank. Matututo kang preprocessing, scoring, threshold selection, at evaluation, at kung paano i-adapt ang methods para sa domains tulad ng support tickets o reviews.
Text preprocessing at candidate phrase generationRAKE scoring, stoplists, at phrase length limitsYAKE features, window sizes, at language settingsTextRank graph construction at edge weightingScore normalization at threshold calibrationPag-e-evaluate ng keyphrases na may gold labels o expertsAralin 4Dimensionality reduction para sa topics: LSA (SVD), UMAP, t-SNE para sa visualizationTinatakpan ng seksyong ito ang dimensionality reduction para sa topic exploration. Ilalapat mo ang LSA na may SVD, UMAP, at t-SNE para i-project ang document o topic vectors, mag-tune ng parameters, at magdisenyo ng malinaw, mapagkakatiwalaang visualizations.
LSA na may truncated SVD para sa topic structurePagpili ng k at pag-interpret ng singular vectorsUMAP parameters para sa global versus local structuret-SNE perplexity, learning rate, at iterationsVisual encoding choices para sa topic scatterplotsPitfalls at validation ng visual clustersAralin 5Word at sentence embeddings: Word2Vec, GloVe, FastText, Transformer embeddings (BERT variants)Galugarin ng seksyong ito ang word at sentence embeddings, mula Word2Vec, GloVe, at FastText hanggang transformer-based models. Matututo kang training, fine-tuning, pooling, at kung paano pumili ng embeddings para sa iba't ibang analytic tasks.
Word2Vec architectures at training settingsGloVe co-occurrence matrices at hyperparametersFastText subword modeling at rare wordsSentence pooling strategies para sa static embeddingsTransformer embeddings at BERT variantsTask-specific fine-tuning versus frozen encodersAralin 6Neural topic approaches at BERTopic: clustering embeddings, topic merging at interpretabilityIpapakita ng seksyong ito ang neural topic approaches, na nakatuon sa BERTopic. Mag-cluster ka ng embeddings, bawasan ang dimensionality, i-refine ang topics, pagsamahin o hatiin ang clusters, at mapabuti ang interpretability na may representative terms at labels.
Pagpili ng embedding at preprocessing para sa topicsUMAP at HDBSCAN configuration sa BERTopicTopic representation at c-TF-IDF weightingMerging, splitting, at pruning ng noisy topicsPagpapabuti ng topic labels na may domain knowledgePag-e-evaluate ng neural topics laban sa LDA baselinesAralin 7Frequent pattern mining at association rules para sa co-occurring complaint termsIpapakilala ng seksyong ito ang frequent pattern mining at association rules para sa teksto. I-transform mo ang documents sa transactions, mag-mine ng co-occurring complaint terms, mag-tune ng support at confidence, at mag-interpret ng rules para sa insights.
Pagbuo ng term transactions mula sa documentsPagpili ng support at confidence thresholdsApriori at FP-Growth algorithm basicsPag-interpret ng association rules at liftPag-filter ng spurious o redundant patternsPaggamit ng patterns para i-refine ang taxonomies at alertsAralin 8Unsupervised topic modeling: LDA configuration, coherence measures, number of topics tuningIpapakilala ng seksyong ito ang unsupervised topic modeling na may LDA. Magko-configure ka ng priors, passes, at optimization, gagamit ng coherence at perplexity, at magdidisenyo ng experiments para pumili ng topic numbers na nagbabalanse ng interpretability at stability.
Bag-of-words preparation at stopword controlDirichlet priors: alpha, eta, at sparsityPasses, iterations, at convergence diagnosticsTopic coherence metrics at variants nitoTuning ng number of topics na may grid searchesStability checks at qualitative topic reviewAralin 9Basic lexical features: token counts, character counts, unique token ratio, readability scoresNakatuon ang seksyong ito sa basic lexical features para sa text analytics. Kakalkulahin mo ang token at character counts, type–token ratios, at readability scores, at matututo kung kailan ang mga simpleng features na ito ay mas mabuti kaysa mas komplikadong representations.
Pagpili sa tokenization at token count featuresCharacter-level counts at length distributionsType–token ratio at vocabulary richnessStopword ratios at punctuation-based signalsReadability indices at formula selectionPag-combine ng lexical features sa iba pang signalsAralin 10Annotation schema design para sa manual labels: issue types, sentiment, urgency, topic tagsIpapaliwanag ng seksyong ito kung paano magdisenyo ng annotation schemas para sa manual labels. Magde-define ka ng issue types, sentiment, urgency, at topic tags, magsusulat ng malinaw na guidelines, hawakan ang ambiguity, at masuri ang agreement para i-refine ang schema iteratively.
Pagde-define ng label taxonomies at granularityPag-operationalize ng sentiment at emotion labelsPagmo-model ng urgency, impact, at priority levelsPagdidisenyo ng multi-label topic tag structuresPagsusulat ng annotation guidelines na may examplesInter-annotator agreement at schema revision