Somo 1TF-IDF, hashing, na embedding za hati: wakati wa kutumia kila moja na chaguzi za vigezoSehemu hii inalinganisha TF-IDF, hashing, na embedding za hati kwa uwakilishi wa maandishi. Utajifunza nguvu, udhaifu, na mikakati ya kurekebisha, na jinsi ya kuchagua mbinu na vigezo kwa utafutaji, kuunganisha, na kugawanya kazi.
TF-IDF weighting schemes and normalizationHashing trick, collisions, and feature space sizeChoosing n-grams and vocabulary pruning rulesWhen sparse vectors beat dense embeddingsEmbedding dimensionality and pooling choicesEvaluating representations for downstream tasksSomo 2Uchukuzi na uchaguzi wa n-gram: unigrams, bigrams, trigrams; vichuja vya mwezi na PMISehemu hii inaelezea uchukuzi na uchaguzi wa n-gram. Utazalisha unigrams, bigrams, na trigrams, kutumia vichuja vya mwezi na PMI, na kujenga msamiati thabiti kwa modeli na uchambuzi wa k Sondaji.
Generating n-grams with sliding windowsMinimum frequency thresholds and cutoffsPMI and other association measures for n-gramsHandling multiword expressions and phrasesDomain-specific stoplists and collocation filtersEvaluating n-gram sets on downstream tasksSomo 3Uchukuzi wa keyphrase: RAKE, YAKE, TextRank na alama/uchaguzi wa kiwangoSehemu hii inashughulikia uchukuzi wa keyphrase kwa RAKE, YAKE, na TextRank. Utajifunza preprocessing, alama, uchaguzi wa kiwango, na utathmini, na jinsi ya kurekebisha mbinu kwa kikoina kama tiketi za msaada au ukaguzi.
Text preprocessing and candidate phrase generationRAKE scoring, stoplists, and phrase length limitsYAKE features, window sizes, and language settingsTextRank graph construction and edge weightingScore normalization and threshold calibrationEvaluating keyphrases with gold labels or expertsSomo 4Kupunguza dimensionality kwa mada: LSA (SVD), UMAP, t-SNE kwa uchunguziSehemu hii inashughulikia kupunguza dimensionality kwa uchunguzi wa mada. Utatumia LSA na SVD, UMAP, na t-SNE kutoa projekti vectori za hati au mada, kurekebisha vigezo, na kubuni michoro wazi, inayoaminika.
LSA with truncated SVD for topic structureChoosing k and interpreting singular vectorsUMAP parameters for global versus local structuret-SNE perplexity, learning rate, and iterationsVisual encoding choices for topic scatterplotsPitfalls and validation of visual clustersSomo 5Embedding za neno na sentensi: Word2Vec, GloVe, FastText, embedding za Transformer (vibadala vya BERT)Sehemu hii inachunguza embedding za neno na sentensi, kutoka Word2Vec, GloVe, na FastText hadi modeli za transformer. Utajifunza mafunzo, fine-tuning, pooling, na jinsi ya kuchagua embedding kwa kazi tofauti za uchambuzi.
Word2Vec architectures and training settingsGloVe co-occurrence matrices and hyperparametersFastText subword modeling and rare wordsSentence pooling strategies for static embeddingsTransformer embeddings and BERT variantsTask-specific fine-tuning versus frozen encodersSomo 6Mbinu za mada za neura na BERTopic: kuunganisha embedding, kuunganisha mada na utafsiriSehemu hii inawasilisha mbinu za mada za neura, ikilenga BERTopic. Utaunganisha embedding, kupunguza dimensionality, kuboresha mada, kuunganisha au kugawanya nguzo, na kuboresha utafsiri kwa maneno na lebo zinazowakilisha.
Embedding selection and preprocessing for topicsUMAP and HDBSCAN configuration in BERTopicTopic representation and c-TF-IDF weightingMerging, splitting, and pruning noisy topicsImproving topic labels with domain knowledgeEvaluating neural topics against LDA baselinesSomo 7Uchimbaji wa mifumo ya kawaida na sheria za ushirikiano kwa maneno ya malalamiko yanayotokea pamojaSehemu hii inatambulisha uchimbaji wa mifumo ya kawaida na sheria za ushirikiano kwa maandishi. Utabadilisha hati kuwa shughuli, kuchimba maneno ya malalamiko yanayotokea pamoja, kurekebisha msaada na uaminifu, na kutafsiri sheria kwa maarifa.
Building term transactions from documentsChoosing support and confidence thresholdsApriori and FP-Growth algorithm basicsInterpreting association rules and liftFiltering spurious or redundant patternsUsing patterns to refine taxonomies and alertsSomo 8Modelling ya mada isiyo na usimamizi: muundo wa LDA, vipimo vya coherence, urekebishaji wa idadi ya madaSehemu hii inatambulisha modelling ya mada isiyo na usimamizi kwa LDA. Utaweka priors, passes, na uboreshaji, kutumia coherence na perplexity, na kubuni majaribio ya kuchagua idadi ya mada inayolingana utafsiri na uthabiti.
Bag-of-words preparation and stopword controlDirichlet priors: alpha, eta, and sparsityPasses, iterations, and convergence diagnosticsTopic coherence metrics and their variantsTuning number of topics with grid searchesStability checks and qualitative topic reviewSomo 9Vipengele vya lexical vya msingi: idadi ya tokeni, idadi ya herufi, uwiano wa tokeni ya kipekee, alama za kusomaSehemu hii inalenga vipengele vya lexical vya msingi kwa uchambuzi wa maandishi. Utahesabu idadi ya tokeni na herufi, uwiano wa type-token, na alama za kusoma, na kujifunza wakati vipengele hivi rahisi vinapita uwakilishi mgumu.
Tokenization choices and token count featuresCharacter-level counts and length distributionsType–token ratio and vocabulary richnessStopword ratios and punctuation-based signalsReadability indices and formula selectionCombining lexical features with other signalsSomo 10Kubuni muundo wa kutunga lebo kwa lebo za mikono: aina za tatizo, hisia, dharura, lebo za madaSehemu hii inaelezea jinsi ya kubuni miundo ya kutunga lebo kwa lebo za mikono. Utafafanua aina za tatizo, hisia, dharura, na lebo za mada, kuandika miongozo wazi, kushughulikia utata, na kupima makubaliano ili kuboresha muundo mara kwa mara.
Defining label taxonomies and granularityOperationalizing sentiment and emotion labelsModeling urgency, impact, and priority levelsDesigning multi-label topic tag structuresWriting annotation guidelines with examplesInter-annotator agreement and schema revision