1 Guide to Combining Distinctiveness Metrics
1.1 Understanding the Three Metrics
- count_distinct: Based on raw word frequency (words per million)
- Best for: Overall prominence in the discourse
- Sensitive to: Repeated use in individual articles
- appear_distinct: Based on document dispersion (proportion of articles)
- Best for: Breadth of usage across the journal
- Sensitive to: Widely used terms vs. niche terms
- topic_distinct: Based on concentrated usage (articles with 10+ uses)
- Best for: Identifying topical keywords and technical terms
- Sensitive to: Terms that are central when they appear
1.2 Combination Methods
1.2.1 1. Simple Mean (combined_distinct_mean)
- Formula: Average of the three log-ratio scores
- Best for: When all three metrics are equally important
- Advantage: Simple, interpretable
- Limitation: Assumes metrics are on comparable scales
1.2.2 2. Standardized Mean (combined_distinct_standardized)
- Formula: Average of z-scores for each metric
- Best for: When metrics have different variances
- Advantage: Ensures each metric contributes equally
- Limitation: More abstract scale (in standard deviations)
- RECOMMENDED for most analyses
1.2.3 3. Minimum Score (combined_distinct_min)
- Formula: Minimum of the three scores
- Best for: Finding words distinctive on ALL dimensions
- Advantage: Conservative, high confidence
- Limitation: May miss words strong on just one dimension
1.2.4 4. Weighted Combination
- Formula: Custom weights (e.g., 0.5count + 0.3appear + 0.2*topic)
- Best for: When you have theoretical reasons to prioritize metrics
- Advantage: Flexible, domain-specific
- Limitation: Requires justification for weights
1.3 Choosing the Right Method
1.3.1 Use standardized mean if:
- You want each metric to contribute equally
- You’re exploring the data without strong priors
- You want the most balanced results
- This is the default recommendation
1.3.2 Use simple mean if:
- The three metrics have similar distributions
- You’ve already checked that scales are comparable
- You prefer simpler, more interpretable scores
1.3.3 Use minimum score if:
- You want to be very conservative
- You’re looking for “slam dunk” distinctive words
- You want words that are distinctive in all ways
1.3.4 Use weighted combination if:
- You have theoretical reasons (e.g., “count” matters more for your question)
- You’re testing specific hypotheses
- You want to emphasize breadth (appear) over intensity (topic)
1.4 Practical Recommendations
Start with standardized mean - it’s the most robust default
Check correlations between the three metrics:
- High correlation (>0.8): metrics mostly agree, simple mean is fine
- Low correlation (<0.5): metrics capture different aspects, standardized is better
Filter by minimum frequency to avoid rare words with extreme ratios:
filter(count >= 50, appear >= 10)Compare results from different methods:
- If top 20 words are mostly the same: methods agree, any is fine
- If top 20 words differ substantially: methods capture different things, choose based on your research question
1.5 Interpreting the Scores
1.5.1 Log-ratio interpretation (for individual metrics):
- +2: Word is 4× more frequent in PS than T20 (or this decade vs others)
- +1: Word is 2× more frequent
- 0: Word has equal frequency
- -1: Word is 2× less frequent
- -2: Word is 4× less frequent
1.5.2 Combined distinctiveness:
- Positive scores: more distinctive in PS and/or this decade
- Negative scores: less distinctive in PS and/or this decade
- Magnitude: how much more distinctive (larger = more distinctive)
1.5.3 Z-scores (standardized):
- >2: Highly distinctive (top ~2.5%)
- >1: Moderately distinctive (top ~16%)
- >0: Above average distinctiveness
- <0: Below average distinctiveness
1.6 Example Research Questions → Method Choice
Q: What philosophical terms became trendy in specific decades? → Use standardized mean, focus on temporal_ratio component
Q: What makes PS different from other journals generally? → Use standardized mean, focus on spatial_ratio component
Q: What are the signature keywords of PS in each era? → Use standardized mean with frequency filter (count >= 50)
Q: Which technical terms are uniquely central to PS? → Use topic_distinct metric specifically, or minimum score
Q: Which concepts have broadest adoption in PS? → Use appear_distinct metric specifically
1.7 Diagnostic Checks
Run these to validate your approach:
- Check for NA values:
sum(is.na(combined_distinctiveness$combined_distinct_standardized)) - Check distribution:
summary(combined_distinctiveness$combined_distinct_standardized) - Check correlations:
cor(combined_distinctiveness[,c("count_distinct", "appear_distinct", "topic_distinct")]) - Check outliers: Look for abs(score) > 5 (very extreme)
- Spot-check top words: Do they make sense for your domain knowledge?