1 Guide to Combining Distinctiveness Metrics

1.1 Understanding the Three Metrics

count_distinct: Based on raw word frequency (words per million)
- Best for: Overall prominence in the discourse
- Sensitive to: Repeated use in individual articles
appear_distinct: Based on document dispersion (proportion of articles)
- Best for: Breadth of usage across the journal
- Sensitive to: Widely used terms vs. niche terms
topic_distinct: Based on concentrated usage (articles with 10+ uses)
- Best for: Identifying topical keywords and technical terms
- Sensitive to: Terms that are central when they appear

Start with standardized mean - it’s the most robust default
Check correlations between the three metrics:
- High correlation (>0.8): metrics mostly agree, simple mean is fine
- Low correlation (<0.5): metrics capture different aspects, standardized is better
Filter by minimum frequency to avoid rare words with extreme ratios:
```
filter(count >= 50, appear >= 10)
```
Compare results from different methods:
- If top 20 words are mostly the same: methods agree, any is fine
- If top 20 words differ substantially: methods capture different things, choose based on your research question

Q: What philosophical terms became trendy in specific decades? → Use standardized mean, focus on temporal_ratio component

Q: What makes PS different from other journals generally? → Use standardized mean, focus on spatial_ratio component

Q: What are the signature keywords of PS in each era? → Use standardized mean with frequency filter (count >= 50)

Q: Which technical terms are uniquely central to PS? → Use topic_distinct metric specifically, or minimum score

Q: Which concepts have broadest adoption in PS? → Use appear_distinct metric specifically

Run these to validate your approach:

Check for NA values: sum(is.na(combined_distinctiveness$combined_distinct_standardized))
Check distribution: summary(combined_distinctiveness$combined_distinct_standardized)
Check correlations: cor(combined_distinctiveness[,c("count_distinct", "appear_distinct", "topic_distinct")])
Check outliers: Look for abs(score) > 5 (very extreme)
Spot-check top words: Do they make sense for your domain knowledge?