A seemingly obvious criterion to select terms for a terminology database is frequency of occurrence. A term extraction program, for example, should tell us how often a term appears in the text mined. Term extraction output or other text-mining solutions might also tell you what the distribution of a term is, in other words you may be able to find out in how many documents or products a term occurs.
When sifting through term candidates in term-mining output, we very likely have to scope quite a bit, because we can’t spend weeks on making perfect term selections. As we know by now, frequency is not the only term selection criteria, but it can help us particularly in large projects. Here are options and their pros and cons:
Pros | Cons | Recommendation | |
Ignore frequency and evaluate all term candidates | More precise selection because nothing is excluded | High time investment | Good for small lists; never completely ignore frequency, as it can still tell us something about the importance of a term |
Exclude all terms that occur less than x number of times | Number of term candidates is smaller | Potential to miss critical terms | Good for larger lists and when a critical percentage of terms was already extracted manually |
Exclude all terms that occur more than y number of times | Number of term candidates is smaller | Potential to miss critical terms | Good for large lists from which existing database or other non-critical terms or words were not excluded |
Only go through terms that occur more than x and less than y | Number of terms can be reduced significantly | High potential to miss critical terms | Good when both critical terms are already extracted and no stop word list was used |
If a term occurs often in a project, it is probably either very important or so generic that it shouldn’t be included. If you run a term extraction process, words should not be part of the resulting list; they should be part of a stop-word list.
Certain term mining solutions or lookup tools also indicate in which project or in which version and product a particular term is used. In other words, they give us information about the distribution of a term. But high distribution, just like high frequency, may be criteria of terms that are very well known and do not need to be documented. For example, at Microsoft it would seem useless to include terms, such as computer or user, just because they occur frequently and are widely distributed. There are other reasons to include them, though. By the same token, a widely-distributed and highly-frequent term that is somewhat mysterious should be included in the terminology database, as many users might need to look it up and the return on investment is there.
To summarize, frequency and distribution are important term selection criteria. They must be looked at in combination with other criteria, though, to make sense. One criterion to consider could be novelty, which we will examine in the following entry.