BIK Terminology—

Solving the terminology puzzle, one posting at a time

  • Author

    Barbara Inge Karsch - Terminology Consulting and Training

  • Images

    Bear cub by Reiner Karsch

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 789 other followers

How do I identify a term—frequency and distribution

Posted by Barbara Inge Karsch on June 27, 2010

A seemingly obvious criterion to select terms for a terminology database is frequency of occurrence. A term extraction program, for example, should tell us how often a term appears in the text mined. Term extraction output or other text-mining solutions might also tell you what the distribution of a term is, in other words you may be able to find out in how many documents or products a term occurs.

When sifting through term candidates in term-mining output, we very likely have to scope quite a bit, because we can’t spend weeks on making perfect term selections. As we know by now, frequency is not the only term selection criteria, but it can help us particularly in large projects. Here are options and their pros and cons:

Pros

Cons

Recommendation

Ignore frequency and evaluate all term candidates

More precise selection because nothing is excluded

High time investment

Good for small lists; never completely ignore frequency, as it can still tell us something about the importance of a term

Exclude all terms that occur less than x number of times

Number of term candidates is smaller

Potential to miss critical terms

Good for larger lists and when a critical percentage of terms was already extracted manually

Exclude all terms that occur more than y number of times

Number of term candidates is smaller

Potential to miss critical terms

Good for large lists from which existing database or other non-critical terms or words were not excluded

Only go through terms that occur more than x and less than y

Number of terms can be reduced significantly

High potential to miss critical terms

Good when both critical terms are already extracted and no stop word list was used

If a term occurs often in a project, it is probably either very important or so generic that it shouldn’t be included. If you run a term extraction process, words should not be part of the resulting list; they should be part of a stop-word list.

Certain term mining solutions or lookup tools also indicate in which project or in which version and product a particular term is used. In other words, they give us information about the distribution of a term. But high distribution, just like high frequency, may be criteria of terms that are very well known and do not need to be documented. For example, at Microsoft it would seem useless to include terms, such as computer or user, just because they occur frequently and are widely distributed. There are other reasons to include them, though. By the same token, a widely-distributed and highly-frequent term that is somewhat mysterious should be included in the terminology database, as many users might need to look it up and the return on investment is there.

To summarize, frequency and distribution are important term selection criteria. They must be looked at in combination with other criteria, though, to make sense. One criterion to consider could be novelty, which we will examine in the following entry.

Advertisements

One Response to “How do I identify a term—frequency and distribution”

  1. I look forward to more discussion. I have written indexing systems that cope with many of these same issues.

    Nice work!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s