BIK Terminology

Solving the terminology puzzle, one posting at a time

  • About
    • Curriculum Vitae
  • Services
  • Portfolio
  • Resources
  • Blog
  • Contact

How do I identify a term—frequency and distribution

June 27, 2010 by Barbara Inge Karsch

A seemingly obvious criterion to select terms for a terminology database is frequency of occurrence. A term extraction program, for example, should tell us how often a term appears in the text mined. Term extraction output or other text-mining solutions might also tell you what the distribution of a term is, in other words you may be able to find out in how many documents or products a term occurs.

When sifting through term candidates in term-mining output, we very likely have to scope quite a bit, because we can’t spend weeks on making perfect term selections. As we know by now, frequency is not the only term selection criteria, but it can help us particularly in large projects. Here are options and their pros and cons:

  Pros Cons Recommendation
Ignore frequency and evaluate all term candidates More precise selection because nothing is excluded High time investment Good for small lists; never completely ignore frequency, as it can still tell us something about the importance of a term
Exclude all terms that occur less than x number of times Number of term candidates is smaller Potential to miss critical terms Good for larger lists and when a critical percentage of terms was already extracted manually
Exclude all terms that occur more than y number of times Number of term candidates is smaller Potential to miss critical terms Good for large lists from which existing database or other non-critical terms or words were not excluded
Only go through terms that occur more than x and less than y Number of terms can be reduced significantly High potential to miss critical terms Good when both critical terms are already extracted and no stop word list was used

If a term occurs often in a project, it is probably either very important or so generic that it shouldn’t be included. If you run a term extraction process, words should not be part of the resulting list; they should be part of a stop-word list.

Certain term mining solutions or lookup tools also indicate in which project or in which version and product a particular term is used. In other words, they give us information about the distribution of a term. But high distribution, just like high frequency, may be criteria of terms that are very well known and do not need to be documented. For example, at Microsoft it would seem useless to include terms, such as computer or user, just because they occur frequently and are widely distributed. There are other reasons to include them, though. By the same token, a widely-distributed and highly-frequent term that is somewhat mysterious should be included in the terminology database, as many users might need to look it up and the return on investment is there.

To summarize, frequency and distribution are important term selection criteria. They must be looked at in combination with other criteria, though, to make sense. One criterion to consider could be novelty, which we will examine in the following entry.

SHARE THIS:

Blog Categories

  • Advanced terminology topics
  • Branding
  • Content publisher
  • Events
  • Interesting terms
  • Job posting
  • Process
    • Coining terms
    • Designing a terminology database
    • Maintaining a database
    • Researching terms
    • Selecting terms
    • Setting up entries
    • Standardizing entries
  • Return on investment
  • Skills and qualities
    • Negotiation skills
    • Producing quality
    • Producing quantity
  • Subject matter expert
  • Terminologist
  • Terminology 101
    • Terminology methods
    • Terminology of terminology
    • Terminology principles
  • TermNet
  • Theory
  • Tool
    • iTerm
    • Machine translation
    • Proprietary terminology management systems
      • J.D. Edwards TDB
      • Microsoft Terminology Studio
    • Term extraction tool
      • memoQ
    • Terminology portals
      • BACUS
      • EuroTermBank
      • Irish National Terminology Database
      • Microsoft Language Portal
      • Rikstermbanken
  • Translator
  • Usability

Blog Archives

  • November 2012
  • October 2012
  • September 2012
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010

BIK Terminology

  • About Barbara Inge Karsch
  • Terminology Services
  • Terminology Resources
  • My Terminology Portfolio
  • Let’s Talk Terminology

From the Blog

  • A glossary for MT–terrific! MT on a glossary—horrific!
  • Part-time position for an Arabic terminologist
  • Tidbit from the ATA Conference
  • Bilingual corpora and target terminology research
  • Terminology internship at Eurocopter in France

Find It Here

Follow Me

  • Email
  • LinkedIn
  • Phone
Copyright © 2023 BIK Terminology. All Rights Reserved. Sitemap. Website by sundaradesign.