BIK Terminology—

Solving the terminology puzzle, one posting at a time

  • Author

    Barbara Inge Karsch - Terminology Consulting and Training

  • Images

    Bear cub by Reiner Karsch

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 789 other followers

Archive for the ‘Selecting terms’ Category

Terminology extraction with memoQ 5.0 RC

Posted by Barbara Inge Karsch on August 15, 2011

In the framework of a TermNet study, I have been researching and gathering data about terminology management systems (TMS). We will not focus on term extraction tools (TE), but since one of our tools candidates recently released a new term extraction module, I wanted to check it out. Here is what I learned from giving the TE functionality of memoQ 5.0 release candidate a good run.

Let me start by saying that this test made me realize again how much I enjoy working with terminological data; I love analyzing terms and concept, researching meaning and compiling data in entries; to me it is a very creative process. Note furthermore that I am not an expert in term extraction tools: I was a serious power-user of several proprietary term extraction tools at JDE and Microsoft; I haven’t worked with the Trados solution since 2003; and I have only played with a few other methods (e.g. Word/Excel and SynchroTerm). So, my view of the market at the moment is by no means a comprehensive one. It is, however, one of a user who has done some serious term mining work. One of the biggest projects I ever did was Axapta 4.0 specs. It took us several days to even just load all documents on a server directory; it took the engine at least a night to “spit out” 14,000 term candidates; and it took me an exhausting week to nail down 500 designators worth working with.

As a mere user, as opposed to a computational linguist, I am not primarily interested in the performance of the extraction engine (I actually think the topic is a bit overrated); I like that in memoQ I can set the minimum/maximum word lengths, the minimum frequency, and the inclusion/exclusion of words with numbers (the home-grown solutions had predefined settings for all of this). But beyond the rough selection, I can deal with either too many or too few suggestions, if the tool allows me to quickly add or delete what I deem the appropriate form. There will always be noise and lots of it. I would rather have the developer focus on the usability of the interface than “waste” time on tweaking algorithms a tiny bit more.Microsoft PowerPoint Clip Art

So, along the lines of the previous posting on UX design, my requirements on a TE tool are that it allows me to

  • Process term candidates (go/no-go decision) extremely fast and
  • Move data into the TMS smoothly and flawlessly.

memoQ by Kilgray Translation Technologies* meets the first requirement very nicely. My (monolingual) test project was the PowerPoint presentations of the ECQA Certified Terminology Manager, which I had gone through in detail the previous week and which contained 28,979 English words. Because the subject matter is utterly familiar to me, there was no question as to what should make the cut and what shouldn’t. I loved that I could “race” through the list and go yay or nay; that I could merge obvious synonyms; and that I could modify term candidates to reflect their canonical form. Because the contexts for each candidate are all visible, I could have even checked the meaning in context quickly if I had needed to.

I also appreciated that there is already a stop word list in place. It was very easy to add to it, although here comes one suggestion: It would be great to have the term candidate automatically inserted in the stop-word dialog. Right now, I still have to type it in. It would safe time if it was prefilled. Since the stop word list is not very extensive (e.g. even words like “doesn’t” are missing in the English list), it’ll take everyone considerable time to build up a list, which in its core will not vary substantially from user to user. But that may be too much to ask for a first release.

As for my second requirement, memoQ term extraction doesn’t meet that (yet) (note that I only tested the transfer of data to memoQ, but not to qTerm). I know it is asking for a lot to have a workflow from cleaned-up term candidate list to terminological entry in a TMS. Here are two suggestions that would make a difference to users:

  • Provide a way to move context from the source document, incl. context source, into the new terminological entry.
  • Merging terms into one entry because they are synonyms is great. But they need to show up as synonyms when imported into the term base; none of my short forms (e.g. POS, TMS) showed up in the entry for the long forms (e.g. part of speech, terminology management systems) when I moved them into the memoQ term base.

imageMy main overall wish is that we integrate TE with authoring and translation in a way that allows companies and LSPs, writers and translators to have an efficient workflow. It is imperative in technical communication/translation to document terms and concepts. When this task is put on the translators, it is already quite late, but it is better than if it doesn’t happen. Only fast and flawless processing will allow one-person or multi-person enterprises, for that matter, to carry out terminology work as part of the content supply chain. When the “fast and flawless” prerequisite is met, even those of my translator-friends who detest the term “content supply chain” will have enough time to enjoy themselves with the more creative aspects of their profession. Then, economic requirements essential on the macro level are met, and the need of the individual to get satisfaction out of the task is fulfilled on the micro level. The TE functionality of memoQ 5.0 RC excels in design and, in my opinion, is ready for translators’ use. If you have any comments, if you agree or disagree with me, I’d love to hear it.

*Kilgray is a client of BIK Terminology.


Posted in Designing a terminology database, memoQ, Producing quantity, Selecting terms, Term extraction tool, Usability | Tagged: | 3 Comments »

How many terms do we need to document?

Posted by Barbara Inge Karsch on December 17, 2010

Each time a new project is kicked off this question is on the table. Content publishers ask how much are we expected to document. Localizers ask how many new terms will be used.

Who knows these things when each project is different, deadlines and scopes change, everyone understands “new term” to mean something else, etc. And yet, there is only the need to agree on a ballpark volume and schedule. With a bit of experience and a look at some key criteria, expectations can be set for the project team.

In a Canadian study, shared by Kara Warburton at TKE in Dublin, authors found that texts contain 2-4% terms. If you take a project of 500,000 words, that would be roughly 15,000 terms. In contrast, product glossary prepared for end-customers in print or online contain 20 to 100 terms. So, the discrepancy of what could be defined and what is generally defined for end-customers is large.

A product glossary is a great start. Sometimes, even that is not available. And, yet, I hear from at least one customer that he goes to the glossary first and then navigates the documentation. Ok, that customer is my father. But juxtapose that to the remark by a translator at a panel discussion at the ATA about a recent translation project (“aha, the quality of writing tells me that this falls in the category of ‘nobody will read it anyway’”), and I am glad that someone is getting value out of documentation.Microsoft ClipArt

In my experience, content publishing teams are staffed and ready to define about 20% of what localizers need. Ideally, 80% of new terms are documented systematically in the centralized terminology database upfront and the other 20% of terms submitted later, on an as-needed basis. Incidentally, I define “new terms” as terms that have not been documented in the terminology database. Anything that is part of a source text of a previous version or that is part of translation memories cannot be considered managed terminology.

Here are a few key criteria that help determine the number of terms to document in a terminology database:

  • Size of the project: small, medium, large, extra-large…?
  • Timeline: Are there five days or five months to the deadline?
  • Version: Is this version 1 or 2 of the product? Or is it 6 or 7?
  • Number of terms existing in the database already: Is this the first time that terminology has been documented?
  • Headcount: How many people will be documenting terms and how much time can they devote?
  • Level of complexity: Are there more new features? Is the SME content higher than normal?

These criteria can serve as guidelines, so that a project teams knows whether they are aiming at documenting 50 or 500 terms upfront. If memory serves me right, we added about 2700 terms to the database for Windows Vista. 75% was documented upfront. It might be worthwhile to keep track of historic data. That enables planning for the next project. Of course, upfront documentation of terms takes planning. But answering questions later is much more time-consuming, expensive and resource-intense. Hats off to companies, such as SAP, where the localization department has the power to stop a project when not enough terms were defined upfront!

Posted in Content publisher, Selecting terms, Translator | Tagged: , | Leave a Comment »

How do I identify a term—standardization

Posted by Barbara Inge Karsch on July 1, 2010

And the final criterion in this blog series on how to identify terms is, in my mind, one of the most important ones—standardization. Standardized usage and spelling makes the life of the product user much easier, and it is fairly clear which key concepts need to be documented in a terminology database for that reason. But are they the same for target terms? And if not, how would we know what must be standardized for, say, Japanese? We don’t—that’s when we rely on process and tools.

Example 1. Before we got to standardizing terminology at J.D. Edwards (JDE), purchase orders could be pushed, Microsoft Office ClipArtcommitted or sent. And it all meant the same thing. That had several obvious consequences:

  • Loss of productivity by customers: They had to research documentation to find out what would happen if they clicked Push on one form, Send on another or Commit on the third.
  • Loss of productivity by translators: They walked across the hall, which fortunately was possible, to enquire about the difference.
  • Inconsistency in target languages: If some translators did not think that these three terms could stand for the same thing (why would they?), they replicated the inconsistency in their language.
  • Translation memory: Push purchase order, Commit purchase order and Send purchase order needed to be translated three times by 21 languages before the translation memory kicked in.

All this results in direct and indirect cost.

Microsoft Office ClipArtExample 2. The VP of content publishing and translation at JDE used the following example to point out that terms and concepts should not be used at will: reporting code, system code, application, product, module, and product code. While everyone in Accounting had some sort of meaning in their head, the concepts behind them were initially not clearly defined. For example, does a product consist of modules? Or does an application consist of systems? Is a reporting code part of a module or a subunit of a product code? And when a customer buys an application is it the same as a product? So, what happens if Accounting isn’t clear what exactly the customer is buying…

Example 3. Standardization to achieve consistency in the source language is self-evident. But what about the target side? Of course, we would want a team of ten localizers working on different parts of the same product to use the same terminology. One of the most difficult languages to standardize is Japanese. My former colleague and Japanese terminologist at JDE, Demi, explained it as follows:

For Japanese, “[…] we have three writing systems:

  • Chinese characters […]
  • Hiragana […]
  • Katakana […].

We often mix Roman alphabet in our writing system too. […]how to mix the three characters, Chinese, Katakana, Hiragana, plus Roman alphabet, is up to each [person’s] discretion! For translation, it causes a problem of course. We need to come up with a certain agreements and rules.”

The standards and rules that Demi referred should be reflected in standardized entries in a terminology database and available at the localizers’ fingertips. Now, the tricky part is that, for Japanese, terms representing different concepts than those selected during upfront term selection may need to be standardized. In this case, it Microsoft Office ClipArtis vital that the terminology management system allow requests for entries from downstream contributors, such as the Japanese terminologist or the Japanese localizers. The requests may not make sense to a source terminologist at first glance, so a justification comment speeds up processing of the request.

To sum up this series on how to identify terms for inclusion in a terminology database: We discussed nine criteria: terminologization, specialization, confusability, frequency, distribution, novelty, visibility, system and standardization. Each one of them weighs differently for each term candidate and most of the time several criteria apply. A terminologist, content publisher or translator has to weight these criteria and make a decision quickly. No two people will come up with the same list upfront. But tools and processes should support downstream requests.

Posted in Content publisher, Selecting terms, Terminologist, Terminology 101, Translator | Tagged: , , , , | Leave a Comment »

How do I identify a term—system

Posted by Barbara Inge Karsch on June 30, 2010

Here is one that is forgotten often in fast-paced, high-production environments: system. This at first glance cryptic criterion refers to terms that may not be part of our text or our list of term candidates, but that are part of the conceptual system that makes up the subject matter we are working in. And sometimes, if not to say almost always, it pays off to be systematic.

A very quick excursion into the theory of terminology management: We distinguish between ad-hoc and systematic terminology work.

  • When we work ad-hoc, we don’t care about the surrounding concepts or terms; we focus on solving the terminological problem at hand; for example: I need to know what forecasting is and what it is called in Finnish.
  • When we take a systematic approach, we go deeper into understanding a particular subject. We may start out researching one term (e.g. forecasting) and understand the concept behind it, but then we continue to study its parent, sibling and child concepts; we work in a subject area and examine and document the relationships of the concepts.

In the following example, the terminologist decided to not only set up an entry for forecasting, but to also list different types of forecasting—child or subordinate concepts—and the parent or superordinate concept. The J.D. Edwards terminology tool, TDB, had an add-on that turned the data into visuals, such as the one below. It goes without saying that displays of this nature help, for instance, the Finnish terminologist to find equivalents more easily when s/he knows that besides qualitative forecasting there is also quantitative forecasting, etc.

JDE types of forecasting

In his Manuel pratique de terminologie, Dubuc suggests that ad-hoc terminology work is a good way to get started with terminology management. Furthermore, he is right in that documenting concepts and their systems takes time and money, both of which are in short supply in many business environments. On the other hand, a more systematic approach will, in my experience, lead to entries that stand the test of time longer, create less downstream problems or questions, and need less maintenance. So, investing more time in the initial research and setting the surrounding concepts while you have the information at hand anyway, may very well pay off later. Seasoned terminologists know when to include terms to flesh out a system and when to simply answer an ad-hoc question.

Posted in Advanced terminology topics, Content publisher, J.D. Edwards TDB, Selecting terms, Terminologist, Terminology 101 | Tagged: , , , | Leave a Comment »

How do I identify a term—visibility

Posted by Barbara Inge Karsch on June 29, 2010

Yesterday’s example was the term ribbon. While the concept was an innovation at the time that is quite prevalent in software today, the term is not necessarily highly visible. Today’s focus will be on the term-selection criterion “visibility”—in other words, on terms that are conspicuous and prevalent.

Look at the following screen prints from products within the Microsoft© Office 2010 suite:

Microsoft Office ribbon tabsDid you find some highly visible terms there? All of them stand for ribbon tabs that are highly standardized to maximize user retention: One term representing the same concept in each of the different products makes it much easier for the user to remember where to find what. Do you think that this was a coordinated effort? I don’t know for sure, as my involvement with Office was limited to Office 2007, but it looks like it. That, too, is terminology management.

Highly-visible terms must be correct in both the source and all target languages. Inconsistencies, spelling errors or variations are not only embarrassing, they lead to less trust by users, especially in markets with high-quality expectations. Terminology management working methods can spare you the embarrassment and lead to a trusting relationship with the users.

Posted in Content publisher, Selecting terms, Terminologist, Terminology 101 | Tagged: , , | Leave a Comment »

How do I identify a term—novelty

Posted by Barbara Inge Karsch on June 28, 2010

In the posting for frequency and distribution, the focus was on automated term extraction output. Today’s criterion for term selection will pertain more often to manual term extraction. For consistency sake, we call it novelty to go along with all the other nouns (terminologization, specialization, confusability, frequency and distribution). But it simply refers to terms that are new and should be added to a terminology database for that reason.

In the manual term extraction process a writer or editor documents terms while authoring material. They can do this either in a separate list or directly in the terminology database, depending on their working style, the need for immediate availability of the terms, their rights in the terminology tool, etc. Many of the terms documented this way will meet the criterion “novelty.” In a less strict sense of the word, novelties or “new terms” can also be the focus of a term extraction program. These programs can be set up to only extract terms that have not come up or been documented so far. The difference is that the human can evaluate right away which term really stands for an innovative concept, while the machine will only exclude what is already documented elsewhere.

Most of us remember that with Office 2007 the ribbon was introduced. While the name of this new tabbed command bar does not show up in text all that often, it was new and would have been hard to name in other languages had it not been documented in a terminology entry.





If the answer to the question “is this a new term representing a new concept?” is yes, do make an entry in the terminology database. Especially in environments where terminology management has been common practice and there is no need to document legacy terminology, most terms added to the database meet this criterion. Stay tuned for the posting on term selection and visibility.

Posted in Content publisher, Selecting terms, Terminologist, Terminology 101 | Tagged: , , | Leave a Comment »

How do I identify a term—frequency and distribution

Posted by Barbara Inge Karsch on June 27, 2010

A seemingly obvious criterion to select terms for a terminology database is frequency of occurrence. A term extraction program, for example, should tell us how often a term appears in the text mined. Term extraction output or other text-mining solutions might also tell you what the distribution of a term is, in other words you may be able to find out in how many documents or products a term occurs.

When sifting through term candidates in term-mining output, we very likely have to scope quite a bit, because we can’t spend weeks on making perfect term selections. As we know by now, frequency is not the only term selection criteria, but it can help us particularly in large projects. Here are options and their pros and cons:




Ignore frequency and evaluate all term candidates

More precise selection because nothing is excluded

High time investment

Good for small lists; never completely ignore frequency, as it can still tell us something about the importance of a term

Exclude all terms that occur less than x number of times

Number of term candidates is smaller

Potential to miss critical terms

Good for larger lists and when a critical percentage of terms was already extracted manually

Exclude all terms that occur more than y number of times

Number of term candidates is smaller

Potential to miss critical terms

Good for large lists from which existing database or other non-critical terms or words were not excluded

Only go through terms that occur more than x and less than y

Number of terms can be reduced significantly

High potential to miss critical terms

Good when both critical terms are already extracted and no stop word list was used

If a term occurs often in a project, it is probably either very important or so generic that it shouldn’t be included. If you run a term extraction process, words should not be part of the resulting list; they should be part of a stop-word list.

Certain term mining solutions or lookup tools also indicate in which project or in which version and product a particular term is used. In other words, they give us information about the distribution of a term. But high distribution, just like high frequency, may be criteria of terms that are very well known and do not need to be documented. For example, at Microsoft it would seem useless to include terms, such as computer or user, just because they occur frequently and are widely distributed. There are other reasons to include them, though. By the same token, a widely-distributed and highly-frequent term that is somewhat mysterious should be included in the terminology database, as many users might need to look it up and the return on investment is there.

To summarize, frequency and distribution are important term selection criteria. They must be looked at in combination with other criteria, though, to make sense. One criterion to consider could be novelty, which we will examine in the following entry.

Posted in Content publisher, Selecting terms, Term extraction tool, Terminologist, Terminology 101 | Tagged: , , , | 1 Comment »

How do I identify a term—specialization

Posted by Barbara Inge Karsch on June 26, 2010

You may have noticed that no two people involved in term selection will make the exact same choices; each person’s list would look slightly differently. And depending on the users of the database, different terms need to be selected. After terminologization and confusability, the next selection criterion is a term’s degree of “specialization.” And here is where the person selecting the terms and the person consuming the terminology product influence choices.

What is a highly specialized term to one person may be old hat to someone else. For example, a content publisher who has worked on, say, ERP content most of their professional life, may not want to document the term “bill of material.” But for an English-to-Slovak translator who might work on a birth certificate and a medical report one day and the ERP project the next, it is really helpful to have a terminological entry for “bill of material” to resort to.Microsoft Office Clip Art

Similarly, if the goal is to prepare a terminology glossary for medical interpreters who have worked in their specialized field for a long time, we may not add the most common anatomical body parts, such as sternum, as they would likely be familiar with them. But if the same terminology database is used to produce a glossary for patient information, it may very well be worthwhile to select and document the terms sternum and breastbone.

In my experience—especially in large-scale environments with multilingual databases with dozens of target languages, hundreds of products and thousands of consumers—if you find that a term is not that specialized, because you are familiar with it, do include it anyway. Since you know it, you can set up a correct and complete entry quite fast; while it’ll take someone else a long time to research and find the information that you already have.

After terminologization, confusability, and specialization, tomorrow we’ll look at the simple topic of frequency.

Posted in Content publisher, Selecting terms, Terminologist, Terminology 101 | Tagged: , , | Leave a Comment »

How do I identify a term—confusability

Posted by Barbara Inge Karsch on June 25, 2010

Let’s continue in our series of designators (remember, these are terms, appellations and symbols) to include in a terminology database. Today, we will focus on the question: Can this designator be confused with another? More specifically, is there a homograph that stands for a different concept?

Homographs—words that have the same spelling, but differ from one another in meaning, origin, and sometimes pronunciation—are probably the most frequent source of confusion. While we try not to use one term for multiple things, it cannot always be avoided; language is alive, meaning evolves, and even with the best prescriptive terminology management system, you might encounter homographs. good example is the term port. Port has many meanings as a word in general language and as a term in special languages. In the IT world, it can refer to at least a physical piece of hardware and a logical piece of software.

Theoretically, when there is the risk of “confusability,” the technical writer should be very specific, for instance, by using physical port or hardware port or even more specifically keyboard port. But even if the writer is precise in the first occurrence of the concept in the text, s/he may use the more generic or abbreviated form port in subsequent parts of the text or on the user interface. Because we never know what shows up in the translation environment first, though, it is good to alert a localizer to the fact that there are multiple meanings behind the term and include it in the terminology database.

So, if the answer to the question “is there a risk of confusability?” is yes, add the term and its homograph to the terminology database. While users of the database still need to identify the meaning in their context, at least they are alerted to the fact that there are two or more possible meanings.

Tomorrow, we will discuss selecting terms based on their degree of specialization.

Posted in Content publisher, Selecting terms, Terminologist, Terminology 101 | Tagged: , , , | Leave a Comment »

How do I identify a term—terminologization

Posted by Barbara Inge Karsch on June 24, 2010

In What is a term? our focus was on how to define the scope of a terminology database and guide a team on what should and what shouldn’t be entered into a terminology database. It is good to have rough guidelines, but there is obviously more to the story of what a term is and what should be included in a terminology database.

If we are asked to go through a list of term candidates extracted by a term extraction tool or if we are selecting terms manually, we may not always be sure whether a certain term candidate should be included. Especially if you are not a subject matter expert or if you only speak one language, this is a difficult job. It is a little easier for translators, as they are used to analyzing texts very thoroughly. As an aside, this quality makes the translator a content publisher’s best friend, for translators find the mistakes, the inconsistencies or just the minor hitches of a text. And yet in the term selection process, we have to make decisions in split seconds. How do we make them? This and the next eight postings—one short post over the next eight days—will provide more in-depth guidance on why a term should be included in a terminology database.Mouse - computer device

Mouse - animal Let’s start with terms that have gone through what is called “terminologization”—the process by which a general-language word or expression is transformed into a term designating a concept in a language for special purposes (LSP) (ISO 704). This Microsoft Language Portal Blog posting gives a variety of examples for animal names, e.g. mouse or worm, that became technical terms in the IT industry. We are often able to distinguish terms that have undergone terminologization when we distinguish them from other terms in the conceptual vicinity (see Juan Sager’s A Practical Course in Terminology), e.g. dedicated line vs. public line.

So, if we ask ourselves: Is it a word that became a term and is now used with a very specific meaning in technical language, and the answer is yes, let’s include it in the terminology database. Then there is no confusion about what we mean with it, because it is clearly defined, and its usage can be standardized across languages.

More on term selection and the criterion “confusability” next time.

Posted in Content publisher, Selecting terms, Terminologist, Terminology 101 | Tagged: , | 3 Comments »

%d bloggers like this: