Selecting terms

Terminology extraction with memoQ 5.0 RC

August 15, 2011 by Barbara Inge Karsch

In the framework of a TermNet study, I have been researching and gathering data about terminology management systems (TMS). We will not focus on term extraction tools (TE), but since one of our tools candidates recently released a new term extraction module, I wanted to check it out. Here is what I learned from giving the TE functionality of memoQ 5.0 release candidate a good run.

Let me start by saying that this test made me realize again how much I enjoy working with terminological data; I love analyzing terms and concept, researching meaning and compiling data in entries; to me it is a very creative process. Note furthermore that I am not an expert in term extraction tools: I was a serious power-user of several proprietary term extraction tools at JDE and Microsoft; I haven’t worked with the Trados solution since 2003; and I have only played with a few other methods (e.g. Word/Excel and SynchroTerm). So, my view of the market at the moment is by no means a comprehensive one. It is, however, one of a user who has done some serious term mining work. One of the biggest projects I ever did was Axapta 4.0 specs. It took us several days to even just load all documents on a server directory; it took the engine at least a night to “spit out” 14,000 term candidates; and it took me an exhausting week to nail down 500 designators worth working with.

As a mere user, as opposed to a computational linguist, I am not primarily interested in the performance of the extraction engine (I actually think the topic is a bit overrated); I like that in memoQ I can set the minimum/maximum word lengths, the minimum frequency, and the inclusion/exclusion of words with numbers (the home-grown solutions had predefined settings for all of this). But beyond the rough selection, I can deal with either too many or too few suggestions, if the tool allows me to quickly add or delete what I deem the appropriate form. There will always be noise and lots of it. I would rather have the developer focus on the usability of the interface than “waste” time on tweaking algorithms a tiny bit more.

So, along the lines of the previous posting on UX design, my requirements on a TE tool are that it allows me to:

Process term candidates (go/no-go decision) extremely fast and
Move data into the TMS smoothly and flawlessly.

memoQ by Kilgray Translation Technologies* meets the first requirement very nicely. My (monolingual) test project was the PowerPoint presentations of the ECQA Certified Terminology Manager, which I had gone through in detail the previous week and which contained 28,979 English words. Because the subject matter is utterly familiar to me, there was no question as to what should make the cut and what shouldn’t. I loved that I could “race” through the list and go yay or nay; that I could merge obvious synonyms; and that I could modify term candidates to reflect their canonical form. Because the contexts for each candidate are all visible, I could have even checked the meaning in context quickly if I had needed to.

I also appreciated that there is already a stop word list in place. It was very easy to add to it, although here comes one suggestion: It would be great to have the term candidate automatically inserted in the stop-word dialog. Right now, I still have to type it in. It would safe time if it was prefilled. Since the stop word list is not very extensive (e.g. even words like “doesn’t” are missing in the English list), it’ll take everyone considerable time to build up a list, which in its core will not vary substantially from user to user. But that may be too much to ask for a first release.

As for my second requirement, memoQ term extraction doesn’t meet that (yet) (note that I only tested the transfer of data to memoQ, but not to qTerm). I know it is asking for a lot to have a workflow from cleaned-up term candidate list to terminological entry in a TMS. Here are two suggestions that would make a difference to users:

Provide a way to move context from the source document, incl. context source, into the new terminological entry.
Merging terms into one entry because they are synonyms is great. But they need to show up as synonyms when imported into the term base; none of my short forms (e.g. POS, TMS) showed up in the entry for the long forms (e.g. part of speech, terminology management systems) when I moved them into the memoQ term base.

My main overall wish is that we integrate TE with authoring and translation in a way that allows companies and LSPs, writers and translators to have an efficient workflow. It is imperative in technical communication/translation to document terms and concepts. When this task is put on the translators, it is already quite late, but it is better than if it doesn’t happen. Only fast and flawless processing will allow one-person or multi-person enterprises, for that matter, to carry out terminology work as part of the content supply chain. When the “fast and flawless” prerequisite is met, even those of my translator-friends who detest the term “content supply chain” will have enough time to enjoy themselves with the more creative aspects of their profession. Then, economic requirements essential on the macro level are met, and the need of the individual to get satisfaction out of the task is fulfilled on the micro level. The TE functionality of memoQ 5.0 RC excels in design and, in my opinion, is ready for translators’ use. If you have any comments, if you agree or disagree with me, I’d love to hear it.

*Kilgray is a client of BIK Terminology.

How many terms do we need to document?

December 17, 2010 by Barbara Inge Karsch

Each time a new project is kicked off this question is on the table. Content publishers ask how much are we expected to document. Localizers ask how many new terms will be used.

Who knows these things when each project is different, deadlines and scopes change, everyone understands “new term” to mean something else, etc. And yet, there is only the need to agree on a ballpark volume and schedule. With a bit of experience and a look at some key criteria, expectations can be set for the project team.

In a Canadian study, shared by Kara Warburton at TKE in Dublin, authors found that texts contain 2-4% terms. If you take a project of 500,000 words, that would be roughly 15,000 terms. In contrast, product glossary prepared for end-customers in print or online contain 20 to 100 terms. So, the discrepancy of what could be defined and what is generally defined for end-customers is large.

A product glossary is a great start. Sometimes, even that is not available. And, yet, I hear from at least one customer that he goes to the glossary first and then navigates the documentation. Ok, that customer is my father. But juxtapose that to the remark by a translator at a panel discussion at the ATA about a recent translation project (“aha, the quality of writing tells me that this falls in the category of ‘nobody will read it anyway’”), and I am glad that someone is getting value out of documentation.

In my experience, content publishing teams are staffed and ready to define about 20% of what localizers need. Ideally, 80% of new terms are documented systematically in the centralized terminology database upfront and the other 20% of terms submitted later, on an as-needed basis. Incidentally, I define “new terms” as terms that have not been documented in the terminology database. Anything that is part of a source text of a previous version or that is part of translation memories cannot be considered managed terminology.

Here are a few key criteria that help determine the number of terms to document in a terminology database:

Size of the project: small, medium, large, extra-large…?
Timeline: Are there five days or five months to the deadline?
Version: Is this version 1 or 2 of the product? Or is it 6 or 7?
Number of terms existing in the database already: Is this the first time that terminology has been documented?
Headcount: How many people will be documenting terms and how much time can they devote?
Level of complexity: Are there more new features? Is the SME content higher than normal?

These criteria can serve as guidelines, so that a project teams knows whether they are aiming at documenting 50 or 500 terms upfront. If memory serves me right, we added about 2700 terms to the database for Windows Vista. 75% was documented upfront. It might be worthwhile to keep track of historic data. That enables planning for the next project. Of course, upfront documentation of terms takes planning. But answering questions later is much more time-consuming, expensive and resource-intense. Hats off to companies, such as SAP, where the localization department has the power to stop a project when not enough terms were defined upfront!

How do I identify a term—standardization

July 1, 2010 by Barbara Inge Karsch

And the final criterion in this blog series on how to identify terms is, in my mind, one of the most important ones—standardization. Standardized usage and spelling makes the life of the product user much easier, and it is fairly clear which key concepts need to be documented in a terminology database for that reason. But are they the same for target terms? And if not, how would we know what must be standardized for, say, Japanese? We don’t—that’s when we rely on process and tools.

Example 1. Before we got to standardizing terminology at J.D. Edwards (JDE), purchase orders could be pushed, committed or sent. And it all meant the same thing. That had several obvious consequences:

Loss of productivity by customers: They had to research documentation to find out what would happen if they clicked Push on one form, Send on another or Commit on the third.
Loss of productivity by translators: They walked across the hall, which fortunately was possible, to enquire about the difference.
Inconsistency in target languages: If some translators did not think that these three terms could stand for the same thing (why would they?), they replicated the inconsistency in their language.
Translation memory: Push purchase order, Commit purchase order and Send purchase order needed to be translated three times by 21 languages before the translation memory kicked in.

All this results in direct and indirect cost.

Example 2. The VP of content publishing and translation at JDE used the following example to point out that terms and concepts should not be used at will: reporting code, system code, application, product, module, and product code. While everyone in Accounting had some sort of meaning in their head, the concepts behind them were initially not clearly defined. For example, does a product consist of modules? Or does an application consist of systems? Is a reporting code part of a module or a subunit of a product code? And when a customer buys an application is it the same as a product? So, what happens if Accounting isn’t clear what exactly the customer is buying…

Example 3. Standardization to achieve consistency in the source language is self-evident. But what about the target side? Of course, we would want a team of ten localizers working on different parts of the same product to use the same terminology. One of the most difficult languages to standardize is Japanese. My former colleague and Japanese terminologist at JDE, Demi, explained it as follows:

For Japanese, “[…] we have three writing systems:

Chinese characters […]

Hiragana […]

Katakana […].

We often mix Roman alphabet in our writing system too. […]how to mix the three characters, Chinese, Katakana, Hiragana, plus Roman alphabet, is up to each [person’s] discretion! For translation, it causes a problem of course. We need to come up with a certain agreements and rules.”

The standards and rules that Demi referred should be reflected in standardized entries in a terminology database and available at the localizers’ fingertips. Now, the tricky part is that, for Japanese, terms representing different concepts than those selected during upfront term selection may need to be standardized. In this case, it is vital that the terminology management system allow requests for entries from downstream contributors, such as the Japanese terminologist or the Japanese localizers. The requests may not make sense to a source terminologist at first glance, so a justification comment speeds up processing of the request.

To sum up this series on how to identify terms for inclusion in a terminology database: We discussed nine criteria: terminologization, specialization, confusability, frequency, distribution, novelty, visibility, system and standardization. Each one of them weighs differently for each term candidate and most of the time several criteria apply. A terminologist, content publisher or translator has to weight these criteria and make a decision quickly. No two people will come up with the same list upfront. But tools and processes should support downstream requests.

Terminology extraction with memoQ 5.0 RC

How many terms do we need to document?

How do I identify a term—standardization

BIK Terminology

From the Blog

Find It Here

BIK Terminology

From the Blog

Find It Here

Follow Me