Thank you very much for your positive feedback while I was busy with things like Windows 8 terminology, teaching at NYU, attending TKE and the ISO meetings in Madrid, and doing webinars. During one of the webinars, we didn’t get around to all questions. I will be addressing some of these here now.
Question: As you add terminology into your database, you might not remember that you have already entered some word that is a synonym. So, might you not end up with a different ID for 2 synonyms?
Answer: Yes, that is a scenario that is very common and that everyone setting up terminology entries is facing: We do our best to enter terms and names in canonical form in order to find them again and to avoid creating duplicates. So, we document, say, operating system and not Operating Systems, or we enter purge, and not to purge or purged in the database. Even though we were good about the form of our terms, we might not remember the meaning of all entries created and thus willy-nilly create doublettes in our database. Often times, we create them because we are not aware that one entry is a view onto a concept from one angle and a second entry might present the same concept from another angle, similar to these two pictures of the some flower.
Here are a few thoughts on what might help you avoid duplicate entries:
- Start out by specifying the subject field in your database. It will help you narrow down the concept for which you are about to create an entry. You might do a search on the subject field and see what concepts you defined at an earlier time. Sometimes that helps trigger your memory.
- As you are narrowing down the subject field and take a quick glance through some of the existing definitions, you might identify and recognize an existing concept as the one you are about to work on.
If you set up a doublette anyway—and it is bound to happen—you might find it later in one of the following ways and eradicate it:
- Export your database into a spreadsheet program and do a quick QA on your entries. In a spreadsheet, such as Excel, you can sort each column. If there are true doublettes, you might have started the definition with the same superordinate, which, if you sort the entries, get lined up next to each other.
- Maybe you don’t have time for QA, then I would simply wait until you notice while you are using your database and take care of it then. The damage in databases with lots of languages attached to a source language entry is bigger, but there are usually also more people working in the system, so errors are identified quickly. For the freelance translator, a doublette here and there is not as costly and it is also eliminated quickly once identified.
Developers of terminology management systems might eventually get to a point where maintenance functionality becomes part of the out-of-the-box program. At Microsoft, a colleague worked on an algorithm that helped us identify duplicates. The project was not completed when I left the corporate world, but a first test showed that the noise the program identified was not overwhelming. So, there is hope that with increasing demand for clean terminological and conceptual data such functionality becomes standard in off-the-shelf TMSs. In the meantime, stick with best practices when documenting your terms and names and use the database.
Fabio Said says
Great article. This is a really annoying problem. The solution with the Excel file is actually what I do – when I feel inspired to do housekeeping with my terminology databases. 🙂
By the way, are you at the BDÜ conference in Berlin right now? I’d like to meet you in person and talk about terminology management, my latest obsession. If not, I will definitely approach you during the next ATA Annual Conference in San Diego.
Barbara Inge Karsch says
Nice to see when the little tricks of the trade are confirmed by others. Thanks, Fabio.
I am not in Berlin, but I will be in San Diego. Yes, let’s get together. I’ll send you a separate mail.
Barbara
Lucy Brooks says
Thank you Barbara for your continued wisdom on terminology documentation matters. Best wishes for San Diego.
Barbara Inge Karsch says
Thanks, Lucy.
simonevi says
The example using the flower is just perfect!!
Sue Kocher says
Great article, Barbara, about a problem that plagues me continuously. The t19t processes at my company differ from those of most organizations, in that we have more than just terminologists contributing to the termbank. To enable product glossaries to be built relatively quickly, we also give write access to the termbank to 50 or so writers and editors. They get some training in using the termbank software, and in the concept of “concepts”–but it doesn’t always stick, especially when it might be 6 months or a year before they actually need to start documenting their terminology. Sometimes they forget to search the termbank before creating a new entry, but more often, as you noted, they are “not aware that one entry is a view onto a concept from one angle and a second entry might present the same concept from another angle.” And I can’t always tell either, without spending some time researching and asking questions of various subject matter experts.
Furthermore, these doublettes almost never start with the same superordinate concept–because the “angles” are so different, or because not everyone knows how to choose an appropriate superordinate. That’s a job for continued training, and I’m working on that!
Related to this, we run into problems when folks try to “define” what is essentially a field label in a software product. It might be something like, “Trading Partner” which, for our purposes, needs to be adequately documented in the Help and possibly explained to the translators, but does not need to be “defined” as a concept. Doublette entries then often follow, because subsequent writers see a very product-specific definition that does not quite fit their own product. And so it goes. To help head off such issues, we set permissions on the termbank so that writers and editors can contribute term data, but only the terminologist can finalize it. That helps.
Barbara Inge Karsch says
The knowledge in a terminologist’s head is another factor that really helps keeping duplicates down. And unfortunately, the institutional knowledge that someone brings to a job is so often ignored in our highly dynamic worlds. Outsourcing to changing suppliers is not helping that either. SAS should be glad that you have been there for a while and that you recognize these problems, Sue.
Michael Beijer says
Hi Barbara,
My suggestion would be that they really need to build in a way to hunt down and get rid of duplicates. Especially now that they have introduced a term extraction module, which I expect is going to create quite a few. I hope they are not just using this lack of functionality as a way of making us cough up the extra money for qTerm (whose price isn’t even listed).
As to the second problem (reimporting term bases), it wouldn’t exist if memoQ could clean duplicates. I think any decent tool ought to be able to handle its own data internally. This also applies to the TM editor in memoQ, which is basically … the ancient Olifant for many users.
In short, although I understand they are trying to target different types of users with their Pro and Server tools, data management really needs some work before they add any more new features. If you have access to a listening ear at Kilgray, this would basically be my number one feature request.
Best wishes,
Michael
Michael