Posted by Barbara Inge Karsch on June 10, 2011
Sooner rather than later terminologists need to think about database maintenance. Initially, with few entries in the database, data integrity is easy to warrant: In fact, the terminologist might remember about any entry they ever compiled; my Italian colleague, Licia, remembered just about any entry she ever opened in the database. But even the best human brains will eventually ‘run out of memory’ and blunders will happen. One of these blunders are so called doublettes.
According to ISO TR 26162, a doublette is a “terminological entry that describes the same concept as another entry.” Sometimes these entries are also referred to as duplicates or duplicate entries, but the technical term in standards is doublette. It is important to note that homonyms do not equal doublettes. In other words, two terms that are spelt the same way and that are in two separate entries may refer to the same concept and may therefore be doublettes. But they may also justifiably be listed in separate entries, because they denote slightly or completely different concepts.
As an example, I deliberately set up doublettes in i-Term, a terminology management system developed by DANTERM: The terms automated teller machine and electronic cash machine can be considered synonyms and should be listed in one terminological entry. Below you can see that automated teller machine and its abbreviated form ATM have one definition and definition source, while electronic cash machine and its abbreviated form, cash machine, are listed in a separate entry with another, yet similar definition and its definition source. During database maintenance, these entries should be consolidated into one terminological entry with all its synonyms.
It is much easier to detect homographs that turn out to be doublettes. Rather, it should be easier to avoid them in the first place: after all, every new entry in a database starts with a search of the term denoting the concept; if it already exists with the same spelling, it would be a hit). Here are ‘homograph doublettes’ from the Microsoft Language Portal. While we can’t see the ID, the definition shows pretty clearly that the two entries are describing the same concept.
Doublettes happen, particularly in settings where more than one terminologist adds and approves entries in a database. But even if one terminologist approves all new concepts, s/he cannot guarantee that a database remains free of doublettes. The right combination of skills, processes and tool support can help limit the number, though.
Posted in iTerm, Maintaining a database, Microsoft Language Portal, Process, Setting up entries | Tagged: doublette, ISO 26162 | 4 Comments »
Posted by Barbara Inge Karsch on July 16, 2010
The Localization Industry Standards Association (LISA) reminded us in their recent Globalization Insider that they had declared 2010 the ‘Year of Standards.’ It resonates with me because socializing standards was one of the objectives that I set for this blog. Standards and standardization are the essence of terminology management, and yet practitioners either don’t know of standards, don’t have time to read them, or think they can do without them. In the following weeks, as the ISO Technical Committee 37 ("Terminology and other language and content resources") is gearing up for the annual meeting in Dublin, I’d like to focus on standards. Let’s start with ISO 12620.
ISO 12620:1999 (Computer applications in terminology—Data categories—Part 2: Data category registry) provides standardized data categories (DCs) for terminology databases; a data category is the name of the database field, as it were, its definition, and its ID. Did everyone notice that terminology can now be downloaded from the Microsoft Language Portal? One of the reasons why you can download the terminology today and use it in your own terminology database is ISO 12620. The availability of such a tremendous asset is a major argument in favor of standards.
I remember when my manager at J.D. Edwards slapped 12620 on the table and we started the selection process for TDB. It can be quite overwhelming. But I turned into a big fan of 12620 very quickly: It allowed us to design a database that met our needs at J.D. Edwards.
When I joined Microsoft in 2004, my colleagues had already selected data categories for a MultiTerm database. Since I was familiar with 12620, it did not take much time to be at home in the new database. We reviewed and simplified the DCs over the years, because certain data categories chosen initially were not used often enough to warrant their existence. One example is ‘animacy,’ which is defined in 12620 as “[t]he characteristic of a word indicating that in a given discourse community, its referent is considered to be alive or to possess a quality of volition or consciousness”…most of the things documented in Term Studio are dead and have no will or consciousness. But we could simply remove ‘animacy’, while it would have been difficult or costly to integrate a new data category late in the game. If you are designing a terminology database, err on the side of being more comprehensive. Because we relied on 12620, it was easy when earlier in 2010 we prepared for making data exportable into a TBX format (ISO 30042). The alignment was already there, and communication with the vendor, an expert in TBX, was easy.
ISO 12620:1999 has since been retired and was succeeded by ISO 12620:2009, which “provides guidelines […] for creating, selecting and maintaining data categories, as well as an interchange format for representing them.” The data categories themselves were moved into the ISOcat “Data Category Registry” open to use by anyone.
ISO 12620 or now the Data Category Registry allows terminology database designers to apply tried and true standards rather than reinventing the wheel. As all standards, they enable quick adoption by those familiar with them and they enable data sharing (e.g. in large term banks, such as the EuroTermBank). If you are not familiar with standards, read A Standards Primer written by Christine Bucher for LISA. It is a fantastic overview that helps navigate the standardization maze.
Posted in Advanced terminology topics, Designing a terminology database, EuroTermBank, J.D. Edwards TDB, Microsoft Language Portal, Microsoft Terminology Studio, Terminologist | Tagged: ISO 12620, ISOcat, TBX, TC37 | 1 Comment »