Sunday 28 March 2010

Mapping intricacies: UDC to DDC

Last week, I received an email from Yulia Skora (Ukraine) who was interested in the availability of the mapping between UDC Summary and the Summary of the Russian universal classification LBC (BBK - Библиотечно-библиографическая классификация in English: Library Bibliographic Classification) Summary. It reminded me of yet another challenging area of work. When responding to Yulia I realised that the issues with mapping, for instance, UDC Summary to Dewey Summaries [pdf] are often made more difficult because we have to deal with classification summaries in both systems and we cannot use a known exactMatch in many situations.

In 2008, following advice received from colleagues in the HILT project, two of our colleagues quickly mapped 1000 classes of Dewey Summaries to UDC Master Reference File as a whole. This appeared to be relatively simple. The mapping in this case is simply an answer to a question "and how would you say e.g. Art metal work in UDC?"

But when in 2009 we realised that we were going to release 2000 classes of UDC Summary as linked data, we decided to wait until we had our UDC Summary set defined and completed to be able to publish it mapped to the Dewey Summaries.

As we arrived at this stage, little did we realise how much more complex the reversed mapping of UDC Summary to Dewey Summaries would turn out to be.

Mapping the Dewey Summaries to UDC highlighted situations in which the logic and structure of two systems do not agree. Especially because Dewey tends to enumerate combinations of subject and attributes that do not always logically belong together. For instance, 850 Literatures of Italian, Sardinian, Dalmatian, Romanian, Rhaeto-Romanic languages Italian literature. This class mixes languages from three different subgroups of Romance languages. Italian and Sardinian belong to Italo Romance sub-family; Romanian and Dalmatian are Balkan Romance languages and Rhaeto Romance is the third subgroup that includes Friulian Ladin and Romanch. As UDC literature is based on a strict classification of language families, Dewey class 850 has to be mapped to 3 narrower UDC classes 821.131 Literature of Italo-Romance Languages , 821.132 Literature of Rhaeto-Romance languages and 821.135 Literature of Balkan-Romance Languages, or to a broader class 821.13 Literature of Romance languages. Hence we have to be sure that we have all these classes listed in the UDC Summary to be able to express UDC-DDC many-to-one, specific-to-broader relationships.

Another challenge appears when, e.g., mapping Dewey class 890 Literatures of other specific languages and language families, which does not make sense in UDC in which all languages and literatures have equal status. Standard UDC schedules do not have a selection of preferred literatures and other literatures. In principle, UDC does not allow classes entitled 'others' which do not have defined semantic content. If entities are subdivided and there is no provision for an item outside the listed subclasses then this item is subsumed to a top class or a broader class where all unspecified or general members of that class may be expected. If specification is needed this can be divided by adding an alphabetical extension to the broader class. Here we have to find and list in the UDC Summary all literatures that are 'unpreferred' i.e. lumped in the 890 classes and map them again as many-to-one specific-to-broader match.

The example below illustrates another interesting case. Classes Dewey 061 and UDC 06 cover roughly the same semantic field but in the subdivision the Dewey Summaries lists a combination of subject and place and as an enumerative classification, provides ready made numbers for combinations of place that are most common in an average (American?) library. This is a frequent approach in the schemes created with the physical book arrangement, i.e. library shelves, in mind. UDC, designed as an indexing language for information retrieval, keeps subject and place in separate tables and allows for any concept of place such as, e.g. (7) North America to be used in combination with any subject as these may coincide in documents. Thus combinations such as Newspapers in North America, or Organizations in North America would not be offered as ready made combinations. There is no selection of 'preferred' or 'most needed countries' or languages or cultures in the standard UDC edition:

If we map the Dewey Summaries to UDC in general and do not have to worry about a reverse relationship the situation is very simple as shown above.

Mapping of UDC Summary to Dewey Summaries requires more thought.

Firstly, UDC class (7) North America (common auxiliary of place) which simply represents the place has to be mapped to all occurrences in which this place is 'built in' to the Dewey subjects:

063 Organization of North America
073 Journalism of North America
917 Geography of North America
970 History of North America
277 Christianity in North America
317 General Statistics in North America
557 Earth Sciences of North America

The type of mapping from what is a general UDC concept of place (7) North America to a specific subject is clearly a broader-to-narrow match. Mapping of, for instance, UDC class 07 Newspapers. The press (includes journalism) to DDC class of 073 Journalism of North America is again broad-to-narrow match.

Precombined subjects, such as those shown above from Dewey, may be expressed in UDC Summary as examples of combination within various records. To express an exact match UDC class 07 has to contain example of combination 07(7) Journals. The Press - North America. In some cases we have, therefore, added examples to UDC Summary that represent exact match to Dewey Summaries. It is unfortunate that DDC has so many classes on the top level that deal with a selection of countries or languages that are given a preferred status in the scheme, and repeating these preferences in examples of combinations of UDC emulates an unwelcome cultural bias which we have to balance out somehow.

This brings us to another challenge... UDC 913(7) Regional Geography - North America [contains 2 concepts each of which has its URI] is an exact match to Dewey 917 [represented as one concept, 1 URI]. It seems that, because they represent an exact match to Dewey numbers, these UDC examples of combinations may also need a separate URIs so that they can be published as SKOS data.

Albeit challenging, mapping proves to be a very useful exercise and I am looking forward to future work here especially in relation to our plans to map UDC Summary to Colon Classification. We are discussing this project with colleagues from DRTC in Bangalore (India).

UDC Summary - translation in progress for 21 languages

The UDC Summary translation team had a busy week. We have uploaded the top classes for Estonian and Armenian languages, just a day after we uploaded the top classes for Hindi and over 800 classes of Norwegian that we managed to extract from TEKORD data (courtesy of Rurik Greenal).

We now have 21 languages online and over 30 volunteers working on translations.

Our online translation tool is being enhanced as we speak. A browsing list with a colour scheme indicating record completion and enabling easy selection of records for translation was also added last week.

The online editor now allows the editing of a subject index and mapping. Access to this is now available for contributors working in this area.

The translation progress statistics can now be viewed for all 21 languages.

The progress statistics page harvests up-to-the-minute completion statistics for each language from the UDCS database and displays them in graph format using jQuery and jqPlot. The percentage completion figures for each language (compared to English) are shown in tables as the ones exposed on the right.