Cargando la página

  links
Sumario
  1. The concept of Taxonomy
  2. Construction of taxonomy
    2.1. Processes for the construction of taxonomies
    2.2. Automation of the processes of construction of taxonomies
  3. Resource categorisation
  4. Application of taxonomy in the development of information search systems
  5. Bibliography
  6. Notas

1. The concept of Taxonomy

At the moment when this article is published, a fact will have happened which should mark a before and after in the evolution of taxonomies as content organisation systems: the appearance of the final draft of the revision of the ANSI/NISO Z39.19-1993 standard, Guidelines for the construction, format, and management of monolingual thesauri [1] . This revision has been carried out between 2002 and 2004 by the Thesaurus Advisory Group (hereinafter, TAG), created in the National Information Standards Organization, following the introduction of a more user-friendly language in the standard, the update of its scope to the current environment of digital information and the extension of its scope to the wide range of production and content organisations.

We do not have the draft of the revised standard, but we do have a summary of its contents and the notes from the TAG meetings. From these documents, we can see that one of the global modifications that have been proposed is the change of the standard title - Guidelines for the construction, format, and management of monolingual thesauri - for Construction, format and management of monolingual controlled vocabularies . The controlled vocabularies include the four main types: the lists, the synonym rings, the taxonomies and the thesauri. The revision of the standard ANSI/NISO Z39.19 proposes the "normalisation" definition of the four types, and establishes the essential elements for the construction and management of all these. Specifically, in the "TAG Conference Call, June 30, 2003" (2003), the provisional definitions below were included:

In accordance with this definition, taxonomy does not require its components to be connected by a specific type of relationships; it simply requires its components to be organised. The defining characteristics are its purpose -prioritising browsing- and, therefore, the application environment -the digital environment-.

Nevertheless, in some documents relative to the process of revision of the ANSI/NISO Z39.19 standard, the difference between the four types of controlled vocabularies is determined for the lesser or greater structural complexity presented. On one hand, the lists and synonym rings only include the equivalence relationship; on the other hand, the thesauri include equivalence, hierarchy and associational relationships. In a central position, the taxonomies include equivalence and hierarchical relationships.

Waiting for the TAG works to provide a normative definition of the concept of taxonomy, we should highlight that we currently do not have a universally accepted concept of said term.

Etymologically speaking, taxonomy comes from the Greek terms "taxis" , ordering, and "nomos", rule. Aristotle was one of the first to use this term, in the year 300 before Christ, to name hierarchical schemes oriented to the classification of scientific objects. The botanist Carl Linnaeus (1707-1778) named with the term taxonomy the classification of the living beings in hierarchical groups, ordered from the most generic to the most specific (kingdom, type, order, gender, and species). From this classical concept, taxonomy developed as a subfield of biology, dedicated to the classification of organisms in accordance with their differences and similarities. In accordance with Grove (2003, p. 2774), the principles providing a strict guide for the construction of taxonomies were the logical basis, the empirical observation, the hierarchical structure based on feature inheritance, the evolutive history and the pragmatic use. The terminological sources of the general language still include the meaning specifically oriented to the experimental sciences environment, as proven by the article including the latest version in paper of the Diccionario de la lengua española (2001) -Dictionary of the Spanish Language-:

"1. f. Science dealing with the principles, methods, and purposes of classification. It is specifically applied, within Biology, to the hierarchical ordering and systems, with the names of the groups of animals and plants.

2. f. classification (? action and effect of classifying)."

In its basic concept, linked to the experimental sciences, taxonomy applies a mono-hierarchical criterion in the establishment of the classification systems; that is: each one of the groups or types making it up can only have one place, and only one, in the hierarchical structure.

At the beginning of the 90s, in the 20th Century, the concept of taxonomy is included in other fields of knowledge, such as Psychology, Social sciences and Information Technology, to name almost all the access systems to the information that attempt to establish coincidences between the terminology of the user, and that of the system. The first specialists developing web content organisation systems were part of the knowledge management consultancy area, coming from fields close to information technology and engineering (content management and information architecture); not being aware of the tradition of the documental languages of the Information Sciences field, they used the term taxonomy for the systems they developed. This term is currently used to name the content organisation systems in the Internet context, although the theory and practice of the documental languages has been intensively applied in this context.

Before proposing a definition of the term of taxonomy in accordance with the current development scopes, we have carried out a work of identification and confrontation of the semantic features with which they are defined. For this purpose, we have carried out an extensive search for definitions in all the study, development scopes and/or application of the term of taxonomy. Initially, we have not placed any limitation whatsoever on the origin of the definitions; we have only discarded those made from a classical definition of the term. The result has been the localisation of 36 definitions published between 2000 and 2005 in various types of sources [2] .

The analysis of the definitions shows that these give importance to four variables: the place occupied by taxonomy in the scope of the knowledge organisation systems (hereinafter KOS); the information context where taxonomy is applied; the purposes sought by taxonomy; and the structural model with which the elements making up taxonomy interrelate.

From the documentation drafted by the NISO TAG, and in the view of the mainly accepted properties in the definitions formulated in the study, development and/or application scopes, the following definition is proposed:

Taxonomy is the type of controlled vocabulary where all the terms are connected by means of any structural model (hierarchical, tree, faceted, ...) and specially oriented to browsing, organisation systems and search of contents of the web sites.

It is necessary to specify three points in the contents of this definition:

Once the definition of taxonomy is established, we shall carry out a brief tour on the taxonomy construction processes and the application in the categorisation of resources, and the development of information search systems of the web sites. Both processes should be preceded by strategic planning determining what characteristics the taxonomy should present from the analysis of the context-that will identify the priorities of the corporation in the organisation and presentation of the information on the web site-, of the audience -that will identify the needs and search behaviour and the use of the information by the various user segments- and of the content -which will identify content patterns-.

2. Construction of taxonomy

2.1. Processes for the construction of taxonomies

The construction of corporate taxonomies involves the carrying out of four processes:

1. Limitation of reality (entity, knowledge area, industrial sector, etc.) that will be represented by the taxonomy.

2. Extraction of the group of terms or categories that represent said reality.

In order to carry out this process the establishment is necessary, in the first place, of what the priority sources are and the ideal extraction mechanisms for each one of them. There are three types: the personal sources integrated by web users and specialists at the web domain; document sources, integrated by documents representative of the types of contents identified at the strategic planning stage; and the taxonomies or knowledge representation instruments already existing (from nomenclatures of the units and existing resources at an entity to the administration classification charts).

It is necessary to identify the extraction mechanisms for each one of the sources; thus, in the case of the personal sources, the interviews with web site users and the analysis of the search transaction registers are especially useful.

The result of this process is a register of representative terms or categories.

3. Terminological control of the terms or categories.

This process involves the carrying out of two tasks. In the first place, the terms making up a same concept are identified; in the event that there are two or more, it is necessary to specify which one is considered most preferential and which are the less. Secondly, giving a correct and consistent shape to all the taxonomy elements is necessary, regardless of whether these are preferential or not.

The result of this process is the establishment of the equivalence relationship between all the taxonomy terms.

4. Establishment of the scheme and organisation structure of the terms or categories.

The organisation scheme includes the criteria used to divide and group the categories. At the beginning, the criteria are limitless and their suitability depends on the object that should be represented by the taxonomy. Examples of the most widely used criteria are the following: the subjects, the matters and/or disciplines; the people; the addressees; the process, tasks and/or functions; the types of documents; etc.

The structural model defines the type of relationship established between the category groups derived from the organisation scheme. The general tendency has been the application of the hierarchical model (based on the "type of" relationship) and the tree model (based on the "part of" relationship) and, in fact, the international and national rules for thesauri designing that have been applied to the corporate taxonomies exalt these two structural models. A third model, the faceted, is a good alternative for the hypertext environment, where the breakdown of various perspectives from which a same concept or item can be seen is key. In fact, this model is being used more and more frequently for certain types of web sites. Nevertheless, the documentation we have on the revision of standard ANSI/NISO Z39.19 does not seem to show the inclusion of this alternative.

Traditionally, two techniques for the development of the structure of taxonomy have been distinguished: the up to down technique and the down to up technique.

2.2. Automation of the processes of construction of taxonomies

A critical factor in the construction of taxonomy is the degree of automation applied to the previously indicated processes. The degree of automation can be seen as a continuum : on the one hand the manual systems (or intellectual) are placed, and on the other, the automatic ones. The semi-automatic systems are placed in a central point.

We should highlight that, currently, fully manual systems are rarely used in the creation of taxonomies.

In the minimum level of automation, there are two types of solutions: the taxonomy templates, specialised in a certain industrial sector, that should be adapted to the specific conditions of a certain organisation [3] , and the taxonomy edition tools. This second type of solution offers the administrators of the taxonomy a tank for term management, a friendly environment for the establishing of relationships between terms, and various modalities of presentation and viewing of results. Many of these applications already existed as thesauri administrators, and have not included excessive innovations for their new function in the context of taxonomies. Examples of these models can be the Multites 2005 ( http://www.multites.com ) or Term Tree ( http://www.termtree.com.au ) products.

At the maximum level of automation, we find programmes that analyse the corpus of digital resources of a web site and extract categories in fact, clusters of resources by means of the application of statistical analysis and/or linguistic processing. Generally, the process of construction of taxonomy and that of categorisation of resources is the same; even in some cases, the result is directly editable as a browsing system. An extreme option of this automation modality is that giving rise to the so-called dynamic taxonomies: groups of resulting resources of a search in a search engine that usually responds to a statistical analysis of frequencies than to linguistic processing. In the automatic systems, the possibilities of establishing equivalence and hierarchical relationships between the categories is very limited; the result is usually a flat taxonomy, closer to a clustering of resources than a classification in itself. An example of these solutions is the Automatic Taxonomy Generation module from IDOL Server ( http://www.autonomy.com/content/Products/IDOL ).

The completely automatic solutions have not offered, up to the current moment, satisfactory results on taxonomy construction. Consequently, semi-automatic alternatives are being developed that, as Ultraseek Topic Advisor ( http://www.verity.com/products/ultraseek/index.html ), assist in the process of creation and maintenance of taxonomy at the same time that it provides an interface for the revision and approval of categories.Said systems include an algorithm of statistical basis that analyses a resources corpus and suggests terms and relationships between terms to the administrator of the system for this to accept them or reject them. All this in a friendly working environment.

3. Resource categorisation

Categorisation can be defined as the content representation process, context and/or structure or information resources by means of the assignation of terms from a documental language -categorisation by assignation- or by means of the extraction of terms of the own resources -categorisation by extraction-.

The most efficient categorisation model currently existing is that based on metadata. According to Méndez and Senso (2004), we can define metadata as:

" all that descriptive information on the context, quality, condition or characteristics of a resource, data or object with the finality of facilitating its recovery, authentification, evaluation, preservation and/or inter-operateability ".

There are various models of metadata. The elements allowing the establishment of differences between these models are, basically, two:

For example, Dublin Core, one of the most widely used models for the description of all the types of information resources, includes, in its simplest format (simple level), fifteen elements [4] . The syntax of each element usually includes three components:

In a web page coded by means of HTML metalanguage, the syntax of the Key element would present the following aspect:

<META NAME="DC.Subject" SCHEME="TAGS" CONTENT="Cultural heritage; Cultural events; Exhibitions; Administration documentation management; Internet; Files; Information Management ">

In a categorisation model based on metadata, the taxonomy constitutes a type of controlled vocabulary that is very useful for value extraction the terms that will be assigned to the elements describing the information resources. As previously indicated, the application of taxonomies should not be limited to the elements expressing the contents of the resources, and more exactly, to the matter, subject or discipline. The elements relative to the context and resource structure can also be expressed by means of categories extracted from taxonomy.

The use of taxonomies in the information resource taxonomies offer the general strong points of the controlled languages, as: the treatment of the semantic and syntactic aspects of the language; the representation of implicit concepts; the creation of a global vision of the domains object of the representation; the exhaustiveness in the indexing; the solution to the problems involved by the multilingual contexts. From the web site management point of view, the use of taxonomies in the categorisation of resources offers two additional important benefits:

The categorisation model applied by a certain organisation should give a reply to four essential questions: what information resources will be categorised? With what purpose? Who will categorise them? How will this be done?

The last two questions are closely related to the degree of automation applied in the assignment of values to the metadata. From this point of view, the categorisation systems can be conceived as a continuum , on one hand the manual systems (or intellectual) are placed, and on the other, the automatic ones.

In the first case, an expert analyses the content, context and/or structure of a resource and assigns the appropriate categories to this from a controlled language (categorisation by assignation) or from the text of the resource itself (categorisation by extraction). The intellectual categorisation offers, as strong points, a high level of exactness in the description of resources, and the capacity of including the contextual meaning in the description. Additionally, it facilitates the categorisation of non-textual documents (images, applications, etc.); the weak points are the limited scalability, the high cost in human resources and the lack of consistency and exhaustiveness.

The automatic categorisation is based on algorithms that statistically analyse the document word sequence, identify word behaviour patterns from the variables such as collocation, order, proximity, frequency, etc., and group the documents that show similarities in said behaviour. The results are clusters of resources that show similar behaviour patterns, labelled by means of the word sequence extracted from the resources themselves that best represent the similarity.

A grouping system should be able to carry out the following tasks: statistically analyse the resource word sequences; calculate the value numerically representing the content of a document; and compare the values of the two (sub) documents and determine their degree of similarity.

Currently, the algorithms designed for the analysis of frequencies use one of the following analysis methods, or a combination of various: probability methods (Bayesian method Rocchio method, ...); vectorial methods (K-Nearest Neighbor method, Support Vector Machines...); and trees and decision lists.

Examples of automatic categorisation can be the Automatic Categorization module from IDOL Server ( http://www.autonomy.com/content/Products/IDOL ), based on the Bayesian probability method, and Lotus Discovery Server ( http://www.lotus.com ), based on the vectorial method [5] .

The strong points of the automatic categorisation are the efficiency and speed of processing, the high level of scalability and high level of consistency; its biggest weak point is the low level of exactness that it usually provides, making the very frequent use of these systems bases for decision taking by human categorisation experts.

The semi-automatic or hybrid categorisation systems combine human intelligence, which can identify the various levels of meaning existing in the documents, and the efficiency of the automatisms. Four families of semiautomatic systems of categorisation can be identified.

The strong points of the semi-automatic categorisation systems are the good balance between efficiency and exactness, the fact that the process is guided by human reasoning; and the capacity of accumulating and generating self-learning. Amongst the weak points, we should highlight the requirement of knowledge, skills and efforts of management and maintenance.

In a questionnaire carried out by Delphi Research [8] , the managers of 300 large companies all over the world (60% North American) gave the following answers to the question on the type of taxonomy implementation: 36%, hybrid; 26%, automatic; 23%, manual; the rest, or other options or no comment.

4. Application of taxonomy in the development of information search systems

As previously indicated, the differentiation of the taxonomy creation processes, of resource categorisation by means of taxonomy categories, and of application of taxonomy offers multiple benefits. The objective of the construction of this is the representation of a reality (an area of knowledge, the scope of an organisation activity, etc.) in the most appropriate way for the purpose and interests of the entity that could exploit said representation. Additionally, it should be the expression of the image and corporate interests of the entity itself.

The applications of taxonomy in the web site context can be diverse; if we focus on the information architecture scope, a same taxonomy can become a basic or auxiliary tool for the various browsing, organisation and content search, labelling and personalisation systems. The re-use of a same taxonomy for various information architecture tools offers various types of benefit:

There are various taxonomy presentation options.

The selection of an option depends on various factors; the functionality for which it is applied, the users to which it is addressed, etc. Generally, the combination between various presentations of a same functionality offers good results.

One of the functionalities of the web sites where taxonomy plays an important role is in the search for information.The systems that allow searching contents in the web environment can be classified into three main groups: browsing, searching and filtering.

The browsing search engines offer the users an organised structure of categories where the information resources are included, and a browsing mechanism through said categories to find the relevant resources for the information requirement. These browsing systems are especially suitable for situations when the users are unable to specify the need for information to a high level (exploration search). The browsing system can be:

The information search systems offer the users the possibility of creating a search equation from a word or word combination. These exploration systems are especially suitable for search situations where the users can specify the information requirement with enough detail (search for a known item). The taxonomy is included to the search system to help the user in the identification of relevant terms for the creation of the search equation, and also to improve the result and presentation and search reformulation processes. The exploration and search systems imply interaction in real time between the user and the search mechanism.

The third modality, the filtering systems, offers the user the possibility to create and declare an information need (user profile) and receive an automatic reply when a certain period of time elapses, or when the system identifies relevant resources for said need. In this case, taxonomy allows the user the selection of relevant terms for the specification of the profile.

5. Bibliography

Bennett, Paul. (2002). Introduction to text categorization. Consulted: 1-03-2005, http://www.softlab.ece.ntua.gr/facilities/public/AD/Text%20Categorization /Introduction%20to%20Text%20Categorization.ppt#256 , 1, Introduction to Text Categorization

Diccionario de la lengua española (2001). Consulted: 22-03-2005, http://buscon.rae.es/diccionario/drae.htm

Fast, Karl; Leise, Fred; Steckel, Mike (2003). "Controlled vocabularies: a glosso-thesaurus". In: Boxes & arrows, October 27, 2003. http://www.boxesandarrows.com/archives/controlled_vocabularies_a _ glossothesaurus.php

Gilchrist, Alan; Kibby, Peter; Mahon, Barry. (2000). Taxonomies for business: access and connectivity in a wired world . London: TFPL. ISBN: 1-870-889-83-5

Grove, Andrew. "Taxonomy". (2003). In: Encyclopedia of library and information science . 2nd ed., rev and enlarg. New York [etc.]: Marcel Dekker, p. 2770-2777

IDOL Server. (2005). Consulted: 13-03-2005, http://www.autonomy.com/content/Products/IDOL

Information intelligence: content classification and the enterprise taxonomy practice (2004). Consulted: 25-01-2005, http://www.delphigroup.com/research/whitepapers/20040601-taxonomy-WP.pdf

K2 Enterprise. (2005). Consulted: 13-03-2005, http://www.verity.com/products/k2_enterprise/index.html

Lotus Discovery Server. (2004). Consulted: 1-sep-2004, http://www.lotus.com

Mathes, Adam. (2004). Folksonomies: cooperative classification and communication through shared metadata. Consulted: 26-01-2005, http://www.adammathes.com/academic/computer-mediated-communication/ folksonomies.html

Méndez, Eva; Senso, José A. (2004). Introducción a los metadatos. Consulted: 14-01-2004, http://www.sedic.es/autoformacion/metadatos/introduccion.htm

Metainformación: Dublin Core. (2003). Consulted: 13-03-2005, http://www.rediris.es/metadata

Mohomine Classifier. (2005). Consulted: 13-03-2005, http://www.kofax.com/products/mohomine/classifier.asp

Multites 2005. (2005). Consulted: 13-03-2005, http://www.multites.com

National Information Standards Organization. (2005). ANSI/NISO Z39.19-2003: guidelines for the construction, format, and management of monolingual thesauri. Consulted: 9-03-2005, http://www.niso.org/standards/standard_gather.cfm?pdflink = http://www.niso.org/standards/resources/Z39-19.pdf&std_id=518 . [Consulted: 9-03-2005]

Ruiz, Miguel E.; Srinivasan, Padmini. "Combining machine learning and hierarchical indexing structures for text categorization". In: ASIS/SIGCR Workshop on Classification Research (10è: Washington: 1999). Advances in classification research: proceedings of the ASIS SIG/CR Classification Research Workshop, v. 10 (1999), p. 107-124

Smart Discovery. (2005). Consulted: 13-03-2005, http://www.inxight.com/products/smartdiscovery

"TAG Conference Call, may 19, 2003" (2003). In: National Information Standards Organization. (2004). Developing the next generation of standards for controlled vocabularies and thesauri. Consulted: 23-04-2004. http://www.niso.org/committees/MTinfo.html

"TAG Conference Call, June 30, 2003" (2003). In: National Information Standards Organization. (2004). Developing the next generation of standards for controlled vocabularies and thesauri. Consulted: 23-04-2004. http://www.niso.org/committees/MTinfo.html

"TAG Notes November 1, 2004" (2004). In: National Information Standards Organization. (2004). Developing the next generation of standards for controlled vocabularies and thesauri. Consulted: 23-042004. http://www.niso.org/committees/MTinfo.html

Taxonomy strategies. Consulted: 25-01-2005, http://www.taxonomystrategies.com/index.htm

Taxonomy warehouse. Consulted: 22-02-2005, http://www.taxonomywarehouse.com

Term Tree. (2005). Consulted: 13-mar-2005, http://www.termtree.com.au

Ultraseek Advanced Classifier. (2005). Consulted: 22-02-2005, http://www.verity.com/products/ultraseek/index.html

Ultraseek Content Classification Engine (CCE). (2005). Consulted: 13-03-2005, http://www.verity.com/products/ultraseek/cce.html

Ultraseek Topic Advisor. (2005). Consulted: 22-02-2005, http://www.verity.com/products/ultraseek/index.html

Webopedia. Consulted: 28-01-2005, http://www.pcwebopedia.com/TERM/t/taxonomy.html

6. Notas

[1] In accordance with "TAG Notes November 1, 2004" (2004), the final draft should be ready for January 2005. [volver]

[2] A copy of the references can be obtained by sending an e-mail message to this article's author ( miguel.centelles@ub.edu ). The reason for this request should be included. [volver]

[3] An example of this option is Semio Taxonomy from Entrieva. More information from: http://www.entrieva.com/entrieva/products/scts.asp?Hdr=scts [Consultado: 13-mar-2005]- [volver]

[4] Information extracted from the Metainformation web site: Dublin Core (2003), maintained by RedIRIS. [volver]

[5] In accordance with the report Information intelligence: content classification and the enterprise taxonomy practice (2004 , p. 38 ), Autonomy has a market share of 14% and Lotus Discovery Server of 7%. [volver]

[6] In accordance with the report Information intelligence: content classification and the enterprise taxonomy practice (2004 , p. 38 ), K2 has a market share of 15%. [volver]

[7] In accordance with the report Information intelligence: content classification and the enterprise taxonomy practice (2004 , p. 38 ), Smart Discovery has a market share of 4%. [volver]

[8] Information intelligence: content classification and the enterprise taxonomy practice (2004 , p. 26 ). [volver]

versión para imprimir
versión mínima para imprimir o guardar

Creative Commons License