• BS ISO 24614-1:2010

    Current The latest, up-to-date edition.

    Language resource management. Word segmentation of written texts Basic concepts and general principles

    Available format(s):  Hardcopy, PDF

    Language(s):  English

    Published date:  30-11-2010

    Publisher:  British Standards Institution

    Add To Cart

    Table of Contents - (Show below) - (Hide below)

    Foreword
    Introduction
    1 Scope
    2 Terms and definitions
    3 Basic framework for word segmentation
    4 General principles of word segmentation
    Annex A (informative) - Representing word segmentation in XML
    Bibliography

    Abstract - (Show below) - (Hide below)

    Provides the basic concepts and general principles of word segmentation, and provides language-independent guidelines to enable written texts to be segmented, in a reliable and reproducible manner, into word segmentation units (WSU).

    Scope - (Show below) - (Hide below)

    This part of ISO24614 presents the basic concepts and general principles of word segmentation, and provides language-independent guidelines to enable written texts to be segmented, in a reliable and reproducible manner, into word segmentation units (WSU).

    NOTE1 In language-related research and industry, the word is a fundamental and necessary concept. It is thus critical to have a universal definition of what comprises a word for the purposes of segmenting a text into words. One cannot simply use rules based only on spaces and punctuation to delimit words. Such rules do not account for situations such as hyphenated compounds, abbreviations, idioms or word-like expressions that contain symbols or numbers. Word segmentation is even more problematic for languages that do not use spaces to separate words, such as Chinese and Japanese, and for agglutinative languages, where some functional word classes are realized as affixes, such as Korean.

    The many applications and fields that need to segment texts into words — and thus to which this part of ISO24614 can be applied — include the following.

    Translation

    Word count is the principal method for calculating the cost of a translation. Word segmentation is a standard function in translation memory systems and computer-assisted translation (CAT) tools. Word segmentation is performed by term extraction tools, which are sometimes provided in terminology management systems and CAT tools.

    Content management

    Most content management systems and databases allow for searching by individual words. The content being searched has to be segmented to permit matching with a search word. Furthermore, search functions require knowledge of the boundaries of words.

    Speech technologies

    Text-to-speech systems generate speech based on words and therefore require word segmentation for lexicon lookup, stress assignment, prosodic pattern assignment, etc.

    Computational linguistics

    Various natural language processing (NLP) systems must segment text into words in order to carry out their functions. NLP systems include

    • morphosyntactic processors,

    • syntactic parsers,

    • spellcheckers,

    • text classification systems, and

    • corpus linguistics annotators.

    Lexicography

    Lexical resources are often evaluated by size, usually by referring to the number of words.

    NOTE2 The size of language resources is an essential benchmark for their management. Quantifying the size of language resources is typically achieved by counting the words. However, because NLP applications use different segmentation methods, each calculates the number of words differently and arrives at a different sum for the same text. A reliable, reproducible, standard measure would allow comparable results. This is not to say that applications may not use their own, application-specific segmentation methods. For example, a speech synthesis application might segment a text into smaller or larger units compared to another application.

    General Product Information - (Show below) - (Hide below)

    Committee TS/1
    Development Note Supersedes 09/30196484 DC. (11/2010)
    Document Type Standard
    Publisher British Standards Institution
    Status Current
    Supersedes

    Standards Referencing This Book - (Show below) - (Hide below)

    ISO 1087-1:2000 Terminology work Vocabulary Part 1: Theory and application
    ISO 639-1:2002 Codes for the representation of names of languages — Part 1: Alpha-2 code
    ISO 12620:2009 Terminology and other language and content resources Specification of data categories and management of a Data Category Registry for language resources
    ISO 30042:2008 Systems to manage terminology, knowledge and content TermBase eXchange (TBX)
    ISO 24613:2008 Language resource management - Lexical markup framework (LMF)
    ISO 860:2007 Terminology work Harmonization of concepts and terms
    ISO 639-3:2007 Codes for the representation of names of languages — Part 3: Alpha-3 code for comprehensive coverage of languages
    ISO 639-2:1998 Codes for the representation of names of languages — Part 2: Alpha-3 code
    ISO 16642:2003 Computer applications in terminology Terminological markup framework
    ISO 704:2009 Terminology work — Principles and methods
    ISO 1087-2:2000 Terminology work Vocabulary Part 2: Computer applications
    ISO 639-5:2008 Codes for the representation of names of languages — Part 5: Alpha-3 code for language families and groups
    • Access your standards online with a subscription

      Features

      • Simple online access to standards, technical information and regulations
      • Critical updates of standards and customisable alerts and notifications
      • Multi - user online standards collection: secure, flexibile and cost effective