Language resource management -- Linguistic annotation framework (LAF)

ISO 24612:2012 specifies a linguistic annotation framework (LAF) for representing linguistic annotations of language data such as corpora, speech signal and video. The framework includes an abstract data model and an XML serialization of that model for representing annotations of primary data. The serialization serves as a pivot format to allow annotations expressed in one representation format to be mapped onto another.

Gestion des ressources langagières -- Cadre d'annotation linguistique (LAF)

Upravljanje z jezikovnimi viri - Ogrodje za jezikoslovno označevanje (LAF)

Ta mednarodni standard določa ogrodje za jezikoslovno označevanje (LAF) za predstavitev jezikoslovnega označevanja jezikovnih podatkov, kot so korpusi, govorni signali in videoposnetki. Ogrodje vključuje abstraktni podatkovni model in serializacijo XML tega modela za predstavitev označevanja primarnih podatkov. Serializacija je ključni format, ki omogoča, da je označevanje iz ene predstavitve preslikano v drugo. OPOMBA Standardizacijo kategorij jezikovnih podatkov, ki zagotavljajo vsebino označevanja, določajo ISO 12620 in drugi z njim povezani mednarodni standardi.

General Information

Status
Published
Publication Date
06-Jun-2013
Current Stage
6060 - National Implementation/Publication (Adopted Project)
Start Date
30-May-2013
Due Date
04-Aug-2013
Completion Date
07-Jun-2013

Buy Standard

Standard
ISO 24612:2013
English language
24 pages
sale 10% off
Preview
sale 10% off
Preview
e-Library read for
1 day
Standard
ISO 24612:2013
English language
24 pages
sale 10% off
Preview
sale 10% off
Preview
e-Library read for
1 day
Standard
ISO 24612:2012 - Language resource management -- Linguistic annotation framework (LAF)
English language
19 pages
sale 15% off
Preview
sale 15% off
Preview
Standard
ISO 24612:2012
Russian language
26 pages
sale 15% off
Preview
sale 15% off
Preview

Standards Content (Sample)


SLOVENSKI STANDARD
01-julij-2013
8SUDYOMDQMH]MH]LNRYQLPLYLUL2JURGMH]DMH]LNRVORYQRR]QDþHYDQMH /$)
Language resource management -- Linguistic annotation framework (LAF)
Gestion des ressources langagières -- Cadre d'annotation linguistique (LAF)
Ta slovenski standard je istoveten z: ISO 24612:2012
ICS:
01.020 7HUPLQRORJLMD QDþHODLQ Terminology (principles and
NRRUGLQDFLMD coordination)
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

INTERNATIONAL ISO
STANDARD 24612
First edition
2012-06-15
Language resource management —
Linguistic annotation framework (LAF)
Gestion des ressources langagières — Cadre d'annotation linguistique
(LAF)
Reference number
©
ISO 2012
©  ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56  CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved

Contents Page
Foreword . iv
Introduction . v
1  Scope . 1
2  Terms and definitions . 1
3  LAF specification . 3
3.1  Overview . 3
3.2  LAF data model . 3
3.3  LAF architecture . 4
3.4  XML pivot format . 6
3.5  XML elements for the resource header . 11
3.6  Elements in the primary data document header . 16
Bibliography . 19

Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee has
been established has the right to be represented on that committee. International organizations, governmental
and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24612 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
iv © ISO 2012 – All rights reserved

Introduction
Effective creation, encoding, processing and management of language resources is facilitated by a single
high-level data model that supports analysis and design of both annotation schemes and representation
formats. This International Standard is designed to support the development and use of computer applications
relying on linguistically annotated resources and the exchange of these resources among different
applications.
INTERNATIONAL STANDARD ISO 24612:2012(E)

Language resource management — Linguistic annotation
framework (LAF)
1 Scope
This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic
annotations of language data such as corpora, speech signal and video. The framework includes an abstract
data model and an XML serialization of that model for representing annotations of primary data. The
serialization serves as a pivot format to allow annotations expressed in one representation format to be
mapped onto another.
NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and
other related International Standards.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
primary data
electronic representation of language data
EXAMPLE Text, image, speech signal.
Note to entry: Typically, primary data objects are addressed by “locations” in an electronic file, for example, the span of
characters comprising a sentence or word, or a point at which a given temporal event begins or ends (as in speech
annotation). More complex data objects may consist of a list or set of contiguous or non-contiguous locations in primary
data.
2.2
annotate, verb
process of adding linguistic information to primary data (2.1)
2.3
annotation, noun
linguistic information added to primary data (2.1), independent of its representation
2.4
representation
format in which the annotation (2.3) is rendered, independent of its content
EXAMPLE XML, list or bracketed format, tab-delimited text.
2.5
segmentation annotation
annotation (2.3) that delimits linguistic elements that appear in the primary data (2.1)
Note to entry: These elements include (1) continuous segments (appearing contiguously in the primary data), (2) super-
and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g. contiguous word segment
typically comprise a sentence segment), (3) discontinuous segments (linking continuous segments), and (4) landmarks
(e.g. timestamp) that note a point in the primary data. In current practice, segmental information may or may not appear in
the document containing the primary data itself.
2.6
linguistic annotation
annotation (2.3) that provides linguistic information about the segments in the primary data (2.1)
EXAMPLE Morphosyntactic annotation in which a part of speech and lemma are associated with each segment in
the data.
Note to entry: The identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation.
In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that
segment are often combined (e.g. syntactic bracketing, or delimiting each word in the document with an XML element that
identifies the segment as a word or sentence).
2.7
stand-off annotation
annotation (2.3) layered over primary data (2.1) and serialized in a document separate from that containing
the primary data
Note to entry: Stand-off annotations refer to specific locations in the primary data, by addressing character offsets,
elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can
refer to the same primary document (e.g. two different part of speech annotations for a given text).
2.8
annotation document
XML document containing annotations (2.3)
2.9
anchor
fixed, immutable position in the primary data (2.1) being annotated (2.2)
Note to entry: The medium determines how an anchor is described. For example, text anchors may be character offsets,
audio anchors may be time offsets, video anchors may be time offsets or frame indices, image anchors may be
coordinates.
2.10
region
area in the primary data (2.1) defined by a non-empty, ordered list of anchors (2.9)
2.11
original artefact
artefact or annotation (2.3) from which the primary data (2.1) is derived
2.12
graph
set of nodes (vertices) V(G) and a set of edges E(G)
2.13
node
vertex
terminal point in a graph G, or the intersection of edges in G
Note to entry: The terms node and vertex are used interchangeably in this document.
2.14
edge
ordered pair of nodes [u,v] from V(G)
Note to entry: The order of the nodes determines the direction of the edge.
3 LAF specification
3.1 Overview
LAF consists of the following.
 A data model for linguistic annotations and the data to which they apply.
 An architecture for representing language data and its annotations.
 An XML serialization of the data model, which describes the referential structure of annotations
associated with language data, consisting of a directed graph or graphs. Nodes in the graph may be
linked to regions of primary data. Nodes and edges may be associated with feature structures describing
linguistic properties of regions of primary data linked to reachable nodes.
3.2 LAF data model
The LAF data model consists of
a) a structure for describing media, consisting of anchors that reference locations in primary data and
regions defined in terms of these anchors,
b) a graph structure, consisting of nodes, edges and links to regions, and
c) an annotation structure for representing annotation content with feature structures.
The data model for annotations thus comprises a directed graph referencing n-dimensional regions of primary
data as well as other annotations, in which nodes are associated with feature structures providing the
annotation content. LAF conformance requires that an annotation scheme shall be (or be rendered via the
mapping) isomorphic to the LAF data model.
NOTE LAF does not include specifications for annotation content categories (i.e. the contents of the associated
linguistic phenomena).
Figure 1 — LAF data model
3.3 LAF architecture
3.3.1 Overview
Language resources conforming to the LAF architecture consist of the following, described in more detail in
3.3.2 to 3.3.5.
 One or more primary data documents (see 3.3.2).
 Any number of annotation documents containing nodes, edges and feature structures associated with
some or all of the nodes and/or edges in a directed graph. All nodes reference either a base
segmentation document (in which case the node has no outgoing edges) or other nodes in the same or
other annotation documents via edges. (See 3.3.3).
 One or more documents defining regions that reference each primary data document, which serve as the
base segmentation for annotations (see 3.3.4.)
 A set of headers, including a resource header describing a collection of primary data documents and
annotations, as well as headers for each primary data document and each annotation document in the
collection (see 3.3.5).
It is recommended that whenever possible, each primary data document also be associated with an original
artefact containing the source from which the primary data was adapted or extracted for annotation (e.g. the
original text in the file format of a particular word processor or file viewer).
3.3.2 Primary data
Primary data consists of electronic data in any format, including character (text), image, audio and video.
Primary data in a LAF-compliant resources are frozen as “read-only” to preserve the integrity of references to
locations within the document or documents. Corrections and modifications to the primary data are treated as
annotations and stored in a separate annotation document. Primary data documents containing textual data
are encoded in UTF-8 (default) or UTF-16.
In the general case, primary data does not contain markup of any kind. If markup does exist in primary data
(e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is
made between markup and other characters in the data when referring to locations in the document.
3.3.3 Annotation documents
Annotation documents contain linguistic information describing primary data. Annotations are always
associated with a node in a graph that directly references regions defined over primary data, either directly or
via a path through reachable nodes. In the latter case, the annotations are said to be layered over the primary
data. LAF recommends representing each of the linguistic layers defined in language resource management,
in a separate annotation document for the purposes of exchange.
The granularity of the annotation — i.e. the smallest info
...


SLOVENSKI STANDARD
01-julij-2013
Upravljanje z jezikovnimi viri - Ogrodje za jezikoslovno označevanje (LAF)
Language resource management -- Linguistic annotation framework (LAF)
Gestion des ressources langagières -- Cadre d'annotation linguistique (LAF)
Ta slovenski standard je istoveten z: ISO 24612:2012
ICS:
01.020 Terminologija (načela in Terminology (principles and
koordinacija) coordination)
01.140.20 Informacijske vede Information sciences
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

INTERNATIONAL ISO
STANDARD 24612
First edition
2012-06-15
Language resource management —
Linguistic annotation framework (LAF)
Gestion des ressources langagières — Cadre d'annotation linguistique
(LAF)
Reference number
©
ISO 2012
©  ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56  CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved

Contents Page
Foreword . iv
Introduction . v
1  Scope . 1
2  Terms and definitions . 1
3  LAF specification . 3
3.1  Overview . 3
3.2  LAF data model . 3
3.3  LAF architecture . 4
3.4  XML pivot format . 6
3.5  XML elements for the resource header . 11
3.6  Elements in the primary data document header . 16
Bibliography . 19

Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee has
been established has the right to be represented on that committee. International organizations, governmental
and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24612 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
iv © ISO 2012 – All rights reserved

Introduction
Effective creation, encoding, processing and management of language resources is facilitated by a single
high-level data model that supports analysis and design of both annotation schemes and representation
formats. This International Standard is designed to support the development and use of computer applications
relying on linguistically annotated resources and the exchange of these resources among different
applications.
INTERNATIONAL STANDARD ISO 24612:2012(E)

Language resource management — Linguistic annotation
framework (LAF)
1 Scope
This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic
annotations of language data such as corpora, speech signal and video. The framework includes an abstract
data model and an XML serialization of that model for representing annotations of primary data. The
serialization serves as a pivot format to allow annotations expressed in one representation format to be
mapped onto another.
NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and
other related International Standards.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
primary data
electronic representation of language data
EXAMPLE Text, image, speech signal.
Note to entry: Typically, primary data objects are addressed by “locations” in an electronic file, for example, the span of
characters comprising a sentence or word, or a point at which a given temporal event begins or ends (as in speech
annotation). More complex data objects may consist of a list or set of contiguous or non-contiguous locations in primary
data.
2.2
annotate, verb
process of adding linguistic information to primary data (2.1)
2.3
annotation, noun
linguistic information added to primary data (2.1), independent of its representation
2.4
representation
format in which the annotation (2.3) is rendered, independent of its content
EXAMPLE XML, list or bracketed format, tab-delimited text.
2.5
segmentation annotation
annotation (2.3) that delimits linguistic elements that appear in the primary data (2.1)
Note to entry: These elements include (1) continuous segments (appearing contiguously in the primary data), (2) super-
and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g. contiguous word segment
typically comprise a sentence segment), (3) discontinuous segments (linking continuous segments), and (4) landmarks
(e.g. timestamp) that note a point in the primary data. In current practice, segmental information may or may not appear in
the document containing the primary data itself.
2.6
linguistic annotation
annotation (2.3) that provides linguistic information about the segments in the primary data (2.1)
EXAMPLE Morphosyntactic annotation in which a part of speech and lemma are associated with each segment in
the data.
Note to entry: The identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation.
In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that
segment are often combined (e.g. syntactic bracketing, or delimiting each word in the document with an XML element that
identifies the segment as a word or sentence).
2.7
stand-off annotation
annotation (2.3) layered over primary data (2.1) and serialized in a document separate from that containing
the primary data
Note to entry: Stand-off annotations refer to specific locations in the primary data, by addressing character offsets,
elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can
refer to the same primary document (e.g. two different part of speech annotations for a given text).
2.8
annotation document
XML document containing annotations (2.3)
2.9
anchor
fixed, immutable position in the primary data (2.1) being annotated (2.2)
Note to entry: The medium determines how an anchor is described. For example, text anchors may be character offsets,
audio anchors may be time offsets, video anchors may be time offsets or frame indices, image anchors may be
coordinates.
2.10
region
area in the primary data (2.1) defined by a non-empty, ordered list of anchors (2.9)
2.11
original artefact
artefact or annotation (2.3) from which the primary data (2.1) is derived
2.12
graph
set of nodes (vertices) V(G) and a set of edges E(G)
2.13
node
vertex
terminal point in a graph G, or the intersection of edges in G
Note to entry: The terms node and vertex are used interchangeably in this document.
2.14
edge
ordered pair of nodes [u,v] from V(G)
Note to entry: The order of the nodes determines the direction of the edge.
3 LAF specification
3.1 Overview
LAF consists of the following.
 A data model for linguistic annotations and the data to which they apply.
 An architecture for representing language data and its annotations.
 An XML serialization of the data model, which describes the referential structure of annotations
associated with language data, consisting of a directed graph or graphs. Nodes in the graph may be
linked to regions of primary data. Nodes and edges may be associated with feature structures describing
linguistic properties of regions of primary data linked to reachable nodes.
3.2 LAF data model
The LAF data model consists of
a) a structure for describing media, consisting of anchors that reference locations in primary data and
regions defined in terms of these anchors,
b) a graph structure, consisting of nodes, edges and links to regions, and
c) an annotation structure for representing annotation content with feature structures.
The data model for annotations thus comprises a directed graph referencing n-dimensional regions of primary
data as well as other annotations, in which nodes are associated with feature structures providing the
annotation content. LAF conformance requires that an annotation scheme shall be (or be rendered via the
mapping) isomorphic to the LAF data model.
NOTE LAF does not include specifications for annotation content categories (i.e. the contents of the associated
linguistic phenomena).
Figure 1 — LAF data model
3.3 LAF architecture
3.3.1 Overview
Language resources conforming to the LAF architecture consist of the following, described in more detail in
3.3.2 to 3.3.5.
 One or more primary data documents (see 3.3.2).
 Any number of annotation documents containing nodes, edges and feature structures associated with
some or all of the nodes and/or edges in a directed graph. All nodes reference either a base
segmentation document (in which case the node has no outgoing edges) or other nodes in the same or
other annotation documents via edges. (See 3.3.3).
 One or more documents defining regions that reference each primary data document, which serve as the
base segmentation for annotations (see 3.3.4.)
 A set of headers, including a resource header describing a collection of primary data documents and
annotations, as well as headers for each primary data document and each annotation document in the
collection (see 3.3.5).
It is recommended that whenever possible, each primary data document also be associated with an original
artefact containing the source from which the primary data was adapted or extracted for annotation (e.g. the
original text in the file format of a particular word processor or file viewer).
3.3.2 Primary data
Primary data consists of electronic data in any format, including character (text), image, audio and video.
Primary data in a LAF-compliant resources are frozen as “read-only” to preserve the integrity of references to
locations within the document or documents. Corrections and modifications to the primary data are treated as
annotations and stored in a separate annotation document. Primary data documents containing textual data
are encoded in UTF-8 (default) or UTF-16.
In the general case, primary data does not contain markup of any kind. If markup does exist in primary data
(e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is
made between markup and other characters in the data when referring to locations in the document.
3.3.3 Annotation documents
Annotation documents contain linguistic information describing primary data. Annotations are always
associated with a node in a graph that directly references regions defined over primary data, either directly or
via a path through reachable nodes. In the latter case, the annotations are said to be layered over the primary
data. LAF recommends representing each of the linguistic la
...


INTERNATIONAL ISO
STANDARD 24612
First edition
2012-06-15
Language resource management —
Linguistic annotation framework (LAF)
Gestion des ressources langagières — Cadre d'annotation linguistique
(LAF)
Reference number
©
ISO 2012
©  ISO 2012
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56  CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO 2012 – All rights reserved

Contents Page
Foreword . iv
Introduction . v
1  Scope . 1
2  Terms and definitions . 1
3  LAF specification . 3
3.1  Overview . 3
3.2  LAF data model . 3
3.3  LAF architecture . 4
3.4  XML pivot format . 6
3.5  XML elements for the resource header . 11
3.6  Elements in the primary data document header . 16
Bibliography . 19

Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through
ISO technical committees. Each member body interested in a subject for which a technical committee has
been established has the right to be represented on that committee. International organizations, governmental
and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24612 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
iv © ISO 2012 – All rights reserved

Introduction
Effective creation, encoding, processing and management of language resources is facilitated by a single
high-level data model that supports analysis and design of both annotation schemes and representation
formats. This International Standard is designed to support the development and use of computer applications
relying on linguistically annotated resources and the exchange of these resources among different
applications.
INTERNATIONAL STANDARD ISO 24612:2012(E)

Language resource management — Linguistic annotation
framework (LAF)
1 Scope
This International Standard specifies a linguistic annotation framework (LAF) for representing linguistic
annotations of language data such as corpora, speech signal and video. The framework includes an abstract
data model and an XML serialization of that model for representing annotations of primary data. The
serialization serves as a pivot format to allow annotations expressed in one representation format to be
mapped onto another.
NOTE Standardization of linguistic data categories that provide annotation content is provided by ISO 12620 and
other related International Standards.
2 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
2.1
primary data
electronic representation of language data
EXAMPLE Text, image, speech signal.
Note to entry: Typically, primary data objects are addressed by “locations” in an electronic file, for example, the span of
characters comprising a sentence or word, or a point at which a given temporal event begins or ends (as in speech
annotation). More complex data objects may consist of a list or set of contiguous or non-contiguous locations in primary
data.
2.2
annotate, verb
process of adding linguistic information to primary data (2.1)
2.3
annotation, noun
linguistic information added to primary data (2.1), independent of its representation
2.4
representation
format in which the annotation (2.3) is rendered, independent of its content
EXAMPLE XML, list or bracketed format, tab-delimited text.
2.5
segmentation annotation
annotation (2.3) that delimits linguistic elements that appear in the primary data (2.1)
Note to entry: These elements include (1) continuous segments (appearing contiguously in the primary data), (2) super-
and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g. contiguous word segment
typically comprise a sentence segment), (3) discontinuous segments (linking continuous segments), and (4) landmarks
(e.g. timestamp) that note a point in the primary data. In current practice, segmental information may or may not appear in
the document containing the primary data itself.
2.6
linguistic annotation
annotation (2.3) that provides linguistic information about the segments in the primary data (2.1)
EXAMPLE Morphosyntactic annotation in which a part of speech and lemma are associated with each segment in
the data.
Note to entry: The identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation.
In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that
segment are often combined (e.g. syntactic bracketing, or delimiting each word in the document with an XML element that
identifies the segment as a word or sentence).
2.7
stand-off annotation
annotation (2.3) layered over primary data (2.1) and serialized in a document separate from that containing
the primary data
Note to entry: Stand-off annotations refer to specific locations in the primary data, by addressing character offsets,
elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can
refer to the same primary document (e.g. two different part of speech annotations for a given text).
2.8
annotation document
XML document containing annotations (2.3)
2.9
anchor
fixed, immutable position in the primary data (2.1) being annotated (2.2)
Note to entry: The medium determines how an anchor is described. For example, text anchors may be character offsets,
audio anchors may be time offsets, video anchors may be time offsets or frame indices, image anchors may be
coordinates.
2.10
region
area in the primary data (2.1) defined by a non-empty, ordered list of anchors (2.9)
2.11
original artefact
artefact or annotation (2.3) from which the primary data (2.1) is derived
2.12
graph
set of nodes (vertices) V(G) and a set of edges E(G)
2.13
node
vertex
terminal point in a graph G, or the intersection of edges in G
Note to entry: The terms node and vertex are used interchangeably in this document.
2.14
edge
ordered pair of nodes [u,v] from V(G)
Note to entry: The order of the nodes determines the direction of the edge.
3 LAF specification
3.1 Overview
LAF consists of the following.
 A data model for linguistic annotations and the data to which they apply.
 An architecture for representing language data and its annotations.
 An XML serialization of the data model, which describes the referential structure of annotations
associated with language data, consisting of a directed graph or graphs. Nodes in the graph may be
linked to regions of primary data. Nodes and edges may be associated with feature structures describing
linguistic properties of regions of primary data linked to reachable nodes.
3.2 LAF data model
The LAF data model consists of
a) a structure for describing media, consisting of anchors that reference locations in primary data and
regions defined in terms of these anchors,
b) a graph structure, consisting of nodes, edges and links to regions, and
c) an annotation structure for representing annotation content with feature structures.
The data model for annotations thus comprises a directed graph referencing n-dimensional regions of primary
data as well as other annotations, in which nodes are associated with feature structures providing the
annotation content. LAF conformance requires that an annotation scheme shall be (or be rendered via the
mapping) isomorphic to the LAF data model.
NOTE LAF does not include specifications for annotation content categories (i.e. the contents of the associated
linguistic phenomena).
Figure 1 — LAF data model
3.3 LAF architecture
3.3.1 Overview
Language resources conforming to the LAF architecture consist of the following, described in more detail in
3.3.2 to 3.3.5.
 One or more primary data documents (see 3.3.2).
 Any number of annotation documents containing nodes, edges and feature structures associated with
some or all of the nodes and/or edges in a directed graph. All nodes reference either a base
segmentation document (in which case the node has no outgoing edges) or other nodes in the same or
other annotation documents via edges. (See 3.3.3).
 One or more documents defining regions that reference each primary data document, which serve as the
base segmentation for annotations (see 3.3.4.)
 A set of headers, including a resource header describing a collection of primary data documents and
annotations, as well as headers for each primary data document and each annotation document in the
collection (see 3.3.5).
It is recommended that whenever possible, each primary data document also be associated with an original
artefact containing the source from which the primary data was adapted or extracted for annotation (e.g. the
original text in the file format of a particular word processor or file viewer).
3.3.2 Primary data
Primary data consists of electronic data in any format, including character (text), image, audio and video.
Primary data in a LAF-compliant resources are frozen as “read-only” to preserve the integrity of references to
locations within the document or documents. Corrections and modifications to the primary data are treated as
annotations and stored in a separate annotation document. Primary data documents containing textual data
are encoded in UTF-8 (default) or UTF-16.
In the general case, primary data does not contain markup of any kind. If markup does exist in primary data
(e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is
made between markup and other characters in the data when referring to locations in the document.
3.3.3 Annotation documents
Annotation documents contain linguistic information describing primary data. Annotations are always
associated with a node in a graph that directly references regions defined over primary data, either directly or
via a path through reachable nodes. In the latter case, the annotations are said to be layered over the primary
data. LAF recommends representing each of the linguistic layers defined in language resource management,
in a separate annotation document for the purposes of exchange.
The granularity of the annotation — i.e. the smallest information unit to which the annotation applies — is
dependent on the application. For example, a single annotation over text may cover a phoneme, word,
sentence, paragraph, document, or an entire corpus; for audio it may cover any temporal interval, including a
temporal “instant” (timeslot, timestamp, etc.).
3.3.4 References to primary data
Direct reference to locations in primary data is accomplished using anchors. In most cases, these nodes are
located between the base units of the primary data representation.
Anchors are medium-dependent. Regions of a resource may be defined by specifying the anchors that bound
the region. Regions in artefacts such as an image map or video may be defined in terms of anchors specifying
one or more coordinates, frame indexes, etc. Regions in audio data may be referenced in terms of anchors
that refer to one or
...


МЕЖДУНАРОДНЫЙ ISO
СТАНДАРТ 24612
Первое издание
2012-06-15
Управление языковыми ресурсами.
Лингвистическая аннотационная
система (LAF)
Language resource management. – Linguistic annotation framework
(LAF)
Ответственность за подготовку русской версии несѐт GOST R
(Российская Федерация) в соответствии со статьѐй 18.1 Устава ISO

Ссылочный номер
©
ISO 2012
ДОКУМЕНТ ЗАЩИЩЁН АВТОРСКИМ ПРАВОМ

©  ISO 2012
Все права сохраняются. Если не указано иное, никакую часть настоящей публикации нельзя копировать или использовать в
какой-либо форме или каким-либо электронным или механическим способом, включая фотокопии и микрофильмы, без
предварительного получения письменного согласия ISO по указанному ниже адресу или организации-члена ISO в стране
запрашивающей стороны.
Бюро ISO по авторским правам:
Case postale 56 CH-1211 Geneva 20
Тел.: + 41 22 749 01 11
Факс: + 41 22 749 09 47
Эл. почта: copyright@iso.org
Веб-сайт: www.iso.org
Опубликовано в Швейцарии
©
ii ISO 2012 – Все права сохраняются

Содержание Страница
Предисловие . iv
Введение . v
1 Область применения . 1
2 Термины и определения . 1
3 Спецификация LAF. 3
3.1 Общий обзор . 3
3.2 Модель данных LAF . 3
3.3 Архитектура LAF . 4
3.4 Базовый формат XML . 7
3.5 XML-элементы заголовка ресурса . 12
3.6 Элементы заголовка документа, содержащего первичные данные . 17
Библиография . 19

©
ISO 2012 – Все права сохраняются iii

Предисловие
Международная организация по стандартизации (ISO) является всемирной федерацией
национальных организаций по стандартизации (комитетов-членов ISO). Разработка
международных стандартов обычно осуществляется техническими комитетами ISO. Каждый
комитет-член, заинтересованный в деятельности, для которой был создан технический комитет,
имеет право быть представленным в этом комитете. Международные правительственные и
неправительственные организации, имеющие связь с ISO, также принимают участие в работе. ISO
работает в тесном сотрудничестве с Международной электротехнической комиссией (IEC) по всем
вопросам стандартизации в области электротехники.
Проекты международных стандартов разрабатываются согласно правилам, приведѐнным в
Директивах ISO/IEC, Часть 2.
Разработка международных стандартов является основной задачей технических комитетов.
Проекты международных стандартов, принятые техническими комитетами, рассылаются
комитетам-членам на голосование. Для публикации в качестве международного стандарта
требуется одобрение не менее 75 % комитетов-членов, принявших участие в голосовании.
Принимается во внимание тот факт, что некоторые из элементов настоящего документа могут
быть объектом патентных прав. ISO не принимает на себя обязательств по определению
отдельных или всех таких патентных прав.
ISO 24612 был подготовлен Техническим комитетом ISO/TC 37, Терминология и другие языковые
и информационные ресурсы, Подкомитетом SC 4, Управление языковыми ресурсами.
©
iv ISO 2012 – Все права сохраняются

Введение
Эффективные процедуры создания, кодирования, обработки языковых ресурсов и управления ими
значительно упрощаются при наличии единой высокоуровневой модели данных, которая
обеспечивает возможность анализа и проектирования как различных схем аннотирования, так и
разнообразных форматов представления аннотаций. Настоящий Международный стандарт
предназначен для технической поддержки разработки и использования компьютерных приложений,
основой которых служат языковые ресурсы с лингвистическими аннотациями и процедуры обмена
такими ресурсами между различными прикладными системами.
©
ISO 2012 – Все права сохраняются v

МЕЖДУНАРОДНЫЙ СТАНДАРТ ISO 24612:2012(R)

Управление языковыми ресурсами. Лингвистическая
аннотационная система (LAF)
1 Область применения
Настоящий международный стандарт содержит определение лингвистической аннотационной
системы (LAF), которая предназначена для представления лингвистических аннотаций различных
языковых данных, таких как текстовые корпуса, речевые сигналы и видеоданные. Эта
аннотационная система состоит из абстрактной модели данных и преобразованных в
последовательную форму описаний этой модели на языке XML (XML-сериализаций) для
представления аннотаций первичных данных. Сериализация служит базовым форматом,
позволяющим устанавливать соответствие между аннотациями, представляемыми в разных
форматах.
ПРИМЕЧАНИЕ Вопросы стандартизации категорий лингвистических данных, составляющих содержание
аннотаций, рассматриваются в ISO 12620 и других аналогичных международных стандартах.
2 Термины и определения
Для целей данного документа используются термины и определения, представленные ниже.
2.1
первичные данные
primary data
языковая информация, представленная в электронной форме
ПРИМЕРЫ Текст, изображение, речевой сигнал.
Примечание к статье: Как правило, обращение к объектам первичных данных осуществляется по адресам
их ―местоположения‖ в электронном файле: например, по адресу области памяти, в которой располагаются
символы, составляющие предложение или слово, либо по адресу точки, в которой начинается или
заканчивается информация об определѐнном событии (как в случае аннотации речевого сообщения). Более
сложные информационные объекты могут представлять собой список или группы последовательно
расположенных или разрозненных элементов первичных данных.
2.2
аннотировать, составлять аннотацию
annotate
добавлять лингвистическую информацию к первичным данным (2.1)
2.3
аннотация
annotation, noun
лингвистическая информация, добавленная к первичным данным (2.1) и не зависящая от формы
их представления
2.4
представление
representation
формат, в котором отображается аннотация (2.3) , не зависящий от еѐ содержания
ПРИМЕР формат XML, списковый или скобочный формат, текст с разделителями в виде знака табуляции.
2.5
аннотация сегментирования
segmentation annotation
аннотация (2.3), разграничивающая лингвистические элементы, появляющиеся в первичных
данных (2.1)
Примечание к статье: К числу таких элементов относятся: (1) неразрывные сегменты (появляющиеся в
первичных данных совместно); (2) сегменты более высокого или более низкого уровня, являющиеся
составными частями более крупного сегмента (например, сегмент из смежных слов, обычно входящий в
состав сегмента предложения); (3) дискретные сегменты (для связывания неразрывных сегментов) и (4)
реперы (например, отметки времени), обозначающие определѐнные позиции в первичных данных. В
современной практике аннотирования информация сегментирования может присутствовать, а может и не
присутствовать в самом документе, содержащем первичные данные.
2.6
лингвистическая аннотация
linguistic annotation
аннотация (2.3), которая предоставляет лингвистическую информацию о сегментах первичных
данных (2.1)
ПРИМЕР Морфосинтаксическая аннотация, в которой с каждым сегментом данных ассоциируются
некоторая часть речи и некоторая лемма.
Примечание к статье: Идентификатор сегмента как слова, предложения, именной группы и т.п. тоже
образует лингвистическую аннотацию. В современной практике аннотирования всюду, где это возможно,
сегментация часто сочетается с идентификацией лингвистической роли или характеристик сегмента
(например, скобочная запись синтаксических свойств или разграничение слов документа с помощью XML-
элемента, который определяет сегмент как слово или как предложение).
2.7
автономная аннотация
stand-off annotation
аннотация (2.3), охватывающая различные слои первичных данных (2.1) и сериализуемая в
документе, отделѐнном от документа, который содержит первичные данные
Примечание к статье: Автономные аннотации, связываются с конкретными участками первичных данных
посредством адресации соответствующих символьных смещений, элементов и т.п. С одни и тем же
первичным документом может быть связано множество документированных автономных аннотаций
(например, могут существовать аннотации двух разных частей речи, фигурирующих в аннотируемом тексте).
2.8
аннотационный документ, документированная аннотация
annotation document
документ в формате XML, содержащий аннотации (2.3)
2.9
якорь, привязка
anchor
жѐсткая неизменная позиция в первичных данных (2.1), которые необходимо аннотировать (2.2)
Примечание к статье: Способ описания якоря определяется конкретной языковой средой. Например,
текстовыми якорями могут быть смещения символов, якорями аудиоданных – сдвиги по времени, якорями
видеоинформации – временные сдвиги или указатели кадров, а якорями изображений – системы координат.
2.10
местоположение, участок
region
область первичных данных (2.1), определяемая непустым упорядоченным списком якорей (2.9)

2.11
исходный артефакт
original artefact
искусственный объект или аннотация (2.3), используемые для извлечения первичных данных (2.1)
2.12
граф
graph
совокупность узлов (вершин) V(G) и связывающих их рѐбер E(G)
2.13
узел, вершина
node
vertex
конечная точка в графе G или точка пересечения его рѐбер
Примечание к статье: Термины узел и вершина используются в настоящем документе как синонимы.
2.14
ребро
edge
упорядоченная пара [u,v] узлов, принадлежащих графу, V(G)
Примечание к статье: Порядок следования узлов определяет ориентацию ребра.
3 Спецификация LAF
3.1 Общий обзор
LAF состоит из следующих компонентов:
информационной модели лингвистических аннотаций и данных, к которым относятся эти
аннотации;
структурной схемы представления языковых данных и их аннотаций;
сериализованного XML-описания информационной модели, которое характеризует
представленную одним или несколькими ориентированными графами ссылочную структуру
аннотаций, ассоциируемых с языковыми данными. Узлы графа могут связываться с
конкретными участками первичных данных, а в совокупности с рѐбрами могут
ассоциироваться с соответствующими признаковыми структурами, которые описывают
лингвистические свойства участков первичных данных, относящихся к достижимым узлам.
3.2 Модель данных LAF
Модель данных LAF включает в себя следующие блоки:
a) структурное описание информационного носителя, состоящее из якорей, указывающих
участки первичных данных и их местоположение,
b) графовой структуры, образованной узлами, рѐбрами и ссылками на конкретные участки, и
c) аннотационной структуры для представления содержания аннотации с использованием
признаковых структур элементов.
Таким образом, информационная модель аннотаций состоит из ориентированного графа,
охватывающего n-мерные участки первичных данных, и прочих аннотационных представлений, в
рамках которых узлы графа ассоциируются с признаковыми структурами, предоставляющими
контент аннотации. Аннотация считается соответствующей LAF, если еѐ схема изоморфна модели
данных LAF или может быть преобразована к ней.
ПРИМЕЧАНИЕ В состав лингвистической аннотационной системы не входят спецификации категорий
содержания аннотаций (то есть сущностей соответствующих лингвистических явлений).

Рисунок 1 — Модель данных LAF
3.3 Архитектура LAF
3.3.1 Общее описание
Языковые ресурсы, соответствующие архитектуре LAF, состоят из перечисленных ниже
компонентов, которые более подробно рассматриваются в подразделах 3.3.2 - 3.3.5:
один или несколько документов, содержащих первичные данные (см. 3.3.2);
произвольное число документированных аннотаций, охватывающих различные узлы, рѐбра
графов и ассоциируемые с ними признаковые структуры, все или часть которых могут
принадлежать ориентированному графу (орграфу); при этом все узлы снабжаются ссылками
либо на базовый документ сегментации (в данном случае узел не имеет исходящих рѐбер),
либо на другие узлы того же самого или других документов через соответствующие пути в
графе (см. 3.3.3);
один или несколько документов, определяющих области, которые содержат ссылки на каждый
документ с первичными данными, служащий основой для сегментации аннотаций (см. 3.3.4.);
множество заголовочных блоков, включая ресурсный заголовок, описывающий коллекцию
документов с первичными данными и аннотациями, равно как и заголовки для каждого
первичного документа и каждой аннотации из соответствующей коллекции (см. 3.3.5).
Рекомендуется всегда, когда это возможно, ассоциировать каждый первичный документ с
исходным артефактом, первичные данные которого извлекаются или адаптируются для
аннотации (например, исходный текстовый файл конкретного текстового процессора или
программы визуального представления файлов).

3.3.2 Первичные данные
Первичные данные – это сведения, представленные в электронном виде в любом формате,
включающие в себя текстовые символы, изображения, аудиоинформацию и видеоданные.
Первичные данные в LAF-совместимых ресурсах «замораживаются» как доступные только для
чтения (―read-only‖) – для обеспечения целостности ссылок на различные участки данных в рамках
используемых документов. Внесе
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.