Information technology — Coding of audio-visual objects — Part 1: Systems

ISO/IEC 14496-1:2010 specifies system level functionalities for the communication of interactive audio-visual scenes, i.e. the coded representation of information related to the management of data streams (synchronization, identification, description and association of stream content).

Technologies de l'information — Codage des objets audiovisuels — Partie 1: Systèmes

General Information

Status
Not Published
Current Stage
6000 - International Standard under publication
Start Date
24-Mar-2026
Completion Date
28-Mar-2026

Relations

Effective Date
10-Sep-2022
Effective Date
10-Sep-2022
Effective Date
10-Sep-2022

Overview

ISO/IEC 14496-1:2024, also known as "Information technology - Coding of audio-visual objects - Part 1: Systems," is an international standard developed by ISO and IEC. It specifies system-level functionalities for efficient communication, synchronization, and management of interactive audio-visual scenes. This part of the ISO/IEC 14496 standard, often referred to as MPEG-4 Systems, provides the framework for the coded representation and delivery of multimedia content, including audio, video, text, and associated metadata.

The standard focuses on the management of data streams, ensuring that synchronization, identification, description, and association of multimedia stream content are robust and interoperable across various platforms and applications.

Key Topics

  • System Architecture: Defines how audio-visual terminals (receivers and senders) process, synchronize, and present coded multimedia content.
  • Elementary Streams: Each type of media (audio, video, text, font, or interaction data) is carried in its own elementary stream, with decoders assigned to each.
  • Terminal Model: Abstracts the behavior of receiving terminals, focusing on buffer and time management to ensure smooth playback and synchronization.
  • Synchronization (Sync Layer): Mechanisms for timed composition and decoding of streams, utilizing clock references and time stamps to maintain media alignment.
  • Multiplexing and Delivery Layer: Interfaces with abstracted transport or storage mechanisms. Supports the multiplexing of different elementary streams for efficient transmission or storage.
  • Object Description Framework: Allows descriptors for media objects, supporting content identification and association within scenes.
  • Intellectual Property Management and Protection (IPMP): Provides interfaces for licensing and protection of digital assets.
  • Quality of Service (QoS): Defines management models ensuring performance and reliability during media delivery.

Applications

ISO/IEC 14496-1 forms the backbone for a wide range of multimedia applications, especially wherever interactive audio-visual scenes must be communicated or rendered. Its practical value includes:

  • Streaming Platforms: Ensures synchronized delivery and playback of audio and video streams across the internet or local networks.
  • Multimedia Messaging: Enables complex scenes with interactivity for rich messaging services.
  • Interactive TV and Digital Broadcasting: Supports advanced features such as multiple camera angles, interactive advertising, and language selection.
  • Video Conferencing: Coordinates the real-time synchronization of audio, video, and user interaction data.
  • Media File Formats: Underpins popular formats like MP4, ensuring standardized packaging and interchange of audio-visual data.
  • Content Protection: Integrates mechanisms for digital rights management and secure transmission of media content.
  • Gaming and VR/AR: Provides systems features for interactive scenes, synchronized audio-visual rendering, and efficient content handling in immersive applications.

Related Standards

ISO/IEC 14496-1 is one part of the broader MPEG-4 family of standards, each targeting a specific aspect of coded audio-visual objects:

  • ISO/IEC 14496-2: Visual (Video coding)
  • ISO/IEC 14496-3: Audio coding
  • ISO/IEC 14496-4: Conformance testing
  • ISO/IEC 14496-6: Delivery Multimedia Integration Framework (DMIF)
  • ISO/IEC 14496-10: Advanced Video Coding (AVC, commonly known as H.264)
  • ISO/IEC 14496-11: Scene description and application engine
  • ISO/IEC 14496-12, 14, and 15: Media file formats (ISO Base Media File Format, MP4, AVC File Format)
  • ISO/IEC 14496-17, 18, 20: Text, font, and animation extensions
  • ISO/IEC 14496-22: Open Font Format

The ISO/IEC 14496-1 standard is crucial for developers, system integrators, and solution architects in the multimedia technology sector, supporting the delivery of synchronized, manageable, and interactive multimedia content worldwide.

Buy Documents

Draft

ISO/IEC PRF 14496-1 - Information technology — Coding of audio-visual objects — Part 1: Systems/5/2023

Release Date:05-Dec-2023
English language (128 pages)
sale 15% off
sale 15% off
Draft

ISO/IEC PRF 14496-1 - Information technology — Coding of audio-visual objects — Part 1: Systems

Release Date:23-Feb-2026
English language (118 pages)
sale 15% off
sale 15% off
Draft

REDLINE ISO/IEC PRF 14496-1 - Information technology — Coding of audio-visual objects — Part 1: Systems

Release Date:23-Feb-2026
English language (118 pages)
sale 15% off
sale 15% off

Get Certified

Connect with accredited certification bodies for this standard

BSI Group

BSI (British Standards Institution) is the business standards company that helps organizations make excellence a habit.

UKAS United Kingdom Verified

NYCE

Mexican standards and certification body.

EMA Mexico Verified

Sponsored listings

Frequently Asked Questions

ISO/IEC 14496-1 is a draft published by the International Organization for Standardization (ISO). Its full title is "Information technology — Coding of audio-visual objects — Part 1: Systems". This standard covers: ISO/IEC 14496-1:2010 specifies system level functionalities for the communication of interactive audio-visual scenes, i.e. the coded representation of information related to the management of data streams (synchronization, identification, description and association of stream content).

ISO/IEC 14496-1:2010 specifies system level functionalities for the communication of interactive audio-visual scenes, i.e. the coded representation of information related to the management of data streams (synchronization, identification, description and association of stream content).

ISO/IEC 14496-1 is classified under the following ICS (International Classification for Standards) categories: 35.040.40 - Coding of audio, video, multimedia and hypermedia information. The ICS classification helps identify the subject area and facilitates finding related standards.

ISO/IEC 14496-1 has the following relationships with other standards: It is inter standard links to ISO/IEC 14496-1:2010, ISO/IEC 14496-1:2010/Amd 1:2010, ISO/IEC 14496-1:2010/Amd 2:2014. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.

ISO/IEC 14496-1 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.

Standards Content (Sample)


DRAFT INTERNATIONAL STANDARD
ISO/IEC DIS 14496-1
ISO/IEC JTC 1/SC 29 Secretariat: JISC
Voting begins on: Voting terminates on:
2024-01-30 2024-04-23
Information technology — Coding of audio-visual
objects —
Part 1:
Systems
Technologies de l'information — Codage des objets audiovisuels —
Partie 1: Systèmes
ICS: 35.040.40
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENT AND APPROVAL. IT IS
THEREFORE SUBJECT TO CHANGE AND MAY
This document is circulated as received from the committee secretariat.
NOT BE REFERRED TO AS AN INTERNATIONAL
STANDARD UNTIL PUBLISHED AS SUCH.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL,
TECHNOLOGICAL, COMMERCIAL AND
USER PURPOSES, DRAFT INTERNATIONAL
STANDARDS MAY ON OCCASION HAVE TO
BE CONSIDERED IN THE LIGHT OF THEIR
POTENTIAL TO BECOME STANDARDS TO
WHICH REFERENCE MAY BE MADE IN
Reference number
NATIONAL REGULATIONS.
ISO/IEC DIS 14496-1:2024(E)
RECIPIENTS OF THIS DRAFT ARE INVITED
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
PROVIDE SUPPORTING DOCUMENTATION. © ISO/IEC 2024

ISO/IEC DIS 14496-1:2024(E)
DRAFT INTERNATIONAL STANDARD
ISO/IEC DIS 14496-1
ISO/IEC JTC 1/SC 29 Secretariat: JISC
Voting begins on: Voting terminates on:
Information technology — Coding of audio-visual
objects —
Part 1:
Systems
Technologies de l'information — Codage des objets audiovisuels —
Partie 1: Systèmes
ICS: 35.040.40
THIS DOCUMENT IS A DRAFT CIRCULATED
FOR COMMENT AND APPROVAL. IT IS
© ISO/IEC 2024
THEREFORE SUBJECT TO CHANGE AND MAY
This document is circulated as received from the committee secretariat.
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
NOT BE REFERRED TO AS AN INTERNATIONAL
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on STANDARD UNTIL PUBLISHED AS SUCH.
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
IN ADDITION TO THEIR EVALUATION AS
or ISO’s member body in the country of the requester. BEING ACCEPTABLE FOR INDUSTRIAL,
TECHNOLOGICAL, COMMERCIAL AND
ISO copyright office
USER PURPOSES, DRAFT INTERNATIONAL
CP 401 • Ch. de Blandonnet 8
STANDARDS MAY ON OCCASION HAVE TO
BE CONSIDERED IN THE LIGHT OF THEIR
CH-1214 Vernier, Geneva
POTENTIAL TO BECOME STANDARDS TO
Phone: +41 22 749 01 11
WHICH REFERENCE MAY BE MADE IN
Reference number
Email: copyright@iso.org
NATIONAL REGULATIONS.
Website: www.iso.org ISO/IEC DIS 14496-1:2024(E)
RECIPIENTS OF THIS DRAFT ARE INVITED
Published in Switzerland
TO SUBMIT, WITH THEIR COMMENTS,
NOTIFICATION OF ANY RELEVANT PATENT
RIGHTS OF WHICH THEY ARE AWARE AND TO
ii
© ISO/IEC 2024 – All rights reserved
PROVIDE SUPPORTING DOCUMENTATION. © ISO/IEC 2024

ISO/IEC DIS 14496-1:2024(E)
Contents Page
Foreword .vi
0 Introduction. viii
0.1 Overview. viii
0.2 Architecture . viii
0.3 Terminal Model: Systems Decoder Model .x
0.4 Multiplexing of Streams: The Delivery Layer .x
0.5 Synchronization of Streams: The Sync Layer.x
0.6 The Compression Layer .x
0.7 Application Engine .xii
0.8 Extensible MPEG-4 Textual Format (XMT) . xiii
1 Scope .1
2 Normative references.1
3 Additional reference.2
4 Terms and definitions .2
4.1 Access Unit (AU) .2
4.2 Alpha Map .2
4.3 Audio-visual Object.2
4.4 Audio-visual Scene (AV Scene) .2
4.5 AVC Parameter Set.2
4.6 AVC Access Unit .2
4.7 AVC Parameter Set Access Unit .3
4.8 AVC Parameter Set Elementary Stream .3
4.9 AVC Video Elementary Stream .3
4.10 Binary Format for Scene (BIFS) .3
4.11 Buffer Model.3
4.12 Byte Aligned.3
4.13 Clock Reference .3
4.14 Composition.3
4.15 Composition Memory (CM) .3
4.16 Composition Time Stamp (CTS) .3
4.17 Composition Unit (CU) .3
4.18 Compression Layer .3
4.19 Control Point .3
4.20 Decoder .3
4.21 Decoding buffer (DB) .4
4.22 Decoder configuration .4
4.23 Decoding Time Stamp (DTS) .4
4.24 Delivery Layer .4
4.25 Descriptor.4
4.26 DMIF Application Interface (DAI) .4
4.27 Elementary Stream (ES) .4
4.28 Elementary Stream Descriptor.4
4.29 Elementary Stream Interface (ESI) .4
4.30 M4Mux Channel (FMC) .4
4.31 M4Mux Packet .4
4.32 M4Mux Stream .4
4.33 M4Mux tool .4
4.34 Graphics Profile .5
4.35 Inter .5
4.36 Interaction Stream .5
4.37 Intra .5
4.38 Initial Object Descriptor .5
4.39 Intellectual Property Identification (IPI) .5
4.40 Intellectual Property Management and Protection (IPMP) System .5
© ISO/IEC 2024 – All rights reserved iii

ISO/IEC DIS 14496-1:2024(E)ISO/IEC DIS 14496-1:2024(E)
4.41 IPMP Information . 5
4.42 IPMP System . 5
4.43 IPMP Tool. 5
4.44 IPMP Tool Identifier . 6
4.45 IPMP Tool List . 6
4.46 Media Node . 6
4.47 Media stream . 6
4.48 Media time line . 6
4.49 MP4 File . 6
4.50 Object Clock Reference (OCR) . 6
4.51 Object Content Information (OCI) . 6
4.52 Object Descriptor (OD) . 6
4.53 Object Descriptor Command . 6
4.54 Object Descriptor Profile . 6
4.55 Object Descriptor Stream . 6
4.56 Object Time Base (OTB) . 6
4.57 Parametric Audio Decoder . 7
4.58 Parametric Description . 7
4.59 Quality of Service (QoS) . 7
4.60 Random Access . 7
4.61 Reference Point . 7
4.62 Rendering . 7
4.63 Rendering Area . 7
4.64 Scene Description . 7
4.65 Scene Description Stream . 7
4.66 Scene Graph Elements . 7
4.67 Scene Graph Profile . 7
4.68 Seekable . 7
4.69 SL-Packetized Stream (SPS) . 7
4.70 Stream object . 8
4.71 Structured Audio . 8
4.72 Sync Layer (SL) . 8
4.73 Sync Layer Configuration . 8
4.74 Sync Layer Packet (SL-Packet) . 8
4.75 Systems Decoder Model (SDM). 8
4.76 System Time Base (STB) . 8
4.77 Terminal . 8
4.78 Time Base . 8
4.79 Timing Model . 8
4.80 Time Stamp . 8
4.81 Track . 8
5 Abbreviations and Symbols . 8
6 Conventions . 9
7 Streaming Framework . 10
7.1 Systems Decoder Model . 10
7.2 Object Description Framework . 16
7.3 Synchronization of Elementary Streams . 74
7.4 Multiplexing of Elementary Streams . 85
8 Profiles . 93
Annex A (informative) Time Base Reconstruction . 95
A.1 Time Base Reconstruction . 95
A.2 Temporal aliasing and audio resampling. 96
A.3 Reconstruction of a Synchronised Audio-visual Scene: A Walkthrough . 96
Annex B (informative) The QoS Management Model for ISO/IEC 14496 Content . 98
Annex C (informative)  Conversion Between Time and Date Conventions . 99
C.1 Conversion Between Time and Date Conventions . 99
iv © ISO/IEC 2024 – All rights reserved

ISO/IEC DIS 14496-1:2024(E)
Annex D (informative) Graphical Representation of Object Descriptor and Sync Layer Syntax . 101
D.1 Length encoding of descriptors and commands . 101
D.2 Object Descriptor Stream and OD commands . 102
D.3 OCI stream . 102
D.4 Object descriptor and its components . 103
D.5 OCI Descriptors . 105
D.6 Sync layer configuration and syntax . 108
Annex E (informative) Elementary Stream Interface . 109
Annex F (informative) Upstream Walkthrough . 111
F.1 Introduction. 111
F.2 Configuration . 111
F.3 Content access procedure with DAI. 112
F.4 Example . 112
Annex G (informative) Scene and Object Description Carrousel . 117
Annex H (normative) Usage of ITU-T Recommendation H.264 | ISO/IEC 14496-10 AVC . 118
H.1 SL packet encapsulation of AVC Access Unit . 118
H.2 Handling of Parameter Sets . 118
H.3 Usage of ISO/IEC 14496-14 AVC File Format in MPEG-4 Systems . 119
Annex I (informative) Patent statements . 120
I.1 General . 120
I.2 Patent Statements for Version 1 . 120
I.3 Patent Statements for Version 2 . 121
Annex J (informative) Registration Authority for MPEG-4 Systems . 123
J.1 Code points to be registered . 123
J.2 Procedure for the request of an MPEG-4 registered identifier value . 123
J.3 Responsibilities of the Registration Authority. 123
J.4 Contact information for the Registration Authority . 124
J.5 Responsibilities of Parties Requesting a RID . 124
J.6 Appeal Procedure for Denied Applications . 124
J.7 Registration Application Form . 124
Bibliography . 128
© ISO/IEC 2024 – All rights reserved v

ISO/IEC DIS 14496-1:2024(E)ISO/IEC DIS 14496-1:2024(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees established
by the respective organization to deal with particular fields of technical activity. ISO and IEC technical
committees collaborate in fields of mutual interest. Other international organizations, governmental and non-
governmental, in liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO
and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as
an International Standard requires approval by at least 75 % of the national bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This fourth edition cancels and replaces the third edition (ISO/IEC 14496-1:2004). It also
incorporates the Amendments ISO/IEC 14496-1:2004/Amd.1:2005, ISO/IEC 14496-1:2004/Amd.2:2007,
1:2004/Cor.2:2007, which have been technically revised.
ISO/IEC 14496 consists of the following parts, under the general title Information technology — Coding of audio-
visual objects:
— Part 1: Systems
— Part 2: Visual
— Part 3: Audio
— Part 4: Conformance testing
— Part 5: Reference software
— Part 6: Delivery Multimedia Integration Framework (DMIF)
— Part 7: Optimized reference software for coding of audio-visual objects
— Part 8: Carriage of ISO/IEC 14496 contents over IP networks
— Part 9: Reference hardware description
— Part 10: Advanced Video Coding
— Part 11: Scene description and application engine
— Part 12: ISO base media file format
— Part 13: Intellectual Property Management and Protection (IPMP) extensions
— Part 14: MP4 file format
vi © ISO/IEC 2024 – All rights reserved

ISO/IEC DIS 14496-1:2024(E)
— Part 15: Advanced Video Coding (AVC) file format
— Part 16: Animation Framework eXtension (AFX)
— Part 17: Streaming text format
— Part 18: Font compression and streaming
— Part 19: Synthesized texture stream
— Part 20: Lightweight Application Scene Representation (LASeR) and Simple Aggregation Format (SAF)
— Part 21: MPEG-J Graphics Framework eXtensions (GFX)
— Part 22: Open Font Format
— Part 23: Symbolic Music Representation
— Part 24: Audio and systems interaction
— Part 25: 3D Graphics Compression Model
— Part 26: Audio conformance
— Part 27: 3D Graphics conformance
— Part 28: Composite font representation
— Part 29: Web video coding
— Part 30: Timed text and other visual overlays in ISO base media file format
— Part 31: Video coding for browsers
— Part 32: File format reference software and conformance
— Part 33: Internet video coding
— Part 34: Syntactic description language
© ISO/IEC 2024 – All rights reserved vii

ISO/IEC DIS 14496-1:2024(E)ISO/IEC DIS 14496-1:2024(E)
0 Introduction
0.1 Overview
ISO/IEC 14496 specifies a system for the communication of interactive audio-visual scenes. This specification includes the
following elements:
1. the coded representation of natural or synthetic, two-dimensional (2D) or three-dimensional (3D) objects that can be
manifested audibly and/or visually (audio-visual objects) (specified in part 2, 3, 10, 11, 16, 19, 20, 23 and 25 of ISO/IEC
14496);
2. the coded representation of the spatio-temporal positioning of audio-visual objects as well as their behavior in response
to interaction (scene description, specified in part 11 and 20 of ISO/IEC 14496);
3. the coded representation of information related to the management of data streams (synchronization, identification,
description and association of stream content, specified in this part and in part 24 of ISO/IEC 14496);
4. a generic interface to the data stream delivery layer functionality (specified in part 6 of ISO/IEC 14496);
5. an application engine for programmatic control of the player: format, delivery of downloadable Java byte code as well as
its execution lifecycle and behavior through APIs (specified in part 11 and 21 of ISO/IEC 14496);
6. a file format to contain the media information of an ISO/IEC 14496 presentation in a flexible, extensible format to facilitate
interchange, management, editing, and presentation of the media specified in part 12 (ISO File Format), part 14 (MP4
File Format) and part 15 (AVC File Format) of ISO/IEC 14496; and
7. the coded representation of font data and of information related to the management of text streams and font data streams
(specified in part 17, 18 and 22 of ISO/IEC 14496).
The overall operation of a system communicating audio-visual scenes can be paraphrased as follows:
At the sending terminal, the audio-visual scene information is compressed, supplemented with synchronization information
and passed to a delivery layer that multiplexes it into one or more coded binary streams that are transmitted or stored. At
the receiving terminal, these streams are demultiplexed and decompressed. The audio-visual objects are composed
according to the scene description and synchronization information and presented to the end user. The end user may have
the option to interact with this presentation. Interaction information can be processed locally or transmitted back to the
sending terminal. ISO/IEC 14496 defines the syntax and semantics of the bitstreams that convey such scene information,
as well as the details of their decoding processes.
This part of ISO/IEC 14496 specifies the following tools:
• a terminal model for time and buffer management;
• a coded representation of metadata for the identification, description and logical dependencies of the
elementary streams (object descriptors and other descriptors);
• a coded representation of descriptive audio-visual content information (object content information – OCI);
• an interface to intellectual property management and protection (IPMP) systems;
• a coded representation of synchronization information (sync layer – SL); and
• a multiplexed representation of individual elementary streams in a single stream (M4Mux).
These various elements are described functionally in this Subclause and specified in the normative Clauses that follow.
0.2 Architecture
The information representation specified in ISO/IEC 14496 describes the means to create an interactive audio-visual scene
in terms of coded audio-visual information and associated scene description information. The entity that composes and
sends, or receives and presents such a coded representation of an interactive audio-visual scene is generically referred to
as an "audio-visual terminal" or just "terminal". This terminal may correspond to a standalone application or be part of an
application system.
viii © ISO/IEC 2024 – All rights reserved

ISO/IEC DIS 14496-1:2024(E)
Figure 1 — The ISO/IEC 14496 Terminal Architecture
The basic operations performed by such a receiver terminal are as follows. Information that allows access to content
complying with ISO/IEC 14496 is provided as initial session set up information to the terminal. Part 6 of ISO/IEC 14496
defines the procedures for establishing such session contexts as well as the interface to the delivery layer that generically
abstracts the storage or transport medium. The initial set up information allows, in a recursive manner, to locate one or more
elementary streams that are part of the coded content representation. Some of these elementary streams may be grouped
together using the multiplexing tool described in ISO/IEC 14496-1.
© ISO/IEC 2024 – All rights reserved ix

ISO/IEC DIS 14496-1:2024(E)ISO/IEC DIS 14496-1:2024(E)
Elementary streams contain the coded representation of either audio or visual data or scene description information or user
interaction data or text or font data. Elementary streams may as well themselves convey information to identify streams, to
describe logical dependencies between streams, or to describe information related to the content of the streams. Each
elementary stream contains only one type of data.
Elementary streams are decoded using their respective stream-specific decoders. The audio-visual objects are composed
according to the scene description information and presented by the terminal’s presentation device(s). All these processes
are synchronized according to the systems decoder model (SDM) using the synchronization information provided at the
synchronization layer.
These basic operations are depicted in Figure 1, and are described in more detail below.
0.3 Terminal Model: Systems Decoder Model
The systems decoder model provides an abstract view of the behavior of a terminal complying with ISO/IEC 14496-1. Its
purpose is to enable a sending terminal to predict how the receiving terminal will behave in terms of buffer management and
synchronization when reconstructing the audio-visual information that comprises the presentation. The systems decoder
model includes a systems timing model and a systems buffer model which are described briefly in the following Subclauses.
0.3.1 Timing Model
The timing model defines the mechanisms through which a receiving terminal establishes a notion of time that enables it to
process time-dependent events. This model also allows the receiving terminal to establish mechanisms to maintain
synchronization both across and within particular audio-visual objects as well as with user interaction events. In order to
facilitate these functions at the receiving terminal, the timing model requires that the transmitted data streams contain implicit
or explicit timing information. Two sets of timing information are defined in ISO/IEC 14496-1: clock references and time
stamps. The former convey the sending terminal’s time base to the receiving terminal, while the latter convey a notion of
relative time for specific events such as the desired decoding or composition time for portions of the encoded audio-visual
information.
0.3.2 Buffer Model
The buffer model enables the sending terminal to monitor and control the buffer resources that are needed to decode each
elementary stream in a presentation. The required buffer resources are conveyed to the receiving terminal by means of
descriptors at the beginning of the presentation. The terminal can then decide whether or not it is capable of handling this
particular presentation. The buffer model allows the sending terminal to specify when information may be removed from
these buffers and enables it to schedule data transmission so that the appropriate buffers at the receiving terminal do not
overflow or underflow.
0.4 Multiplexing of Streams: The Delivery Layer
The term delivery layer is used as a generic abstraction of any existing transport protocol stack that may be used to transmit
and/or store content complying with ISO/IEC 14496. The functionality of this layer is not within the scope of ISO/IEC 14496-1,
and only the interface to this layer is considered. This interface is the DMIF Application Interface (DAI) specified in ISO/IEC
14496-6. The DAI defines not only an interface for the delivery of streaming data, but also for signaling information required
for session and channel set up as well as tear down. A wide variety of delivery mechanisms exist below this interface, with
some of them indicated in Figure 1. These mechanisms serve for transmission as well as storage of streaming data, i.e., a
file is considered to be a particular instance of a delivery layer. For applications where the desired transport facility does not
fully address the needs of a service according to the specifications in ISO/IEC 14496, a simple multiplexing tool (M4Mux)
with low delay and low overhead is defined in ISO/IEC 14496-1.
0.5 Synchronization of Streams: The Sync Layer
Elementary streams are the basic abstraction for any streaming data source. Elementary streams are conveyed as sync
layer-packetized (SL-packetized) streams at the DMIF Application Interface. This packetized representation additionally
provides timing and synchronization information, as well as fragmentation and random access information. The sync layer
(SL) extracts this timing information to enable synchronized decoding and, subsequently, composition of the elementary
stream data.
0.6 The Compression Layer
The compression layer receives data in its encoded format and performs the necessary operations to decode this data. The
decoded information is then used by the terminal’s composition, rendering and presentation subsystems.
x © ISO/IEC 2024 – All rights reserved

ISO/IEC DIS 14496-1:2024(E)
0.6.1 Object Description Framework
The purpose of the object description framework is to identify and describe elementary streams and to associate them
appropriately to an audio-visual scene description. Object descriptors serve to gain access to ISO/IEC 14496 content. Object
content information and the interface to intellectual property management and protection systems are also part of this
framework.
An object descriptor is a collection of one or more elementary stream descriptors that provide the configuration and other
information for the streams that relate to either an audio-visual object, or text or font data, or a scene description. Object
descriptors are themselves conveyed in elementary streams. Each object descriptor is assigned an identifier (object
descriptor ID), which is unique within a defined name scope. This identifier is used to associate audio-visual objects in the
scene description with a particular object descriptor, and thus the elementary streams related to that particular object.
Elementary stream descriptors include information about the source of the stream data, in form of a unique numeric identifier
(the elementary stream ID) or a URL pointing to a remote source for the stream. Elementary stream descriptors also include
information about the encoding format, configuration information for the decoding process and the sync layer packetization,
as well as quality of service requirements for the transmission of the stream and intellectual property identification.
Dependencies between streams can also be signaled within the elementary stream descriptors. This functionality may be
used, for example, in scalable audio or visual object representations to indicate the logical dependency of a stream
containing enhancement information, to a stream containing the base information. It can also be used to describe alternative
representations for the same content (e.g. the same speech content in various languages).
0.6.1.1 Intellectual Property Management and Protection
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists of a normative
interface that permits an ISO/IEC 14496 terminal to host one or more IPMP Systems in the form of monolithic IPMP Systems
or modular IPMP Tools. The IPMP interface consists of IPMP elementary streams and IPMP descriptors. IPMP descriptors
are carried as part of an object descriptor stream. IPMP elementary streams carry time variant IPMP information that can
be associated to multiple object descriptors.
The IPMP System, or IPMP Tools themselves are non-normative components that provides intellectual property
management and protection functions for the terminal. The IPMP Systems or Tools uses the information carried by the IPMP
elementary streams and descriptors to make protected ISO/IEC 14496 content available to the terminal.
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists of a set of tools
that permits an ISO/IEC 14496 terminal to support IPMP functionality. This functionality is provided by two different
complementary technologies, supporting different levels of interoperability:
• The IPMP framework as defined in 7.2.3, consists of a normative interface that permits an ISO/IEC 14496
terminal to host one or more IPMP Systems. The IPMP interface consists of IPMP elementary streams and
IPMP descriptors. IPMP descriptors are carried as part of an object descriptor stream. IPMP elementary
streams carry time variant IPMP information that can be associated to multiple object descriptors. The IPMP
System itself is a non-normative component that provides intellectual property management and protection
functions for the terminal. The IPMP System uses the information carried by the IPMP elementary streams
and descriptors to make protected ISO/IEC 14496 content available to the terminal.
• The IPMP framework extension, as specified in ISO/IEC 14496-13 allows, in addition to the functionality
specified in ISO/IEC 14496-1, a finer granularity of governance. ISO/IEC 14496-13 provides normative
support for individual IPMP components, referred to as IPMP Tools, to be normatively placed at identified
points of control within the terminal systems model. Additionally ISO/IEC 14496-13 provides normative
support for secure communications to be performed between IPMP Tools. ISO/IEC 14496-1 also specifies
specific normative extensions at the Systems level to support the IPMP functionality described in ISO/IEC
14496-13.
An application may choose not to use an IPMP System, thereby offering no management and protection features.
0.6.1.2 Object Content Information
Object content information (OCI) descriptors convey descriptive information about audio-visual objects. The main content
descriptors are: content classification descriptors, keyword descriptors, rating descriptors, language descriptors, textual
descriptors, and descriptors about the creation of the content. OCI descriptors can be included directly in the related object
descriptor or elementary stream descriptor or, if it is time variant, it may be carried in an elementary stream by itself. An OCI
stream is organized in a sequence of small, synchronized entities called events that contain a set of OCI descriptors. OCI
streams can be associated to multiple object descriptors.
© ISO/IEC 2024 – All rights reserved xi

ISO/IEC DIS 14496-1:2024(E)ISO/IEC DIS 14496-1:2024(E)
0.6.2 Scene Description Streams
Scene description addresses the organization of audio-visual objects in a scene, in terms of both spatial and temporal
attributes. This information allows the composition and rendering of individual audio-visual objects after the respective
decoders have reconstructed the streaming data for them. For visual data, ISO/IEC 14496-11 does not mandate particular
composition algorithms. Hence, visual composition is implementation dependent. For audio data, the composition process
is defined in a normative manner in ISO/IEC 14496-11 and ISO/IEC 14496-3.
The scene description is represented using a parametric approach (BIFS - Binary Format for Scenes). The description
consists of an encoded hierarchy (tree) of nodes with attributes and other information (including event sources and targets).
Leaf nodes in this tree correspond to elementary audio-visual data, whereas intermediate nodes group t
...


International
Standard
Fifth edition
Information technology — Coding of
audio-visual objects —
Part 1:
Systems
Technologies de l'information — Codage des objets
audiovisuels —
Partie 1: Systèmes
PROOF/ÉPREUVE
Reference number
© ISO/IEC 2026
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
ii
Contents Page
Foreword .iv
Introduction .v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 2
4 Abbreviated terms . 9
5 Conventions . 10
6 Streaming framework . 10
6.1 Systems decoder model .10
6.1.1 General .10
6.1.2 Concepts of the systems decoder model .11
6.1.3 Timing model specification . 13
6.1.4 Buffer model specification. 15
6.2 Object description framework . .16
6.2.1 General .16
6.2.2 Common data structures .18
6.2.3 Intellectual property management and protection framework (IPMP) .21
6.2.4 Object content information (OCI) . 23
6.2.5 Object descriptor stream . 25
6.2.6 Object descriptor components . 28
6.2.7 Rules for usage of the object description framework . 60
6.2.8 Usage of the IPMP system interface . 69
6.3 Synchronization of elementary streams . 72
6.3.1 General . 72
6.3.2 Sync layer . 72
6.3.3 DMIF application interface . 83
6.4 Multiplexing of elementary streams . 83
6.4.1 General . 83
6.4.2 M4Mux tool . 83
6.4.3 M4Mux descriptors . 89
7 Profiles .91
Annex A (informative) Time base reconstruction .93
Annex B (informative) The QoS management model for ISO/IEC 14496 content .96
Annex C (informative) Conversion between time and date conventions .97
Annex D (informative) Graphical representation of object descriptor and sync layer syntax .99
Annex E (informative) Elementary stream interface .106
Annex F (informative) Upstream walkthrough .108
Annex G (informative) Scene and object description carrousel .113
Annex H (normative) Usage of ITU-T Recommendation H.264 | ISO/IEC 14496-10 AVC .115
Bibliography .118
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
iii
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical activity.
ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations,
governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of document should be noted. This document was drafted in accordance with the editorial rules of the ISO/
IEC Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any
claimed patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not
received notice of (a) patent(s) which may be required to implement this document. However, implementers
are cautioned that this may not represent the latest information, which may be obtained from the patent
database available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held
responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This fifth edition cancels and replaces the fourth edition (ISO/IEC 14496-1:2010), which has been technically
revised. It also incorporates the Amendments ISO/IEC 14496-1:2010/Amd.1:2010, ISO/IEC 14496-:2010/
Amd.2:2014.
The main changes are as follows:
— Added support for LASeR
— Added support for raw audio and video bitstream
— Referencing to new Syntactic Description Language specification
A list of all parts in the ISO/IEC 14496 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards
body. A complete listing of these bodies can be found at www.iso.org/members.html and
www.iec.ch/national-committees.
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
iv
Introduction
0.1  Overview
The ISO/IEC 14496 series specifies a system for the communication of interactive audio-visual scenes. This
document includes the following elements:
a) the coded representation of natural or synthetic, two-dimensional (2D) or three-dimensional
(3D) objects that can be manifested audibly and/or visually (audio-visual objects) (specified
in ISO/IEC 14496-2, ISO/IEC 14496-3, ISO/IEC 14496-10, ISO/IEC 14496-11, ISO/IEC 14496-16,
b) the coded representation of the spatio-temporal positioning of audio-visual objects as well as their
behavior in response to interaction (scene description, specified in ISO/IEC 14496-11 ISO/IEC 14496-20);
c) the coded representation of information related to the management of data streams (synchronization,
identification, description and association of stream content, specified in this document and in
ISO/IEC 14496-24);
d) a generic interface to the data stream delivery layer functionality (specified in ISO/IEC 14496-6);
e) an application engine for programmatic control of the player: format, delivery of downloadable Java
byte code as well as its execution lifecycle and behavior through APIs (specified in ISO/IEC 14496-11
and ISO/IEC 14496-21);
f) a file format to contain the media information of an ISO/IEC 14496 presentation in a flexible, extensible
format to facilitate interchange, management, editing, and presentation of the media specified in
File Format); and
g) the coded representation of font data and of information related to the management of text streams and
font data streams (specified in ISO/IEC 14496-17, ISO/IEC 14496-18 and ISO/IEC 14496-22).
The overall operation of a system communicating audio-visual scenes can be paraphrased as follows:
At the sending terminal, the audio-visual scene information is compressed, supplemented with
synchronization information and passed to a delivery layer that multiplexes it into one or more coded
binary streams that are transmitted or stored. At the receiving terminal, these streams are demultiplexed
and decompressed. The audio-visual objects are composed according to the scene description and
synchronization information and presented to the end user. The end user may have the option to interact
with this presentation. Interaction information can be processed locally or transmitted back to the sending
terminal. ISO/IEC 14496 defines the syntax and semantics of the bitstreams that convey such scene
information, as well as the details of their decoding processes.
This document specifies the following tools:
— a terminal model for time and buffer management;
— a coded representation of metadata for the identification, description and logical dependencies of the
elementary streams (object descriptors and other descriptors);
— a coded representation of descriptive audio-visual content information (object content information –
OCI);
— an interface to intellectual property management and protection (IPMP) systems;
— a coded representation of synchronization information (sync layer – SL) and
— a multiplexed representation of individual elementary streams in a single stream (M4Mux).
These various elements are described functionally in this Subclause and specified in the normative Clauses
that follow.
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
v
0.2  Architecture
The information representation specified in the ISO/IEC 14496 series describes the means to create an
interactive audio-visual scene in terms of coded audio-visual information and associated scene description
information. The entity that composes and sends or receives and presents such a coded representation of an
interactive audio-visual scene is generically referred to as an "audio-visual terminal" or just "terminal". This
terminal may correspond to a standalone application or be part of an application system.
Figure 1 — The ISO/IEC 14496 series terminal architecture
The basic operations performed by such a receiver terminal are as follows. Information that allows access
to content complying with the ISO/IEC 14496 series is provided as initial session set up information to the
terminal. ISO/IEC 14496-6 defines the procedures for establishing such session contexts as well as the
interface to the delivery layer that generically abstracts the storage or transport medium. The initial set
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
vi
up information allows, in a recursive manner, to locate one or more elementary streams that are part of
the coded content representation. Some of these elementary streams may be grouped together using the
multiplexing tool described in this document.
Elementary streams contain the coded representation of either audio or visual data or scene description
information or user interaction data or text or font data. Elementary streams may as well themselves
convey information to identify streams, to describe logical dependencies between streams, or to describe
information related to the content of the streams. Each elementary stream contains only one type of data.
Elementary streams are decoded using their respective stream-specific decoders. The audio-visual objects
are composed according to the scene description information and presented by the terminal’s presentation
device(s). All these processes are synchronized according to the systems decoder model (SDM) using the
synchronization information provided at the synchronization layer.
These basic operations are depicted in Figure 1 and are described in more detail below.
0.3  Terminal model: systems decoder model
The systems decoder model provides an abstract view of the behavior of a terminal complying with this
document. Its purpose is to enable a sending terminal to predict how the receiving terminal will behave in
terms of buffer management and synchronization when reconstructing the audio-visual information that
comprises the presentation. The systems decoder model includes a systems timing model and a systems
buffer model which are described briefly in the following Subclauses.
0.3.1  Timing model
The timing model defines the mechanisms through which a receiving terminal establishes a notion of time
that enables it to process time-dependent events. This model also allows the receiving terminal to establish
mechanisms to maintain synchronization both across and within particular audio-visual objects as well
as with user interaction events. In order to facilitate these functions at the receiving terminal, the timing
model requires that the transmitted data streams contain implicit or explicit timing information. Two sets
of timing information are defined in this document: clock references and time stamps. The former convey
the sending terminal’s time base to the receiving terminal, while the latter convey a notion of relative time
for specific events such as the desired decoding or composition time for portions of the encoded audio-visual
information.
0.3.2  Buffer model
The buffer model enables the sending terminal to monitor and control the buffer resources that are needed
to decode each elementary stream in a presentation. The required buffer resources are conveyed to the
receiving terminal by means of descriptors at the beginning of the presentation. The terminal can then
decide whether or not it is capable of handling this particular presentation. The buffer model allows the
sending terminal to specify when information may be removed from these buffers and enables it to schedule
data transmission so that the appropriate buffers at the receiving terminal do not overflow or underflow.
0.4  Multiplexing of streams: the delivery layer
The term delivery layer is used as a generic abstraction of any existing transport protocol stack that may
be used to transmit and/or store content complying with ISO/IEC 14496. The functionality of this layer is
not within the scope of this document, and only the interface to this layer is considered. This interface is
the DMIF Application Interface (DAI) specified in ISO/IEC 14496-6. The DAI defines not only an interface
for the delivery of streaming data, but also for signaling information required for session and channel set up
as well as tear down. A wide variety of delivery mechanisms exist below this interface, with some of them
indicated in Figure 1. These mechanisms serve for transmission as well as storage of streaming data, i.e. a
file is considered to be a particular instance of a delivery layer. For applications where the desired transport
facility does not fully address the needs of a service according to the specifications in ISO/IEC 14496, a
simple multiplexing tool (M4Mux) with low delay and low overhead is defined in this document.
0.5  Synchronization of streams: the sync layer
Elementary streams are the basic abstraction for any streaming data source. Elementary streams are
conveyed as sync layer-packetized (SL-packetized) streams at the DMIF Application Interface. This
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
vii
packetized representation additionally provides timing and synchronization information, as well as
fragmentation and random access information. The sync layer (SL) extracts this timing information to
enable synchronized decoding and, subsequently, composition of the elementary stream data.
0.6  The compression Layer
The compression layer receives data in its encoded format and performs the necessary operations to decode
this data. The decoded information is then used by the terminal’s composition, rendering and presentation
subsystems.
0.6.1  Object description framework
The purpose of the object description framework is to identify and describe elementary streams and to
associate them appropriately to an audio-visual scene description. Object descriptors serve to gain access to
ISO/IEC 14496 content. Object content information and the interface to intellectual property management
and protection systems are also part of this framework.
An object descriptor is a collection of one or more elementary stream descriptors that provide the
configuration and other information for the streams that relate to either an audio-visual object, or text or
font data, or a scene description. Object descriptors are themselves conveyed in elementary streams. Each
object descriptor is assigned an identifier (object descriptor ID), which is unique within a defined name
scope. This identifier is used to associate audio-visual objects in the scene description with a particular
object descriptor, and thus the elementary streams related to that particular object.
Elementary stream descriptors include information about the source of the stream data, in form of a unique
numeric identifier (the elementary stream ID) or a URL pointing to a remote source for the stream. Elementary
stream descriptors also include information about the encoding format, configuration information for
the decoding process and the sync layer packetization, as well as quality of service requirements for the
transmission of the stream and intellectual property identification. Dependencies between streams can
also be signaled within the elementary stream descriptors. This functionality may be used, for example,
in scalable audio or visual object representations to indicate the logical dependency of a stream containing
enhancement information, to a stream containing the base information. It can also be used to describe
alternative representations for the same content (e.g. the same speech content in various languages).
0.6.1.1  Intellectual property management and protection
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists
of a normative interface that permits an ISO/IEC 14496 terminal to host one or more IPMP systems in the
form of monolithic IPMP systems or modular IPMP tools. The IPMP interface consists of IPMP elementary
streams and IPMP descriptors. IPMP descriptors are carried as part of an object descriptor stream.
IPMP elementary streams carry time variant IPMP information that can be associated to multiple object
descriptors.
The IPMP system, or IPMP tools themselves are non-normative components that provides intellectual
property management and protection functions for the terminal. The IPMP systems or tools uses the
information carried by the IPMP elementary streams and descriptors to make protected ISO/IEC 14496
content available to the terminal.
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists
of a set of tools that permits an ISO/IEC 14496 terminal to support IPMP functionality. This functionality is
provided by two different complementary technologies, supporting different levels of interoperability:
— The IPMP framework as defined in 6.2.3, consists of a normative interface that permits an ISO/IEC 14496
terminal to host one or more IPMP systems. The IPMP interface consists of IPMP elementary streams and
IPMP descriptors. IPMP descriptors are carried as part of an object descriptor stream. IPMP elementary
streams carry time variant IPMP information that can be associated to multiple object descriptors.
The IPMP system itself is a non-normative component that provides intellectual property management
and protection functions for the terminal. The IPMP system uses the information carried by the IPMP
elementary streams and descriptors to make protected ISO/IEC 14496 content available to the terminal.
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
viii
— The IPMP framework extension, as specified in ISO/IEC 14496-13 allows, in addition to the functionality
specified in ISO/IEC 14496-1, a finer granularity of governance. ISO/IEC 14496-13 provides normative
support for individual IPMP components, referred to as IPMP tools, to be normatively placed at identified
points of control within the terminal systems model. Additionally, ISO/IEC 14496-13 provides normative
support for secure communications to be performed between IPMP tools. ISO/IEC 14496-1 also specifies
specific normative extensions at the systems level to support the IPMP functionality described in
An application may choose not to use an IPMP system, thereby offering no management and protection
features.
0.6.1.2  Object content information
Object content information (OCI) descriptors convey descriptive information about audio-visual objects. The
main content descriptors are: content classification descriptors, keyword descriptors, rating descriptors,
language descriptors, textual descriptors, and descriptors about the creation of the content. OCI descriptors
can be included directly in the related object descriptor or elementary stream descriptor or, if it is time
variant, it may be carried in an elementary stream by itself. An OCI stream is organized in a sequence of
small, synchronized entities called events that contain a set of OCI descriptors. OCI streams can be associated
to multiple object descriptors.
0.6.2  Scene description streams
Scene description addresses the organization of audio-visual objects in a scene, in terms of both spatial
and temporal attributes. This information allows the composition and rendering of individual audio-
visual objects after the respective decoders have reconstructed the streaming data for them. For visual
data, ISO/IEC 14496-11 does not mandate particular composition algorithms. Hence, visual composition is
implementation dependent. For audio data, the composition process is defined in a normative manner in
The scene description is represented using a parametric approach (BIFS - dinary format for scenes). The
description consists of an encoded hierarchy (tree) of nodes with attributes and other information (including
event sources and targets). Leaf nodes in this tree correspond to elementary audio-visual data, whereas
intermediate nodes group this material to form audio-visual objects, and perform grouping, transformation,
and other such operations on audio-visual objects (scene description nodes). The scene description can
evolve over time by using scene description updates.
In order to facilitate active user involvement with the presented audio-visual information, ISO/IEC 14496-11
provides support for user and object interactions. Interactivity mechanisms are integrated with the scene
description information, in the form of linked event sources and targets (routes) as well as sensors (special
nodes that can trigger events based on specific conditions). These event sources and targets are part of
scene description nodes, and thus allow close coupling of dynamic and interactive behavior with the specific
scene at hand. ISO/IEC 14496-11, however, does not specify a particular user interface or a mechanism that
maps user actions (e.g. keyboard key presses or mouse movements) to such events.
Such an interactive environment may not need an upstream channel, but ISO/IEC 14496 also provides means
for client-server interactive sessions with the ability to set up upstream elementary streams and associate
them to specific downstream elementary streams.
0.6.3  Audio-visual streams
The coded representation of audio and visual information are described in ISO/IEC 14496-3 (Audio) and
ISO/IEC 14496-2 (Visual) and ISO/IEC 14496-10 (Advanced Video Coding) respectively. The reconstructed
audio-visual data are made available to the composition process for potential use during the scene rendering.
0.6.4  Upchannel streams
Downchannel elementary streams may require upchannel information to be transmitted from the receiving
terminal to the sending terminal (e.g. to allow for client-server interactivity). Figure 1 indicates the flowpath
for an elementary stream from the receiving terminal to the sending terminal. The content of upchannel
streams is specified in the same part of the specification that defines the content of the downstream
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
ix
data. For example, upchannel control streams for video downchannel elementary streams are defined in
ISO/IEC 14496-2.
0.6.5  Interaction streams
The coded representation of user interaction information is not in the scope of ISO/IEC 14496. But this
information is to be translated into scene modification and the modifications made available to the
composition process for potential use during the scene rendering.
0.6.6  Text and font data streams
Scene description often contains information presented in textual format. The audio-visual data encoded in
the scene may also be accompanied by supplemental text information such as subtitles. In order to enable
time-based updates of text data and to insure the text appearance and layout, both elementary streams
carrying timed text information and font data are used. The coded representation of the timed text stream
is described in ISO/IEC 14496-17. The font data format and encoded representation of font data stream are
described in ISO/IEC 14496-18 (font data stream) and ISO/IEC 14496-22 (font data format).
0.7  Application engine
The MPEG-J is a programmatic system (as opposed to a conventional parametric system) which specifies
API(s) for interoperation of MPEG-4 media players with Java code. By combining MPEG-4 media and safe
executable code, content creators may embed complex control and data processing mechanisms with their
media data to intelligently manage the operation of the audio-visual session. The parametric MPEG-4 system
forms the presentation engine while the MPEG-J subsystem controlling the presentation engine forms the
application engine.
The Java application is delivered as a separate elementary stream to the MPEG-4 terminal. There it will
be directed to the MPEG-J run time environment, from where the MPEG-J program will have access to the
various components and required data of the MPEG-4 player to control it.
In addition to the basic packages of the language (java.lang, java.io, java.util) a few categories of APIs have
been defined for different scopes. For the scene graph API the objective is to provide access to the scene
graph specified in ISO/IEC 14496-11: to inspect the graph, to alter nodes and their fields, and to add and
remove nodes within the graph. The resource API is used for regulation of performance: it provides a
centralized facility for managing resources. This is used when the program execution is contingent upon the
terminal configuration and its capabilities, both static (that do not change during execution) and dynamic.
Decoder API allows the control of the decoders that are present in the terminal. The net API provides a way to
interact with the network, being compliant to the MPEG-4 DMIF application interface. Complex applications
and enhanced interactivity are possible with these basic packages. The architecture of MPEG-J is presented
in more detail in ISO/IEC 14496-11.
0.8  Extensible MPEG-4 textual (XMT) format
The extensible MPEG-4 textual (XMT) format is a textual representation of the multimedia content described
in ISO/IEC 14496 using the extensible markup language (XML). XMT is designed to facilitate the creation
and maintenance of MPEG-4 multimedia content, whether by human authors or by automated machine
programs. XMT is specified in ISO/IEC 14496-11.
The textual representation of MPEG-4 content has high-level abstractions, XMT-O, that allow authors to
exchange their content easily with other authors or authoring tools, while at the same time preserving
semantic intent. XMT also has low-level textual representations, XMT-A, covering the full scope and function
of MPEG-4. The high-level XMT-O is designed to facilitate interoperability with the synchronized multimedia
integration Language (SMIL) 2.0, a recommendation from the W3C consortium, and also with extensible 3D
specification, X3D, developed by the Web3D consortium as the next generation of virtual reality modeling
language (VRML).
The XMT language has grammars that are specified using the W3C XML schema language. The grammars
contain rules for element placement and attribute values, etc. These rules for XMT, defined using the schema
language, follow the binary coding rules defined in ISO/IEC 14496-11 and help ensure that the textual
representation can be coded into correct binary according to ISO/IEC 14496-11 coding rules.
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
x
All constructs in the ISO/IEC 14496 specification have their parallel in the XMT textual format. For the
visual and audio parts, XMT provides a means to reference external media streams of either pre-encoded or
raw audiovisual binary content. While XMT does not contain a textual format for audiovisual media, it does
contain hints in a textual format that allow an XMT tool to encode and embed the audiovisual media into a
complete MPEG-4 presentation.
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
xi
International Standard ISO/IEC 14496-1:2026(en)
Information technology — Coding of audio-visual objects —
Part 1:
Systems
1 Scope
This document specifies system level functionalities for the communication of interactive audio-visual scenes,
i.e. the coded representation of information related to the management of data streams (synchronization,
identification, description and association of stream content).
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitutes
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
ISO 639:2023, Codes for the representation of names of languages — Part 2: Alpha-3 code
ISO/IEC 10646:2020, Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1:
Architecture and Basic Multilingual Plane
ISO/IEC 10918-1:1994 | ITU-T Rec. T.81, Information technology — Digital compression and coding of
continuous-tone still images — Part 1: Requirements and guidelines
ISO/IEC 11172-2:1993, Information technology — Coding of moving pictures and associated audio for digital
storage media at up to about 1,5 Mbit/s — Part 2: Video
ISO/IEC 11172-3:1993, Information technology — Coding of moving pictures and associated audio for digital
storage media at up to about 1,5 Mbit/s — Part 3: Audio
ISO/IEC 13818-2:2000 | ITU-T Rec. H.262, Information technology — Generic coding of moving pictures and
associated audio information — Part 2: Video
ISO/IEC 13818-3:1998, Information technology — Generic coding of moving pictures and associated audio
information — Part 3: Audio
ISO/IEC 13818-7:2006, Information technology — Generic coding of moving pictures and associated audio
information — Part 7: Advanced Audio Coding (AAC)
ISO/IEC 14496-2:2004, Information technology — Coding of audio-visual objects — Part 2: Visual
ISO/IEC 14496-3:2019, Information technology — Coding of audio-visual objects — Part 3: Audio
Coding
Coding (AVC) file format
Framework eXtension (AFX)
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
and streaming
ISO/IEC 14496-34:2025, Information technology — Coding of audio-visual objects — Part 34: Syntactic
description language
DAVIC 1.4.1 specification, Part 9: Information Representation
ANSI/SMPTE 291M-1996, Television — Ancillary Data Packet and Space Formatting
SMPTE 315M -1999, Television — Camera Positioning Information Conveyed by Ancillary Data Packets
W3C Recommendation: August 2001 — Synchronized Multimedia Integration Language (SMIL 2.0),
https://www.w3.org/TR/smil20/
W3C Recommendation: May 2001 — XML Schema, https://www.w3.org/TR/xmlschema-0/
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https://www.iso.org/obp
— IEC Electropedia: available at https://www.electropedia.org/
3.1
access unit
AU
smallest individually accessible portion of data within an elementary stream (3.27) to which unique timing
information can be attributed
3.2
audio-visual object
representation of a natural or synthetic object that has an audio and/or visual manifestation
Note 1 to entry: The representation corresponds to a node or a group of nodes in the BIFS scene description. Each
audio-visual object is associated with zero or more elementary streams using one or more object descriptors.
3.3
audio-visual scene
AV Scene
set of audio-visual objects together with scene description information that defines their spatial and
temporal attributes including behaviors resulting from object and user interactions
3.4
AVC parameter set
sequence parameter set or a picture parameter set
3.5
AVC access unit
access unit made up of NAL Units as defined in ITU-T H.264 | ISO/IEC 14496-10 with the structure defined in
5.2.3 of ISO/IEC 14496-15:2024
3.6
AVC parameter set access unit
access unit made up only of sequence parameter set NAL units or picture parameter set NAL units having
same timestamps to be applied
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
3.7
AVC parameter set elementary stream
elementary stream containing made up only of AVC parameter set access units
3.8
AVC video elementary stream
elementary stream containing access units made up of NAL units for coded picture data
3.9
binary format for scene
BIFS
coded representation of a parametric scene description format as specified in ISO/IEC 14496-11
3.10
buffer model
model that defines how a terminal complying with ISO/IEC 14496 manages the buffer resources that are
needed to decode a presentation
3.11
clock reference
special time stamp that conveys a reading of a time base
3.12
composition
process of applying scene description information in order to identify the spatio-temporal attributes and
hierarchies of audio-visual objects
3.13
composition memory
CM
random access memory that contains composition units
3.14
composition time stamp
CTS
indication of the nominal composition time of a composition unit
3.15
composition unit
CU
individually accessible portion of the output that a decoder produces from access units
3.16
compression layer
layer of a system according to the specifications in ISO/IEC 14496 that translates between the coded
representation of an elementary stream and its decoded representation. It incorporates the decoders
3.17
control point
point on a given elementary stream in a terminal where IPMP processing on stream data shall be carried out
3.18
decoder
entity that translates between the coded representation of an elementary stream and its decoded
representation
3.19
decoding buffer
DB
buffer at the input of a decoder that contains access units
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
3.20
decoder configuration
configuration of a decoder for processing its elementary stream data by using information contained in its
elementary stream descriptor
3.21
decoding time stamp
DTS
indication of the nominal decoding time of an access unit
3.22
delivery layer
generic abstraction for delivery mechanisms (computer networks, etc.) able to store or transmit a number of
multiplexed elementary streams or M4Mux streams
3.23
descriptor
data structure that is used to describe particular aspects of an elementary stream or a coded audio-visual
object
3.24
DMIF application interface
DAI
interface specified in ISO/IEC 14496-6 used to model the exchange of SL-packetized stream data and
associated control information between the sync layer and the delivery layer
3.25
elementary stream
ES
consecutive flow of mono-media data from a single source entity to a single destination entity on the
compression layer
3.26
elementary stream descriptor
structure contained in object descriptors that describes the encoding format, initialization information, sync
layer configuration, and other descriptive information about the content carried in an elementary stream
3.27
elementary stream interface
ESI
conceptual interface modeling the exchange of elementary stream data and associated control information
between the compression layer and the sync layer
3.28
M4Mux channel
FMC
label to differentiate between data belonging to different constituent streams within one M4Mux stream
Note 1 to entry: A sequence of data in one M4Mux channel within a M4Mux stream corresponds to one single SL-
packetized stream.
3.29
M4Mux packet
smallest data entity managed by the M4Mux tool consisting of a header and a payload
3.30
M4Mux stream
sequence of M4Mux packets with data from one or more SL-packetized streams that are each identified by
their own M4Mux channel
PROOF/ÉPREUVE
© ISO/IEC 2026 – All rights reserved
3.31
M4Mux tool
tool that allows the interleaving of data from multiple data streams
3.32
graphics profile
profile that specifies the permissible set of graphical elements of the BIFS tool that may be used in a scene
description stream
Note 1 to entry: BIFS comprises both graphical and scene description elements.
3.33
inter
mode for coding parameters that uses previously coded parameters to construc
...


Copyright notice
This ISO/IEC PRF 14496-1
ISO document is a Draft International Standard and is copyright-protected by/IEC JTC 1/SC 29
Secretariat: JISC
Date: 2026-02-23
Information technology — Coding of audio-visual objects —
Part 1:
Systems
Technologies de l'information — Codage des objets audiovisuels —
Partie 1: Systèmes
FDIS stage
I
ISO/IEC PRF 14496-1:2026(en)
© ISO. Except as permitted under /IEC 2026
All rights reserved. Unless otherwise specified, or required in the applicable lawscontext of the user's country, neither
its implementation, no part of this ISO draft nor any extract from itpublication may be reproduced, stored in a retrieval
system or transmitted or utilized otherwise in any form or by any means, electronic, or mechanical, including
photocopying, recording or otherwiseor posting on the internet or an intranet, without prior written permission being
secured.
Requests for permission to reproduce should be addressed to . Permission can be requested from either ISO at the
address below or ISO'sISO’s member body in the country of the requester.
ISO copyright office
Case postale 56 • CP 401 • Ch. de Blandonnet 8
CH-12111214 Vernier, Geneva 20
Tel. Phone: + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail : copyright@iso.org
Web www.iso.org
Reproduction may be subject to royalty payments or a licensing agreement.
Violators may be prosecuted.Website: www.iso.org
Published in Switzerland
© ISO/IEC 2026 – All rights reserved
ii
ISO/IEC PRF 14496-1:2026(en)
Contents Page
Foreword . iv
Introduction . vi
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 2
4 Abbreviated terms . 13
5 Conventions . 14
6 Streaming framework . 14
6.1 Systems decoder model . 14
6.2 Object description framework . 22
6.3 Synchronization of elementary streams . 100
6.4 Multiplexing of elementary streams . 117
7 Profiles . 129
Annex A (informative) Time base reconstruction . 132
Annex B (informative) The QoS management model for ISO/IEC 14496 content . 135
Annex C (informative) Conversion between time and date conventions . 136
Annex D (informative) Graphical representation of object descriptor and sync layer syntax . 140
Annex E (informative) Elementary stream interface . 155
Annex F (informative) Upstream walkthrough . 158
Annex G (informative) Scene and object description carrousel . 168
Annex H (normative) Usage of ITU-T Recommendation H.264 | ISO/IEC 14496-10 AVC . 170
Bibliography . 174

© ISO/IEC 2026 – All rights reserved
iii
ISO/IEC PRF 14496-1:2026(en)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members
of ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of
document should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC
Directives, Part 2 (see www.iso.org/directives or
www.iec.ch/members_experts/refdocs).www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the use of
(a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any claimed
patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not received
notice of (a) patent(s) which may be required to implement this document. However, implementers are
cautioned that this may not represent the latest information, which may be obtained from the patent database
available at www.iso.org/patents and https://patents.iec.ch.www.iso.org/patents and https://patents.iec.ch.
ISO and IEC shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.www.iso.org/iso/foreword.html. In the IEC, see
www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.
This fifth edition cancels and replaces the fourth edition (ISO/IEC 14496-1:2010), which has been technically
revised. It also incorporates the Amendments ISO/IEC 14496-1:2010/Amd.1:2010, ISO/IEC 14496-
:2010/Amd.2:2014.
The main changes are as follows:
— Added support for LASeR
— Added support for raw audio and video bitstream
— Referencing to new Syntactic Description Language specification
A list of all parts in the ISO/IEC 14496 series can be found on the ISO and IEC websites.

© ISO/IEC 2026 – All rights reserved
iv
ISO/IEC PRF 14496-1:2026(en)
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.htmlwww.iso.org/members.html and
www.iec.ch/national-committees.www.iec.ch/national-committees.

© ISO/IEC 2026 – All rights reserved
v
ISO/IEC PRF 14496-1:2026(en)
Introduction
0.1 Overview
The ISO/IEC 14496 series specifies a system for the communication of interactive audio-visual scenes. This
document includes the following elements:
a) the coded representation of natural or synthetic, two-dimensional (2D) or three-dimensional (3D) objects
that can be manifested audibly and/or visually (audio-visual objects) (specified in part ISO/IEC 14496-2,
ISO/IEC 14496-3, ISO/IEC 14496-10, ISO/IEC 14496-11, ISO/IEC 14496-16, ISO/IEC 14496-19, ISO/IEC
14496-20, ISO/IEC 14496-23 and ISO/IEC 14496-25 of ISO/IEC 14496);
b) the coded representation of the spatio-temporal positioning of audio-visual objects as well as their
behavior in response to interaction (scene description, specified in part 11 and 20 of ISO/IEC 14496-11
ISO/IEC 14496-20);
c) the coded representation of information related to the management of data streams (synchronization,
identification, description and association of stream content, specified in this partdocument and in part 24
of ISO/IEC 14496-24);
d) a generic interface to the data stream delivery layer functionality (specified in part 6 of ISO/IEC 14496-6);
e) an application engine for programmatic control of the player: format, delivery of downloadable Java byte
code as well as its execution lifecycle and behavior through APIs (specified in part 11 and 21 of ISO/IEC
14496); -11 and ISO/IEC 14496-21);
f) a file format to contain the media information of an ISO/IEC 14496 presentation in a flexible, extensible
format to facilitate interchange, management, editing, and presentation of the media specified in part
15 (AVC File Format) of ISO/IEC 14496;); and
g) the coded representation of font data and of information related to the management of text streams and
font data streams (specified in part 17, 18 and 22 of ISO/IEC 14496-17, ISO/IEC 14496-18 and ISO/IEC
14496-22).
The overall operation of a system communicating audio-visual scenes can be paraphrased as follows:
At the sending terminal, the audio-visual scene information is compressed, supplemented with
synchronization information and passed to a delivery layer that multiplexes it into one or more coded binary
streams that are transmitted or stored. At the receiving terminal, these streams are demultiplexed and
decompressed. The audio-visual objects are composed according to the scene description and synchronization
information and presented to the end user. The end user may have the option to interact with this
presentation. Interaction information can be processed locally or transmitted back to the sending terminal.
ISO/IEC 14496 defines the syntax and semantics of the bitstreams that convey such scene information, as well
as the details of their decoding processes.
This document specifies the following tools:
— a terminal model for time and buffer management;

© ISO/IEC 2026 – All rights reserved
vi
ISO/IEC PRF 14496-1:2026(en)
— a coded representation of metadata for the identification, description and logical dependencies of the
elementary streams (object descriptors and other descriptors);
— a coded representation of descriptive audio-visual content information (object content information – OCI);
— an interface to intellectual property management and protection (IPMP) systems;
— a coded representation of synchronization information (sync layer – SL); ) and
— a multiplexed representation of individual elementary streams in a single stream (M4Mux).
These various elements are described functionally in this Subclause and specified in the normative Clauses
that follow.
0.2 Architecture
The information representation specified in the ISO/IEC 14496 series describes the means to create an
interactive audio-visual scene in terms of coded audio-visual information and associated scene description
information. The entity that composes and sends, or receives and presents such a coded representation of an
interactive audio-visual scene is generically referred to as an "audio-visual terminal" or just "terminal". This
terminal may correspond to a standalone application or be part of an application system.

© ISO/IEC 2026 – All rights reserved
vii
ISO/IEC PRF 14496-1:2026(en)
© ISO/IEC 2026 – All rights reserved
viii
ISO/IEC PRF 14496-1:2026(en)
Figure 1 — The ISO/IEC 14496 series terminal architecture
The basic operations performed by such a receiver terminal are as follows. Information that allows access to
content complying with the ISO/IEC 14496 series is provided as initial session set up information to the
terminal. Part 6 of ISO/IEC 14496-6 defines the procedures for establishing such session contexts as well as
the interface to the delivery layer that generically abstracts the storage or transport medium. The initial set
up information allows, in a recursive manner, to locate one or more elementary streams that are part of the
coded content representation. Some of these elementary streams may be grouped together using the
multiplexing tool described in ISO/IEC 14496-1this document.

© ISO/IEC 2026 – All rights reserved
ix
ISO/IEC PRF 14496-1:2026(en)
Elementary streams contain the coded representation of either audio or visual data or scene description
information or user interaction data or text or font data. Elementary streams may as well themselves convey
information to identify streams, to describe logical dependencies between streams, or to describe information
related to the content of the streams. Each elementary stream contains only one type of data.
Elementary streams are decoded using their respective stream-specific decoders. The audio-visual objects are
composed according to the scene description information and presented by the terminal’s presentation
device(s). All these processes are synchronized according to the systems decoder model (SDM) using the
synchronization information provided at the synchronization layer.
These basic operations are depicted in Figure 1,Figure 1 and are described in more detail below.
0.3 Terminal model: systems decoder model
The systems decoder model provides an abstract view of the behavior of a terminal complying with
terminal will behave in terms of buffer management and synchronization when reconstructing the audio-
visual information that comprises the presentation. The systems decoder model includes a systems timing
model and a systems buffer model which are described briefly in the following Subclauses.
0.3.1 Timing model
The timing model defines the mechanisms through which a receiving terminal establishes a notion of time that
enables it to process time-dependent events. This model also allows the receiving terminal to establish
mechanisms to maintain synchronization both across and within particular audio-visual objects as well as
with user interaction events. In order to facilitate these functions at the receiving terminal, the timing model
requires that the transmitted data streams contain implicit or explicit timing information. Two sets of timing
information are defined in ISO/IEC 14496-1this document: clock references and time stamps. The former
convey the sending terminal’s time base to the receiving terminal, while the latter convey a notion of relative
time for specific events such as the desired decoding or composition time for portions of the encoded audio-
visual information.
0.3.2 Buffer model
The buffer model enables the sending terminal to monitor and control the buffer resources that are needed to
decode each elementary stream in a presentation. The required buffer resources are conveyed to the receiving
terminal by means of descriptors at the beginning of the presentation. The terminal can then decide whether
or not it is capable of handling this particular presentation. The buffer model allows the sending terminal to
specify when information may be removed from these buffers and enables it to schedule data transmission so
that the appropriate buffers at the receiving terminal do not overflow or underflow.
0.4 Multiplexing of streams: the delivery layer
The term delivery layer is used as a generic abstraction of any existing transport protocol stack that may be
used to transmit and/or store content complying with ISO/IEC 14496. The functionality of this layer is not
within the scope of ISO/IEC 14496-1this document, and only the interface to this layer is considered. This
interface is the DMIF Application Interface (DAI) specified in ISO/IEC 14496--6. The DAI defines not only an
interface for the delivery of streaming data, but also for signaling information required for session and channel
set up as well as tear down. A wide variety of delivery mechanisms exist below this interface, with some of
them indicated in Figure 1.Figure 1. These mechanisms serve for transmission as well as storage of streaming
data, i.e.,. a file is considered to be a particular instance of a delivery layer. For applications where the desired
transport facility does not fully address the needs of a service according to the specifications in ISO/IEC 14496,

© ISO/IEC 2026 – All rights reserved
x
ISO/IEC PRF 14496-1:2026(en)
a simple multiplexing tool (M4Mux) with low delay and low overhead is defined in ISO/IEC 14496-1this
document.
0.5 Synchronization of streams: the sync layer
Elementary streams are the basic abstraction for any streaming data source. Elementary streams are conveyed
as sync layer-packetized (SL-packetized) streams at the DMIF Application Interface. This packetized
representation additionally provides timing and synchronization information, as well as fragmentation and
random access information. The sync layer (SL) extracts this timing information to enable synchronized
decoding and, subsequently, composition of the elementary stream data.
0.6 The compression Layer
The compression layer receives data in its encoded format and performs the necessary operations to decode
this data. The decoded information is then used by the terminal’s composition, rendering and presentation
subsystems.
0.6.1 Object description framework
The purpose of the object description framework is to identify and describe elementary streams and to
associate them appropriately to an audio-visual scene description. Object descriptors serve to gain access to
ISO/IEC 14496 content. Object content information and the interface to intellectual property management and
protection systems are also part of this framework.
An object descriptor is a collection of one or more elementary stream descriptors that provide the
configuration and other information for the streams that relate to either an audio-visual object, or text or font
data, or a scene description. Object descriptors are themselves conveyed in elementary streams. Each object
descriptor is assigned an identifier (object descriptor ID), which is unique within a defined name scope. This
identifier is used to associate audio-visual objects in the scene description with a particular object descriptor,
and thus the elementary streams related to that particular object.
Elementary stream descriptors include information about the source of the stream data, in form of a unique
numeric identifier (the elementary stream ID) or a URL pointing to a remote source for the stream. Elementary
stream descriptors also include information about the encoding format, configuration information for the
decoding process and the sync layer packetization, as well as quality of service requirements for the
transmission of the stream and intellectual property identification. Dependencies between streams can also
be signaled within the elementary stream descriptors. This functionality may be used, for example, in scalable
audio or visual object representations to indicate the logical dependency of a stream containing enhancement
information, to a stream containing the base information. It can also be used to describe alternative
representations for the same content (e.g. the same speech content in various languages).
0.6.1.1 Intellectual property management and protection
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists
of a normative interface that permits an ISO/IEC 14496 terminal to host one or more IPMP systems in the
form of monolithic IPMP systems or modular IPMP tools. The IPMP interface consists of IPMP elementary
streams and IPMP descriptors. IPMP descriptors are carried as part of an object descriptor stream. IPMP
elementary streams carry time variant IPMP information that can be associated to multiple object descriptors.
The IPMP system, or IPMP tools themselves are non-normative components that provides intellectual
property management and protection functions for the terminal. The IPMP systems or tools uses the

© ISO/IEC 2026 – All rights reserved
xi
ISO/IEC PRF 14496-1:2026(en)
information carried by the IPMP elementary streams and descriptors to make protected ISO/IEC 14496
content available to the terminal.
The intellectual property management and protection (IPMP) framework for ISO/IEC 14496 content consists
of a set of tools that permits an ISO/IEC 14496 terminal to support IPMP functionality. This functionality is
provided by two different complementary technologies, supporting different levels of interoperability:
— The IPMP framework as defined in 6.2.3,6.2.3, consists of a normative interface that permits an ISO/IEC
14496 terminal to host one or more IPMP systems. The IPMP interface consists of IPMP elementary
streams and IPMP descriptors. IPMP descriptors are carried as part of an object descriptor stream. IPMP
elementary streams carry time variant IPMP information that can be associated to multiple object
descriptors. The IPMP system itself is a non-normative component that provides intellectual property
management and protection functions for the terminal. The IPMP system uses the information carried by
the IPMP elementary streams and descriptors to make protected ISO/IEC 14496 content available to the
terminal.
— The IPMP framework extension, as specified in ISO/IEC 14496-13 allows, in addition to the functionality
specified in ISO/IEC 14496-1, a finer granularity of governance. ISO/IEC 14496-13 provides normative
support for individual IPMP components, referred to as IPMP tools, to be normatively placed at identified
points of control within the terminal systems model. Additionally, ISO/IEC 14496-13 provides normative
support for secure communications to be performed between IPMP tools. ISO/IEC 14496-1 also specifies
specific normative extensions at the systems level to support the IPMP functionality described in ISO/IEC
14496-13.
An application may choose not to use an IPMP system, thereby offering no management and protection
features.
0.6.1.2 Object content information
Object content information (OCI) descriptors convey descriptive information about audio-visual objects. The
main content descriptors are: content classification descriptors, keyword descriptors, rating descriptors,
language descriptors, textual descriptors, and descriptors about the creation of the content. OCI descriptors
can be included directly in the related object descriptor or elementary stream descriptor or, if it is time variant,
it may be carried in an elementary stream by itself. An OCI stream is organized in a sequence of small,
synchronized entities called events that contain a set of OCI descriptors. OCI streams can be associated to
multiple object descriptors.
0.6.2 Scene description streams
Scene description addresses the organization of audio-visual objects in a scene, in terms of both spatial and
temporal attributes. This information allows the composition and rendering of individual audio-visual objects
after the respective decoders have reconstructed the streaming data for them. For visual data, ISO/IEC
14496--11 does not mandate particular composition algorithms. Hence, visual composition is implementation
dependent. For audio data, the composition process is defined in a normative manner in ISO/IEC 14496-11
and ISO/IEC 14496--3.
The scene description is represented using a parametric approach (BIFS - dinary format for scenes). The
description consists of an encoded hierarchy (tree) of nodes with attributes and other information (including
event sources and targets). Leaf nodes in this tree correspond to elementary audio-visual data, whereas
intermediate nodes group this material to form audio-visual objects, and perform grouping, transformation,
and other such operations on audio-visual objects (scene description nodes). The scene description can evolve
over time by using scene description updates.

© ISO/IEC 2026 – All rights reserved
xii
ISO/IEC PRF 14496-1:2026(en)
In order to facilitate active user involvement with the presented audio-visual information, ISO/IEC 14496--11
provides support for user and object interactions. Interactivity mechanisms are integrated with the scene
description information, in the form of linked event sources and targets (routes) as well as sensors (special
nodes that can trigger events based on specific conditions). These event sources and targets are part of scene
description nodes, and thus allow close coupling of dynamic and interactive behavior with the specific scene
at hand. ISO/IEC 14496-11, however, does not specify a particular user interface or a mechanism that maps
user actions (e.g.,. keyboard key presses or mouse movements) to such events.
Such an interactive environment may not need an upstream channel, but ISO/IEC 14496 also provides means
for client-server interactive sessions with the ability to set up upstream elementary streams and associate
them to specific downstream elementary streams.
0.6.3 Audio-visual streams
The coded representation of audio and visual information are described in ISO/IEC 14496-3 (Audio) and
ISO/IEC 14496-2 (Visual) and ISO/IEC 14496-10 (Advanced Video Coding) respectively. The reconstructed
audio-visual data are made available to the composition process for potential use during the scene rendering.
0.6.4 Upchannel streams
Downchannel elementary streams may require upchannel information to be transmitted from the receiving
terminal to the sending terminal (e.g.,. to allow for client-server interactivity). Figure 1Figure 1 indicates the
flowpath for an elementary stream from the receiving terminal to the sending terminal. The content of
upchannel streams is specified in the same part of the specification that defines the content of the downstream
data. For example, upchannel control streams for video downchannel elementary streams are defined in
ISO/IEC 14496-2.
0.6.5 Interaction streams
The coded representation of user interaction information is not in the scope of ISO/IEC 14496. But this
information is to be translated into scene modification and the modifications made available to the
composition process for potential use during the scene rendering.
0.6.6 Text and font data streams
Scene description often contains information presented in textual format. The audio-visual data encoded in
the scene may also be accompanied by supplemental text information such as subtitles. In order to enable
time-based updates of text data and to insure the text appearance and layout, both elementary streams
carrying timed text information and font data are used. The coded representation of the timed text stream is
described in ISO/IEC 14496-17. The font data format and encoded representation of font data stream are
described in ISO/IEC 14496-18 (font data stream) and ISO/IEC 14496-22 (font data format).
0.7 Application engine
The MPEG-J is a programmatic system (as opposed to a conventional parametric system) which specifies
API(s) for interoperation of MPEG-4 media players with Java code. By combining MPEG-4 media and safe
executable code, content creators may embed complex control and data processing mechanisms with their
media data to intelligently manage the operation of the audio-visual session. The parametric MPEG-4 system
forms the presentation engine while the MPEG-J subsystem controlling the presentation engine forms the
application engine.
© ISO/IEC 2026 – All rights reserved
xiii
ISO/IEC PRF 14496-1:2026(en)
The Java application is delivered as a separate elementary stream to the MPEG-4 terminal. There it will be
directed to the MPEG-J run time environment, from where the MPEG-J program will have access to the various
components and required data of the MPEG-4 player to control it.
In addition to the basic packages of the language (java.lang, java.io, java.util) a few categories of APIs have
been defined for different scopes. For the scene graph API the objective is to provide access to the scene graph
specified in ISO/IEC 14496-11: to inspect the graph, to alter nodes and their fields, and to add and remove
nodes within the graph. The resource API is used for regulation of performance: it provides a centralized
facility for managing resources. This is used when the program execution is contingent upon the terminal
configuration and its capabilities, both static (that do not change during execution) and dynamic. Decoder API
allows the control of the decoders that are present in the terminal. The net API provides a way to interact with
the network, being compliant to the MPEG-4 DMIF application interface. Complex applications and enhanced
interactivity are possible with these basic packages. The architecture of MPEG-J is presented in more detail in
0.8 Extensible MPEG-4 textual (XMT) format
The extensible MPEG-4 textual (XMT) format is a textual representation of the multimedia content described
in ISO/IEC 14496 using the extensible markup language (XML). XMT is designed to facilitate the creation and
maintenance of MPEG-4 multimedia content, whether by human authors or by automated machine programs.
XMT is specified in ISO/IEC 14496-11.
The textual representation of MPEG-4 content has high-level abstractions, XMT-O, that allow authors to
exchange their content easily with other authors or authoring tools, while at the same time preserving
semantic intent. XMT also has low-level textual representations, XMT-A, covering the full scope and function
of MPEG-4. The high-level XMT-O is designed to facilitate interoperability with the synchronized multimedia
integration Language (SMIL) 2.0, a recommendation from the W3C consortium, and also with extensible 3D
specification, X3D, developed by the Web3D consortium as the next generation of virtual reality modeling
language (VRML).
The XMT language has grammars that are specified using the W3C XML schema language. The grammars
contain rules for element placement and attribute values, etc. These rules for XMT, defined using the schema
language, follow the binary coding rules defined in ISO/IEC 14496-11 and help ensure that the textual
representation can be coded into correct binary according to ISO/IEC 14496-11 coding rules.
All constructs in the ISO/IEC 14496 specification have their parallel in the XMT textual format. For the visual
and audio parts, XMT provides a means to reference external media streams of either pre-encoded or raw
audiovisual binary content. While XMT does not contain a textual format for audiovisual media, it does contain
hints in a textual format that allow an XMT tool to encode and embed the audiovisual media into a complete
MPEG-4 presentation.
© ISO/IEC 2026 – All rights reserved
xiv
Information technology — Coding of audio-visual objects —
Part 1:
Systems
1 Scope
This document specifies system level functionalities for the communication of interactive audio-visual
scenes, i.e.,. the coded representation of information related to the management of data streams
(synchronization, identification, description and association of stream content).
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.

ISO 639:2023, CodeCodes for individualthe representation of names of languages and language groups—
Part 2: Alpha-3 code
ISO/IEC 10646:2020, Information technology — Universal coded character setMultiple-Octet Coded
Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane
ISO/IEC 10918-1:1994 | ITU-T Rec. T.81, Information technology — Digital compression and coding of
continuous-tone still images — Part 1: Requirements and guidelines
ISO/IEC 11172-2:1993, Information technology — Coding of moving pictures and associated audio for
digital storage media at up to about 1,5 Mbit/s — Part 2: Video
ISO/IEC 11172-3:1993, Information technology — Coding of moving pictures and associated audio for
digital storage media at up to about 1,5 Mbit/s — Part 3: Audio
ISO/IEC 13818-2:2000 | ITU-T Rec. H.262, Information technology — Generic coding of moving pictures
and associated audio information — Part 2: Video
ISO/IEC 13818-3:1998, Information technology — Generic coding of moving pictures and associated audio
information — Part 3: Audio
ISO/IEC 13818-7:2006, Information technology — Generic coding of moving pictures and associated audio
information — Part 7: Advanced Audio Coding (AAC)
ISO/IEC 14496-2:2004, Information technology — Coding of audio-visual objects — Part 2: Visual
ISO/IEC 14496-3:2019, Information technology — Coding of audio-visual objects — Part 3: Audio

ISO/IEC PRF 14496-1:2026(en)
Advanced video codingVideo Coding
network abstraction layer (NAL) unit structured video in the ISO base mediaAdvanced Video Coding (AVC)
file format
ISO/IEC 14496-16:2011, Information technology — Coding of audio-visual objects — Part 16: Animation
Framework eXtension (AFX)
ISO/IEC 14496-18:2004, Information technology — Coding of audio-visual objects — Part 18: Font
compression and streaming
ISO/IEC 14496-34:2025, Information technology — Coding of audio-visual objects — Part 34: Syntactic
description language
ITU-T Rec. H.262 (2000) | ISO/IEC 13818-2:2000, Information technology — Generic coding of moving
pictures and associated audio information — Part 2: Video
ITU-T Rec. T.81 (1992) | ISO/IEC 10918-1:1994, Information technology — Digital compression and
coding of continuous-tone still images — Part 1: Requirements and guidelines
DAVIC 1.4.1 specification, Part 9: Information Representation
ANSI/SMPTE 291M-1996, Television — Ancillary Data Packet and Space Formatting
SMPTE 315M -1999, Television — Camera Positioning Information Conveyed by Ancillary Data Packets
W3C Recommendation: August 2001 — Synchronized Multimedia Integration Language (SMIL 2.0),
http://www.w3.org/TR/smil20/https://www.w3.org/TR/smil20/
W3C Recommendation: May 2001 — XML Schema, http://www.w3.org/TR/xmlschema-
0/https://www.w3.org/TR/xmlschema-0/
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— — ISO Online browsing platform: available at https://www.iso.org/obphttps://www.iso.org/obp
— — IEC Electropedia: available at https://www.electropedia.org/https://www.electropedia.org/
3.1
access unit
Commented [NJ1]: The use of unnecessary capitals
must be avoided as per https://www.iso.org/ISO-house-
style.html#iso-hs-s-text-r-s-capitals. Kindly amend
AU
throughout the document.
smallest individually accessible portion of data within an elementary stream (3.27)(3.27) to which unique
timing information can be attributed

ISO/IEC PRF 14496-1:2026(en)
3.2
audio-visual object
representation of a natural or synthetic object that has an audio and/or visual manifestation
Note 1 to entry: The representation corresponds to a node or a group of nodes in the BIFS scene
description. Each audio-visual object is associated with zero or more elementary streams using one or more object
descriptors.
3.3
audio-visual scene
AV Scene
set of audio-visual objects together with scene description information that defines their spatial and
temporal attributes including behaviors resulting from object and user interactions
3.4
AVC parameter set
sequence parameter set or a picture parameter set
3.5
AVC access unit
access unit made up of NAL Units as defined in ITU-T H.264 | ISO/IEC 14496-10 with the structure defined
in 45.2.3.1 of ISO/IEC 14496-15:2024
3.6
AVC parameter set access unit
access unit made up only of sequence parameter set NAL units or picture parameter set NAL units having
same timestamps to be applied
3.7
AVC parameter set elementary stream
elementary stream containing made up only of AVC parameter set access units
3.8
AVC video elementary stream
elementary stream containing access units made up of NAL units for coded picture data
3.9
binary format for scene
BIFS
coded representation of a parametric scene description format as specified in ISO/IEC 14496-11
© ISO/IEC 2024 – All rights reserved
ISO/IEC PRF 14496-1:2026(en)
3.10
buffer model
model that defines how a terminal complying with ISO/IEC 14496 manages the buffer resources that are
needed to decode a presentation
3.11
clock reference
special time stamp that conveys a reading of a time base
3.12
composition
process of applying scene description information in order to identify the spatio-temporal attributes and
hierarchies of audio-visual objects
3.13
composition memory
CM
random access memory that contains composition units
3.14
composition time stamp
CTS
indication of the nominal composition time of a composition unit
3.15
composition unit
CU
individually accessible portion of the output that a decoder produces from access units
3.16
compression layer
layer of a system according to the specifications in ISO/IEC 14496 that translates between the coded
representation of an elementary stream and its decoded representation. It incorporates the decoders
3.17
control point
point on a given elementary stream in a terminal where IPMP processing on stream data shall be carried
out
ISO/IEC PRF 14496-1:2026(en)
3.18
decoder
entity that translates between the coded representation of an elementary stream and its decoded
representation
3.19
decoding buffer
DB
buffer at the input of a decoder that contains access units
3.20
decoder configuration
configuration of a decoder for processing its elementary stream data by using information contained in
its elementary stream descriptor
3.21
decoding time stamp
DTS
indication of the nominal decoding time of an access unit
3.22
delivery layer
generic abstraction for delivery mechanisms (computer networks, etc.) able to store or transmit a
number of multiplexed elementary streams or M4Mux streams
3.23
descriptor
data structure that is used to describe particular aspects of an elementary stream or a coded audio-visual
object
3.24
DMIF application interface
DAI
interface specified in ISO/IEC 14496-6 used to model the exchange of SL-packetized stream data and
associated control information between the sync layer and the delivery layer
© ISO/IEC 2024 – All rights reserved
ISO/IEC PRF 14496-1:2026(en)
3.25
elementary stream
ES
consecutive flow of mono-media data from a single source entity to a single destination entity on the
compression layer
3.26
elementary stream descriptor
structure contained in object descriptors that describes the encoding format, initialization information,
sync layer configuration, and other descriptive information about the content carried in an elementary
stream
3.27
elementary stream interface
ESI
conceptual interface modeling the exchange of elementary stream data and associated control
information between the compression layer and the sync layer
3.28
M4Mux channel
FMC
label to differentiate between data belonging to different constituent streams within one M4Mux stream
Note 1 to entry: A sequence of data in one M4Mux channel within a M4Mux stream corresponds
to one single SL-packetized stream.
3.29
M4Mux packet
smallest data entity managed by the M4Mux tool consisting of a header and a payload
3.30
M4Mux stream
sequence of M4Mux packets with data from one or more SL-packetized streams that are each identified
by their own M4Mux channel
3.31
M4Mux tool
tool that allows the interleaving of data from multiple data streams

ISO/IEC PRF 14496-1:2026(en)
3.32
graphics profile
profile that specifies the permissible set of graphical elements of the BIFS tool that may be used in a scene
description stream
Note 1 to entry: BIFS comprises both graphical and scene description elements.
3.33
inter
mode for coding parameters that uses previously coded parameters to construct a prediction
3.34
interaction stream
elementary stream that conveys user interaction information
3.35
intra
mode for coding parameters that does not make reference to previously coded parameters to perform
the encoding
3.36
initial object descriptor
special object descriptor that allows the receiving terminal to gain initial access to portions of content
encoded according to ISO/IEC 14496 and that conveys profile and level information to describe the
complexity of the content
3.37
intellectual property identification

IPI
unique identification of one or more elementary streams corresponding to parts of one or more audio-
visual objects
3.38
intellectual property management and protection system
generic term for mechanisms and tools to manage and protect intellectual property
Note 1 to entry: This document defines the interface to such systems as well as:
— The provision for the identification of IPMP tools through the use of a functional description of the IPMP tools’
capabilities in a parametric fashion.
— Controlling the time of instantiation of IPMP tools either by the inclusion of references to the required IPMP
tools or at the request of already instantiated IPMP tools.
— Providing secure messaging between IPMP tools and the terminal and between IPMP tools and the user.
© ISO/IEC 2024 – All rights reserved
ISO/IEC PRF 14496-1:2026(en)
— Notification of the instantiation of IPMP tools to IPMP tools requesting such notification.
— Interaction between IPMP tools, and/or the terminal and the user.
— The carriage of IPMP tools within the bitstream.
3.39
IPMP information
Information directed to a given IPMP tool to enable, assist or facilitate its operation
3.40
IPMP system
monolithic IPMP protection scheme which requ
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...