ISO/IEC TR 23888-3
(Main)Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content
Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content
Technologies de l'information — Intelligence artificielle pour le multimédia — Partie 3: Optimisation des codeurs et des systèmes de réception pour l'analyse automatique de contenus vidéo codés
General Information
- Status
- Not Published
- Current Stage
- 6000 - International Standard under publication
- Start Date
- 29-Apr-2026
- Completion Date
- 02-May-2026
Overview
ISO/IEC DTR 23888-3.2, titled Information technology - Artificial intelligence for multimedia - Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content, is a technical report from ISO/IEC JTC 1/SC 29/WG 5. This document focuses on practices and recent developments for optimizing video encoding and receiving systems to enhance machine analysis-such as object detection, segmentation, and tracking-on coded video data.
The standard addresses emerging needs in fields such as surveillance, intelligent transportation, and industrial automation, where video is increasingly analyzed by AI systems rather than directly viewed by humans. Optimizing both encoders and receiving systems for these scenarios can significantly improve coding efficiency, reduce network bandwidth requirements, and maintain or boost the accuracy of machine vision tasks.
Key Topics
Pre-processing Technologies: Techniques to prepare video before encoding, improving efficiency or machine analysis outcomes. This includes:
- Region of interest (RoI) based processing
- Foreground and background separation
- Temporal (frame rate) and spatial (resolution) subsampling
- Noise filtering to enhance relevant signal
Encoding Technologies: Enhancements at the encoder stage to allocate video quality and bitrate based on machine analysis needs:
- Adaptive quantization parameter (QP) adjustments, especially for RoI
- Temporal layer and chroma QP offset tuning
Post-processing Technologies: Methods applied to video after decoding, enabling further optimization for machine consumption such as:
- Temporal and spatial resampling
- Enhancement filtering
Metadata for Machine Analysis: Use of supplemental enhancement information (SEI) messages to carry data useful for machine analysis, such as:
- Neural-network post-filter messages
- Annotated regions
- Object masks
- Encoder optimization information
Evaluation Methodologies: Objective metrics and frameworks for assessing optimization, including:
- Bitrate calculation
- Peak Signal-to-Noise Ratio (PSNR)
- Mean Average Precision (mAP) and Multiple Object Tracking Accuracy (MOTA)
- Bjøntegaard delta rate (BD-rate) analysis
Applications
Optimizing video encoders and receivers for machine analysis is crucial for a range of practical applications:
- Surveillance Systems: Reducing required bandwidth while maintaining the quality of video needed for automated object detection and tracking.
- Intelligent Transportation: Supporting interoperability and efficient data exchange between vehicles and infrastructure, enabling real-time analysis at scale.
- Industrial Automation: Assisting in visual content inspection and automated quality control, improving efficiency through AI-driven video analysis.
- Edge AI Devices: Empowering front-end devices (e.g., cameras with integrated AI) to preprocess or partially analyze video content before transmission, distributing computational load and optimizing network usage.
Related Standards
To ensure interoperability and performance, this technical report references several important international standards:
- ISO/IEC 23090-3 / ITU-T H.266: Versatile Video Coding (VVC)
- ISO/IEC 23008-2 / ITU-T H.265: High Efficiency Video Coding (HEVC)
- ISO/IEC 14496-10 / ITU-T H.264: Advanced Video Coding (AVC)
- ISO/IEC 23002-7 / ITU-T H.274: Versatile Supplemental Enhancement Information Messages
- ISO/IEC TR 23002-8: Distortion metrics for image and video coding
- ISO/IEC TR 23888-1: Further use cases and foundational information
Summary
ISO/IEC DTR 23888-3.2 provides a comprehensive overview of state-of-the-art practices for optimizing encoder and receiving systems to enhance machine analysis of coded video content. By leveraging pre-processing, encoding, and post-processing techniques-along with informative metadata-organizations can make machine vision applications more efficient and reliable across a wide array of smart video domains. Implementing these optimizations can result in bandwidth reductions, improved scalability, and higher accuracy in AI-powered multimedia systems.
Buy Documents
ISO/IEC DTR 23888-3 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content Released:26. 08. 2025
REDLINE ISO/IEC DTR 23888-3 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content Released:26. 08. 2025
ISO/IEC DTR 23888-3.2 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content
REDLINE ISO/IEC DTR 23888-3.2 - Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content
Get Certified
Connect with accredited certification bodies for this standard

BSI Group
BSI (British Standards Institution) is the business standards company that helps organizations make excellence a habit.

NYCE
Mexican standards and certification body.
Sponsored listings
Frequently Asked Questions
ISO/IEC TR 23888-3 is a draft published by the International Organization for Standardization (ISO). Its full title is "Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content". This standard covers: Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content
Information technology — Artificial intelligence for multimedia — Part 3: Optimization of encoders and receiving systems for machine analysis of coded video content
ISO/IEC TR 23888-3 is classified under the following ICS (International Classification for Standards) categories: 35.040.40 - Coding of audio, video, multimedia and hypermedia information; 35.240.01 - Application of information technology in general. The ICS classification helps identify the subject area and facilitates finding related standards.
ISO/IEC TR 23888-3 is available in PDF format for immediate download after purchase. The document can be added to your cart and obtained through the secure checkout process. Digital delivery ensures instant access to the complete standard document.
Standards Content (Sample)
FINAL DRAFT
Technical
Report
ISO/IEC DTR
23888-3
ISO/IEC JTC 1/SC 29
Information technology — Artificial
Secretariat: JISC
intelligence for multimedia —
Voting begins on:
2025-09-09
Part 3:
Optimization of encoders and
Voting terminates on:
2025-11-04
receiving systems for machine
analysis of coded video content
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO-
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
Reference number
ISO/IEC DTR 23888-3:2025(en) © ISO/IEC 2025
FINAL DRAFT
ISO/IEC DTR 23888-3:2025(en)
Technical
Report
ISO/IEC DTR
23888-3
ISO/IEC JTC 1/SC 29
Information technology — Artificial
Secretariat: JISC
intelligence for multimedia —
Voting begins on:
Part 3:
Optimization of encoders and
Voting terminates on:
receiving systems for machine
analysis of coded video content
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
© ISO/IEC 2025
IN ADDITION TO THEIR EVALUATION AS
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO-
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
or ISO’s member body in the country of the requester.
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland Reference number
ISO/IEC DTR 23888-3:2025(en) © ISO/IEC 2025
© ISO/IEC 2025 – All rights reserved
ii
ISO/IEC DTR 23888-3:2025(en)
Contents Page
Foreword .iv
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 2
5 Overview . 2
5.1 General overview .2
5.2 Use cases and applications .3
6 Evaluation methodology . 3
6.1 General .3
6.2 Bit rate .4
6.3 PSNR .4
6.4 mAP .4
6.5 MOTA .5
6.6 BD-rate .5
7 Pre-processing technologies . 6
7.1 Region of interest-based methods.6
7.2 Foreground and background processing .7
7.3 Temporal subsampling .7
7.4 Spatial subsampling .7
7.5 Noise filtering .8
8 Encoding technologies . 8
8.1 RoI-based quantization parameter adaption .8
8.2 Quantization step adjustment for temporal layers .8
8.3 Chroma QP offset setting.9
9 Post-processing technologies . 9
9.1 Temporal resampling .9
9.2 Spatial resampling .9
9.3 Enhancement post-filtering .9
10 Metadata . 10
10.1 Neural-network post-filter SEI message .10
10.2 Annotated regions SEI message .10
10.3 Object mask information SEI message .11
10.4 Encoder optimization information SEI message .11
10.5 Packed regions information SEI message .11
Annex A (informative) Software implementation examples .12
Annex B (informative) Combined software implementation examples . 19
Bibliography .20
© ISO/IEC 2025 – All rights reserved
iii
ISO/IEC DTR 23888-3:2025(en)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical activity.
ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations,
governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of document should be noted. This document was drafted in accordance with the editorial rules of the ISO/
IEC Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any
claimed patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not
received notice of (a) patent(s) which may be required to implement this document. However, implementers
are cautioned that this may not represent the latest information, which may be obtained from the patent
database available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held
responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information, in collaboration with
ITU-T SG21, Technologies for multimedia, content delivery and cable television. The corresponding ITU-T SG21
provisional work item name is H.Sup.MACVC.
A list of all parts in the ISO/IEC 23888 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards
body. A complete listing of these bodies can be found at www.iso.org/members.html and
www.iec.ch/national-committees.
© ISO/IEC 2025 – All rights reserved
iv
FINAL DRAFT Technical Report ISO/IEC DTR 23888-3:2025(en)
Information technology — Artificial intelligence for
multimedia —
Part 3:
Optimization of encoders and receiving systems for machine
analysis of coded video content
1 Scope
This document specifies a summary of optimizations for encoders and receiving systems for conducting
machine analysis tasks on coded video content. It provides a concept-level overview of recent practices
and provides comments on technical aspects and cautions to be taken when interpreting the results. This
document describes technologies that have recently been studied and demonstrated benefits to coding
efficiency for some machine analysis tasks.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitute
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
Rec. ITU-T H.266 | ISO/IEC 23090-3, Versatile video coding
Rec. ITU-T H.265 | ISO/IEC 23008-2, High efficiency video coding
Rec. ITU-T H.264 | ISO/IEC 14496-10, Advanced video coding
Rec. ITU-T H.274 | ISO/IEC 23002-7, Versatile supplemental enhancement information messages for coded video
bitstreams
3 Terms and definitions
For the purposes of this document, the terms and definitions given in Rec. ITU-T H.266 | ISO/IEC 23090-3,
Rec. ITU-T H.265 | ISO/IEC 23008-2, Rec. ITU-T H.264 | ISO/IEC 14496-10, Rec. ITU-T H.274 | ISO/IEC 23002-7
and the following apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
machine consumption
applying a machine analysis task such as object detection, segmentation or object tracking
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
4 Abbreviated terms
AVC Advanced Video Coding (Rec. ITU-T H.264 | ISO/IEC 14496-10)
BD-rate Bjøntegaard delta bit rate
CTU coding tree unit
HEVC High Efficiency Video Coding (Rec. ITU-T H.265 | ISO/IEC 23008-2)
mAP mean average precision
MOTA multiple object tracking accuracy
NNPF neural-network post-filter
NNPFA neural-network post-filter activation
NNPFC neural-network post-filter characteristics
OMI object mask information
PSNR peak signal-to-noise ratio
QP quantization parameter
RoI region of interest
RPR reference picture resampling
SEI supplemental enhancement information
TID temporal identifier
URI uniform resource identifier
VSEI Versatile Supplemental Enhancement Information Messages for Coded Video Bitstreams (Rec.
ITU-T H.274 | ISO/IEC 23002-7)
VVC Versatile Video Coding (Rec. ITU-T H.266 | ISO/IEC 23090-3)
Y′C C colour space representation commonly used for video/image distribution, also written as YUV
B R
YUV colour space representation commonly used for video/image distribution, also written as Y′C C
B R
5 Overview
5.1 General overview
Most video processing systems consist of four main processing steps, as shown in Figure 1. This document
describes technologies for optimization of encoders and receiving systems, such as pre-processing, encoding
and post-processing for machine consumption. The decoding process, on the other hand, is fully specified
in the respective Rec. ITU-T H.266 | ISO/IEC 23090-3 Versatile Video Coding (VVC), Rec. ITU-T H.265 |
ISO/IEC 23008-2 High Efficiency Video Coding (HEVC) and Rec. ITU-T H.264 | ISO/IEC 14496-10 Advanced
Video Coding (AVC) video coding standards, amongst others. Hence, the samples of the decoded video are
fully specified by the given input bitstream.
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
Figure 1 — General video coding and processing pipeline
An overview of the commonly used practices for evaluating encoder optimization technologies for machine
consumption can be found in Clause 6. Descriptions of pre-processing technologies can be found in Clause 7.
Encoder optimization technologies are described in Clause 8 and post-processing technologies are described
in Clause 9. Metadata that is useful for machine consumption is described in Clause 10.
It is noted that depending on specific use cases, the technologies outlined in this document can be
implemented individually or in combination to optimize the machine consumption performance within the
constraints of the system capabilities. When employing multiple technologies simultaneously, it is important
to consider that certain combinations can be impractical or infeasible due to inherent methodological
constraints. Tested combinations of two or more technologies are listed in Annex B.
5.2 Use cases and applications
There are various use cases and applications using encoded video that benefit from optimizing both encoders
and receiving systems for machine consumption. Some of them are highlighted below:
— Surveillance: A considerable amount of bandwidth is needed to transmit a high volume of data generated
by a large number of sensors. The number of sensors also has an impact on the computational load on
the server side, as having to analyse the input from many sensors can become a huge burden. This can be
eased by distributing the computation to the front-end devices.
— Intelligent transportation: A key aspect for vehicular applications is interoperability between not only
vehicles from different vendors, but also the infrastructures of various locations. Connected vehicles
are expected to play a significant role in future transport systems and the tremendous number of
vehicles emphasizes the need of reducing the amount of data being transmitted between them to avoid
overloading the network.
— Intelligent industry: One example in this area is visual content analysis, checking and screening. Machine
automation is desirable for increasing efficiency.
[1]
A more detailed description of use cases can be found in ISO/IEC TR 23888-1 .
6 Evaluation methodology
6.1 General
A set of assessment metrics are used for the evaluation of encoder and receiving systems optimization
technologies for machine consumption. An overview evaluation framework is shown in Figure 2. Here the
input video is encoded to generate a bitstream. This bitstream is then decoded, and the decoded video is
used for machine consumption. In this diagram, the “encoder” includes both pre-processing and encoding
steps, and the “decoder” includes both decoding and post-processing steps, as shown in Figure 1.
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
Figure 2 — Evaluation framework and points of measurement
6.2 Bit rate
The bit rate is determined based on the encoded bitstream and parameters of the input video such as frame
rate and the number of total frames. The following formula is applied to calculate the bit rate:
8 **fileSizeInBytesf ps
bitrate =
numFrames *1000
6.3 PSNR
Encoding for video distribution is ordinarily performed in the Y′C C domain (nicknamed YUV herein for
B R
brevity and ease of typing). For standard-dynamic range video, the distortion metric primarily used in the
video coding standardization community has been the Peak Signal to Noise Ratio (PSNR). The following two
formulae are used to calculate PSNR:
n−1m−1
MSE = xi,,jy− ij
() ()
∑∑
mn*
i=0 j=0
()bitdepth−8
255*2
()
PSNR =10*log
10
MSE
where x(i,j) is the decoded sample value of a certain color component, y(i,j) is the corresponding original
sample value, and bitdepth is the bit depth of the input video. It is a common practice to calculate PSNR
values for each of the color component Y, U and V.
6.4 mAP
The performance of object detection and segmentation tasks are measured by mean average precision
(mAP). This metric indicates what percentage of objects are correctly identified by having sufficient overlap
between the detected object and the ground truth as well as being assigned to the correct object class. Then
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
the share of correctly identified objects for each class is determined, and finally the score for each class is
averaged. The calculation of mAP is as follows:
numOverlaps numClasses
coorrectObjects
11
ii
mAP =
∑∑
numOverlaps numClasses totalObjects
ii
i==11 ii
Some commonly used variants of this metric are:
— mAP@0 .5: An object is counted as correctly identified if the Intersection over Union (IoU) between the
detected bounding box and the ground truth bounding box is at least 0.5. Sometimes this variant of the
mAP metric is also referred to as mAP50.
— mAP@[0.5:0.05:0.95]: In this variant a total of ten mAP scores with increasing IoU thresholds are
calculated. The IoU threshold starts at 0.5 and increases by 0.05 after each iteration, until it reaches to
the upper bound value 0.95. Once all ten scores are determined, the average of these scores is calculated
to produce the final mAP.
6.5 MOTA
Object tracking performance is measured by Multiple Object Tracking Accuracy (MOTA). This metric
accounts for all object configuration errors made by the tracker, false positives, misses (true negative),
mismatches, overall frames. The calculation of MOTA is as follows:
()FN ++FP mme
∑ tt t
t
MOTA=−1,
g
t
∑
t
where FN , FP , mme and g are the number of false negatives, the number of false positives, the number
t t t t
of mismatch error (ID Switching between 2 successive frames), and the number of objects in the ground
truth respectively at time t .
6.6 BD-rate
To compare the performance of a technology against the reference, the well-known Bjøntegaard delta rate
[2]
(BD-rate) metric is used. Instead of using PSNR as the distortion metric as is typical for human vision
performance evaluation, machine consumption distortion metrics, e.g., mAP and MOTA, are used in machine
BD-rate calculation.
The distortion measurement of machine consumption (e.g., mAP and MOTA) can sometimes be non-
monotonic to the bit rate due to the characteristics of the machine analysis task and possible limitations of
machine networks. Polynomial curve fitting is applied to ensure rate-distortion monotonicity and thus valid
BD-rate calculation.
3 2
fx() =+ bx** bx ++bx* b
0 1 23
For a given polynomial function in the above formula, b , b , b , and b are coefficients of the function, x is
0 1 2 3
the input (bit rate) and fx() is the output (quality). The following two constraints are invoked to ensure its
monotonicity and convexity:
— the first order derivative of the polynomial shown below is positive in the given x range
fx′ =+32 **bx *bx* + b
()
0 12
— the second order derivative of the polynomial shown below is negative in the given x range
fx′′ =+ 62** bx * b
()
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
Parameters ()bb,,bb, in the polynomial function are solved by sequential least squares programming
01 23
(SLSQP) and applied to curve fitting.
NOTE It is a common practice to have the minimal quality value of the fitted curve no smaller than the minimal
quality value of the original curve and the maximum quality value of the fitted curve no greater than the maximum
quality value of the original curve.
7 Pre-processing technologies
7.1 Region of interest-based methods
One often-used optimization method is region of interest (RoI)-based coding. Here the input video is
analysed in some way and then the encoder can optimize the encoding towards machine consumption based
on the analysis results. The analysis can be done using various methods, e.g., neural networks. An example
of a pipeline that can be used for RoI-based approaches is shown in Figure 3.
Figure 3 — Pipeline for RoI-based systems
In one implementation example, an object detection network is used to analyse the input data. This network
produces a list of objects that can be found in the current picture. The information used to describe each
object includes the index of the picture in which the object can be found and the position of the object in
the picture. Some networks can provide more information than this and the encoder can choose to select a
subset of all objects by filtering based on, for example, the class of an object or the estimated likelihood of an
object of the described class being at the described position. In a similar approach, a segmentation network
can be used where the object is not described by a bounding box but by a segmentation mask indicating
exactly which samples the segmentation network estimates belonging to the object. The list produced
during the analysis can then be used by the encoder, for example, to separate foreground and background
with the purpose of encoding the foreground at a better quality and the background at a lower quality. One
such encoding method is described in 8.1. In this example, the analysis does not change the input video, but
directly forwards it to the encoder.
In other RoI-based methods, the pre-processing changes the input video, for example, by applying different
pre-processing methods on the foreground and background, or specific parts of the video, such as
subsampling the background area of the input video.
In one implementation example, an object segmentation network is first used to analyse the input data.
The network produces a list of objects segmented with the object shapes in the current picture. The object
shapes and positions could be represented, for example, by segmentation masks. More information such as
the object class or the estimated likelihood of the object segment could also be provided by the network to
identify the objects. Based on the object information, it is possible to derive spatial complexity and temporal
complexity for the different segments, and then RoI-based pre-processing of the input video can be adapted
based on the spatial and temporal complexity. The spatial complexity here indicates the averaged object
size which can be calculated by dividing the percentage of the area covered by the objects by the total
number of the objects. Temporal complexity indicates the content changes between two pictures which can
be calculated by various methods, for example, by taking the mean absolute difference of the collocated
samples in two pictures.
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:2025(en)
7.2 Foreground and background processing
After pre-analysis that determines the foreground and background areas, one straightforward way to handle
the background that is less critical to machine consumption is to “eliminate” it by setting the corresponding
sample values to a constant value. However, some portions of the background samples, for example those
immediately surrounding the foreground area, could still be useful for machine consumption. Therefore, the
background regions relevant to machine consumption can be preserved to a certain extent with low-pass
filtering, such as a Gaussian filter with a sliding window, where the window size can be set based on the
input video resolution.
Moreover, extracted features can reveal importance information of the input video. In other words,
compared with binary classification of foreground and background, these extracted features can provide
importance information at a finer granularity. Therefore, such extracted features can be used to determine
how to process foreground and background differently. In one implementation example, a feature map is
extracted by a feature extraction network, and based on the feature map, the parameters of a Gaussian
smoothing filter are adapted and then the adaptive filtering is applied to the picture. As the background area
and foreground area have different features and even within the background or foreground area, different
regions can have different features, the Gaussian smooth filter can be controlled at a finer granularity, which
finally results in a more efficient pre-processing.
An implementation example with more detailed description can be found in A.2.
7.3 Temporal subsampling
In some use cases, for example when the frame rate is high, a way to reduce the bit rate without a strong
negative impact on the machine consumption performance can be to skip certain frames and encode the
video at a lower frame rate. One example is to remove every other frame from the input video and encode
the video at half frame rate. This can be done in a dynamic manner, for example by evaluating the motion
between two or more frames and if there is only little motion, a frame can be removed. In some cases, if
the receiving system requires a specific frame rate, a corresponding post-processing technology that up-
samples the video to the full frame rate can be applied.
An implementation example with more detailed description can be found in A.4.
7.4 Spatial subsampling
If the analysis of a video shows that it contains primarily large objects, one way to improve the BD-rate
performance is to perform spatial subsampling on the input video. This will result in fewer samples in the
subsampled frames to be encoded, and thus likely lead to faster encoding and bit rate savings. The optimum
downscaling factor is content dependent. It is possible that the machine consumption performance drops
when subsampling is too aggressive. Therefore, it is advisable to apply spatial subsampling adaptively, for
example, based on the characteristics of the video content (such as the averaged object spatial area and the
number of objects) and the target bit rate. Moreover, the spatial subsampling can also be dependent on the
picture types. For example, depending on whether the input video is captured by regular camera as natural
scenes, or is captured by infrared sensor as thermal images, different spatial subsampling methods can be
applied.
One tool specified in the Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC) standard that can be used for the purpose
of spatial resampling is called reference picture resampling (RPR). This tool allows the encoder to choose
to encode some pictures of the input video at different resolutions. For example, based on the analysis of
keyframes, without encoding an intra picture at the specified changed resolution, inter predictions can be
made from all allowed pictures regardless of their resolution.
In one implementation example, the RPR tool can be used at the frame level, where the input video can be
analysed as described in 7.1. In this case, the unmodified input video is forwarded to the encoder with a scale
factor list generated by the analyser. Specifically, an object detection network is used to analyse the input
video in both full resolution and at least one spatially subsampled resolution. This network produces a list of
objects with object inform
...
Draft ISO/IEC TR DTR 23888-3:202#(1)
ISO/IEC JTC 1/SC 29/WG 5
Secretariat: JISC
Date: 2025-06-08
Information technology — Artificial intelligence for multimedia —
Part 3:
Optimization of encoders and receiving systems for machine analysis
of coded video content
DTRFDIS stage
Warning for WDs and CDs
This document is not an ISO/IEC International Standard. It is distributed for review and comment. It is subject to change
without notice and may not be referred to as an International Standard.
Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of which
they are aware and to provide supporting documentation.
© ISO/IEC 2025
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication
may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying,
or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO
at the address below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: + 41 22 749 01 11
EmailE-mail: copyright@iso.org
Website: www.iso.orgwww.iso.org
Published in Switzerland
© ISO/IEC 2025 – All rights reserved
Draft ISO/IEC TR DTR 23888-3:202#(1:(en)
Contents
Foreword . viii
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 1
5 Overview . 2
6 Evaluation methodology . 3
7 Pre-processing technologies . 6
8 Encoding technologies . 9
9 Post-processing technologies . 10
10 Metadata . 12
Annex A (informative) Software implementation examples . 14
Annex B (informative) Combined software implementation examples . 23
Bibliography . 24
Foreword . v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 1
5 Overview . 2
5.1 General overview . 2
5.2 Use cases and applications . 2
6 Evaluation methodology . 3
6.1 Bit rate . 3
6.2 PSNR . 3
6.3 mAP . 4
6.4 MOTA . 4
6.5 BD-rate . 4
7 Pre-processing technologies . 5
7.1 Region of interest-based methods . 5
7.2 Foreground and background processing . 6
7.3 Temporal subsampling . 6
7.4 Spatial subsampling . 6
7.5 Noise filtering . 7
8 Encoding technologies . 7
8.1 RoI-based quantization parameter adaption . 7
8.2 Quantization step adjustment for temporal layers . 8
8.3 Chroma QP offset setting . 8
© ISO/IEC 2025 – All rights reserved
v
ISO
#####-#:####(X/IEC DTR 23888-3:(en)
9 Post-processing technologies . 9
9.1 Temporal resampling . 9
9.2 Spatial resampling . 9
9.3 Enhancement post-filtering . 9
10 Metadata . 9
10.1 Neural-network post-filter SEI message . 9
10.2 Annotated regions SEI message . 10
10.3 Object mask information SEI message . 10
10.4 Encoder optimization information SEI message . 10
10.5 Packed regions information SEI message . 10
Annex A (informative) Software implementation examples . 12
A.1 Region of interest-based adaptive QP . 12
A.1.1 General . 12
A.1.2 Analyser . 12
A.1.3 Encoder . 12
A.2 Pre-processing of foreground and background . 12
A.2.1 General . 12
A.2.2 Pre-analysis. 13
A.2.3 Pre-processing . 13
A.3 Enhancement post-filtering . 13
A.3.1 General . 13
A.3.2 Network structure . 14
A.4 Temporal resampling . 15
A.4.1 General . 15
A.4.2 Pre-analysis. 15
A.4.2.1 Frame-level MIOU 𝑴𝒇 . 15
A.4.2.2 Sequence-level MIOU 𝑴𝒔 . 16
A.4.2.3 Adaptive temporal resampling ratio decision. 17
A.4.3 Down-sampling . 17
A.4.4 Up-sampling . 17
vi © ISO #### /IEC 2025 – All rights reserved
vi
Draft ISO/IEC TR DTR 23888-3:202#(1:(en)
Annex B (informative) Combined software implementation examples . 18
Bibliography . 19
© ISO/IEC 2025 – All rights reserved
vii
ISO
#####-#:####(X/IEC DTR 23888-3:(en)
Foreword
ISO (the International Organization for Standardization) is a and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide federation of national standardsstandardization.
National bodies (that are members of ISO member bodies). The work of preparingor IEC participate in the
development of International Standards is normally carried out through ISO technical committees. Each
member body interested in a subject for which a technical committee has been established has the right to be
represented on that committee. Internationalby the respective organization to deal with particular fields of
technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international
organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of
ISO documentsdocument should be noted. This document was drafted in accordance with the editorial rules
of the ISO/IEC Directives, Part 2 (see www.iso.org/directives 2 (see www.iso.org/directives or
www.iec.ch/members_experts/refdocs).
ISO drawsand IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO takesand IEC take no position concerning the evidence, validity or applicability of any
claimed patent rights in respect thereof. As of the date of publication of this document, ISO [had/and IEC had
not] received notice of (a) patent(s) which may be required to implement this document. However,
implementers are cautioned that this may not represent the latest information, which may be obtained from
the patent database available at www.iso.org/patents. ISOwww.iso.org/patents and https://patents.iec.ch.
ISO and IEC shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html)
see www.iso.org/iso/foreword.html. In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information, in collaboration with
ITU-T SG21, Technologies for multimedia, content delivery and cable television. The corresponding ITU-T SG21
provisional work item name is H.Sup.MACVC.
viii © ISO #### /IEC 2025 – All rights reserved
viii
Draft ISO/IEC TR DTR 23888-3:202#(1:(en)
A list of all parts in the ISO/IEC 23888 series can be found on the ISO websiteand IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.htmlwww.iso.org/members.html and
www.iec.ch/national-committees.
© ISO/IEC 2025 – All rights reserved
ix
ISO/IEC DTR 23888-3:(en)
Information technology — Artificial intelligence for multimedia —
Part 3:
Optimization of encoders and receiving systems for machine analysis
of coded video content
1 Scope
This document specifies a summary of optimizations for encoders and receiving systems for conducting
machine analysis tasks on coded video content. It provides a concept-level overview of recent practices and
provides comments on technical aspects and cautions to be taken when interpreting the results. This
document describes technologies that have recently been studied and demonstrated benefits to coding
efficiency for some machine analysis tasks.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitute
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
Rec. ITU-T H.266 | ISO/IEC 23090-3, Versatile video coding
Rec. ITU-T H.265 | ISO/IEC 23008-2, High efficiency video coding
Rec. ITU-T H.264 | ISO/IEC 14496-10, Advanced video coding
Rec. ITU-T H.274 | ISO/IEC 23002-7, Versatile supplemental enhancement information messages for coded video
bitstreams
3 Terms and definitions
For the purposes of this document, the terms and definitions given in Rec. ITU-T H.266 | ISO/IEC 23090-3,
Rec. ITU-T H.265 | ISO/IEC 23008-2, Rec. ITU-T H.264 | ISO/IEC 14496-10, Rec. ITU-T H.274 | ISO/IEC
23002-7 and the following apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https://www.iso.org/obp
— IEC Electropedia: available at https://www.electropedia.org/
3.1 3.1
machine consumption
applying a machine analysis task such as object detection, segmentation or object tracking
4 Abbreviated terms
AVC Advanced Video Coding (Rec. ITU-T H.264 | ISO/IEC 14496-10)
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
BD-rate Bjøntegaard delta bit rate
CTU coding tree unit
HEVC High Efficiency Video Coding (Rec. ITU-T H.265 | ISO/IEC 23008-2)
mAP mean average precision
MOTA multiple object tracking accuracy
NNPF neural-network post-filter
NNPFA neural-network post-filter activation
NNPFC neural-network post-filter characteristics
OMI object mask information
PSNR peak signal-to-noise ratio
QP quantization parameter
RoI region of interest
RPR reference picture resampling
SEI supplemental enhancement information
TID temporal identifier
URI uniform resource identifier
VSEI Versatile Supplemental Enhancement Information Messages for Coded Video Bitstreams
(Rec. ITU-T H.274 | ISO/IEC 23002-7)
VVC Versatile Video Coding (Rec. ITU-T H.266 | ISO/IEC 23090-3)
Y′C C colour space representation commonly used for video/image distribution, also written as
B R
YUV
YUV colour space representation commonly used for video/image distribution, also written as
Y′C C
B R
5 Overview
5.1 General overview
Most video processing systems consist of four main processing steps, as shown in Figure 1.Figure 1. This
document describes technologies for optimization of encoders and receiving systems, such as pre-processing,
encoding and post-processing for machine consumption. The decoding process, on the other hand, is fully
specified in the respective Rec. ITU-T H.266 | ISO/IEC 23090-3 Versatile Video Coding (VVC), Rec. ITU-T
H.265 | ISO/IEC 23008-2 High Efficiency Video Coding (HEVC) and Rec. ITU-T H.264 | ISO/IEC 14496-10
Advanced Video Coding (AVC) video coding standards, amongst others. Hence, the samples of the decoded
video are fully specified by the given input bitstream.
input pre- post- machine
encoding decoding
video processing processing consumption
Figure 1 — Figure 1 — General video coding and processing pipeline
2 © ISO #### /IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
An overview of the commonly used practices for evaluating encoder optimization technologies for machine
consumption can be found in clause 6.6. Descriptions of pre-processing technologies can be found in
clause 7.7. Encoder optimization technologies are described in clause 88 and post-processing technologies are
described in clause 9.9. Metadata that is useful for machine consumption is described in clause 10. 10.
It is noted that depending on specific use cases, the technologies outlined in this document can be
implemented individually or in combination to optimize the machine consumption performance within the
constraints of the system capabilities. When employing multiple technologies simultaneously, it is important
to consider that certain combinations can be impractical or infeasible due to inherent methodological
constraints. Tested combinations of two or more technologies are listed in Annex BAnnex B.
5.2 Use cases and applications
There are various use cases and applications using encoded video that benefit from optimizing both encoders
and receiving systems for machine consumption. Some of them are highlighted below:
— Surveillance: A considerable amount of bandwidth is needed to transmit a high volume of data generated
by a large number of sensors. The number of sensors also has an impact on the computational load on the
server side, as having to analyse the input from many sensors can become a huge burden. This can be
eased by distributing the computation to the front-end devices.
— Intelligent transportation: A key aspect for vehicular applications is interoperability between not only
vehicles from different vendors, but also the infrastructures of various locations. Connected vehicles are
expected to play a significant role in future transport systems and the tremendous number of vehicles
emphasizes the need of reducing the amount of data being transmitted between them to avoid overloading
the network.
— Intelligent industry: One example in this area is visual content analysis, checking and screening. Machine
automation is desirable for increasing efficiency.
[ ]
A more detailed description of use cases can be found in ISO/IEC TR 23888-1 [1]. 0 .
6 Evaluation methodology
6.1 General
A set of assessment metrics are used for the evaluation of encoder and receiving systems optimization
technologies for machine consumption. An overview evaluation framework is shown in Figure 2.Figure 2.
Here the input video is encoded to generate a bitstream. This bitstream is then decoded, and the decoded video
is used for machine consumption. In this diagram, the “encoder” includes both pre-processing and encoding
steps, and the “decoder” includes both decoding and post-processing steps, as shown in Figure 1.Figure 1.
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
YUV PSNR
calculation
input
encoder decoder output video
video
bit rate
task
ground machine
performance
truth consumption
calculation
Figure 2 — Figure 2 — Evaluation framework and points of measurement
6.16.2 Bit rate
The bit rate is determined based on the encoded bitstream and parameters of the input video such as frame
rate and the number of total frames. The following equationformula is applied to calculate the bit rate:
8 ∗ 𝑓𝑖𝑙𝑒𝑆𝑖𝑧𝑒𝐼𝑛𝐵𝑦𝑡𝑒𝑠 ∗ 𝑓𝑝𝑠 8 ∗ 𝑓𝑖𝑙𝑒𝑆𝑖𝑧𝑒𝐼𝑛𝐵𝑦𝑡𝑒𝑠 ∗ 𝑓𝑝𝑠
𝑏𝑖𝑡𝑟𝑎𝑡𝑒 =
𝑛𝑢𝑚𝐹𝑟𝑎𝑚𝑒𝑠 ∗ 1000 𝑛𝑢𝑚𝐹𝑟𝑎𝑚𝑒𝑠 ∗ 1000
6.26.3 PSNR
Encoding for video distribution is ordinarily performed in the Y′C C domain (nicknamed YUV herein for
B R
brevity and ease of typing). For standard-dynamic range video, the distortion metric primarily used in the
video coding standardization community has been the Peak Signal to Noise Ratio (PSNR). The following two
equationsformulae are used to calculate PSNR:
𝑛−1 𝑚−1 𝑛−1 𝑚−1
2 2
𝑀𝑆𝐸 = ∑ ∑ ‖𝑥(𝑖, 𝑗) − 𝑦(𝑖, 𝑗)‖ ∑ ∑ ∥ 𝑥(𝑖, 𝑗) − 𝑦(𝑖, 𝑗) ∥
𝑚 ∗ 𝑛
𝑖=0 𝑗=0 𝑖=0 𝑗=0
(𝑏𝑖𝑡𝑑𝑒𝑝𝑡ℎ−8) 2 (𝑏𝑖𝑡𝑑𝑒𝑝𝑡ℎ−8) 2
(255 ∗ 2 ) (255 ∗ 2 )
𝑃𝑆𝑁𝑅 = 10 ∗ 𝑙𝑜𝑔 ( ) 𝑙𝑜𝑔 ( )
10 10
𝑀𝑆𝐸 𝑀𝑆𝐸
4 © ISO #### /IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
where x(i,j) is the decoded sample value of a certain color component, y(i,j) is the corresponding original
sample value, and bitdepth is the bit depth of the input video. It is a common practice to calculate PSNR values
for each of the color component Y, U and V.
6.36.4 mAP
The performance of object detection and segmentation tasks are measured by mean average precision (mAP).
This metric indicates what percentage of objects are correctly identified by having sufficient overlap between
the detected object and the ground truth as well as being assigned to the correct object class. Then the share
of correctly identified objects for each class is determined, and finally the score for each class is averaged. The
calculation of mAP is as follows:
𝑚𝐴𝑃
𝑛𝑢𝑚𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠
𝑛𝑢𝑚𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑛𝑢𝑚𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠
𝑛𝑢𝑚𝐶𝑙𝑎𝑠𝑠𝑒𝑠
1 1 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑂𝑏𝑗𝑒𝑐𝑡𝑠 1 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑂𝑏𝑗𝑒𝑐𝑡𝑠
𝑖𝑖 𝑖𝑖
= ∑ ( ∑ ( )) ∑ ( ∑ ( ))
𝑛𝑢𝑚𝑂𝑣𝑒𝑟𝑙𝑎𝑝𝑠 𝑛𝑢𝑚𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑡𝑜𝑡𝑎𝑙𝑂𝑏𝑗𝑒𝑐𝑡𝑠 𝑛𝑢𝑚𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑡𝑜𝑡𝑎𝑙𝑂𝑏𝑗𝑒𝑐𝑡𝑠
𝑖𝑖 𝑖𝑖
𝑖=1 𝑖𝑖=1
𝑖𝑖=1
𝑖=1
Some commonly used variants of this metric are:
— mAP@0.5: An object is counted as correctly identified if the Intersection over Union (IoU) between the
detected bounding box and the ground truth bounding box is at least 0.5. Sometimes this variant of the
mAP metric is also referred to as mAP50.
— mAP@[0.5:0.05:0.95]: In this variant a total of ten mAP scores with increasing IoU thresholds are
calculated. The IoU threshold starts at 0.5 and increases by 0.05 after each iteration, until it reaches to the
upper bound value 0.95. Once all ten scores are determined, the average of these scores is calculated to
produce the final mAP.
6.46.5 MOTA
Object tracking performance is measured by Multiple Object Tracking Accuracy (MOTA). This metric accounts
for all object configuration errors made by the tracker, false positives, misses (true negative), mismatches,
overall frames. The calculation of MOTA is as follows:
∑ (𝐹𝑁 + 𝐹𝑃 + 𝑚𝑚𝑒 )
∑ ( )
𝐹𝑁 + 𝐹𝑃 + 𝑚𝑚𝑒
𝑡 𝑡 𝑡
𝑡 𝑡 𝑡 𝑡
𝑡
𝑀𝑂𝑇𝐴 = 1 − , ,
∑ 𝑔 ∑ 𝑔
𝑡 𝑡 𝑡 𝑡
where 𝐹𝑁 𝐹𝑁 , 𝐹𝑃 𝐹𝑃 , 𝑚𝑚𝑒 𝑚𝑚𝑒 and 𝑔 are the number of false negatives, the number of false positives,
𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡
the number of mismatch error (ID Switching between 2 successive frames), and the number of objects in the
ground truth respectively at time 𝑡.
6.56.6 BD-rate
To compare the performance of a technology against the reference, the well-known Bjøntegaard delta rate
[ ]
(BD-rate) metric [2] 2 is used. Instead of using PSNR as the distortion metric as is typical for human vision
performance evaluation, machine consumption distortion metrics, e.g., mAP and MOTA, are used in machine
BD-rate calculation.
The distortion measurement of machine consumption (e.g., mAP and MOTA) can sometimes be non-
monotonic to the bit rate due to the characteristics of the machine analysis task and possible limitations of
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
machine networks. Polynomial curve fitting is applied to ensure rate-distortion monotonicity and thus valid
BD-rate calculation.
3 2 2
( )
𝑓 𝑥 = 𝑏 (𝑥) = 𝑏 ∗ 𝑥 + 𝑏 ∗ 𝑥 𝑏 ∗ 𝑥 + 𝑏 𝑏 ∗ 𝑥 + 𝑏 + 𝑏
0 0 1 1 2 2 3 3
For a given polynomial function in the above formula, 𝑏 , 𝑏 , 𝑏 , and 𝑏 are coefficients of the function, 𝑥 is the
0 1 2 3
input (bit rate) and 𝑓(𝑥) is the output (quality). The following two constraints are invoked to ensure its
monotonicity and convexity:
-— the first order derivative of the polynomial shown below is positive in the given 𝑥 range
′ 2 2
( )
𝑓 𝑥 = 3 ∗ 𝑏 ∗ 𝑥 + 2 ∗ 𝑏 ∗ 𝑥 + 𝑏 (𝑥) = 3 ∗ 𝑏 ∗ 𝑥 + 2* 𝑏 ∗ 𝑥 + 𝑏
0 1 2 0 1 2
-— the second order derivative of the polynomial shown below is negative in the given 𝑥 range
′′ ″
𝑓 (𝑥) = 𝑓 (𝑥) = 6 ∗ 𝑏 ∗ 𝑏 ∗ 𝑥 + +2 ∗ 𝑏 * 𝑏
0 0 1 1
Parameters (𝑏 , 𝑏 , 𝑏 , 𝑏 ) in the polynomial function are solved by sequential least squares programming
0 1 2 3
(SLSQP) and applied to curve fitting.
NOTE: It is a common practice to have the minimal quality value of the fitted curve no smaller than the minimal
quality value of the original curve and the maximum quality value of the fitted curve no greater than the maximum quality
value of the original curve.
7 Pre-processing technologies
7.1 Region of interest-based methods
One often-used optimization method is region of interest (RoI)-based coding. Here the input video is analysed
in some way and then the encoder can optimize the encoding towards machine consumption based on the
analysis results. The analysis can be done using various methods, e.g., neural networks. An example of a
pipeline that can be used for RoI-based approaches is shown in Figure 3.Figure 3.
analysis
input machine
encoder decoder
video consumption
Figure 3 — Figure 3 — Pipeline for RoI-based systems
In one implementation example, an object detection network is used to analyse the input data. This network
produces a list of objects that can be found in the current picture. The information used to describe each object
includes the index of the picture in which the object can be found and the position of the object in the picture.
6 © ISO #### /IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
Some networks can provide more information than this and the encoder can choose to select a subset of all
objects by filtering based on, for example, the class of an object or the estimated likelihood of an object of the
described class being at the described position. In a similar approach, a segmentation network can be used
where the object is not described by a bounding box but by a segmentation mask indicating exactly which
samples the segmentation network estimates belonging to the object. The list produced during the analysis
can then be used by the encoder, for example, to separate foreground and background with the purpose of
encoding the foreground at a better quality and the background at a lower quality. One such encoding method
is described in 8.1.8.1. In this example, the analysis does not change the input video, but directly forwards it
to the encoder.
In other RoI-based methods, the pre-processing changes the input video, for example, by applying different
pre-processing methods on the foreground and background, or specific parts of the video, such as subsampling
the background area of the input video.
In one implementation example, an object segmentation network is first used to analyse the input data. The
network produces a list of objects segmented with the object shapes in the current picture. The object shapes
and positions could be represented, for example, by segmentation masks. More information such as the object
class or the estimated likelihood of the object segment could also be provided by the network to identify the
objects. Based on the object information, it is possible to derive spatial complexity and temporal complexity
for the different segments, and then RoI-based pre-processing of the input video can be adapted based on the
spatial and temporal complexity. The spatial complexity here indicates the averaged object size which can be
calculated by dividing the percentage of the area covered by the objects by the total number of the objects.
Temporal complexity indicates the content changes between two pictures which can be calculated by various
methods, for example, by taking the mean absolute difference of the collocated samples in two pictures.
7.2 Foreground and background processing
After pre-analysis that determines the foreground and background areas, one straightforward way to handle
the background that is less critical to machine consumption is to “eliminate” it by setting the corresponding
sample values to a constant value. However, some portions of the background samples, for example those
immediately surrounding the foreground area, could still be useful for machine consumption. Therefore, the
background regions relevant to machine consumption can be preserved to a certain extent with low-pass
filtering, such as a Gaussian filter with a sliding window, where the window size can be set based on the input
video resolution.
Moreover, extracted features can reveal importance information of the input video. In other words, compared
with binary classification of foreground and background, these extracted features can provide importance
information at a finer granularity. Therefore, such extracted features can be used to determine how to process
foreground and background differently. In one implementation example, a feature map is extracted by a
feature extraction network, and based on the feature map, the parameters of a Gaussian smoothing filter are
adapted and then the adaptive filtering is applied to the picture. As the background area and foreground area
have different features and even within the background or foreground area, different regions can have
different features, the Gaussian smooth filter can be controlled at a finer granularity, which finally results in a
more efficient pre-processing.
An implementation example with more detailed description can be found in A.2A.2.
7.3 Temporal subsampling
In some use cases, for example when the frame rate is high, a way to reduce the bit rate without a strong
negative impact on the machine consumption performance can be to skip certain frames and encode the video
at a lower frame rate. One example is to remove every other frame from the input video and encode the video
at half frame rate. This can be done in a dynamic manner, for example by evaluating the motion between two
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
or more frames and if there is only little motion, a frame can be removed. In some cases, if the receiving system
requires a specific frame rate, a corresponding post-processing technology that up-samples the video to the
full frame rate can be applied.
An implementation example with more detailed description can be found in A.4A.4.
7.4 Spatial subsampling
If the analysis of a video shows that it contains primarily large objects, one way to improve the BD-rate
performance is to perform spatial subsampling on the input video. This will result in fewer samples in the
subsampled frames to be encoded, and thus likely lead to faster encoding and bit rate savings. The optimum
downscaling factor is content dependent. It is possible that the machine consumption performance drops
when subsampling is too aggressive. Therefore, it is advisable to apply spatial subsampling adaptively, for
example, based on the characteristics of the video content (such as the averaged object spatial area and the
number of objects) and the target bit rate. Moreover, the spatial subsampling can also be dependent on the
picture types. For example, depending on whether the input video is captured by regular camera as natural
scenes, or is captured by infrared sensor as thermal images, different spatial subsampling methods can be
applied.
One tool specified in the Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC) standard that can be used for the purpose
of spatial resampling is called reference picture resampling (RPR). This tool allows the encoder to choose to
encode some pictures of the input video at different resolutions. For example, based on the analysis of
keyframes, without encoding an intra picture at the specified changed resolution, inter predictions can be
made from all allowed pictures regardless of their resolution.
In one implementation example, the RPR tool can be used at the frame level, where the input video can be
analysed as described in 7.1.7.1. In this case, the unmodified input video is forwarded to the encoder with a
scale factor list generated by the analyser. Specifically, an object detection network is used to analyse the input
video in both full resolution and at least one spatially subsampled resolution. This network produces a list of
objects with object information that can be found in the current picture and the spatially resampled picture.
The object information describes the position and size of the detected objects for both the current picture and
the spatially resampled picture. Based on the object information, an object occupancy distribution (i.e., the
distribution of the ratio of object size to the corresponding picture resolution) can be generated for both the
current picture and the spatially resampled picture. The scale factor can be derived for the current picture by
comparing the correlation of object occupancy distributions, and then the list of all scale factors is passed on
to the encoder for utilization of the RPR tool.
7.5 Noise filtering
Under some circumstances, various types of noise can be present in the video content. Denoising filters can be
applied on such content to reduce unnecessary bit rate increases and avoid machine consumption
performance degradation by filtering out the undesirable noise from the video content while preserving
information important to machine consumption. Various types of denoising filters can be applied according to
the characteristics of the noise. The strength of the filter can be adjusted based on the noise, and the filter can
adaptively be enabled and disabled for either an entire picture or sequence, or only a part thereof.
It is noted that some existing denoising filters such as bilateral or anisotropic diffusion filters can preserve
local details during denoising. For applications that benefit from such local details being preserved, filtering
the entire picture using the same strength can be detrimental.
8 © ISO #### /IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
8 Encoding technologies
8.1 RoI-based quantization parameter adaption
One method that is available in many video coding standards is adaptive quantization parameter (QP). Here
the encoder can change the QP value at a sub-picture level, for example the coding tree unit (CTU) level, to
optimize the encoding for the application. Due to this versatility, adaptive QP can be used in many different
use cases to improve performance.
The decision on where to change the QP value and by how much can be made by the encoder based on an
analysis of the input video. Another option is to utilize the output of an external analyser such as described in
7.1.7.1. In this case, the encoder receives information about the positions and sizes of objects in each frame to
make a differentiation between foreground and background.
One option is to use a base QP value of the picture for areas that contain objects, i.e., foreground areas, and an
increased QP value for background areas, resulting in fewer bits being used to encode the background. As the
background is usually not critical to machine consumption, this is a straight-forward way to reduce the bit
rate without affecting the machine consumption performance. As an extension, it can also be beneficial to
encode large objects with slightly higher QP values. As it is generally easier to detect larger objects, reducing
the bit rate for large objects usually does not reduce the performance of machine consumption.
However, it is noted that when utilizing the analysis, complexity can be traded against bit rate. For example, if
a light-weight neural network is used to perform the analysis, it is possible that not all relevant objects have
been found and thus it couldcan be detrimental to reduce the quality for the background too much as there
are possibly objects that the initial analysis missed. These objects are possibly still important for machine
consumption and if the background is encoded in sufficient quality, the machine consumption network has a
chance of detecting objects in the background even if they are coded in lower quality than the foreground area.
On the other hand, if the encoding system has a lot of resources, it can employ a neural network of higher
complexity for the analysis. With a better and more certain analysis, the bit rate for the background can be
reduced more as there are likely fewer objects that have been missed in the initial analysis.
A more detailed description with a link to an implementation can be found in A.1A.1.
8.2 Quantization step adjustment for temporal layers
It is a common practice that the encoder places different pictures on different temporal layers, i.e., assigning
different temporal identifiers (TID). This has the purpose of creating hierarchical structures that indicate from
which previously coded pictures the encoder can create predictions for the current picture. One aspect of these
hierarchical structures is that pictures cannot be referenced by other pictures on a higher temporal layer. This
way, it is not necessary to store every decoded picture in the decoded picture buffer. Another aspect of the
hierarchical structure is that pictures on higher temporal layers can be encoded with higher QP values, i.e.,
lower quality, as they will not be used, or less often used, as references by other pictures. An example of a
hierarchical structure is shown in Figure 4.Figure 4. The display order is left-to-right, and the numbers specify
the order of the coded pictures in the bitstream.
© ISO/IEC 2025 – All rights reserved
ISO/IEC DTR 23888-3:(en)
Figure 4 — Figure 4 — An example hierarchical referencing structure in a Rec. ITU-T H.266 |
ISO/IEC 23090-3 (VVC) random access configuration bitstream
This hierarchy of pictures can be exploited by encoding pictures on higher temporal layers using higher QP
values, i.e., reducing the number of bits spent on these pictures in total. This is also done in the case of the
common test conditions for standard definition range for Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC). In the
use case of coding video content for machine consumption, this characteristic can be exploited further by
increasing the QP value for pictures on higher temporal layers more. Taking advantage of motion
compensation, many bits can be saved when compressing pictures in high temporal layers while these pictures
are still able to be reconstructed with high quality. Compressing the highest temporal layer with at a high QP
can be seen as being similar in spirit to reducing the frame rate as discussed in 7.3.7.3. As an example, lowering
the bit rate substantially on every odd-numbered frame can be seen as a step towards completely removing
them.
8.3 Chroma QP offset setting
Many machine analysis methods are performed using 4:4:4 colour format input data. Therefore, encoding in
4:2:0 colour format, which has lower chroma resolution that 4:4:4 colour format, can sometimes have a
negative impact on the machine analysis performance. This can sometimes be compensated for by using a
negative chroma QP offset, which increases the quality of the low-resolution chroma components.
9 Post-processing technologies
9.1 Temporal resampling
When the receiving system requires a specific frame rate that is different from the frame rate of the decoded
sequences, temporal resampling can be applied on the decoded video by utilizing conventional temporal filters
(e.g., motion compensated interpolation filters) or neural network-based filters, or just frame repetition. An
implementation example with more detailed description can be found in A.4A.4.
10 ©
...
FINAL DRAFT
Technical
Report
ISO/IEC DTR
23888-3.2
ISO/IEC JTC 1/SC 29
Information technology — Artificial
Secretariat: JISC
intelligence for multimedia —
Voting begins on:
2026-03-31
Part 3:
Optimization of encoders and
Voting terminates on:
2026-04-28
receiving systems for machine
analysis of coded video content
Technologies de l'information — Intelligence artificielle pour le
multimédia —
Partie 3: Optimisation des codeurs et des systèmes de réception
pour l'analyse automatique de contenus vidéo codés
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
IN ADDITION TO THEIR EVALUATION AS
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO-
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
Reference number
ISO/IEC DTR 23888-3.2:2026(en) © ISO/IEC 2026
FINAL DRAFT
ISO/IEC DTR 23888-3.2:2026(en)
Technical
Report
ISO/IEC DTR
23888-3.2
ISO/IEC JTC 1/SC 29
Information technology — Artificial
Secretariat: JISC
intelligence for multimedia —
Voting begins on:
Part 3:
Optimization of encoders and
Voting terminates on:
receiving systems for machine
analysis of coded video content
Technologies de l'information — Intelligence artificielle pour le
multimédia —
Partie 3: Optimisation des codeurs et des systèmes de réception
pour l'analyse automatique de contenus vidéo codés
RECIPIENTS OF THIS DRAFT ARE INVITED TO SUBMIT,
WITH THEIR COMMENTS, NOTIFICATION OF ANY
RELEVANT PATENT RIGHTS OF WHICH THEY ARE AWARE
AND TO PROVIDE SUPPOR TING DOCUMENTATION.
© ISO/IEC 2026
IN ADDITION TO THEIR EVALUATION AS
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may
BEING ACCEPTABLE FOR INDUSTRIAL, TECHNO-
LOGICAL, COMMERCIAL AND USER PURPOSES, DRAFT
be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on
INTERNATIONAL STANDARDS MAY ON OCCASION HAVE
the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below
TO BE CONSIDERED IN THE LIGHT OF THEIR POTENTIAL
or ISO’s member body in the country of the requester.
TO BECOME STAN DARDS TO WHICH REFERENCE MAY BE
MADE IN NATIONAL REGULATIONS.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland Reference number
ISO/IEC DTR 23888-3.2:2026(en) © ISO/IEC 2026
© ISO/IEC 2026 – All rights reserved
ii
ISO/IEC DTR 23888-3.2:2026(en)
Contents Page
Foreword .iv
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 2
5 Overview . 2
5.1 General overview .2
5.2 Use cases and applications .3
6 Evaluation methodology . 3
6.1 General .3
6.2 Bit rate .4
6.3 PSNR .4
6.4 mAP .4
6.5 MOTA .5
6.6 BD-rate .5
7 Pre-processing technologies . 6
7.1 Region of interest-based methods.6
7.2 Foreground and background processing .7
7.3 Temporal subsampling .7
7.4 Spatial subsampling .8
7.5 Noise filtering .8
8 Encoding technologies . 8
8.1 RoI-based quantization parameter adaption .8
8.2 Quantization step adjustment for temporal layers .9
8.3 Chroma QP offset setting.10
9 Post-processing technologies .10
9.1 Temporal resampling .10
9.2 Spatial resampling .10
9.3 Enhancement post-filtering .10
10 Metadata .11
10.1 General .11
10.2 Neural-network post-filter SEI message .11
10.3 Annotated regions SEI message .11
10.4 Object mask information SEI message .11
10.5 Encoder optimization information SEI message . 12
10.6 Packed regions information SEI message . 12
Annex A (informative) Software implementation examples .13
Annex B (informative) Combined software implementation examples .20
Bibliography .21
© ISO/IEC 2026 – All rights reserved
iii
ISO/IEC DTR 23888-3.2:2026(en)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are
members of ISO or IEC participate in the development of International Standards through technical
committees established by the respective organization to deal with particular fields of technical activity.
ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations,
governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types
of document should be noted. This document was drafted in accordance with the editorial rules of the ISO/
IEC Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the
use of (a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any
claimed patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not
received notice of (a) patent(s) which may be required to implement this document. However, implementers
are cautioned that this may not represent the latest information, which may be obtained from the patent
database available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held
responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information, in collaboration with
ITU-T (as ITU-T H.Sup.MACVC).
A list of all parts in the ISO/IEC 23888 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards
body. A complete listing of these bodies can be found at www.iso.org/members.html and
www.iec.ch/national-committees.
© ISO/IEC 2026 – All rights reserved
iv
FINAL DRAFT Technical Report ISO/IEC DTR 23888-3.2:2026(en)
Information technology — Artificial intelligence for
multimedia —
Part 3:
Optimization of encoders and receiving systems for machine
analysis of coded video content
1 Scope
This document provides information about optimizations for encoders and receiving systems for conducting
machine analysis tasks on coded video content. It provides a concept-level overview of recent practices
and provides comments on technical aspects and cautions to be taken when interpreting the results. This
document describes technologies that have recently been studied and have demonstrated benefits to coding
efficiency for some machine analysis tasks.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitute
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
Rec. ITU-T H.266 | ISO/IEC 23090-3, Versatile video coding
Rec. ITU-T H.265 | ISO/IEC 23008-2, High efficiency video coding
Rec. ITU-T H.264 | ISO/IEC 14496-10, Advanced video coding
Rec. ITU-T H.274 | ISO/IEC 23002-7, Versatile supplemental enhancement information messages for coded video
bitstreams
3 Terms and definitions
For the purposes of this document, the terms and definitions given in Rec. ITU-T H.266 | ISO/IEC 23090-3,
Rec. ITU-T H.265 | ISO/IEC 23008-2, Rec. ITU-T H.264 | ISO/IEC 14496-10, Rec. ITU-T H.274 | ISO/IEC 23002-7
and the following apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https:// www .iso .org/ obp
— IEC Electropedia: available at https:// www .electropedia .org/
3.1
machine consumption
operation of a machine analysis task such as object detection, segmentation or object tracking
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
4 Abbreviated terms
AVC Advanced Video Coding (Rec. ITU-T H.264 | ISO/IEC 14496-10)
BD-rate Bjøntegaard delta bit rate
CTU coding tree unit
HEVC High Efficiency Video Coding (Rec. ITU-T H.265 | ISO/IEC 23008-2)
IoU intersection over union
mAP mean average precision
MOTA multiple object tracking accuracy
NNPF neural-network post-filter
NNPFA neural-network post-filter activation
NNPFC neural-network post-filter characteristics
OMI object mask information
PRI packed regions information
PSNR peak signal-to-noise ratio
QP quantization parameter
RoI region of interest
RPR reference picture resampling
SEI supplemental enhancement information
TID temporal identifier
URI uniform resource identifier
VSEI Versatile Supplemental Enhancement Information Messages for Coded Video Bitstreams (Rec.
ITU-T H.274 | ISO/IEC 23002-7)
VTM Reference software for versatile video coding (Rec. ITU-T H.266.2 | ISO/IEC 23090-16:2025)
VVC Versatile Video Coding (Rec. ITU-T H.266 | ISO/IEC 23090-3)
Y′C C colour space representation commonly used for video/image distribution, also written as YUV
B R
YUV colour space representation commonly used for video/image distribution, also written as Y′C C
B R
5 Overview
5.1 General overview
Most video processing systems consist of four main processing steps, as shown in Figure 1. This document
describes technologies for optimization of encoders and receiving systems, such as pre-processing, encoding
and post-processing for machine consumption. The decoding process, on the other hand, is fully specified
in the respective Rec. ITU-T H.266 | ISO/IEC 23090-3 Versatile Video Coding (VVC), Rec. ITU-T H.265 |
ISO/IEC 23008-2 High Efficiency Video Coding (HEVC) and Rec. ITU-T H.264 | ISO/IEC 14496-10 Advanced
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
Video Coding (AVC) video coding standards, amongst others. Hence, the samples of the decoded video are
fully specified by the given input bitstream.
Figure 1 — General video coding and processing pipeline
An overview of the commonly used practices for evaluating encoder optimization technologies for machine
consumption can be found in 6. Descriptions of pre-processing technologies can be found in 7. Encoder
optimization technologies are described in 8 and post-processing technologies are described in 9. Metadata
that is useful for machine consumption is described in 10.
It is noted that depending on specific use cases, the technologies outlined in this document can be
implemented individually or in combination to optimize the machine consumption performance within the
constraints of the system capabilities. When employing multiple technologies simultaneously, it is important
to consider that certain combinations can be impractical or infeasible due to inherent methodological
constraints. Tested technologies and their combinations are listed in Annex A and Annex B, respectively.
5.2 Use cases and applications
There are various use cases and applications using encoded video that benefit from optimizing both encoders
and receiving systems for machine consumption. Some of them are highlighted below:
— Surveillance: A considerable amount of bandwidth is needed to transmit a high volume of data generated
by a large number of sensors. The number of sensors also has an impact on the computational load on
the server side, as having to analyse the input from many sensors can become a huge burden. This can be
eased by distributing the computation to the front-end devices.
— Intelligent transportation: A key aspect for vehicular applications is interoperability between not only
vehicles from different vendors, but also the infrastructures of various locations. Connected vehicles
are expected to play a significant role in future transport systems and the tremendous number of
vehicles emphasizes the need of reducing the amount of data being transmitted between them to avoid
overloading the network.
— Intelligent industry: One example in this area is visual content analysis, checking and screening. Machine
automation is desirable for increasing efficiency.
[1]
A more detailed description of use cases can be found in ISO/IEC TR 23888-1 .
6 Evaluation methodology
6.1 General
A set of assessment metrics are used for the evaluation of encoder and receiving systems optimization
technologies for machine consumption. An overview evaluation framework is shown in Figure 2. Here the
input video is encoded to generate a bitstream. This bitstream is then decoded, and the decoded video is
used for machine consumption. In this diagram, the “encoder” includes both pre-processing and encoding
steps, and the “decoder” includes both decoding and post-processing steps, as shown in Figure 1.
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
Figure 2 — Evaluation framework and points of measurement
6.2 Bit rate
The bit rate is determined based on the encoded bitstream and parameters of the input video such as frame
rate and the number of total frames. The following formula is applied to calculate the bit rate:
8**fileSizeInBytesfps
bitRate=
numFrames*1000
where fileSizeInBytes is the size of a file measured in number of bytes, fps is the number of frames per second
and numFrames is the total number of frames in the file.
6.3 PSNR
Encoding for video distribution is ordinarily performed in the Y′C C domain (nicknamed YUV herein for
B R
brevity and ease of typing). For standard-dynamic-range video, the distortion metric primarily used in the
video coding standardization community has been peak signal to noise ratio (PSNR). The following two
formulae are used to calculate PSNR:
n1m1
MSE xi,,jy ij
mn*
i0 j0
bitDepth8
255*2
PSNR 10*log
MSE
where x(i,j) is the decoded sample value of a certain colour component, y(i,j) is the corresponding original
sample value, and bitDepth is the bit depth of the input video. It is a common practice to calculate PSNR
values for each of the colour components Y, U and V. Information on how to interpret or combine PSNR values
[2]
of colour components can be found in ISO/IEC TR 23002-8 .
6.4 mAP
The performance of object detection and segmentation tasks are measured by mean average precision(mAP)
described as follows.
For a given category of object, true positive TP T , false positive FP T , true negative TN T and
IoU IoU IoU
false negative FN T are defined with an IoU threshold T for that category, where true positive is a
IoU IoU
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
case that an object is detected by the model and it is also a part of the ground truth, false positive is a case
that an object is detected by the model but it is not a part of the ground truth; true negative is a case that the
an object is not detected by the model and is it also not a part of the ground truth; false negative is a case
that an object is not detected by the model but it is a part of the ground truth.
Then, recall of the given IoU threshold is defined as the proportion of all true positive cases in all true
positive and false negative cases corresponding to that IoU threshold:
TP T
IoU
recall T
IoU
TP TF NT
IoUIoU
The precision of the given IoU threshold is the proportion of all true positive cases in all positive cases:
TP T
IoU
precisionT
IoU
TP TF PT
IoUIoU
It is possible for a neural network of detection or segmentation to obtain several pairs of recall and precision
values with different confidence levels. For each recall value r in the pairs, let pr take the maximum
precision value in all precision values for which the corresponding recall values are above the given recall
value r :
pr max precisionr
rr: r
Average precision (AP) of a given category of object is defined as the average value of pr for all recall
values provided by the neural network, which characterizes the area of the entire precision-recall curve.
The mAP is defined as the average of AP scores of all categories within a range of IoU thresholds.
Some commonly used variants of this metric are:
— mAP@0 . 5: An object is counted as correctly identified if the IoU between the detected bounding box and
the ground truth bounding box is at least 0.5. Sometimes this variant of the mAP metric is also referred
to as mAP50.
— mAP@[0.5:0.05:0.95]: In this variant a total of ten mAP scores with increasing IoU thresholds are
calculated. The IoU threshold starts at 0.5 and increases by 0.05 after each iteration, until it reaches to
the upper bound value 0.95. Once all ten scores are determined, the average of these scores is calculated
to produce the final mAP.
6.5 MOTA
Object tracking performance is measured by multiple object tracking accuracy (MOTA). This metric accounts
for all object configuration errors made by the tracker, false positives, misses (true negative), mismatches,
overall frames. The calculation of MOTA is as follows:
FN FP mme
tt t
t
MOTA1
�g
t
t
where FN , FP , mme and g are the number of false negatives, the number of false positives, the number
t t t t
of mismatch error (ID Switching between 2 successive frames), and the number of objects in the ground
truth respectively at time t .
6.6 BD-rate
To compare the performance of a technology against the reference, the well-known Bjøntegaard delta rate
[2]
(BD-rate) metric is used. Instead of using PSNR as the distortion metric as is typical for human vision
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
performance evaluation, machine consumption distortion metrics, e.g., mAP and MOTA, are used in machine
BD-rate calculation.
The distortion measurement of machine consumption (e.g., mAP and MOTA) can sometimes be non-
monotonic to the bit rate due to the characteristics of the machine analysis task and possible limitations of
machine networks. Polynomial curve fitting is applied to ensure rate-distortion monotonicity and thus valid
BD-rate calculation.
3 2
fx bx**bx bx* b
0 1 23
For a given polynomial function in the above formula, b , b , b , and b are coefficients of the function, x is
0 1 2 3
the input (bit rate) and fx is the output (quality). The following two constraints are invoked to ensure its
monotonicity and convexity:
— the first order derivative of the polynomial shown below is positive in the given x range
fx 32**bx *bx* b
0 12
— the second order derivative of the polynomial shown below is negative in the given x range
fx 62**bx *b
Parameters bb,, bb, in the polynomial function are solved by sequential least squares programming
01 23
(SLSQP) and applied to curve fitting.
NOTE It is a common practice to have the minimal quality value of the fitted curve no smaller than the minimal
quality value of the original curve and the maximum quality value of the fitted curve no greater than the maximum
quality value of the original curve.
7 Pre-processing technologies
7.1 Region of interest-based methods
One often-used optimization method is region of interest (RoI)-based coding. Here the input video is
analysed in some way and then the encoder can optimize the encoding towards machine consumption based
on the analysis results. The analysis can be done using various methods, e.g., neural networks. An example
of a pipeline that can be used for RoI-based approaches is shown in Figure 3.
Figure 3 — Pipeline for RoI-based systems
In one implementation example, an object detection network is used to analyse the input data. This network
produces a list of objects that can be found in the current picture. The information used to describe each
object includes the index of the picture in which the object can be found and the position of the object in
the picture. Some networks can provide more information than this and the encoder can choose to select a
subset of all objects by filtering based on, for example, the class of an object or the estimated likelihood of an
object of the described class being at the described position. In a similar approach, a segmentation network
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
can be used where the object is not described by a bounding box but by a segmentation mask indicating
exactly which samples the segmentation network estimates belonging to the object. The list produced
during the analysis can then be used by the encoder, for example, to separate foreground and background
with the purpose of encoding the foreground at a better quality and the background at a lower quality. One
such encoding method is described in 8.1. In this example, the analysis does not change the input video, but
directly forwards it to the encoder.
In other RoI-based methods, the pre-processing changes the input video, for example, by applying different
pre-processing methods on the foreground and background, or specific parts of the video, such as
subsampling the background area of the input video.
In one implementation example, an object segmentation network is first used to analyse the input data.
The network produces a list of objects segmented with the object shapes in the current picture. The object
shapes and positions could be represented, for example, by segmentation masks. More information such as
the object class or the estimated likelihood of the object segment could also be provided by the network to
identify the objects. Based on the object information, it is possible to derive spatial complexity and temporal
complexity for the different segments, and then RoI-based pre-processing of the input video can be adapted
based on the spatial and temporal complexity. The spatial complexity here indicates the averaged object
size which can be calculated by dividing the percentage of the area covered by the objects by the total
number of the objects. Temporal complexity indicates the content changes between two pictures which can
be calculated by various methods, for example, by taking the mean absolute difference of the collocated
samples in two pictures.
7.2 Foreground and background processing
After pre-analysis that determines the foreground and background areas, one straightforward way to handle
the background that is less critical to machine consumption is to “eliminate” it by setting the corresponding
sample values to a constant value. However, some portions of the background samples, for example those
immediately surrounding the foreground area, could still be useful for machine consumption. Therefore, the
background regions relevant to machine consumption can be preserved to a certain extent with low-pass
filtering, such as a Gaussian filter with a sliding window, where the window size can be set based on the
input video resolution.
Moreover, extracted features can reveal importance information of the input video. In other words,
compared with binary classification of foreground and background, these extracted features can provide
importance information at a finer granularity. Therefore, such extracted features can be used to determine
how to process foreground and background differently. In one implementation example, a feature map is
extracted by a feature extraction network, and based on the feature map, the parameters of a Gaussian
smoothing filter are adapted and then the adaptive filtering is applied to the picture. As the background area
and foreground area have different features and even within the background or foreground area, different
regions can have different features, the Gaussian smoothing filter can be controlled at a finer granularity,
which finally results in a more efficient pre-processing.
An implementation example with more detailed description can be found in A.2.
7.3 Temporal subsampling
In some use cases, for example when the frame rate is high, a way to reduce the bit rate without a strong
negative impact on the machine consumption performance can be to skip certain frames and encode the
video at a lower frame rate. One example is to remove every other frame from the input video and encode
the video at half frame rate. This can be done in a dynamic manner, for example by evaluating the motion
between two or more frames and if there is only little motion, a frame can be removed. In some cases, if
the receiving system requires a specific frame rate, a corresponding post-processing technology that up-
samples the video to the full frame rate can be applied.
An implementation example with more detailed description can be found in A.4.
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
7.4 Spatial subsampling
If the analysis of a video shows that it contains primarily large objects, one way to improve the BD-rate
performance is to perform spatial subsampling on the input video. This will result in fewer samples in the
subsampled frames to be encoded, and thus likely lead to faster encoding and bit rate savings. The optimum
downscaling factor is content dependent. It is possible that the machine consumption performance drops
when subsampling is too aggressive. Therefore, it is advisable to apply spatial subsampling adaptively, for
example, based on the characteristics of the video content (such as the averaged object spatial area and the
number of objects) and the target bit rate. Moreover, the spatial subsampling can also be dependent on the
picture types. For example, depending on whether the input video is captured by regular camera as natural
scenes, or is captured by infrared sensor as thermal images, different spatial subsampling methods can be
applied.
One tool specified in the Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC) standard that can be used for the purpose
of spatial resampling is called reference picture resampling (RPR). This tool allows the encoder to choose
to encode some pictures of the input video at different resolutions. For example, based on the analysis of
keyframes, without encoding an intra picture at the specified changed resolution, inter predictions can be
made from all allowed pictures regardless of their resolution.
In one implementation example, the RPR tool can be used at the frame level, where the input video can be
analysed as described in 7.1. In this case, the unmodified input video is forwarded to the encoder with a scale
factor list generated by the analyser. Specifically, an object detection network is used to analyse the input
video in both full resolution and at least one spatially subsampled resolution. This network produces a list of
objects with object information that can be found in the current picture and the spatially resampled picture.
The object information describes the position and size of the detected objects for both the current picture
and the spatially resampled picture. Based on the object information, an object occupancy distribution (i.e.,
the distribution of the ratio of object size to the corresponding picture resolution) can be generated for
both the current picture and the spatially resampled picture. The scale factor can be derived for the current
picture by comparing the correlation of object occupancy distributions, and then the list of all scale factors
is passed on to the encoder for utilization of the RPR tool.
7.5 Noise filtering
Under some circumstances, various types of noise can be present in the video content. Denoising filters
can be applied on such content to reduce unnecessary bit rate increases and avoid machine consumption
performance degradation by filtering out the undesirable noise from the video content while preserving
information important to machine consumption. Various types of denoising filters can be applied according
to the characteristics of the noise. The strength of the filter can be adjusted based on the noise, and the filter
can adaptively be enabled and disabled for either an entire picture or sequence, or only a part thereof.
It is noted that some existing denoising filters such as bilateral or anisotropic diffusion filters can preserve
local details during denoising. For applications that benefit from such local details being preserved, filtering
the entire picture using the same strength can be detrimental.
8 Encoding technologies
8.1 RoI-based quantization parameter adaption
One method that is available in many video coding standards is adaptive quantization parameter (QP). Here
the encoder can change the QP value at a sub-picture level, for example the coding tree unit (CTU) level, to
optimize the encoding for the application. Due to this versatility, adaptive QP can be used in many different
use cases to improve performance.
The decision on where to change the QP value and by how much can be made by the encoder based on an
analysis of the input video. Another option is to utilize the output of an external analyser such as described
in 7.1. In this case, the encoder receives information about the positions and sizes of objects in each frame to
make a differentiation between foreground and background.
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
One option is to use a base QP value of the picture for areas that contain objects, i.e., foreground areas, and
an increased QP value for background areas, resulting in fewer bits being used to encode the background. As
the background is usually not critical to machine consumption, this is a straight-forward way to reduce the
bit rate without affecting the machine consumption performance. As an extension, it can also be beneficial to
encode large objects with slightly higher QP values. As it is generally easier to detect larger objects, reducing
the bit rate for large objects usually does not reduce the performance of machine consumption.
However, it is noted that when utilizing the analysis, complexity can be traded against bit rate. For example,
if a light-weight neural network is used to perform the analysis, it is possible that not all relevant objects
have been found and thus it can be detrimental to reduce the quality for the background too much as there
are possibly objects that the initial analysis missed. These objects are possibly still important for machine
consumption and if the background is encoded in sufficient quality, the machine consumption network has
a chance of detecting objects in the background even if they are coded in lower quality than the foreground
area. On the other hand, if the encoding system has a lot of resources, it can employ a neural network of
higher complexity for the analysis. With a better and more certain analysis, the bit rate for the background
can be reduced more as there are likely fewer objects that have been missed in the initial analysis.
A more detailed description with a link to an implementation can be found in A.1.
8.2 Quantization step adjustment for temporal layers
It is a common practice that the encoder places different pictures on different temporal layers, i.e., assigning
different temporal identifiers (TID). This has the purpose of creating hierarchical structures that indicate
from which previously coded pictures the encoder can create predictions for the current picture. One aspect
of these hierarchical structures is that pictures cannot be referenced by other pictures on a higher temporal
layer. This way, it is not necessary to store every decoded picture in the decoded picture buffer. Another
aspect of the hierarchical structure is that pictures on higher temporal layers can be encoded with higher
QP values, i.e., lower quality, as they will not be used, or less often used, as references by other pictures. An
example of a hierarchical structure is shown in Figure 4. The display order is left-to-right, and the numbers
specify the order of the coded pictures in the bitstream.
Figure 4 — An example hierarchical referencing structure in a Rec. ITU-T H.266 | ISO/IEC 23090-3
(VVC) random access configuration bitstream
This hierarchy of pictures can be exploited by encoding pictures on higher temporal layers using higher
QP values, i.e., reducing the number of bits spent on these pictures in total. This is also done in the case of
the common test conditions for standard definition range for Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC).
In the use case of coding video content for machine consumption, this characteristic can be exploited
further by increasing the QP value for pictures on higher temporal layers more. Taking advantage of motion
compensation, many bits can be saved when compressing pictures in high temporal layers while these
pictures are still able to be reconstructed with high quality. Compressing the highest temporal layer with at
a high QP can be seen as being similar in spirit to reducing the frame rate as discussed in 7.3. As an example,
lowering the bit rate substantially on every odd-numbered frame can be seen as a step towards completely
removing them.
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
8.3 Chroma QP offset setting
Many machine analysis methods are performed using 4:4:4 colour format input data. Therefore, encoding
in 4:2:0 colour format, which has lower chroma resolution than 4:4:4 colour format, can sometimes have a
negative impact on the machine analysis performance. This can sometimes be compensated for by using a
negative chroma QP offset, which increases the quality of the low-resolution chroma components.
9 Post-processing technologies
9.1 Temporal resampling
When the receiving system requires a specific frame rate that is different from the frame rate of the decoded
sequences, temporal resampling can be applied on the decoded video by utilizing conventional temporal
filters (e.g., motion compensated interpolation filters) or neural network-based filters, or just frame
repetition. An implementation example with more detailed description can be found in A.4.
9.2 Spatial resampling
Spatial resampling can for example be applied on the decoded video by utilizing conventional spatial filters
(e.g., the Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC) motion compensation interpolation filter or Rec. ITU-T
H.266 | ISO/IEC 23090-3 (VVC) reference picture resampling filter) or neural network-based filters.
9.3 Enhancement post-filtering
Video coding introduces quality degradation to the original video, which can consequently reduce the
performance of machine consumption. To improve the machine consumption performance, the receiver can
introduce a machine-oriented post-filtering network to enhance the decoded video before feeding it to the
machine consumption networks. Such a machine-oriented post-filtering network can be fixed at the receiver
side, i.e., without requiring any signalling in the coded bitstream or metadata. An illustration of this process
is shown in Figure 5.
Figure 5 — The processing order for an enhancement post-filter without additional signalling
Alternatively, the information of the post-filtering network can be signalled as metadata, for example using
the SEI messages described in 10.2. In this case, the encoder can train the post-filtering network specifically
for the machine consumption conducted in the decoder side if the machine consumption network is known
on the encoder side. An illustration of this process is shown in Figure 6. An example of an enhancement post-
filter is described in A.3.
Figure 6 — The processing order for an enhancement post-filter with metadata signalling
© ISO/IEC 2026 – All rights reserved
ISO/IEC DTR 23888-3.2:2026(en)
10 Metadata
10.1 General
The following subclauses describe SEI messages that can be beneficial for interpreting coded video content
that has been optimized for machine analysis. SEI messages do not impact normative decoder behaviour,
and are optional for decoders to implement.
10.2 Neural-network post-filter SEI message
The neural-network post-filter (NNPF) characteristics (NNPFC) SEI message and neural-network post-
filter activation (NNPFA) SEI message are specified in Rec. ITU-T H.274 | ISO/IEC 23002-7 (VSEI). These SEI
messages can be used for the purpose of machine consumption.
The NNPF
...
Draft ISO/IEC TR DTR 23888-3:202#(.2)
ISO/IEC JTC 1/SC 29/WG 5
Secretariat: JISC
Date: 2026-02-0603-16
Information technology — Artificial intelligence for multimedia —
Part 3:
Optimization of encoders and receiving systems for machine analysis
of coded video content
Technologies de l'information — Intelligence artificielle pour le multimédia —
Partie 3: Optimisation des codeurs et des systèmes de réception pour l'analyse automatique de contenus vidéo
codés
DTR stage
ISO/IEC DTR 23888-3.2:(en)
© ISO/IEC 20252026
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication
may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying,
or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO
at the address below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: + 41 22 749 01 11
E-mail: copyright@iso.org
Website: www.iso.org
Published in Switzerland
© ISO/IEC 20252026 – All rights reserved
ii
ISO/IEC DTR 23888-3.2:(en)
Contents
Foreword . v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 1
5 Overview . 2
6 Evaluation methodology . 3
7 Pre-processing technologies . 6
8 Encoding technologies . 8
9 Post-processing technologies . 10
10 Metadata . 10
Annex A (informative) Software implementation examples . 13
Annex B (informative) Combined software implementation examples . 20
Bibliography . 21
Foreword . v
1 Scope . 1
2 Normative references . 1
3 Terms and definitions . 1
4 Abbreviated terms . 1
5 Overview . 2
5.1 General overview . 2
5.2 Use cases and applications . 3
6 Evaluation methodology . 3
6.1 General . 3
6.2 Bit rate . 4
6.3 PSNR . 4
6.4 mAP . 4
6.5 MOTA . 5
6.6 BD-rate . 5
7 Pre-processing technologies . 6
7.1 Region of interest-based methods . 6
7.2 Foreground and background processing . 7
7.3 Temporal subsampling . 7
7.4 Spatial subsampling . 7
7.5 Noise filtering . 8
8 Encoding technologies . 8
8.1 RoI-based quantization parameter adaption . 8
8.2 Quantization step adjustment for temporal layers . 9
8.3 Chroma QP offset setting . 9
9 Post-processing technologies . 10
9.1 Temporal resampling . 10
© ISO/IEC 20252026 – All rights reserved
iii
ISO/IEC DTR 23888-3.2:(en)
9.2 Spatial resampling . 10
9.3 Enhancement post-filtering . 10
10 Metadata . 10
10.1 General . 10
10.2 Neural-network post-filter SEI message . 11
10.3 Annotated regions SEI message . 11
10.4 Object mask information SEI message . 11
10.5 Encoder optimization information SEI message . 11
10.6 Packed regions information SEI message . 12
Annex A (informative) Software implementation examples . 13
Annex B (informative) Combined software implementation examples . 20
Bibliography . 21
© ISO/IEC 20252026 – All rights reserved
iv
ISO/IEC DTR 23888-3.2:(en)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical
Commission) form the specialized system for worldwide standardization. National bodies that are members
of ISO or IEC participate in the development of International Standards through technical committees
established by the respective organization to deal with particular fields of technical activity. ISO and IEC
technical committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work.
The procedures used to develop this document and those intended for its further maintenance are described
in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of
document should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC
Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).
ISO and IEC draw attention to the possibility that the implementation of this document may involve the use of
(a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any claimed
patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not received
notice of (a) patent(s) which may be required to implement this document. However, implementers are
cautioned that this may not represent the latest information, which may be obtained from the patent database
available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held responsible for
identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions
related to conformity assessment, as well as information about ISO's adherence to the World Trade
Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html.
In the IEC, see www.iec.ch/understanding-standards.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information, in collaboration with
ITU-T SG21, Technologies for multimedia, content delivery and cable television. The corresponding ITU-T SG21
provisional work item name is H.Sup.MACVC.(as ITU-T H.Sup.MACVC).
A list of all parts in the ISO/IEC 23888 series can be found on the ISO and IEC websites.
Any feedback or questions on this document should be directed to the user’s national standards body. A
complete listing of these bodies can be found at www.iso.org/members.html and www.iec.ch/national-
committees.
© ISO/IEC 20252026 – All rights reserved
v
ISO/IEC DTR 23888-3.2:(en)
Information technology — Artificial intelligence for multimedia —
Part 3:
Optimization of encoders and receiving systems for machine analysis
of coded video content
1 Scope
This document provides information about optimizations for encoders and receiving systems for conducting
machine analysis tasks on coded video content. It provides a concept-level overview of recent practices and
provides comments on technical aspects and cautions to be taken when interpreting the results. This
document describes technologies that have recently been studied and have demonstrated benefits to coding
efficiency for some machine analysis tasks.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content constitute
requirements of this document. For dated references, only the edition cited applies. For undated references,
the latest edition of the referenced document (including any amendments) applies.
Rec. ITU-T H.266 | ISO/IEC 23090-3, Versatile video coding
Rec. ITU-T H.265 | ISO/IEC 23008-2, High efficiency video coding
Rec. ITU-T H.264 | ISO/IEC 14496-10, Advanced video coding
Rec. ITU-T H.274 | ISO/IEC 23002-7, Versatile supplemental enhancement information messages for coded video
bitstreams
3 Terms and definitions
For the purposes of this document, the terms and definitions given in Rec. ITU-T H.266 | ISO/IEC 23090-3,
Rec. ITU-T H.265 | ISO/IEC 23008-2, Rec. ITU-T H.264 | ISO/IEC 14496-10, Rec. ITU-T H.274 | ISO/IEC 23002-
7 and the following apply.
ISO and IEC maintain terminology databases for use in standardization at the following addresses:
— ISO Online browsing platform: available at https://www.iso.org/obp
— IEC Electropedia: available at https://www.electropedia.org/
3.1
machine consumption
operation of a machine analysis task such as object detection, segmentation or object tracking
4 Abbreviated terms
AVC Advanced Video Coding (Rec. ITU-T H.264 | ISO/IEC 14496-10)
BD-rate Bjøntegaard delta bit rate
CTU coding tree unit
HEVC High Efficiency Video Coding (Rec. ITU-T H.265 | ISO/IEC 23008-2)
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
IoU intersection over union
mAP mean average precision
MOTA multiple object tracking accuracy
NNPF neural-network post-filter
NNPFA neural-network post-filter activation
NNPFC neural-network post-filter characteristics
OMI object mask information
PRI packed regions information
PSNR peak signal-to-noise ratio
QP quantization parameter
RoI region of interest
RPR reference picture resampling
SEI supplemental enhancement information
TID temporal identifier
URI uniform resource identifier
VSEI Versatile Supplemental Enhancement Information Messages for Coded Video Bitstreams (Rec.
ITU-T H.274 | ISO/IEC 23002-7)
VTM Reference software for versatile video coding (Rec. ITU-T H.266.2 | ISO/IEC 23090-16:2025)
VVC Versatile Video Coding (Rec. ITU-T H.266 | ISO/IEC 23090-3)
Y′C C colour space representation commonly used for video/image distribution, also written as YUV
B R
YUV colour space representation commonly used for video/image distribution, also written as
Y′C C
B R
5 Overview
5.1 General overview
Most video processing systems consist of four main processing steps, as shown in Figure 1.Figure 1. This
document describes technologies for optimization of encoders and receiving systems, such as pre-processing,
encoding and post-processing for machine consumption. The decoding process, on the other hand, is fully
specified in the respective Rec. ITU-T H.266 | ISO/IEC 23090-3 Versatile Video Coding (VVC), Rec. ITU-T H.265
| ISO/IEC 23008-2 High Efficiency Video Coding (HEVC) and Rec. ITU-T H.264 | ISO/IEC 14496-10 Advanced
Video Coding (AVC) video coding standards, amongst others. Hence, the samples of the decoded video are fully
specified by the given input bitstream.
Figure 1 — General video coding and processing pipeline
An overview of the commonly used practices for evaluating encoder optimization technologies for machine
consumption can be found in 6.6. Descriptions of pre-processing technologies can be found in 7.7. Encoder
optimization technologies are described in 88 and post-processing technologies are described in 9.9. Metadata
that is useful for machine consumption is described in 10.10.
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
It is noted that depending on specific use cases, the technologies outlined in this document can be
implemented individually or in combination to optimize the machine consumption performance within the
constraints of the system capabilities. When employing multiple technologies simultaneously, it is important
to consider that certain combinations can be impractical or infeasible due to inherent methodological
constraints. Tested technologies and their combinations are listed in Annex A and Annex BAnnex A and
Annex B, respectively.
5.2 Use cases and applications
There are various use cases and applications using encoded video that benefit from optimizing both encoders
and receiving systems for machine consumption. Some of them are highlighted below:
— Surveillance: A considerable amount of bandwidth is needed to transmit a high volume of data generated
by a large number of sensors. The number of sensors also has an impact on the computational load on the
server side, as having to analyse the input from many sensors can become a huge burden. This can be
eased by distributing the computation to the front-end devices.
— Intelligent transportation: A key aspect for vehicular applications is interoperability between not only
vehicles from different vendors, but also the infrastructures of various locations. Connected vehicles are
expected to play a significant role in future transport systems and the tremendous number of vehicles
emphasizes the need of reducing the amount of data being transmitted between them to avoid overloading
the network.
— Intelligent industry: One example in this area is visual content analysis, checking and screening. Machine
automation is desirable for increasing efficiency.
[1] [1]
A more detailed description of use cases can be found in ISO/IEC TR 23888-1 . .
6 Evaluation methodology
6.1 General
A set of assessment metrics are used for the evaluation of encoder and receiving systems optimization
technologies for machine consumption. An overview evaluation framework is shown in Figure 2.Figure 2.
Here the input video is encoded to generate a bitstream. This bitstream is then decoded, and the decoded video
is used for machine consumption. In this diagram, the “encoder” includes both pre-processing and encoding
steps, and the “decoder” includes both decoding and post-processing steps, as shown in Figure 1Figure 1.
Figure 2 — Evaluation framework and points of measurement
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
6.2 Bit rate
The bit rate is determined based on the encoded bitstream and parameters of the input video such as frame
rate and the number of total frames. The following formula is applied to calculate the bit rate:
8∗𝑓𝑓𝑏𝑏𝑓𝑓𝑏𝑏𝑓𝑓𝑏𝑏𝑓𝑓𝑏𝑏𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑏𝑏𝑏𝑏𝑓𝑓∗𝑓𝑓𝑓𝑓𝑓𝑓
𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 =
𝑓𝑓𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑏𝑏𝑓𝑓∗ 1000
where fileSizeInBytes is the size of a file measured in number of bytes, fps is the number of frames per second
and numFrames is the total number of frames in the file.
6.3 PSNR
Encoding for video distribution is ordinarily performed in the Y′C C domain (nicknamed YUV herein for
B R
brevity and ease of typing). For standard-dynamic-range video, the distortion metric primarily used in the
video coding standardization community has been peak signal to noise ratio (PSNR). The following two
formulae are used to calculate PSNR:
𝑛𝑛−1 𝑚𝑚−1 𝑚𝑚−1
𝑀𝑀𝑓𝑓𝑀𝑀 = � � ∥ � ∥𝑥𝑥(𝑏𝑏,𝑗𝑗)−𝑓𝑓(𝑏𝑏,𝑗𝑗)∥
𝑛𝑛∗𝑓𝑓
𝑖𝑖=0 𝑗𝑗=0 𝑗𝑗=0
(𝑏𝑏𝑖𝑖𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏ℎ−8) 2
(255∗ 2 )
𝑃𝑃𝑓𝑓𝑃𝑃𝑏𝑏 = 10∗𝑓𝑓𝑙𝑙𝑔𝑔 ( )
𝑀𝑀𝑓𝑓𝑀𝑀
where x(i,j) is the decoded sample value of a certain colour component, y(i,j) is the corresponding original
sample value, and bitDepth is the bit depth of the input video. It is a common practice to calculate PSNR values
for each of the colour components Y, U and V. Information on how to interpret or combine PSNR values of
[2] [2]
colour components can be found in ISO/IEC TR 23002-8 .-8 .
6.4 mAP
The performance of object detection and segmentation tasks are measured by mean average precision(mAP)
described as follows.
( ) ( )
For a given category of object, true positive 𝑇𝑇𝑃𝑃𝑇𝑇 ,(𝑇𝑇 ), false positive 𝑛𝑛𝑃𝑃𝑇𝑇 ,(𝑇𝑇 ), true negative
𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼
( ) ( )
𝑇𝑇𝑃𝑃𝑇𝑇 (𝑇𝑇 ) and false negative 𝑛𝑛𝑃𝑃𝑇𝑇 (𝑇𝑇 ) are defined with an IoU threshold 𝑇𝑇 for that category,
𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼
where true positive is a case that an object is detected by the model and it is also a part of the ground truth,
false positive is a case that an object is detected by the model but it is not a part of the ground truth; true
negative is a case that the an object is not detected by the model and is it also not a part of the ground truth;
false negative is a case that an object is not detected by the model but it is a part of the ground truth.
Then, recall of the given IoU threshold is defined as the proportion of all true positive cases in all true positive
and false negative cases corresponding to that IoU threshold:
𝑇𝑇𝑃𝑃(𝑇𝑇 ) 𝑇𝑇𝑃𝑃(𝑇𝑇 )
𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼
( )
𝑛𝑛𝑏𝑏𝑟𝑟𝑏𝑏𝑓𝑓𝑓𝑓𝑇𝑇 = (𝑇𝑇 ) =
𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼
( ) ( )
𝑇𝑇𝑃𝑃𝑇𝑇 +𝑛𝑛𝑃𝑃𝑇𝑇 𝑇𝑇𝑃𝑃(𝑇𝑇 ) +𝑛𝑛𝑃𝑃(𝑇𝑇 )
𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼
The precision of the given IoU threshold is the proportion of all true positive cases in all positive cases:
𝑇𝑇𝑃𝑃(𝑇𝑇 ) 𝑇𝑇𝑃𝑃(𝑇𝑇 )
𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼
( )
𝑓𝑓𝑛𝑛𝑏𝑏𝑟𝑟𝑏𝑏𝑓𝑓𝑏𝑏𝑙𝑙𝑓𝑓𝑇𝑇 = (𝑇𝑇 ) =
𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼
( ) ( )
𝑇𝑇𝑃𝑃𝑇𝑇 +𝑛𝑛𝑃𝑃𝑇𝑇 𝑇𝑇𝑃𝑃(𝑇𝑇 ) +𝑛𝑛𝑃𝑃(𝑇𝑇 )
𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼𝐼𝐼𝐼𝐼
It is possible for a neural network of detection or segmentation to obtain several pairs of recall and precision
( )
values with different confidence levels. For each recall value 𝑛𝑛 in the pairs, let 𝑓𝑓𝑛𝑛 (𝑛𝑛) take the maximum
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
precision value in all precision values for which the corresponding recall values are above the given recall
value 𝑛𝑛:
( ) ( )
𝑓𝑓𝑛𝑛 = max𝑓𝑓𝑛𝑛𝑏𝑏𝑟𝑟𝑏𝑏𝑓𝑓𝑏𝑏𝑙𝑙𝑓𝑓𝑛𝑛̃
𝑟𝑟̃:𝑟𝑟̃≥𝑟𝑟
𝑓𝑓(𝑛𝑛) = max𝑓𝑓𝑛𝑛𝑏𝑏𝑟𝑟𝑏𝑏𝑓𝑓𝑏𝑏𝑙𝑙𝑓𝑓(𝑛𝑛˜)
𝑟𝑟˜:𝑟𝑟˜≥𝑟𝑟
Average precision (AP) of a given category of object is defined as the average value of 𝑓𝑓(𝑛𝑛)(𝑛𝑛) for all recall
values provided by the neural network, which characterizes the area of the entire precision-recall curve. The
mAP is defined as the average of AP scores of all categories within a range of IoU thresholds.
Some commonly used variants of this metric are:
— mAP@0.5: An object is counted as correctly identified if the IoU between the detected bounding box and
the ground truth bounding box is at least 0.5. Sometimes this variant of the mAP metric is also referred to
as mAP50.
— mAP@[0.5:0.05:0.95]: In this variant a total of ten mAP scores with increasing IoU thresholds are
calculated. The IoU threshold starts at 0.5 and increases by 0.05 after each iteration, until it reaches to the
upper bound value 0.95. Once all ten scores are determined, the average of these scores is calculated to
produce the final mAP.
6.5 MOTA
Object tracking performance is measured by multiple object tracking accuracy (MOTA). This metric accounts
for all object configuration errors made by the tracker, false positives, misses (true negative), mismatches,
overall frames. The calculation of MOTA is as follows:
� (𝑛𝑛𝑃𝑃 +𝑛𝑛𝑃𝑃 +𝑛𝑛𝑛𝑛𝑏𝑏 )� (𝑛𝑛𝑃𝑃 +𝑛𝑛𝑃𝑃 +𝑛𝑛𝑛𝑛𝑏𝑏 )
𝑏𝑏 𝑏𝑏 𝑏𝑏 𝑏𝑏 𝑏𝑏 𝑏𝑏
𝑏𝑏 𝑏𝑏
𝑀𝑀𝑀𝑀𝑇𝑇𝑀𝑀 = 1−
∑ ∑
𝑔𝑔 𝑔𝑔
𝑏𝑏 𝑏𝑏 𝑏𝑏 𝑏𝑏
where 𝑛𝑛𝑃𝑃 , 𝑛𝑛𝑃𝑃 , 𝑛𝑛𝑛𝑛𝑏𝑏 and 𝑔𝑔 are the number of false negatives, the number of false positives, the number of
𝑏𝑏 𝑏𝑏 𝑏𝑏 𝑏𝑏
mismatch error (ID Switching between 2 successive frames), and the number of objects in the ground truth
respectively at time 𝑏𝑏.
6.6 BD-rate
To compare the performance of a technology against the reference, the well-known Bjøntegaard delta rate
[2]2
(BD-rate) metric is used. Instead of using PSNR as the distortion metric as is typical for human vision
performance evaluation, machine consumption distortion metrics, e.g., mAP and MOTA, are used in machine
BD-rate calculation.
The distortion measurement of machine consumption (e.g., mAP and MOTA) can sometimes be non-
monotonic to the bit rate due to the characteristics of the machine analysis task and possible limitations of
machine networks. Polynomial curve fitting is applied to ensure rate-distortion monotonicity and thus valid
BD-rate calculation.
3 2
𝑓𝑓(𝑥𝑥) =𝑏𝑏 ∗𝑥𝑥 +𝑏𝑏 ∗𝑥𝑥 +𝑏𝑏 ∗𝑥𝑥 +𝑏𝑏
0 1 2 3
For a given polynomial function in the above formula, 𝑏𝑏 , 𝑏𝑏 , 𝑏𝑏 , and 𝑏𝑏 are coefficients of the function, 𝑥𝑥 is the
0 1 2 3
input (bit rate) and 𝑓𝑓(𝑥𝑥) is the output (quality). The following two constraints are invoked to ensure its
monotonicity and convexity:
— the first order derivative of the polynomial shown below is positive in the given 𝑥𝑥 range
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
′ 2
𝑓𝑓 (𝑥𝑥) = 3∗𝑏𝑏 ∗𝑥𝑥 + 2* 𝑏𝑏 ∗𝑥𝑥 +𝑏𝑏
0 1 2
— the second order derivative of the polynomial shown below is negative in the given 𝑥𝑥 range
″
𝑓𝑓 (𝑥𝑥) = 6∗𝑏𝑏 ∗𝑥𝑥 + 2* 𝑏𝑏
0 1
Parameters (𝑏𝑏 ,𝑏𝑏 ,𝑏𝑏 ,𝑏𝑏 ) in the polynomial function are solved by sequential least squares programming
0 1 2 3
(SLSQP) and applied to curve fitting.
NOTE It is a common practice to have the minimal quality value of the fitted curve no smaller than the minimal
quality value of the original curve and the maximum quality value of the fitted curve no greater than the maximum quality
value of the original curve.
7 Pre-processing technologies
7.1 Region of interest-based methods
One often-used optimization method is region of interest (RoI)-based coding. Here the input video is analysed
in some way and then the encoder can optimize the encoding towards machine consumption based on the
analysis results. The analysis can be done using various methods, e.g., neural networks. An example of a
pipeline that can be used for RoI-based approaches is shown in Figure 3Figure 3.
Figure 3 — Pipeline for RoI-based systems
In one implementation example, an object detection network is used to analyse the input data. This network
produces a list of objects that can be found in the current picture. The information used to describe each object
includes the index of the picture in which the object can be found and the position of the object in the picture.
Some networks can provide more information than this and the encoder can choose to select a subset of all
objects by filtering based on, for example, the class of an object or the estimated likelihood of an object of the
described class being at the described position. In a similar approach, a segmentation network can be used
where the object is not described by a bounding box but by a segmentation mask indicating exactly which
samples the segmentation network estimates belonging to the object. The list produced during the analysis
can then be used by the encoder, for example, to separate foreground and background with the purpose of
encoding the foreground at a better quality and the background at a lower quality. One such encoding method
is described in 8.1.8.1. In this example, the analysis does not change the input video, but directly forwards it
to the encoder.
In other RoI-based methods, the pre-processing changes the input video, for example, by applying different
pre-processing methods on the foreground and background, or specific parts of the video, such as subsampling
the background area of the input video.
In one implementation example, an object segmentation network is first used to analyse the input data. The
network produces a list of objects segmented with the object shapes in the current picture. The object shapes
and positions could be represented, for example, by segmentation masks. More information such as the object
class or the estimated likelihood of the object segment could also be provided by the network to identify the
objects. Based on the object information, it is possible to derive spatial complexity and temporal complexity
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
for the different segments, and then RoI-based pre-processing of the input video can be adapted based on the
spatial and temporal complexity. The spatial complexity here indicates the averaged object size which can be
calculated by dividing the percentage of the area covered by the objects by the total number of the objects.
Temporal complexity indicates the content changes between two pictures which can be calculated by various
methods, for example, by taking the mean absolute difference of the collocated samples in two pictures.
7.2 Foreground and background processing
After pre-analysis that determines the foreground and background areas, one straightforward way to handle
the background that is less critical to machine consumption is to “eliminate” it by setting the corresponding
sample values to a constant value. However, some portions of the background samples, for example those
immediately surrounding the foreground area, could still be useful for machine consumption. Therefore, the
background regions relevant to machine consumption can be preserved to a certain extent with low-pass
filtering, such as a Gaussian filter with a sliding window, where the window size can be set based on the input
video resolution.
Moreover, extracted features can reveal importance information of the input video. In other words, compared
with binary classification of foreground and background, these extracted features can provide importance
information at a finer granularity. Therefore, such extracted features can be used to determine how to process
foreground and background differently. In one implementation example, a feature map is extracted by a
feature extraction network, and based on the feature map, the parameters of a Gaussian smoothing filter are
adapted and then the adaptive filtering is applied to the picture. As the background area and foreground area
have different features and even within the background or foreground area, different regions can have
different features, the Gaussian smoothing filter can be controlled at a finer granularity, which finally results
in a more efficient pre-processing.
An implementation example with more detailed description can be found in A.2A.2.
7.3 Temporal subsampling
In some use cases, for example when the frame rate is high, a way to reduce the bit rate without a strong
negative impact on the machine consumption performance can be to skip certain frames and encode the video
at a lower frame rate. One example is to remove every other frame from the input video and encode the video
at half frame rate. This can be done in a dynamic manner, for example by evaluating the motion between two
or more frames and if there is only little motion, a frame can be removed. In some cases, if the receiving system
requires a specific frame rate, a corresponding post-processing technology that up-samples the video to the
full frame rate can be applied.
An implementation example with more detailed description can be found in A.4A.4.
7.4 Spatial subsampling
If the analysis of a video shows that it contains primarily large objects, one way to improve the BD-rate
performance is to perform spatial subsampling on the input video. This will result in fewer samples in the
subsampled frames to be encoded, and thus likely lead to faster encoding and bit rate savings. The optimum
downscaling factor is content dependent. It is possible that the machine consumption performance drops
when subsampling is too aggressive. Therefore, it is advisable to apply spatial subsampling adaptively, for
example, based on the characteristics of the video content (such as the averaged object spatial area and the
number of objects) and the target bit rate. Moreover, the spatial subsampling can also be dependent on the
picture types. For example, depending on whether the input video is captured by regular camera as natural
scenes, or is captured by infrared sensor as thermal images, different spatial subsampling methods can be
applied.
One tool specified in the Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC) standard that can be used for the purpose
of spatial resampling is called reference picture resampling (RPR). This tool allows the encoder to choose to
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
encode some pictures of the input video at different resolutions. For example, based on the analysis of
keyframes, without encoding an intra picture at the specified changed resolution, inter predictions can be
made from all allowed pictures regardless of their resolution.
In one implementation example, the RPR tool can be used at the frame level, where the input video can be
analysed as described in 7.1.7.1. In this case, the unmodified input video is forwarded to the encoder with a
scale factor list generated by the analyser. Specifically, an object detection network is used to analyse the input
video in both full resolution and at least one spatially subsampled resolution. This network produces a list of
objects with object information that can be found in the current picture and the spatially resampled picture.
The object information describes the position and size of the detected objects for both the current picture and
the spatially resampled picture. Based on the object information, an object occupancy distribution (i.e., the
distribution of the ratio of object size to the corresponding picture resolution) can be generated for both the
current picture and the spatially resampled picture. The scale factor can be derived for the current picture by
comparing the correlation of object occupancy distributions, and then the list of all scale factors is passed on
to the encoder for utilization of the RPR tool.
7.5 Noise filtering
Under some circumstances, various types of noise can be present in the video content. Denoising filters can be
applied on such content to reduce unnecessary bit rate increases and avoid machine consumption
performance degradation by filtering out the undesirable noise from the video content while preserving
information important to machine consumption. Various types of denoising filters can be applied according to
the characteristics of the noise. The strength of the filter can be adjusted based on the noise, and the filter can
adaptively be enabled and disabled for either an entire picture or sequence, or only a part thereof.
It is noted that some existing denoising filters such as bilateral or anisotropic diffusion filters can preserve
local details during denoising. For applications that benefit from such local details being preserved, filtering
the entire picture using the same strength can be detrimental.
8 Encoding technologies
8.1 RoI-based quantization parameter adaption
One method that is available in many video coding standards is adaptive quantization parameter (QP). Here
the encoder can change the QP value at a sub-picture level, for example the coding tree unit (CTU) level, to
optimize the encoding for the application. Due to this versatility, adaptive QP can be used in many different
use cases to improve performance.
The decision on where to change the QP value and by how much can be made by the encoder based on an
analysis of the input video. Another option is to utilize the output of an external analyser such as described in
7.1.7.1. In this case, the encoder receives information about the positions and sizes of objects in each frame to
make a differentiation between foreground and background.
One option is to use a base QP value of the picture for areas that contain objects, i.e., foreground areas, and an
increased QP value for background areas, resulting in fewer bits being used to encode the background. As the
background is usually not critical to machine consumption, this is a straight-forward way to reduce the bit
rate without affecting the machine consumption performance. As an extension, it can also be beneficial to
encode large objects with slightly higher QP values. As it is generally easier to detect larger objects, reducing
the bit rate for large objects usually does not reduce the performance of machine consumption.
However, it is noted that when utilizing the analysis, complexity can be traded against bit rate. For example, if
a light-weight neural network is used to perform the analysis, it is possible that not all relevant objects have
been found and thus it can be detrimental to reduce the quality for the background too much as there are
possibly objects that the initial analysis missed. These objects are possibly still important for machine
consumption and if the background is encoded in sufficient quality, the machine consumption network has a
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
chance of detecting objects in the background even if they are coded in lower quality than the foreground area.
On the other hand, if the encoding system has a lot of resources, it can employ a neural network of higher
complexity for the analysis. With a better and more certain analysis, the bit rate for the background can be
reduced more as there are likely fewer objects that have been missed in the initial analysis.
A more detailed description with a link to an implementation can be found in A.1A.1.
8.2 Quantization step adjustment for temporal layers
It is a common practice that the encoder places different pictures on different temporal layers, i.e., assigning
different temporal identifiers (TID). This has the purpose of creating hierarchical structures that indicate from
which previously coded pictures the encoder can create predictions for the current picture. One aspect of these
hierarchical structures is that pictures cannot be referenced by other pictures on a higher temporal layer. This
way, it is not necessary to store every decoded picture in the decoded picture buffer. Another aspect of the
hierarchical structure is that pictures on higher temporal layers can be encoded with higher QP values, i.e.,
lower quality, as they will not be used, or less often used, as references by other pictures. An example of a
hierarchical structure is shown in Figure 4.Figure 4. The display order is left-to-right, and the numbers specify
the order of the coded pictures in the bitstream.
Figure 4 — An example hierarchical referencing structure in a Rec. ITU-T H.266 | ISO/IEC 23090-3
(VVC) random access configuration bitstream
This hierarchy of pictures can be exploited by encoding pictures on higher temporal layers using higher QP
values, i.e., reducing the number of bits spent on these pictures in total. This is also done in the case of the
common test conditions for standard definition range for Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC). In the
use case of coding video content for machine consumption, this characteristic can be exploited further by
increasing the QP value for pictures on higher temporal layers more. Taking advantage of motion
compensation, many bits can be saved when compressing pictures in high temporal layers while these pictures
are still able to be reconstructed with high quality. Compressing the highest temporal layer with at a high QP
can be seen as being similar in spirit to reducing the frame rate as discussed in 7.3.7.3. As an example, lowering
the bit rate substantially on every odd-numbered frame can be seen as a step towards completely removing
them.
8.3 Chroma QP offset setting
Many machine analysis methods are performed using 4:4:4 colour format input data. Therefore, encoding in
4:2:0 colour format, which has lower chroma resolution than 4:4:4 colour format, can sometimes have a
negative impact on the machine analysis performance. This can sometimes be compensated for by using a
negative chroma QP offset, which increases the quality of the low-resolution chroma components.
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
9 Post-processing technologies
9.1 Temporal resampling
When the receiving system requires a specific frame rate that is different from the frame rate of the decoded
sequences, temporal resampling can be applied on the decoded video by utilizing conventional temporal filters
(e.g., motion compensated interpolation filters) or neural network-based filters, or just frame repetition. An
implementation example with more detailed description can be found in A.4A.4.
9.2 Spatial resampling
Spatial resampling can for example be applied on the decoded video by utilizing conventional spatial filters
(e.g., the Rec. ITU-T H.266 | ISO/IEC 23090-3 (VVC) motion compensation interpolation filter or Rec. ITU-T
H.266 | ISO/IEC 23090-3 (VVC) reference picture resampling filter) or neural network-based filters.
9.3 Enhancement post-filtering
Video coding introduces quality degradation to the original video, which can consequently reduce the
performance of machine consumption. To improve the machine consumption performance, the receiver can
introduce a machine-oriented post-filtering network to enhance the decoded video before feeding it to the
machine consumption networks. Such a machine-oriented post-filtering network can be fixed at the receiver
side, i.e., without requiring any signalling in the coded bitstream or metadata. An illustration of this process is
shown in Figure 5Figure 5.
Figure 5 — The processing order for an enhancement post-filter without additional signalling
Alternatively, the information of the post-filtering network can be signalled as metadata, for example using
the SEI messages described in 10.2.10.2. In this case, the encoder can train the post-filtering network
specifically for the machine consumption conducted in the decoder side if the machine consumption network
is known on the encoder side. An illustration of this process is shown in Figure 6.Figure 6. An example of an
enhancement post-filter is described in A.3A.3.
Figure 6 — The processing order for an enhancement post-filter with metadata signalling
10 Metadata
10.1 General
The following subclauses describe SEI messages that can be beneficial for interpreting coded video content
that has been optimized for machine analysis. Note that SEI messages do not impact normative decoder
behaviour, and are optional for decoders to implement.
© ISO/IEC 20252026 – All rights reserved
ISO/IEC DTR 23888-3.2:(en)
10.2 Neural-network post-filter SEI message
The neural-network post-filter (NNPF) characteristics (NNPFC) SEI message and neural-network post-filter
activation (NNPFA) SEI message are specified in Rec. ITU-T H.274 | ISO/IEC 23002-7 (VSEI). These SEI
messages can be used for the purpose of machine consumption.
The NNPFC SEI message describes the characteristics of a neural network that can be used to filter pictures
after decoding and such filtering can be beneficial for machine consumption. This SEI message can contain an
indication of the purpose of
...
















Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.
Loading comments...