All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/7] Live Migration With IAA
@ 2024-03-19 16:45 Yuan Liu
  2024-03-19 16:45 ` [PATCH v5 1/7] docs/migration: add qpl compression feature Yuan Liu
                   ` (7 more replies)
  0 siblings, 8 replies; 45+ messages in thread
From: Yuan Liu @ 2024-03-19 16:45 UTC (permalink / raw)
  To: peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, yuan1.liu, nanhai.zou

I am writing to submit a code change aimed at enhancing live migration
acceleration by leveraging the compression capability of the Intel
In-Memory Analytics Accelerator (IAA).

The implementation of the IAA (de)compression code is based on Intel Query
Processing Library (QPL), an open-source software project designed for
IAA high-level software programming. https://github.com/intel/qpl

I would like to summarize the progress so far
1. QPL will be used as an independent compression method like ZLIB and ZSTD,
   QPL will force the use of the IAA accelerator and will not support software
   compression. For a summary of issues compatible with Zlib, please refer to
   docs/devel/migration/qpl-compression.rst

2. Compression accelerator related patches are removed from this patch set and
   will be added to the QAT patch set, we will submit separate patches to use
   QAT to accelerate ZLIB and ZSTD.

3. Advantages of using IAA accelerator include:
   a. Compared with the non-compression method, it can improve downtime
      performance without adding additional host resources (both CPU and
      network).
   b. Compared with using software compression methods (ZSTD/ZLIB), it can
      provide high data compression ratio and save a lot of CPU resources
      used for compression.

Test condition:
  1. Host CPUs are based on Sapphire Rapids
  2. VM type, 16 vCPU and 64G memory
  3. The source and destination respectively use 4 IAA devices.
  4. The workload in the VM
    a. all vCPUs are idle state
    b. 90% of the virtual machine's memory is used, use silesia to fill
       the memory.
       The introduction of silesia:
       https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
  5. Set "--mem-prealloc" boot parameter on the destination, this parameter
     can make IAA performance better and related introduction is added here.
     docs/devel/migration/qpl-compression.rst
  6. Source migration configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter downtime-limit 300
     f. migrate_set_parameter max-bandwidth 100G/1G
     d. migrate_set_parameter multifd-compression none/qpl/zstd
  7. Destination migration configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter multifd-compression none/qpl/zstd

Early migration result, each result is the average of three tests

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|    8571|      69|    58391|   1896525|  256%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|    7180|      92|    69736|   1865640|  300%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|    7090|     121|    70562|   2174060|  307%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|    8413|      34|    30067|   1732411|  230%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|    6559|      32|    38804|   1689954|  450%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|    6623|      37|    38745|   1566507|  790%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   95846|      24|     1800|    521829|  203%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|   49004|      24|     3529|    890532|  403%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   25574|      32|     6782|   1762222|  800%|
 +--------+-------------+--------+--------+---------+----------+------+

When network bandwidth resource is sufficient, QPL can improve downtime
by 2x compared to no compression. In this scenario, with 4 channels, the
IAA hardware resources are fully used, so adding more channels will not
gain more benefits.

 
 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   57758|      66|     8643|    264617|   34%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   57216|      58|     8726|    266773|   34%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   56708|      53|     8804|    270223|   33%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   30129|      34|     8345|   2224761|   54%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   30317|      39|     8300|   2025220|   73%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   29615|      35|     8514|   2250122|  131%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   95750|      24|     1802|    477236|  202%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   48907|      24|     3536|   1002142|  404%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   25568|      32|     6783|   1696437|  800%|
 +--------+-------------+--------+--------+---------+----------+------+

When network bandwidth resource is limited, the "page perf second" metric
decreases for none compression, the success rate of migration will reduce.
Comparison of QPL and ZSTD compression methods, QPL can save a lot of CPU
resources used for compression.

v2:
  - add support for multifd compression accelerator
  - add support for the QPL accelerator in the multifd
    compression accelerator
  - fixed the issue that QPL was compiled into the migration
    module by default

v3:
  - use Meson instead of pkg-config to resolve QPL build
    dependency issue
  - fix coding style
  - fix a CI issue for get_multifd_ops function in multifd.c file

v4:
  - patch based on commit: da96ad4a6a Merge tag 'hw-misc-20240215' of
    https://github.com/philmd/qemu into staging
  - remove the compression accelerator implementation patches, the patches
    will be placed in the QAT accelerator implementation.
  - introduce QPL as a new compression method
  - add QPL compression documentation
  - add QPL compression migration test
  - fix zlib/zstd compression level issue

v5:
  - patch based on v9.0.0-rc0 (c62d54d0a8)
  - use pkgconfig to check libaccel-config, libaccel-config is already
    in many distributions.
  - initialize the IOV of the sender by the specific compression method
  - refine the coding style
  - remove the zlib/zstd compression level not working patch, the issue
    has been solved

Yuan Liu (7):
  docs/migration: add qpl compression feature
  migration/multifd: put IOV initialization into compression method
  configure: add --enable-qpl build option
  migration/multifd: add qpl compression method
  migration/multifd: implement initialization of qpl compression
  migration/multifd: implement qpl compression and decompression
  tests/migration-test: add qpl compression test

 docs/devel/migration/features.rst        |   1 +
 docs/devel/migration/qpl-compression.rst | 231 +++++++++++
 hw/core/qdev-properties-system.c         |   2 +-
 meson.build                              |  16 +
 meson_options.txt                        |   2 +
 migration/meson.build                    |   1 +
 migration/multifd-qpl.c                  | 482 +++++++++++++++++++++++
 migration/multifd-zlib.c                 |   4 +
 migration/multifd-zstd.c                 |   6 +-
 migration/multifd.c                      |   8 +-
 migration/multifd.h                      |   1 +
 qapi/migration.json                      |   7 +-
 scripts/meson-buildoptions.sh            |   3 +
 tests/qtest/migration-test.c             |  24 ++
 14 files changed, 782 insertions(+), 6 deletions(-)
 create mode 100644 docs/devel/migration/qpl-compression.rst
 create mode 100644 migration/multifd-qpl.c

-- 
2.39.3



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 1/7] docs/migration: add qpl compression feature
  2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
@ 2024-03-19 16:45 ` Yuan Liu
  2024-03-26 17:58   ` Peter Xu
  2024-03-19 16:45 ` [PATCH v5 2/7] migration/multifd: put IOV initialization into compression method Yuan Liu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Yuan Liu @ 2024-03-19 16:45 UTC (permalink / raw)
  To: peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, yuan1.liu, nanhai.zou

add Intel Query Processing Library (QPL) compression method
introduction

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 docs/devel/migration/features.rst        |   1 +
 docs/devel/migration/qpl-compression.rst | 231 +++++++++++++++++++++++
 2 files changed, 232 insertions(+)
 create mode 100644 docs/devel/migration/qpl-compression.rst

diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
index d5ca7b86d5..bc98b65075 100644
--- a/docs/devel/migration/features.rst
+++ b/docs/devel/migration/features.rst
@@ -12,3 +12,4 @@ Migration has plenty of features to support different use cases.
    virtio
    mapped-ram
    CPR
+   qpl-compression
diff --git a/docs/devel/migration/qpl-compression.rst b/docs/devel/migration/qpl-compression.rst
new file mode 100644
index 0000000000..42c7969d30
--- /dev/null
+++ b/docs/devel/migration/qpl-compression.rst
@@ -0,0 +1,231 @@
+===============
+QPL Compression
+===============
+The Intel Query Processing Library (Intel ``QPL``) is an open-source library to
+provide compression and decompression features and it is based on deflate
+compression algorithm (RFC 1951).
+
+The ``QPL`` compression relies on Intel In-Memory Analytics Accelerator(``IAA``)
+and Shared Virtual Memory(``SVM``) technology, they are new features supported
+from Intel 4th Gen Intel Xeon Scalable processors, codenamed Sapphire Rapids
+processor(``SPR``).
+
+For more ``QPL`` introduction, please refer to:
+
+https://intel.github.io/qpl/documentation/introduction_docs/introduction.html
+
+QPL Compression Framework
+=========================
+
+::
+
+  +----------------+       +------------------+
+  | MultiFD Service|       |accel-config tool |
+  +-------+--------+       +--------+---------+
+          |                         |
+          |                         |
+  +-------+--------+                | Setup IAA
+  |  QPL library   |                | Resources
+  +-------+---+----+                |
+          |   |                     |
+          |   +-------------+-------+
+          |   Open IAA      |
+          |   Devices +-----+-----+
+          |           |idxd driver|
+          |           +-----+-----+
+          |                 |
+          |                 |
+          |           +-----+-----+
+          +-----------+IAA Devices|
+      Submit jobs     +-----------+
+      via enqcmd
+
+
+Intel In-Memory Analytics Accelerator (Intel IAA) Introduction
+================================================================
+
+Intel ``IAA`` is an accelerator that has been designed to help benefit
+in-memory databases and analytic workloads. There are three main areas
+that Intel ``IAA`` can assist with analytics primitives (scan, filter, etc.),
+sparse data compression and memory tiering.
+
+``IAA`` Manual Documentation:
+
+https://www.intel.com/content/www/us/en/content-details/721858/intel-in-memory-analytics-accelerator-architecture-specification
+
+IAA Device Enabling
+-------------------
+
+- Enabling ``IAA`` devices for platform configuration, please refer to:
+
+https://www.intel.com/content/www/us/en/content-details/780887/intel-in-memory-analytics-accelerator-intel-iaa.html
+
+- ``IAA`` device driver is ``Intel Data Accelerator Driver (idxd)``, it is
+  recommended that the minimum version of Linux kernel is 5.18.
+
+- Add ``"intel_iommu=on,sm_on"`` parameter to kernel command line
+  for ``SVM`` feature enabling.
+
+Here is an easy way to verify ``IAA`` device driver and ``SVM``, refer to:
+
+https://github.com/intel/idxd-config/tree/stable/test
+
+IAA Device Management
+---------------------
+
+The number of ``IAA`` devices will vary depending on the Xeon product model.
+On a ``SPR`` server, there can be a maximum of 8 ``IAA`` devices, with up to
+4 devices per socket.
+
+By default, all ``IAA`` devices are disabled and need to be configured and
+enabled by users manually.
+
+Check the number of devices through the following command
+
+.. code-block:: shell
+
+  # lspci -d 8086:0cfe
+  # 6a:02.0 System peripheral: Intel Corporation Device 0cfe
+  # 6f:02.0 System peripheral: Intel Corporation Device 0cfe
+  # 74:02.0 System peripheral: Intel Corporation Device 0cfe
+  # 79:02.0 System peripheral: Intel Corporation Device 0cfe
+  # e7:02.0 System peripheral: Intel Corporation Device 0cfe
+  # ec:02.0 System peripheral: Intel Corporation Device 0cfe
+  # f1:02.0 System peripheral: Intel Corporation Device 0cfe
+  # f6:02.0 System peripheral: Intel Corporation Device 0cfe
+
+IAA Device Configuration
+------------------------
+
+The ``accel-config`` tool is used to enable ``IAA`` devices and configure
+``IAA`` hardware resources(work queues and engines). One ``IAA`` device
+has 8 work queues and 8 processing engines, multiple engines can be assigned
+to a work queue via ``group`` attribute.
+
+One example of configuring and enabling an ``IAA`` device.
+
+.. code-block:: shell
+
+  # accel-config config-engine iax1/engine1.0 -g 0
+  # accel-config config-engine iax1/engine1.1 -g 0
+  # accel-config config-engine iax1/engine1.2 -g 0
+  # accel-config config-engine iax1/engine1.3 -g 0
+  # accel-config config-engine iax1/engine1.4 -g 0
+  # accel-config config-engine iax1/engine1.5 -g 0
+  # accel-config config-engine iax1/engine1.6 -g 0
+  # accel-config config-engine iax1/engine1.7 -g 0
+  # accel-config config-wq iax1/wq1.0 -g 0 -s 128 -p 10 -b 1 -t 128 -m shared -y user -n app1 -d user
+  # accel-config enable-device iax1
+  # accel-config enable-wq iax1/wq1.0
+
+.. note::
+   IAX is an early name for IAA
+
+- The ``IAA`` device index is 1, use ``ls -lh /sys/bus/dsa/devices/iax*``
+  command to query the ``IAA`` device index.
+
+- 8 engines and 1 work queue are configured in group 0, so all compression jobs
+  submitted to this work queue can be processed by all engines at the same time.
+
+- Set work queue attributes including the work mode, work queue size and so on.
+
+- Enable the ``IAA1`` device and work queue 1.0
+
+.. note::
+  Set work queue mode to shared mode, since ``QPL`` library only supports
+  shared mode
+
+For more detailed configuration, please refer to:
+
+https://github.com/intel/idxd-config/tree/stable/Documentation/accfg
+
+IAA Resources Allocation For Migration
+--------------------------------------
+
+There is no ``IAA`` resource configuration parameters for migration and
+``accel-config`` tool configuration cannot directly specify the ``IAA``
+resources used for migration.
+
+``QPL`` will use all work queues that are enabled and set to shared mode,
+and use all engines assigned to the work queues with shared mode.
+
+By default, ``QPL`` will only use the local ``IAA`` device for compression
+job processing. The local ``IAA`` device means that the CPU of the job
+submission and the ``IAA`` device are on the same socket, so one CPU
+can submit the jobs to up to 4 ``IAA`` devices.
+
+Shared Virtual Memory(SVM) Introduction
+=======================================
+
+An ability for an accelerator I/O device to operate in the same virtual
+memory space of applications on host processors. It also implies the
+ability to operate from pageable memory, avoiding functional requirements
+to pin memory for DMA operations.
+
+When using ``SVM`` technology, users do not need to reserve memory for the
+``IAA`` device and perform pin memory operation. The ``IAA`` device can
+directly access data using the virtual address of the process.
+
+For more ``SVM`` technology, please refer to:
+
+https://docs.kernel.org/next/x86/sva.html
+
+
+How To Use QPL Compression In Migration
+=======================================
+
+1 - Installation of ``accel-config`` tool and ``QPL`` library
+
+  - Install ``accel-config`` tool from https://github.com/intel/idxd-config
+  - Install ``QPL`` library from https://github.com/intel/qpl
+
+2 - Configure and enable ``IAA`` devices and work queues via ``accel-config``
+
+3 - Build ``Qemu`` with ``--enable-qpl`` parameter
+
+  E.g. configure --target-list=x86_64-softmmu --enable-kvm ``--enable-qpl``
+
+4 - Start VMs with ``sudo`` command or ``root`` permission
+
+  Use the ``sudo`` command or ``root`` privilege to start the source and
+  destination virtual machines, since migration service needs permission
+  to access ``IAA`` hardware resources.
+
+5 - Enable ``QPL`` compression during migration
+
+  Set ``migrate_set_parameter multifd-compression qpl`` when migrating, the
+  ``QPL`` compression does not support configuring the compression level, it
+  only supports one compression level.
+
+The Difference Between QPL And ZLIB
+===================================
+
+Although both ``QPL`` and ``ZLIB`` are based on the deflate compression
+algorithm, and ``QPL`` can support the header and tail of ``ZLIB``, ``QPL``
+is still not fully compatible with the ``ZLIB`` compression in the migration.
+
+``QPL`` only supports 4K history buffer, and ``ZLIB`` is 32K by default. The
+``ZLIB`` compressed data that ``QPL`` may not decompress correctly and
+vice versa.
+
+``QPL`` does not support the ``Z_SYNC_FLUSH`` operation in ``ZLIB`` streaming
+compression, current ``ZLIB`` implementation uses ``Z_SYNC_FLUSH``, so each
+``multifd`` thread has a ``ZLIB`` streaming context, and all page compression
+and decompression are based on this stream. ``QPL`` cannot decompress such data
+and vice versa.
+
+The introduction for ``Z_SYNC_FLUSH``, please refer to:
+
+https://www.zlib.net/manual.html
+
+The Best Practices
+==================
+
+When the virtual machine's pages are not populated and the ``IAA`` device is
+used, I/O page faults occur, which can impact performance due to a large number
+of flush ``IOTLB`` operations.
+
+Since the normal pages on the source side are all populated, ``IOTLB`` caused
+by I/O page fault will not occur. On the destination side, a large number
+of normal pages need to be loaded, so it is recommended to add ``-mem-prealloc``
+parameter on the destination side.
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 2/7] migration/multifd: put IOV initialization into compression method
  2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
  2024-03-19 16:45 ` [PATCH v5 1/7] docs/migration: add qpl compression feature Yuan Liu
@ 2024-03-19 16:45 ` Yuan Liu
  2024-03-20 15:18   ` Fabiano Rosas
  2024-03-19 16:45 ` [PATCH v5 3/7] configure: add --enable-qpl build option Yuan Liu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Yuan Liu @ 2024-03-19 16:45 UTC (permalink / raw)
  To: peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, yuan1.liu, nanhai.zou

Different compression methods may require different numbers of IOVs.
Based on streaming compression of zlib and zstd, all pages will be
compressed to a data block, so two IOVs are needed for packet header
and compressed data block.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 migration/multifd-zlib.c | 4 ++++
 migration/multifd-zstd.c | 6 +++++-
 migration/multifd.c      | 8 +++++---
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 99821cd4d5..8095ef8e28 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -70,6 +70,10 @@ static int zlib_send_setup(MultiFDSendParams *p, Error **errp)
         goto err_free_zbuff;
     }
     p->compress_data = z;
+
+    assert(p->iov == NULL);
+    /* For packet header and zlib streaming compression block */
+    p->iov = g_new0(struct iovec, 2);
     return 0;
 
 err_free_zbuff:
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 02112255ad..9c9217794e 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -52,7 +52,6 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
     struct zstd_data *z = g_new0(struct zstd_data, 1);
     int res;
 
-    p->compress_data = z;
     z->zcs = ZSTD_createCStream();
     if (!z->zcs) {
         g_free(z);
@@ -77,6 +76,11 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
         error_setg(errp, "multifd %u: out of memory for zbuff", p->id);
         return -1;
     }
+    p->compress_data = z;
+
+    assert(p->iov == NULL);
+    /* For packet header and zstd streaming compression block */
+    p->iov = g_new0(struct iovec, 2);
     return 0;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 0179422f6d..5155e02ae3 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1181,9 +1181,11 @@ bool multifd_send_setup(void)
             p->packet = g_malloc0(p->packet_len);
             p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
             p->packet->version = cpu_to_be32(MULTIFD_VERSION);
-
-            /* We need one extra place for the packet header */
-            p->iov = g_new0(struct iovec, page_count + 1);
+            /* IOVs are initialized in send_setup of compression method */
+            if (!migrate_multifd_compression()) {
+                /* We need one extra place for the packet header */
+                p->iov = g_new0(struct iovec, page_count + 1);
+            }
         } else {
             p->iov = g_new0(struct iovec, page_count);
         }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 3/7] configure: add --enable-qpl build option
  2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
  2024-03-19 16:45 ` [PATCH v5 1/7] docs/migration: add qpl compression feature Yuan Liu
  2024-03-19 16:45 ` [PATCH v5 2/7] migration/multifd: put IOV initialization into compression method Yuan Liu
@ 2024-03-19 16:45 ` Yuan Liu
  2024-03-20  8:55   ` Thomas Huth
  2024-03-20 10:31   ` Daniel P. Berrangé
  2024-03-19 16:45 ` [PATCH v5 4/7] migration/multifd: add qpl compression method Yuan Liu
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 45+ messages in thread
From: Yuan Liu @ 2024-03-19 16:45 UTC (permalink / raw)
  To: peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, yuan1.liu, nanhai.zou

add --enable-qpl and --disable-qpl options to enable and disable
the QPL compression method for multifd migration.

the Query Processing Library (QPL) is an open-source library
that supports data compression and decompression features.

The QPL compression is based on the deflate compression algorithm
and use Intel In-Memory Analytics Accelerator(IAA) hardware for
compression and decompression acceleration.

Please refer to the following for more information about QPL
https://intel.github.io/qpl/documentation/introduction_docs/introduction.html

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 meson.build                   | 16 ++++++++++++++++
 meson_options.txt             |  2 ++
 scripts/meson-buildoptions.sh |  3 +++
 3 files changed, 21 insertions(+)

diff --git a/meson.build b/meson.build
index b375248a76..bee7dcd53b 100644
--- a/meson.build
+++ b/meson.build
@@ -1200,6 +1200,20 @@ if not get_option('zstd').auto() or have_block
                     required: get_option('zstd'),
                     method: 'pkg-config')
 endif
+qpl = not_found
+if not get_option('qpl').auto()
+  libqpl = cc.find_library('qpl', required: false)
+  if not libqpl.found()
+    error('libqpl not found, please install it from ' +
+    'https://intel.github.io/qpl/documentation/get_started_docs/installation.html')
+  endif
+  libaccel = dependency('libaccel-config', version: '>=4.0.0',
+                        required: true,
+                        method: 'pkg-config')
+  qpl = declare_dependency(dependencies: [libqpl, libaccel,
+        cc.find_library('dl', required: get_option('qpl'))],
+        link_args: ['-lstdc++'])
+endif
 virgl = not_found
 
 have_vhost_user_gpu = have_tools and host_os == 'linux' and pixman.found()
@@ -2305,6 +2319,7 @@ config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim)
 config_host_data.set('CONFIG_STATX', has_statx)
 config_host_data.set('CONFIG_STATX_MNT_ID', has_statx_mnt_id)
 config_host_data.set('CONFIG_ZSTD', zstd.found())
+config_host_data.set('CONFIG_QPL', qpl.found())
 config_host_data.set('CONFIG_FUSE', fuse.found())
 config_host_data.set('CONFIG_FUSE_LSEEK', fuse_lseek.found())
 config_host_data.set('CONFIG_SPICE_PROTOCOL', spice_protocol.found())
@@ -4462,6 +4477,7 @@ summary_info += {'snappy support':    snappy}
 summary_info += {'bzip2 support':     libbzip2}
 summary_info += {'lzfse support':     liblzfse}
 summary_info += {'zstd support':      zstd}
+summary_info += {'Query Processing Library support': qpl}
 summary_info += {'NUMA host support': numa}
 summary_info += {'capstone':          capstone}
 summary_info += {'libpmem support':   libpmem}
diff --git a/meson_options.txt b/meson_options.txt
index 0a99a059ec..06cd675572 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -259,6 +259,8 @@ option('xkbcommon', type : 'feature', value : 'auto',
        description: 'xkbcommon support')
 option('zstd', type : 'feature', value : 'auto',
        description: 'zstd compression support')
+option('qpl', type : 'feature', value : 'auto',
+       description: 'Query Processing Library support')
 option('fuse', type: 'feature', value: 'auto',
        description: 'FUSE block device export')
 option('fuse_lseek', type : 'feature', value : 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 680fa3f581..784f74fde9 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -222,6 +222,7 @@ meson_options_help() {
   printf "%s\n" '                  Xen PCI passthrough support'
   printf "%s\n" '  xkbcommon       xkbcommon support'
   printf "%s\n" '  zstd            zstd compression support'
+  printf "%s\n" '  qpl             Query Processing Library support'
 }
 _meson_option_parse() {
   case $1 in
@@ -562,6 +563,8 @@ _meson_option_parse() {
     --disable-xkbcommon) printf "%s" -Dxkbcommon=disabled ;;
     --enable-zstd) printf "%s" -Dzstd=enabled ;;
     --disable-zstd) printf "%s" -Dzstd=disabled ;;
+    --enable-qpl) printf "%s" -Dqpl=enabled ;;
+    --disable-qpl) printf "%s" -Dqpl=disabled ;;
     *) return 1 ;;
   esac
 }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 4/7] migration/multifd: add qpl compression method
  2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
                   ` (2 preceding siblings ...)
  2024-03-19 16:45 ` [PATCH v5 3/7] configure: add --enable-qpl build option Yuan Liu
@ 2024-03-19 16:45 ` Yuan Liu
  2024-03-27 19:49   ` Peter Xu
  2024-03-19 16:45 ` [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression Yuan Liu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Yuan Liu @ 2024-03-19 16:45 UTC (permalink / raw)
  To: peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, yuan1.liu, nanhai.zou

add the Query Processing Library (QPL) compression method

Although both qpl and zlib support deflate compression, qpl will
only use the In-Memory Analytics Accelerator(IAA) for compression
and decompression, and IAA is not compatible with the Zlib in
migration, so qpl is used as a new compression method for migration.

How to enable qpl compression during migration:
migrate_set_parameter multifd-compression qpl

The qpl only supports one compression level, there is no qpl
compression level parameter added, users do not need to specify
the qpl compression level.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 hw/core/qdev-properties-system.c |  2 +-
 migration/meson.build            |  1 +
 migration/multifd-qpl.c          | 20 ++++++++++++++++++++
 migration/multifd.h              |  1 +
 qapi/migration.json              |  7 ++++++-
 5 files changed, 29 insertions(+), 2 deletions(-)
 create mode 100644 migration/multifd-qpl.c

diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index d79d6f4b53..6ccd7224f6 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -659,7 +659,7 @@ const PropertyInfo qdev_prop_fdc_drive_type = {
 const PropertyInfo qdev_prop_multifd_compression = {
     .name = "MultiFDCompression",
     .description = "multifd_compression values, "
-                   "none/zlib/zstd",
+                   "none/zlib/zstd/qpl",
     .enum_table = &MultiFDCompression_lookup,
     .get = qdev_propinfo_get_enum,
     .set = qdev_propinfo_set_enum,
diff --git a/migration/meson.build b/migration/meson.build
index 1eeb915ff6..cb177de1d2 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -41,6 +41,7 @@ if get_option('live_block_migration').allowed()
   system_ss.add(files('block.c'))
 endif
 system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
+system_ss.add(when: qpl, if_true: files('multifd-qpl.c'))
 
 specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
                 if_true: files('ram.c',
diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
new file mode 100644
index 0000000000..056a68a060
--- /dev/null
+++ b/migration/multifd-qpl.c
@@ -0,0 +1,20 @@
+/*
+ * Multifd qpl compression accelerator implementation
+ *
+ * Copyright (c) 2023 Intel Corporation
+ *
+ * Authors:
+ *  Yuan Liu<yuan1.liu@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qemu/module.h"
+
+static void multifd_qpl_register(void)
+{
+    /* noop */
+}
+
+migration_init(multifd_qpl_register);
diff --git a/migration/multifd.h b/migration/multifd.h
index c9d9b09239..5b7d9b15f8 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -40,6 +40,7 @@ MultiFDRecvData *multifd_get_recv_data(void);
 #define MULTIFD_FLAG_NOCOMP (0 << 1)
 #define MULTIFD_FLAG_ZLIB (1 << 1)
 #define MULTIFD_FLAG_ZSTD (2 << 1)
+#define MULTIFD_FLAG_QPL (4 << 1)
 
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
diff --git a/qapi/migration.json b/qapi/migration.json
index aa1b39bce1..dceb35db5b 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -629,11 +629,16 @@
 #
 # @zstd: use zstd compression method.
 #
+# @qpl: use qpl compression method. Query Processing Library(qpl) is based on
+#       the deflate compression algorithm and use the Intel In-Memory Analytics
+#       Accelerator(IAA) accelerated compression and decompression. (Since 9.0)
+#
 # Since: 5.0
 ##
 { 'enum': 'MultiFDCompression',
   'data': [ 'none', 'zlib',
-            { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
+            { 'name': 'zstd', 'if': 'CONFIG_ZSTD' },
+            { 'name': 'qpl', 'if': 'CONFIG_QPL' } ] }
 
 ##
 # @MigMode:
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
                   ` (3 preceding siblings ...)
  2024-03-19 16:45 ` [PATCH v5 4/7] migration/multifd: add qpl compression method Yuan Liu
@ 2024-03-19 16:45 ` Yuan Liu
  2024-03-20 10:42   ` Daniel P. Berrangé
  2024-03-19 16:45 ` [PATCH v5 6/7] migration/multifd: implement qpl compression and decompression Yuan Liu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Yuan Liu @ 2024-03-19 16:45 UTC (permalink / raw)
  To: peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, yuan1.liu, nanhai.zou

the qpl initialization includes memory allocation for compressed
data and the qpl job initialization.

the qpl initialization will check whether the In-Memory Analytics
Accelerator(IAA) hardware is available, if the platform does not
have IAA hardware or the IAA hardware is not available, the QPL
compression initialization will fail.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 migration/multifd-qpl.c | 243 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 242 insertions(+), 1 deletion(-)

diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
index 056a68a060..6de65e9da7 100644
--- a/migration/multifd-qpl.c
+++ b/migration/multifd-qpl.c
@@ -9,12 +9,253 @@
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
  */
+
 #include "qemu/osdep.h"
 #include "qemu/module.h"
+#include "qapi/error.h"
+#include "migration.h"
+#include "multifd.h"
+#include "qpl/qpl.h"
+
+typedef struct {
+    qpl_job **job_array;
+    /* the number of allocated jobs */
+    uint32_t job_num;
+    /* the size of data processed by a qpl job */
+    uint32_t data_size;
+    /* compressed data buffer */
+    uint8_t *zbuf;
+    /* the length of compressed data */
+    uint32_t *zbuf_hdr;
+} QplData;
+
+static void free_zbuf(QplData *qpl)
+{
+    if (qpl->zbuf != NULL) {
+        munmap(qpl->zbuf, qpl->job_num * qpl->data_size);
+        qpl->zbuf = NULL;
+    }
+    if (qpl->zbuf_hdr != NULL) {
+        g_free(qpl->zbuf_hdr);
+        qpl->zbuf_hdr = NULL;
+    }
+}
+
+static int alloc_zbuf(QplData *qpl, uint8_t chan_id, Error **errp)
+{
+    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
+    uint32_t size = qpl->job_num * qpl->data_size;
+    uint8_t *buf;
+
+    buf = (uint8_t *) mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
+    if (buf == MAP_FAILED) {
+        error_setg(errp, "multifd: %u: alloc_zbuf failed, job num %u, size %u",
+                   chan_id, qpl->job_num, qpl->data_size);
+        return -1;
+    }
+    qpl->zbuf = buf;
+    qpl->zbuf_hdr = g_new0(uint32_t, qpl->job_num);
+    return 0;
+}
+
+static void free_jobs(QplData *qpl)
+{
+    for (int i = 0; i < qpl->job_num; i++) {
+        qpl_fini_job(qpl->job_array[i]);
+        g_free(qpl->job_array[i]);
+        qpl->job_array[i] = NULL;
+    }
+    g_free(qpl->job_array);
+    qpl->job_array = NULL;
+}
+
+static int alloc_jobs(QplData *qpl, uint8_t chan_id, Error **errp)
+{
+    qpl_status status;
+    uint32_t job_size = 0;
+    qpl_job *job = NULL;
+    /* always use IAA hardware accelerator */
+    qpl_path_t path = qpl_path_hardware;
+
+    status = qpl_get_job_size(path, &job_size);
+    if (status != QPL_STS_OK) {
+        error_setg(errp, "multifd: %u: qpl_get_job_size failed with error %d",
+                   chan_id, status);
+        return -1;
+    }
+    qpl->job_array = g_new0(qpl_job *, qpl->job_num);
+    for (int i = 0; i < qpl->job_num; i++) {
+        job = g_malloc0(job_size);
+        status = qpl_init_job(path, job);
+        if (status != QPL_STS_OK) {
+            error_setg(errp, "multifd: %u: qpl_init_job failed with error %d",
+                       chan_id, status);
+            free_jobs(qpl);
+            return -1;
+        }
+        qpl->job_array[i] = job;
+    }
+    return 0;
+}
+
+static int init_qpl(QplData *qpl, uint32_t job_num, uint32_t data_size,
+                    uint8_t chan_id, Error **errp)
+{
+    qpl->job_num = job_num;
+    qpl->data_size = data_size;
+    if (alloc_zbuf(qpl, chan_id, errp) != 0) {
+        return -1;
+    }
+    if (alloc_jobs(qpl, chan_id, errp) != 0) {
+        free_zbuf(qpl);
+        return -1;
+    }
+    return 0;
+}
+
+static void deinit_qpl(QplData *qpl)
+{
+    if (qpl != NULL) {
+        free_jobs(qpl);
+        free_zbuf(qpl);
+        qpl->job_num = 0;
+        qpl->data_size = 0;
+    }
+}
+
+/**
+ * qpl_send_setup: setup send side
+ *
+ * Setup each channel with QPL compression.
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int qpl_send_setup(MultiFDSendParams *p, Error **errp)
+{
+    QplData *qpl;
+
+    qpl = g_new0(QplData, 1);
+    if (init_qpl(qpl, p->page_count, p->page_size, p->id, errp) != 0) {
+        g_free(qpl);
+        return -1;
+    }
+    p->compress_data = qpl;
+
+    assert(p->iov == NULL);
+    /*
+     * Each page will be compressed independently and sent using an IOV. The
+     * additional two IOVs are used to store packet header and compressed data
+     * length
+     */
+    p->iov = g_new0(struct iovec, p->page_count + 2);
+    return 0;
+}
+
+/**
+ * qpl_send_cleanup: cleanup send side
+ *
+ * Close the channel and return memory.
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static void qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
+{
+    QplData *qpl = p->compress_data;
+
+    deinit_qpl(qpl);
+    g_free(p->compress_data);
+    p->compress_data = NULL;
+}
+
+/**
+ * qpl_send_prepare: prepare data to be able to send
+ *
+ * Create a compressed buffer with all the pages that we are going to
+ * send.
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int qpl_send_prepare(MultiFDSendParams *p, Error **errp)
+{
+    /* Implement in next patch */
+    return -1;
+}
+
+/**
+ * qpl_recv_setup: setup receive side
+ *
+ * Create the compressed channel and buffer.
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int qpl_recv_setup(MultiFDRecvParams *p, Error **errp)
+{
+    QplData *qpl;
+
+    qpl = g_new0(QplData, 1);
+    if (init_qpl(qpl, p->page_count, p->page_size, p->id, errp) != 0) {
+        g_free(qpl);
+        return -1;
+    }
+    p->compress_data = qpl;
+    return 0;
+}
+
+/**
+ * qpl_recv_cleanup: setup receive side
+ *
+ * Close the channel and return memory.
+ *
+ * @p: Params for the channel that we are using
+ */
+static void qpl_recv_cleanup(MultiFDRecvParams *p)
+{
+    QplData *qpl = p->compress_data;
+
+    deinit_qpl(qpl);
+    g_free(p->compress_data);
+    p->compress_data = NULL;
+}
+
+/**
+ * qpl_recv: read the data from the channel into actual pages
+ *
+ * Read the compressed buffer, and uncompress it into the actual
+ * pages.
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int qpl_recv(MultiFDRecvParams *p, Error **errp)
+{
+    /* Implement in next patch */
+    return -1;
+}
+
+static MultiFDMethods multifd_qpl_ops = {
+    .send_setup = qpl_send_setup,
+    .send_cleanup = qpl_send_cleanup,
+    .send_prepare = qpl_send_prepare,
+    .recv_setup = qpl_recv_setup,
+    .recv_cleanup = qpl_recv_cleanup,
+    .recv = qpl_recv,
+};
 
 static void multifd_qpl_register(void)
 {
-    /* noop */
+    multifd_register_ops(MULTIFD_COMPRESSION_QPL, &multifd_qpl_ops);
 }
 
 migration_init(multifd_qpl_register);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 6/7] migration/multifd: implement qpl compression and decompression
  2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
                   ` (4 preceding siblings ...)
  2024-03-19 16:45 ` [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression Yuan Liu
@ 2024-03-19 16:45 ` Yuan Liu
  2024-03-19 16:45 ` [PATCH v5 7/7] tests/migration-test: add qpl compression test Yuan Liu
  2024-03-26 20:30 ` [PATCH v5 0/7] Live Migration With IAA Peter Xu
  7 siblings, 0 replies; 45+ messages in thread
From: Yuan Liu @ 2024-03-19 16:45 UTC (permalink / raw)
  To: peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, yuan1.liu, nanhai.zou

each qpl job is used to (de)compress a normal page and it can
be processed independently by the IAA hardware. All qpl jobs
are submitted to the hardware at once, and wait for all jobs
completion.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 migration/multifd-qpl.c | 229 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 225 insertions(+), 4 deletions(-)

diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
index 6de65e9da7..479b051b24 100644
--- a/migration/multifd-qpl.c
+++ b/migration/multifd-qpl.c
@@ -13,6 +13,7 @@
 #include "qemu/osdep.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
+#include "exec/ramblock.h"
 #include "migration.h"
 #include "multifd.h"
 #include "qpl/qpl.h"
@@ -171,6 +172,112 @@ static void qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
     p->compress_data = NULL;
 }
 
+static inline void prepare_job(qpl_job *job, uint8_t *input, uint32_t input_len,
+                               uint8_t *output, uint32_t output_len,
+                               bool is_compression)
+{
+    job->op = is_compression ? qpl_op_compress : qpl_op_decompress;
+    job->next_in_ptr = input;
+    job->next_out_ptr = output;
+    job->available_in = input_len;
+    job->available_out = output_len;
+    job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST | QPL_FLAG_OMIT_VERIFY;
+    /* only supports one compression level */
+    job->level = 1;
+}
+
+/**
+ * set_raw_data_hdr: set the length of raw data
+ *
+ * If the length of the compressed output data is greater than or equal to
+ * the page size, then set the compressed data length to the data size and
+ * send raw data directly.
+ *
+ * @qpl: pointer to the QplData structure
+ * @index: the index of the compression job header
+ */
+static inline void set_raw_data_hdr(QplData *qpl, uint32_t index)
+{
+    assert(index < qpl->job_num);
+    qpl->zbuf_hdr[index] = cpu_to_be32(qpl->data_size);
+}
+
+/**
+ * is_raw_data: check if the data is raw data
+ *
+ * The raw data length is always equal to data size, which is the
+ * size of one page.
+ *
+ * Returns true if the data is raw data, otherwise false
+ *
+ * @qpl: pointer to the QplData structure
+ * @index: the index of the decompressed job header
+ */
+static inline bool is_raw_data(QplData *qpl, uint32_t index)
+{
+    assert(index < qpl->job_num);
+    return qpl->zbuf_hdr[index] == qpl->data_size;
+}
+
+static int run_comp_jobs(MultiFDSendParams *p, Error **errp)
+{
+    qpl_status status;
+    QplData *qpl = p->compress_data;
+    MultiFDPages_t *pages = p->pages;
+    uint32_t job_num = pages->normal_num;
+    qpl_job *job = NULL;
+    uint32_t off = 0;
+
+    assert(job_num <= qpl->job_num);
+    /* submit all compression jobs */
+    for (int i = 0; i < job_num; i++) {
+        job = qpl->job_array[i];
+        /* the compressed data size should be less than one page */
+        prepare_job(job, pages->block->host + pages->offset[i], qpl->data_size,
+                    qpl->zbuf + off, qpl->data_size - 1, true);
+retry:
+        status = qpl_submit_job(job);
+        if (status == QPL_STS_OK) {
+            off += qpl->data_size;
+        } else if (status == QPL_STS_QUEUES_ARE_BUSY_ERR) {
+            goto retry;
+        } else {
+            error_setg(errp, "multifd %u: qpl_submit_job failed with error %d",
+                       p->id, status);
+            return -1;
+        }
+    }
+
+    /* wait all jobs to complete */
+    for (int i = 0; i < job_num; i++) {
+        job = qpl->job_array[i];
+        status = qpl_wait_job(job);
+        if (status == QPL_STS_OK) {
+            qpl->zbuf_hdr[i] = cpu_to_be32(job->total_out);
+            p->iov[p->iovs_num].iov_len = job->total_out;
+            p->iov[p->iovs_num].iov_base = qpl->zbuf + (qpl->data_size * i);
+            p->next_packet_size += job->total_out;
+        } else if (status == QPL_STS_MORE_OUTPUT_NEEDED) {
+            /*
+             * the compression job does not fail, the output data
+             * size is larger than the provided memory size. In this
+             * case, raw data is sent directly to the destination.
+             */
+            set_raw_data_hdr(qpl, i);
+            p->iov[p->iovs_num].iov_len = qpl->data_size;
+            p->iov[p->iovs_num].iov_base = pages->block->host +
+                                           pages->offset[i];
+            p->next_packet_size += qpl->data_size;
+        } else {
+            error_setg(errp, "multifd %u: qpl_wait_job failed with error %d",
+                       p->id, status);
+            return -1;
+        }
+        p->iovs_num++;
+    }
+    return 0;
+}
+
 /**
  * qpl_send_prepare: prepare data to be able to send
  *
@@ -184,8 +291,28 @@ static void qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
  */
 static int qpl_send_prepare(MultiFDSendParams *p, Error **errp)
 {
-    /* Implement in next patch */
-    return -1;
+    QplData *qpl = p->compress_data;
+    uint32_t hdr_size;
+
+    if (!multifd_send_prepare_common(p)) {
+        goto out;
+    }
+
+    assert(p->pages->normal_num <= qpl->job_num);
+    hdr_size = p->pages->normal_num * sizeof(uint32_t);
+    /* prepare the header that stores the lengths of all compressed data */
+    p->iov[1].iov_base = (uint8_t *) qpl->zbuf_hdr;
+    p->iov[1].iov_len = hdr_size;
+    p->iovs_num++;
+    p->next_packet_size += hdr_size;
+    if (run_comp_jobs(p, errp) != 0) {
+        return -1;
+    }
+
+out:
+    p->flags |= MULTIFD_FLAG_QPL;
+    multifd_send_fill_packet(p);
+    return 0;
 }
 
 /**
@@ -227,6 +354,60 @@ static void qpl_recv_cleanup(MultiFDRecvParams *p)
     p->compress_data = NULL;
 }
 
+static int run_decomp_jobs(MultiFDRecvParams *p, Error **errp)
+{
+    qpl_status status;
+    qpl_job *job;
+    QplData *qpl = p->compress_data;
+    uint32_t off = 0;
+    uint32_t job_num = p->normal_num;
+
+    assert(job_num <= qpl->job_num);
+    /* submit all decompression jobs */
+    for (int i = 0; i < job_num; i++) {
+        /* for the raw data, load it directly */
+        if (is_raw_data(qpl, i)) {
+            memcpy(p->host + p->normal[i], qpl->zbuf + off, qpl->data_size);
+            off += qpl->data_size;
+            continue;
+        }
+        job = qpl->job_array[i];
+        prepare_job(job, qpl->zbuf + off, qpl->zbuf_hdr[i],
+                    p->host + p->normal[i], qpl->data_size, false);
+retry:
+        status = qpl_submit_job(job);
+        if (status == QPL_STS_OK) {
+            off += qpl->zbuf_hdr[i];
+        } else if (status == QPL_STS_QUEUES_ARE_BUSY_ERR) {
+            goto retry;
+        } else {
+            error_setg(errp, "multifd %u: qpl_submit_job failed with error %d",
+                       p->id, status);
+            return -1;
+        }
+    }
+
+    /* wait all jobs to complete */
+    for (int i = 0; i < job_num; i++) {
+        if (is_raw_data(qpl, i)) {
+            continue;
+        }
+        job = qpl->job_array[i];
+        status = qpl_wait_job(job);
+        if (status != QPL_STS_OK) {
+            error_setg(errp, "multifd %u: qpl_wait_job failed with error %d",
+                       p->id, status);
+            return -1;
+        }
+        if (job->total_out != qpl->data_size) {
+            error_setg(errp, "multifd %u: decompressed len %u, expected len %u",
+                       p->id, job->total_out, qpl->data_size);
+            return -1;
+        }
+    }
+    return 0;
+}
+
 /**
  * qpl_recv: read the data from the channel into actual pages
  *
@@ -240,8 +421,48 @@ static void qpl_recv_cleanup(MultiFDRecvParams *p)
  */
 static int qpl_recv(MultiFDRecvParams *p, Error **errp)
 {
-    /* Implement in next patch */
-    return -1;
+    QplData *qpl = p->compress_data;
+    uint32_t in_size = p->next_packet_size;
+    uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
+    uint32_t hdr_len = p->normal_num * sizeof(uint32_t);
+    uint32_t data_len = 0;
+    int ret;
+
+    if (flags != MULTIFD_FLAG_QPL) {
+        error_setg(errp, "multifd %u: flags received %x flags expected %x",
+                   p->id, flags, MULTIFD_FLAG_QPL);
+        return -1;
+    }
+    multifd_recv_zero_page_process(p);
+    if (!p->normal_num) {
+        assert(in_size == 0);
+        return 0;
+    }
+
+    /* read compressed data lengths */
+    assert(hdr_len < in_size);
+    ret = qio_channel_read_all(p->c, (void *) qpl->zbuf_hdr, hdr_len, errp);
+    if (ret != 0) {
+        return ret;
+    }
+    assert(p->normal_num <= qpl->job_num);
+    for (int i = 0; i < p->normal_num; i++) {
+        qpl->zbuf_hdr[i] = be32_to_cpu(qpl->zbuf_hdr[i]);
+        data_len += qpl->zbuf_hdr[i];
+        assert(qpl->zbuf_hdr[i] <= qpl->data_size);
+    }
+
+    /* read compressed data */
+    assert(in_size == hdr_len + data_len);
+    ret = qio_channel_read_all(p->c, (void *) qpl->zbuf, data_len, errp);
+    if (ret != 0) {
+        return ret;
+    }
+
+    if (run_decomp_jobs(p, errp) != 0) {
+        return -1;
+    }
+    return 0;
 }
 
 static MultiFDMethods multifd_qpl_ops = {
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 7/7] tests/migration-test: add qpl compression test
  2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
                   ` (5 preceding siblings ...)
  2024-03-19 16:45 ` [PATCH v5 6/7] migration/multifd: implement qpl compression and decompression Yuan Liu
@ 2024-03-19 16:45 ` Yuan Liu
  2024-03-20 10:45   ` Daniel P. Berrangé
  2024-03-26 20:30 ` [PATCH v5 0/7] Live Migration With IAA Peter Xu
  7 siblings, 1 reply; 45+ messages in thread
From: Yuan Liu @ 2024-03-19 16:45 UTC (permalink / raw)
  To: peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, yuan1.liu, nanhai.zou

add qpl to compression method test for multifd migration

the migration with qpl compression needs to access IAA hardware
resource, please run "check-qtest" with sudo or root permission,
otherwise migration test will fail

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 tests/qtest/migration-test.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 71895abb7f..052d0d60fd 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2815,6 +2815,15 @@ test_migrate_precopy_tcp_multifd_zstd_start(QTestState *from,
 }
 #endif /* CONFIG_ZSTD */
 
+#ifdef CONFIG_QPL
+static void *
+test_migrate_precopy_tcp_multifd_qpl_start(QTestState *from,
+                                            QTestState *to)
+{
+    return test_migrate_precopy_tcp_multifd_start_common(from, to, "qpl");
+}
+#endif /* CONFIG_QPL */
+
 static void test_multifd_tcp_none(void)
 {
     MigrateCommon args = {
@@ -2880,6 +2889,17 @@ static void test_multifd_tcp_zstd(void)
 }
 #endif
 
+#ifdef CONFIG_QPL
+static void test_multifd_tcp_qpl(void)
+{
+    MigrateCommon args = {
+        .listen_uri = "defer",
+        .start_hook = test_migrate_precopy_tcp_multifd_qpl_start,
+    };
+    test_precopy_common(&args);
+}
+#endif
+
 #ifdef CONFIG_GNUTLS
 static void *
 test_migrate_multifd_tcp_tls_psk_start_match(QTestState *from,
@@ -3789,6 +3809,10 @@ int main(int argc, char **argv)
     migration_test_add("/migration/multifd/tcp/plain/zstd",
                        test_multifd_tcp_zstd);
 #endif
+#ifdef CONFIG_QPL
+    migration_test_add("/migration/multifd/tcp/plain/qpl",
+                       test_multifd_tcp_qpl);
+#endif
 #ifdef CONFIG_GNUTLS
     migration_test_add("/migration/multifd/tcp/tls/psk/match",
                        test_multifd_tcp_tls_psk_match);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 3/7] configure: add --enable-qpl build option
  2024-03-19 16:45 ` [PATCH v5 3/7] configure: add --enable-qpl build option Yuan Liu
@ 2024-03-20  8:55   ` Thomas Huth
  2024-03-20  8:56     ` Thomas Huth
  2024-03-20 10:31   ` Daniel P. Berrangé
  1 sibling, 1 reply; 45+ messages in thread
From: Thomas Huth @ 2024-03-20  8:55 UTC (permalink / raw)
  To: Yuan Liu, peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, nanhai.zou

On 19/03/2024 17.45, Yuan Liu wrote:
> add --enable-qpl and --disable-qpl options to enable and disable
> the QPL compression method for multifd migration.
> 
> the Query Processing Library (QPL) is an open-source library
> that supports data compression and decompression features.
> 
> The QPL compression is based on the deflate compression algorithm
> and use Intel In-Memory Analytics Accelerator(IAA) hardware for
> compression and decompression acceleration.
> 
> Please refer to the following for more information about QPL
> https://intel.github.io/qpl/documentation/introduction_docs/introduction.html
> 
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> ---
>   meson.build                   | 16 ++++++++++++++++
>   meson_options.txt             |  2 ++
>   scripts/meson-buildoptions.sh |  3 +++
>   3 files changed, 21 insertions(+)
> 
> diff --git a/meson.build b/meson.build
> index b375248a76..bee7dcd53b 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1200,6 +1200,20 @@ if not get_option('zstd').auto() or have_block
>                       required: get_option('zstd'),
>                       method: 'pkg-config')
>   endif
> +qpl = not_found
> +if not get_option('qpl').auto()

Do you really only want to enable this if the user explicitly specified 
"--enable-qpl" ? Otherwise, I think this should be:

  if not get_option('qpl').auto() or have_system

?

  Thomas




> +  libqpl = cc.find_library('qpl', required: false)
> +  if not libqpl.found()
> +    error('libqpl not found, please install it from ' +
> +    'https://intel.github.io/qpl/documentation/get_started_docs/installation.html')
> +  endif
> +  libaccel = dependency('libaccel-config', version: '>=4.0.0',
> +                        required: true,
> +                        method: 'pkg-config')
> +  qpl = declare_dependency(dependencies: [libqpl, libaccel,
> +        cc.find_library('dl', required: get_option('qpl'))],
> +        link_args: ['-lstdc++'])
> +endif
>   virgl = not_found



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 3/7] configure: add --enable-qpl build option
  2024-03-20  8:55   ` Thomas Huth
@ 2024-03-20  8:56     ` Thomas Huth
  2024-03-20 14:34       ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Huth @ 2024-03-20  8:56 UTC (permalink / raw)
  To: Yuan Liu, peterx, farosas; +Cc: qemu-devel, hao.xiang, bryan.zhang, nanhai.zou

On 20/03/2024 09.55, Thomas Huth wrote:
> On 19/03/2024 17.45, Yuan Liu wrote:
>> add --enable-qpl and --disable-qpl options to enable and disable
>> the QPL compression method for multifd migration.
>>
>> the Query Processing Library (QPL) is an open-source library
>> that supports data compression and decompression features.
>>
>> The QPL compression is based on the deflate compression algorithm
>> and use Intel In-Memory Analytics Accelerator(IAA) hardware for
>> compression and decompression acceleration.
>>
>> Please refer to the following for more information about QPL
>> https://intel.github.io/qpl/documentation/introduction_docs/introduction.html
>>
>> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
>> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
>> ---
>>   meson.build                   | 16 ++++++++++++++++
>>   meson_options.txt             |  2 ++
>>   scripts/meson-buildoptions.sh |  3 +++
>>   3 files changed, 21 insertions(+)
>>
>> diff --git a/meson.build b/meson.build
>> index b375248a76..bee7dcd53b 100644
>> --- a/meson.build
>> +++ b/meson.build
>> @@ -1200,6 +1200,20 @@ if not get_option('zstd').auto() or have_block
>>                       required: get_option('zstd'),
>>                       method: 'pkg-config')
>>   endif
>> +qpl = not_found
>> +if not get_option('qpl').auto()
> 
> Do you really only want to enable this if the user explicitly specified 
> "--enable-qpl" ? Otherwise, I think this should be:
> 
>   if not get_option('qpl').auto() or have_system
> 
> ?
> 
>   Thomas
> 
> 
> 
> 
>> +  libqpl = cc.find_library('qpl', required: false)

... and it should use "required: get_option('qpl')" in that case.

  Thomas


>> +  if not libqpl.found()
>> +    error('libqpl not found, please install it from ' +
>> +    
>> 'https://intel.github.io/qpl/documentation/get_started_docs/installation.html')
>> +  endif
>> +  libaccel = dependency('libaccel-config', version: '>=4.0.0',
>> +                        required: true,
>> +                        method: 'pkg-config')
>> +  qpl = declare_dependency(dependencies: [libqpl, libaccel,
>> +        cc.find_library('dl', required: get_option('qpl'))],
>> +        link_args: ['-lstdc++'])
>> +endif
>>   virgl = not_found
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 3/7] configure: add --enable-qpl build option
  2024-03-19 16:45 ` [PATCH v5 3/7] configure: add --enable-qpl build option Yuan Liu
  2024-03-20  8:55   ` Thomas Huth
@ 2024-03-20 10:31   ` Daniel P. Berrangé
  2024-03-20 14:42     ` Liu, Yuan1
  1 sibling, 1 reply; 45+ messages in thread
From: Daniel P. Berrangé @ 2024-03-20 10:31 UTC (permalink / raw)
  To: Yuan Liu; +Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, nanhai.zou

On Wed, Mar 20, 2024 at 12:45:23AM +0800, Yuan Liu wrote:
> add --enable-qpl and --disable-qpl options to enable and disable
> the QPL compression method for multifd migration.
> 
> the Query Processing Library (QPL) is an open-source library
> that supports data compression and decompression features.
> 
> The QPL compression is based on the deflate compression algorithm
> and use Intel In-Memory Analytics Accelerator(IAA) hardware for
> compression and decompression acceleration.
> 
> Please refer to the following for more information about QPL
> https://intel.github.io/qpl/documentation/introduction_docs/introduction.html
> 
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> ---
>  meson.build                   | 16 ++++++++++++++++
>  meson_options.txt             |  2 ++
>  scripts/meson-buildoptions.sh |  3 +++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/meson.build b/meson.build
> index b375248a76..bee7dcd53b 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1200,6 +1200,20 @@ if not get_option('zstd').auto() or have_block
>                      required: get_option('zstd'),
>                      method: 'pkg-config')
>  endif
> +qpl = not_found
> +if not get_option('qpl').auto()
> +  libqpl = cc.find_library('qpl', required: false)
> +  if not libqpl.found()
> +    error('libqpl not found, please install it from ' +
> +    'https://intel.github.io/qpl/documentation/get_started_docs/installation.html')
> +  endif
> +  libaccel = dependency('libaccel-config', version: '>=4.0.0',
> +                        required: true,
> +                        method: 'pkg-config')
> +  qpl = declare_dependency(dependencies: [libqpl, libaccel,
> +        cc.find_library('dl', required: get_option('qpl'))],
> +        link_args: ['-lstdc++'])
> +endif

Are either of these libraries present in any mainstream Linux
distro ? If not, then this feature will not get any CI build
coverage from QEMU.

Needing to manually add '-lstdc++' & '-ldl' is presumably a
requirement from 'qpl'. As a future enhancement it would be
much better if 'qpl' provided a pkg-config file, so this
list of dependencies didn't have to be hardcoded by apps
using qpl


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-19 16:45 ` [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression Yuan Liu
@ 2024-03-20 10:42   ` Daniel P. Berrangé
  2024-03-20 15:02     ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Daniel P. Berrangé @ 2024-03-20 10:42 UTC (permalink / raw)
  To: Yuan Liu; +Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, nanhai.zou

On Wed, Mar 20, 2024 at 12:45:25AM +0800, Yuan Liu wrote:
> the qpl initialization includes memory allocation for compressed
> data and the qpl job initialization.
> 
> the qpl initialization will check whether the In-Memory Analytics
> Accelerator(IAA) hardware is available, if the platform does not
> have IAA hardware or the IAA hardware is not available, the QPL
> compression initialization will fail.
> 
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> ---
>  migration/multifd-qpl.c | 243 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 242 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> index 056a68a060..6de65e9da7 100644
> --- a/migration/multifd-qpl.c
> +++ b/migration/multifd-qpl.c
> @@ -9,12 +9,253 @@
>   * This work is licensed under the terms of the GNU GPL, version 2 or later.
>   * See the COPYING file in the top-level directory.
>   */
> +
>  #include "qemu/osdep.h"
>  #include "qemu/module.h"
> +#include "qapi/error.h"
> +#include "migration.h"
> +#include "multifd.h"
> +#include "qpl/qpl.h"
> +
> +typedef struct {
> +    qpl_job **job_array;
> +    /* the number of allocated jobs */
> +    uint32_t job_num;
> +    /* the size of data processed by a qpl job */
> +    uint32_t data_size;
> +    /* compressed data buffer */
> +    uint8_t *zbuf;
> +    /* the length of compressed data */
> +    uint32_t *zbuf_hdr;
> +} QplData;
> +
> +static void free_zbuf(QplData *qpl)
> +{
> +    if (qpl->zbuf != NULL) {
> +        munmap(qpl->zbuf, qpl->job_num * qpl->data_size);
> +        qpl->zbuf = NULL;
> +    }
> +    if (qpl->zbuf_hdr != NULL) {
> +        g_free(qpl->zbuf_hdr);
> +        qpl->zbuf_hdr = NULL;
> +    }
> +}
> +
> +static int alloc_zbuf(QplData *qpl, uint8_t chan_id, Error **errp)
> +{
> +    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
> +    uint32_t size = qpl->job_num * qpl->data_size;
> +    uint8_t *buf;
> +
> +    buf = (uint8_t *) mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
> +    if (buf == MAP_FAILED) {
> +        error_setg(errp, "multifd: %u: alloc_zbuf failed, job num %u, size %u",
> +                   chan_id, qpl->job_num, qpl->data_size);
> +        return -1;
> +    }

What's the reason for using mmap here, rather than a normal
malloc ?

> +    qpl->zbuf = buf;
> +    qpl->zbuf_hdr = g_new0(uint32_t, qpl->job_num);
> +    return 0;
> +}
> +
> +static void free_jobs(QplData *qpl)
> +{
> +    for (int i = 0; i < qpl->job_num; i++) {
> +        qpl_fini_job(qpl->job_array[i]);
> +        g_free(qpl->job_array[i]);
> +        qpl->job_array[i] = NULL;
> +    }
> +    g_free(qpl->job_array);
> +    qpl->job_array = NULL;
> +}
> +
> +static int alloc_jobs(QplData *qpl, uint8_t chan_id, Error **errp)
> +{
> +    qpl_status status;
> +    uint32_t job_size = 0;
> +    qpl_job *job = NULL;
> +    /* always use IAA hardware accelerator */
> +    qpl_path_t path = qpl_path_hardware;
> +
> +    status = qpl_get_job_size(path, &job_size);
> +    if (status != QPL_STS_OK) {
> +        error_setg(errp, "multifd: %u: qpl_get_job_size failed with error %d",
> +                   chan_id, status);
> +        return -1;
> +    }
> +    qpl->job_array = g_new0(qpl_job *, qpl->job_num);
> +    for (int i = 0; i < qpl->job_num; i++) {
> +        job = g_malloc0(job_size);
> +        status = qpl_init_job(path, job);
> +        if (status != QPL_STS_OK) {
> +            error_setg(errp, "multifd: %u: qpl_init_job failed with error %d",
> +                       chan_id, status);
> +            free_jobs(qpl);
> +            return -1;
> +        }
> +        qpl->job_array[i] = job;
> +    }
> +    return 0;
> +}
> +
> +static int init_qpl(QplData *qpl, uint32_t job_num, uint32_t data_size,
> +                    uint8_t chan_id, Error **errp)
> +{

IMHO this method should be a normal constructor, it it should
be responsible for allocating 'qpl' struct too, and returning
it, not have the caller allocate it.

> +    qpl->job_num = job_num;
> +    qpl->data_size = data_size;
> +    if (alloc_zbuf(qpl, chan_id, errp) != 0) {
> +        return -1;
> +    }
> +    if (alloc_jobs(qpl, chan_id, errp) != 0) {
> +        free_zbuf(qpl);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +static void deinit_qpl(QplData *qpl)
> +{
> +    if (qpl != NULL) {
> +        free_jobs(qpl);
> +        free_zbuf(qpl);
> +        qpl->job_num = 0;
> +        qpl->data_size = 0;
> +    }
> +}

This should also free 'qpl' instead of leaving it upto the
caller.

> +
> +/**
> + * qpl_send_setup: setup send side
> + *
> + * Setup each channel with QPL compression.
> + *
> + * Returns 0 for success or -1 for error
> + *
> + * @p: Params for the channel that we are using
> + * @errp: pointer to an error
> + */
> +static int qpl_send_setup(MultiFDSendParams *p, Error **errp)
> +{
> +    QplData *qpl;
> +
> +    qpl = g_new0(QplData, 1);
> +    if (init_qpl(qpl, p->page_count, p->page_size, p->id, errp) != 0) {
> +        g_free(qpl);
> +        return -1;
> +    }
> +    p->compress_data = qpl;
> +
> +    assert(p->iov == NULL);
> +    /*
> +     * Each page will be compressed independently and sent using an IOV. The
> +     * additional two IOVs are used to store packet header and compressed data
> +     * length
> +     */
> +    p->iov = g_new0(struct iovec, p->page_count + 2);
> +    return 0;
> +}
> +
> +/**
> + * qpl_send_cleanup: cleanup send side
> + *
> + * Close the channel and return memory.
> + *
> + * @p: Params for the channel that we are using
> + * @errp: pointer to an error
> + */
> +static void qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
> +{
> +    QplData *qpl = p->compress_data;
> +
> +    deinit_qpl(qpl);
> +    g_free(p->compress_data);
> +    p->compress_data = NULL;
> +}
> +
> +/**
> + * qpl_send_prepare: prepare data to be able to send
> + *
> + * Create a compressed buffer with all the pages that we are going to
> + * send.
> + *
> + * Returns 0 for success or -1 for error
> + *
> + * @p: Params for the channel that we are using
> + * @errp: pointer to an error
> + */
> +static int qpl_send_prepare(MultiFDSendParams *p, Error **errp)
> +{
> +    /* Implement in next patch */
> +    return -1;
> +}
> +
> +/**
> + * qpl_recv_setup: setup receive side
> + *
> + * Create the compressed channel and buffer.
> + *
> + * Returns 0 for success or -1 for error
> + *
> + * @p: Params for the channel that we are using
> + * @errp: pointer to an error
> + */
> +static int qpl_recv_setup(MultiFDRecvParams *p, Error **errp)
> +{
> +    QplData *qpl;
> +
> +    qpl = g_new0(QplData, 1);
> +    if (init_qpl(qpl, p->page_count, p->page_size, p->id, errp) != 0) {
> +        g_free(qpl);
> +        return -1;
> +    }
> +    p->compress_data = qpl;
> +    return 0;
> +}
> +
> +/**
> + * qpl_recv_cleanup: setup receive side
> + *
> + * Close the channel and return memory.
> + *
> + * @p: Params for the channel that we are using
> + */
> +static void qpl_recv_cleanup(MultiFDRecvParams *p)
> +{
> +    QplData *qpl = p->compress_data;
> +
> +    deinit_qpl(qpl);
> +    g_free(p->compress_data);
> +    p->compress_data = NULL;
> +}
> +
> +/**
> + * qpl_recv: read the data from the channel into actual pages
> + *
> + * Read the compressed buffer, and uncompress it into the actual
> + * pages.
> + *
> + * Returns 0 for success or -1 for error
> + *
> + * @p: Params for the channel that we are using
> + * @errp: pointer to an error
> + */
> +static int qpl_recv(MultiFDRecvParams *p, Error **errp)
> +{
> +    /* Implement in next patch */
> +    return -1;
> +}

The qpl library uses 'qpl_' as its name prefix, so using the
same prefix in QEMU is fragile if future APIs are added to
the library.

Please consistently use 'multifd_qpl_' as the prefix for
*every* method in this file.

> +
> +static MultiFDMethods multifd_qpl_ops = {
> +    .send_setup = qpl_send_setup,
> +    .send_cleanup = qpl_send_cleanup,
> +    .send_prepare = qpl_send_prepare,
> +    .recv_setup = qpl_recv_setup,
> +    .recv_cleanup = qpl_recv_cleanup,
> +    .recv = qpl_recv,
> +};
>  
>  static void multifd_qpl_register(void)
>  {
> -    /* noop */
> +    multifd_register_ops(MULTIFD_COMPRESSION_QPL, &multifd_qpl_ops);
>  }
>  
>  migration_init(multifd_qpl_register);
> -- 
> 2.39.3
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 7/7] tests/migration-test: add qpl compression test
  2024-03-19 16:45 ` [PATCH v5 7/7] tests/migration-test: add qpl compression test Yuan Liu
@ 2024-03-20 10:45   ` Daniel P. Berrangé
  2024-03-20 15:30     ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Daniel P. Berrangé @ 2024-03-20 10:45 UTC (permalink / raw)
  To: Yuan Liu; +Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, nanhai.zou

On Wed, Mar 20, 2024 at 12:45:27AM +0800, Yuan Liu wrote:
> add qpl to compression method test for multifd migration
> 
> the migration with qpl compression needs to access IAA hardware
> resource, please run "check-qtest" with sudo or root permission,
> otherwise migration test will fail

That's not an acceptable requirement.

If someone builds QEMU with QPL, the migration test *must*
pass 100% reliably when either running on a host without
the QPL required hardware, or when lacking permissions.

The test case needs to detect these scenarios and automatically
skip the test if it is incapable of running successfully.

This raises another question though. If QPL migration requires
running as root, then it is effectively unusable for QEMU, as
no sane deployment ever runs QEMU as root.

Is there a way to make QPL work for non-root users ?

> 
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> ---
>  tests/qtest/migration-test.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 71895abb7f..052d0d60fd 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -2815,6 +2815,15 @@ test_migrate_precopy_tcp_multifd_zstd_start(QTestState *from,
>  }
>  #endif /* CONFIG_ZSTD */
>  
> +#ifdef CONFIG_QPL
> +static void *
> +test_migrate_precopy_tcp_multifd_qpl_start(QTestState *from,
> +                                            QTestState *to)
> +{
> +    return test_migrate_precopy_tcp_multifd_start_common(from, to, "qpl");
> +}
> +#endif /* CONFIG_QPL */
> +
>  static void test_multifd_tcp_none(void)
>  {
>      MigrateCommon args = {
> @@ -2880,6 +2889,17 @@ static void test_multifd_tcp_zstd(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_QPL
> +static void test_multifd_tcp_qpl(void)
> +{
> +    MigrateCommon args = {
> +        .listen_uri = "defer",
> +        .start_hook = test_migrate_precopy_tcp_multifd_qpl_start,
> +    };
> +    test_precopy_common(&args);
> +}
> +#endif
> +
>  #ifdef CONFIG_GNUTLS
>  static void *
>  test_migrate_multifd_tcp_tls_psk_start_match(QTestState *from,
> @@ -3789,6 +3809,10 @@ int main(int argc, char **argv)
>      migration_test_add("/migration/multifd/tcp/plain/zstd",
>                         test_multifd_tcp_zstd);
>  #endif
> +#ifdef CONFIG_QPL
> +    migration_test_add("/migration/multifd/tcp/plain/qpl",
> +                       test_multifd_tcp_qpl);
> +#endif
>  #ifdef CONFIG_GNUTLS
>      migration_test_add("/migration/multifd/tcp/tls/psk/match",
>                         test_multifd_tcp_tls_psk_match);
> -- 
> 2.39.3
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 3/7] configure: add --enable-qpl build option
  2024-03-20  8:56     ` Thomas Huth
@ 2024-03-20 14:34       ` Liu, Yuan1
  0 siblings, 0 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-20 14:34 UTC (permalink / raw)
  To: Thomas Huth, peterx, farosas
  Cc: qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Thomas Huth <thuth@redhat.com>
> Sent: Wednesday, March 20, 2024 4:57 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>; peterx@redhat.com; farosas@suse.de
> Cc: qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 3/7] configure: add --enable-qpl build option
> 
> On 20/03/2024 09.55, Thomas Huth wrote:
> > On 19/03/2024 17.45, Yuan Liu wrote:
> >> add --enable-qpl and --disable-qpl options to enable and disable
> >> the QPL compression method for multifd migration.
> >>
> >> the Query Processing Library (QPL) is an open-source library
> >> that supports data compression and decompression features.
> >>
> >> The QPL compression is based on the deflate compression algorithm
> >> and use Intel In-Memory Analytics Accelerator(IAA) hardware for
> >> compression and decompression acceleration.
> >>
> >> Please refer to the following for more information about QPL
> >>
> https://intel.github.io/qpl/documentation/introduction_docs/introduction.h
> tml
> >>
> >> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> >> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> >> ---
> >>   meson.build                   | 16 ++++++++++++++++
> >>   meson_options.txt             |  2 ++
> >>   scripts/meson-buildoptions.sh |  3 +++
> >>   3 files changed, 21 insertions(+)
> >>
> >> diff --git a/meson.build b/meson.build
> >> index b375248a76..bee7dcd53b 100644
> >> --- a/meson.build
> >> +++ b/meson.build
> >> @@ -1200,6 +1200,20 @@ if not get_option('zstd').auto() or have_block
> >>                       required: get_option('zstd'),
> >>                       method: 'pkg-config')
> >>   endif
> >> +qpl = not_found
> >> +if not get_option('qpl').auto()
> >
> > Do you really only want to enable this if the user explicitly specified
> > "--enable-qpl" ? Otherwise, I think this should be:
> >
> >   if not get_option('qpl').auto() or have_system
> >
> > ?
> >
> >   Thomas
> >
> >
> >
> >
> >> +  libqpl = cc.find_library('qpl', required: false)
> 
> ... and it should use "required: get_option('qpl')" in that case.
> 
>   Thomas

Hi Thomas

Thanks for your comments, you are right, I need to add have_system
and check get_option('qpl') here, I will fix this next version.

> >> +  if not libqpl.found()
> >> +    error('libqpl not found, please install it from ' +
> >> +
> >>
> 'https://intel.github.io/qpl/documentation/get_started_docs/installation.h
> tml')
> >> +  endif
> >> +  libaccel = dependency('libaccel-config', version: '>=4.0.0',
> >> +                        required: true,
> >> +                        method: 'pkg-config')
> >> +  qpl = declare_dependency(dependencies: [libqpl, libaccel,
> >> +        cc.find_library('dl', required: get_option('qpl'))],
> >> +        link_args: ['-lstdc++'])
> >> +endif
> >>   virgl = not_found
> >


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 3/7] configure: add --enable-qpl build option
  2024-03-20 10:31   ` Daniel P. Berrangé
@ 2024-03-20 14:42     ` Liu, Yuan1
  0 siblings, 0 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-20 14:42 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Wednesday, March 20, 2024 6:31 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: peterx@redhat.com; farosas@suse.de; qemu-devel@nongnu.org;
> hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou, Nanhai
> <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 3/7] configure: add --enable-qpl build option
> 
> On Wed, Mar 20, 2024 at 12:45:23AM +0800, Yuan Liu wrote:
> > add --enable-qpl and --disable-qpl options to enable and disable
> > the QPL compression method for multifd migration.
> >
> > the Query Processing Library (QPL) is an open-source library
> > that supports data compression and decompression features.
> >
> > The QPL compression is based on the deflate compression algorithm
> > and use Intel In-Memory Analytics Accelerator(IAA) hardware for
> > compression and decompression acceleration.
> >
> > Please refer to the following for more information about QPL
> >
> https://intel.github.io/qpl/documentation/introduction_docs/introduction.h
> tml
> >
> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > ---
> >  meson.build                   | 16 ++++++++++++++++
> >  meson_options.txt             |  2 ++
> >  scripts/meson-buildoptions.sh |  3 +++
> >  3 files changed, 21 insertions(+)
> >
> > diff --git a/meson.build b/meson.build
> > index b375248a76..bee7dcd53b 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -1200,6 +1200,20 @@ if not get_option('zstd').auto() or have_block
> >                      required: get_option('zstd'),
> >                      method: 'pkg-config')
> >  endif
> > +qpl = not_found
> > +if not get_option('qpl').auto()
> > +  libqpl = cc.find_library('qpl', required: false)
> > +  if not libqpl.found()
> > +    error('libqpl not found, please install it from ' +
> > +
> 'https://intel.github.io/qpl/documentation/get_started_docs/installation.h
> tml')
> > +  endif
> > +  libaccel = dependency('libaccel-config', version: '>=4.0.0',
> > +                        required: true,
> > +                        method: 'pkg-config')
> > +  qpl = declare_dependency(dependencies: [libqpl, libaccel,
> > +        cc.find_library('dl', required: get_option('qpl'))],
> > +        link_args: ['-lstdc++'])
> > +endif
> 
> Are either of these libraries present in any mainstream Linux
> distro ? If not, then this feature will not get any CI build
> coverage from QEMU.
> 
> Needing to manually add '-lstdc++' & '-ldl' is presumably a
> requirement from 'qpl'. As a future enhancement it would be
> much better if 'qpl' provided a pkg-config file, so this
> list of dependencies didn't have to be hardcoded by apps
> using qpl
> 
> 
> With regards,
> Daniel

Hi Daniel

Thanks for your comments, the QPL has not been integrated into 
mainstream Linux distro yet, I am actively promoting QPL to 
support pkg-config file and integrate it into the distributions.

The QPL will support these soon, I will use pkg-config in the next 
version to solve the QPL build dependency issue.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-20 10:42   ` Daniel P. Berrangé
@ 2024-03-20 15:02     ` Liu, Yuan1
  2024-03-20 15:20       ` Daniel P. Berrangé
  2024-03-20 15:34       ` Peter Xu
  0 siblings, 2 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-20 15:02 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Wednesday, March 20, 2024 6:42 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: peterx@redhat.com; farosas@suse.de; qemu-devel@nongnu.org;
> hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou, Nanhai
> <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> On Wed, Mar 20, 2024 at 12:45:25AM +0800, Yuan Liu wrote:
> > the qpl initialization includes memory allocation for compressed
> > data and the qpl job initialization.
> >
> > the qpl initialization will check whether the In-Memory Analytics
> > Accelerator(IAA) hardware is available, if the platform does not
> > have IAA hardware or the IAA hardware is not available, the QPL
> > compression initialization will fail.
> >
> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > ---
> >  migration/multifd-qpl.c | 243 +++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 242 insertions(+), 1 deletion(-)
> >
> > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> > index 056a68a060..6de65e9da7 100644
> > --- a/migration/multifd-qpl.c
> > +++ b/migration/multifd-qpl.c
> > @@ -9,12 +9,253 @@
> >   * This work is licensed under the terms of the GNU GPL, version 2 or
> later.
> >   * See the COPYING file in the top-level directory.
> >   */
> > +
> >  #include "qemu/osdep.h"
> >  #include "qemu/module.h"
> > +#include "qapi/error.h"
> > +#include "migration.h"
> > +#include "multifd.h"
> > +#include "qpl/qpl.h"
> > +
> > +typedef struct {
> > +    qpl_job **job_array;
> > +    /* the number of allocated jobs */
> > +    uint32_t job_num;
> > +    /* the size of data processed by a qpl job */
> > +    uint32_t data_size;
> > +    /* compressed data buffer */
> > +    uint8_t *zbuf;
> > +    /* the length of compressed data */
> > +    uint32_t *zbuf_hdr;
> > +} QplData;
> > +
> > +static void free_zbuf(QplData *qpl)
> > +{
> > +    if (qpl->zbuf != NULL) {
> > +        munmap(qpl->zbuf, qpl->job_num * qpl->data_size);
> > +        qpl->zbuf = NULL;
> > +    }
> > +    if (qpl->zbuf_hdr != NULL) {
> > +        g_free(qpl->zbuf_hdr);
> > +        qpl->zbuf_hdr = NULL;
> > +    }
> > +}
> > +
> > +static int alloc_zbuf(QplData *qpl, uint8_t chan_id, Error **errp)
> > +{
> > +    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
> > +    uint32_t size = qpl->job_num * qpl->data_size;
> > +    uint8_t *buf;
> > +
> > +    buf = (uint8_t *) mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -
> 1, 0);
> > +    if (buf == MAP_FAILED) {
> > +        error_setg(errp, "multifd: %u: alloc_zbuf failed, job num %u,
> size %u",
> > +                   chan_id, qpl->job_num, qpl->data_size);
> > +        return -1;
> > +    }
> 
> What's the reason for using mmap here, rather than a normal
> malloc ?

I want to populate the memory accessed by the IAA device in the initialization
phase, and then avoid initiating I/O page faults through the IAA device during
migration, a large number of I/O page faults are not good for performance. 

This problem also occurs at the destination, therefore, I recommend that
customers need to add -mem-prealloc for destination boot parameters.

> > +    qpl->zbuf = buf;
> > +    qpl->zbuf_hdr = g_new0(uint32_t, qpl->job_num);
> > +    return 0;
> > +}
> > +
> > +static void free_jobs(QplData *qpl)
> > +{
> > +    for (int i = 0; i < qpl->job_num; i++) {
> > +        qpl_fini_job(qpl->job_array[i]);
> > +        g_free(qpl->job_array[i]);
> > +        qpl->job_array[i] = NULL;
> > +    }
> > +    g_free(qpl->job_array);
> > +    qpl->job_array = NULL;
> > +}
> > +
> > +static int alloc_jobs(QplData *qpl, uint8_t chan_id, Error **errp)
> > +{
> > +    qpl_status status;
> > +    uint32_t job_size = 0;
> > +    qpl_job *job = NULL;
> > +    /* always use IAA hardware accelerator */
> > +    qpl_path_t path = qpl_path_hardware;
> > +
> > +    status = qpl_get_job_size(path, &job_size);
> > +    if (status != QPL_STS_OK) {
> > +        error_setg(errp, "multifd: %u: qpl_get_job_size failed with
> error %d",
> > +                   chan_id, status);
> > +        return -1;
> > +    }
> > +    qpl->job_array = g_new0(qpl_job *, qpl->job_num);
> > +    for (int i = 0; i < qpl->job_num; i++) {
> > +        job = g_malloc0(job_size);
> > +        status = qpl_init_job(path, job);
> > +        if (status != QPL_STS_OK) {
> > +            error_setg(errp, "multifd: %u: qpl_init_job failed with
> error %d",
> > +                       chan_id, status);
> > +            free_jobs(qpl);
> > +            return -1;
> > +        }
> > +        qpl->job_array[i] = job;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int init_qpl(QplData *qpl, uint32_t job_num, uint32_t data_size,
> > +                    uint8_t chan_id, Error **errp)
> > +{
> 
> IMHO this method should be a normal constructor, it it should
> be responsible for allocating 'qpl' struct too, and returning
> it, not have the caller allocate it.

Thanks for your comments, I will refine this.

> > +    qpl->job_num = job_num;
> > +    qpl->data_size = data_size;
> > +    if (alloc_zbuf(qpl, chan_id, errp) != 0) {
> > +        return -1;
> > +    }
> > +    if (alloc_jobs(qpl, chan_id, errp) != 0) {
> > +        free_zbuf(qpl);
> > +        return -1;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static void deinit_qpl(QplData *qpl)
> > +{
> > +    if (qpl != NULL) {
> > +        free_jobs(qpl);
> > +        free_zbuf(qpl);
> > +        qpl->job_num = 0;
> > +        qpl->data_size = 0;
> > +    }
> > +}
> 
> This should also free 'qpl' instead of leaving it upto the
> caller.

Sure, I will refine this in the next version.

> > +/**
> > + * qpl_send_setup: setup send side
> > + *
> > + * Setup each channel with QPL compression.
> > + *
> > + * Returns 0 for success or -1 for error
> > + *
> > + * @p: Params for the channel that we are using
> > + * @errp: pointer to an error
> > + */
> > +static int qpl_send_setup(MultiFDSendParams *p, Error **errp)
> > +{
> > +    QplData *qpl;
> > +
> > +    qpl = g_new0(QplData, 1);
> > +    if (init_qpl(qpl, p->page_count, p->page_size, p->id, errp) != 0) {
> > +        g_free(qpl);
> > +        return -1;
> > +    }
> > +    p->compress_data = qpl;
> > +
> > +    assert(p->iov == NULL);
> > +    /*
> > +     * Each page will be compressed independently and sent using an
> IOV. The
> > +     * additional two IOVs are used to store packet header and
> compressed data
> > +     * length
> > +     */
> > +    p->iov = g_new0(struct iovec, p->page_count + 2);
> > +    return 0;
> > +}
> > +
> > +/**
> > + * qpl_send_cleanup: cleanup send side
> > + *
> > + * Close the channel and return memory.
> > + *
> > + * @p: Params for the channel that we are using
> > + * @errp: pointer to an error
> > + */
> > +static void qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
> > +{
> > +    QplData *qpl = p->compress_data;
> > +
> > +    deinit_qpl(qpl);
> > +    g_free(p->compress_data);
> > +    p->compress_data = NULL;
> > +}
> > +
> > +/**
> > + * qpl_send_prepare: prepare data to be able to send
> > + *
> > + * Create a compressed buffer with all the pages that we are going to
> > + * send.
> > + *
> > + * Returns 0 for success or -1 for error
> > + *
> > + * @p: Params for the channel that we are using
> > + * @errp: pointer to an error
> > + */
> > +static int qpl_send_prepare(MultiFDSendParams *p, Error **errp)
> > +{
> > +    /* Implement in next patch */
> > +    return -1;
> > +}
> > +
> > +/**
> > + * qpl_recv_setup: setup receive side
> > + *
> > + * Create the compressed channel and buffer.
> > + *
> > + * Returns 0 for success or -1 for error
> > + *
> > + * @p: Params for the channel that we are using
> > + * @errp: pointer to an error
> > + */
> > +static int qpl_recv_setup(MultiFDRecvParams *p, Error **errp)
> > +{
> > +    QplData *qpl;
> > +
> > +    qpl = g_new0(QplData, 1);
> > +    if (init_qpl(qpl, p->page_count, p->page_size, p->id, errp) != 0) {
> > +        g_free(qpl);
> > +        return -1;
> > +    }
> > +    p->compress_data = qpl;
> > +    return 0;
> > +}
> > +
> > +/**
> > + * qpl_recv_cleanup: setup receive side
> > + *
> > + * Close the channel and return memory.
> > + *
> > + * @p: Params for the channel that we are using
> > + */
> > +static void qpl_recv_cleanup(MultiFDRecvParams *p)
> > +{
> > +    QplData *qpl = p->compress_data;
> > +
> > +    deinit_qpl(qpl);
> > +    g_free(p->compress_data);
> > +    p->compress_data = NULL;
> > +}
> > +
> > +/**
> > + * qpl_recv: read the data from the channel into actual pages
> > + *
> > + * Read the compressed buffer, and uncompress it into the actual
> > + * pages.
> > + *
> > + * Returns 0 for success or -1 for error
> > + *
> > + * @p: Params for the channel that we are using
> > + * @errp: pointer to an error
> > + */
> > +static int qpl_recv(MultiFDRecvParams *p, Error **errp)
> > +{
> > +    /* Implement in next patch */
> > +    return -1;
> > +}
> 
> The qpl library uses 'qpl_' as its name prefix, so using the
> same prefix in QEMU is fragile if future APIs are added to
> the library.
> 
> Please consistently use 'multifd_qpl_' as the prefix for
> *every* method in this file.

Get it, thanks for the guidance, I will fix this.

> > +
> > +static MultiFDMethods multifd_qpl_ops = {
> > +    .send_setup = qpl_send_setup,
> > +    .send_cleanup = qpl_send_cleanup,
> > +    .send_prepare = qpl_send_prepare,
> > +    .recv_setup = qpl_recv_setup,
> > +    .recv_cleanup = qpl_recv_cleanup,
> > +    .recv = qpl_recv,
> > +};
> >
> >  static void multifd_qpl_register(void)
> >  {
> > -    /* noop */
> > +    multifd_register_ops(MULTIFD_COMPRESSION_QPL, &multifd_qpl_ops);
> >  }
> >
> >  migration_init(multifd_qpl_register);
> > --
> > 2.39.3
> >
> >
> 
> With regards,
> Daniel
> --
> |: https://berrange.com      -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-
> https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 2/7] migration/multifd: put IOV initialization into compression method
  2024-03-19 16:45 ` [PATCH v5 2/7] migration/multifd: put IOV initialization into compression method Yuan Liu
@ 2024-03-20 15:18   ` Fabiano Rosas
  2024-03-20 15:32     ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-03-20 15:18 UTC (permalink / raw)
  To: Yuan Liu, peterx
  Cc: qemu-devel, hao.xiang, bryan.zhang, yuan1.liu, nanhai.zou

Yuan Liu <yuan1.liu@intel.com> writes:

> Different compression methods may require different numbers of IOVs.
> Based on streaming compression of zlib and zstd, all pages will be
> compressed to a data block, so two IOVs are needed for packet header
> and compressed data block.
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> ---
>  migration/multifd-zlib.c | 4 ++++
>  migration/multifd-zstd.c | 6 +++++-
>  migration/multifd.c      | 8 +++++---
>  3 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
> index 99821cd4d5..8095ef8e28 100644
> --- a/migration/multifd-zlib.c
> +++ b/migration/multifd-zlib.c
> @@ -70,6 +70,10 @@ static int zlib_send_setup(MultiFDSendParams *p, Error **errp)
>          goto err_free_zbuff;
>      }
>      p->compress_data = z;
> +
> +    assert(p->iov == NULL);
> +    /* For packet header and zlib streaming compression block */
> +    p->iov = g_new0(struct iovec, 2);
>      return 0;
>  
>  err_free_zbuff:
> diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
> index 02112255ad..9c9217794e 100644
> --- a/migration/multifd-zstd.c
> +++ b/migration/multifd-zstd.c
> @@ -52,7 +52,6 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
>      struct zstd_data *z = g_new0(struct zstd_data, 1);
>      int res;
>  
> -    p->compress_data = z;
>      z->zcs = ZSTD_createCStream();
>      if (!z->zcs) {
>          g_free(z);
> @@ -77,6 +76,11 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
>          error_setg(errp, "multifd %u: out of memory for zbuff", p->id);
>          return -1;
>      }
> +    p->compress_data = z;
> +
> +    assert(p->iov == NULL);
> +    /* For packet header and zstd streaming compression block */
> +    p->iov = g_new0(struct iovec, 2);
>      return 0;
>  }
>  
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 0179422f6d..5155e02ae3 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -1181,9 +1181,11 @@ bool multifd_send_setup(void)
>              p->packet = g_malloc0(p->packet_len);
>              p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
>              p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> -
> -            /* We need one extra place for the packet header */
> -            p->iov = g_new0(struct iovec, page_count + 1);
> +            /* IOVs are initialized in send_setup of compression method */
> +            if (!migrate_multifd_compression()) {
> +                /* We need one extra place for the packet header */
> +                p->iov = g_new0(struct iovec, page_count + 1);
> +            }

This^ should go into nocomp_send_setup:

static int nocomp_send_setup(MultiFDSendParams *p, Error **errp)
{
    if (migrate_zero_copy_send()) {
        p->write_flags |= QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
    }

    if (multifd_use_packets()) {
        /* We need one extra place for the packet header */
        p->iov = g_new0(struct iovec, p->page_count + 1);
    } else {
        p->iov = g_new0(struct iovec, p->page_count);
    }

    return 0;
}

>          } else {
>              p->iov = g_new0(struct iovec, page_count);
>          }


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-20 15:02     ` Liu, Yuan1
@ 2024-03-20 15:20       ` Daniel P. Berrangé
  2024-03-20 16:04         ` Liu, Yuan1
  2024-03-20 15:34       ` Peter Xu
  1 sibling, 1 reply; 45+ messages in thread
From: Daniel P. Berrangé @ 2024-03-20 15:20 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Wed, Mar 20, 2024 at 03:02:59PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Daniel P. Berrangé <berrange@redhat.com>
> > Sent: Wednesday, March 20, 2024 6:42 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: peterx@redhat.com; farosas@suse.de; qemu-devel@nongnu.org;
> > hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou, Nanhai
> > <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> > qpl compression
> > 
> > On Wed, Mar 20, 2024 at 12:45:25AM +0800, Yuan Liu wrote:
> > > the qpl initialization includes memory allocation for compressed
> > > data and the qpl job initialization.
> > >
> > > the qpl initialization will check whether the In-Memory Analytics
> > > Accelerator(IAA) hardware is available, if the platform does not
> > > have IAA hardware or the IAA hardware is not available, the QPL
> > > compression initialization will fail.
> > >
> > > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > > ---
> > >  migration/multifd-qpl.c | 243 +++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 242 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> > > index 056a68a060..6de65e9da7 100644
> > > --- a/migration/multifd-qpl.c
> > > +++ b/migration/multifd-qpl.c
> > > @@ -9,12 +9,253 @@
> > >   * This work is licensed under the terms of the GNU GPL, version 2 or
> > later.
> > >   * See the COPYING file in the top-level directory.
> > >   */
> > > +
> > >  #include "qemu/osdep.h"
> > >  #include "qemu/module.h"
> > > +#include "qapi/error.h"
> > > +#include "migration.h"
> > > +#include "multifd.h"
> > > +#include "qpl/qpl.h"
> > > +
> > > +typedef struct {
> > > +    qpl_job **job_array;
> > > +    /* the number of allocated jobs */
> > > +    uint32_t job_num;
> > > +    /* the size of data processed by a qpl job */
> > > +    uint32_t data_size;
> > > +    /* compressed data buffer */
> > > +    uint8_t *zbuf;
> > > +    /* the length of compressed data */
> > > +    uint32_t *zbuf_hdr;
> > > +} QplData;
> > > +
> > > +static void free_zbuf(QplData *qpl)
> > > +{
> > > +    if (qpl->zbuf != NULL) {
> > > +        munmap(qpl->zbuf, qpl->job_num * qpl->data_size);
> > > +        qpl->zbuf = NULL;
> > > +    }
> > > +    if (qpl->zbuf_hdr != NULL) {
> > > +        g_free(qpl->zbuf_hdr);
> > > +        qpl->zbuf_hdr = NULL;
> > > +    }
> > > +}
> > > +
> > > +static int alloc_zbuf(QplData *qpl, uint8_t chan_id, Error **errp)
> > > +{
> > > +    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
> > > +    uint32_t size = qpl->job_num * qpl->data_size;
> > > +    uint8_t *buf;
> > > +
> > > +    buf = (uint8_t *) mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -
> > 1, 0);
> > > +    if (buf == MAP_FAILED) {
> > > +        error_setg(errp, "multifd: %u: alloc_zbuf failed, job num %u,
> > size %u",
> > > +                   chan_id, qpl->job_num, qpl->data_size);
> > > +        return -1;
> > > +    }
> > 
> > What's the reason for using mmap here, rather than a normal
> > malloc ?
> 
> I want to populate the memory accessed by the IAA device in the initialization
> phase, and then avoid initiating I/O page faults through the IAA device during
> migration, a large number of I/O page faults are not good for performance.

Does this mmap actually make a measurable difference ?

If I've followed the code paths correctly, I think this
alloc_zbuf method only gets called during initial setup
of each migration thread.

So this use of MAP_POPULATE seems to only make a difference
between faulting in before starting sending data, and faulting
in on first bit of data that's sent. I'm surprised if that's
noticable as a difference.


> This problem also occurs at the destination, therefore, I recommend that
> customers need to add -mem-prealloc for destination boot parameters.

I can understand mem-prelloc making a difference as that guarantees
all of guest RAM is faulted in.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 7/7] tests/migration-test: add qpl compression test
  2024-03-20 10:45   ` Daniel P. Berrangé
@ 2024-03-20 15:30     ` Liu, Yuan1
  2024-03-20 15:39       ` Daniel P. Berrangé
  0 siblings, 1 reply; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-20 15:30 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Wednesday, March 20, 2024 6:46 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: peterx@redhat.com; farosas@suse.de; qemu-devel@nongnu.org;
> hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou, Nanhai
> <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 7/7] tests/migration-test: add qpl compression test
> 
> On Wed, Mar 20, 2024 at 12:45:27AM +0800, Yuan Liu wrote:
> > add qpl to compression method test for multifd migration
> >
> > the migration with qpl compression needs to access IAA hardware
> > resource, please run "check-qtest" with sudo or root permission,
> > otherwise migration test will fail
> 
> That's not an acceptable requirement.
> 
> If someone builds QEMU with QPL, the migration test *must*
> pass 100% reliably when either running on a host without
> the QPL required hardware, or when lacking permissions.
> 
> The test case needs to detect these scenarios and automatically
> skip the test if it is incapable of running successfully.
> This raises another question though. If QPL migration requires
> running as root, then it is effectively unusable for QEMU, as
> no sane deployment ever runs QEMU as root.
> 
> Is there a way to make QPL work for non-root users ?

There are two issues here
1. I need to add an IAA resource detection before the QPL test begins
   In this way, when QPL resources are unavailable, the live migration 
   test will not be affected.

2. I need to add some additional information about IAA configuration in 
   the devel/qpl-compression.rst documentation. In addition to configuring 
   IAA resources, the system administrator also needs to assign IAA resources
   to user groups.
   For example, the system administrator runs "chown -R user /dev/iax", then
   all IAA resources can be accessed by "user", this method does not require 
   sudo or root permissions

> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > ---
> >  tests/qtest/migration-test.c | 24 ++++++++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> >
> > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > index 71895abb7f..052d0d60fd 100644
> > --- a/tests/qtest/migration-test.c
> > +++ b/tests/qtest/migration-test.c
> > @@ -2815,6 +2815,15 @@
> test_migrate_precopy_tcp_multifd_zstd_start(QTestState *from,
> >  }
> >  #endif /* CONFIG_ZSTD */
> >
> > +#ifdef CONFIG_QPL
> > +static void *
> > +test_migrate_precopy_tcp_multifd_qpl_start(QTestState *from,
> > +                                            QTestState *to)
> > +{
> > +    return test_migrate_precopy_tcp_multifd_start_common(from, to,
> "qpl");
> > +}
> > +#endif /* CONFIG_QPL */
> > +
> >  static void test_multifd_tcp_none(void)
> >  {
> >      MigrateCommon args = {
> > @@ -2880,6 +2889,17 @@ static void test_multifd_tcp_zstd(void)
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_QPL
> > +static void test_multifd_tcp_qpl(void)
> > +{
> > +    MigrateCommon args = {
> > +        .listen_uri = "defer",
> > +        .start_hook = test_migrate_precopy_tcp_multifd_qpl_start,
> > +    };
> > +    test_precopy_common(&args);
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_GNUTLS
> >  static void *
> >  test_migrate_multifd_tcp_tls_psk_start_match(QTestState *from,
> > @@ -3789,6 +3809,10 @@ int main(int argc, char **argv)
> >      migration_test_add("/migration/multifd/tcp/plain/zstd",
> >                         test_multifd_tcp_zstd);
> >  #endif
> > +#ifdef CONFIG_QPL
> > +    migration_test_add("/migration/multifd/tcp/plain/qpl",
> > +                       test_multifd_tcp_qpl);
> > +#endif
> >  #ifdef CONFIG_GNUTLS
> >      migration_test_add("/migration/multifd/tcp/tls/psk/match",
> >                         test_multifd_tcp_tls_psk_match);
> > --
> > 2.39.3
> >
> >
> 
> With regards,
> Daniel
> --
> |: https://berrange.com      -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-
> https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 2/7] migration/multifd: put IOV initialization into compression method
  2024-03-20 15:18   ` Fabiano Rosas
@ 2024-03-20 15:32     ` Liu, Yuan1
  0 siblings, 0 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-20 15:32 UTC (permalink / raw)
  To: Fabiano Rosas, peterx; +Cc: qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Fabiano Rosas <farosas@suse.de>
> Sent: Wednesday, March 20, 2024 11:19 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>; peterx@redhat.com
> Cc: qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Liu, Yuan1 <yuan1.liu@intel.com>; Zou, Nanhai
> <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 2/7] migration/multifd: put IOV initialization into
> compression method
> 
> Yuan Liu <yuan1.liu@intel.com> writes:
> 
> > Different compression methods may require different numbers of IOVs.
> > Based on streaming compression of zlib and zstd, all pages will be
> > compressed to a data block, so two IOVs are needed for packet header
> > and compressed data block.
> >
> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > ---
> >  migration/multifd-zlib.c | 4 ++++
> >  migration/multifd-zstd.c | 6 +++++-
> >  migration/multifd.c      | 8 +++++---
> >  3 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
> > index 99821cd4d5..8095ef8e28 100644
> > --- a/migration/multifd-zlib.c
> > +++ b/migration/multifd-zlib.c
> > @@ -70,6 +70,10 @@ static int zlib_send_setup(MultiFDSendParams *p,
> Error **errp)
> >          goto err_free_zbuff;
> >      }
> >      p->compress_data = z;
> > +
> > +    assert(p->iov == NULL);
> > +    /* For packet header and zlib streaming compression block */
> > +    p->iov = g_new0(struct iovec, 2);
> >      return 0;
> >
> >  err_free_zbuff:
> > diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
> > index 02112255ad..9c9217794e 100644
> > --- a/migration/multifd-zstd.c
> > +++ b/migration/multifd-zstd.c
> > @@ -52,7 +52,6 @@ static int zstd_send_setup(MultiFDSendParams *p, Error
> **errp)
> >      struct zstd_data *z = g_new0(struct zstd_data, 1);
> >      int res;
> >
> > -    p->compress_data = z;
> >      z->zcs = ZSTD_createCStream();
> >      if (!z->zcs) {
> >          g_free(z);
> > @@ -77,6 +76,11 @@ static int zstd_send_setup(MultiFDSendParams *p,
> Error **errp)
> >          error_setg(errp, "multifd %u: out of memory for zbuff", p->id);
> >          return -1;
> >      }
> > +    p->compress_data = z;
> > +
> > +    assert(p->iov == NULL);
> > +    /* For packet header and zstd streaming compression block */
> > +    p->iov = g_new0(struct iovec, 2);
> >      return 0;
> >  }
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 0179422f6d..5155e02ae3 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -1181,9 +1181,11 @@ bool multifd_send_setup(void)
> >              p->packet = g_malloc0(p->packet_len);
> >              p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> >              p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> > -
> > -            /* We need one extra place for the packet header */
> > -            p->iov = g_new0(struct iovec, page_count + 1);
> > +            /* IOVs are initialized in send_setup of compression method
> */
> > +            if (!migrate_multifd_compression()) {
> > +                /* We need one extra place for the packet header */
> > +                p->iov = g_new0(struct iovec, page_count + 1);
> > +            }
> 
> This^ should go into nocomp_send_setup:
> 
> static int nocomp_send_setup(MultiFDSendParams *p, Error **errp)
> {
>     if (migrate_zero_copy_send()) {
>         p->write_flags |= QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
>     }
> 
>     if (multifd_use_packets()) {
>         /* We need one extra place for the packet header */
>         p->iov = g_new0(struct iovec, p->page_count + 1);
>     } else {
>         p->iov = g_new0(struct iovec, p->page_count);
>     }
> 
>     return 0;
> }

Yes, this is better, I will fix this in the next version,
thanks for your comments.

> >          } else {
> >              p->iov = g_new0(struct iovec, page_count);
> >          }


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-20 15:02     ` Liu, Yuan1
  2024-03-20 15:20       ` Daniel P. Berrangé
@ 2024-03-20 15:34       ` Peter Xu
  2024-03-20 16:23         ` Liu, Yuan1
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-20 15:34 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Wed, Mar 20, 2024 at 03:02:59PM +0000, Liu, Yuan1 wrote:
> > > +static int alloc_zbuf(QplData *qpl, uint8_t chan_id, Error **errp)
> > > +{
> > > +    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
> > > +    uint32_t size = qpl->job_num * qpl->data_size;
> > > +    uint8_t *buf;
> > > +
> > > +    buf = (uint8_t *) mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -
> > 1, 0);
> > > +    if (buf == MAP_FAILED) {
> > > +        error_setg(errp, "multifd: %u: alloc_zbuf failed, job num %u,
> > size %u",
> > > +                   chan_id, qpl->job_num, qpl->data_size);
> > > +        return -1;
> > > +    }
> > 
> > What's the reason for using mmap here, rather than a normal
> > malloc ?
> 
> I want to populate the memory accessed by the IAA device in the initialization
> phase, and then avoid initiating I/O page faults through the IAA device during
> migration, a large number of I/O page faults are not good for performance. 

mmap() doesn't populate pages, unless with MAP_POPULATE.  And even with
that it shouldn't be guaranteed, as the populate phase should ignore all
errors.

       MAP_POPULATE (since Linux 2.5.46)
              Populate (prefault) page tables for a mapping.  For a file  map‐
              ping, this causes read-ahead on the file.  This will help to re‐
              duce  blocking  on  page  faults later.  The mmap() call doesn't
              fail if the mapping cannot be populated  (for  example,  due  to
              limitations  on  the  number  of  mapped  huge  pages when using
              MAP_HUGETLB).  Support for MAP_POPULATE in conjunction with pri‐
              vate mappings was added in Linux 2.6.23.

OTOH, I think g_malloc0() should guarantee to prefault everything in as
long as the call returned (even though they can be swapped out later, but
that applies to all cases anyway).

> 
> This problem also occurs at the destination, therefore, I recommend that
> customers need to add -mem-prealloc for destination boot parameters.

I'm not sure what issue you hit when testing it, but -mem-prealloc flag
should only control the guest memory backends not the buffers that QEMU
internally use, afaiu.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 7/7] tests/migration-test: add qpl compression test
  2024-03-20 15:30     ` Liu, Yuan1
@ 2024-03-20 15:39       ` Daniel P. Berrangé
  2024-03-20 16:26         ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Daniel P. Berrangé @ 2024-03-20 15:39 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Wed, Mar 20, 2024 at 03:30:40PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Daniel P. Berrangé <berrange@redhat.com>
> > Sent: Wednesday, March 20, 2024 6:46 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: peterx@redhat.com; farosas@suse.de; qemu-devel@nongnu.org;
> > hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou, Nanhai
> > <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v5 7/7] tests/migration-test: add qpl compression test
> > 
> > On Wed, Mar 20, 2024 at 12:45:27AM +0800, Yuan Liu wrote:
> > > add qpl to compression method test for multifd migration
> > >
> > > the migration with qpl compression needs to access IAA hardware
> > > resource, please run "check-qtest" with sudo or root permission,
> > > otherwise migration test will fail
> > 
> > That's not an acceptable requirement.
> > 
> > If someone builds QEMU with QPL, the migration test *must*
> > pass 100% reliably when either running on a host without
> > the QPL required hardware, or when lacking permissions.
> > 
> > The test case needs to detect these scenarios and automatically
> > skip the test if it is incapable of running successfully.
> > This raises another question though. If QPL migration requires
> > running as root, then it is effectively unusable for QEMU, as
> > no sane deployment ever runs QEMU as root.
> > 
> > Is there a way to make QPL work for non-root users ?
> 
> There are two issues here
> 1. I need to add an IAA resource detection before the QPL test begins
>    In this way, when QPL resources are unavailable, the live migration 
>    test will not be affected.
> 
> 2. I need to add some additional information about IAA configuration in 
>    the devel/qpl-compression.rst documentation. In addition to configuring 
>    IAA resources, the system administrator also needs to assign IAA resources
>    to user groups.
>    For example, the system administrator runs "chown -R user /dev/iax", then
>    all IAA resources can be accessed by "user", this method does not require 
>    sudo or root permissions

Ok, so in the test suite you likely should do something
approximately like

#ifdef CONFIG_QPL
  if (access("/dev/iax", R_OK|W_OK) == 0) {
    migration_test_add("/migration/multifd/tcp/plain/qpl",
                       test_multifd_tcp_qpl);
  }
#endif

possibly more if you need to actually query supported features
of /dev/iax before trying to use it

> 
> > > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > > ---
> > >  tests/qtest/migration-test.c | 24 ++++++++++++++++++++++++
> > >  1 file changed, 24 insertions(+)
> > >
> > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > > index 71895abb7f..052d0d60fd 100644
> > > --- a/tests/qtest/migration-test.c
> > > +++ b/tests/qtest/migration-test.c
> > > @@ -2815,6 +2815,15 @@
> > test_migrate_precopy_tcp_multifd_zstd_start(QTestState *from,
> > >  }
> > >  #endif /* CONFIG_ZSTD */
> > >
> > > +#ifdef CONFIG_QPL
> > > +static void *
> > > +test_migrate_precopy_tcp_multifd_qpl_start(QTestState *from,
> > > +                                            QTestState *to)
> > > +{
> > > +    return test_migrate_precopy_tcp_multifd_start_common(from, to,
> > "qpl");
> > > +}
> > > +#endif /* CONFIG_QPL */
> > > +
> > >  static void test_multifd_tcp_none(void)
> > >  {
> > >      MigrateCommon args = {
> > > @@ -2880,6 +2889,17 @@ static void test_multifd_tcp_zstd(void)
> > >  }
> > >  #endif
> > >
> > > +#ifdef CONFIG_QPL
> > > +static void test_multifd_tcp_qpl(void)
> > > +{
> > > +    MigrateCommon args = {
> > > +        .listen_uri = "defer",
> > > +        .start_hook = test_migrate_precopy_tcp_multifd_qpl_start,
> > > +    };
> > > +    test_precopy_common(&args);
> > > +}
> > > +#endif
> > > +
> > >  #ifdef CONFIG_GNUTLS
> > >  static void *
> > >  test_migrate_multifd_tcp_tls_psk_start_match(QTestState *from,
> > > @@ -3789,6 +3809,10 @@ int main(int argc, char **argv)
> > >      migration_test_add("/migration/multifd/tcp/plain/zstd",
> > >                         test_multifd_tcp_zstd);
> > >  #endif
> > > +#ifdef CONFIG_QPL
> > > +    migration_test_add("/migration/multifd/tcp/plain/qpl",
> > > +                       test_multifd_tcp_qpl);
> > > +#endif
> > >  #ifdef CONFIG_GNUTLS
> > >      migration_test_add("/migration/multifd/tcp/tls/psk/match",
> > >                         test_multifd_tcp_tls_psk_match);
> > > --
> > > 2.39.3
> > >
> > >
> > 
> > With regards,
> > Daniel
> > --
> > |: https://berrange.com      -o-
> > https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-
> > https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-
> > https://www.instagram.com/dberrange :|
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-20 15:20       ` Daniel P. Berrangé
@ 2024-03-20 16:04         ` Liu, Yuan1
  0 siblings, 0 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-20 16:04 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Wednesday, March 20, 2024 11:21 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: peterx@redhat.com; farosas@suse.de; qemu-devel@nongnu.org;
> hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou, Nanhai
> <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> On Wed, Mar 20, 2024 at 03:02:59PM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Daniel P. Berrangé <berrange@redhat.com>
> > > Sent: Wednesday, March 20, 2024 6:42 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: peterx@redhat.com; farosas@suse.de; qemu-devel@nongnu.org;
> > > hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou, Nanhai
> > > <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement
> initialization of
> > > qpl compression
> > >
> > > On Wed, Mar 20, 2024 at 12:45:25AM +0800, Yuan Liu wrote:
> > > > the qpl initialization includes memory allocation for compressed
> > > > data and the qpl job initialization.
> > > >
> > > > the qpl initialization will check whether the In-Memory Analytics
> > > > Accelerator(IAA) hardware is available, if the platform does not
> > > > have IAA hardware or the IAA hardware is not available, the QPL
> > > > compression initialization will fail.
> > > >
> > > > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > > > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > > > ---
> > > >  migration/multifd-qpl.c | 243
> +++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 242 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> > > > index 056a68a060..6de65e9da7 100644
> > > > --- a/migration/multifd-qpl.c
> > > > +++ b/migration/multifd-qpl.c
> > > > @@ -9,12 +9,253 @@
> > > >   * This work is licensed under the terms of the GNU GPL, version 2
> or
> > > later.
> > > >   * See the COPYING file in the top-level directory.
> > > >   */
> > > > +
> > > >  #include "qemu/osdep.h"
> > > >  #include "qemu/module.h"
> > > > +#include "qapi/error.h"
> > > > +#include "migration.h"
> > > > +#include "multifd.h"
> > > > +#include "qpl/qpl.h"
> > > > +
> > > > +typedef struct {
> > > > +    qpl_job **job_array;
> > > > +    /* the number of allocated jobs */
> > > > +    uint32_t job_num;
> > > > +    /* the size of data processed by a qpl job */
> > > > +    uint32_t data_size;
> > > > +    /* compressed data buffer */
> > > > +    uint8_t *zbuf;
> > > > +    /* the length of compressed data */
> > > > +    uint32_t *zbuf_hdr;
> > > > +} QplData;
> > > > +
> > > > +static void free_zbuf(QplData *qpl)
> > > > +{
> > > > +    if (qpl->zbuf != NULL) {
> > > > +        munmap(qpl->zbuf, qpl->job_num * qpl->data_size);
> > > > +        qpl->zbuf = NULL;
> > > > +    }
> > > > +    if (qpl->zbuf_hdr != NULL) {
> > > > +        g_free(qpl->zbuf_hdr);
> > > > +        qpl->zbuf_hdr = NULL;
> > > > +    }
> > > > +}
> > > > +
> > > > +static int alloc_zbuf(QplData *qpl, uint8_t chan_id, Error **errp)
> > > > +{
> > > > +    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
> > > > +    uint32_t size = qpl->job_num * qpl->data_size;
> > > > +    uint8_t *buf;
> > > > +
> > > > +    buf = (uint8_t *) mmap(NULL, size, PROT_READ | PROT_WRITE,
> flags, -
> > > 1, 0);
> > > > +    if (buf == MAP_FAILED) {
> > > > +        error_setg(errp, "multifd: %u: alloc_zbuf failed, job
> num %u,
> > > size %u",
> > > > +                   chan_id, qpl->job_num, qpl->data_size);
> > > > +        return -1;
> > > > +    }
> > >
> > > What's the reason for using mmap here, rather than a normal
> > > malloc ?
> >
> > I want to populate the memory accessed by the IAA device in the
> initialization
> > phase, and then avoid initiating I/O page faults through the IAA device
> during
> > migration, a large number of I/O page faults are not good for
> performance.
> 
> Does this mmap actually make a measurable difference ?
> 
> If I've followed the code paths correctly, I think this
> alloc_zbuf method only gets called during initial setup
> of each migration thread.
> 
> So this use of MAP_POPULATE seems to only make a difference
> between faulting in before starting sending data, and faulting
> in on first bit of data that's sent. I'm surprised if that's
> noticable as a difference.

You are right, the performance impact is only on the first page fault 
processing, and has little impact on the overall live migration performance.

I just did a simple test. The total time of live migration using g_malloc is
2321ms, and it takes 2098ms using mmap. I need more tests to check if 
g_malloc/g_malloc0 is feasible.

> > This problem also occurs at the destination, therefore, I recommend that
> > customers need to add -mem-prealloc for destination boot parameters.
> 
> I can understand mem-prelloc making a difference as that guarantees
> all of guest RAM is faulted in.
> 
> 
> With regards,
> Daniel
> --
> |: https://berrange.com      -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-
> https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-20 15:34       ` Peter Xu
@ 2024-03-20 16:23         ` Liu, Yuan1
  2024-03-20 20:31           ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-20 16:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, March 20, 2024 11:35 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> On Wed, Mar 20, 2024 at 03:02:59PM +0000, Liu, Yuan1 wrote:
> > > > +static int alloc_zbuf(QplData *qpl, uint8_t chan_id, Error **errp)
> > > > +{
> > > > +    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
> > > > +    uint32_t size = qpl->job_num * qpl->data_size;
> > > > +    uint8_t *buf;
> > > > +
> > > > +    buf = (uint8_t *) mmap(NULL, size, PROT_READ | PROT_WRITE,
> flags, -
> > > 1, 0);
> > > > +    if (buf == MAP_FAILED) {
> > > > +        error_setg(errp, "multifd: %u: alloc_zbuf failed, job
> num %u,
> > > size %u",
> > > > +                   chan_id, qpl->job_num, qpl->data_size);
> > > > +        return -1;
> > > > +    }
> > >
> > > What's the reason for using mmap here, rather than a normal
> > > malloc ?
> >
> > I want to populate the memory accessed by the IAA device in the
> initialization
> > phase, and then avoid initiating I/O page faults through the IAA device
> during
> > migration, a large number of I/O page faults are not good for
> performance.
> 
> mmap() doesn't populate pages, unless with MAP_POPULATE.  And even with
> that it shouldn't be guaranteed, as the populate phase should ignore all
> errors.
> 
>        MAP_POPULATE (since Linux 2.5.46)
>               Populate (prefault) page tables for a mapping.  For a file
> map‐
>               ping, this causes read-ahead on the file.  This will help to
> re‐
>               duce  blocking  on  page  faults later.  The mmap() call
> doesn't
>               fail if the mapping cannot be populated  (for  example,  due
> to
>               limitations  on  the  number  of  mapped  huge  pages when
> using
>               MAP_HUGETLB).  Support for MAP_POPULATE in conjunction with
> pri‐
>               vate mappings was added in Linux 2.6.23.
> 
> OTOH, I think g_malloc0() should guarantee to prefault everything in as
> long as the call returned (even though they can be swapped out later, but
> that applies to all cases anyway).

Thanks, Peter. I will try the g_malloc0 method here

> > This problem also occurs at the destination, therefore, I recommend that
> > customers need to add -mem-prealloc for destination boot parameters.
> 
> I'm not sure what issue you hit when testing it, but -mem-prealloc flag
> should only control the guest memory backends not the buffers that QEMU
> internally use, afaiu.
> 
> Thanks,
> 
> --
> Peter Xu

let me explain here, during the decompression operation of IAA, the decompressed data
can be directly output to the virtual address of the guest memory by IAA hardware. 
It can avoid copying the decompressed data to guest memory by CPU.

Without -mem-prealloc, all the guest memory is not populated, and IAA hardware needs to trigger
I/O page fault first and then output the decompressed data to the guest memory region. 
Besides that, CPU page faults will also trigger IOTLB flush operation when IAA devices use SVM. 

Due to the inability to quickly resolve a large number of IO page faults and IOTLB flushes, the
decompression throughput of the IAA device will decrease significantly.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 7/7] tests/migration-test: add qpl compression test
  2024-03-20 15:39       ` Daniel P. Berrangé
@ 2024-03-20 16:26         ` Liu, Yuan1
  0 siblings, 0 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-20 16:26 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: peterx, farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Wednesday, March 20, 2024 11:40 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: peterx@redhat.com; farosas@suse.de; qemu-devel@nongnu.org;
> hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou, Nanhai
> <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 7/7] tests/migration-test: add qpl compression test
> 
> On Wed, Mar 20, 2024 at 03:30:40PM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Daniel P. Berrangé <berrange@redhat.com>
> > > Sent: Wednesday, March 20, 2024 6:46 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: peterx@redhat.com; farosas@suse.de; qemu-devel@nongnu.org;
> > > hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou, Nanhai
> > > <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 7/7] tests/migration-test: add qpl compression
> test
> > >
> > > On Wed, Mar 20, 2024 at 12:45:27AM +0800, Yuan Liu wrote:
> > > > add qpl to compression method test for multifd migration
> > > >
> > > > the migration with qpl compression needs to access IAA hardware
> > > > resource, please run "check-qtest" with sudo or root permission,
> > > > otherwise migration test will fail
> > >
> > > That's not an acceptable requirement.
> > >
> > > If someone builds QEMU with QPL, the migration test *must*
> > > pass 100% reliably when either running on a host without
> > > the QPL required hardware, or when lacking permissions.
> > >
> > > The test case needs to detect these scenarios and automatically
> > > skip the test if it is incapable of running successfully.
> > > This raises another question though. If QPL migration requires
> > > running as root, then it is effectively unusable for QEMU, as
> > > no sane deployment ever runs QEMU as root.
> > >
> > > Is there a way to make QPL work for non-root users ?
> >
> > There are two issues here
> > 1. I need to add an IAA resource detection before the QPL test begins
> >    In this way, when QPL resources are unavailable, the live migration
> >    test will not be affected.
> >
> > 2. I need to add some additional information about IAA configuration in
> >    the devel/qpl-compression.rst documentation. In addition to
> configuring
> >    IAA resources, the system administrator also needs to assign IAA
> resources
> >    to user groups.
> >    For example, the system administrator runs "chown -R user /dev/iax",
> then
> >    all IAA resources can be accessed by "user", this method does not
> require
> >    sudo or root permissions
> 
> Ok, so in the test suite you likely should do something
> approximately like
> 
> #ifdef CONFIG_QPL
>   if (access("/dev/iax", R_OK|W_OK) == 0) {
>     migration_test_add("/migration/multifd/tcp/plain/qpl",
>                        test_multifd_tcp_qpl);
>   }
> #endif
> 
> possibly more if you need to actually query supported features
> of /dev/iax before trying to use it

Yes, very thanks for your suggestion, I will fix this in the next version.

> > > > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > > > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > > > ---
> > > >  tests/qtest/migration-test.c | 24 ++++++++++++++++++++++++
> > > >  1 file changed, 24 insertions(+)
> > > >
> > > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-
> test.c
> > > > index 71895abb7f..052d0d60fd 100644
> > > > --- a/tests/qtest/migration-test.c
> > > > +++ b/tests/qtest/migration-test.c
> > > > @@ -2815,6 +2815,15 @@
> > > test_migrate_precopy_tcp_multifd_zstd_start(QTestState *from,
> > > >  }
> > > >  #endif /* CONFIG_ZSTD */
> > > >
> > > > +#ifdef CONFIG_QPL
> > > > +static void *
> > > > +test_migrate_precopy_tcp_multifd_qpl_start(QTestState *from,
> > > > +                                            QTestState *to)
> > > > +{
> > > > +    return test_migrate_precopy_tcp_multifd_start_common(from, to,
> > > "qpl");
> > > > +}
> > > > +#endif /* CONFIG_QPL */
> > > > +
> > > >  static void test_multifd_tcp_none(void)
> > > >  {
> > > >      MigrateCommon args = {
> > > > @@ -2880,6 +2889,17 @@ static void test_multifd_tcp_zstd(void)
> > > >  }
> > > >  #endif
> > > >
> > > > +#ifdef CONFIG_QPL
> > > > +static void test_multifd_tcp_qpl(void)
> > > > +{
> > > > +    MigrateCommon args = {
> > > > +        .listen_uri = "defer",
> > > > +        .start_hook = test_migrate_precopy_tcp_multifd_qpl_start,
> > > > +    };
> > > > +    test_precopy_common(&args);
> > > > +}
> > > > +#endif
> > > > +
> > > >  #ifdef CONFIG_GNUTLS
> > > >  static void *
> > > >  test_migrate_multifd_tcp_tls_psk_start_match(QTestState *from,
> > > > @@ -3789,6 +3809,10 @@ int main(int argc, char **argv)
> > > >      migration_test_add("/migration/multifd/tcp/plain/zstd",
> > > >                         test_multifd_tcp_zstd);
> > > >  #endif
> > > > +#ifdef CONFIG_QPL
> > > > +    migration_test_add("/migration/multifd/tcp/plain/qpl",
> > > > +                       test_multifd_tcp_qpl);
> > > > +#endif
> > > >  #ifdef CONFIG_GNUTLS
> > > >      migration_test_add("/migration/multifd/tcp/tls/psk/match",
> > > >                         test_multifd_tcp_tls_psk_match);
> > > > --
> > > > 2.39.3
> > > >
> > > >
> > >
> > > With regards,
> > > Daniel
> > > --
> > > |: https://berrange.com      -o-
> > > https://www.flickr.com/photos/dberrange :|
> > > |: https://libvirt.org         -o-
> > > https://fstop138.berrange.com :|
> > > |: https://entangle-photo.org    -o-
> > > https://www.instagram.com/dberrange :|
> >
> 
> With regards,
> Daniel
> --
> |: https://berrange.com      -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-
> https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-20 16:23         ` Liu, Yuan1
@ 2024-03-20 20:31           ` Peter Xu
  2024-03-21  1:37             ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-20 20:31 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> let me explain here, during the decompression operation of IAA, the
> decompressed data can be directly output to the virtual address of the
> guest memory by IAA hardware.  It can avoid copying the decompressed data
> to guest memory by CPU.

I see.

> Without -mem-prealloc, all the guest memory is not populated, and IAA
> hardware needs to trigger I/O page fault first and then output the
> decompressed data to the guest memory region.  Besides that, CPU page
> faults will also trigger IOTLB flush operation when IAA devices use SVM.

Oh so the IAA hardware already can use CPU pgtables?  Nice..

Why IOTLB flush is needed?  AFAIU we're only installing new pages, the
request can either come from a CPU access or a DMA.  In all cases there
should have no tearing down of an old page.  Isn't an iotlb flush only
needed if a tear down happens?

>
> Due to the inability to quickly resolve a large number of IO page faults
> and IOTLB flushes, the decompression throughput of the IAA device will
> decrease significantly.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-20 20:31           ` Peter Xu
@ 2024-03-21  1:37             ` Liu, Yuan1
  2024-03-21 15:28               ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-21  1:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 21, 2024 4:32 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> > let me explain here, during the decompression operation of IAA, the
> > decompressed data can be directly output to the virtual address of the
> > guest memory by IAA hardware.  It can avoid copying the decompressed
> data
> > to guest memory by CPU.
> 
> I see.
> 
> > Without -mem-prealloc, all the guest memory is not populated, and IAA
> > hardware needs to trigger I/O page fault first and then output the
> > decompressed data to the guest memory region.  Besides that, CPU page
> > faults will also trigger IOTLB flush operation when IAA devices use SVM.
> 
> Oh so the IAA hardware already can use CPU pgtables?  Nice..
> 
> Why IOTLB flush is needed?  AFAIU we're only installing new pages, the
> request can either come from a CPU access or a DMA.  In all cases there
> should have no tearing down of an old page.  Isn't an iotlb flush only
> needed if a tear down happens?

As far as I know, IAA hardware uses SVM technology to use the CPU's page table 
for address translation (IOMMU scalable mode directly accesses the CPU page table).
Therefore, when the CPU page table changes, the device's Invalidation operation needs
to be triggered to update the IOMMU and the device's cache. 

My current kernel version is mainline 6.2. The issue I see is as follows:
--Handle_mm_fault
 |
  -- wp_page_copy
    |
    -- mmu_notifier_invalidate_range
      |
      -- intel_invalidate_rage
        |
        -- qi_flush_piotlb
        -- qi_flush_dev_iotlb_pasid
	 

> > Due to the inability to quickly resolve a large number of IO page faults
> > and IOTLB flushes, the decompression throughput of the IAA device will
> > decrease significantly.
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-21  1:37             ` Liu, Yuan1
@ 2024-03-21 15:28               ` Peter Xu
  2024-03-22  2:06                 ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-21 15:28 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Thu, Mar 21, 2024 at 01:37:36AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, March 21, 2024 4:32 AM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> > qpl compression
> > 
> > On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> > > let me explain here, during the decompression operation of IAA, the
> > > decompressed data can be directly output to the virtual address of the
> > > guest memory by IAA hardware.  It can avoid copying the decompressed
> > data
> > > to guest memory by CPU.
> > 
> > I see.
> > 
> > > Without -mem-prealloc, all the guest memory is not populated, and IAA
> > > hardware needs to trigger I/O page fault first and then output the
> > > decompressed data to the guest memory region.  Besides that, CPU page
> > > faults will also trigger IOTLB flush operation when IAA devices use SVM.
> > 
> > Oh so the IAA hardware already can use CPU pgtables?  Nice..
> > 
> > Why IOTLB flush is needed?  AFAIU we're only installing new pages, the
> > request can either come from a CPU access or a DMA.  In all cases there
> > should have no tearing down of an old page.  Isn't an iotlb flush only
> > needed if a tear down happens?
> 
> As far as I know, IAA hardware uses SVM technology to use the CPU's page table 
> for address translation (IOMMU scalable mode directly accesses the CPU page table).
> Therefore, when the CPU page table changes, the device's Invalidation operation needs
> to be triggered to update the IOMMU and the device's cache. 
> 
> My current kernel version is mainline 6.2. The issue I see is as follows:
> --Handle_mm_fault
>  |
>   -- wp_page_copy

This is the CoW path.  Not usual at all..

I assume this issue should only present on destination.  Then the guest
pages should be the destination of such DMAs to happen, which means these
should be write faults, and as we see here it is, otherwise it won't
trigger a CoW.

However it's not clear to me why a pre-installed zero page existed.  It
means someone read the guest pages first.

It might be interesting to know _why_ someone reads the guest pages, even
if we know they're all zeros.  If we can avoid such reads then it'll be a
hole rather than a prefaulted read on zero page, then invalidations are not
needed, and I expect that should fix the iotlb storm issue.

It'll still be good we can fix this first to not make qpl special from this
regard, so that the hope is migration submodule shouldn't rely on any
pre-config (-mem-prealloc) on guest memory behaviors to work properly.

>     |
>     -- mmu_notifier_invalidate_range
>       |
>       -- intel_invalidate_rage
>         |
>         -- qi_flush_piotlb
>         -- qi_flush_dev_iotlb_pasid

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-21 15:28               ` Peter Xu
@ 2024-03-22  2:06                 ` Liu, Yuan1
  2024-03-22 14:47                   ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-22  2:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 21, 2024 11:28 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> On Thu, Mar 21, 2024 at 01:37:36AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, March 21, 2024 4:32 AM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com;
> Zou,
> > > Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement
> initialization of
> > > qpl compression
> > >
> > > On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> > > > let me explain here, during the decompression operation of IAA, the
> > > > decompressed data can be directly output to the virtual address of
> the
> > > > guest memory by IAA hardware.  It can avoid copying the decompressed
> > > data
> > > > to guest memory by CPU.
> > >
> > > I see.
> > >
> > > > Without -mem-prealloc, all the guest memory is not populated, and
> IAA
> > > > hardware needs to trigger I/O page fault first and then output the
> > > > decompressed data to the guest memory region.  Besides that, CPU
> page
> > > > faults will also trigger IOTLB flush operation when IAA devices use
> SVM.
> > >
> > > Oh so the IAA hardware already can use CPU pgtables?  Nice..
> > >
> > > Why IOTLB flush is needed?  AFAIU we're only installing new pages, the
> > > request can either come from a CPU access or a DMA.  In all cases
> there
> > > should have no tearing down of an old page.  Isn't an iotlb flush only
> > > needed if a tear down happens?
> >
> > As far as I know, IAA hardware uses SVM technology to use the CPU's page
> table
> > for address translation (IOMMU scalable mode directly accesses the CPU
> page table).
> > Therefore, when the CPU page table changes, the device's Invalidation
> operation needs
> > to be triggered to update the IOMMU and the device's cache.
> >
> > My current kernel version is mainline 6.2. The issue I see is as
> follows:
> > --Handle_mm_fault
> >  |
> >   -- wp_page_copy
> 
> This is the CoW path.  Not usual at all..
> 
> I assume this issue should only present on destination.  Then the guest
> pages should be the destination of such DMAs to happen, which means these
> should be write faults, and as we see here it is, otherwise it won't
> trigger a CoW.
> 
> However it's not clear to me why a pre-installed zero page existed.  It
> means someone read the guest pages first.
> 
> It might be interesting to know _why_ someone reads the guest pages, even
> if we know they're all zeros.  If we can avoid such reads then it'll be a
> hole rather than a prefaulted read on zero page, then invalidations are
> not
> needed, and I expect that should fix the iotlb storm issue.

The received pages will be read for zero pages check first. Although
these pages are zero pages, and IAA hardware will not access them, the
COW happens and causes following IOTLB flush operation. As far as I know, 
IOMMU quickly detects whether the address range has been used by the device,
and does not invalidate the address that is not used by the device, this has 
not yet been resolved in Linux kernel 6.2. I will check the latest status for
this.
void multifd_recv_zero_page_process(MultiFDRecvParams *p)
{
    for (int i = 0; i < p->zero_num; i++) {
        void *page = p->host + p->zero[i];
        if (!buffer_is_zero(page, p->page_size)) {
            memset(page, 0, p->page_size);
        }
    }
}


> It'll still be good we can fix this first to not make qpl special from
> this
> regard, so that the hope is migration submodule shouldn't rely on any
> pre-config (-mem-prealloc) on guest memory behaviors to work properly.

Even if the IOTLB problem can be avoided, the I/O page fault problem (normal
pages are loaded by the IAA device and solving normal page faults through IOMMU,
the performance is not good)

It can let the decompressed data of the IAA device be output to a pre-populated
memory instead of directly outputting to the guest address, but then each multifd
thread needs two memory copies, one copy from the network to the IAA input 
memory(pre-populated), and another copy from the IAA output memory(pre-populated)
to the guest address, which may become a performance bottleneck at the destination
during the live migration process.

So I think it is still necessary to use the -mem-prealloc option

> >     -- mmu_notifier_invalidate_range
> >       |
> >       -- intel_invalidate_rage
> >         |
> >         -- qi_flush_piotlb
> >         -- qi_flush_dev_iotlb_pasid
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-22  2:06                 ` Liu, Yuan1
@ 2024-03-22 14:47                   ` Liu, Yuan1
  2024-03-22 16:40                     ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-22 14:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Liu, Yuan1
> Sent: Friday, March 22, 2024 10:07 AM
> To: Peter Xu <peterx@redhat.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: RE: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> > -----Original Message-----
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, March 21, 2024 11:28 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com;
> Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization
> of
> > qpl compression
> >
> > On Thu, Mar 21, 2024 at 01:37:36AM +0000, Liu, Yuan1 wrote:
> > > > -----Original Message-----
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Sent: Thursday, March 21, 2024 4:32 AM
> > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > > devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com;
> > Zou,
> > > > Nanhai <nanhai.zou@intel.com>
> > > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement
> > initialization of
> > > > qpl compression
> > > >
> > > > On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> > > > > let me explain here, during the decompression operation of IAA,
> the
> > > > > decompressed data can be directly output to the virtual address of
> > the
> > > > > guest memory by IAA hardware.  It can avoid copying the
> decompressed
> > > > data
> > > > > to guest memory by CPU.
> > > >
> > > > I see.
> > > >
> > > > > Without -mem-prealloc, all the guest memory is not populated, and
> > IAA
> > > > > hardware needs to trigger I/O page fault first and then output the
> > > > > decompressed data to the guest memory region.  Besides that, CPU
> > page
> > > > > faults will also trigger IOTLB flush operation when IAA devices
> use
> > SVM.
> > > >
> > > > Oh so the IAA hardware already can use CPU pgtables?  Nice..
> > > >
> > > > Why IOTLB flush is needed?  AFAIU we're only installing new pages,
> the
> > > > request can either come from a CPU access or a DMA.  In all cases
> > there
> > > > should have no tearing down of an old page.  Isn't an iotlb flush
> only
> > > > needed if a tear down happens?
> > >
> > > As far as I know, IAA hardware uses SVM technology to use the CPU's
> page
> > table
> > > for address translation (IOMMU scalable mode directly accesses the CPU
> > page table).
> > > Therefore, when the CPU page table changes, the device's Invalidation
> > operation needs
> > > to be triggered to update the IOMMU and the device's cache.
> > >
> > > My current kernel version is mainline 6.2. The issue I see is as
> > follows:
> > > --Handle_mm_fault
> > >  |
> > >   -- wp_page_copy
> >
> > This is the CoW path.  Not usual at all..
> >
> > I assume this issue should only present on destination.  Then the guest
> > pages should be the destination of such DMAs to happen, which means
> these
> > should be write faults, and as we see here it is, otherwise it won't
> > trigger a CoW.
> >
> > However it's not clear to me why a pre-installed zero page existed.  It
> > means someone read the guest pages first.
> >
> > It might be interesting to know _why_ someone reads the guest pages,
> even
> > if we know they're all zeros.  If we can avoid such reads then it'll be
> a
> > hole rather than a prefaulted read on zero page, then invalidations are
> > not
> > needed, and I expect that should fix the iotlb storm issue.
> 
> The received pages will be read for zero pages check first. Although
> these pages are zero pages, and IAA hardware will not access them, the
> COW happens and causes following IOTLB flush operation. As far as I know,
> IOMMU quickly detects whether the address range has been used by the
> device,
> and does not invalidate the address that is not used by the device, this
> has
> not yet been resolved in Linux kernel 6.2. I will check the latest status
> for
> this.

I checked the Linux mainline 6.8 code, there are no big changes for this.
In version 6.8, if the process needs to flush MMU TLB, then I/O TLB flush
will be also triggered when the process has SVM devices. I haven't found
the code to check if pages have been set EA (Extended-Accessed) bit before
submitting invalidation operations, this is same with version 6.2.

VT-d 3.6.2
If the Extended-Accessed-Flag-Enable (EAFE) is 1 in a scalable-mode PASID-table
entry that references a first-stage paging-structure entry used by the remapping
hardware, it atomically sets the EA field in that entry. Whenever EA field is 
atomically set, the A field is also set in the same atomic operation. For software
usages where the first-stage paging structures are shared across heterogeneous agents
(e.g., CPUs and accelerator devices such as GPUs), the EA flag may be used by software
to identify pages accessed by non-CPU agent(s) (as opposed to the A flag which indicates
access by any agent sharing the paging structures).

> void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> {
>     for (int i = 0; i < p->zero_num; i++) {
>         void *page = p->host + p->zero[i];
>         if (!buffer_is_zero(page, p->page_size)) {
>             memset(page, 0, p->page_size);
>         }
>     }
> }
> 
> 
> > It'll still be good we can fix this first to not make qpl special from
> > this
> > regard, so that the hope is migration submodule shouldn't rely on any
> > pre-config (-mem-prealloc) on guest memory behaviors to work properly.
> 
> Even if the IOTLB problem can be avoided, the I/O page fault problem
> (normal
> pages are loaded by the IAA device and solving normal page faults through
> IOMMU,
> the performance is not good)
> 
> It can let the decompressed data of the IAA device be output to a pre-
> populated
> memory instead of directly outputting to the guest address, but then each
> multifd
> thread needs two memory copies, one copy from the network to the IAA input
> memory(pre-populated), and another copy from the IAA output memory(pre-
> populated)
> to the guest address, which may become a performance bottleneck at the
> destination
> during the live migration process.
> 
> So I think it is still necessary to use the -mem-prealloc option
> 
> > >     -- mmu_notifier_invalidate_range
> > >       |
> > >       -- intel_invalidate_rage
> > >         |
> > >         -- qi_flush_piotlb
> > >         -- qi_flush_dev_iotlb_pasid
> >
> > --
> > Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-22 14:47                   ` Liu, Yuan1
@ 2024-03-22 16:40                     ` Peter Xu
  2024-03-27 19:25                       ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-22 16:40 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Fri, Mar 22, 2024 at 02:47:02PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Liu, Yuan1
> > Sent: Friday, March 22, 2024 10:07 AM
> > To: Peter Xu <peterx@redhat.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: RE: [PATCH v5 5/7] migration/multifd: implement initialization of
> > qpl compression
> > 
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, March 21, 2024 11:28 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com;
> > Zou,
> > > Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization
> > of
> > > qpl compression
> > >
> > > On Thu, Mar 21, 2024 at 01:37:36AM +0000, Liu, Yuan1 wrote:
> > > > > -----Original Message-----
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > Sent: Thursday, March 21, 2024 4:32 AM
> > > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > > > devel@nongnu.org; hao.xiang@bytedance.com;
> > bryan.zhang@bytedance.com;
> > > Zou,
> > > > > Nanhai <nanhai.zou@intel.com>
> > > > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement
> > > initialization of
> > > > > qpl compression
> > > > >
> > > > > On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> > > > > > let me explain here, during the decompression operation of IAA,
> > the
> > > > > > decompressed data can be directly output to the virtual address of
> > > the
> > > > > > guest memory by IAA hardware.  It can avoid copying the
> > decompressed
> > > > > data
> > > > > > to guest memory by CPU.
> > > > >
> > > > > I see.
> > > > >
> > > > > > Without -mem-prealloc, all the guest memory is not populated, and
> > > IAA
> > > > > > hardware needs to trigger I/O page fault first and then output the
> > > > > > decompressed data to the guest memory region.  Besides that, CPU
> > > page
> > > > > > faults will also trigger IOTLB flush operation when IAA devices
> > use
> > > SVM.
> > > > >
> > > > > Oh so the IAA hardware already can use CPU pgtables?  Nice..
> > > > >
> > > > > Why IOTLB flush is needed?  AFAIU we're only installing new pages,
> > the
> > > > > request can either come from a CPU access or a DMA.  In all cases
> > > there
> > > > > should have no tearing down of an old page.  Isn't an iotlb flush
> > only
> > > > > needed if a tear down happens?
> > > >
> > > > As far as I know, IAA hardware uses SVM technology to use the CPU's
> > page
> > > table
> > > > for address translation (IOMMU scalable mode directly accesses the CPU
> > > page table).
> > > > Therefore, when the CPU page table changes, the device's Invalidation
> > > operation needs
> > > > to be triggered to update the IOMMU and the device's cache.
> > > >
> > > > My current kernel version is mainline 6.2. The issue I see is as
> > > follows:
> > > > --Handle_mm_fault
> > > >  |
> > > >   -- wp_page_copy
> > >
> > > This is the CoW path.  Not usual at all..
> > >
> > > I assume this issue should only present on destination.  Then the guest
> > > pages should be the destination of such DMAs to happen, which means
> > these
> > > should be write faults, and as we see here it is, otherwise it won't
> > > trigger a CoW.
> > >
> > > However it's not clear to me why a pre-installed zero page existed.  It
> > > means someone read the guest pages first.
> > >
> > > It might be interesting to know _why_ someone reads the guest pages,
> > even
> > > if we know they're all zeros.  If we can avoid such reads then it'll be
> > a
> > > hole rather than a prefaulted read on zero page, then invalidations are
> > > not
> > > needed, and I expect that should fix the iotlb storm issue.
> > 
> > The received pages will be read for zero pages check first. Although
> > these pages are zero pages, and IAA hardware will not access them, the
> > COW happens and causes following IOTLB flush operation. As far as I know,
> > IOMMU quickly detects whether the address range has been used by the
> > device,
> > and does not invalidate the address that is not used by the device, this
> > has
> > not yet been resolved in Linux kernel 6.2. I will check the latest status
> > for
> > this.
> 
> I checked the Linux mainline 6.8 code, there are no big changes for this.
> In version 6.8, if the process needs to flush MMU TLB, then I/O TLB flush
> will be also triggered when the process has SVM devices. I haven't found
> the code to check if pages have been set EA (Extended-Accessed) bit before
> submitting invalidation operations, this is same with version 6.2.
> 
> VT-d 3.6.2
> If the Extended-Accessed-Flag-Enable (EAFE) is 1 in a scalable-mode PASID-table
> entry that references a first-stage paging-structure entry used by the remapping
> hardware, it atomically sets the EA field in that entry. Whenever EA field is 
> atomically set, the A field is also set in the same atomic operation. For software
> usages where the first-stage paging structures are shared across heterogeneous agents
> (e.g., CPUs and accelerator devices such as GPUs), the EA flag may be used by software
> to identify pages accessed by non-CPU agent(s) (as opposed to the A flag which indicates
> access by any agent sharing the paging structures).

This seems pretty new hardware features.  I didn't check in depths but what
you said makes sense.

> 
> > void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> > {
> >     for (int i = 0; i < p->zero_num; i++) {
> >         void *page = p->host + p->zero[i];
> >         if (!buffer_is_zero(page, p->page_size)) {
> >             memset(page, 0, p->page_size);
> >         }
> >     }
> > }

It may not matter much (where I also see your below comments), but just to
mention another solution to avoid this read is that we can maintain
RAMBlock->receivedmap for precopy (especially, multifd, afaiu multifd
doesn't yet update this bitmap.. even if normal precopy does), then here
instead of scanning every time, maybe we can do:

  /*
   * If it's the 1st time receiving it, no need to clear it as it must be
   * all zeros now.
   */
  if (bitmap_test(rb->receivedmap, page_offset)) {
      memset(page, 0, ...);
  } else {
      bitmap_set(rb->receivedmap, page_offset);
  }

And we also always set the bit when !zero too.
    
My rational is that it's unlikely a zero page if it's sent once or more,
while OTOH for the 1st time we receive it, it must be a zero page, so no
need to scan for the 1st round.

> > 
> > 
> > > It'll still be good we can fix this first to not make qpl special from
> > > this
> > > regard, so that the hope is migration submodule shouldn't rely on any
> > > pre-config (-mem-prealloc) on guest memory behaviors to work properly.
> > 
> > Even if the IOTLB problem can be avoided, the I/O page fault problem
> > (normal
> > pages are loaded by the IAA device and solving normal page faults through
> > IOMMU,
> > the performance is not good)

Do you have a rough estimate on how slow that could be? It'll be good to
mention some details too in the doc file in that case.

> > 
> > It can let the decompressed data of the IAA device be output to a pre-
> > populated
> > memory instead of directly outputting to the guest address, but then each
> > multifd
> > thread needs two memory copies, one copy from the network to the IAA input
> > memory(pre-populated), and another copy from the IAA output memory(pre-
> > populated)
> > to the guest address, which may become a performance bottleneck at the
> > destination
> > during the live migration process.
> >  
> > So I think it is still necessary to use the -mem-prealloc option

Right, that complexity may not be necessary, in that case, maybe such
suggestion is fine.

Thanks,

> > 
> > > >     -- mmu_notifier_invalidate_range
> > > >       |
> > > >       -- intel_invalidate_rage
> > > >         |
> > > >         -- qi_flush_piotlb
> > > >         -- qi_flush_dev_iotlb_pasid
> > >
> > > --
> > > Peter Xu
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/7] docs/migration: add qpl compression feature
  2024-03-19 16:45 ` [PATCH v5 1/7] docs/migration: add qpl compression feature Yuan Liu
@ 2024-03-26 17:58   ` Peter Xu
  2024-03-27  2:14     ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-26 17:58 UTC (permalink / raw)
  To: Yuan Liu; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, nanhai.zou

On Wed, Mar 20, 2024 at 12:45:21AM +0800, Yuan Liu wrote:
> add Intel Query Processing Library (QPL) compression method
> introduction
> 
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> ---
>  docs/devel/migration/features.rst        |   1 +
>  docs/devel/migration/qpl-compression.rst | 231 +++++++++++++++++++++++
>  2 files changed, 232 insertions(+)
>  create mode 100644 docs/devel/migration/qpl-compression.rst
> 
> diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
> index d5ca7b86d5..bc98b65075 100644
> --- a/docs/devel/migration/features.rst
> +++ b/docs/devel/migration/features.rst
> @@ -12,3 +12,4 @@ Migration has plenty of features to support different use cases.
>     virtio
>     mapped-ram
>     CPR
> +   qpl-compression
> diff --git a/docs/devel/migration/qpl-compression.rst b/docs/devel/migration/qpl-compression.rst
> new file mode 100644
> index 0000000000..42c7969d30
> --- /dev/null
> +++ b/docs/devel/migration/qpl-compression.rst
> @@ -0,0 +1,231 @@
> +===============
> +QPL Compression
> +===============
> +The Intel Query Processing Library (Intel ``QPL``) is an open-source library to
> +provide compression and decompression features and it is based on deflate
> +compression algorithm (RFC 1951).
> +
> +The ``QPL`` compression relies on Intel In-Memory Analytics Accelerator(``IAA``)
> +and Shared Virtual Memory(``SVM``) technology, they are new features supported
> +from Intel 4th Gen Intel Xeon Scalable processors, codenamed Sapphire Rapids
> +processor(``SPR``).
> +
> +For more ``QPL`` introduction, please refer to:
> +
> +https://intel.github.io/qpl/documentation/introduction_docs/introduction.html

There're a bunch of links in this page, please consider switching all of
them to use the link formats of .rST:

  Please refer to `QPL introduction page <https://...>`_.

> +
> +QPL Compression Framework
> +=========================
> +
> +::
> +
> +  +----------------+       +------------------+
> +  | MultiFD Service|       |accel-config tool |
> +  +-------+--------+       +--------+---------+
> +          |                         |
> +          |                         |
> +  +-------+--------+                | Setup IAA
> +  |  QPL library   |                | Resources
> +  +-------+---+----+                |
> +          |   |                     |
> +          |   +-------------+-------+
> +          |   Open IAA      |
> +          |   Devices +-----+-----+
> +          |           |idxd driver|
> +          |           +-----+-----+
> +          |                 |
> +          |                 |
> +          |           +-----+-----+
> +          +-----------+IAA Devices|
> +      Submit jobs     +-----------+
> +      via enqcmd
> +
> +
> +Intel In-Memory Analytics Accelerator (Intel IAA) Introduction
> +================================================================
> +
> +Intel ``IAA`` is an accelerator that has been designed to help benefit
> +in-memory databases and analytic workloads. There are three main areas
> +that Intel ``IAA`` can assist with analytics primitives (scan, filter, etc.),
> +sparse data compression and memory tiering.
> +
> +``IAA`` Manual Documentation:
> +
> +https://www.intel.com/content/www/us/en/content-details/721858/intel-in-memory-analytics-accelerator-architecture-specification
> +
> +IAA Device Enabling
> +-------------------
> +
> +- Enabling ``IAA`` devices for platform configuration, please refer to:
> +
> +https://www.intel.com/content/www/us/en/content-details/780887/intel-in-memory-analytics-accelerator-intel-iaa.html
> +
> +- ``IAA`` device driver is ``Intel Data Accelerator Driver (idxd)``, it is
> +  recommended that the minimum version of Linux kernel is 5.18.
> +
> +- Add ``"intel_iommu=on,sm_on"`` parameter to kernel command line
> +  for ``SVM`` feature enabling.
> +
> +Here is an easy way to verify ``IAA`` device driver and ``SVM``, refer to:
> +
> +https://github.com/intel/idxd-config/tree/stable/test
> +
> +IAA Device Management
> +---------------------
> +
> +The number of ``IAA`` devices will vary depending on the Xeon product model.
> +On a ``SPR`` server, there can be a maximum of 8 ``IAA`` devices, with up to
> +4 devices per socket.
> +
> +By default, all ``IAA`` devices are disabled and need to be configured and
> +enabled by users manually.
> +
> +Check the number of devices through the following command
> +
> +.. code-block:: shell
> +
> +  # lspci -d 8086:0cfe
> +  # 6a:02.0 System peripheral: Intel Corporation Device 0cfe
> +  # 6f:02.0 System peripheral: Intel Corporation Device 0cfe
> +  # 74:02.0 System peripheral: Intel Corporation Device 0cfe
> +  # 79:02.0 System peripheral: Intel Corporation Device 0cfe
> +  # e7:02.0 System peripheral: Intel Corporation Device 0cfe
> +  # ec:02.0 System peripheral: Intel Corporation Device 0cfe
> +  # f1:02.0 System peripheral: Intel Corporation Device 0cfe
> +  # f6:02.0 System peripheral: Intel Corporation Device 0cfe
> +
> +IAA Device Configuration
> +------------------------
> +
> +The ``accel-config`` tool is used to enable ``IAA`` devices and configure
> +``IAA`` hardware resources(work queues and engines). One ``IAA`` device
> +has 8 work queues and 8 processing engines, multiple engines can be assigned
> +to a work queue via ``group`` attribute.
> +
> +One example of configuring and enabling an ``IAA`` device.
> +
> +.. code-block:: shell
> +
> +  # accel-config config-engine iax1/engine1.0 -g 0
> +  # accel-config config-engine iax1/engine1.1 -g 0
> +  # accel-config config-engine iax1/engine1.2 -g 0
> +  # accel-config config-engine iax1/engine1.3 -g 0
> +  # accel-config config-engine iax1/engine1.4 -g 0
> +  # accel-config config-engine iax1/engine1.5 -g 0
> +  # accel-config config-engine iax1/engine1.6 -g 0
> +  # accel-config config-engine iax1/engine1.7 -g 0
> +  # accel-config config-wq iax1/wq1.0 -g 0 -s 128 -p 10 -b 1 -t 128 -m shared -y user -n app1 -d user
> +  # accel-config enable-device iax1
> +  # accel-config enable-wq iax1/wq1.0
> +
> +.. note::
> +   IAX is an early name for IAA
> +
> +- The ``IAA`` device index is 1, use ``ls -lh /sys/bus/dsa/devices/iax*``
> +  command to query the ``IAA`` device index.
> +
> +- 8 engines and 1 work queue are configured in group 0, so all compression jobs
> +  submitted to this work queue can be processed by all engines at the same time.
> +
> +- Set work queue attributes including the work mode, work queue size and so on.
> +
> +- Enable the ``IAA1`` device and work queue 1.0
> +
> +.. note::
> +  Set work queue mode to shared mode, since ``QPL`` library only supports
> +  shared mode
> +
> +For more detailed configuration, please refer to:
> +
> +https://github.com/intel/idxd-config/tree/stable/Documentation/accfg
> +
> +IAA Resources Allocation For Migration
> +--------------------------------------
> +
> +There is no ``IAA`` resource configuration parameters for migration and
> +``accel-config`` tool configuration cannot directly specify the ``IAA``
> +resources used for migration.
> +
> +``QPL`` will use all work queues that are enabled and set to shared mode,
> +and use all engines assigned to the work queues with shared mode.
> +
> +By default, ``QPL`` will only use the local ``IAA`` device for compression
> +job processing. The local ``IAA`` device means that the CPU of the job
> +submission and the ``IAA`` device are on the same socket, so one CPU
> +can submit the jobs to up to 4 ``IAA`` devices.
> +
> +Shared Virtual Memory(SVM) Introduction
> +=======================================
> +
> +An ability for an accelerator I/O device to operate in the same virtual
> +memory space of applications on host processors. It also implies the
> +ability to operate from pageable memory, avoiding functional requirements
> +to pin memory for DMA operations.
> +
> +When using ``SVM`` technology, users do not need to reserve memory for the
> +``IAA`` device and perform pin memory operation. The ``IAA`` device can
> +directly access data using the virtual address of the process.
> +
> +For more ``SVM`` technology, please refer to:
> +
> +https://docs.kernel.org/next/x86/sva.html
> +
> +
> +How To Use QPL Compression In Migration
> +=======================================
> +
> +1 - Installation of ``accel-config`` tool and ``QPL`` library

We can drop "1 " and stick with:

  - item1
    - item1.1
    ...
  - item2

> +
> +  - Install ``accel-config`` tool from https://github.com/intel/idxd-config
> +  - Install ``QPL`` library from https://github.com/intel/qpl
> +
> +2 - Configure and enable ``IAA`` devices and work queues via ``accel-config``
> +
> +3 - Build ``Qemu`` with ``--enable-qpl`` parameter
> +
> +  E.g. configure --target-list=x86_64-softmmu --enable-kvm ``--enable-qpl``
> +
> +4 - Start VMs with ``sudo`` command or ``root`` permission
> +
> +  Use the ``sudo`` command or ``root`` privilege to start the source and
> +  destination virtual machines, since migration service needs permission
> +  to access ``IAA`` hardware resources.
> +
> +5 - Enable ``QPL`` compression during migration
> +
> +  Set ``migrate_set_parameter multifd-compression qpl`` when migrating, the
> +  ``QPL`` compression does not support configuring the compression level, it
> +  only supports one compression level.
> +
> +The Difference Between QPL And ZLIB
> +===================================
> +
> +Although both ``QPL`` and ``ZLIB`` are based on the deflate compression
> +algorithm, and ``QPL`` can support the header and tail of ``ZLIB``, ``QPL``
> +is still not fully compatible with the ``ZLIB`` compression in the migration.
> +
> +``QPL`` only supports 4K history buffer, and ``ZLIB`` is 32K by default. The
> +``ZLIB`` compressed data that ``QPL`` may not decompress correctly and
> +vice versa.
> +
> +``QPL`` does not support the ``Z_SYNC_FLUSH`` operation in ``ZLIB`` streaming
> +compression, current ``ZLIB`` implementation uses ``Z_SYNC_FLUSH``, so each
> +``multifd`` thread has a ``ZLIB`` streaming context, and all page compression
> +and decompression are based on this stream. ``QPL`` cannot decompress such data
> +and vice versa.
> +
> +The introduction for ``Z_SYNC_FLUSH``, please refer to:
> +
> +https://www.zlib.net/manual.html
> +
> +The Best Practices
> +==================
> +
> +When the virtual machine's pages are not populated and the ``IAA`` device is
> +used, I/O page faults occur, which can impact performance due to a large number
> +of flush ``IOTLB`` operations.

AFAIU the IOTLB issue is not expected and can be fixed, while per our
discussion in the other thread, the DMA fault latency cannot.

I think we can mention the possibility of IOTLB flush issue but that
shouldn't be the major cause of such suggestion.  Again, I think it'll be
great to mention how slow a DMA page fault can be, it can be compared in
migration performance difference, or just describe an average DMA page
fault latency, v.s. a processor fault.  That might explain why we suggest
prefault the pages (comparing to a generic setup where the pages are always
faulted by processors).

> +
> +Since the normal pages on the source side are all populated, ``IOTLB`` caused
> +by I/O page fault will not occur. On the destination side, a large number
> +of normal pages need to be loaded, so it is recommended to add ``-mem-prealloc``
> +parameter on the destination side.
> -- 
> 2.39.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/7] Live Migration With IAA
  2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
                   ` (6 preceding siblings ...)
  2024-03-19 16:45 ` [PATCH v5 7/7] tests/migration-test: add qpl compression test Yuan Liu
@ 2024-03-26 20:30 ` Peter Xu
  2024-03-27  3:20   ` Liu, Yuan1
  7 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-26 20:30 UTC (permalink / raw)
  To: Yuan Liu; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, nanhai.zou

Hi, Yuan,

On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> 1. QPL will be used as an independent compression method like ZLIB and ZSTD,
>    QPL will force the use of the IAA accelerator and will not support software
>    compression. For a summary of issues compatible with Zlib, please refer to
>    docs/devel/migration/qpl-compression.rst

IIRC our previous discussion is we should provide a software fallback for
the new QEMU paths, right?  Why the decision changed?  Again, such fallback
can help us to make sure qpl won't get broken easily by other changes.

> 
> 2. Compression accelerator related patches are removed from this patch set and
>    will be added to the QAT patch set, we will submit separate patches to use
>    QAT to accelerate ZLIB and ZSTD.
> 
> 3. Advantages of using IAA accelerator include:
>    a. Compared with the non-compression method, it can improve downtime
>       performance without adding additional host resources (both CPU and
>       network).
>    b. Compared with using software compression methods (ZSTD/ZLIB), it can
>       provide high data compression ratio and save a lot of CPU resources
>       used for compression.
> 
> Test condition:
>   1. Host CPUs are based on Sapphire Rapids
>   2. VM type, 16 vCPU and 64G memory
>   3. The source and destination respectively use 4 IAA devices.
>   4. The workload in the VM
>     a. all vCPUs are idle state
>     b. 90% of the virtual machine's memory is used, use silesia to fill
>        the memory.
>        The introduction of silesia:
>        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
>   5. Set "--mem-prealloc" boot parameter on the destination, this parameter
>      can make IAA performance better and related introduction is added here.
>      docs/devel/migration/qpl-compression.rst
>   6. Source migration configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter downtime-limit 300
>      f. migrate_set_parameter max-bandwidth 100G/1G
>      d. migrate_set_parameter multifd-compression none/qpl/zstd
>   7. Destination migration configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter multifd-compression none/qpl/zstd
> 
> Early migration result, each result is the average of three tests
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|    8571|      69|    58391|   1896525|  256%|

Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps only
with 2 channels.  My previous experience is around ~1XGbps per channel, so
no more than 30Gbps for two channels.  Is it because of a faster processor?
Indeed from the 4/8 results it doesn't look like increasing the num of
channels helped a lot, and even it got worse on the downtime.

What is the rational behind "downtime improvement" when with the QPL
compressors?  IIUC in this 100Gbps case the bandwidth is never a
limitation, then I don't understand why adding the compression phase can
make the switchover faster.  I can expect much more pages sent in a
NIC-limted env like you described below with 1Gbps, but not when NIC has
unlimited resources like here.

>  |BW:100G +-------------+--------+--------+---------+----------+------+
>  |        |            4|    7180|      92|    69736|   1865640|  300%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|    7090|     121|    70562|   2174060|  307%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|    8413|      34|    30067|   1732411|  230%|
>  |BW:100G +-------------+--------+--------+---------+----------+------+
>  |        |            4|    6559|      32|    38804|   1689954|  450%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|    6623|      37|    38745|   1566507|  790%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   95846|      24|     1800|    521829|  203%|
>  |BW:100G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   49004|      24|     3529|    890532|  403%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   25574|      32|     6782|   1762222|  800%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
> When network bandwidth resource is sufficient, QPL can improve downtime
> by 2x compared to no compression. In this scenario, with 4 channels, the
> IAA hardware resources are fully used, so adding more channels will not
> gain more benefits.
> 
>  
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   57758|      66|     8643|    264617|   34%|
>  |BW:  1G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   57216|      58|     8726|    266773|   34%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   56708|      53|     8804|    270223|   33%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   30129|      34|     8345|   2224761|   54%|
>  |BW:  1G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   30317|      39|     8300|   2025220|   73%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   29615|      35|     8514|   2250122|  131%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   95750|      24|     1802|    477236|  202%|
>  |BW:  1G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   48907|      24|     3536|   1002142|  404%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   25568|      32|     6783|   1696437|  800%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
> When network bandwidth resource is limited, the "page perf second" metric
> decreases for none compression, the success rate of migration will reduce.
> Comparison of QPL and ZSTD compression methods, QPL can save a lot of CPU
> resources used for compression.
> 
> v2:
>   - add support for multifd compression accelerator
>   - add support for the QPL accelerator in the multifd
>     compression accelerator
>   - fixed the issue that QPL was compiled into the migration
>     module by default
> 
> v3:
>   - use Meson instead of pkg-config to resolve QPL build
>     dependency issue
>   - fix coding style
>   - fix a CI issue for get_multifd_ops function in multifd.c file
> 
> v4:
>   - patch based on commit: da96ad4a6a Merge tag 'hw-misc-20240215' of
>     https://github.com/philmd/qemu into staging
>   - remove the compression accelerator implementation patches, the patches
>     will be placed in the QAT accelerator implementation.
>   - introduce QPL as a new compression method
>   - add QPL compression documentation
>   - add QPL compression migration test
>   - fix zlib/zstd compression level issue
> 
> v5:
>   - patch based on v9.0.0-rc0 (c62d54d0a8)
>   - use pkgconfig to check libaccel-config, libaccel-config is already
>     in many distributions.
>   - initialize the IOV of the sender by the specific compression method
>   - refine the coding style
>   - remove the zlib/zstd compression level not working patch, the issue
>     has been solved
> 
> Yuan Liu (7):
>   docs/migration: add qpl compression feature
>   migration/multifd: put IOV initialization into compression method
>   configure: add --enable-qpl build option
>   migration/multifd: add qpl compression method
>   migration/multifd: implement initialization of qpl compression
>   migration/multifd: implement qpl compression and decompression
>   tests/migration-test: add qpl compression test
> 
>  docs/devel/migration/features.rst        |   1 +
>  docs/devel/migration/qpl-compression.rst | 231 +++++++++++
>  hw/core/qdev-properties-system.c         |   2 +-
>  meson.build                              |  16 +
>  meson_options.txt                        |   2 +
>  migration/meson.build                    |   1 +
>  migration/multifd-qpl.c                  | 482 +++++++++++++++++++++++
>  migration/multifd-zlib.c                 |   4 +
>  migration/multifd-zstd.c                 |   6 +-
>  migration/multifd.c                      |   8 +-
>  migration/multifd.h                      |   1 +
>  qapi/migration.json                      |   7 +-
>  scripts/meson-buildoptions.sh            |   3 +
>  tests/qtest/migration-test.c             |  24 ++
>  14 files changed, 782 insertions(+), 6 deletions(-)
>  create mode 100644 docs/devel/migration/qpl-compression.rst
>  create mode 100644 migration/multifd-qpl.c
> 
> -- 
> 2.39.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 1/7] docs/migration: add qpl compression feature
  2024-03-26 17:58   ` Peter Xu
@ 2024-03-27  2:14     ` Liu, Yuan1
  0 siblings, 0 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-27  2:14 UTC (permalink / raw)
  To: Peter Xu; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai


> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, March 27, 2024 1:58 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 1/7] docs/migration: add qpl compression feature
> 
> On Wed, Mar 20, 2024 at 12:45:21AM +0800, Yuan Liu wrote:
> > add Intel Query Processing Library (QPL) compression method
> > introduction
> >
> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > ---
> >  docs/devel/migration/features.rst        |   1 +
> >  docs/devel/migration/qpl-compression.rst | 231 +++++++++++++++++++++++
> >  2 files changed, 232 insertions(+)
> >  create mode 100644 docs/devel/migration/qpl-compression.rst
> >
> > diff --git a/docs/devel/migration/features.rst
> b/docs/devel/migration/features.rst
> > index d5ca7b86d5..bc98b65075 100644
> > --- a/docs/devel/migration/features.rst
> > +++ b/docs/devel/migration/features.rst
> > @@ -12,3 +12,4 @@ Migration has plenty of features to support different
> use cases.
> >     virtio
> >     mapped-ram
> >     CPR
> > +   qpl-compression
> > diff --git a/docs/devel/migration/qpl-compression.rst
> b/docs/devel/migration/qpl-compression.rst
> > new file mode 100644
> > index 0000000000..42c7969d30
> > --- /dev/null
> > +++ b/docs/devel/migration/qpl-compression.rst
> > @@ -0,0 +1,231 @@
> > +===============
> > +QPL Compression
> > +===============
> > +The Intel Query Processing Library (Intel ``QPL``) is an open-source
> library to
> > +provide compression and decompression features and it is based on
> deflate
> > +compression algorithm (RFC 1951).
> > +
> > +The ``QPL`` compression relies on Intel In-Memory Analytics
> Accelerator(``IAA``)
> > +and Shared Virtual Memory(``SVM``) technology, they are new features
> supported
> > +from Intel 4th Gen Intel Xeon Scalable processors, codenamed Sapphire
> Rapids
> > +processor(``SPR``).
> > +
> > +For more ``QPL`` introduction, please refer to:
> > +
> >
> +https://intel.github.io/qpl/documentation/introduction_docs/introduction.
> html
> 
> There're a bunch of links in this page, please consider switching all of
> them to use the link formats of .rST:
> 
>   Please refer to `QPL introduction page <https://...>`_.

Sure, thanks for the suggestion

> > +
> > +QPL Compression Framework
> > +=========================
> > +
> > +::
> > +
> > +  +----------------+       +------------------+
> > +  | MultiFD Service|       |accel-config tool |
> > +  +-------+--------+       +--------+---------+
> > +          |                         |
> > +          |                         |
> > +  +-------+--------+                | Setup IAA
> > +  |  QPL library   |                | Resources
> > +  +-------+---+----+                |
> > +          |   |                     |
> > +          |   +-------------+-------+
> > +          |   Open IAA      |
> > +          |   Devices +-----+-----+
> > +          |           |idxd driver|
> > +          |           +-----+-----+
> > +          |                 |
> > +          |                 |
> > +          |           +-----+-----+
> > +          +-----------+IAA Devices|
> > +      Submit jobs     +-----------+
> > +      via enqcmd
> > +
> > +
> > +Intel In-Memory Analytics Accelerator (Intel IAA) Introduction
> > +================================================================
> > +
> > +Intel ``IAA`` is an accelerator that has been designed to help benefit
> > +in-memory databases and analytic workloads. There are three main areas
> > +that Intel ``IAA`` can assist with analytics primitives (scan, filter,
> etc.),
> > +sparse data compression and memory tiering.
> > +
> > +``IAA`` Manual Documentation:
> > +
> > +https://www.intel.com/content/www/us/en/content-details/721858/intel-
> in-memory-analytics-accelerator-architecture-specification
> > +
> > +IAA Device Enabling
> > +-------------------
> > +
> > +- Enabling ``IAA`` devices for platform configuration, please refer to:
> > +
> > +https://www.intel.com/content/www/us/en/content-details/780887/intel-
> in-memory-analytics-accelerator-intel-iaa.html
> > +
> > +- ``IAA`` device driver is ``Intel Data Accelerator Driver (idxd)``, it
> is
> > +  recommended that the minimum version of Linux kernel is 5.18.
> > +
> > +- Add ``"intel_iommu=on,sm_on"`` parameter to kernel command line
> > +  for ``SVM`` feature enabling.
> > +
> > +Here is an easy way to verify ``IAA`` device driver and ``SVM``, refer
> to:
> > +
> > +https://github.com/intel/idxd-config/tree/stable/test
> > +
> > +IAA Device Management
> > +---------------------
> > +
> > +The number of ``IAA`` devices will vary depending on the Xeon product
> model.
> > +On a ``SPR`` server, there can be a maximum of 8 ``IAA`` devices, with
> up to
> > +4 devices per socket.
> > +
> > +By default, all ``IAA`` devices are disabled and need to be configured
> and
> > +enabled by users manually.
> > +
> > +Check the number of devices through the following command
> > +
> > +.. code-block:: shell
> > +
> > +  # lspci -d 8086:0cfe
> > +  # 6a:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  # 6f:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  # 74:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  # 79:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  # e7:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  # ec:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  # f1:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  # f6:02.0 System peripheral: Intel Corporation Device 0cfe
> > +
> > +IAA Device Configuration
> > +------------------------
> > +
> > +The ``accel-config`` tool is used to enable ``IAA`` devices and
> configure
> > +``IAA`` hardware resources(work queues and engines). One ``IAA`` device
> > +has 8 work queues and 8 processing engines, multiple engines can be
> assigned
> > +to a work queue via ``group`` attribute.
> > +
> > +One example of configuring and enabling an ``IAA`` device.
> > +
> > +.. code-block:: shell
> > +
> > +  # accel-config config-engine iax1/engine1.0 -g 0
> > +  # accel-config config-engine iax1/engine1.1 -g 0
> > +  # accel-config config-engine iax1/engine1.2 -g 0
> > +  # accel-config config-engine iax1/engine1.3 -g 0
> > +  # accel-config config-engine iax1/engine1.4 -g 0
> > +  # accel-config config-engine iax1/engine1.5 -g 0
> > +  # accel-config config-engine iax1/engine1.6 -g 0
> > +  # accel-config config-engine iax1/engine1.7 -g 0
> > +  # accel-config config-wq iax1/wq1.0 -g 0 -s 128 -p 10 -b 1 -t 128 -m
> shared -y user -n app1 -d user
> > +  # accel-config enable-device iax1
> > +  # accel-config enable-wq iax1/wq1.0
> > +
> > +.. note::
> > +   IAX is an early name for IAA
> > +
> > +- The ``IAA`` device index is 1, use ``ls -lh
> /sys/bus/dsa/devices/iax*``
> > +  command to query the ``IAA`` device index.
> > +
> > +- 8 engines and 1 work queue are configured in group 0, so all
> compression jobs
> > +  submitted to this work queue can be processed by all engines at the
> same time.
> > +
> > +- Set work queue attributes including the work mode, work queue size
> and so on.
> > +
> > +- Enable the ``IAA1`` device and work queue 1.0
> > +
> > +.. note::
> > +  Set work queue mode to shared mode, since ``QPL`` library only
> supports
> > +  shared mode
> > +
> > +For more detailed configuration, please refer to:
> > +
> > +https://github.com/intel/idxd-config/tree/stable/Documentation/accfg
> > +
> > +IAA Resources Allocation For Migration
> > +--------------------------------------
> > +
> > +There is no ``IAA`` resource configuration parameters for migration and
> > +``accel-config`` tool configuration cannot directly specify the ``IAA``
> > +resources used for migration.
> > +
> > +``QPL`` will use all work queues that are enabled and set to shared
> mode,
> > +and use all engines assigned to the work queues with shared mode.
> > +
> > +By default, ``QPL`` will only use the local ``IAA`` device for
> compression
> > +job processing. The local ``IAA`` device means that the CPU of the job
> > +submission and the ``IAA`` device are on the same socket, so one CPU
> > +can submit the jobs to up to 4 ``IAA`` devices.
> > +
> > +Shared Virtual Memory(SVM) Introduction
> > +=======================================
> > +
> > +An ability for an accelerator I/O device to operate in the same virtual
> > +memory space of applications on host processors. It also implies the
> > +ability to operate from pageable memory, avoiding functional
> requirements
> > +to pin memory for DMA operations.
> > +
> > +When using ``SVM`` technology, users do not need to reserve memory for
> the
> > +``IAA`` device and perform pin memory operation. The ``IAA`` device can
> > +directly access data using the virtual address of the process.
> > +
> > +For more ``SVM`` technology, please refer to:
> > +
> > +https://docs.kernel.org/next/x86/sva.html
> > +
> > +
> > +How To Use QPL Compression In Migration
> > +=======================================
> > +
> > +1 - Installation of ``accel-config`` tool and ``QPL`` library
> 
> We can drop "1 " and stick with:
> 
>   - item1
>     - item1.1
>     ...
>   - item2
> 

Yes, it is better.

> > +
> > +  - Install ``accel-config`` tool from https://github.com/intel/idxd-
> config
> > +  - Install ``QPL`` library from https://github.com/intel/qpl
> > +
> > +2 - Configure and enable ``IAA`` devices and work queues via ``accel-
> config``
> > +
> > +3 - Build ``Qemu`` with ``--enable-qpl`` parameter
> > +
> > +  E.g. configure --target-list=x86_64-softmmu --enable-kvm ``--enable-
> qpl``
> > +
> > +4 - Start VMs with ``sudo`` command or ``root`` permission
> > +
> > +  Use the ``sudo`` command or ``root`` privilege to start the source
> and
> > +  destination virtual machines, since migration service needs
> permission
> > +  to access ``IAA`` hardware resources.
> > +
> > +5 - Enable ``QPL`` compression during migration
> > +
> > +  Set ``migrate_set_parameter multifd-compression qpl`` when migrating,
> the
> > +  ``QPL`` compression does not support configuring the compression
> level, it
> > +  only supports one compression level.
> > +
> > +The Difference Between QPL And ZLIB
> > +===================================
> > +
> > +Although both ``QPL`` and ``ZLIB`` are based on the deflate compression
> > +algorithm, and ``QPL`` can support the header and tail of ``ZLIB``,
> ``QPL``
> > +is still not fully compatible with the ``ZLIB`` compression in the
> migration.
> > +
> > +``QPL`` only supports 4K history buffer, and ``ZLIB`` is 32K by
> default. The
> > +``ZLIB`` compressed data that ``QPL`` may not decompress correctly and
> > +vice versa.
> > +
> > +``QPL`` does not support the ``Z_SYNC_FLUSH`` operation in ``ZLIB``
> streaming
> > +compression, current ``ZLIB`` implementation uses ``Z_SYNC_FLUSH``, so
> each
> > +``multifd`` thread has a ``ZLIB`` streaming context, and all page
> compression
> > +and decompression are based on this stream. ``QPL`` cannot decompress
> such data
> > +and vice versa.
> > +
> > +The introduction for ``Z_SYNC_FLUSH``, please refer to:
> > +
> > +https://www.zlib.net/manual.html
> > +
> > +The Best Practices
> > +==================
> > +
> > +When the virtual machine's pages are not populated and the ``IAA``
> device is
> > +used, I/O page faults occur, which can impact performance due to a
> large number
> > +of flush ``IOTLB`` operations.
> 
> AFAIU the IOTLB issue is not expected and can be fixed, while per our
> discussion in the other thread, the DMA fault latency cannot.
> 
> I think we can mention the possibility of IOTLB flush issue but that
> shouldn't be the major cause of such suggestion.  Again, I think it'll be
> great to mention how slow a DMA page fault can be, it can be compared in
> migration performance difference, or just describe an average DMA page
> fault latency, v.s. a processor fault.  That might explain why we suggest
> prefault the pages (comparing to a generic setup where the pages are
> always
> faulted by processors).

Get it, I will explain this issue in depth and add some comparative data, 
including I/O page fault Vs. process fault performance comparison and the
performance impact of I/O page fault on the entire live migration. Help 
developers and users better understand why -mem- prealloc needs to be 
added currently.

> > +
> > +Since the normal pages on the source side are all populated, ``IOTLB``
> caused
> > +by I/O page fault will not occur. On the destination side, a large
> number
> > +of normal pages need to be loaded, so it is recommended to add ``-mem-
> prealloc``
> > +parameter on the destination side.
> > --
> > 2.39.3
> >
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 0/7] Live Migration With IAA
  2024-03-26 20:30 ` [PATCH v5 0/7] Live Migration With IAA Peter Xu
@ 2024-03-27  3:20   ` Liu, Yuan1
  2024-03-27 19:46     ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-27  3:20 UTC (permalink / raw)
  To: Peter Xu; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, March 27, 2024 4:30 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> 
> Hi, Yuan,
> 
> On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> > 1. QPL will be used as an independent compression method like ZLIB and
> ZSTD,
> >    QPL will force the use of the IAA accelerator and will not support
> software
> >    compression. For a summary of issues compatible with Zlib, please
> refer to
> >    docs/devel/migration/qpl-compression.rst
> 
> IIRC our previous discussion is we should provide a software fallback for
> the new QEMU paths, right?  Why the decision changed?  Again, such
> fallback
> can help us to make sure qpl won't get broken easily by other changes.

Hi Peter

Previous your suggestion below

https://patchew.org/QEMU/PH7PR11MB5941019462E0ADDE231C7295A37C2@PH7PR11MB5941.namprd11.prod.outlook.com/
Compression methods: none, zlib, zstd, qpl (describes all the algorithms
that might be used; again, qpl enforces HW support).
Compression accelerators: auto, none, qat (only applies when zlib/zstd
chosen above)

Maybe I misunderstood here, what you mean is that if the IAA hardware is unavailable, 
it will fall back to the software path. This does not need to be specified through live
migration parameters, and it will automatically determine whether to use the software or
hardware path during QPL initialization, is that right?

> > 2. Compression accelerator related patches are removed from this patch
> set and
> >    will be added to the QAT patch set, we will submit separate patches
> to use
> >    QAT to accelerate ZLIB and ZSTD.
> >
> > 3. Advantages of using IAA accelerator include:
> >    a. Compared with the non-compression method, it can improve downtime
> >       performance without adding additional host resources (both CPU and
> >       network).
> >    b. Compared with using software compression methods (ZSTD/ZLIB), it
> can
> >       provide high data compression ratio and save a lot of CPU
> resources
> >       used for compression.
> >
> > Test condition:
> >   1. Host CPUs are based on Sapphire Rapids
> >   2. VM type, 16 vCPU and 64G memory
> >   3. The source and destination respectively use 4 IAA devices.
> >   4. The workload in the VM
> >     a. all vCPUs are idle state
> >     b. 90% of the virtual machine's memory is used, use silesia to fill
> >        the memory.
> >        The introduction of silesia:
> >        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> >   5. Set "--mem-prealloc" boot parameter on the destination, this
> parameter
> >      can make IAA performance better and related introduction is added
> here.
> >      docs/devel/migration/qpl-compression.rst
> >   6. Source migration configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter downtime-limit 300
> >      f. migrate_set_parameter max-bandwidth 100G/1G
> >      d. migrate_set_parameter multifd-compression none/qpl/zstd
> >   7. Destination migration configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter multifd-compression none/qpl/zstd
> >
> > Early migration result, each result is the average of three tests
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|    8571|      69|    58391|   1896525|  256%|
> 
> Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps
> only
> with 2 channels.  My previous experience is around ~1XGbps per channel, so
> no more than 30Gbps for two channels.  Is it because of a faster
> processor?
> Indeed from the 4/8 results it doesn't look like increasing the num of
> channels helped a lot, and even it got worse on the downtime.

Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
[  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes

And in the live migration test, a multifd thread's CPU utilization is almost 100%

> What is the rational behind "downtime improvement" when with the QPL
> compressors?  IIUC in this 100Gbps case the bandwidth is never a
> limitation, then I don't understand why adding the compression phase can
> make the switchover faster.  I can expect much more pages sent in a
> NIC-limted env like you described below with 1Gbps, but not when NIC has
> unlimited resources like here.

The compression can improve the network stack overhead(not improve the RDMA 
solution), the less data, the smaller the overhead in the 
network protocol stack. If compression has no overhead, and network bandwidth
is not limited, the last memory copy is faster with compression

The migration hotspot focuses on the _sys_sendmsg
_sys_sendmsg
  |- tcp_sendmsg
    |- copy_user_enhanced_fast_string
    |- tcp_push_one


> >  |BW:100G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|    7180|      92|    69736|   1865640|  300%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|    7090|     121|    70562|   2174060|  307%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|    8413|      34|    30067|   1732411|  230%|
> >  |BW:100G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|    6559|      32|    38804|   1689954|  450%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|    6623|      37|    38745|   1566507|  790%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|   95846|      24|     1800|    521829|  203%|
> >  |BW:100G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|   49004|      24|     3529|    890532|  403%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|   25574|      32|     6782|   1762222|  800%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> > When network bandwidth resource is sufficient, QPL can improve downtime
> > by 2x compared to no compression. In this scenario, with 4 channels, the
> > IAA hardware resources are fully used, so adding more channels will not
> > gain more benefits.
> >
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|   57758|      66|     8643|    264617|   34%|
> >  |BW:  1G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|   57216|      58|     8726|    266773|   34%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|   56708|      53|     8804|    270223|   33%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|   30129|      34|     8345|   2224761|   54%|
> >  |BW:  1G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|   30317|      39|     8300|   2025220|   73%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|   29615|      35|     8514|   2250122|  131%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|   95750|      24|     1802|    477236|  202%|
> >  |BW:  1G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|   48907|      24|     3536|   1002142|  404%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|   25568|      32|     6783|   1696437|  800%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> > When network bandwidth resource is limited, the "page perf second"
> metric
> > decreases for none compression, the success rate of migration will
> reduce.
> > Comparison of QPL and ZSTD compression methods, QPL can save a lot of
> CPU
> > resources used for compression.
> >
> > v2:
> >   - add support for multifd compression accelerator
> >   - add support for the QPL accelerator in the multifd
> >     compression accelerator
> >   - fixed the issue that QPL was compiled into the migration
> >     module by default
> >
> > v3:
> >   - use Meson instead of pkg-config to resolve QPL build
> >     dependency issue
> >   - fix coding style
> >   - fix a CI issue for get_multifd_ops function in multifd.c file
> >
> > v4:
> >   - patch based on commit: da96ad4a6a Merge tag 'hw-misc-20240215' of
> >     https://github.com/philmd/qemu into staging
> >   - remove the compression accelerator implementation patches, the
> patches
> >     will be placed in the QAT accelerator implementation.
> >   - introduce QPL as a new compression method
> >   - add QPL compression documentation
> >   - add QPL compression migration test
> >   - fix zlib/zstd compression level issue
> >
> > v5:
> >   - patch based on v9.0.0-rc0 (c62d54d0a8)
> >   - use pkgconfig to check libaccel-config, libaccel-config is already
> >     in many distributions.
> >   - initialize the IOV of the sender by the specific compression method
> >   - refine the coding style
> >   - remove the zlib/zstd compression level not working patch, the issue
> >     has been solved
> >
> > Yuan Liu (7):
> >   docs/migration: add qpl compression feature
> >   migration/multifd: put IOV initialization into compression method
> >   configure: add --enable-qpl build option
> >   migration/multifd: add qpl compression method
> >   migration/multifd: implement initialization of qpl compression
> >   migration/multifd: implement qpl compression and decompression
> >   tests/migration-test: add qpl compression test
> >
> >  docs/devel/migration/features.rst        |   1 +
> >  docs/devel/migration/qpl-compression.rst | 231 +++++++++++
> >  hw/core/qdev-properties-system.c         |   2 +-
> >  meson.build                              |  16 +
> >  meson_options.txt                        |   2 +
> >  migration/meson.build                    |   1 +
> >  migration/multifd-qpl.c                  | 482 +++++++++++++++++++++++
> >  migration/multifd-zlib.c                 |   4 +
> >  migration/multifd-zstd.c                 |   6 +-
> >  migration/multifd.c                      |   8 +-
> >  migration/multifd.h                      |   1 +
> >  qapi/migration.json                      |   7 +-
> >  scripts/meson-buildoptions.sh            |   3 +
> >  tests/qtest/migration-test.c             |  24 ++
> >  14 files changed, 782 insertions(+), 6 deletions(-)
> >  create mode 100644 docs/devel/migration/qpl-compression.rst
> >  create mode 100644 migration/multifd-qpl.c
> >
> > --
> > 2.39.3
> >
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-22 16:40                     ` Peter Xu
@ 2024-03-27 19:25                       ` Peter Xu
  2024-03-28  2:32                         ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-27 19:25 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Fri, Mar 22, 2024 at 12:40:32PM -0400, Peter Xu wrote:
> > > void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> > > {
> > >     for (int i = 0; i < p->zero_num; i++) {
> > >         void *page = p->host + p->zero[i];
> > >         if (!buffer_is_zero(page, p->page_size)) {
> > >             memset(page, 0, p->page_size);
> > >         }
> > >     }
> > > }
> 
> It may not matter much (where I also see your below comments), but just to
> mention another solution to avoid this read is that we can maintain
> RAMBlock->receivedmap for precopy (especially, multifd, afaiu multifd
> doesn't yet update this bitmap.. even if normal precopy does), then here
> instead of scanning every time, maybe we can do:
> 
>   /*
>    * If it's the 1st time receiving it, no need to clear it as it must be
>    * all zeros now.
>    */
>   if (bitmap_test(rb->receivedmap, page_offset)) {
>       memset(page, 0, ...);
>   } else {
>       bitmap_set(rb->receivedmap, page_offset);
>   }
> 
> And we also always set the bit when !zero too.
>     
> My rational is that it's unlikely a zero page if it's sent once or more,
> while OTOH for the 1st time we receive it, it must be a zero page, so no
> need to scan for the 1st round.

Thinking about this, I'm wondering whether we should have this regardless.
IIUC now multifd will always require two page faults on destination for
anonymous guest memories (I suppose shmem/hugetlb is fine as no zero page
in those worlds).  Even though it should be faster than DMA faults, it
still is unwanted.

I'll take a note myself as todo to do some measurements in the future
first.  However if anyone thinks that makes sense and want to have a look,
please say so.  It'll be more than welcomed.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/7] Live Migration With IAA
  2024-03-27  3:20   ` Liu, Yuan1
@ 2024-03-27 19:46     ` Peter Xu
  2024-03-28  3:02       ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-27 19:46 UTC (permalink / raw)
  To: Liu, Yuan1; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Wed, Mar 27, 2024 at 03:20:19AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, March 27, 2024 4:30 AM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> > 
> > Hi, Yuan,
> > 
> > On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> > > 1. QPL will be used as an independent compression method like ZLIB and
> > ZSTD,
> > >    QPL will force the use of the IAA accelerator and will not support
> > software
> > >    compression. For a summary of issues compatible with Zlib, please
> > refer to
> > >    docs/devel/migration/qpl-compression.rst
> > 
> > IIRC our previous discussion is we should provide a software fallback for
> > the new QEMU paths, right?  Why the decision changed?  Again, such
> > fallback
> > can help us to make sure qpl won't get broken easily by other changes.
> 
> Hi Peter
> 
> Previous your suggestion below
> 
> https://patchew.org/QEMU/PH7PR11MB5941019462E0ADDE231C7295A37C2@PH7PR11MB5941.namprd11.prod.outlook.com/
> Compression methods: none, zlib, zstd, qpl (describes all the algorithms
> that might be used; again, qpl enforces HW support).
> Compression accelerators: auto, none, qat (only applies when zlib/zstd
> chosen above)
> 
> Maybe I misunderstood here, what you mean is that if the IAA hardware is unavailable, 
> it will fall back to the software path. This does not need to be specified through live
> migration parameters, and it will automatically determine whether to use the software or
> hardware path during QPL initialization, is that right?

I think there are two questions.

Firstly, we definitely want the qpl compressor to be able to run without
any hardware support.  As I mentioned above, I think that's the only way
that qpl code can always get covered by the CI as CI hosts should normally
don't have those modern hardwares.

I think it also means in the last test patch, instead of detecting /dev/iax
we should unconditionally run the qpl test as long as compiled in, because
it should just fallback to the software path then when HW not valid?

The second question is whether we'll want a new "compression accelerator",
fundamentally the only use case of that is to enforce software fallback
even if hardware existed.  I don't remember whether others have any opinion
before, but to me I think it's good to have, however no strong opinion.
It's less important comparing to the other question on CI coverage.

> 
> > > 2. Compression accelerator related patches are removed from this patch
> > set and
> > >    will be added to the QAT patch set, we will submit separate patches
> > to use
> > >    QAT to accelerate ZLIB and ZSTD.
> > >
> > > 3. Advantages of using IAA accelerator include:
> > >    a. Compared with the non-compression method, it can improve downtime
> > >       performance without adding additional host resources (both CPU and
> > >       network).
> > >    b. Compared with using software compression methods (ZSTD/ZLIB), it
> > can
> > >       provide high data compression ratio and save a lot of CPU
> > resources
> > >       used for compression.
> > >
> > > Test condition:
> > >   1. Host CPUs are based on Sapphire Rapids
> > >   2. VM type, 16 vCPU and 64G memory
> > >   3. The source and destination respectively use 4 IAA devices.
> > >   4. The workload in the VM
> > >     a. all vCPUs are idle state
> > >     b. 90% of the virtual machine's memory is used, use silesia to fill
> > >        the memory.
> > >        The introduction of silesia:
> > >        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > >   5. Set "--mem-prealloc" boot parameter on the destination, this
> > parameter
> > >      can make IAA performance better and related introduction is added
> > here.
> > >      docs/devel/migration/qpl-compression.rst
> > >   6. Source migration configuration commands
> > >      a. migrate_set_capability multifd on
> > >      b. migrate_set_parameter multifd-channels 2/4/8
> > >      c. migrate_set_parameter downtime-limit 300
> > >      f. migrate_set_parameter max-bandwidth 100G/1G
> > >      d. migrate_set_parameter multifd-compression none/qpl/zstd
> > >   7. Destination migration configuration commands
> > >      a. migrate_set_capability multifd on
> > >      b. migrate_set_parameter multifd-channels 2/4/8
> > >      c. migrate_set_parameter multifd-compression none/qpl/zstd
> > >
> > > Early migration result, each result is the average of three tests
> > >
> > >  +--------+-------------+--------+--------+---------+----------+------|
> > >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> > >  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> > >  | Comp   |             |        |        |(mbps)   |          |      |
> > >  |        +-------------+-----------------+---------+----------+------+
> > >  |Network |            2|    8571|      69|    58391|   1896525|  256%|
> > 
> > Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps
> > only
> > with 2 channels.  My previous experience is around ~1XGbps per channel, so
> > no more than 30Gbps for two channels.  Is it because of a faster
> > processor?
> > Indeed from the 4/8 results it doesn't look like increasing the num of
> > channels helped a lot, and even it got worse on the downtime.
> 
> Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> 
> And in the live migration test, a multifd thread's CPU utilization is almost 100%

This 60Gpbs per-channel is definitely impressive..

Have you tried migration without multifd on your system? Would that also
perform similarly v.s. 2 channels multifd?

The whole point of multifd is to scale on bandwidth.  If single thread can
already achieve 60Gbps (where in my previous memory of tests, multifd can
only reach ~70Gbps before..), then either multifd will be less useful with
the new hardwares (especially when with a most generic socket nocomp
setup), or we need to start working on bottlenecks of multifd to make it
scale better.  Otherwise multifd will become a pool for compressor loads
only.

> 
> > What is the rational behind "downtime improvement" when with the QPL
> > compressors?  IIUC in this 100Gbps case the bandwidth is never a
> > limitation, then I don't understand why adding the compression phase can
> > make the switchover faster.  I can expect much more pages sent in a
> > NIC-limted env like you described below with 1Gbps, but not when NIC has
> > unlimited resources like here.
> 
> The compression can improve the network stack overhead(not improve the RDMA 
> solution), the less data, the smaller the overhead in the 
> network protocol stack. If compression has no overhead, and network bandwidth
> is not limited, the last memory copy is faster with compression
> 
> The migration hotspot focuses on the _sys_sendmsg
> _sys_sendmsg
>   |- tcp_sendmsg
>     |- copy_user_enhanced_fast_string
>     |- tcp_push_one

Makes sense.  I assume that's logical indeed when the compression ratio is
high enough, meanwhile if the compression work is fast enough to be much
lower than sending extra data when without it.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/7] migration/multifd: add qpl compression method
  2024-03-19 16:45 ` [PATCH v5 4/7] migration/multifd: add qpl compression method Yuan Liu
@ 2024-03-27 19:49   ` Peter Xu
  2024-03-28  3:03     ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-27 19:49 UTC (permalink / raw)
  To: Yuan Liu; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, nanhai.zou

On Wed, Mar 20, 2024 at 12:45:24AM +0800, Yuan Liu wrote:
> add the Query Processing Library (QPL) compression method
> 
> Although both qpl and zlib support deflate compression, qpl will
> only use the In-Memory Analytics Accelerator(IAA) for compression
> and decompression, and IAA is not compatible with the Zlib in
> migration, so qpl is used as a new compression method for migration.
> 
> How to enable qpl compression during migration:
> migrate_set_parameter multifd-compression qpl
> 
> The qpl only supports one compression level, there is no qpl
> compression level parameter added, users do not need to specify
> the qpl compression level.
> 
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> ---
>  hw/core/qdev-properties-system.c |  2 +-
>  migration/meson.build            |  1 +
>  migration/multifd-qpl.c          | 20 ++++++++++++++++++++
>  migration/multifd.h              |  1 +
>  qapi/migration.json              |  7 ++++++-
>  5 files changed, 29 insertions(+), 2 deletions(-)
>  create mode 100644 migration/multifd-qpl.c
> 
> diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
> index d79d6f4b53..6ccd7224f6 100644
> --- a/hw/core/qdev-properties-system.c
> +++ b/hw/core/qdev-properties-system.c
> @@ -659,7 +659,7 @@ const PropertyInfo qdev_prop_fdc_drive_type = {
>  const PropertyInfo qdev_prop_multifd_compression = {
>      .name = "MultiFDCompression",
>      .description = "multifd_compression values, "
> -                   "none/zlib/zstd",
> +                   "none/zlib/zstd/qpl",
>      .enum_table = &MultiFDCompression_lookup,
>      .get = qdev_propinfo_get_enum,
>      .set = qdev_propinfo_set_enum,
> diff --git a/migration/meson.build b/migration/meson.build
> index 1eeb915ff6..cb177de1d2 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -41,6 +41,7 @@ if get_option('live_block_migration').allowed()
>    system_ss.add(files('block.c'))
>  endif
>  system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
> +system_ss.add(when: qpl, if_true: files('multifd-qpl.c'))
>  
>  specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
>                  if_true: files('ram.c',
> diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> new file mode 100644
> index 0000000000..056a68a060
> --- /dev/null
> +++ b/migration/multifd-qpl.c
> @@ -0,0 +1,20 @@
> +/*
> + * Multifd qpl compression accelerator implementation
> + *
> + * Copyright (c) 2023 Intel Corporation
> + *
> + * Authors:
> + *  Yuan Liu<yuan1.liu@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include "qemu/module.h"
> +
> +static void multifd_qpl_register(void)
> +{
> +    /* noop */
> +}
> +
> +migration_init(multifd_qpl_register);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index c9d9b09239..5b7d9b15f8 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -40,6 +40,7 @@ MultiFDRecvData *multifd_get_recv_data(void);
>  #define MULTIFD_FLAG_NOCOMP (0 << 1)
>  #define MULTIFD_FLAG_ZLIB (1 << 1)
>  #define MULTIFD_FLAG_ZSTD (2 << 1)
> +#define MULTIFD_FLAG_QPL (4 << 1)
>  
>  /* This value needs to be a multiple of qemu_target_page_size() */
>  #define MULTIFD_PACKET_SIZE (512 * 1024)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index aa1b39bce1..dceb35db5b 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -629,11 +629,16 @@
>  #
>  # @zstd: use zstd compression method.
>  #
> +# @qpl: use qpl compression method. Query Processing Library(qpl) is based on
> +#       the deflate compression algorithm and use the Intel In-Memory Analytics
> +#       Accelerator(IAA) accelerated compression and decompression. (Since 9.0)

s/9.0/9.1/

> +#
>  # Since: 5.0
>  ##
>  { 'enum': 'MultiFDCompression',
>    'data': [ 'none', 'zlib',
> -            { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
> +            { 'name': 'zstd', 'if': 'CONFIG_ZSTD' },
> +            { 'name': 'qpl', 'if': 'CONFIG_QPL' } ] }
>  
>  ##
>  # @MigMode:
> -- 
> 2.39.3
> 

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-27 19:25                       ` Peter Xu
@ 2024-03-28  2:32                         ` Liu, Yuan1
  2024-03-28 15:16                           ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-28  2:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 3:26 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> On Fri, Mar 22, 2024 at 12:40:32PM -0400, Peter Xu wrote:
> > > > void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> > > > {
> > > >     for (int i = 0; i < p->zero_num; i++) {
> > > >         void *page = p->host + p->zero[i];
> > > >         if (!buffer_is_zero(page, p->page_size)) {
> > > >             memset(page, 0, p->page_size);
> > > >         }
> > > >     }
> > > > }
> >
> > It may not matter much (where I also see your below comments), but just
> to
> > mention another solution to avoid this read is that we can maintain
> > RAMBlock->receivedmap for precopy (especially, multifd, afaiu multifd
> > doesn't yet update this bitmap.. even if normal precopy does), then here
> > instead of scanning every time, maybe we can do:
> >
> >   /*
> >    * If it's the 1st time receiving it, no need to clear it as it must
> be
> >    * all zeros now.
> >    */
> >   if (bitmap_test(rb->receivedmap, page_offset)) {
> >       memset(page, 0, ...);
> >   } else {
> >       bitmap_set(rb->receivedmap, page_offset);
> >   }
> >
> > And we also always set the bit when !zero too.
> >
> > My rational is that it's unlikely a zero page if it's sent once or more,
> > while OTOH for the 1st time we receive it, it must be a zero page, so no
> > need to scan for the 1st round.
> 
> Thinking about this, I'm wondering whether we should have this regardless.
> IIUC now multifd will always require two page faults on destination for
> anonymous guest memories (I suppose shmem/hugetlb is fine as no zero page
> in those worlds).  Even though it should be faster than DMA faults, it
> still is unwanted.
> 
> I'll take a note myself as todo to do some measurements in the future
> first.  However if anyone thinks that makes sense and want to have a look,
> please say so.  It'll be more than welcomed.

Yes, I think this is a better improvement to avoid two page faults. I can test
the performance impact of this change on SVM-capable devices and give some data
later. As we saw before, the IOTLB flush occurs via COW, with the change, the 
impact of the COW should be gone.

If you need more testing and analysis on this, please let me know


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 0/7] Live Migration With IAA
  2024-03-27 19:46     ` Peter Xu
@ 2024-03-28  3:02       ` Liu, Yuan1
  2024-03-28 15:22         ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-28  3:02 UTC (permalink / raw)
  To: Peter Xu; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 3:46 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> 
> On Wed, Mar 27, 2024 at 03:20:19AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, March 27, 2024 4:30 AM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> > > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> > >
> > > Hi, Yuan,
> > >
> > > On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> > > > 1. QPL will be used as an independent compression method like ZLIB
> and
> > > ZSTD,
> > > >    QPL will force the use of the IAA accelerator and will not
> support
> > > software
> > > >    compression. For a summary of issues compatible with Zlib, please
> > > refer to
> > > >    docs/devel/migration/qpl-compression.rst
> > >
> > > IIRC our previous discussion is we should provide a software fallback
> for
> > > the new QEMU paths, right?  Why the decision changed?  Again, such
> > > fallback
> > > can help us to make sure qpl won't get broken easily by other changes.
> >
> > Hi Peter
> >
> > Previous your suggestion below
> >
> >
> https://patchew.org/QEMU/PH7PR11MB5941019462E0ADDE231C7295A37C2@PH7PR11MB5
> 941.namprd11.prod.outlook.com/
> > Compression methods: none, zlib, zstd, qpl (describes all the algorithms
> > that might be used; again, qpl enforces HW support).
> > Compression accelerators: auto, none, qat (only applies when zlib/zstd
> > chosen above)
> >
> > Maybe I misunderstood here, what you mean is that if the IAA hardware is
> unavailable,
> > it will fall back to the software path. This does not need to be
> specified through live
> > migration parameters, and it will automatically determine whether to use
> the software or
> > hardware path during QPL initialization, is that right?
> 
> I think there are two questions.
> 
> Firstly, we definitely want the qpl compressor to be able to run without
> any hardware support.  As I mentioned above, I think that's the only way
> that qpl code can always get covered by the CI as CI hosts should normally
> don't have those modern hardwares.
> 
> I think it also means in the last test patch, instead of detecting
> /dev/iax
> we should unconditionally run the qpl test as long as compiled in, because
> it should just fallback to the software path then when HW not valid?
> 
> The second question is whether we'll want a new "compression accelerator",
> fundamentally the only use case of that is to enforce software fallback
> even if hardware existed.  I don't remember whether others have any
> opinion
> before, but to me I think it's good to have, however no strong opinion.
> It's less important comparing to the other question on CI coverage.

Yes, I will support software fallback to ensure CI testing and users can 
still use qpl compression without IAA hardware.

Although the qpl software solution will have better performance than zlib, 
I still don't think it has a greater advantage than zstd. I don't think there
is a need to add a migration option to configure the qpl software or hardware path.
So I will still only use QPL as an independent compression in the next version, and
no other migration options are needed.

I will also add a guide to qpl-compression.rst about IAA permission issues and how to
determine whether the hardware path is available.

> > > > 2. Compression accelerator related patches are removed from this
> patch
> > > set and
> > > >    will be added to the QAT patch set, we will submit separate
> patches
> > > to use
> > > >    QAT to accelerate ZLIB and ZSTD.
> > > >
> > > > 3. Advantages of using IAA accelerator include:
> > > >    a. Compared with the non-compression method, it can improve
> downtime
> > > >       performance without adding additional host resources (both CPU
> and
> > > >       network).
> > > >    b. Compared with using software compression methods (ZSTD/ZLIB),
> it
> > > can
> > > >       provide high data compression ratio and save a lot of CPU
> > > resources
> > > >       used for compression.
> > > >
> > > > Test condition:
> > > >   1. Host CPUs are based on Sapphire Rapids
> > > >   2. VM type, 16 vCPU and 64G memory
> > > >   3. The source and destination respectively use 4 IAA devices.
> > > >   4. The workload in the VM
> > > >     a. all vCPUs are idle state
> > > >     b. 90% of the virtual machine's memory is used, use silesia to
> fill
> > > >        the memory.
> > > >        The introduction of silesia:
> > > >        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > > >   5. Set "--mem-prealloc" boot parameter on the destination, this
> > > parameter
> > > >      can make IAA performance better and related introduction is
> added
> > > here.
> > > >      docs/devel/migration/qpl-compression.rst
> > > >   6. Source migration configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter downtime-limit 300
> > > >      f. migrate_set_parameter max-bandwidth 100G/1G
> > > >      d. migrate_set_parameter multifd-compression none/qpl/zstd
> > > >   7. Destination migration configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter multifd-compression none/qpl/zstd
> > > >
> > > > Early migration result, each result is the average of three tests
> > > >
> > > >  +--------+-------------+--------+--------+---------+----------+----
> --|
> > > >  |        | The number  |total   |downtime|network  |pages per | CPU
> |
> > > >  | None   | of channels |time(ms)|(ms)    |bandwidth|second    |
> Util |
> > > >  | Comp   |             |        |        |(mbps)   |          |
> |
> > > >  |        +-------------+-----------------+---------+----------+----
> --+
> > > >  |Network |            2|    8571|      69|    58391|   1896525|
> 256%|
> > >
> > > Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps
> > > only
> > > with 2 channels.  My previous experience is around ~1XGbps per
> channel, so
> > > no more than 30Gbps for two channels.  Is it because of a faster
> > > processor?
> > > Indeed from the 4/8 results it doesn't look like increasing the num of
> > > channels helped a lot, and even it got worse on the downtime.
> >
> > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is
> 60Gbps.
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> > [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> >
> > And in the live migration test, a multifd thread's CPU utilization is
> almost 100%
> 
> This 60Gpbs per-channel is definitely impressive..
> 
> Have you tried migration without multifd on your system? Would that also
> perform similarly v.s. 2 channels multifd?

Simple Test result below:
VM Type: 16vCPU, 64G memory
Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
Migration Configurations:
1. migrate_set_parameter max-bandwidth 100G
2. migrate_set_parameter downtime-limit 300
3. migrate_set_capability multifd on (multiFD test case)
4. migrate_set_parameter multifd-channels 2 (multiFD test case)

                  Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second
without Multifd	23580	            307	         21221	       689588
Multifd 2	       7657	            198	         65410	      2221176

> 
> The whole point of multifd is to scale on bandwidth.  If single thread can
> already achieve 60Gbps (where in my previous memory of tests, multifd can
> only reach ~70Gbps before..), then either multifd will be less useful with
> the new hardwares (especially when with a most generic socket nocomp
> setup), or we need to start working on bottlenecks of multifd to make it
> scale better.  Otherwise multifd will become a pool for compressor loads
> only.
> 
> >
> > > What is the rational behind "downtime improvement" when with the QPL
> > > compressors?  IIUC in this 100Gbps case the bandwidth is never a
> > > limitation, then I don't understand why adding the compression phase
> can
> > > make the switchover faster.  I can expect much more pages sent in a
> > > NIC-limted env like you described below with 1Gbps, but not when NIC
> has
> > > unlimited resources like here.
> >
> > The compression can improve the network stack overhead(not improve the
> RDMA
> > solution), the less data, the smaller the overhead in the
> > network protocol stack. If compression has no overhead, and network
> bandwidth
> > is not limited, the last memory copy is faster with compression
> >
> > The migration hotspot focuses on the _sys_sendmsg
> > _sys_sendmsg
> >   |- tcp_sendmsg
> >     |- copy_user_enhanced_fast_string
> >     |- tcp_push_one
> 
> Makes sense.  I assume that's logical indeed when the compression ratio is
> high enough, meanwhile if the compression work is fast enough to be much
> lower than sending extra data when without it.
> 
> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 4/7] migration/multifd: add qpl compression method
  2024-03-27 19:49   ` Peter Xu
@ 2024-03-28  3:03     ` Liu, Yuan1
  0 siblings, 0 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-28  3:03 UTC (permalink / raw)
  To: Peter Xu; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 3:49 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 4/7] migration/multifd: add qpl compression method
> 
> On Wed, Mar 20, 2024 at 12:45:24AM +0800, Yuan Liu wrote:
> > add the Query Processing Library (QPL) compression method
> >
> > Although both qpl and zlib support deflate compression, qpl will
> > only use the In-Memory Analytics Accelerator(IAA) for compression
> > and decompression, and IAA is not compatible with the Zlib in
> > migration, so qpl is used as a new compression method for migration.
> >
> > How to enable qpl compression during migration:
> > migrate_set_parameter multifd-compression qpl
> >
> > The qpl only supports one compression level, there is no qpl
> > compression level parameter added, users do not need to specify
> > the qpl compression level.
> >
> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > ---
> >  hw/core/qdev-properties-system.c |  2 +-
> >  migration/meson.build            |  1 +
> >  migration/multifd-qpl.c          | 20 ++++++++++++++++++++
> >  migration/multifd.h              |  1 +
> >  qapi/migration.json              |  7 ++++++-
> >  5 files changed, 29 insertions(+), 2 deletions(-)
> >  create mode 100644 migration/multifd-qpl.c
> >
> > diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-
> system.c
> > index d79d6f4b53..6ccd7224f6 100644
> > --- a/hw/core/qdev-properties-system.c
> > +++ b/hw/core/qdev-properties-system.c
> > @@ -659,7 +659,7 @@ const PropertyInfo qdev_prop_fdc_drive_type = {
> >  const PropertyInfo qdev_prop_multifd_compression = {
> >      .name = "MultiFDCompression",
> >      .description = "multifd_compression values, "
> > -                   "none/zlib/zstd",
> > +                   "none/zlib/zstd/qpl",
> >      .enum_table = &MultiFDCompression_lookup,
> >      .get = qdev_propinfo_get_enum,
> >      .set = qdev_propinfo_set_enum,
> > diff --git a/migration/meson.build b/migration/meson.build
> > index 1eeb915ff6..cb177de1d2 100644
> > --- a/migration/meson.build
> > +++ b/migration/meson.build
> > @@ -41,6 +41,7 @@ if get_option('live_block_migration').allowed()
> >    system_ss.add(files('block.c'))
> >  endif
> >  system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
> > +system_ss.add(when: qpl, if_true: files('multifd-qpl.c'))
> >
> >  specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
> >                  if_true: files('ram.c',
> > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> > new file mode 100644
> > index 0000000000..056a68a060
> > --- /dev/null
> > +++ b/migration/multifd-qpl.c
> > @@ -0,0 +1,20 @@
> > +/*
> > + * Multifd qpl compression accelerator implementation
> > + *
> > + * Copyright (c) 2023 Intel Corporation
> > + *
> > + * Authors:
> > + *  Yuan Liu<yuan1.liu@intel.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or
> later.
> > + * See the COPYING file in the top-level directory.
> > + */
> > +#include "qemu/osdep.h"
> > +#include "qemu/module.h"
> > +
> > +static void multifd_qpl_register(void)
> > +{
> > +    /* noop */
> > +}
> > +
> > +migration_init(multifd_qpl_register);
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index c9d9b09239..5b7d9b15f8 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -40,6 +40,7 @@ MultiFDRecvData *multifd_get_recv_data(void);
> >  #define MULTIFD_FLAG_NOCOMP (0 << 1)
> >  #define MULTIFD_FLAG_ZLIB (1 << 1)
> >  #define MULTIFD_FLAG_ZSTD (2 << 1)
> > +#define MULTIFD_FLAG_QPL (4 << 1)
> >
> >  /* This value needs to be a multiple of qemu_target_page_size() */
> >  #define MULTIFD_PACKET_SIZE (512 * 1024)
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index aa1b39bce1..dceb35db5b 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -629,11 +629,16 @@
> >  #
> >  # @zstd: use zstd compression method.
> >  #
> > +# @qpl: use qpl compression method. Query Processing Library(qpl) is
> based on
> > +#       the deflate compression algorithm and use the Intel In-Memory
> Analytics
> > +#       Accelerator(IAA) accelerated compression and decompression.
> (Since 9.0)
> 
> s/9.0/9.1/

Ok, I will fix it in the next version.

> > +#
> >  # Since: 5.0
> >  ##
> >  { 'enum': 'MultiFDCompression',
> >    'data': [ 'none', 'zlib',
> > -            { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
> > +            { 'name': 'zstd', 'if': 'CONFIG_ZSTD' },
> > +            { 'name': 'qpl', 'if': 'CONFIG_QPL' } ] }
> >
> >  ##
> >  # @MigMode:
> > --
> > 2.39.3
> >
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-28  2:32                         ` Liu, Yuan1
@ 2024-03-28 15:16                           ` Peter Xu
  2024-03-29  2:04                             ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-28 15:16 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Thu, Mar 28, 2024 at 02:32:37AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, March 28, 2024 3:26 AM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> > qpl compression
> > 
> > On Fri, Mar 22, 2024 at 12:40:32PM -0400, Peter Xu wrote:
> > > > > void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> > > > > {
> > > > >     for (int i = 0; i < p->zero_num; i++) {
> > > > >         void *page = p->host + p->zero[i];
> > > > >         if (!buffer_is_zero(page, p->page_size)) {
> > > > >             memset(page, 0, p->page_size);
> > > > >         }
> > > > >     }
> > > > > }
> > >
> > > It may not matter much (where I also see your below comments), but just
> > to
> > > mention another solution to avoid this read is that we can maintain
> > > RAMBlock->receivedmap for precopy (especially, multifd, afaiu multifd
> > > doesn't yet update this bitmap.. even if normal precopy does), then here
> > > instead of scanning every time, maybe we can do:
> > >
> > >   /*
> > >    * If it's the 1st time receiving it, no need to clear it as it must
> > be
> > >    * all zeros now.
> > >    */
> > >   if (bitmap_test(rb->receivedmap, page_offset)) {
> > >       memset(page, 0, ...);
> > >   } else {
> > >       bitmap_set(rb->receivedmap, page_offset);
> > >   }
> > >
> > > And we also always set the bit when !zero too.
> > >
> > > My rational is that it's unlikely a zero page if it's sent once or more,
> > > while OTOH for the 1st time we receive it, it must be a zero page, so no
> > > need to scan for the 1st round.
> > 
> > Thinking about this, I'm wondering whether we should have this regardless.
> > IIUC now multifd will always require two page faults on destination for
> > anonymous guest memories (I suppose shmem/hugetlb is fine as no zero page
> > in those worlds).  Even though it should be faster than DMA faults, it
> > still is unwanted.
> > 
> > I'll take a note myself as todo to do some measurements in the future
> > first.  However if anyone thinks that makes sense and want to have a look,
> > please say so.  It'll be more than welcomed.
> 
> Yes, I think this is a better improvement to avoid two page faults. I can test
> the performance impact of this change on SVM-capable devices and give some data
> later. As we saw before, the IOTLB flush occurs via COW, with the change, the 
> impact of the COW should be gone.
> 
> If you need more testing and analysis on this, please let me know

Nothing more than that.  Just a heads up that Xiang used to mention a test
case where Richard used to suggest dropping the zero check:

https://lore.kernel.org/r/CAAYibXib+TWnJpV22E=adncdBmwXJRqgRjJXK7X71J=bDfaxDg@mail.gmail.com

AFAIU this should be resolved if we have the bitmap maintained, but we can
double check.  IIUC that's exactly the case for an idle guest, in that case
it should be even faster to skip the memcmp when bit clear.

If you're going to post the patches, feel free to post that as a standalone
small series first, then that can be considered merge even earlier.

Thanks a lot for doing this.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/7] Live Migration With IAA
  2024-03-28  3:02       ` Liu, Yuan1
@ 2024-03-28 15:22         ` Peter Xu
  2024-03-29  3:33           ` Liu, Yuan1
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-03-28 15:22 UTC (permalink / raw)
  To: Liu, Yuan1; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote:
> Yes, I will support software fallback to ensure CI testing and users can 
> still use qpl compression without IAA hardware.
> 
> Although the qpl software solution will have better performance than zlib, 
> I still don't think it has a greater advantage than zstd. I don't think there
> is a need to add a migration option to configure the qpl software or hardware path.
> So I will still only use QPL as an independent compression in the next version, and
> no other migration options are needed.

That should be fine.

> 
> I will also add a guide to qpl-compression.rst about IAA permission issues and how to
> determine whether the hardware path is available.

OK.

[...]

> > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is
> > 60Gbps.
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> > > [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> > >
> > > And in the live migration test, a multifd thread's CPU utilization is
> > almost 100%
> > 
> > This 60Gpbs per-channel is definitely impressive..
> > 
> > Have you tried migration without multifd on your system? Would that also
> > perform similarly v.s. 2 channels multifd?
> 
> Simple Test result below:
> VM Type: 16vCPU, 64G memory
> Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
> Migration Configurations:
> 1. migrate_set_parameter max-bandwidth 100G
> 2. migrate_set_parameter downtime-limit 300
> 3. migrate_set_capability multifd on (multiFD test case)
> 4. migrate_set_parameter multifd-channels 2 (multiFD test case)
> 
>                   Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second
> without Multifd	23580	            307	         21221	       689588
> Multifd 2	       7657	            198	         65410	      2221176

Thanks for the test results.

So I am guessing the migration overheads besides pushing the socket is high
enough to make it drop drastically, even if in this case zero detection
shouldn't play a major role considering most of guest mem is pre-filled.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
  2024-03-28 15:16                           ` Peter Xu
@ 2024-03-29  2:04                             ` Liu, Yuan1
  0 siblings, 0 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-29  2:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 11:16 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> On Thu, Mar 28, 2024 at 02:32:37AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, March 28, 2024 3:26 AM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com;
> Zou,
> > > Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement
> initialization of
> > > qpl compression
> > >
> > > On Fri, Mar 22, 2024 at 12:40:32PM -0400, Peter Xu wrote:
> > > > > > void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> > > > > > {
> > > > > >     for (int i = 0; i < p->zero_num; i++) {
> > > > > >         void *page = p->host + p->zero[i];
> > > > > >         if (!buffer_is_zero(page, p->page_size)) {
> > > > > >             memset(page, 0, p->page_size);
> > > > > >         }
> > > > > >     }
> > > > > > }
> > > >
> > > > It may not matter much (where I also see your below comments), but
> just
> > > to
> > > > mention another solution to avoid this read is that we can maintain
> > > > RAMBlock->receivedmap for precopy (especially, multifd, afaiu
> multifd
> > > > doesn't yet update this bitmap.. even if normal precopy does), then
> here
> > > > instead of scanning every time, maybe we can do:
> > > >
> > > >   /*
> > > >    * If it's the 1st time receiving it, no need to clear it as it
> must
> > > be
> > > >    * all zeros now.
> > > >    */
> > > >   if (bitmap_test(rb->receivedmap, page_offset)) {
> > > >       memset(page, 0, ...);
> > > >   } else {
> > > >       bitmap_set(rb->receivedmap, page_offset);
> > > >   }
> > > >
> > > > And we also always set the bit when !zero too.
> > > >
> > > > My rational is that it's unlikely a zero page if it's sent once or
> more,
> > > > while OTOH for the 1st time we receive it, it must be a zero page,
> so no
> > > > need to scan for the 1st round.
> > >
> > > Thinking about this, I'm wondering whether we should have this
> regardless.
> > > IIUC now multifd will always require two page faults on destination
> for
> > > anonymous guest memories (I suppose shmem/hugetlb is fine as no zero
> page
> > > in those worlds).  Even though it should be faster than DMA faults, it
> > > still is unwanted.
> > >
> > > I'll take a note myself as todo to do some measurements in the future
> > > first.  However if anyone thinks that makes sense and want to have a
> look,
> > > please say so.  It'll be more than welcomed.
> >
> > Yes, I think this is a better improvement to avoid two page faults. I
> can test
> > the performance impact of this change on SVM-capable devices and give
> some data
> > later. As we saw before, the IOTLB flush occurs via COW, with the
> change, the
> > impact of the COW should be gone.
> >
> > If you need more testing and analysis on this, please let me know
> 
> Nothing more than that.  Just a heads up that Xiang used to mention a test
> case where Richard used to suggest dropping the zero check:
> 
> https://lore.kernel.org/r/CAAYibXib+TWnJpV22E=adncdBmwXJRqgRjJXK7X71J=bDfa
> xDg@mail.gmail.com
> 
> AFAIU this should be resolved if we have the bitmap maintained, but we can
> double check.  IIUC that's exactly the case for an idle guest, in that
> case
> it should be even faster to skip the memcmp when bit clear.
> 
> If you're going to post the patches, feel free to post that as a
> standalone
> small series first, then that can be considered merge even earlier.
> 
> Thanks a lot for doing this.

Sure, I will prepare a separate patch for this, and we can have a better discussion
on concrete implementation and test results. 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 0/7] Live Migration With IAA
  2024-03-28 15:22         ` Peter Xu
@ 2024-03-29  3:33           ` Liu, Yuan1
  0 siblings, 0 replies; 45+ messages in thread
From: Liu, Yuan1 @ 2024-03-29  3:33 UTC (permalink / raw)
  To: Peter Xu; +Cc: farosas, qemu-devel, hao.xiang, bryan.zhang, Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 11:22 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> 
> On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote:
> > Yes, I will support software fallback to ensure CI testing and users can
> > still use qpl compression without IAA hardware.
> >
> > Although the qpl software solution will have better performance than
> zlib,
> > I still don't think it has a greater advantage than zstd. I don't think
> there
> > is a need to add a migration option to configure the qpl software or
> hardware path.
> > So I will still only use QPL as an independent compression in the next
> version, and
> > no other migration options are needed.
> 
> That should be fine.
> 
> >
> > I will also add a guide to qpl-compression.rst about IAA permission
> issues and how to
> > determine whether the hardware path is available.
> 
> OK.
> 
> [...]
> 
> > > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith
> is
> > > 60Gbps.
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87
> MBytes
> > > > [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87
> Mbytes
> > > >
> > > > And in the live migration test, a multifd thread's CPU utilization
> is
> > > almost 100%
> > >
> > > This 60Gpbs per-channel is definitely impressive..
> > >
> > > Have you tried migration without multifd on your system? Would that
> also
> > > perform similarly v.s. 2 channels multifd?
> >
> > Simple Test result below:
> > VM Type: 16vCPU, 64G memory
> > Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
> > Migration Configurations:
> > 1. migrate_set_parameter max-bandwidth 100G
> > 2. migrate_set_parameter downtime-limit 300
> > 3. migrate_set_capability multifd on (multiFD test case)
> > 4. migrate_set_parameter multifd-channels 2 (multiFD test case)
> >
> >                   Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-
> per-second
> > without Multifd	23580	            307	         21221	       689588
> > Multifd 2	       7657	            198	         65410	      2221176
> 
> Thanks for the test results.
> 
> So I am guessing the migration overheads besides pushing the socket is
> high
> enough to make it drop drastically, even if in this case zero detection
> shouldn't play a major role considering most of guest mem is pre-filled.

Yes, for no multifd migration, besides the network stack overhead, the zero
page detection overhead (both of source and destination) is indeed very high.
Placing the zero page detection in multi-threads can reduce the performance 
degradation caused by the overhead of zero page detection.

I also think migration doesn't need to detect zero page by memcmp in all cases.
The benefit of zero page detection may be that the VM's memory determines that
there are a large number of 0 pages. 

My experience in this area may be insufficient, I am trying with Hao and Bryan to
see if it is possible to use DSA hardware to accelerate this part (including page 0
detection and writing page 0). 

DSA is an accelerator for detecting memory, writing memory, and comparing memory
https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2024-03-29  3:33 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-19 16:45 [PATCH v5 0/7] Live Migration With IAA Yuan Liu
2024-03-19 16:45 ` [PATCH v5 1/7] docs/migration: add qpl compression feature Yuan Liu
2024-03-26 17:58   ` Peter Xu
2024-03-27  2:14     ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 2/7] migration/multifd: put IOV initialization into compression method Yuan Liu
2024-03-20 15:18   ` Fabiano Rosas
2024-03-20 15:32     ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 3/7] configure: add --enable-qpl build option Yuan Liu
2024-03-20  8:55   ` Thomas Huth
2024-03-20  8:56     ` Thomas Huth
2024-03-20 14:34       ` Liu, Yuan1
2024-03-20 10:31   ` Daniel P. Berrangé
2024-03-20 14:42     ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 4/7] migration/multifd: add qpl compression method Yuan Liu
2024-03-27 19:49   ` Peter Xu
2024-03-28  3:03     ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression Yuan Liu
2024-03-20 10:42   ` Daniel P. Berrangé
2024-03-20 15:02     ` Liu, Yuan1
2024-03-20 15:20       ` Daniel P. Berrangé
2024-03-20 16:04         ` Liu, Yuan1
2024-03-20 15:34       ` Peter Xu
2024-03-20 16:23         ` Liu, Yuan1
2024-03-20 20:31           ` Peter Xu
2024-03-21  1:37             ` Liu, Yuan1
2024-03-21 15:28               ` Peter Xu
2024-03-22  2:06                 ` Liu, Yuan1
2024-03-22 14:47                   ` Liu, Yuan1
2024-03-22 16:40                     ` Peter Xu
2024-03-27 19:25                       ` Peter Xu
2024-03-28  2:32                         ` Liu, Yuan1
2024-03-28 15:16                           ` Peter Xu
2024-03-29  2:04                             ` Liu, Yuan1
2024-03-19 16:45 ` [PATCH v5 6/7] migration/multifd: implement qpl compression and decompression Yuan Liu
2024-03-19 16:45 ` [PATCH v5 7/7] tests/migration-test: add qpl compression test Yuan Liu
2024-03-20 10:45   ` Daniel P. Berrangé
2024-03-20 15:30     ` Liu, Yuan1
2024-03-20 15:39       ` Daniel P. Berrangé
2024-03-20 16:26         ` Liu, Yuan1
2024-03-26 20:30 ` [PATCH v5 0/7] Live Migration With IAA Peter Xu
2024-03-27  3:20   ` Liu, Yuan1
2024-03-27 19:46     ` Peter Xu
2024-03-28  3:02       ` Liu, Yuan1
2024-03-28 15:22         ` Peter Xu
2024-03-29  3:33           ` Liu, Yuan1

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.