Re: [PATCH v5 0/7] Live Migration With IAA

From: Peter Xu <peterx@redhat.com>
To: Yuan Liu <yuan1.liu@intel.com>
Cc: farosas@suse.de, qemu-devel@nongnu.org, hao.xiang@bytedance.com,
	bryan.zhang@bytedance.com, nanhai.zou@intel.com
Subject: Re: [PATCH v5 0/7] Live Migration With IAA
Date: Tue, 26 Mar 2024 16:30:00 -0400	[thread overview]
Message-ID: <ZgMwSO_eRIgXZ24L@x1n> (raw)
In-Reply-To: <20240319164527.1873891-1-yuan1.liu@intel.com>

Hi, Yuan,

On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> 1. QPL will be used as an independent compression method like ZLIB and ZSTD,
>    QPL will force the use of the IAA accelerator and will not support software
>    compression. For a summary of issues compatible with Zlib, please refer to
>    docs/devel/migration/qpl-compression.rst

IIRC our previous discussion is we should provide a software fallback for
the new QEMU paths, right?  Why the decision changed?  Again, such fallback
can help us to make sure qpl won't get broken easily by other changes.

> 
> 2. Compression accelerator related patches are removed from this patch set and
>    will be added to the QAT patch set, we will submit separate patches to use
>    QAT to accelerate ZLIB and ZSTD.
> 
> 3. Advantages of using IAA accelerator include:
>    a. Compared with the non-compression method, it can improve downtime
>       performance without adding additional host resources (both CPU and
>       network).
>    b. Compared with using software compression methods (ZSTD/ZLIB), it can
>       provide high data compression ratio and save a lot of CPU resources
>       used for compression.
> 
> Test condition:
>   1. Host CPUs are based on Sapphire Rapids
>   2. VM type, 16 vCPU and 64G memory
>   3. The source and destination respectively use 4 IAA devices.
>   4. The workload in the VM
>     a. all vCPUs are idle state
>     b. 90% of the virtual machine's memory is used, use silesia to fill
>        the memory.
>        The introduction of silesia:
>        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
>   5. Set "--mem-prealloc" boot parameter on the destination, this parameter
>      can make IAA performance better and related introduction is added here.
>      docs/devel/migration/qpl-compression.rst
>   6. Source migration configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter downtime-limit 300
>      f. migrate_set_parameter max-bandwidth 100G/1G
>      d. migrate_set_parameter multifd-compression none/qpl/zstd
>   7. Destination migration configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter multifd-compression none/qpl/zstd
> 
> Early migration result, each result is the average of three tests
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|    8571|      69|    58391|   1896525|  256%|

Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps only
with 2 channels.  My previous experience is around ~1XGbps per channel, so
no more than 30Gbps for two channels.  Is it because of a faster processor?
Indeed from the 4/8 results it doesn't look like increasing the num of
channels helped a lot, and even it got worse on the downtime.

What is the rational behind "downtime improvement" when with the QPL
compressors?  IIUC in this 100Gbps case the bandwidth is never a
limitation, then I don't understand why adding the compression phase can
make the switchover faster.  I can expect much more pages sent in a
NIC-limted env like you described below with 1Gbps, but not when NIC has
unlimited resources like here.

>  |BW:100G +-------------+--------+--------+---------+----------+------+
>  |        |            4|    7180|      92|    69736|   1865640|  300%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|    7090|     121|    70562|   2174060|  307%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|    8413|      34|    30067|   1732411|  230%|
>  |BW:100G +-------------+--------+--------+---------+----------+------+
>  |        |            4|    6559|      32|    38804|   1689954|  450%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|    6623|      37|    38745|   1566507|  790%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   95846|      24|     1800|    521829|  203%|
>  |BW:100G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   49004|      24|     3529|    890532|  403%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   25574|      32|     6782|   1762222|  800%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
> When network bandwidth resource is sufficient, QPL can improve downtime
> by 2x compared to no compression. In this scenario, with 4 channels, the
> IAA hardware resources are fully used, so adding more channels will not
> gain more benefits.
> 
>  
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   57758|      66|     8643|    264617|   34%|
>  |BW:  1G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   57216|      58|     8726|    266773|   34%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   56708|      53|     8804|    270223|   33%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   30129|      34|     8345|   2224761|   54%|
>  |BW:  1G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   30317|      39|     8300|   2025220|   73%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   29615|      35|     8514|   2250122|  131%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   95750|      24|     1802|    477236|  202%|
>  |BW:  1G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   48907|      24|     3536|   1002142|  404%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   25568|      32|     6783|   1696437|  800%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
> When network bandwidth resource is limited, the "page perf second" metric
> decreases for none compression, the success rate of migration will reduce.
> Comparison of QPL and ZSTD compression methods, QPL can save a lot of CPU
> resources used for compression.
> 
> v2:
>   - add support for multifd compression accelerator
>   - add support for the QPL accelerator in the multifd
>     compression accelerator
>   - fixed the issue that QPL was compiled into the migration
>     module by default
> 
> v3:
>   - use Meson instead of pkg-config to resolve QPL build
>     dependency issue
>   - fix coding style
>   - fix a CI issue for get_multifd_ops function in multifd.c file
> 
> v4:
>   - patch based on commit: da96ad4a6a Merge tag 'hw-misc-20240215' of
>     https://github.com/philmd/qemu into staging
>   - remove the compression accelerator implementation patches, the patches
>     will be placed in the QAT accelerator implementation.
>   - introduce QPL as a new compression method
>   - add QPL compression documentation
>   - add QPL compression migration test
>   - fix zlib/zstd compression level issue
> 
> v5:
>   - patch based on v9.0.0-rc0 (c62d54d0a8)
>   - use pkgconfig to check libaccel-config, libaccel-config is already
>     in many distributions.
>   - initialize the IOV of the sender by the specific compression method
>   - refine the coding style
>   - remove the zlib/zstd compression level not working patch, the issue
>     has been solved
> 
> Yuan Liu (7):
>   docs/migration: add qpl compression feature
>   migration/multifd: put IOV initialization into compression method
>   configure: add --enable-qpl build option
>   migration/multifd: add qpl compression method
>   migration/multifd: implement initialization of qpl compression
>   migration/multifd: implement qpl compression and decompression
>   tests/migration-test: add qpl compression test
> 
>  docs/devel/migration/features.rst        |   1 +
>  docs/devel/migration/qpl-compression.rst | 231 +++++++++++
>  hw/core/qdev-properties-system.c         |   2 +-
>  meson.build                              |  16 +
>  meson_options.txt                        |   2 +
>  migration/meson.build                    |   1 +
>  migration/multifd-qpl.c                  | 482 +++++++++++++++++++++++
>  migration/multifd-zlib.c                 |   4 +
>  migration/multifd-zstd.c                 |   6 +-
>  migration/multifd.c                      |   8 +-
>  migration/multifd.h                      |   1 +
>  qapi/migration.json                      |   7 +-
>  scripts/meson-buildoptions.sh            |   3 +
>  tests/qtest/migration-test.c             |  24 ++
>  14 files changed, 782 insertions(+), 6 deletions(-)
>  create mode 100644 docs/devel/migration/qpl-compression.rst
>  create mode 100644 migration/multifd-qpl.c
> 
> -- 
> 2.39.3
> 

-- 
Peter Xu