RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression

From: "Liu, Yuan1" <yuan1.liu@intel.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>,
	"farosas@suse.de" <farosas@suse.de>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"hao.xiang@bytedance.com" <hao.xiang@bytedance.com>,
	"bryan.zhang@bytedance.com" <bryan.zhang@bytedance.com>,
	"Zou, Nanhai" <nanhai.zou@intel.com>
Subject: RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
Date: Wed, 20 Mar 2024 16:23:01 +0000	[thread overview]
Message-ID: <PH7PR11MB5941F8AE52DBD0F197798103A3332@PH7PR11MB5941.namprd11.prod.outlook.com> (raw)
In-Reply-To: <ZfsCDhnYYmjxLTRW@x1n>

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, March 20, 2024 11:35 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of
> qpl compression
> 
> On Wed, Mar 20, 2024 at 03:02:59PM +0000, Liu, Yuan1 wrote:
> > > > +static int alloc_zbuf(QplData *qpl, uint8_t chan_id, Error **errp)
> > > > +{
> > > > +    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
> > > > +    uint32_t size = qpl->job_num * qpl->data_size;
> > > > +    uint8_t *buf;
> > > > +
> > > > +    buf = (uint8_t *) mmap(NULL, size, PROT_READ | PROT_WRITE,
> flags, -
> > > 1, 0);
> > > > +    if (buf == MAP_FAILED) {
> > > > +        error_setg(errp, "multifd: %u: alloc_zbuf failed, job
> num %u,
> > > size %u",
> > > > +                   chan_id, qpl->job_num, qpl->data_size);
> > > > +        return -1;
> > > > +    }
> > >
> > > What's the reason for using mmap here, rather than a normal
> > > malloc ?
> >
> > I want to populate the memory accessed by the IAA device in the
> initialization
> > phase, and then avoid initiating I/O page faults through the IAA device
> during
> > migration, a large number of I/O page faults are not good for
> performance.
> 
> mmap() doesn't populate pages, unless with MAP_POPULATE.  And even with
> that it shouldn't be guaranteed, as the populate phase should ignore all
> errors.
> 
>        MAP_POPULATE (since Linux 2.5.46)
>               Populate (prefault) page tables for a mapping.  For a file
> map‐
>               ping, this causes read-ahead on the file.  This will help to
> re‐
>               duce  blocking  on  page  faults later.  The mmap() call
> doesn't
>               fail if the mapping cannot be populated  (for  example,  due
> to
>               limitations  on  the  number  of  mapped  huge  pages when
> using
>               MAP_HUGETLB).  Support for MAP_POPULATE in conjunction with
> pri‐
>               vate mappings was added in Linux 2.6.23.
> 
> OTOH, I think g_malloc0() should guarantee to prefault everything in as
> long as the call returned (even though they can be swapped out later, but
> that applies to all cases anyway).

Thanks, Peter. I will try the g_malloc0 method here

> > This problem also occurs at the destination, therefore, I recommend that
> > customers need to add -mem-prealloc for destination boot parameters.
> 
> I'm not sure what issue you hit when testing it, but -mem-prealloc flag
> should only control the guest memory backends not the buffers that QEMU
> internally use, afaiu.
> 
> Thanks,
> 
> --
> Peter Xu

let me explain here, during the decompression operation of IAA, the decompressed data
can be directly output to the virtual address of the guest memory by IAA hardware. 
It can avoid copying the decompressed data to guest memory by CPU.

Without -mem-prealloc, all the guest memory is not populated, and IAA hardware needs to trigger
I/O page fault first and then output the decompressed data to the guest memory region. 
Besides that, CPU page faults will also trigger IOTLB flush operation when IAA devices use SVM. 

Due to the inability to quickly resolve a large number of IO page faults and IOTLB flushes, the
decompression throughput of the IAA device will decrease significantly.