Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression

From: Peter Xu <peterx@redhat.com>
To: "Liu, Yuan1" <yuan1.liu@intel.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>,
	"farosas@suse.de" <farosas@suse.de>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"hao.xiang@bytedance.com" <hao.xiang@bytedance.com>,
	"bryan.zhang@bytedance.com" <bryan.zhang@bytedance.com>,
	"Zou, Nanhai" <nanhai.zou@intel.com>
Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression
Date: Fri, 22 Mar 2024 12:40:32 -0400	[thread overview]
Message-ID: <Zf20gJbSavpp93_b@x1n> (raw)
In-Reply-To: <PH7PR11MB5941B5EB0C21FBFB6C8FFA16A3312@PH7PR11MB5941.namprd11.prod.outlook.com>

On Fri, Mar 22, 2024 at 02:47:02PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Liu, Yuan1
> > Sent: Friday, March 22, 2024 10:07 AM
> > To: Peter Xu <peterx@redhat.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: RE: [PATCH v5 5/7] migration/multifd: implement initialization of
> > qpl compression
> > 
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, March 21, 2024 11:28 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com;
> > Zou,
> > > Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization
> > of
> > > qpl compression
> > >
> > > On Thu, Mar 21, 2024 at 01:37:36AM +0000, Liu, Yuan1 wrote:
> > > > > -----Original Message-----
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > Sent: Thursday, March 21, 2024 4:32 AM
> > > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > > > devel@nongnu.org; hao.xiang@bytedance.com;
> > bryan.zhang@bytedance.com;
> > > Zou,
> > > > > Nanhai <nanhai.zou@intel.com>
> > > > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement
> > > initialization of
> > > > > qpl compression
> > > > >
> > > > > On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> > > > > > let me explain here, during the decompression operation of IAA,
> > the
> > > > > > decompressed data can be directly output to the virtual address of
> > > the
> > > > > > guest memory by IAA hardware.  It can avoid copying the
> > decompressed
> > > > > data
> > > > > > to guest memory by CPU.
> > > > >
> > > > > I see.
> > > > >
> > > > > > Without -mem-prealloc, all the guest memory is not populated, and
> > > IAA
> > > > > > hardware needs to trigger I/O page fault first and then output the
> > > > > > decompressed data to the guest memory region.  Besides that, CPU
> > > page
> > > > > > faults will also trigger IOTLB flush operation when IAA devices
> > use
> > > SVM.
> > > > >
> > > > > Oh so the IAA hardware already can use CPU pgtables?  Nice..
> > > > >
> > > > > Why IOTLB flush is needed?  AFAIU we're only installing new pages,
> > the
> > > > > request can either come from a CPU access or a DMA.  In all cases
> > > there
> > > > > should have no tearing down of an old page.  Isn't an iotlb flush
> > only
> > > > > needed if a tear down happens?
> > > >
> > > > As far as I know, IAA hardware uses SVM technology to use the CPU's
> > page
> > > table
> > > > for address translation (IOMMU scalable mode directly accesses the CPU
> > > page table).
> > > > Therefore, when the CPU page table changes, the device's Invalidation
> > > operation needs
> > > > to be triggered to update the IOMMU and the device's cache.
> > > >
> > > > My current kernel version is mainline 6.2. The issue I see is as
> > > follows:
> > > > --Handle_mm_fault
> > > >  |
> > > >   -- wp_page_copy
> > >
> > > This is the CoW path.  Not usual at all..
> > >
> > > I assume this issue should only present on destination.  Then the guest
> > > pages should be the destination of such DMAs to happen, which means
> > these
> > > should be write faults, and as we see here it is, otherwise it won't
> > > trigger a CoW.
> > >
> > > However it's not clear to me why a pre-installed zero page existed.  It
> > > means someone read the guest pages first.
> > >
> > > It might be interesting to know _why_ someone reads the guest pages,
> > even
> > > if we know they're all zeros.  If we can avoid such reads then it'll be
> > a
> > > hole rather than a prefaulted read on zero page, then invalidations are
> > > not
> > > needed, and I expect that should fix the iotlb storm issue.
> > 
> > The received pages will be read for zero pages check first. Although
> > these pages are zero pages, and IAA hardware will not access them, the
> > COW happens and causes following IOTLB flush operation. As far as I know,
> > IOMMU quickly detects whether the address range has been used by the
> > device,
> > and does not invalidate the address that is not used by the device, this
> > has
> > not yet been resolved in Linux kernel 6.2. I will check the latest status
> > for
> > this.
> 
> I checked the Linux mainline 6.8 code, there are no big changes for this.
> In version 6.8, if the process needs to flush MMU TLB, then I/O TLB flush
> will be also triggered when the process has SVM devices. I haven't found
> the code to check if pages have been set EA (Extended-Accessed) bit before
> submitting invalidation operations, this is same with version 6.2.
> 
> VT-d 3.6.2
> If the Extended-Accessed-Flag-Enable (EAFE) is 1 in a scalable-mode PASID-table
> entry that references a first-stage paging-structure entry used by the remapping
> hardware, it atomically sets the EA field in that entry. Whenever EA field is 
> atomically set, the A field is also set in the same atomic operation. For software
> usages where the first-stage paging structures are shared across heterogeneous agents
> (e.g., CPUs and accelerator devices such as GPUs), the EA flag may be used by software
> to identify pages accessed by non-CPU agent(s) (as opposed to the A flag which indicates
> access by any agent sharing the paging structures).

This seems pretty new hardware features.  I didn't check in depths but what
you said makes sense.

> 
> > void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> > {
> >     for (int i = 0; i < p->zero_num; i++) {
> >         void *page = p->host + p->zero[i];
> >         if (!buffer_is_zero(page, p->page_size)) {
> >             memset(page, 0, p->page_size);
> >         }
> >     }
> > }

It may not matter much (where I also see your below comments), but just to
mention another solution to avoid this read is that we can maintain
RAMBlock->receivedmap for precopy (especially, multifd, afaiu multifd
doesn't yet update this bitmap.. even if normal precopy does), then here
instead of scanning every time, maybe we can do:

  /*
   * If it's the 1st time receiving it, no need to clear it as it must be
   * all zeros now.
   */
  if (bitmap_test(rb->receivedmap, page_offset)) {
      memset(page, 0, ...);
  } else {
      bitmap_set(rb->receivedmap, page_offset);
  }

And we also always set the bit when !zero too.

My rational is that it's unlikely a zero page if it's sent once or more,
while OTOH for the 1st time we receive it, it must be a zero page, so no
need to scan for the 1st round.

> > 
> > 
> > > It'll still be good we can fix this first to not make qpl special from
> > > this
> > > regard, so that the hope is migration submodule shouldn't rely on any
> > > pre-config (-mem-prealloc) on guest memory behaviors to work properly.
> > 
> > Even if the IOTLB problem can be avoided, the I/O page fault problem
> > (normal
> > pages are loaded by the IAA device and solving normal page faults through
> > IOMMU,
> > the performance is not good)

Do you have a rough estimate on how slow that could be? It'll be good to
mention some details too in the doc file in that case.

> > 
> > It can let the decompressed data of the IAA device be output to a pre-
> > populated
> > memory instead of directly outputting to the guest address, but then each
> > multifd
> > thread needs two memory copies, one copy from the network to the IAA input
> > memory(pre-populated), and another copy from the IAA output memory(pre-
> > populated)
> > to the guest address, which may become a performance bottleneck at the
> > destination
> > during the live migration process.
> >  
> > So I think it is still necessary to use the -mem-prealloc option

Right, that complexity may not be necessary, in that case, maybe such
suggestion is fine.

Thanks,

> > 
> > > >     -- mmu_notifier_invalidate_range
> > > >       |
> > > >       -- intel_invalidate_rage
> > > >         |
> > > >         -- qi_flush_piotlb
> > > >         -- qi_flush_dev_iotlb_pasid
> > >
> > > --
> > > Peter Xu
> 

-- 
Peter Xu