From: Yan Zhao <yan.y.zhao@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>,
cjia@nvidia.com, kevin.tian@intel.com, ziye.yang@intel.com,
changpeng.liu@intel.com, yi.l.liu@intel.com, mlevitsk@redhat.com,
eskultet@redhat.com, cohuck@redhat.com, dgilbert@redhat.com,
jonathan.davies@nutanix.com, eauger@redhat.com, aik@ozlabs.ru,
pasic@linux.ibm.com, felipe@nutanix.com,
Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst@alibaba-inc.com,
Ken.Xue@amd.com, zhi.a.wang@intel.com, qemu-devel@nongnu.org,
kvm@vger.kernel.org
Subject: Re: [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices
Date: Wed, 27 May 2020 02:23:58 -0400 [thread overview]
Message-ID: <20200527062358.GD19560@joy-OptiPlex-7040> (raw)
In-Reply-To: <20200526141939.2632f100@x1.home>
On Tue, May 26, 2020 at 02:19:39PM -0600, Alex Williamson wrote:
> On Mon, 25 May 2020 18:50:54 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
> > On 5/25/2020 12:29 PM, Yan Zhao wrote:
> > > On Tue, May 19, 2020 at 10:58:04AM -0600, Alex Williamson wrote:
> > >> Hi folks,
> > >>
> > >> My impression is that we're getting pretty close to a workable
> > >> implementation here with v22 plus respins of patches 5, 6, and 8. We
> > >> also have a matching QEMU series and a proposal for a new i40e
> > >> consumer, as well as I assume GVT-g updates happening internally at
> > >> Intel. I expect all of the latter needs further review and discussion,
> > >> but we should be at the point where we can validate these proposed
> > >> kernel interfaces. Therefore I'd like to make a call for reviews so
> > >> that we can get this wrapped up for the v5.8 merge window. I know
> > >> Connie has some outstanding documentation comments and I'd like to make
> > >> sure everyone has an opportunity to check that their comments have been
> > >> addressed and we don't discover any new blocking issues. Please send
> > >> your Acked-by/Reviewed-by/Tested-by tags if you're satisfied with this
> > >> interface and implementation. Thanks!
> > >>
> > > hi Alex
> > > after porting gvt/i40e vf migration code to kernel/qemu v23, we spoted
> > > two bugs.
> > > 1. "Failed to get dirty bitmap for iova: 0xfe011000 size: 0x3fb0 err: 22"
> > > This is a qemu bug that the dirty bitmap query range is not the same
> > > as the dma map range. It can be fixed in qemu. and I just have a little
> > > concern for kernel to have this restriction.
> > >
> >
> > I never saw this unaligned size in my testing. In this case if you can
> > provide vfio_* event traces, that will helpful.
>
> Yeah, I'm curious why we're hitting such a call path, I think we were
> designing this under the assumption we wouldn't see these. I also
that's because the algorithm for getting dirty bitmap query range is still not exactly
matching to that for dma map range in vfio_dma_map().
> wonder if we really need to enforce the dma mapping range for getting
> the dirty bitmap with the current implementation (unmap+dirty obviously
> still has the restriction). We do shift the bitmap in place for
> alignment, but I'm not sure why we couldn't shift it back and only
> clear the range that was reported. Kirti, do you see other issues? I
> think a patch to lift that restriction is something we could plan to
> include after the initial series is included and before we've committed
> to the uapi at the v5.8 release.
>
> > > 2. migration abortion, reporting
> > > "qemu-system-x86_64-lm: vfio_load_state: Error allocating buffer
> > > qemu-system-x86_64-lm: error while loading state section id 49(vfio)
> > > qemu-system-x86_64-lm: load of migration failed: Cannot allocate memory"
> > >
> > > It's still a qemu bug and we can fixed it by
> > > "
> > > if (migration->pending_bytes == 0) {
> > > + qemu_put_be64(f, 0);
> > > + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > "
> >
> > In which function in QEMU do you have to add this?
>
> I think this is relative to QEMU path 09/ where Yan had the questions
> below on v16 and again tried to get answers to them on v22:
>
> https://lore.kernel.org/qemu-devel/20200520031323.GB10369@joy-OptiPlex-7040/
>
> Kirti, please address these questions.
>
> > > and actually there are some extra concerns about this part, as reported in
> > > [1][2].
> > >
> > > [1] data_size should be read ahead of data_offset
> > > https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg02795.html.
> > > [2] should not repeatedly update pending_bytes in vfio_save_iterate()
> > > https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg02796.html.
> > >
> > > but as those errors are all in qemu, and we have finished basic tests in
> > > both gvt & i40e, we're fine with the kernel part interface in general now.
> > > (except for my concern [1], which needs to update kernel patch 1)
> > >
> >
> > >> what if pending_bytes is not 0, but vendor driver just does not want to
> > >> send data in this iteration? isn't it right to get data_size first
> > before
> > >> getting data_offset?
> >
> > If vendor driver doesn't want to send data but still has data in staging
> > buffer, vendor driver still can control to send pending_bytes for this
> > iteration as 0 as this is a trap field.
> >
> > I would defer this to Alex.
>
> This is my understanding of the protocol as well, when the device is
> running, pending_bytes might drop to zero if no internal state has
> changed and may be non-zero on the next iteration due to device
> activity. When the device is not running, pending_bytes reporting zero
> indicates the device is done, there is no further state to transmit.
> Does that meet your need/expectation?
>
(1) on one side, as in vfio_save_pending(),
vfio_save_pending()
{
...
ret = vfio_update_pending(vbasedev);
...
*res_precopy_only += migration->pending_bytes;
...
}
the pending_bytes tells migration thread how much data is still hold in
device side.
the device data includes
device internal data + running device dirty data + device state.
so the pending_bytes should include device state as well, right?
if so, the pending_bytes should never reach 0 if there's any device
state to be sent after device is stopped.
(2) on the other side,
along side we updated the pending_bytes in vfio_save_pending() and
enter into the vfio_save_iterate(), if we repeatedly update
pending_bytes in vfio_save_iterate(), it would enter into a scenario
like
initially pending_bytes=500M.
vfio_save_iterate() -->
round 1: transmitted 500M.
round 2: update pending bytes, pending_bytes=50M (50M dirty data).
round 3: update pending bytes, pending_bytes=50M.
...
round N: update pending bytes, pending_bytes=50M.
If there're two vfio devices, the vfio_save_iterate() for the second device
may never get chance to be called because there's always pending_bytes
produced by the first device, even the size if small.
> > > so I wonder which way in your mind is better, to give our reviewed-by to
> > > the kernel part now, or hold until next qemu fixes?
> > > and as performance data from gvt is requested from your previous mail, is
> > > that still required before the code is accepted?
>
> The QEMU series does not need to be perfect, I kind of expect we might
> see a few iterations of that beyond the kernel portion being accepted.
> We should have the QEMU series to the point that we've resolved any
> uapi issues though, which it seems like we're pretty close to having.
> Ideally I'd like to get the kernel series into my next branch before
> the merge window opens, where it seems like upstream is on schedule to
> have that happen this Sunday. If you feel we're to the point were we
> can iron a couple details out during the v5.8 development cycle, then
> please provide your reviewed-by. We haven't fully committed to a uapi
> until we've committed to it for a non-rc release.
>
got it.
> I think the performance request was largely due to some conversations
> with Dave Gilbert wondering if all this actually works AND is practical
> for a LIVE migration. I think we're all curious about things like how
> much data does a GPU have to transfer in each phase of migration, and
> particularly if the final phase is going to be a barrier to claiming
> the VM is actually sufficiently live. I'm not sure we have many
> options if a device simply has a very large working set, but even
> anecdotal evidence that the stop-and-copy phase transfers abMB from the
> device while idle or xyzMB while active would give us some idea what to
for intel vGPU, the data is
single-round dirty query:
data to be transferred at stop-and-copy phase: 90MB+ ~ 900MB+, including
- device state: 9MB
- system dirty memory: 80MB+ ~ 900MB+ (depending on workload type)
multi-round dirty query :
-each iteration data: 60MB ~ 400MB
-data to be transferred at stop-and-copy phase: 70MB ~ 400MB
BTW, for viommu, the downtime data is as below. under the same network
condition and guest memory size, and no running dirty data/memory produced
by device.
(1) viommu off
single-round dirty query: downtime ~100ms
(2) viommu on
single-round dirty query: downtime 58s
Thanks
Yan
> expect. Kirti, have you done any of those sorts of tests for NVIDIA's
> driver?
>
> > > BTW, we have also conducted some basic tests when viommu is on, and found out
> > > errors like
> > > "qemu-system-x86_64-dt: vtd_iova_to_slpte: detected slpte permission error (iova=0x0, level=0x3, slpte=0x0, write=1)
> > > qemu-system-x86_64-dt: vtd_iommu_translate: detected translation failure (dev=00:03:00, iova=0x0)
> > > qemu-system-x86_64-dt: New fault is not recorded due to compression of faults".
> > >
> >
> > I saw these errors, I'm looking into it.
>
> Let's try to at least determine if this is a uapi issue or just a QEMU
> implementation bug for progressing the kernel series. Thanks,
>
> Alex
>
next prev parent reply other threads:[~2020-05-27 6:33 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-18 5:56 [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices Kirti Wankhede
2020-05-18 5:56 ` [PATCH Kernel v22 1/8] vfio: UAPI for migration interface for device state Kirti Wankhede
2020-05-18 5:56 ` [PATCH Kernel v22 2/8] vfio iommu: Remove atomicity of ref_count of pinned pages Kirti Wankhede
2020-05-18 5:56 ` [PATCH Kernel v22 3/8] vfio iommu: Cache pgsize_bitmap in struct vfio_iommu Kirti Wankhede
2020-05-20 10:08 ` Cornelia Huck
2020-05-20 14:46 ` Kirti Wankhede
2020-05-18 5:56 ` [PATCH Kernel v22 4/8] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2020-05-18 5:56 ` [PATCH Kernel v22 5/8] vfio iommu: Implementation of ioctl " Kirti Wankhede
2020-05-18 21:53 ` Alex Williamson
2020-05-19 7:11 ` Kirti Wankhede
2020-05-19 6:52 ` Kirti Wankhede
2020-05-18 5:56 ` [PATCH Kernel v22 6/8] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2020-05-19 6:54 ` Kirti Wankhede
2020-05-20 10:27 ` Cornelia Huck
2020-05-20 15:16 ` Kirti Wankhede
2020-05-18 5:56 ` [PATCH Kernel v22 7/8] vfio iommu: Add migration capability to report supported features Kirti Wankhede
2020-05-20 10:42 ` Cornelia Huck
2020-05-20 15:23 ` Kirti Wankhede
2020-05-18 5:56 ` [PATCH Kernel v22 8/8] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
2020-05-19 6:54 ` Kirti Wankhede
2020-05-19 16:58 ` [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices Alex Williamson
2020-05-20 2:55 ` Yan Zhao
2020-05-20 13:40 ` Kirti Wankhede
2020-05-20 16:46 ` Alex Williamson
2020-05-21 5:08 ` Yan Zhao
2020-05-21 7:09 ` Kirti Wankhede
2020-05-21 7:04 ` Yan Zhao
2020-05-21 7:28 ` Kirti Wankhede
2020-05-21 7:32 ` Kirti Wankhede
2020-05-25 6:59 ` Yan Zhao
2020-05-25 13:20 ` Kirti Wankhede
2020-05-26 20:19 ` Alex Williamson
2020-05-27 6:23 ` Yan Zhao [this message]
2020-05-27 8:48 ` Dr. David Alan Gilbert
2020-05-28 8:01 ` Yan Zhao
2020-05-28 22:53 ` Alex Williamson
2020-05-29 11:12 ` Dr. David Alan Gilbert
2020-05-28 22:59 ` Alex Williamson
2020-05-29 4:15 ` Yan Zhao
2020-05-29 17:57 ` Kirti Wankhede
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200527062358.GD19560@joy-OptiPlex-7040 \
--to=yan.y.zhao@intel.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@Alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yi.l.liu@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).