xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Chao Gao <chao.gao@intel.com>
To: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Anthony PERARD <anthony.perard@citrix.com>,
	xen-devel@lists.xenproject.org, Wei Liu <wl@xen.org>
Subject: Re: [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached
Date: Mon, 14 Oct 2019 23:03:43 +0800	[thread overview]
Message-ID: <20191014150341.GA12156@gao-cwp> (raw)
In-Reply-To: <d12ee001-7f8e-4482-2a78-9cb1fd2d7530@eikelenboom.it>

On Thu, Oct 10, 2019 at 06:13:43PM +0200, Sander Eikelenboom wrote:
>On 01/10/2019 12:35, Anthony PERARD wrote:
>> Rewrite of the commit message:
>> 
>> Before the problematic commit, libxl used to ignore error when
>> destroying (force == true) a passthrough device, especially error that
>> happens when dealing with the DM.
>> 
>> Since fae4880c45fe, if the DM failed to detach the pci device within
>> the allowed time, the timed out error raised skip part of
>> pci_remove_*, but also raise the error up to the caller of
>> libxl__device_pci_destroy_all, libxl__destroy_domid, and thus the
>> destruction of the domain fails.
>> 
>> In this patch, if the DM didn't confirmed that the device is removed,
>> we will print a warning and keep going if force=true.  The patch
>> reorder the functions so that pci_remove_timeout() calls
>> pci_remove_detatched() like it's done when DM calls are successful.
>> 
>> We also clean the QMP states and associated timeouts earlier, as soon
>> as they are not needed anymore.
>> 
>> Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
>> Fixes: fae4880c45fe015e567afa223f78bf17a6d98e1b
>> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
>> 
>
>Hi Anthony / Chao,
>
>I have to come back to this, a bit because perhaps there is an underlying issue.
>While it earlier occurred to me that the VM to which I passed through most pci-devices 
>(8 to be exact) became very slow to shutdown, but I  didn't investigate it further.
>
>But after you commit messages from this patch it kept nagging, so today I did some testing
>and bisecting.
>
>The difference in tear-down time at least from what the IOMMU code logs is quite large:
>
>xen-4.12.0
>	Setup: 	    7.452 s
>	Tear-down:  7.626 s
>
>xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d
>	Setup:      7.468 s
>	Tear-down: 50.239 s
>
>Bisection turned up:
>	commit c4b1ef0f89aa6a74faa4618ce3efed1de246ec40
>	Author: Chao Gao <chao.gao@intel.com>
>	Date:   Fri Jul 19 10:24:08 2019 +0100
>	libxl_qmp: wait for completion of device removal
>
>Which makes me wonder if there is something going wrong in Qemu ?

Hi Sander,

Thanks for your testing and the bisection.

I tried on my machine, the destruction time of a guest with 8 pass-thru
devices increased from 4s to 12s after applied the commit above. In my
understanding, I guess you might get the error message "timed out
waiting for DM to remove...". There might be some issues on your assigned
devices' drivers. You can first unbind the devices with their drivers in
VM and then tear down the VM, and check whether the VM teardown gets
much faster.

Anthony & Wei,

The commit above basically serializes and synchronizes detaching
assigned devices and thus increases VM teardown time significantly if
there are multiple assigned devices. The commit aimed to avoid qemu's
access to PCI configuration space coinciding with the device reset
initiated by xl (which is not desired and is exactly the case which
triggers the assertion in Xen [1]). I personally insist that xl should
wait for DM's completion of device detaching. Otherwise, besides Xen
panic (which can be fixed in another way), in theory, such sudden
unawared device reset might cause a disaster (e.g. data loss for a
storage device).

[1]: https://lists.xenproject.org/archives/html/xen-devel/2019-09/msg03287.html

But considering fast creation and teardown is an important benefit of
virtualization, I am not sure how to deal with the situation. Anyway,
you can make the decision. To fix the regression on VM teardown, we can
revert the commit by removing the timeout logic.

What's your opinion?

Thanks
Chao

>
>--
>Sander
>
>
>
>xen-4.12.0 setup:
>	(XEN) [2019-10-10 09:54:14.846] AMD-Vi: Disable: device id = 0x900, domain = 0, paging mode = 3
>	(XEN) [2019-10-10 09:54:14.846] AMD-Vi: Setup I/O page table: device id = 0x900, type = 0x1, root table = 0x4aa847000, domain = 1, paging mode = 3
>	(XEN) [2019-10-10 09:54:14.846] AMD-Vi: Re-assign 0000:09:00.0 from dom0 to dom1
>	...
>	(XEN) [2019-10-10 09:54:22.298] AMD-Vi: Disable: device id = 0x907, domain = 0, paging mode = 3
>	(XEN) [2019-10-10 09:54:22.298] AMD-Vi: Setup I/O page table: device id = 0x907, type = 0x1, root table = 0x4aa847000, domain = 1, paging mode = 3
>	(XEN) [2019-10-10 09:54:22.298] AMD-Vi: Re-assign 0000:09:00.7 from dom0 to dom1
>
>
>xen-4.12.0 tear-down:
>	(XEN) [2019-10-10 10:01:11.971] AMD-Vi: Disable: device id = 0x900, domain = 1, paging mode = 3
>	(XEN) [2019-10-10 10:01:11.971] AMD-Vi: Setup I/O page table: device id = 0x900, type = 0x1, root table = 0x53572c000, domain = 0, paging mode = 3
>	(XEN) [2019-10-10 10:01:11.971] AMD-Vi: Re-assign 0000:09:00.0 from dom1 to dom0
>	...
>	(XEN) [2019-10-10 10:01:19.597] AMD-Vi: Disable: device id = 0x907, domain = 1, paging mode = 3
>	(XEN) [2019-10-10 10:01:19.597] AMD-Vi: Setup I/O page table: device id = 0x907, type = 0x1, root table = 0x53572c000, domain = 0, paging mode = 3
>	(XEN) [2019-10-10 10:01:19.597] AMD-Vi: Re-assign 0000:09:00.7 from dom1 to dom0
>
>xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d setup:
>	(XEN) [2019-10-10 10:21:38.549] d1: bind: m_gsi=47 g_gsi=36 dev=00.00.5 intx=0
>	(XEN) [2019-10-10 10:21:38.621] AMD-Vi: Disable: device id = 0x900, domain = 0, paging mode = 3
>	(XEN) [2019-10-10 10:21:38.621] AMD-Vi: Setup I/O page table: device id = 0x900, type = 0x1, root table = 0x4aa83b000, domain = 1, paging mode = 3
>	(XEN) [2019-10-10 10:21:38.621] AMD-Vi: Re-assign 0000:09:00.0 from dom0 to dom1
>	...
>	(XEN) [2019-10-10 10:21:46.069] d1: bind: m_gsi=46 g_gsi=36 dev=00.01.4 intx=3
>	(XEN) [2019-10-10 10:21:46.089] AMD-Vi: Disable: device id = 0x907, domain = 0, paging mode = 3
>	(XEN) [2019-10-10 10:21:46.089] AMD-Vi: Setup I/O page table: device id = 0x907, type = 0x1, root table = 0x4aa83b000, domain = 1, paging mode = 3
>	(XEN) [2019-10-10 10:21:46.089] AMD-Vi: Re-assign 0000:09:00.7 from dom0 to dom1
>
>
>xen-unstable-ee7170822f1fc209f33feb47b268bab35541351d tear-down:
>	(XEN) [2019-10-10 10:23:53.167] AMD-Vi: Disable: device id = 0x900, domain = 1, paging mode = 3
>	(XEN) [2019-10-10 10:23:53.167] AMD-Vi: Setup I/O page table: device id = 0x900, type = 0x1, root table = 0x5240f8000, domain = 0, paging mode = 3
>	(XEN) [2019-10-10 10:23:53.167] AMD-Vi: Re-assign 0000:09:00.0 from dom1 to dom0
>	...
>	(XEN) [2019-10-10 10:24:43.406] AMD-Vi: Disable: device id = 0x907, domain = 1, paging mode = 3
>	(XEN) [2019-10-10 10:24:43.406] AMD-Vi: Setup I/O page table: device id = 0x907, type = 0x1, root table = 0x5240f8000, domain = 0, paging mode = 3
>	(XEN) [2019-10-10 10:24:43.406] AMD-Vi: Re-assign 0000:09:00.7 from dom1 to dom0

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

  reply	other threads:[~2019-10-14 15:00 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-30 17:23 [Xen-devel] [PATCH 0/2] libxl fixes with pci passthrough Anthony PERARD
2019-09-30 17:23 ` [Xen-devel] [PATCH 1/2] libxl_pci: Don't ignore PCI PT error at guest creation Anthony PERARD
2019-09-30 17:23 ` [Xen-devel] [PATCH 2/2] libxl_pci: Fix guest shutdown with PCI PT attached Anthony PERARD
2019-09-30 18:10   ` Sander Eikelenboom
2019-10-01 10:35   ` Anthony PERARD
2019-10-10 16:13     ` Sander Eikelenboom
2019-10-14 15:03       ` Chao Gao [this message]
2019-10-15 16:59         ` Sander Eikelenboom
2019-10-15 18:46           ` Sander Eikelenboom
2019-10-16  4:55           ` Chao Gao
2019-10-18 16:11         ` Anthony PERARD
2019-10-18 16:43           ` Sander Eikelenboom
2019-10-02 15:45 ` [Xen-devel] [PATCH 0/2] libxl fixes with pci passthrough Ian Jackson
2019-10-02 15:58   ` Jürgen Groß
2019-10-04 15:55     ` Ian Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191014150341.GA12156@gao-cwp \
    --to=chao.gao@intel.com \
    --cc=anthony.perard@citrix.com \
    --cc=linux@eikelenboom.it \
    --cc=wl@xen.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).