From: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@epam.com>
To: "Jan Beulich" <jbeulich@suse.com>,
"Roger Pau Monné" <roger.pau@citrix.com>
Cc: "julien@xen.org" <julien@xen.org>,
"sstabellini@kernel.org" <sstabellini@kernel.org>,
Oleksandr Tyshchenko <Oleksandr_Tyshchenko@epam.com>,
Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>,
Artem Mygaiev <Artem_Mygaiev@epam.com>,
"andrew.cooper3@citrix.com" <andrew.cooper3@citrix.com>,
"george.dunlap@citrix.com" <george.dunlap@citrix.com>,
"paul@xen.org" <paul@xen.org>,
Bertrand Marquis <bertrand.marquis@arm.com>,
Rahul Singh <rahul.singh@arm.com>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
Oleksandr Andrushchenko <Oleksandr_Andrushchenko@epam.com>
Subject: Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
Date: Fri, 19 Nov 2021 12:34:48 +0000 [thread overview]
Message-ID: <fec02ac4-1ecf-18e1-7ed6-3b1094d60890@epam.com> (raw)
In-Reply-To: <7dd919b8-0e3d-30dc-302a-9964f9d94ad2@suse.com>
Hi, Roger, Jan!
On 18.11.21 17:53, Jan Beulich wrote:
> On 18.11.2021 16:46, Oleksandr Andrushchenko wrote:
>> On 18.11.21 17:41, Jan Beulich wrote:
>>> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 17:16, Jan Beulich wrote:
>>>>> For the moment I can't help thinking that draining would
>>>>> be preferable over canceling.
>>>> Given that cancellation is going to happen on error path or
>>>> on device de-assign/remove I think this can be acceptable.
>>>> Any reason why not?
>>> It would seem to me that the correctness of a draining approach is
>>> going to be easier to prove than that of a canceling one, where I
>>> expect races to be a bigger risk. Especially something that gets
>>> executed infrequently, if ever (error paths in particular), knowing
>>> things are well from testing isn't typically possible.
>> Could you please then give me a hint how to do that:
>> 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci
>> 2. We have de-assign/remove on vCPU1
>>
>> How do we drain that? Do you mean some atomic variable to be
>> used in vpci_process_pending to flag it is running and de-assign/remove
>> needs to wait and spinning checking that?
> First of all let's please keep remove and de-assign separate. I think we
> have largely reached agreement that remove may need handling differently,
> for being a Dom0-only operation.
>
> As to draining during de-assign: I did suggest before that removing the
> register handling hooks first would guarantee no new requests to appear.
> Then it should be merely a matter of using hypercall continuations until
> the respective domain has no pending requests anymore for the device in
> question. Some locking (or lock barrier) may of course be needed to
> make sure another CPU isn't just about to pend a new request.
>
> Jan
>
>
Too long, but please read.
The below is some simplified analysis of what is happening with
respect to deferred mapping. First we start from looking at what
hypercals are used which may run in parallel with vpci_process_pending,
which lock they hold:
1. do_physdev_op(PHYSDEVOP_pci_device_add): failure during PHYSDEVOP_pci_device_add
===============================================================================
pci_physdev_op: <- no hypercall_create_continuation
pci_add_device <- pcidevs_lock()
vpci_add_handlers
init_bars
cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
modify_bars <- if cmd & PCI_COMMAND_MEMORY
struct rangeset *mem = rangeset_new(NULL, NULL, 0);
if ( system_state < SYS_STATE_active ) <- Dom0 is being created
return apply_map(pdev->domain, pdev, mem, cmd);
defer_map(dev->domain, dev, mem, cmd, rom_only); < Dom0 is running
curr->vpci.pdev = pdev;
curr->vpci.mem = mem;
ret = iommu_add_device(pdev); <- FAIL
if ( ret )
vpci_remove_device
remove vPCI register handlers
xfree(pdev->vpci);
pdev->vpci = NULL; <- this will crash vpci_process_pending if it was
scheduled and yet to run
2. do_physdev_op(PHYSDEVOP_pci_device_remove)
===============================================================================
pci_physdev_op: <- no hypercall_create_continuation
pci_remove_device <- pcidevs_lock()
vpci_remove_device
pdev->vpci = NULL; <- this will crash vpci_process_pending if it was
scheduled and yet to run
3. iommu_do_pci_domctl(XEN_DOMCTL_assign_device)
===============================================================================
case XEN_DOMCTL_assign_device <- pcidevs_lock();
ret = assign_device(d, seg, bus, devfn, flags);
if ( ret == -ERESTART )
ret = hypercall_create_continuation(__HYPERVISOR_domctl, "h", u_domctl);
4. iommu_do_pci_domctl(XEN_DOMCTL_deassign_device) <- pcidevs_lock();
===============================================================================
case XEN_DOMCTL_deassign_device: <- no hypercall_create_continuation
ret = deassign_device(d, seg, bus, devfn);
5. vPCI MMIO trap for PCI_COMMAND
===============================================================================
vpci_mmio_{read|write}
vpci_ecam_{read|write}
vpci_{read|write} <- NO locking yet
pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn);
spin_lock(&pdev->vpci->lock);
cmd_write
modify_bars
defer_map
6. SoftIRQ processing
===============================================================================
hvm_do_resume
vcpu_ioreq_handle_completion
vpci_process_pending
if ( v->vpci.mem )
rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
if ( rc == -ERESTART )
return true; <- re-scheduling
=========================================================================
spin_lock(&v->vpci.pdev->vpci->lock); <- v->vpci.pdev->vpci can be NULL
=========================================================================
spin_unlock(&v->vpci.pdev->vpci->lock);
v->vpci.mem = NULL;
if ( rc ) <- rc is from rangeset_consume_ranges
vpci_remove_device <- this is a BUG as it is potentially possible that
vpci_process_pending is running on other vCPU
So, from the above it is clearly seen that it might be that there is a
PCI_COMMAND's write triggered mapping is happening on other vCPU in parallel
with a hypercall.
Some analysis on the hypercalls with respect to domains which are eligible targets.
1. Dom0 (hardware domain) only: PHYSDEVOP_pci_device_add, PHYSDEVOP_pci_device_remove
2. Any domain: XEN_DOMCTL_assign_device, XEN_DOMCTL_deassign_device
Possible crash paths
===============================================================================
1. Failure in PHYSDEVOP_pci_device_add after defer_map may make
vpci_process_pending crash because of pdev->vpci == NULL
2. vpci_process_pending on other vCPU may crash if runs in parallel with itself
because of vpci_remove_device may be called
3. Crash in vpci_mmio_{read|write} after PHYSDEVOP_pci_device_remove
vpci_remove_device makes pdev->vpci == NULL
4. Both XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device seem to be
unaffected.
Synchronization is needed between:
- vpci_remove_device
- vpci_process_pending
- vpci_mmio_{read|write}
Possible locking and other work needed:
=======================================
1. pcidevs_{lock|unlock} is too heavy and is per-host
2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
3. We may want a dedicated per-domain rw lock to be implemented:
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 28146ee404e6..ebf071893b21 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -444,6 +444,7 @@ struct domain
#ifdef CONFIG_HAS_PCI
struct list_head pdev_list;
+ rwlock_t vpci_rwlock;
+ bool vpci_terminating; <- atomic?
#endif
then vpci_remove_device is a writer (cold path) and vpci_process_pending and
vpci_mmio_{read|write} are readers (hot path).
do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
to be implemented, so when re-start removal if need be:
vpci_remove_device()
{
d->vpci_terminating = true;
remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
if ( !write_trylock(d->vpci_rwlock) )
return -ERESTART;
xfree(pdev->vpci);
pdev->vpci = NULL;
}
Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
other operations which may require it, e.g. virtual bus topology can
use it when assigning vSBDF etc.
4. vpci_remove_device needs to be removed from vpci_process_pending
and do nothing for Dom0 and crash DomU otherwise:
if ( rc )
{
/*
* FIXME: in case of failure remove the device from the domain.
* Note that there might still be leftover mappings. While this is
* safe for Dom0, for DomUs the domain needs to be killed in order
* to avoid leaking stale p2m mappings on failure.
*/
if ( !is_hardware_domain(v->domain) )
domain_crash(v->domain);
I do hope we can finally come up with some decision which I can implement...
Thank you,
Oleksandr
next prev parent reply other threads:[~2021-11-19 12:35 UTC|newest]
Thread overview: 101+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-05 6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 01/11] vpci: fix function attributes for vpci_process_pending Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal Oleksandr Andrushchenko
2021-11-15 16:56 ` Jan Beulich
2021-11-16 7:32 ` Oleksandr Andrushchenko
2021-11-16 8:01 ` Jan Beulich
2021-11-16 8:23 ` Oleksandr Andrushchenko
2021-11-16 11:38 ` Jan Beulich
2021-11-16 13:27 ` Oleksandr Andrushchenko
2021-11-16 14:11 ` Jan Beulich
2021-11-16 13:41 ` Oleksandr Andrushchenko
2021-11-16 14:12 ` Jan Beulich
2021-11-16 14:24 ` Oleksandr Andrushchenko
2021-11-16 14:37 ` Oleksandr Andrushchenko
2021-11-16 16:09 ` Jan Beulich
2021-11-16 18:02 ` Julien Grall
2021-11-18 12:57 ` Oleksandr Andrushchenko
2021-11-17 8:28 ` Jan Beulich
2021-11-18 7:49 ` Oleksandr Andrushchenko
2021-11-18 8:36 ` Jan Beulich
2021-11-18 8:54 ` Oleksandr Andrushchenko
2021-11-18 9:15 ` Jan Beulich
2021-11-18 9:32 ` Oleksandr Andrushchenko
2021-11-18 13:25 ` Jan Beulich
2021-11-18 13:48 ` Oleksandr Andrushchenko
2021-11-18 14:04 ` Roger Pau Monné
2021-11-18 14:14 ` Oleksandr Andrushchenko
2021-11-18 14:35 ` Jan Beulich
2021-11-18 15:11 ` Oleksandr Andrushchenko
2021-11-18 15:16 ` Jan Beulich
2021-11-18 15:21 ` Oleksandr Andrushchenko
2021-11-18 15:41 ` Jan Beulich
2021-11-18 15:46 ` Oleksandr Andrushchenko
2021-11-18 15:53 ` Jan Beulich
2021-11-19 12:34 ` Oleksandr Andrushchenko [this message]
2021-11-19 13:00 ` Jan Beulich
2021-11-19 13:16 ` Oleksandr Andrushchenko
2021-11-19 13:25 ` Jan Beulich
2021-11-19 13:34 ` Oleksandr Andrushchenko
2021-11-22 14:21 ` Oleksandr Andrushchenko
2021-11-22 14:37 ` Jan Beulich
2021-11-22 14:45 ` Oleksandr Andrushchenko
2021-11-22 14:57 ` Jan Beulich
2021-11-22 15:02 ` Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 03/11] vpci: make vpci registers removal a dedicated function Oleksandr Andrushchenko
2021-11-15 16:57 ` Jan Beulich
2021-11-16 8:02 ` Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 04/11] vpci: add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
2021-11-15 17:06 ` Jan Beulich
2021-11-16 9:38 ` Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 05/11] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
2021-11-19 11:58 ` Jan Beulich
2021-11-19 12:10 ` Oleksandr Andrushchenko
2021-11-19 12:37 ` Jan Beulich
2021-11-19 12:46 ` Oleksandr Andrushchenko
2021-11-19 12:49 ` Jan Beulich
2021-11-19 12:54 ` Oleksandr Andrushchenko
2021-11-19 13:02 ` Jan Beulich
2021-11-19 13:17 ` Oleksandr Andrushchenko
2021-11-23 15:14 ` Oleksandr Andrushchenko
2021-11-24 12:32 ` Roger Pau Monné
2021-11-24 12:36 ` Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR Oleksandr Andrushchenko
2021-11-19 12:05 ` Jan Beulich
2021-11-19 12:13 ` Oleksandr Andrushchenko
2021-11-19 12:45 ` Jan Beulich
2021-11-19 12:50 ` Oleksandr Andrushchenko
2021-11-19 13:06 ` Jan Beulich
2021-11-19 13:19 ` Oleksandr Andrushchenko
2021-11-19 13:29 ` Jan Beulich
2021-11-19 13:38 ` Oleksandr Andrushchenko
2021-11-19 13:16 ` Jan Beulich
2021-11-19 13:41 ` Oleksandr Andrushchenko
2021-11-19 13:57 ` Jan Beulich
2021-11-19 14:09 ` Oleksandr Andrushchenko
2021-11-22 8:24 ` Jan Beulich
2021-11-22 8:31 ` Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 07/11] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
2021-11-19 12:33 ` Jan Beulich
2021-11-19 12:44 ` Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 08/11] vpci/header: emulate PCI_COMMAND register for guests Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 09/11] vpci/header: reset the command register when adding devices Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology Oleksandr Andrushchenko
2021-11-18 16:45 ` Jan Beulich
2021-11-24 11:28 ` Oleksandr Andrushchenko
2021-11-24 12:36 ` Roger Pau Monné
2021-11-24 12:43 ` Oleksandr Andrushchenko
2021-11-05 6:56 ` [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests Oleksandr Andrushchenko
2021-11-08 11:10 ` Jan Beulich
2021-11-08 11:16 ` Oleksandr Andrushchenko
2021-11-08 14:23 ` Roger Pau Monné
2021-11-08 15:28 ` Oleksandr Andrushchenko
2021-11-24 11:31 ` Oleksandr Andrushchenko
2021-11-19 13:56 ` [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Jan Beulich
2021-11-19 14:06 ` Oleksandr Andrushchenko
2021-11-19 14:23 ` Roger Pau Monné
2021-11-19 14:26 ` Oleksandr Andrushchenko
2021-11-20 9:47 ` Roger Pau Monné
2021-11-22 8:22 ` Jan Beulich
2021-11-22 8:34 ` Oleksandr Andrushchenko
2021-11-22 8:44 ` Jan Beulich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fec02ac4-1ecf-18e1-7ed6-3b1094d60890@epam.com \
--to=oleksandr_andrushchenko@epam.com \
--cc=Artem_Mygaiev@epam.com \
--cc=Oleksandr_Tyshchenko@epam.com \
--cc=Volodymyr_Babchuk@epam.com \
--cc=andrew.cooper3@citrix.com \
--cc=bertrand.marquis@arm.com \
--cc=george.dunlap@citrix.com \
--cc=jbeulich@suse.com \
--cc=julien@xen.org \
--cc=paul@xen.org \
--cc=rahul.singh@arm.com \
--cc=roger.pau@citrix.com \
--cc=sstabellini@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.