RE: [PATCH] PCI: hv: Move completion variable from stack to heap in hv_compose_msi_msg()

From: Long Li <longli@microsoft.com>
To: Michael Kelley <mikelley@microsoft.com>,
	"longli@linuxonhyperv.com" <longli@linuxonhyperv.com>,
	KY Srinivasan <kys@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Stephen Hemminger <sthemmin@microsoft.com>,
	Wei Liu <wei.liu@kernel.org>,
	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>,
	Rob Herring <robh@kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Andrea Parri <Andrea.Parri@microsoft.com>
Subject: RE: [PATCH] PCI: hv: Move completion variable from stack to heap in hv_compose_msi_msg()
Date: Tue, 1 Jun 2021 19:27:18 +0000	[thread overview]
Message-ID: <BY5PR21MB150673A34B431F9311E6FDC5CE3E9@BY5PR21MB1506.namprd21.prod.outlook.com> (raw)
In-Reply-To: <MWHPR21MB15931F1698FD128C76219F7DD7249@MWHPR21MB1593.namprd21.prod.outlook.com>

> Subject: RE: [PATCH] PCI: hv: Move completion variable from stack to heap in
> hv_compose_msi_msg()
> 
> From: longli@linuxonhyperv.com <longli@linuxonhyperv.com> Sent:
> Wednesday, May 12, 2021 1:07 AM
> >
> > hv_compose_msi_msg() may be called with interrupt disabled. It calls
> > wait_for_completion() in a loop and may exit the loop earlier if the
> > device is being ejected or it's hitting other errors. However the VSP
> > may send completion packet after the loop exit and the completion
> > variable is no longer valid on the stack. This results in a kernel oops.
> >
> > Fix this by relocating completion variable from stack to heap, and use
> > hbus to maintain a list of leftover completions for future cleanup if
> necessary.
> 
> Interesting problem.  I haven't reviewed the details of your implementation
> because I'd like to propose an alternate approach to solving the problem.
> 
> You have fixed the problem for hv_compose_msi_msg(), but it seems like the
> same problem could occur in other places in pci-hyperv.c where a VMbus
> request is sent, and waiting for the response could be aborted by the device
> being rescinded.

The problem in hv_compose_msi_msg() is different to other places, it's a bug in the PCI driver that it doesn't handle the case where the device is ejected (PCI_EJECT). After device is ejected, it's valid that VSP may still send back completion on a prior pending request.

On the other hand, if a device is rescinded, it's not possible to get a completion on this device afterwards. If we are still getting a completion, it's a bug in the VSP or it's from a malicious host.

I agree if the intent is to deal with a untrusted host, I can follow the same principle to add this support to all requests to VSP. But this is a different problem to what this patch intends to address. I can see they may share the same design principle and common code. My question on a untrusted host is: If a host is untrusted and is misbehaving on purpose, what's the point of keep the VM running and not crashing the PCI driver?

> 
> The current code (and with your patch) passes the guest memory address of
> the completion packet to Hyper-V as the requestID.  Hyper-V responds and
> passes back the requestID, whereupon hv_pci_onchannelcallback() treats it as
> the guest memory address of the completion packet.  This all assumes that
> Hyper-V is trusted and that it doesn't pass back a bogus value that will be
> treated as a guest memory address.  But Andrea Parri has been updating
> other VMbus drivers (like netvsc and storvsc) to *not* pass guest memory
> addresses as the requestID. The pci-hyperv.c driver has not been fixed in this
> regard, but I think this patch could take big step in that direction.
> 
> My alternate approach is as follows:
> 1.  For reach PCI VMbus channel, keep a 64-bit counter.  When a VMbus
> message is to be sent, increment the counter atomically, and send the next
> value as the
> requestID.   The counter will not wrap-around in any practical time period, so
> the requestIDs are essentially unique.  Or just read a clock value to get a
> unique requestID.
> 2.  Also keep a per-channel list of mappings from requestID to the guest
> memory address of the completion packet.  For PCI channels, there will be
> very few requests outstanding concurrently, so this can be a simple linked list,
> protected by a spin lock.
> 3. Before sending a new VMbus message that is expecting a response, add the
> mapping to the list.  The guest memory address can be for a stack local, like
> the current code.
> 4. When the sending function completes, either because the response was
> received, or because wait_for_response() aborted, remove the mapping from
> the linked list.
> 5. hv_pci_onchannelcallback() gets the requestID from Hyper-V and looks it
> up in the linked list.  If there's no match in the linked list, the completion
> response from Hyper-V is ignored.  It's either a late response or a completely
> bogus response from Hyper-V.  If there is a match, then the address of the
> completion packet is available and valid.  The completion function will need to
> run while the spin lock is held on the linked list, so that the completion packet
> address is ensured to remain valid while the completion function executes.
> 
> I don't think my proposed approach is any more complicated that what your
> patch does, and it is a step in the direction of fully hardening the pci-hyperv.c
> driver.
> 
> This approach is a bit different from netvsc and storvsc because those drivers
> must handle lots of in-flight requests, and searching a linked list in the
> onchannelcallback function would be too slow.  The overall idea is the same,
> but a different approach is used to generate requestIDs and to map between
> requestIDs and guest memory addresses.
> 
> Thoughts?
> 
> Michael
> 
> >
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> >  drivers/pci/controller/pci-hyperv.c | 97
> > +++++++++++++++++++----------
> >  1 file changed, 65 insertions(+), 32 deletions(-)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > b/drivers/pci/controller/pci-hyperv.c
> > index 9499ae3275fe..29fe26e2193c 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -473,6 +473,9 @@ struct hv_pcibus_device {
> >  	struct msi_controller msi_chip;
> >  	struct irq_domain *irq_domain;
> >
> > +	struct list_head compose_msi_msg_ctxt_list;
> > +	spinlock_t compose_msi_msg_ctxt_list_lock;
> > +
> >  	spinlock_t retarget_msi_interrupt_lock;
> >
> >  	struct workqueue_struct *wq;
> > @@ -552,6 +555,17 @@ struct hv_pci_compl {
> >  	s32 completion_status;
> >  };
> >
> > +struct compose_comp_ctxt {
> > +	struct hv_pci_compl comp_pkt;
> > +	struct tran_int_desc int_desc;
> > +};
> > +
> > +struct compose_msi_msg_ctxt {
> > +	struct list_head list;
> > +	struct pci_packet pci_pkt;
> > +	struct compose_comp_ctxt comp;
> > +};
> > +
> >  static void hv_pci_onchannelcallback(void *context);
> >
> >  /**
> > @@ -1293,11 +1307,6 @@ static void hv_irq_unmask(struct irq_data *data)
> >  	pci_msi_unmask_irq(data);
> >  }
> >
> > -struct compose_comp_ctxt {
> > -	struct hv_pci_compl comp_pkt;
> > -	struct tran_int_desc int_desc;
> > -};
> > -
> >  static void hv_pci_compose_compl(void *context, struct pci_response
> *resp,
> >  				 int resp_packet_size)
> >  {
> > @@ -1373,16 +1382,12 @@ static void hv_compose_msi_msg(struct
> irq_data
> > *data, struct msi_msg *msg)
> >  	struct pci_bus *pbus;
> >  	struct pci_dev *pdev;
> >  	struct cpumask *dest;
> > -	struct compose_comp_ctxt comp;
> >  	struct tran_int_desc *int_desc;
> > -	struct {
> > -		struct pci_packet pci_pkt;
> > -		union {
> > -			struct pci_create_interrupt v1;
> > -			struct pci_create_interrupt2 v2;
> > -		} int_pkts;
> > -	} __packed ctxt;
> > -
> > +	struct compose_msi_msg_ctxt *ctxt;
> > +	union {
> > +		struct pci_create_interrupt v1;
> > +		struct pci_create_interrupt2 v2;
> > +	} int_pkts;
> >  	u32 size;
> >  	int ret;
> >
> > @@ -1402,18 +1407,24 @@ static void hv_compose_msi_msg(struct
> irq_data
> > *data, struct msi_msg *msg)
> >  		hv_int_desc_free(hpdev, int_desc);
> >  	}
> >
> > +	ctxt = kzalloc(sizeof(*ctxt), GFP_ATOMIC);
> > +	if (!ctxt)
> > +		goto drop_reference;
> > +
> >  	int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
> > -	if (!int_desc)
> > +	if (!int_desc) {
> > +		kfree(ctxt);
> >  		goto drop_reference;
> > +	}
> >
> > -	memset(&ctxt, 0, sizeof(ctxt));
> > -	init_completion(&comp.comp_pkt.host_event);
> > -	ctxt.pci_pkt.completion_func = hv_pci_compose_compl;
> > -	ctxt.pci_pkt.compl_ctxt = &comp;
> > +	memset(ctxt, 0, sizeof(*ctxt));
> > +	init_completion(&ctxt->comp.comp_pkt.host_event);
> > +	ctxt->pci_pkt.completion_func = hv_pci_compose_compl;
> > +	ctxt->pci_pkt.compl_ctxt = &ctxt->comp;
> >
> >  	switch (hbus->protocol_version) {
> >  	case PCI_PROTOCOL_VERSION_1_1:
> > -		size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
> > +		size = hv_compose_msi_req_v1(&int_pkts.v1,
> >  					dest,
> >  					hpdev->desc.win_slot.slot,
> >  					cfg->vector);
> > @@ -1421,7 +1432,7 @@ static void hv_compose_msi_msg(struct irq_data
> > *data, struct msi_msg *msg)
> >
> >  	case PCI_PROTOCOL_VERSION_1_2:
> >  	case PCI_PROTOCOL_VERSION_1_3:
> > -		size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
> > +		size = hv_compose_msi_req_v2(&int_pkts.v2,
> >  					dest,
> >  					hpdev->desc.win_slot.slot,
> >  					cfg->vector);
> > @@ -1434,17 +1445,18 @@ static void hv_compose_msi_msg(struct
> irq_data
> > *data, struct msi_msg *msg)
> >  		 */
> >  		dev_err(&hbus->hdev->device,
> >  			"Unexpected vPCI protocol, update driver.");
> > +		kfree(ctxt);
> >  		goto free_int_desc;
> >  	}
> >
> > -	ret = vmbus_sendpacket(hpdev->hbus->hdev->channel,
> &ctxt.int_pkts,
> > -			       size, (unsigned long)&ctxt.pci_pkt,
> > +	ret = vmbus_sendpacket(hpdev->hbus->hdev->channel, &int_pkts,
> > +			       size, (unsigned long)&ctxt->pci_pkt,
> >  			       VM_PKT_DATA_INBAND,
> >
> VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
> >  	if (ret) {
> >  		dev_err(&hbus->hdev->device,
> > -			"Sending request for interrupt failed: 0x%x",
> > -			comp.comp_pkt.completion_status);
> > +			"Sending request for interrupt failed: 0x%x", ret);
> > +		kfree(ctxt);
> >  		goto free_int_desc;
> >  	}
> >
> > @@ -1458,7 +1470,7 @@ static void hv_compose_msi_msg(struct irq_data
> > *data, struct msi_msg *msg)
> >  	 * Since this function is called with IRQ locks held, can't
> >  	 * do normal wait for completion; instead poll.
> >  	 */
> > -	while (!try_wait_for_completion(&comp.comp_pkt.host_event)) {
> > +	while (!try_wait_for_completion(&ctxt->comp.comp_pkt.host_event))
> {
> >  		unsigned long flags;
> >
> >  		/* 0xFFFF means an invalid PCI VENDOR ID. */ @@ -1494,10
> +1506,11
> > @@ static void hv_compose_msi_msg(struct irq_data *data, struct
> > msi_msg *msg)
> >
> >  	tasklet_enable(&channel->callback_event);
> >
> > -	if (comp.comp_pkt.completion_status < 0) {
> > +	if (ctxt->comp.comp_pkt.completion_status < 0) {
> >  		dev_err(&hbus->hdev->device,
> >  			"Request for interrupt failed: 0x%x",
> > -			comp.comp_pkt.completion_status);
> > +			ctxt->comp.comp_pkt.completion_status);
> > +		kfree(ctxt);
> >  		goto free_int_desc;
> >  	}
> >
> > @@ -1506,23 +1519,36 @@ static void hv_compose_msi_msg(struct
> irq_data
> > *data, struct msi_msg *msg)
> >  	 * irq_set_chip_data() here would be appropriate, but the lock it
> takes
> >  	 * is already held.
> >  	 */
> > -	*int_desc = comp.int_desc;
> > +	*int_desc = ctxt->comp.int_desc;
> >  	data->chip_data = int_desc;
> >
> >  	/* Pass up the result. */
> > -	msg->address_hi = comp.int_desc.address >> 32;
> > -	msg->address_lo = comp.int_desc.address & 0xffffffff;
> > -	msg->data = comp.int_desc.data;
> > +	msg->address_hi = ctxt->comp.int_desc.address >> 32;
> > +	msg->address_lo = ctxt->comp.int_desc.address & 0xffffffff;
> > +	msg->data = ctxt->comp.int_desc.data;
> >
> >  	put_pcichild(hpdev);
> > +	kfree(ctxt);
> >  	return;
> >
> >  enable_tasklet:
> >  	tasklet_enable(&channel->callback_event);
> > +
> > +	/*
> > +	 * Move uncompleted context to the leftover list.
> > +	 * The host may send completion at a later time, and we ignore this
> > +	 * completion but keep the memory reference valid.
> > +	 */
> > +	spin_lock(&hbus->compose_msi_msg_ctxt_list_lock);
> > +	list_add_tail(&ctxt->list, &hbus->compose_msi_msg_ctxt_list);
> > +	spin_unlock(&hbus->compose_msi_msg_ctxt_list_lock);
> > +
> >  free_int_desc:
> >  	kfree(int_desc);
> > +
> >  drop_reference:
> >  	put_pcichild(hpdev);
> > +
> >  return_null_message:
> >  	msg->address_hi = 0;
> >  	msg->address_lo = 0;
> > @@ -3076,9 +3102,11 @@ static int hv_pci_probe(struct hv_device *hdev,
> >  	INIT_LIST_HEAD(&hbus->children);
> >  	INIT_LIST_HEAD(&hbus->dr_list);
> >  	INIT_LIST_HEAD(&hbus->resources_for_children);
> > +	INIT_LIST_HEAD(&hbus->compose_msi_msg_ctxt_list);
> >  	spin_lock_init(&hbus->config_lock);
> >  	spin_lock_init(&hbus->device_list_lock);
> >  	spin_lock_init(&hbus->retarget_msi_interrupt_lock);
> > +	spin_lock_init(&hbus->compose_msi_msg_ctxt_list_lock);
> >  	hbus->wq = alloc_ordered_workqueue("hv_pci_%x", 0,
> >  					   hbus->sysdata.domain);
> >  	if (!hbus->wq) {
> > @@ -3282,6 +3310,7 @@ static int hv_pci_bus_exit(struct hv_device
> > *hdev, bool
> > keep_devs)
> >  static int hv_pci_remove(struct hv_device *hdev)  {
> >  	struct hv_pcibus_device *hbus;
> > +	struct compose_msi_msg_ctxt *ctxt, *tmp;
> >  	int ret;
> >
> >  	hbus = hv_get_drvdata(hdev);
> > @@ -3318,6 +3347,10 @@ static int hv_pci_remove(struct hv_device
> > *hdev)
> >
> >  	hv_put_dom_num(hbus->sysdata.domain);
> >
> > +	list_for_each_entry_safe(ctxt, tmp, &hbus-
> >compose_msi_msg_ctxt_list, list) {
> > +		list_del(&ctxt->list);
> > +		kfree(ctxt);
> > +	}
> >  	kfree(hbus);
> >  	return ret;
> >  }
> > --
> > 2.27.0