From: Long Li <longli@microsoft.com>
To: Michael Kelley <mikelley@microsoft.com>,
"longli@linuxonhyperv.com" <longli@linuxonhyperv.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>,
KY Srinivasan <kys@microsoft.com>,
Haiyang Zhang <haiyangz@microsoft.com>,
Stephen Hemminger <sthemmin@microsoft.com>,
Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Bjorn Andersson <bjorn.andersson@linaro.org>,
Hans de Goede <hdegoede@redhat.com>,
"Williams, Dan J" <dan.j.williams@intel.com>,
Maximilian Luz <luzmaximilian@gmail.com>,
Mike Rapoport <rppt@kernel.org>,
Ben Widawsky <ben.widawsky@intel.com>,
Jiri Slaby <jirislaby@kernel.org>,
Andra Paraschiv <andraprs@amazon.com>,
Siddharth Gupta <sidgup@codeaurora.org>,
Hannes Reinecke <hare@suse.de>,
"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
Subject: RE: [Patch v2 2/3] Drivers: hv: add Azure Blob driver
Date: Fri, 9 Jul 2021 19:44:08 +0000 [thread overview]
Message-ID: <BY5PR21MB15060921F762175FB4E85F26CE189@BY5PR21MB1506.namprd21.prod.outlook.com> (raw)
In-Reply-To: <MWHPR21MB1593F080DF28E10FF6E3A529D7189@MWHPR21MB1593.namprd21.prod.outlook.com>
> Subject: RE: [Patch v2 2/3] Drivers: hv: add Azure Blob driver
>
> From: Long Li <longli@microsoft.com> Sent: Friday, July 2, 2021 4:59 PM
> > > PM
> > >
> > > [snip]
> > >
> > > > > > +static void az_blob_remove_device(struct az_blob_device *dev) {
> > > > > > + wait_event(dev->file_wait, list_empty(&dev->file_list));
> > > > > > + misc_deregister(&az_blob_misc_device);
> > > > > > +#ifdef CONFIG_DEBUG_FS
> > > > > > + debugfs_remove_recursive(az_blob_debugfs_root);
> > > > > > +#endif
> > > > > > + /* At this point, we won't get any requests from user-mode
> > > > > > +*/ }
> > > > > > +
> > > > > > +static int az_blob_create_device(struct az_blob_device *dev) {
> > > > > > + int rc;
> > > > > > + struct dentry *d;
> > > > > > +
> > > > > > + rc = misc_register(&az_blob_misc_device);
> > > > > > + if (rc) {
> > > > > > + az_blob_err("misc_register failed rc %d\n", rc);
> > > > > > + return rc;
> > > > > > + }
> > > > > > +
> > > > > > +#ifdef CONFIG_DEBUG_FS
> > > > > > + az_blob_debugfs_root = debugfs_create_dir("az_blob",
> NULL);
> > > > > > + if (!IS_ERR_OR_NULL(az_blob_debugfs_root)) {
> > > > > > + d = debugfs_create_file("pending_requests", 0400,
> > > > > > + az_blob_debugfs_root, NULL,
> > > > > > + &az_blob_debugfs_fops);
> > > > > > + if (IS_ERR_OR_NULL(d)) {
> > > > > > + az_blob_warn("failed to create debugfs
> file\n");
> > > > > > +
> debugfs_remove_recursive(az_blob_debugfs_root);
> > > > > > + az_blob_debugfs_root = NULL;
> > > > > > + }
> > > > > > + } else
> > > > > > + az_blob_warn("failed to create debugfs root\n");
> #endif
> > > > > > +
> > > > > > + return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int az_blob_connect_to_vsp(struct hv_device *device,
> > > > > > +u32
> > > > > > +ring_size) {
> > > > > > + int ret;
> > > > > > +
> > > > > > + spin_lock_init(&az_blob_dev.file_lock);
> > > > >
> > > > > I'd argue that the spin lock should not be re-initialized here.
> > > > > Here's the sequence where things go wrong:
> > > > >
> > > > > 1) The driver is unbound, so az_blob_remove() is called.
> > > > > 2) az_blob_remove() sets the "removing" flag to true, and calls
> > > > > az_blob_remove_device().
> > > > > 3) az_blob_remove_device() waits for the file_list to become empty.
> > > > > 4) After the file_list becomes empty, but before
> > > > > misc_deregister() is called, a separate thread opens the device again.
> > > > > 5) In the separate thread, az_blob_fop_open() obtains the
> > > > > file_lock spin
> > > lock.
> > > > > 6) Before az_blob_fop_open() releases the spin lock,
> > > > > az_blob_remove_device() completes, along with az_blob_remove().
> > > > > 7) Then the device gets rebound, and az_blob_connect_to_vsp()
> > > > > gets called, all while az_blob_fop_open() still holds the spin
> > > > > lock. So the spin lock get re- initialized while it is held.
> > > > >
> > > > > This is admittedly a far-fetched scenario, but stranger things
> > > > > have happened. :-) The issue is that you are counting on the
> > > > > az_blob_dev structure to persist and have a valid file_lock,
> > > > > even while the device is unbound. So any initialization should
> > > > > only happen in
> > > az_blob_drv_init().
> > > >
> > > > I'm not sure if az_blob_probe() and az_blob_remove() can be called
> > > > at the same time, as az_blob_remove_vmbus() is called the last in
> > > az_blob_remove().
> > > > Is it possible for vmbus asking the driver to probe a new channel
> > > > before the old channel is closed? I expect the vmbus provide
> > > > guarantee that those calls are made in sequence.
> > >
> > > In my scenario above, az_blob_remove_vmbus() and az_blob_remove()
> > > run to completion in Step #6, all while some other thread is still
> > > in the middle of an
> > > open() call and holding the file_lock spin lock. Then in Step #7
> > > az_blob_probe() runs. So az_blob_remove() and az_blob_probe()
> > > execute sequentially, not at the same time.
> > >
> > > Michael
> >
> > I think it's a valid scenario. The return value of
> > devtmpfs_delete_node() is not checked in misc_deregister(). It decreases
> the refcount on inodes but it's not guaranteed that someone else is still using
> it (in the middle of opening a file).
> >
> > However, this works fine for "rmmod" that causes device to be removed.
> > Before file is opened the refcount on the module is increased so it can't be
> removed when file is being opened. The scenario you described can't
> happen.
> >
> > But during VMBUS rescind, it can happen. It's possible that the driver
> > is using the spinlock that has been re-initialized, when the next VMBUS
> offer on the same channel comes before all the attempting open file calls
> exit.
>
> I was thinking about the rescind scenario. vmbus_onoffer_rescind() will run
> on the global workqueue. If it eventually calls az_blob_remove() and then
> az_blob_remove_device(), it will wait until the file_list is empty, which
> essentially means waiting until user space processes decide to close the
> instances they have open. This seems like a problem that could block the
> global workqueue for a long time and
> thereby hang the kernel. Is my reasoning valid? If so, I haven't
> thought about what the solution might be. It seems like we do need to wait
> until any in-progress requests to Hyper-V are complete because Hyper-V has
> references to guest physical memory. But waiting for all open instances to
> be closed seems to be problematic.
My tests showed that misc_deregister() caused all opened files to be released if there are no pending I/O waiting in the driver.
If there are pending I/O, we must wait as the VSP owns the memory of the I/O. The correct VSP behavior is to return all the pending I/O along with rescind. This is the same to what storvsc does for rescind.
It looks to me waiting for opened files after the call to misc_deregister(), but before removing the vmbus channel is a safe approach.
If the VSP is behaving correctly, the rescind process should not block for too long. If we want to deal with a buggy VSP that takes forever to release a resource, we want to create a work queue for rescind handling.
>
> Michael
>
> >
> > This is a very rare. I agree things happen that we should make sure the
> driver can handle this. I'll update the driver.
> >
> > Long
> >
> > >
> > > >
> > > > >
> > > > > > + INIT_LIST_HEAD(&az_blob_dev.file_list);
> > > > > > + init_waitqueue_head(&az_blob_dev.file_wait);
> > > > > > +
> > > > > > + az_blob_dev.removing = false;
> > > > > > +
> > > > > > + az_blob_dev.device = device;
> > > > > > + device->channel->rqstor_size = device_queue_depth;
> > > > > > +
> > > > > > + ret = vmbus_open(device->channel, ring_size, ring_size,
> NULL, 0,
> > > > > > + az_blob_on_channel_callback, device-
> >channel);
> > > > > > +
> > > > > > + if (ret) {
> > > > > > + az_blob_err("failed to connect to VSP ret %d\n", ret);
> > > > > > + return ret;
> > > > > > + }
> > > > > > +
> > > > > > + hv_set_drvdata(device, &az_blob_dev);
> > > > > > +
> > > > > > + return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static void az_blob_remove_vmbus(struct hv_device *device) {
> > > > > > + /* At this point, no VSC/VSP traffic is possible over vmbus */
> > > > > > + hv_set_drvdata(device, NULL);
> > > > > > + vmbus_close(device->channel); }
> > > > > > +
> > > > > > +static int az_blob_probe(struct hv_device *device,
> > > > > > + const struct hv_vmbus_device_id *dev_id) {
> > > > > > + int rc;
> > > > > > +
> > > > > > + az_blob_dbg("probing device\n");
> > > > > > +
> > > > > > + rc = az_blob_connect_to_vsp(device,
> az_blob_ringbuffer_size);
> > > > > > + if (rc) {
> > > > > > + az_blob_err("error connecting to VSP rc %d\n", rc);
> > > > > > + return rc;
> > > > > > + }
> > > > > > +
> > > > > > + // create user-mode client library facing device
> > > > > > + rc = az_blob_create_device(&az_blob_dev);
> > > > > > + if (rc) {
> > > > > > + az_blob_remove_vmbus(device);
> > > > > > + return rc;
> > > > > > + }
> > > > > > +
> > > > > > + az_blob_dbg("successfully probed device\n");
> > > > > > + return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int az_blob_remove(struct hv_device *dev) {
> > > > > > + struct az_blob_device *device = hv_get_drvdata(dev);
> > > > > > + unsigned long flags;
> > > > > > +
> > > > > > + spin_lock_irqsave(&device->file_lock, flags);
> > > > > > + device->removing = true;
> > > > > > + spin_unlock_irqrestore(&device->file_lock, flags);
> > > > > > +
> > > > > > + az_blob_remove_device(device);
> > > > > > + az_blob_remove_vmbus(dev);
> > > > > > + return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static struct hv_driver az_blob_drv = {
> > > > > > + .name = KBUILD_MODNAME,
> > > > > > + .id_table = id_table,
> > > > > > + .probe = az_blob_probe,
> > > > > > + .remove = az_blob_remove,
> > > > > > + .driver = {
> > > > > > + .probe_type = PROBE_PREFER_ASYNCHRONOUS,
> > > > > > + },
> > > > > > +};
> > > > > > +
> > > > > > +static int __init az_blob_drv_init(void) {
> > > > > > + int ret;
> > > > > > +
> > > > > > + ret = vmbus_driver_register(&az_blob_drv);
> > > > > > + return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static void __exit az_blob_drv_exit(void) {
> > > > > > + vmbus_driver_unregister(&az_blob_drv);
> > > > > > +}
next prev parent reply other threads:[~2021-07-09 19:44 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-26 6:30 [Patch v2 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob longli
2021-06-26 6:30 ` [Patch v2 1/3] Drivers: hv: vmbus: add support to ignore certain PCIE devices longli
2021-06-28 20:47 ` Michael Kelley
2021-06-26 6:30 ` [Patch v2 2/3] Drivers: hv: add Azure Blob driver longli
2021-06-26 7:18 ` kernel test robot
2021-06-28 17:12 ` Enrico Weigelt, metux IT consult
2021-06-28 17:39 ` Wei Liu
2021-06-28 20:50 ` Michael Kelley
2021-07-01 6:51 ` Long Li
2021-07-01 16:02 ` Michael Kelley
2021-07-02 23:59 ` Long Li
2021-07-03 14:37 ` Michael Kelley
2021-07-09 2:09 ` Michael Kelley
2021-07-09 19:44 ` Long Li [this message]
2021-07-12 3:58 ` Michael Kelley
2021-06-29 6:24 ` Greg Kroah-Hartman
2021-06-29 10:41 ` Enrico Weigelt, metux IT consult
2021-06-29 12:18 ` Greg Kroah-Hartman
2021-06-30 15:29 ` Enrico Weigelt, metux IT consult
2021-07-01 6:58 ` Long Li
2021-07-01 9:36 ` Greg Kroah-Hartman
2021-07-01 17:24 ` Long Li
2021-06-29 6:25 ` Greg Kroah-Hartman
2021-07-01 6:59 ` Long Li
2021-06-29 8:43 ` Jiri Slaby
2021-07-01 7:09 ` Long Li
2021-07-01 15:16 ` Joe Perches
2021-07-01 22:41 ` Long Li
2021-06-26 6:30 ` [Patch v2 3/3] Drivers: hv: Add to maintainer for " longli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=BY5PR21MB15060921F762175FB4E85F26CE189@BY5PR21MB1506.namprd21.prod.outlook.com \
--to=longli@microsoft.com \
--cc=andraprs@amazon.com \
--cc=ben.widawsky@intel.com \
--cc=bjorn.andersson@linaro.org \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=decui@microsoft.com \
--cc=gregkh@linuxfoundation.org \
--cc=haiyangz@microsoft.com \
--cc=hare@suse.de \
--cc=hdegoede@redhat.com \
--cc=jirislaby@kernel.org \
--cc=kys@microsoft.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=longli@linuxonhyperv.com \
--cc=luzmaximilian@gmail.com \
--cc=mikelley@microsoft.com \
--cc=rppt@kernel.org \
--cc=sidgup@codeaurora.org \
--cc=sthemmin@microsoft.com \
--cc=wei.liu@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).