linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv1 0/7] vfio/mdev: Improve vfio/mdev core module
@ 2019-03-27  3:45 Parav Pandit
  2019-03-27  3:45 ` [PATCHv1 1/7] vfio/mdev: Avoid release parent reference during error path Parav Pandit
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: Parav Pandit @ 2019-03-27  3:45 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: cjia, parav

As we would like to use mdev subsystem for wider use case as
discussed in [1], [2] apart from an offline discussion.
This use case is also discussed with wider forum in [4] in track
'Lightweight NIC HW functions for container offload use cases'.

This series is prep-work and improves vfio/mdev module in following ways.

Patch-1 Fixes releasing parent dev reference during error unwinding
mdev parent registration.
Patch-2 Simplifies mdev device for unused kref.
Patch-3 Drops redundant extern prefix of exported symbols.
Patch-4 Returns right error code from vendor driver.
Patch-5 Fixes to use right sysfs remove sequence.
Patch-6 Fixes removing all child devices if one of them fails.
Patch-7 Fixes conditions with mdev device life cycle APIs

This series is tested using
(a) mtty with VM using vfio_mdev driver for positive tests.
(b) mtty with vfio_mdev with error race condition cases of create,
remove and unloading of mtty driver.
(c) mlx5 core driver using RFC patches [3] and internal patches.
Internal patches are large and cannot be combined with this
prep-work patches. It will posted once prep-work completes.

[1] https://www.spinics.net/lists/netdev/msg556978.html
[2] https://lkml.org/lkml/2019/3/7/696
[3] https://lkml.org/lkml/2019/3/8/819
[4] https://netdevconf.org/0x13/session.html?workshop-hardware-offload

---
Changelog:
---

v0->v1:
 - Dropped device placement on bus sequence patch for this series
 - Addressed below comments from Alex, Kirti, Maxim.
 - Added Review-by tag for already reviewed patches.
 - Dropped incorrect patch of put_device().
 - Corrected Fixes commit tag for sysfs remove sequence fix
 - Split last 8th patch to smaller refactor and fixes patch
 - Following coding style commenting format
 - Fixed accidental delete of mutex_lock in mdev_unregister_device
 - Renamed remove helped to mdev_device_remove_common().
 - Rebased for uuid/guid change


Parav Pandit (7):
  vfio/mdev: Avoid release parent reference during error path
  vfio/mdev: Removed unused kref
  vfio/mdev: Drop redundant extern for exported symbols
  vfio/mdev: Avoid masking error code to EBUSY
  vfio/mdev: Follow correct remove sequence
  vfio/mdev: Fix aborting mdev child device removal if one fails
  vfio/mdev: Fix race conditions with mdev device life cycle APIs

 drivers/vfio/mdev/mdev_core.c    | 102 ++++++++++++++++++++++++++++-----------
 drivers/vfio/mdev/mdev_private.h |   7 ++-
 drivers/vfio/mdev/mdev_sysfs.c   |   2 +-
 include/linux/mdev.h             |  21 ++++----
 4 files changed, 91 insertions(+), 41 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCHv1 1/7] vfio/mdev: Avoid release parent reference during error path
  2019-03-27  3:45 [PATCHv1 0/7] vfio/mdev: Improve vfio/mdev core module Parav Pandit
@ 2019-03-27  3:45 ` Parav Pandit
  2019-04-01 16:58   ` Cornelia Huck
  2019-03-27  3:45 ` [PATCHv1 2/7] vfio/mdev: Removed unused kref Parav Pandit
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-03-27  3:45 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: cjia, parav

During mdev parent registration in mdev_register_device(),
if parent device is duplicate, it releases the reference of existing
parent device.
This is incorrect. Existing parent device should not be touched.

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index b96fedc..1299d2e 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -181,6 +181,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	/* Check for duplicate */
 	parent = __find_parent_device(dev);
 	if (parent) {
+		parent = NULL;
 		ret = -EEXIST;
 		goto add_dev_err;
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv1 2/7] vfio/mdev: Removed unused kref
  2019-03-27  3:45 [PATCHv1 0/7] vfio/mdev: Improve vfio/mdev core module Parav Pandit
  2019-03-27  3:45 ` [PATCHv1 1/7] vfio/mdev: Avoid release parent reference during error path Parav Pandit
@ 2019-03-27  3:45 ` Parav Pandit
  2019-04-01 17:01   ` Cornelia Huck
  2019-03-27  3:45 ` [PATCHv1 3/7] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-03-27  3:45 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: cjia, parav

Remove unused kref from the mdev_device structure.

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c    | 1 -
 drivers/vfio/mdev/mdev_private.h | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 1299d2e..00ca613 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -311,7 +311,6 @@ int mdev_device_create(struct kobject *kobj,
 	mutex_unlock(&mdev_list_lock);
 
 	mdev->parent = parent;
-	kref_init(&mdev->ref);
 
 	mdev->dev.parent  = dev;
 	mdev->dev.bus     = &mdev_bus_type;
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 379758c..ddcf9c7 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -30,7 +30,6 @@ struct mdev_device {
 	struct mdev_parent *parent;
 	guid_t uuid;
 	void *driver_data;
-	struct kref ref;
 	struct list_head next;
 	struct kobject *type_kobj;
 	bool active;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv1 3/7] vfio/mdev: Drop redundant extern for exported symbols
  2019-03-27  3:45 [PATCHv1 0/7] vfio/mdev: Improve vfio/mdev core module Parav Pandit
  2019-03-27  3:45 ` [PATCHv1 1/7] vfio/mdev: Avoid release parent reference during error path Parav Pandit
  2019-03-27  3:45 ` [PATCHv1 2/7] vfio/mdev: Removed unused kref Parav Pandit
@ 2019-03-27  3:45 ` Parav Pandit
  2019-04-01 17:02   ` Cornelia Huck
  2019-03-27  3:45 ` [PATCHv1 4/7] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-03-27  3:45 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: cjia, parav

There is no need use 'extern' for exported functions.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 include/linux/mdev.h | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index d7aee90..4924d80 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -118,21 +118,20 @@ struct mdev_driver {
 
 #define to_mdev_driver(drv)	container_of(drv, struct mdev_driver, driver)
 
-extern void *mdev_get_drvdata(struct mdev_device *mdev);
-extern void mdev_set_drvdata(struct mdev_device *mdev, void *data);
-extern const guid_t *mdev_uuid(struct mdev_device *mdev);
+void *mdev_get_drvdata(struct mdev_device *mdev);
+void mdev_set_drvdata(struct mdev_device *mdev, void *data);
+const guid_t *mdev_uuid(struct mdev_device *mdev);
 
 extern struct bus_type mdev_bus_type;
 
-extern int  mdev_register_device(struct device *dev,
-				 const struct mdev_parent_ops *ops);
-extern void mdev_unregister_device(struct device *dev);
+int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops);
+void mdev_unregister_device(struct device *dev);
 
-extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
-extern void mdev_unregister_driver(struct mdev_driver *drv);
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+void mdev_unregister_driver(struct mdev_driver *drv);
 
-extern struct device *mdev_parent_dev(struct mdev_device *mdev);
-extern struct device *mdev_dev(struct mdev_device *mdev);
-extern struct mdev_device *mdev_from_dev(struct device *dev);
+struct device *mdev_parent_dev(struct mdev_device *mdev);
+struct device *mdev_dev(struct mdev_device *mdev);
+struct mdev_device *mdev_from_dev(struct device *dev);
 
 #endif /* MDEV_H */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv1 4/7] vfio/mdev: Avoid masking error code to EBUSY
  2019-03-27  3:45 [PATCHv1 0/7] vfio/mdev: Improve vfio/mdev core module Parav Pandit
                   ` (2 preceding siblings ...)
  2019-03-27  3:45 ` [PATCHv1 3/7] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
@ 2019-03-27  3:45 ` Parav Pandit
  2019-04-01 17:21   ` Cornelia Huck
  2019-03-27  3:45 ` [PATCHv1 5/7] vfio/mdev: Follow correct remove sequence Parav Pandit
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-03-27  3:45 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: cjia, parav

Instead of masking return error to -EBUSY, return actual error
returned by the driver.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 00ca613..836d319 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -141,7 +141,7 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
 	 */
 	ret = parent->ops->remove(mdev);
 	if (ret && !force_remove)
-		return -EBUSY;
+		return ret;
 
 	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
 	return 0;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv1 5/7] vfio/mdev: Follow correct remove sequence
  2019-03-27  3:45 [PATCHv1 0/7] vfio/mdev: Improve vfio/mdev core module Parav Pandit
                   ` (3 preceding siblings ...)
  2019-03-27  3:45 ` [PATCHv1 4/7] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
@ 2019-03-27  3:45 ` Parav Pandit
  2019-04-01 17:24   ` Cornelia Huck
  2019-03-27  3:45 ` [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
  2019-03-27  3:45 ` [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs Parav Pandit
  6 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-03-27  3:45 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: cjia, parav

mdev_remove_sysfs_files() should follow exact mirror sequence of a
create, similar to what is followed in error unwinding path of
mdev_create_sysfs_files().

Fixes: 6a62c1dfb5c7 ("vfio/mdev: Re-order sysfs attribute creation")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 5193a0e..cbf94b8 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -280,7 +280,7 @@ int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
 
 void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
 {
+	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
 	sysfs_remove_link(&dev->kobj, "mdev_type");
 	sysfs_remove_link(type->devices_kobj, dev_name(dev));
-	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-03-27  3:45 [PATCHv1 0/7] vfio/mdev: Improve vfio/mdev core module Parav Pandit
                   ` (4 preceding siblings ...)
  2019-03-27  3:45 ` [PATCHv1 5/7] vfio/mdev: Follow correct remove sequence Parav Pandit
@ 2019-03-27  3:45 ` Parav Pandit
  2019-04-01 17:39   ` Cornelia Huck
  2019-03-27  3:45 ` [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs Parav Pandit
  6 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-03-27  3:45 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: cjia, parav

device_for_each_child() stops executing callback function for remaining
child devices, if callback hits an error.
Each child mdev device is independent of each other.
While unregistering parent device, mdev core must remove all child mdev
devices.
Therefore, mdev_device_remove_cb() always returns success so that
device_for_each_child doesn't abort if one child removal hits error.

While at it, improve remove and unregister functions for below simplicity.

There isn't need to pass forced flag pointer during mdev parent
removal which invokes mdev_device_remove(). So simplify the flow.

mdev_device_remove() is called from two paths.
1. mdev_unregister_driver()
     mdev_device_remove_cb()
       mdev_device_remove()
2. remove_store()
     mdev_device_remove()

When device is removed by user using remote_store(), device under
removal is mdev device.
When device is removed during parent device removal using generic child
iterator, mdev check is already done using dev_is_mdev().

Hence, remove the unnecessary loop in mdev_device_remove().

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 23 +++++------------------
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 836d319..aefcf34 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -149,10 +149,10 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
 
 static int mdev_device_remove_cb(struct device *dev, void *data)
 {
-	if (!dev_is_mdev(dev))
-		return 0;
+	if (dev_is_mdev(dev))
+		mdev_device_remove(dev, true);
 
-	return mdev_device_remove(dev, data ? *(bool *)data : true);
+	return 0;
 }
 
 /*
@@ -240,7 +240,6 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 void mdev_unregister_device(struct device *dev)
 {
 	struct mdev_parent *parent;
-	bool force_remove = true;
 
 	mutex_lock(&parent_list_lock);
 	parent = __find_parent_device(dev);
@@ -254,8 +253,7 @@ void mdev_unregister_device(struct device *dev)
 	list_del(&parent->next);
 	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
 
-	device_for_each_child(dev, (void *)&force_remove,
-			      mdev_device_remove_cb);
+	device_for_each_child(dev, NULL, mdev_device_remove_cb);
 
 	parent_remove_sysfs_files(parent);
 
@@ -348,24 +346,13 @@ int mdev_device_create(struct kobject *kobj,
 
 int mdev_device_remove(struct device *dev, bool force_remove)
 {
-	struct mdev_device *mdev, *tmp;
+	struct mdev_device *mdev;
 	struct mdev_parent *parent;
 	struct mdev_type *type;
 	int ret;
 
 	mdev = to_mdev_device(dev);
-
 	mutex_lock(&mdev_list_lock);
-	list_for_each_entry(tmp, &mdev_list, next) {
-		if (tmp == mdev)
-			break;
-	}
-
-	if (tmp != mdev) {
-		mutex_unlock(&mdev_list_lock);
-		return -ENODEV;
-	}
-
 	if (!mdev->active) {
 		mutex_unlock(&mdev_list_lock);
 		return -EAGAIN;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-03-27  3:45 [PATCHv1 0/7] vfio/mdev: Improve vfio/mdev core module Parav Pandit
                   ` (5 preceding siblings ...)
  2019-03-27  3:45 ` [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
@ 2019-03-27  3:45 ` Parav Pandit
  2019-04-03 21:27   ` Alex Williamson
  6 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-03-27  3:45 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: cjia, parav

Below race condition and call trace exist with current device life cycle
sequence.

1. In following sequence, child devices created while removing mdev parent
device can be left out, or it may lead to race of removing half
initialized child mdev devices.

issue-1:
--------
       cpu-0                         cpu-1
       -----                         -----
                                  mdev_unregister_device()
                                     device_for_each_child()
                                        mdev_device_remove_cb()
                                            mdev_device_remove()
create_store()
  mdev_device_create()                   [...]
       device_register()
                                  parent_remove_sysfs_files()
                                  /* BUG: device added by cpu-0
                                   * whose parent is getting removed.
                                   */

issue-2:
--------
       cpu-0                         cpu-1
       -----                         -----
create_store()
  mdev_device_create()                   [...]
       device_register()

       [...]                      mdev_unregister_device()
                                     device_for_each_child()
                                        mdev_device_remove_cb()
                                            mdev_device_remove()

       mdev_create_sysfs_files()
       /* BUG: create is adding
        * sysfs files for a device
        * which is undergoing removal.
        */
                                 parent_remove_sysfs_files()

2. Below crash is observed when user initiated remove is in progress
and mdev_unregister_driver() completes parent unregistration.

       cpu-0                         cpu-1
       -----                         -----
remove_store()
   mdev_device_remove()
   active = false;
                                  mdev_unregister_device()
                                    remove type
   [...]
   mdev_remove_ops() crashes.

This is similar race like create() racing with mdev_unregister_device().

mtty mtty: MDEV: Registered
iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
mtty mtty: MDEV: Unregistering
mtty_dev: Unloaded!
BUG: unable to handle kernel paging request at ffffffffc027d668
PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
Oops: 0000 [#1] SMP PTI
CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
Call Trace:
 mdev_device_remove+0xef/0x130 [mdev]
 remove_store+0x77/0xa0 [mdev]
 kernfs_fop_write+0x113/0x1a0
 __vfs_write+0x33/0x1b0
 ? rcu_read_lock_sched_held+0x64/0x70
 ? rcu_sync_lockdep_assert+0x2a/0x50
 ? __sb_start_write+0x121/0x1b0
 ? vfs_write+0x17c/0x1b0
 vfs_write+0xad/0x1b0
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 ksys_write+0x55/0xc0
 do_syscall_64+0x5a/0x210

Therefore, mdev core is improved to overcome above issues.

Wait for any ongoing mdev create() and remove() to finish before
unregistering parent device using srcu. This continues to allow multiple
create and remove to progress in parallel. At the same time guard parent
removal while parent is being access by create() and remove callbacks.

mdev_device_remove() is refactored to not block on srcu when device is
removed as part of parent removal.

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c    | 83 ++++++++++++++++++++++++++++++++++------
 drivers/vfio/mdev/mdev_private.h |  6 +++
 2 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index aefcf34..fa233c8 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
 						  ref);
 	struct device *dev = parent->dev;
 
+	cleanup_srcu_struct(&parent->unreg_srcu);
 	kfree(parent);
 	put_device(dev);
 }
@@ -147,10 +148,30 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
 	return 0;
 }
 
+static int mdev_device_remove_common(struct mdev_device *mdev,
+				     bool force_remove)
+{
+	struct mdev_type *type;
+	int ret;
+
+	type = to_mdev_type(mdev->type_kobj);
+
+	ret = mdev_device_remove_ops(mdev, force_remove);
+	if (ret && !force_remove) {
+		mutex_lock(&mdev_list_lock);
+		mdev->active = true;
+		mutex_unlock(&mdev_list_lock);
+		return ret;
+	}
+	mdev_remove_sysfs_files(&mdev->dev, type);
+	device_unregister(&mdev->dev);
+	return ret;
+}
+
 static int mdev_device_remove_cb(struct device *dev, void *data)
 {
 	if (dev_is_mdev(dev))
-		mdev_device_remove(dev, true);
+		mdev_device_remove_common(to_mdev_device(dev), true);
 
 	return 0;
 }
@@ -193,6 +214,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	}
 
 	kref_init(&parent->ref);
+	init_srcu_struct(&parent->unreg_srcu);
 
 	parent->dev = dev;
 	parent->ops = ops;
@@ -213,6 +235,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	if (ret)
 		dev_warn(dev, "Failed to create compatibility class link\n");
 
+	rcu_assign_pointer(parent->self, parent);
 	list_add(&parent->next, &parent_list);
 	mutex_unlock(&parent_list_lock);
 
@@ -251,13 +274,31 @@ void mdev_unregister_device(struct device *dev)
 	dev_info(dev, "MDEV: Unregistering\n");
 
 	list_del(&parent->next);
+	mutex_unlock(&parent_list_lock);
+
+	/*
+	 * Publish that this mdev parent is unregistering. So any new
+	 * create/remove cannot start on this parent anymore by user.
+	 */
+	rcu_assign_pointer(parent->self, NULL);
+
+	/*
+	 * Wait for any active create() or remove() mdev ops on the parent
+	 * to complete.
+	 */
+	synchronize_srcu(&parent->unreg_srcu);
+
+	/*
+	 * At this point it is confirmed that any pending user initiated
+	 * create or remove callbacks accessing the parent are completed.
+	 * It is safe to remove the parent now.
+	 */
 	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
 
 	device_for_each_child(dev, NULL, mdev_device_remove_cb);
 
 	parent_remove_sysfs_files(parent);
 
-	mutex_unlock(&parent_list_lock);
 	mdev_put_parent(parent);
 }
 EXPORT_SYMBOL(mdev_unregister_device);
@@ -278,14 +319,24 @@ int mdev_device_create(struct kobject *kobj,
 		       struct device *dev, const guid_t *uuid)
 {
 	int ret;
+	struct mdev_parent *valid_parent;
 	struct mdev_device *mdev, *tmp;
 	struct mdev_parent *parent;
 	struct mdev_type *type = to_mdev_type(kobj);
+	int srcu_idx;
 
 	parent = mdev_get_parent(type->parent);
 	if (!parent)
 		return -EINVAL;
 
+	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
+	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
+	if (!valid_parent) {
+		/* parent is undergoing unregistration */
+		ret = -ENODEV;
+		goto mdev_fail;
+	}
+
 	mutex_lock(&mdev_list_lock);
 
 	/* Check for duplicate */
@@ -334,44 +385,52 @@ int mdev_device_create(struct kobject *kobj,
 	mdev->type_kobj = kobj;
 	mdev->active = true;
 	dev_dbg(&mdev->dev, "MDEV: created\n");
+	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
 
 	return 0;
 
 create_fail:
 	device_unregister(&mdev->dev);
 mdev_fail:
+	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
 	mdev_put_parent(parent);
 	return ret;
 }
 
 int mdev_device_remove(struct device *dev, bool force_remove)
 {
+	struct mdev_parent *valid_parent;
 	struct mdev_device *mdev;
 	struct mdev_parent *parent;
-	struct mdev_type *type;
+	int srcu_idx;
 	int ret;
 
 	mdev = to_mdev_device(dev);
+	parent = mdev->parent;
+
+	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
+	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
+	if (!valid_parent) {
+		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
+		/* parent is undergoing unregistration */
+		return -ENODEV;
+	}
+
 	mutex_lock(&mdev_list_lock);
 	if (!mdev->active) {
 		mutex_unlock(&mdev_list_lock);
+		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
 		return -EAGAIN;
 	}
 
 	mdev->active = false;
 	mutex_unlock(&mdev_list_lock);
 
-	type = to_mdev_type(mdev->type_kobj);
-	parent = mdev->parent;
-
-	ret = mdev_device_remove_ops(mdev, force_remove);
-	if (ret) {
-		mdev->active = true;
+	ret = mdev_device_remove_common(mdev, force_remove);
+	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
+	if (ret)
 		return ret;
-	}
 
-	mdev_remove_sysfs_files(dev, type);
-	device_unregister(dev);
 	mdev_put_parent(parent);
 
 	return 0;
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index ddcf9c7..b799978 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -23,6 +23,12 @@ struct mdev_parent {
 	struct list_head next;
 	struct kset *mdev_types_kset;
 	struct list_head type_list;
+	/*
+	 * Protects unregistration to wait until create/remove
+	 * are completed.
+	 */
+	struct srcu_struct unreg_srcu;
+	struct mdev_parent __rcu *self;
 };
 
 struct mdev_device {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 1/7] vfio/mdev: Avoid release parent reference during error path
  2019-03-27  3:45 ` [PATCHv1 1/7] vfio/mdev: Avoid release parent reference during error path Parav Pandit
@ 2019-04-01 16:58   ` Cornelia Huck
  0 siblings, 0 replies; 28+ messages in thread
From: Cornelia Huck @ 2019-04-01 16:58 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, alex.williamson, cjia

On Tue, 26 Mar 2019 22:45:39 -0500
Parav Pandit <parav@mellanox.com> wrote:

> During mdev parent registration in mdev_register_device(),
> if parent device is duplicate, it releases the reference of existing
> parent device.
> This is incorrect. Existing parent device should not be touched.
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>

Should probably be changed to the canonical Reviewed-by:.

> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index b96fedc..1299d2e 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -181,6 +181,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	/* Check for duplicate */
>  	parent = __find_parent_device(dev);
>  	if (parent) {
> +		parent = NULL;
>  		ret = -EEXIST;
>  		goto add_dev_err;
>  	}

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 2/7] vfio/mdev: Removed unused kref
  2019-03-27  3:45 ` [PATCHv1 2/7] vfio/mdev: Removed unused kref Parav Pandit
@ 2019-04-01 17:01   ` Cornelia Huck
  0 siblings, 0 replies; 28+ messages in thread
From: Cornelia Huck @ 2019-04-01 17:01 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, alex.williamson, cjia

On Tue, 26 Mar 2019 22:45:40 -0500
Parav Pandit <parav@mellanox.com> wrote:

> Remove unused kref from the mdev_device structure.
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>

Same here (Reviewed-by:).

> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 1 -
>  drivers/vfio/mdev/mdev_private.h | 1 -
>  2 files changed, 2 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 3/7] vfio/mdev: Drop redundant extern for exported symbols
  2019-03-27  3:45 ` [PATCHv1 3/7] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
@ 2019-04-01 17:02   ` Cornelia Huck
  0 siblings, 0 replies; 28+ messages in thread
From: Cornelia Huck @ 2019-04-01 17:02 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, alex.williamson, cjia

On Tue, 26 Mar 2019 22:45:41 -0500
Parav Pandit <parav@mellanox.com> wrote:

> There is no need use 'extern' for exported functions.
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  include/linux/mdev.h | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)

Acked-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 4/7] vfio/mdev: Avoid masking error code to EBUSY
  2019-03-27  3:45 ` [PATCHv1 4/7] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
@ 2019-04-01 17:21   ` Cornelia Huck
  0 siblings, 0 replies; 28+ messages in thread
From: Cornelia Huck @ 2019-04-01 17:21 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, alex.williamson, cjia

On Tue, 26 Mar 2019 22:45:42 -0500
Parav Pandit <parav@mellanox.com> wrote:

> Instead of masking return error to -EBUSY, return actual error
> returned by the driver.
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 00ca613..836d319 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -141,7 +141,7 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
>  	 */
>  	ret = parent->ops->remove(mdev);
>  	if (ret && !force_remove)
> -		return -EBUSY;
> +		return ret;
>  
>  	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
>  	return 0;

Makes sense, even if no current ->remove callback returns anything
other than 0 or -EBUSY.

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 5/7] vfio/mdev: Follow correct remove sequence
  2019-03-27  3:45 ` [PATCHv1 5/7] vfio/mdev: Follow correct remove sequence Parav Pandit
@ 2019-04-01 17:24   ` Cornelia Huck
  0 siblings, 0 replies; 28+ messages in thread
From: Cornelia Huck @ 2019-04-01 17:24 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, alex.williamson, cjia

On Tue, 26 Mar 2019 22:45:43 -0500
Parav Pandit <parav@mellanox.com> wrote:

> mdev_remove_sysfs_files() should follow exact mirror sequence of a
> create, similar to what is followed in error unwinding path of
> mdev_create_sysfs_files().
> 
> Fixes: 6a62c1dfb5c7 ("vfio/mdev: Re-order sysfs attribute creation")
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_sysfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 5193a0e..cbf94b8 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -280,7 +280,7 @@ int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
>  
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
>  {
> +	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
>  	sysfs_remove_link(&dev->kobj, "mdev_type");
>  	sysfs_remove_link(type->devices_kobj, dev_name(dev));
> -	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
>  }

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-03-27  3:45 ` [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
@ 2019-04-01 17:39   ` Cornelia Huck
  2019-04-02 19:59     ` Parav Pandit
  0 siblings, 1 reply; 28+ messages in thread
From: Cornelia Huck @ 2019-04-01 17:39 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, alex.williamson, cjia

On Tue, 26 Mar 2019 22:45:44 -0500
Parav Pandit <parav@mellanox.com> wrote:

> device_for_each_child() stops executing callback function for remaining
> child devices, if callback hits an error.
> Each child mdev device is independent of each other.
> While unregistering parent device, mdev core must remove all child mdev
> devices.
> Therefore, mdev_device_remove_cb() always returns success so that

s/always returns/must always return/ ?

> device_for_each_child doesn't abort if one child removal hits error.
> 
> While at it, improve remove and unregister functions for below simplicity.
> 
> There isn't need to pass forced flag pointer during mdev parent
> removal which invokes mdev_device_remove(). So simplify the flow.
> 
> mdev_device_remove() is called from two paths.
> 1. mdev_unregister_driver()
>      mdev_device_remove_cb()
>        mdev_device_remove()
> 2. remove_store()
>      mdev_device_remove()
> 
> When device is removed by user using remote_store(), device under
> removal is mdev device.
> When device is removed during parent device removal using generic child
> iterator, mdev check is already done using dev_is_mdev().

Isn't there still a possible race condition (which you seem to address
with the following patch)? IOW, you cannot remove that loop-under-mutex
yet?

> 
> Hence, remove the unnecessary loop in mdev_device_remove().
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 836d319..aefcf34 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -149,10 +149,10 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
>  

Maybe add

/* only called during parent device unregistration */

to avoid headscratching in the future?

>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
> -	if (!dev_is_mdev(dev))
> -		return 0;
> +	if (dev_is_mdev(dev))
> +		mdev_device_remove(dev, true);
>  
> -	return mdev_device_remove(dev, data ? *(bool *)data : true);
> +	return 0;
>  }
>  
>  /*
> @@ -240,7 +240,6 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  void mdev_unregister_device(struct device *dev)
>  {
>  	struct mdev_parent *parent;
> -	bool force_remove = true;
>  
>  	mutex_lock(&parent_list_lock);
>  	parent = __find_parent_device(dev);
> @@ -254,8 +253,7 @@ void mdev_unregister_device(struct device *dev)
>  	list_del(&parent->next);
>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
> -	device_for_each_child(dev, (void *)&force_remove,
> -			      mdev_device_remove_cb);
> +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  

Up to this chunk, the patch looks good to me.

> @@ -348,24 +346,13 @@ int mdev_device_create(struct kobject *kobj,
>  
>  int mdev_device_remove(struct device *dev, bool force_remove)
>  {
> -	struct mdev_device *mdev, *tmp;
> +	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> -
>  	mutex_lock(&mdev_list_lock);
> -	list_for_each_entry(tmp, &mdev_list, next) {
> -		if (tmp == mdev)
> -			break;
> -	}
> -
> -	if (tmp != mdev) {
> -		mutex_unlock(&mdev_list_lock);
> -		return -ENODEV;
> -	}
> -
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
>  		return -EAGAIN;


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-04-01 17:39   ` Cornelia Huck
@ 2019-04-02 19:59     ` Parav Pandit
  2019-04-02 22:33       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-04-02 19:59 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: kvm, linux-kernel, kwankhede, alex.williamson, cjia



> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Monday, April 1, 2019 12:39 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com; alex.williamson@redhat.com; cjia@nvidia.com
> Subject: Re: [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device
> removal if one fails
> 
> On Tue, 26 Mar 2019 22:45:44 -0500
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > device_for_each_child() stops executing callback function for
> > remaining child devices, if callback hits an error.
> > Each child mdev device is independent of each other.
> > While unregistering parent device, mdev core must remove all child
> > mdev devices.
> > Therefore, mdev_device_remove_cb() always returns success so that
> 
> s/always returns/must always return/ ?
> 
Must always return.
:-)

> > device_for_each_child doesn't abort if one child removal hits error.
> >
> > While at it, improve remove and unregister functions for below simplicity.
> >
> > There isn't need to pass forced flag pointer during mdev parent
> > removal which invokes mdev_device_remove(). So simplify the flow.
> >
> > mdev_device_remove() is called from two paths.
> > 1. mdev_unregister_driver()
> >      mdev_device_remove_cb()
> >        mdev_device_remove()
> > 2. remove_store()
> >      mdev_device_remove()
> >
> > When device is removed by user using remote_store(), device under
> > removal is mdev device.
> > When device is removed during parent device removal using generic
> > child iterator, mdev check is already done using dev_is_mdev().
> 
> Isn't there still a possible race condition (which you seem to address with
> the following patch)? IOW, you cannot remove that loop-under-mutex yet?

The loop checks if the remove() is called on the mdev or not.
This is already checked from both the paths from remove is invoked.
I didn't remove the 'active' check. So it should be fine.

> >
> > Hence, remove the unnecessary loop in mdev_device_remove().
> >
> > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c | 23 +++++------------------
> >  1 file changed, 5 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/vfio/mdev/mdev_core.c
> > b/drivers/vfio/mdev/mdev_core.c index 836d319..aefcf34 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -149,10 +149,10 @@ static int mdev_device_remove_ops(struct
> > mdev_device *mdev, bool force_remove)
> >
> 
> Maybe add
> 
> /* only called during parent device unregistration */
> 
> to avoid headscratching in the future?
> 
> >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > -	if (!dev_is_mdev(dev))
> > -		return 0;
> > +	if (dev_is_mdev(dev))
> > +		mdev_device_remove(dev, true);
> >
> > -	return mdev_device_remove(dev, data ? *(bool *)data : true);
> > +	return 0;
> >  }
> >
> >  /*
> > @@ -240,7 +240,6 @@ int mdev_register_device(struct device *dev, const
> > struct mdev_parent_ops *ops)  void mdev_unregister_device(struct
> > device *dev)  {
> >  	struct mdev_parent *parent;
> > -	bool force_remove = true;
> >
> >  	mutex_lock(&parent_list_lock);
> >  	parent = __find_parent_device(dev);
> > @@ -254,8 +253,7 @@ void mdev_unregister_device(struct device *dev)
> >  	list_del(&parent->next);
> >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> >
> > -	device_for_each_child(dev, (void *)&force_remove,
> > -			      mdev_device_remove_cb);
> > +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> >
> >  	parent_remove_sysfs_files(parent);
> >
> 
> Up to this chunk, the patch looks good to me.
> 
> > @@ -348,24 +346,13 @@ int mdev_device_create(struct kobject *kobj,
> >
> >  int mdev_device_remove(struct device *dev, bool force_remove)  {
> > -	struct mdev_device *mdev, *tmp;
> > +	struct mdev_device *mdev;
> >  	struct mdev_parent *parent;
> >  	struct mdev_type *type;
> >  	int ret;
> >
> >  	mdev = to_mdev_device(dev);
> > -
> >  	mutex_lock(&mdev_list_lock);
> > -	list_for_each_entry(tmp, &mdev_list, next) {
> > -		if (tmp == mdev)
> > -			break;
> > -	}
> > -
> > -	if (tmp != mdev) {
> > -		mutex_unlock(&mdev_list_lock);
> > -		return -ENODEV;
> > -	}
> > -
> >  	if (!mdev->active) {
> >  		mutex_unlock(&mdev_list_lock);
> >  		return -EAGAIN;


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-04-02 19:59     ` Parav Pandit
@ 2019-04-02 22:33       ` Alex Williamson
  2019-04-03  6:34         ` Parav Pandit
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2019-04-02 22:33 UTC (permalink / raw)
  To: Parav Pandit; +Cc: Cornelia Huck, kvm, linux-kernel, kwankhede, cjia

On Tue, 2 Apr 2019 19:59:58 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Cornelia Huck <cohuck@redhat.com>
> > Sent: Monday, April 1, 2019 12:39 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com; alex.williamson@redhat.com; cjia@nvidia.com
> > Subject: Re: [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device
> > removal if one fails
> > 
> > On Tue, 26 Mar 2019 22:45:44 -0500
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > device_for_each_child() stops executing callback function for
> > > remaining child devices, if callback hits an error.
> > > Each child mdev device is independent of each other.
> > > While unregistering parent device, mdev core must remove all child
> > > mdev devices.
> > > Therefore, mdev_device_remove_cb() always returns success so that  
> > 
> > s/always returns/must always return/ ?
> >   
> Must always return.
> :-)
> 
> > > device_for_each_child doesn't abort if one child removal hits error.
> > >
> > > While at it, improve remove and unregister functions for below simplicity.
> > >
> > > There isn't need to pass forced flag pointer during mdev parent
> > > removal which invokes mdev_device_remove(). So simplify the flow.
> > >
> > > mdev_device_remove() is called from two paths.
> > > 1. mdev_unregister_driver()
> > >      mdev_device_remove_cb()
> > >        mdev_device_remove()
> > > 2. remove_store()
> > >      mdev_device_remove()
> > >
> > > When device is removed by user using remote_store(), device under
> > > removal is mdev device.
> > > When device is removed during parent device removal using generic
> > > child iterator, mdev check is already done using dev_is_mdev().  
> > 
> > Isn't there still a possible race condition (which you seem to address with
> > the following patch)? IOW, you cannot remove that loop-under-mutex yet?  
> 
> The loop checks if the remove() is called on the mdev or not.
> This is already checked from both the paths from remove is invoked.
> I didn't remove the 'active' check. So it should be fine.

I believe the loop was actually trying to sanitize the mdev pointer,
for example if it's not in our list of devices we should not even
de-reference 'active'.  I think maybe this was more fallout from
allowing remove to fail.  For instance, it seems like manipulating
active within the list lock critical section should provide us with
mutual exclusion, the mdev object should be valid until the sysfs
remove attribute is removed, but remove_store() itself removes that
attribute allowing mdev_remove_sysfs_files() to skip over it, but
mdev_remove_device() can fail on the remove_store() path causing it to
recreate the remove attribute.  Now we're in trouble because I'm not
sure if recreating the sysfs attribute ever takes a reference to the
device.  If it does, it's at least racy.  Is it time to put the nail in
the coffin of these remove failure paths?  It seems too fundamental to
our code base that drivers cannot do this.  Thanks,

Alex

> > >
> > > Hence, remove the unnecessary loop in mdev_device_remove().
> > >
> > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > ---
> > >  drivers/vfio/mdev/mdev_core.c | 23 +++++------------------
> > >  1 file changed, 5 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > b/drivers/vfio/mdev/mdev_core.c index 836d319..aefcf34 100644
> > > --- a/drivers/vfio/mdev/mdev_core.c
> > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > @@ -149,10 +149,10 @@ static int mdev_device_remove_ops(struct
> > > mdev_device *mdev, bool force_remove)
> > >  
> > 
> > Maybe add
> > 
> > /* only called during parent device unregistration */
> > 
> > to avoid headscratching in the future?
> >   
> > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > > -	if (!dev_is_mdev(dev))
> > > -		return 0;
> > > +	if (dev_is_mdev(dev))
> > > +		mdev_device_remove(dev, true);
> > >
> > > -	return mdev_device_remove(dev, data ? *(bool *)data : true);
> > > +	return 0;
> > >  }
> > >
> > >  /*
> > > @@ -240,7 +240,6 @@ int mdev_register_device(struct device *dev, const
> > > struct mdev_parent_ops *ops)  void mdev_unregister_device(struct
> > > device *dev)  {
> > >  	struct mdev_parent *parent;
> > > -	bool force_remove = true;
> > >
> > >  	mutex_lock(&parent_list_lock);
> > >  	parent = __find_parent_device(dev);
> > > @@ -254,8 +253,7 @@ void mdev_unregister_device(struct device *dev)
> > >  	list_del(&parent->next);
> > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > >
> > > -	device_for_each_child(dev, (void *)&force_remove,
> > > -			      mdev_device_remove_cb);
> > > +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > >
> > >  	parent_remove_sysfs_files(parent);
> > >  
> > 
> > Up to this chunk, the patch looks good to me.
> >   
> > > @@ -348,24 +346,13 @@ int mdev_device_create(struct kobject *kobj,
> > >
> > >  int mdev_device_remove(struct device *dev, bool force_remove)  {
> > > -	struct mdev_device *mdev, *tmp;
> > > +	struct mdev_device *mdev;
> > >  	struct mdev_parent *parent;
> > >  	struct mdev_type *type;
> > >  	int ret;
> > >
> > >  	mdev = to_mdev_device(dev);
> > > -
> > >  	mutex_lock(&mdev_list_lock);
> > > -	list_for_each_entry(tmp, &mdev_list, next) {
> > > -		if (tmp == mdev)
> > > -			break;
> > > -	}
> > > -
> > > -	if (tmp != mdev) {
> > > -		mutex_unlock(&mdev_list_lock);
> > > -		return -ENODEV;
> > > -	}
> > > -
> > >  	if (!mdev->active) {
> > >  		mutex_unlock(&mdev_list_lock);
> > >  		return -EAGAIN;  
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-04-02 22:33       ` Alex Williamson
@ 2019-04-03  6:34         ` Parav Pandit
  0 siblings, 0 replies; 28+ messages in thread
From: Parav Pandit @ 2019-04-03  6:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Cornelia Huck, kvm, linux-kernel, kwankhede, cjia



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, April 2, 2019 5:33 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; kwankhede@nvidia.com; cjia@nvidia.com
> Subject: Re: [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device
> removal if one fails
> 
> On Tue, 2 Apr 2019 19:59:58 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Cornelia Huck <cohuck@redhat.com>
> > > Sent: Monday, April 1, 2019 12:39 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com; alex.williamson@redhat.com; cjia@nvidia.com
> > > Subject: Re: [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device
> > > removal if one fails
> > >
> > > On Tue, 26 Mar 2019 22:45:44 -0500
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > device_for_each_child() stops executing callback function for
> > > > remaining child devices, if callback hits an error.
> > > > Each child mdev device is independent of each other.
> > > > While unregistering parent device, mdev core must remove all child
> > > > mdev devices.
> > > > Therefore, mdev_device_remove_cb() always returns success so that
> > >
> > > s/always returns/must always return/ ?
> > >
> > Must always return.
> > :-)
> >
> > > > device_for_each_child doesn't abort if one child removal hits error.
> > > >
> > > > While at it, improve remove and unregister functions for below
> simplicity.
> > > >
> > > > There isn't need to pass forced flag pointer during mdev parent
> > > > removal which invokes mdev_device_remove(). So simplify the flow.
> > > >
> > > > mdev_device_remove() is called from two paths.
> > > > 1. mdev_unregister_driver()
> > > >      mdev_device_remove_cb()
> > > >        mdev_device_remove()
> > > > 2. remove_store()
> > > >      mdev_device_remove()
> > > >
> > > > When device is removed by user using remote_store(), device under
> > > > removal is mdev device.
> > > > When device is removed during parent device removal using generic
> > > > child iterator, mdev check is already done using dev_is_mdev().
> > >
> > > Isn't there still a possible race condition (which you seem to
> > > address with the following patch)? IOW, you cannot remove that loop-
> under-mutex yet?
> >
> > The loop checks if the remove() is called on the mdev or not.
> > This is already checked from both the paths from remove is invoked.
> > I didn't remove the 'active' check. So it should be fine.
> 
> I believe the loop was actually trying to sanitize the mdev pointer, for
> example if it's not in our list of devices we should not even de-reference
> 'active'.  I think maybe this was more fallout from allowing remove to fail.
> For instance, it seems like manipulating active within the list lock critical
> section should provide us with mutual exclusion, the mdev object should be
> valid until the sysfs remove attribute is removed, but remove_store() itself
> removes that attribute allowing mdev_remove_sysfs_files() to skip over it,
> but
> mdev_remove_device() can fail on the remove_store() path causing it to
> recreate the remove attribute.  Now we're in trouble because I'm not sure if
> recreating the sysfs attribute ever takes a reference to the device.  If it does,
> it's at least racy.  Is it time to put the nail in the coffin of these remove
> failure paths?  It seems too fundamental to our code base that drivers
> cannot do this.  Thanks,
>

Yes, I agree.
We should follow the right remove/create sequence.
+ we need this for power management too anyway.
There is no point in re-inventing the device model differently.

If this series looks fine/merged, I can send v1 of the patch that fixes the callback order.
Or you want to update this series?

I haven't had chance to go through other email thread yet.

 
> Alex
> 
> > > >
> > > > Hence, remove the unnecessary loop in mdev_device_remove().
> > > >
> > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > ---
> > > >  drivers/vfio/mdev/mdev_core.c | 23 +++++------------------
> > > >  1 file changed, 5 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > b/drivers/vfio/mdev/mdev_core.c index 836d319..aefcf34 100644
> > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > @@ -149,10 +149,10 @@ static int mdev_device_remove_ops(struct
> > > > mdev_device *mdev, bool force_remove)
> > > >
> > >
> > > Maybe add
> > >
> > > /* only called during parent device unregistration */
> > >
> > > to avoid headscratching in the future?
> > >
> > > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > > > -	if (!dev_is_mdev(dev))
> > > > -		return 0;
> > > > +	if (dev_is_mdev(dev))
> > > > +		mdev_device_remove(dev, true);
> > > >
> > > > -	return mdev_device_remove(dev, data ? *(bool *)data : true);
> > > > +	return 0;
> > > >  }
> > > >
> > > >  /*
> > > > @@ -240,7 +240,6 @@ int mdev_register_device(struct device *dev,
> > > > const struct mdev_parent_ops *ops)  void
> > > > mdev_unregister_device(struct device *dev)  {
> > > >  	struct mdev_parent *parent;
> > > > -	bool force_remove = true;
> > > >
> > > >  	mutex_lock(&parent_list_lock);
> > > >  	parent = __find_parent_device(dev); @@ -254,8 +253,7 @@ void
> > > > mdev_unregister_device(struct device *dev)
> > > >  	list_del(&parent->next);
> > > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > > >
> > > > -	device_for_each_child(dev, (void *)&force_remove,
> > > > -			      mdev_device_remove_cb);
> > > > +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > > >
> > > >  	parent_remove_sysfs_files(parent);
> > > >
> > >
> > > Up to this chunk, the patch looks good to me.
> > >
> > > > @@ -348,24 +346,13 @@ int mdev_device_create(struct kobject *kobj,
> > > >
> > > >  int mdev_device_remove(struct device *dev, bool force_remove)  {
> > > > -	struct mdev_device *mdev, *tmp;
> > > > +	struct mdev_device *mdev;
> > > >  	struct mdev_parent *parent;
> > > >  	struct mdev_type *type;
> > > >  	int ret;
> > > >
> > > >  	mdev = to_mdev_device(dev);
> > > > -
> > > >  	mutex_lock(&mdev_list_lock);
> > > > -	list_for_each_entry(tmp, &mdev_list, next) {
> > > > -		if (tmp == mdev)
> > > > -			break;
> > > > -	}
> > > > -
> > > > -	if (tmp != mdev) {
> > > > -		mutex_unlock(&mdev_list_lock);
> > > > -		return -ENODEV;
> > > > -	}
> > > > -
> > > >  	if (!mdev->active) {
> > > >  		mutex_unlock(&mdev_list_lock);
> > > >  		return -EAGAIN;
> >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-03-27  3:45 ` [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs Parav Pandit
@ 2019-04-03 21:27   ` Alex Williamson
  2019-04-04  0:02     ` Parav Pandit
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2019-04-03 21:27 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, cjia

On Tue, 26 Mar 2019 22:45:45 -0500
Parav Pandit <parav@mellanox.com> wrote:

> Below race condition and call trace exist with current device life cycle
> sequence.
> 
> 1. In following sequence, child devices created while removing mdev parent
> device can be left out, or it may lead to race of removing half
> initialized child mdev devices.
> 
> issue-1:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
>                                   mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
>                                   parent_remove_sysfs_files()
>                                   /* BUG: device added by cpu-0
>                                    * whose parent is getting removed.
>                                    */
> 
> issue-2:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
> 
>        [...]                      mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> 
>        mdev_create_sysfs_files()
>        /* BUG: create is adding
>         * sysfs files for a device
>         * which is undergoing removal.
>         */
>                                  parent_remove_sysfs_files()
> 
> 2. Below crash is observed when user initiated remove is in progress
> and mdev_unregister_driver() completes parent unregistration.
> 
>        cpu-0                         cpu-1
>        -----                         -----
> remove_store()
>    mdev_device_remove()
>    active = false;
>                                   mdev_unregister_device()
>                                     remove type
>    [...]
>    mdev_remove_ops() crashes.
> 
> This is similar race like create() racing with mdev_unregister_device().
> 
> mtty mtty: MDEV: Registered
> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> mtty mtty: MDEV: Unregistering
> mtty_dev: Unloaded!
> BUG: unable to handle kernel paging request at ffffffffc027d668
> PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> Oops: 0000 [#1] SMP PTI
> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
> Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
> Call Trace:
>  mdev_device_remove+0xef/0x130 [mdev]
>  remove_store+0x77/0xa0 [mdev]
>  kernfs_fop_write+0x113/0x1a0
>  __vfs_write+0x33/0x1b0
>  ? rcu_read_lock_sched_held+0x64/0x70
>  ? rcu_sync_lockdep_assert+0x2a/0x50
>  ? __sb_start_write+0x121/0x1b0
>  ? vfs_write+0x17c/0x1b0
>  vfs_write+0xad/0x1b0
>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>  ksys_write+0x55/0xc0
>  do_syscall_64+0x5a/0x210
> 
> Therefore, mdev core is improved to overcome above issues.
> 
> Wait for any ongoing mdev create() and remove() to finish before
> unregistering parent device using srcu. This continues to allow multiple
> create and remove to progress in parallel. At the same time guard parent
> removal while parent is being access by create() and remove callbacks.
> 
> mdev_device_remove() is refactored to not block on srcu when device is
> removed as part of parent removal.
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 83 ++++++++++++++++++++++++++++++++++------
>  drivers/vfio/mdev/mdev_private.h |  6 +++
>  2 files changed, 77 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index aefcf34..fa233c8 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
>  						  ref);
>  	struct device *dev = parent->dev;
>  
> +	cleanup_srcu_struct(&parent->unreg_srcu);
>  	kfree(parent);
>  	put_device(dev);
>  }
> @@ -147,10 +148,30 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
>  	return 0;
>  }
>  
> +static int mdev_device_remove_common(struct mdev_device *mdev,
> +				     bool force_remove)
> +{
> +	struct mdev_type *type;
> +	int ret;
> +
> +	type = to_mdev_type(mdev->type_kobj);

I know you're just moving this into the common function, but I think
we're just caching this for aesthetics, the mdev object is still valid
after the remove ops and I don't see anything touching this field.  If
so, maybe we should remove 'type' or at least set it right before it's
used so it doesn't appear that we're preserving it before the remove op.

> +
> +	ret = mdev_device_remove_ops(mdev, force_remove);
> +	if (ret && !force_remove) {
> +		mutex_lock(&mdev_list_lock);
> +		mdev->active = true;
> +		mutex_unlock(&mdev_list_lock);

The mutex around this is a change from the previous code and I'm not
sure it adds anything.  If there's a thread testing for active racing
with this thread setting active to true, there's no meaningful
difference in the result by acquiring the mutex.  'active' may change
from false->true during the critical section of the other thread, but I
don't think there are any strange out of order things that give the
wrong result, the other thread either sees true or false and continues
or exits, regardless of this mutex.

> +		return ret;
> +	}
> +	mdev_remove_sysfs_files(&mdev->dev, type);
> +	device_unregister(&mdev->dev);
> +	return ret;
> +}
> +
>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
>  	if (dev_is_mdev(dev))
> -		mdev_device_remove(dev, true);
> +		mdev_device_remove_common(to_mdev_device(dev), true);
>  
>  	return 0;
>  }
> @@ -193,6 +214,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	}
>  
>  	kref_init(&parent->ref);
> +	init_srcu_struct(&parent->unreg_srcu);
>  
>  	parent->dev = dev;
>  	parent->ops = ops;
> @@ -213,6 +235,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	if (ret)
>  		dev_warn(dev, "Failed to create compatibility class link\n");
>  
> +	rcu_assign_pointer(parent->self, parent);
>  	list_add(&parent->next, &parent_list);
>  	mutex_unlock(&parent_list_lock);
>  
> @@ -251,13 +274,31 @@ void mdev_unregister_device(struct device *dev)
>  	dev_info(dev, "MDEV: Unregistering\n");
>  
>  	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +
> +	/*
> +	 * Publish that this mdev parent is unregistering. So any new
> +	 * create/remove cannot start on this parent anymore by user.
> +	 */
> +	rcu_assign_pointer(parent->self, NULL);
> +
> +	/*
> +	 * Wait for any active create() or remove() mdev ops on the parent
> +	 * to complete.
> +	 */
> +	synchronize_srcu(&parent->unreg_srcu);
> +
> +	/*
> +	 * At this point it is confirmed that any pending user initiated
> +	 * create or remove callbacks accessing the parent are completed.
> +	 * It is safe to remove the parent now.
> +	 */

Thanks for the good documentation here.

Alex

>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
>  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  
> -	mutex_unlock(&parent_list_lock);
>  	mdev_put_parent(parent);
>  }
>  EXPORT_SYMBOL(mdev_unregister_device);
> @@ -278,14 +319,24 @@ int mdev_device_create(struct kobject *kobj,
>  		       struct device *dev, const guid_t *uuid)
>  {
>  	int ret;
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	int srcu_idx;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		/* parent is undergoing unregistration */
> +		ret = -ENODEV;
> +		goto mdev_fail;
> +	}
> +
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -334,44 +385,52 @@ int mdev_device_create(struct kobject *kobj,
>  	mdev->type_kobj = kobj;
>  	mdev->active = true;
>  	dev_dbg(&mdev->dev, "MDEV: created\n");
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
>  	return 0;
>  
>  create_fail:
>  	device_unregister(&mdev->dev);
>  mdev_fail:
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  	mdev_put_parent(parent);
>  	return ret;
>  }
>  
>  int mdev_device_remove(struct device *dev, bool force_remove)
>  {
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
> -	struct mdev_type *type;
> +	int srcu_idx;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		/* parent is undergoing unregistration */
> +		return -ENODEV;
> +	}
> +
>  	mutex_lock(&mdev_list_lock);
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  		return -EAGAIN;
>  	}
>  
>  	mdev->active = false;
>  	mutex_unlock(&mdev_list_lock);
>  
> -	type = to_mdev_type(mdev->type_kobj);
> -	parent = mdev->parent;
> -
> -	ret = mdev_device_remove_ops(mdev, force_remove);
> -	if (ret) {
> -		mdev->active = true;
> +	ret = mdev_device_remove_common(mdev, force_remove);
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +	if (ret)
>  		return ret;
> -	}
>  
> -	mdev_remove_sysfs_files(dev, type);
> -	device_unregister(dev);
>  	mdev_put_parent(parent);
>  
>  	return 0;
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index ddcf9c7..b799978 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -23,6 +23,12 @@ struct mdev_parent {
>  	struct list_head next;
>  	struct kset *mdev_types_kset;
>  	struct list_head type_list;
> +	/*
> +	 * Protects unregistration to wait until create/remove
> +	 * are completed.
> +	 */
> +	struct srcu_struct unreg_srcu;
> +	struct mdev_parent __rcu *self;
>  };
>  
>  struct mdev_device {


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-03 21:27   ` Alex Williamson
@ 2019-04-04  0:02     ` Parav Pandit
  2019-04-04 20:44       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-04-04  0:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, kwankhede, cjia



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 3, 2019 4:27 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com; cjia@nvidia.com
> Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> life cycle APIs
> 
> On Tue, 26 Mar 2019 22:45:45 -0500
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > Below race condition and call trace exist with current device life
> > cycle sequence.
> >
> > 1. In following sequence, child devices created while removing mdev
> > parent device can be left out, or it may lead to race of removing half
> > initialized child mdev devices.
> >
> > issue-1:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> >                                   mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >                                   parent_remove_sysfs_files()
> >                                   /* BUG: device added by cpu-0
> >                                    * whose parent is getting removed.
> >                                    */
> >
> > issue-2:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >
> >        [...]                      mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> >
> >        mdev_create_sysfs_files()
> >        /* BUG: create is adding
> >         * sysfs files for a device
> >         * which is undergoing removal.
> >         */
> >                                  parent_remove_sysfs_files()
> >
> > 2. Below crash is observed when user initiated remove is in progress
> > and mdev_unregister_driver() completes parent unregistration.
> >
> >        cpu-0                         cpu-1
> >        -----                         -----
> > remove_store()
> >    mdev_device_remove()
> >    active = false;
> >                                   mdev_unregister_device()
> >                                     remove type
> >    [...]
> >    mdev_remove_ops() crashes.
> >
> > This is similar race like create() racing with mdev_unregister_device().
> >
> > mtty mtty: MDEV: Registered
> > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > mtty mtty: MDEV: Unregistering
> > mtty_dev: Unloaded!
> > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> >  mdev_device_remove+0xef/0x130 [mdev]
> >  remove_store+0x77/0xa0 [mdev]
> >  kernfs_fop_write+0x113/0x1a0
> >  __vfs_write+0x33/0x1b0
> >  ? rcu_read_lock_sched_held+0x64/0x70
> >  ? rcu_sync_lockdep_assert+0x2a/0x50
> >  ? __sb_start_write+0x121/0x1b0
> >  ? vfs_write+0x17c/0x1b0
> >  vfs_write+0xad/0x1b0
> >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >  ksys_write+0x55/0xc0
> >  do_syscall_64+0x5a/0x210
> >
> > Therefore, mdev core is improved to overcome above issues.
> >
> > Wait for any ongoing mdev create() and remove() to finish before
> > unregistering parent device using srcu. This continues to allow
> > multiple create and remove to progress in parallel. At the same time
> > guard parent removal while parent is being access by create() and remove
> callbacks.
> >
> > mdev_device_remove() is refactored to not block on srcu when device is
> > removed as part of parent removal.
> >
> > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c    | 83
> ++++++++++++++++++++++++++++++++++------
> >  drivers/vfio/mdev/mdev_private.h |  6 +++
> >  2 files changed, 77 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/vfio/mdev/mdev_core.c
> > b/drivers/vfio/mdev/mdev_core.c index aefcf34..fa233c8 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> >  						  ref);
> >  	struct device *dev = parent->dev;
> >
> > +	cleanup_srcu_struct(&parent->unreg_srcu);
> >  	kfree(parent);
> >  	put_device(dev);
> >  }
> > @@ -147,10 +148,30 @@ static int mdev_device_remove_ops(struct
> mdev_device *mdev, bool force_remove)
> >  	return 0;
> >  }
> >
> > +static int mdev_device_remove_common(struct mdev_device *mdev,
> > +				     bool force_remove)
> > +{
> > +	struct mdev_type *type;
> > +	int ret;
> > +
> > +	type = to_mdev_type(mdev->type_kobj);
> 
> I know you're just moving this into the common function, but I think we're
> just caching this for aesthetics, the mdev object is still valid after the remove
> ops and I don't see anything touching this field.  If so, maybe we should
> remove 'type' or at least set it right before it's used so it doesn't appear that
> we're preserving it before the remove op.
> 
Sure, yes.
Type assignment should be done just before calling mdev_remove_sysfs_files().
Will send v2.

> > +
> > +	ret = mdev_device_remove_ops(mdev, force_remove);
> > +	if (ret && !force_remove) {
> > +		mutex_lock(&mdev_list_lock);
> > +		mdev->active = true;
> > +		mutex_unlock(&mdev_list_lock);
> 
> The mutex around this is a change from the previous code and I'm not sure
> it adds anything.  If there's a thread testing for active racing with this thread
> setting active to true, there's no meaningful difference in the result by
> acquiring the mutex.  'active' may change from false->true during the critical
> section of the other thread, but I don't think there are any strange out of
> order things that give the wrong result, the other thread either sees true or
> false and continues or exits, regardless of this mutex.
> 
Yes, I can drop the mutex.
In future remove sequence fix, this will anyway vanish.

Shall we finish this series with these 7 patches?
Once you ack it will send v2 for these 7 patches and follow on to that we cleanup the sequencing?

> > +		return ret;
> > +	}
> > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > +	device_unregister(&mdev->dev);
> > +	return ret;
> > +}
> > +
> >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> >  	if (dev_is_mdev(dev))
> > -		mdev_device_remove(dev, true);
> > +		mdev_device_remove_common(to_mdev_device(dev), true);
> >
> >  	return 0;
> >  }
> > @@ -193,6 +214,7 @@ int mdev_register_device(struct device *dev, const
> struct mdev_parent_ops *ops)
> >  	}
> >
> >  	kref_init(&parent->ref);
> > +	init_srcu_struct(&parent->unreg_srcu);
> >
> >  	parent->dev = dev;
> >  	parent->ops = ops;
> > @@ -213,6 +235,7 @@ int mdev_register_device(struct device *dev, const
> struct mdev_parent_ops *ops)
> >  	if (ret)
> >  		dev_warn(dev, "Failed to create compatibility class link\n");
> >
> > +	rcu_assign_pointer(parent->self, parent);
> >  	list_add(&parent->next, &parent_list);
> >  	mutex_unlock(&parent_list_lock);
> >
> > @@ -251,13 +274,31 @@ void mdev_unregister_device(struct device *dev)
> >  	dev_info(dev, "MDEV: Unregistering\n");
> >
> >  	list_del(&parent->next);
> > +	mutex_unlock(&parent_list_lock);
> > +
> > +	/*
> > +	 * Publish that this mdev parent is unregistering. So any new
> > +	 * create/remove cannot start on this parent anymore by user.
> > +	 */
> > +	rcu_assign_pointer(parent->self, NULL);
> > +
> > +	/*
> > +	 * Wait for any active create() or remove() mdev ops on the parent
> > +	 * to complete.
> > +	 */
> > +	synchronize_srcu(&parent->unreg_srcu);
> > +
> > +	/*
> > +	 * At this point it is confirmed that any pending user initiated
> > +	 * create or remove callbacks accessing the parent are completed.
> > +	 * It is safe to remove the parent now.
> > +	 */
> 
> Thanks for the good documentation here.
> 
> Alex
> 
> >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> >
> >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> >
> >  	parent_remove_sysfs_files(parent);
> >
> > -	mutex_unlock(&parent_list_lock);
> >  	mdev_put_parent(parent);
> >  }
> >  EXPORT_SYMBOL(mdev_unregister_device);
> > @@ -278,14 +319,24 @@ int mdev_device_create(struct kobject *kobj,
> >  		       struct device *dev, const guid_t *uuid)  {
> >  	int ret;
> > +	struct mdev_parent *valid_parent;
> >  	struct mdev_device *mdev, *tmp;
> >  	struct mdev_parent *parent;
> >  	struct mdev_type *type = to_mdev_type(kobj);
> > +	int srcu_idx;
> >
> >  	parent = mdev_get_parent(type->parent);
> >  	if (!parent)
> >  		return -EINVAL;
> >
> > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > +	if (!valid_parent) {
> > +		/* parent is undergoing unregistration */
> > +		ret = -ENODEV;
> > +		goto mdev_fail;
> > +	}
> > +
> >  	mutex_lock(&mdev_list_lock);
> >
> >  	/* Check for duplicate */
> > @@ -334,44 +385,52 @@ int mdev_device_create(struct kobject *kobj,
> >  	mdev->type_kobj = kobj;
> >  	mdev->active = true;
> >  	dev_dbg(&mdev->dev, "MDEV: created\n");
> > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> >
> >  	return 0;
> >
> >  create_fail:
> >  	device_unregister(&mdev->dev);
> >  mdev_fail:
> > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> >  	mdev_put_parent(parent);
> >  	return ret;
> >  }
> >
> >  int mdev_device_remove(struct device *dev, bool force_remove)  {
> > +	struct mdev_parent *valid_parent;
> >  	struct mdev_device *mdev;
> >  	struct mdev_parent *parent;
> > -	struct mdev_type *type;
> > +	int srcu_idx;
> >  	int ret;
> >
> >  	mdev = to_mdev_device(dev);
> > +	parent = mdev->parent;
> > +
> > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > +	if (!valid_parent) {
> > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > +		/* parent is undergoing unregistration */
> > +		return -ENODEV;
> > +	}
> > +
> >  	mutex_lock(&mdev_list_lock);
> >  	if (!mdev->active) {
> >  		mutex_unlock(&mdev_list_lock);
> > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> >  		return -EAGAIN;
> >  	}
> >
> >  	mdev->active = false;
> >  	mutex_unlock(&mdev_list_lock);
> >
> > -	type = to_mdev_type(mdev->type_kobj);
> > -	parent = mdev->parent;
> > -
> > -	ret = mdev_device_remove_ops(mdev, force_remove);
> > -	if (ret) {
> > -		mdev->active = true;
> > +	ret = mdev_device_remove_common(mdev, force_remove);
> > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > +	if (ret)
> >  		return ret;
> > -	}
> >
> > -	mdev_remove_sysfs_files(dev, type);
> > -	device_unregister(dev);
> >  	mdev_put_parent(parent);
> >
> >  	return 0;
> > diff --git a/drivers/vfio/mdev/mdev_private.h
> > b/drivers/vfio/mdev/mdev_private.h
> > index ddcf9c7..b799978 100644
> > --- a/drivers/vfio/mdev/mdev_private.h
> > +++ b/drivers/vfio/mdev/mdev_private.h
> > @@ -23,6 +23,12 @@ struct mdev_parent {
> >  	struct list_head next;
> >  	struct kset *mdev_types_kset;
> >  	struct list_head type_list;
> > +	/*
> > +	 * Protects unregistration to wait until create/remove
> > +	 * are completed.
> > +	 */
> > +	struct srcu_struct unreg_srcu;
> > +	struct mdev_parent __rcu *self;
> >  };
> >
> >  struct mdev_device {


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-04  0:02     ` Parav Pandit
@ 2019-04-04 20:44       ` Alex Williamson
  2019-04-04 23:05         ` Parav Pandit
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2019-04-04 20:44 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, cjia

On Thu, 4 Apr 2019 00:02:22 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, April 3, 2019 4:27 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com; cjia@nvidia.com
> > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> > life cycle APIs
> > 
> > On Tue, 26 Mar 2019 22:45:45 -0500
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > Below race condition and call trace exist with current device life
> > > cycle sequence.
> > >
> > > 1. In following sequence, child devices created while removing mdev
> > > parent device can be left out, or it may lead to race of removing half
> > > initialized child mdev devices.
> > >
> > > issue-1:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > >                                   mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >                                   parent_remove_sysfs_files()
> > >                                   /* BUG: device added by cpu-0
> > >                                    * whose parent is getting removed.
> > >                                    */
> > >
> > > issue-2:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >
> > >        [...]                      mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > >
> > >        mdev_create_sysfs_files()
> > >        /* BUG: create is adding
> > >         * sysfs files for a device
> > >         * which is undergoing removal.
> > >         */
> > >                                  parent_remove_sysfs_files()
> > >
> > > 2. Below crash is observed when user initiated remove is in progress
> > > and mdev_unregister_driver() completes parent unregistration.
> > >
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > remove_store()
> > >    mdev_device_remove()
> > >    active = false;
> > >                                   mdev_unregister_device()
> > >                                     remove type
> > >    [...]
> > >    mdev_remove_ops() crashes.
> > >
> > > This is similar race like create() racing with mdev_unregister_device().
> > >
> > > mtty mtty: MDEV: Registered
> > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > > mtty mtty: MDEV: Unregistering
> > > mtty_dev: Unloaded!
> > > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > >  mdev_device_remove+0xef/0x130 [mdev]
> > >  remove_store+0x77/0xa0 [mdev]
> > >  kernfs_fop_write+0x113/0x1a0
> > >  __vfs_write+0x33/0x1b0
> > >  ? rcu_read_lock_sched_held+0x64/0x70
> > >  ? rcu_sync_lockdep_assert+0x2a/0x50
> > >  ? __sb_start_write+0x121/0x1b0
> > >  ? vfs_write+0x17c/0x1b0
> > >  vfs_write+0xad/0x1b0
> > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > >  ksys_write+0x55/0xc0
> > >  do_syscall_64+0x5a/0x210
> > >
> > > Therefore, mdev core is improved to overcome above issues.
> > >
> > > Wait for any ongoing mdev create() and remove() to finish before
> > > unregistering parent device using srcu. This continues to allow
> > > multiple create and remove to progress in parallel. At the same time
> > > guard parent removal while parent is being access by create() and remove  
> > callbacks.  
> > >
> > > mdev_device_remove() is refactored to not block on srcu when device is
> > > removed as part of parent removal.
> > >
> > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > ---
> > >  drivers/vfio/mdev/mdev_core.c    | 83  
> > ++++++++++++++++++++++++++++++++++------  
> > >  drivers/vfio/mdev/mdev_private.h |  6 +++
> > >  2 files changed, 77 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > b/drivers/vfio/mdev/mdev_core.c index aefcf34..fa233c8 100644
> > > --- a/drivers/vfio/mdev/mdev_core.c
> > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> > >  						  ref);
> > >  	struct device *dev = parent->dev;
> > >
> > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > >  	kfree(parent);
> > >  	put_device(dev);
> > >  }
> > > @@ -147,10 +148,30 @@ static int mdev_device_remove_ops(struct  
> > mdev_device *mdev, bool force_remove)  
> > >  	return 0;
> > >  }
> > >
> > > +static int mdev_device_remove_common(struct mdev_device *mdev,
> > > +				     bool force_remove)
> > > +{
> > > +	struct mdev_type *type;
> > > +	int ret;
> > > +
> > > +	type = to_mdev_type(mdev->type_kobj);  
> > 
> > I know you're just moving this into the common function, but I think we're
> > just caching this for aesthetics, the mdev object is still valid after the remove
> > ops and I don't see anything touching this field.  If so, maybe we should
> > remove 'type' or at least set it right before it's used so it doesn't appear that
> > we're preserving it before the remove op.
> >   
> Sure, yes.
> Type assignment should be done just before calling mdev_remove_sysfs_files().
> Will send v2.
> 
> > > +
> > > +	ret = mdev_device_remove_ops(mdev, force_remove);
> > > +	if (ret && !force_remove) {
> > > +		mutex_lock(&mdev_list_lock);
> > > +		mdev->active = true;
> > > +		mutex_unlock(&mdev_list_lock);  
> > 
> > The mutex around this is a change from the previous code and I'm not sure
> > it adds anything.  If there's a thread testing for active racing with this thread
> > setting active to true, there's no meaningful difference in the result by
> > acquiring the mutex.  'active' may change from false->true during the critical
> > section of the other thread, but I don't think there are any strange out of
> > order things that give the wrong result, the other thread either sees true or
> > false and continues or exits, regardless of this mutex.
> >   
> Yes, I can drop the mutex.
> In future remove sequence fix, this will anyway vanish.
> 
> Shall we finish this series with these 7 patches?
> Once you ack it will send v2 for these 7 patches and follow on to that we cleanup the sequencing?

Do you intend to move the removal of the mdev sanitization loop from
6/7 to this patch?  I don't think we can really claim that what it's
trying to do is unnecessary until after we have the new code here that
prevents the sysfs remove path from running concurrent to the parent
remove path.  It's not really related to the changes in 6/7 anyway.  In
fact, rather than moving that chunk here, it could be added as a
follow-on patch with explanation of why it is now unnecessary.  Thanks,

Alex

> > > +		return ret;
> > > +	}
> > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > +	device_unregister(&mdev->dev);
> > > +	return ret;
> > > +}
> > > +
> > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > >  	if (dev_is_mdev(dev))
> > > -		mdev_device_remove(dev, true);
> > > +		mdev_device_remove_common(to_mdev_device(dev), true);
> > >
> > >  	return 0;
> > >  }
> > > @@ -193,6 +214,7 @@ int mdev_register_device(struct device *dev, const  
> > struct mdev_parent_ops *ops)  
> > >  	}
> > >
> > >  	kref_init(&parent->ref);
> > > +	init_srcu_struct(&parent->unreg_srcu);
> > >
> > >  	parent->dev = dev;
> > >  	parent->ops = ops;
> > > @@ -213,6 +235,7 @@ int mdev_register_device(struct device *dev, const  
> > struct mdev_parent_ops *ops)  
> > >  	if (ret)
> > >  		dev_warn(dev, "Failed to create compatibility class link\n");
> > >
> > > +	rcu_assign_pointer(parent->self, parent);
> > >  	list_add(&parent->next, &parent_list);
> > >  	mutex_unlock(&parent_list_lock);
> > >
> > > @@ -251,13 +274,31 @@ void mdev_unregister_device(struct device *dev)
> > >  	dev_info(dev, "MDEV: Unregistering\n");
> > >
> > >  	list_del(&parent->next);
> > > +	mutex_unlock(&parent_list_lock);
> > > +
> > > +	/*
> > > +	 * Publish that this mdev parent is unregistering. So any new
> > > +	 * create/remove cannot start on this parent anymore by user.
> > > +	 */
> > > +	rcu_assign_pointer(parent->self, NULL);
> > > +
> > > +	/*
> > > +	 * Wait for any active create() or remove() mdev ops on the parent
> > > +	 * to complete.
> > > +	 */
> > > +	synchronize_srcu(&parent->unreg_srcu);
> > > +
> > > +	/*
> > > +	 * At this point it is confirmed that any pending user initiated
> > > +	 * create or remove callbacks accessing the parent are completed.
> > > +	 * It is safe to remove the parent now.
> > > +	 */  
> > 
> > Thanks for the good documentation here.
> > 
> > Alex
> >   
> > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > >
> > >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > >
> > >  	parent_remove_sysfs_files(parent);
> > >
> > > -	mutex_unlock(&parent_list_lock);
> > >  	mdev_put_parent(parent);
> > >  }
> > >  EXPORT_SYMBOL(mdev_unregister_device);
> > > @@ -278,14 +319,24 @@ int mdev_device_create(struct kobject *kobj,
> > >  		       struct device *dev, const guid_t *uuid)  {
> > >  	int ret;
> > > +	struct mdev_parent *valid_parent;
> > >  	struct mdev_device *mdev, *tmp;
> > >  	struct mdev_parent *parent;
> > >  	struct mdev_type *type = to_mdev_type(kobj);
> > > +	int srcu_idx;
> > >
> > >  	parent = mdev_get_parent(type->parent);
> > >  	if (!parent)
> > >  		return -EINVAL;
> > >
> > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > > +	if (!valid_parent) {
> > > +		/* parent is undergoing unregistration */
> > > +		ret = -ENODEV;
> > > +		goto mdev_fail;
> > > +	}
> > > +
> > >  	mutex_lock(&mdev_list_lock);
> > >
> > >  	/* Check for duplicate */
> > > @@ -334,44 +385,52 @@ int mdev_device_create(struct kobject *kobj,
> > >  	mdev->type_kobj = kobj;
> > >  	mdev->active = true;
> > >  	dev_dbg(&mdev->dev, "MDEV: created\n");
> > > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > >
> > >  	return 0;
> > >
> > >  create_fail:
> > >  	device_unregister(&mdev->dev);
> > >  mdev_fail:
> > > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > >  	mdev_put_parent(parent);
> > >  	return ret;
> > >  }
> > >
> > >  int mdev_device_remove(struct device *dev, bool force_remove)  {
> > > +	struct mdev_parent *valid_parent;
> > >  	struct mdev_device *mdev;
> > >  	struct mdev_parent *parent;
> > > -	struct mdev_type *type;
> > > +	int srcu_idx;
> > >  	int ret;
> > >
> > >  	mdev = to_mdev_device(dev);
> > > +	parent = mdev->parent;
> > > +
> > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > > +	if (!valid_parent) {
> > > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > > +		/* parent is undergoing unregistration */
> > > +		return -ENODEV;
> > > +	}
> > > +
> > >  	mutex_lock(&mdev_list_lock);
> > >  	if (!mdev->active) {
> > >  		mutex_unlock(&mdev_list_lock);
> > > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > >  		return -EAGAIN;
> > >  	}
> > >
> > >  	mdev->active = false;
> > >  	mutex_unlock(&mdev_list_lock);
> > >
> > > -	type = to_mdev_type(mdev->type_kobj);
> > > -	parent = mdev->parent;
> > > -
> > > -	ret = mdev_device_remove_ops(mdev, force_remove);
> > > -	if (ret) {
> > > -		mdev->active = true;
> > > +	ret = mdev_device_remove_common(mdev, force_remove);
> > > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > > +	if (ret)
> > >  		return ret;
> > > -	}
> > >
> > > -	mdev_remove_sysfs_files(dev, type);
> > > -	device_unregister(dev);
> > >  	mdev_put_parent(parent);
> > >
> > >  	return 0;
> > > diff --git a/drivers/vfio/mdev/mdev_private.h
> > > b/drivers/vfio/mdev/mdev_private.h
> > > index ddcf9c7..b799978 100644
> > > --- a/drivers/vfio/mdev/mdev_private.h
> > > +++ b/drivers/vfio/mdev/mdev_private.h
> > > @@ -23,6 +23,12 @@ struct mdev_parent {
> > >  	struct list_head next;
> > >  	struct kset *mdev_types_kset;
> > >  	struct list_head type_list;
> > > +	/*
> > > +	 * Protects unregistration to wait until create/remove
> > > +	 * are completed.
> > > +	 */
> > > +	struct srcu_struct unreg_srcu;
> > > +	struct mdev_parent __rcu *self;
> > >  };
> > >
> > >  struct mdev_device {  
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-04 20:44       ` Alex Williamson
@ 2019-04-04 23:05         ` Parav Pandit
  2019-04-23 19:21           ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-04-04 23:05 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, kwankhede, cjia



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 4, 2019 3:44 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com; cjia@nvidia.com
> Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> life cycle APIs
> 
> On Thu, 4 Apr 2019 00:02:22 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, April 3, 2019 4:27 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com; cjia@nvidia.com
> > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev
> > > device life cycle APIs
> > >
> > > On Tue, 26 Mar 2019 22:45:45 -0500
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > Below race condition and call trace exist with current device life
> > > > cycle sequence.
> > > >
> > > > 1. In following sequence, child devices created while removing
> > > > mdev parent device can be left out, or it may lead to race of
> > > > removing half initialized child mdev devices.
> > > >
> > > > issue-1:
> > > > --------
> > > >        cpu-0                         cpu-1
> > > >        -----                         -----
> > > >                                   mdev_unregister_device()
> > > >                                      device_for_each_child()
> > > >                                         mdev_device_remove_cb()
> > > >                                             mdev_device_remove()
> > > > create_store()
> > > >   mdev_device_create()                   [...]
> > > >        device_register()
> > > >                                   parent_remove_sysfs_files()
> > > >                                   /* BUG: device added by cpu-0
> > > >                                    * whose parent is getting removed.
> > > >                                    */
> > > >
> > > > issue-2:
> > > > --------
> > > >        cpu-0                         cpu-1
> > > >        -----                         -----
> > > > create_store()
> > > >   mdev_device_create()                   [...]
> > > >        device_register()
> > > >
> > > >        [...]                      mdev_unregister_device()
> > > >                                      device_for_each_child()
> > > >                                         mdev_device_remove_cb()
> > > >                                             mdev_device_remove()
> > > >
> > > >        mdev_create_sysfs_files()
> > > >        /* BUG: create is adding
> > > >         * sysfs files for a device
> > > >         * which is undergoing removal.
> > > >         */
> > > >                                  parent_remove_sysfs_files()
> > > >
> > > > 2. Below crash is observed when user initiated remove is in
> > > > progress and mdev_unregister_driver() completes parent
> unregistration.
> > > >
> > > >        cpu-0                         cpu-1
> > > >        -----                         -----
> > > > remove_store()
> > > >    mdev_device_remove()
> > > >    active = false;
> > > >                                   mdev_unregister_device()
> > > >                                     remove type
> > > >    [...]
> > > >    mdev_remove_ops() crashes.
> > > >
> > > > This is similar race like create() racing with mdev_unregister_device().
> > > >
> > > > mtty mtty: MDEV: Registered
> > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group
> > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id
> > > > = 57 mtty mtty: MDEV: Unregistering
> > > > mtty_dev: Unloaded!
> > > > BUG: unable to handle kernel paging request at ffffffffc027d668
> > > > PGD
> > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > Oops: 0000 [#1] SMP PTI
> > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > >  remove_store+0x77/0xa0 [mdev]
> > > >  kernfs_fop_write+0x113/0x1a0
> > > >  __vfs_write+0x33/0x1b0
> > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > >  vfs_write+0xad/0x1b0
> > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > >  ksys_write+0x55/0xc0
> > > >  do_syscall_64+0x5a/0x210
> > > >
> > > > Therefore, mdev core is improved to overcome above issues.
> > > >
> > > > Wait for any ongoing mdev create() and remove() to finish before
> > > > unregistering parent device using srcu. This continues to allow
> > > > multiple create and remove to progress in parallel. At the same
> > > > time guard parent removal while parent is being access by create()
> > > > and remove
> > > callbacks.
> > > >
> > > > mdev_device_remove() is refactored to not block on srcu when
> > > > device is removed as part of parent removal.
> > > >
> > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > ---
> > > >  drivers/vfio/mdev/mdev_core.c    | 83
> > > ++++++++++++++++++++++++++++++++++------
> > > >  drivers/vfio/mdev/mdev_private.h |  6 +++
> > > >  2 files changed, 77 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > b/drivers/vfio/mdev/mdev_core.c index aefcf34..fa233c8 100644
> > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> > > >  						  ref);
> > > >  	struct device *dev = parent->dev;
> > > >
> > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > >  	kfree(parent);
> > > >  	put_device(dev);
> > > >  }
> > > > @@ -147,10 +148,30 @@ static int mdev_device_remove_ops(struct
> > > mdev_device *mdev, bool force_remove)
> > > >  	return 0;
> > > >  }
> > > >
> > > > +static int mdev_device_remove_common(struct mdev_device *mdev,
> > > > +				     bool force_remove)
> > > > +{
> > > > +	struct mdev_type *type;
> > > > +	int ret;
> > > > +
> > > > +	type = to_mdev_type(mdev->type_kobj);
> > >
> > > I know you're just moving this into the common function, but I think
> > > we're just caching this for aesthetics, the mdev object is still
> > > valid after the remove ops and I don't see anything touching this
> > > field.  If so, maybe we should remove 'type' or at least set it
> > > right before it's used so it doesn't appear that we're preserving it before
> the remove op.
> > >
> > Sure, yes.
> > Type assignment should be done just before calling
> mdev_remove_sysfs_files().
> > Will send v2.
> >
> > > > +
> > > > +	ret = mdev_device_remove_ops(mdev, force_remove);
> > > > +	if (ret && !force_remove) {
> > > > +		mutex_lock(&mdev_list_lock);
> > > > +		mdev->active = true;
> > > > +		mutex_unlock(&mdev_list_lock);
> > >
> > > The mutex around this is a change from the previous code and I'm not
> > > sure it adds anything.  If there's a thread testing for active
> > > racing with this thread setting active to true, there's no
> > > meaningful difference in the result by acquiring the mutex.
> > > 'active' may change from false->true during the critical section of
> > > the other thread, but I don't think there are any strange out of
> > > order things that give the wrong result, the other thread either sees true
> or false and continues or exits, regardless of this mutex.
> > >
> > Yes, I can drop the mutex.
> > In future remove sequence fix, this will anyway vanish.
> >
> > Shall we finish this series with these 7 patches?
> > Once you ack it will send v2 for these 7 patches and follow on to that we
> cleanup the sequencing?
> 
> Do you intend to move the removal of the mdev sanitization loop from
> 6/7 to this patch?  I don't think we can really claim that what it's trying to do
> is unnecessary until after we have the new code here that prevents the sysfs
> remove path from running concurrent to the parent remove path.  It's not
> really related to the changes in 6/7 anyway.  In fact, rather than moving that
> chunk here, it could be added as a follow-on patch with explanation of why
> it is now unnecessary.  Thanks,
> 
Device type cannot change from mdev to something else when it was invoked by the remove() sysfs cb.
Neither it can be something else in parent removal is passes bus comparison check.

If vendor's remove fails, in current code we are not calling device_unregister(). So bus cannot become null.

I fail to understand how device type can change, 
and let say if the sanitization loop exist, what prevents it to not change after that check is passed.

Do you know how can device type can change?

I was thinking to patch 6 as is and address comments of 7.


> Alex
> 
> > > > +		return ret;
> > > > +	}
> > > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > > +	device_unregister(&mdev->dev);
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > > >  	if (dev_is_mdev(dev))
> > > > -		mdev_device_remove(dev, true);
> > > > +		mdev_device_remove_common(to_mdev_device(dev), true);
> > > >
> > > >  	return 0;
> > > >  }
> > > > @@ -193,6 +214,7 @@ int mdev_register_device(struct device *dev,
> > > > const
> > > struct mdev_parent_ops *ops)
> > > >  	}
> > > >
> > > >  	kref_init(&parent->ref);
> > > > +	init_srcu_struct(&parent->unreg_srcu);
> > > >
> > > >  	parent->dev = dev;
> > > >  	parent->ops = ops;
> > > > @@ -213,6 +235,7 @@ int mdev_register_device(struct device *dev,
> > > > const
> > > struct mdev_parent_ops *ops)
> > > >  	if (ret)
> > > >  		dev_warn(dev, "Failed to create compatibility class link\n");
> > > >
> > > > +	rcu_assign_pointer(parent->self, parent);
> > > >  	list_add(&parent->next, &parent_list);
> > > >  	mutex_unlock(&parent_list_lock);
> > > >
> > > > @@ -251,13 +274,31 @@ void mdev_unregister_device(struct device
> *dev)
> > > >  	dev_info(dev, "MDEV: Unregistering\n");
> > > >
> > > >  	list_del(&parent->next);
> > > > +	mutex_unlock(&parent_list_lock);
> > > > +
> > > > +	/*
> > > > +	 * Publish that this mdev parent is unregistering. So any new
> > > > +	 * create/remove cannot start on this parent anymore by user.
> > > > +	 */
> > > > +	rcu_assign_pointer(parent->self, NULL);
> > > > +
> > > > +	/*
> > > > +	 * Wait for any active create() or remove() mdev ops on the parent
> > > > +	 * to complete.
> > > > +	 */
> > > > +	synchronize_srcu(&parent->unreg_srcu);
> > > > +
> > > > +	/*
> > > > +	 * At this point it is confirmed that any pending user initiated
> > > > +	 * create or remove callbacks accessing the parent are completed.
> > > > +	 * It is safe to remove the parent now.
> > > > +	 */
> > >
> > > Thanks for the good documentation here.
> > >
> > > Alex
> > >
> > > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > > >
> > > >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > > >
> > > >  	parent_remove_sysfs_files(parent);
> > > >
> > > > -	mutex_unlock(&parent_list_lock);
> > > >  	mdev_put_parent(parent);
> > > >  }
> > > >  EXPORT_SYMBOL(mdev_unregister_device);
> > > > @@ -278,14 +319,24 @@ int mdev_device_create(struct kobject *kobj,
> > > >  		       struct device *dev, const guid_t *uuid)  {
> > > >  	int ret;
> > > > +	struct mdev_parent *valid_parent;
> > > >  	struct mdev_device *mdev, *tmp;
> > > >  	struct mdev_parent *parent;
> > > >  	struct mdev_type *type = to_mdev_type(kobj);
> > > > +	int srcu_idx;
> > > >
> > > >  	parent = mdev_get_parent(type->parent);
> > > >  	if (!parent)
> > > >  		return -EINVAL;
> > > >
> > > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > > > +	if (!valid_parent) {
> > > > +		/* parent is undergoing unregistration */
> > > > +		ret = -ENODEV;
> > > > +		goto mdev_fail;
> > > > +	}
> > > > +
> > > >  	mutex_lock(&mdev_list_lock);
> > > >
> > > >  	/* Check for duplicate */
> > > > @@ -334,44 +385,52 @@ int mdev_device_create(struct kobject *kobj,
> > > >  	mdev->type_kobj = kobj;
> > > >  	mdev->active = true;
> > > >  	dev_dbg(&mdev->dev, "MDEV: created\n");
> > > > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > > >
> > > >  	return 0;
> > > >
> > > >  create_fail:
> > > >  	device_unregister(&mdev->dev);
> > > >  mdev_fail:
> > > > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > > >  	mdev_put_parent(parent);
> > > >  	return ret;
> > > >  }
> > > >
> > > >  int mdev_device_remove(struct device *dev, bool force_remove)  {
> > > > +	struct mdev_parent *valid_parent;
> > > >  	struct mdev_device *mdev;
> > > >  	struct mdev_parent *parent;
> > > > -	struct mdev_type *type;
> > > > +	int srcu_idx;
> > > >  	int ret;
> > > >
> > > >  	mdev = to_mdev_device(dev);
> > > > +	parent = mdev->parent;
> > > > +
> > > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > > > +	if (!valid_parent) {
> > > > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > > > +		/* parent is undergoing unregistration */
> > > > +		return -ENODEV;
> > > > +	}
> > > > +
> > > >  	mutex_lock(&mdev_list_lock);
> > > >  	if (!mdev->active) {
> > > >  		mutex_unlock(&mdev_list_lock);
> > > > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > > >  		return -EAGAIN;
> > > >  	}
> > > >
> > > >  	mdev->active = false;
> > > >  	mutex_unlock(&mdev_list_lock);
> > > >
> > > > -	type = to_mdev_type(mdev->type_kobj);
> > > > -	parent = mdev->parent;
> > > > -
> > > > -	ret = mdev_device_remove_ops(mdev, force_remove);
> > > > -	if (ret) {
> > > > -		mdev->active = true;
> > > > +	ret = mdev_device_remove_common(mdev, force_remove);
> > > > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > > > +	if (ret)
> > > >  		return ret;
> > > > -	}
> > > >
> > > > -	mdev_remove_sysfs_files(dev, type);
> > > > -	device_unregister(dev);
> > > >  	mdev_put_parent(parent);
> > > >
> > > >  	return 0;
> > > > diff --git a/drivers/vfio/mdev/mdev_private.h
> > > > b/drivers/vfio/mdev/mdev_private.h
> > > > index ddcf9c7..b799978 100644
> > > > --- a/drivers/vfio/mdev/mdev_private.h
> > > > +++ b/drivers/vfio/mdev/mdev_private.h
> > > > @@ -23,6 +23,12 @@ struct mdev_parent {
> > > >  	struct list_head next;
> > > >  	struct kset *mdev_types_kset;
> > > >  	struct list_head type_list;
> > > > +	/*
> > > > +	 * Protects unregistration to wait until create/remove
> > > > +	 * are completed.
> > > > +	 */
> > > > +	struct srcu_struct unreg_srcu;
> > > > +	struct mdev_parent __rcu *self;
> > > >  };
> > > >
> > > >  struct mdev_device {
> >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-04 23:05         ` Parav Pandit
@ 2019-04-23 19:21           ` Alex Williamson
  2019-04-25 23:29             ` Parav Pandit
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2019-04-23 19:21 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, cjia

On Thu, 4 Apr 2019 23:05:43 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, April 4, 2019 3:44 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com; cjia@nvidia.com
> > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> > life cycle APIs
> > 
> > On Thu, 4 Apr 2019 00:02:22 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, April 3, 2019 4:27 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > kwankhede@nvidia.com; cjia@nvidia.com
> > > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev
> > > > device life cycle APIs
> > > >
> > > > On Tue, 26 Mar 2019 22:45:45 -0500
> > > > Parav Pandit <parav@mellanox.com> wrote:
> > > >  
> > > > > Below race condition and call trace exist with current device life
> > > > > cycle sequence.
> > > > >
> > > > > 1. In following sequence, child devices created while removing
> > > > > mdev parent device can be left out, or it may lead to race of
> > > > > removing half initialized child mdev devices.
> > > > >
> > > > > issue-1:
> > > > > --------
> > > > >        cpu-0                         cpu-1
> > > > >        -----                         -----
> > > > >                                   mdev_unregister_device()
> > > > >                                      device_for_each_child()
> > > > >                                         mdev_device_remove_cb()
> > > > >                                             mdev_device_remove()
> > > > > create_store()
> > > > >   mdev_device_create()                   [...]
> > > > >        device_register()
> > > > >                                   parent_remove_sysfs_files()
> > > > >                                   /* BUG: device added by cpu-0
> > > > >                                    * whose parent is getting removed.
> > > > >                                    */
> > > > >
> > > > > issue-2:
> > > > > --------
> > > > >        cpu-0                         cpu-1
> > > > >        -----                         -----
> > > > > create_store()
> > > > >   mdev_device_create()                   [...]
> > > > >        device_register()
> > > > >
> > > > >        [...]                      mdev_unregister_device()
> > > > >                                      device_for_each_child()
> > > > >                                         mdev_device_remove_cb()
> > > > >                                             mdev_device_remove()
> > > > >
> > > > >        mdev_create_sysfs_files()
> > > > >        /* BUG: create is adding
> > > > >         * sysfs files for a device
> > > > >         * which is undergoing removal.
> > > > >         */
> > > > >                                  parent_remove_sysfs_files()
> > > > >
> > > > > 2. Below crash is observed when user initiated remove is in
> > > > > progress and mdev_unregister_driver() completes parent  
> > unregistration.  
> > > > >
> > > > >        cpu-0                         cpu-1
> > > > >        -----                         -----
> > > > > remove_store()
> > > > >    mdev_device_remove()
> > > > >    active = false;
> > > > >                                   mdev_unregister_device()
> > > > >                                     remove type
> > > > >    [...]
> > > > >    mdev_remove_ops() crashes.
> > > > >
> > > > > This is similar race like create() racing with mdev_unregister_device().
> > > > >
> > > > > mtty mtty: MDEV: Registered
> > > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group
> > > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id
> > > > > = 57 mtty mtty: MDEV: Unregistering
> > > > > mtty_dev: Unloaded!
> > > > > BUG: unable to handle kernel paging request at ffffffffc027d668
> > > > > PGD
> > > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > > Oops: 0000 [#1] SMP PTI
> > > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > > >  remove_store+0x77/0xa0 [mdev]
> > > > >  kernfs_fop_write+0x113/0x1a0
> > > > >  __vfs_write+0x33/0x1b0
> > > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > > >  vfs_write+0xad/0x1b0
> > > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > > >  ksys_write+0x55/0xc0
> > > > >  do_syscall_64+0x5a/0x210
> > > > >
> > > > > Therefore, mdev core is improved to overcome above issues.
> > > > >
> > > > > Wait for any ongoing mdev create() and remove() to finish before
> > > > > unregistering parent device using srcu. This continues to allow
> > > > > multiple create and remove to progress in parallel. At the same
> > > > > time guard parent removal while parent is being access by create()
> > > > > and remove  
> > > > callbacks.  
> > > > >
> > > > > mdev_device_remove() is refactored to not block on srcu when
> > > > > device is removed as part of parent removal.
> > > > >
> > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > ---
> > > > >  drivers/vfio/mdev/mdev_core.c    | 83  
> > > > ++++++++++++++++++++++++++++++++++------  
> > > > >  drivers/vfio/mdev/mdev_private.h |  6 +++
> > > > >  2 files changed, 77 insertions(+), 12 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > b/drivers/vfio/mdev/mdev_core.c index aefcf34..fa233c8 100644
> > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> > > > >  						  ref);
> > > > >  	struct device *dev = parent->dev;
> > > > >
> > > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > > >  	kfree(parent);
> > > > >  	put_device(dev);
> > > > >  }
> > > > > @@ -147,10 +148,30 @@ static int mdev_device_remove_ops(struct  
> > > > mdev_device *mdev, bool force_remove)  
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > > +static int mdev_device_remove_common(struct mdev_device *mdev,
> > > > > +				     bool force_remove)
> > > > > +{
> > > > > +	struct mdev_type *type;
> > > > > +	int ret;
> > > > > +
> > > > > +	type = to_mdev_type(mdev->type_kobj);  
> > > >
> > > > I know you're just moving this into the common function, but I think
> > > > we're just caching this for aesthetics, the mdev object is still
> > > > valid after the remove ops and I don't see anything touching this
> > > > field.  If so, maybe we should remove 'type' or at least set it
> > > > right before it's used so it doesn't appear that we're preserving it before  
> > the remove op.  
> > > >  
> > > Sure, yes.
> > > Type assignment should be done just before calling  
> > mdev_remove_sysfs_files().  
> > > Will send v2.
> > >  
> > > > > +
> > > > > +	ret = mdev_device_remove_ops(mdev, force_remove);
> > > > > +	if (ret && !force_remove) {
> > > > > +		mutex_lock(&mdev_list_lock);
> > > > > +		mdev->active = true;
> > > > > +		mutex_unlock(&mdev_list_lock);  
> > > >
> > > > The mutex around this is a change from the previous code and I'm not
> > > > sure it adds anything.  If there's a thread testing for active
> > > > racing with this thread setting active to true, there's no
> > > > meaningful difference in the result by acquiring the mutex.
> > > > 'active' may change from false->true during the critical section of
> > > > the other thread, but I don't think there are any strange out of
> > > > order things that give the wrong result, the other thread either sees true  
> > or false and continues or exits, regardless of this mutex.  
> > > >  
> > > Yes, I can drop the mutex.
> > > In future remove sequence fix, this will anyway vanish.
> > >
> > > Shall we finish this series with these 7 patches?
> > > Once you ack it will send v2 for these 7 patches and follow on to that we  
> > cleanup the sequencing?
> > 
> > Do you intend to move the removal of the mdev sanitization loop from
> > 6/7 to this patch?  I don't think we can really claim that what it's trying to do
> > is unnecessary until after we have the new code here that prevents the sysfs
> > remove path from running concurrent to the parent remove path.  It's not
> > really related to the changes in 6/7 anyway.  In fact, rather than moving that
> > chunk here, it could be added as a follow-on patch with explanation of why
> > it is now unnecessary.  Thanks,
> >   
> Device type cannot change from mdev to something else when it was invoked by the remove() sysfs cb.
> Neither it can be something else in parent removal is passes bus comparison check.

Hi Parav,

I tried to explain the concern in
Message-ID: <20190402163309.414c45ad@x1.home> It hinges on the fact that
remove_store() calls device_remove_file_self() before calling
mdev_device_remove().  Therefore imagine this scenarios:

Thread A			Thread B

mdev_device_remove()
 mdev_remove_sysfs_files()
				remove_store()
				 device_remove_file_self()
  sysfs_remove_files...
				 mdev_device_remove()
				  return -EAGAIN
                                 device_create_file()
 device_unregister()
 mdev_put_parent()

So Thread B recreated a stale sysfs attribute.  If it prevents the mdev
from being released via mdev_device_release() then it will forever
be !active and calling remove store will continue to error and recreate
it.  If the mdev does get freed, then remove_store() is working with a
stale device, which the sanitizing loop removed in 6/7 is meant to
catch.  Therefore, it makes sense to me to relocate that loop removal
until after we clean up the mess around removal.

BTW, I exchanged email with Kirti offline and I think we're in
agreement around your plans to fix this.  The confusion was around
whether the vendor driver remove callback can be called while the
device is still in use, but I believe vfio-core will prevent that with
the correct bus removal logic in place.

So where do we stand on this series?  I think patches 1-5 look good.
Should I incorporate them for v5.2?  Patch 6 looks ok, except I'd
rather see the sanitizing loop stay until we can otherwise fix the race
above.  Patch 7 needed more work, iirc.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-23 19:21           ` Alex Williamson
@ 2019-04-25 23:29             ` Parav Pandit
  2019-04-26 15:33               ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-04-25 23:29 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, kwankhede, cjia

Hi Alex,

First, sorry for my late reply.

> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, April 23, 2019 2:22 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com; cjia@nvidia.com
> Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> life cycle APIs
> 
> On Thu, 4 Apr 2019 23:05:43 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, April 4, 2019 3:44 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com; cjia@nvidia.com
> > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev
> > > device life cycle APIs
> > >
> > > On Thu, 4 Apr 2019 00:02:22 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Wednesday, April 3, 2019 4:27 PM
> > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > kwankhede@nvidia.com; cjia@nvidia.com
> > > > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with
> > > > > mdev device life cycle APIs
> > > > >
> > > > > On Tue, 26 Mar 2019 22:45:45 -0500 Parav Pandit
> > > > > <parav@mellanox.com> wrote:
> > > > >
> > > > > > Below race condition and call trace exist with current device
> > > > > > life cycle sequence.
> > > > > >
> > > > > > 1. In following sequence, child devices created while removing
> > > > > > mdev parent device can be left out, or it may lead to race of
> > > > > > removing half initialized child mdev devices.
> > > > > >
> > > > > > issue-1:
> > > > > > --------
> > > > > >        cpu-0                         cpu-1
> > > > > >        -----                         -----
> > > > > >                                   mdev_unregister_device()
> > > > > >                                      device_for_each_child()
> > > > > >                                         mdev_device_remove_cb()
> > > > > >
> > > > > > mdev_device_remove()
> > > > > > create_store()
> > > > > >   mdev_device_create()                   [...]
> > > > > >        device_register()
> > > > > >                                   parent_remove_sysfs_files()
> > > > > >                                   /* BUG: device added by cpu-0
> > > > > >                                    * whose parent is getting removed.
> > > > > >                                    */
> > > > > >
> > > > > > issue-2:
> > > > > > --------
> > > > > >        cpu-0                         cpu-1
> > > > > >        -----                         -----
> > > > > > create_store()
> > > > > >   mdev_device_create()                   [...]
> > > > > >        device_register()
> > > > > >
> > > > > >        [...]                      mdev_unregister_device()
> > > > > >                                      device_for_each_child()
> > > > > >                                         mdev_device_remove_cb()
> > > > > >
> > > > > > mdev_device_remove()
> > > > > >
> > > > > >        mdev_create_sysfs_files()
> > > > > >        /* BUG: create is adding
> > > > > >         * sysfs files for a device
> > > > > >         * which is undergoing removal.
> > > > > >         */
> > > > > >                                  parent_remove_sysfs_files()
> > > > > >
> > > > > > 2. Below crash is observed when user initiated remove is in
> > > > > > progress and mdev_unregister_driver() completes parent
> > > unregistration.
> > > > > >
> > > > > >        cpu-0                         cpu-1
> > > > > >        -----                         -----
> > > > > > remove_store()
> > > > > >    mdev_device_remove()
> > > > > >    active = false;
> > > > > >                                   mdev_unregister_device()
> > > > > >                                     remove type
> > > > > >    [...]
> > > > > >    mdev_remove_ops() crashes.
> > > > > >
> > > > > > This is similar race like create() racing with
> mdev_unregister_device().
> > > > > >
> > > > > > mtty mtty: MDEV: Registered
> > > > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to
> > > > > > group
> > > > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV:
> > > > > > group_id = 57 mtty mtty: MDEV: Unregistering
> > > > > > mtty_dev: Unloaded!
> > > > > > BUG: unable to handle kernel paging request at
> > > > > > ffffffffc027d668 PGD
> > > > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > > > Oops: 0000 [#1] SMP PTI
> > > > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > > > >  remove_store+0x77/0xa0 [mdev]
> > > > > >  kernfs_fop_write+0x113/0x1a0
> > > > > >  __vfs_write+0x33/0x1b0
> > > > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > > > >  vfs_write+0xad/0x1b0
> > > > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > > > >  ksys_write+0x55/0xc0
> > > > > >  do_syscall_64+0x5a/0x210
> > > > > >
> > > > > > Therefore, mdev core is improved to overcome above issues.
> > > > > >
> > > > > > Wait for any ongoing mdev create() and remove() to finish
> > > > > > before unregistering parent device using srcu. This continues
> > > > > > to allow multiple create and remove to progress in parallel.
> > > > > > At the same time guard parent removal while parent is being
> > > > > > access by create() and remove
> > > > > callbacks.
> > > > > >
> > > > > > mdev_device_remove() is refactored to not block on srcu when
> > > > > > device is removed as part of parent removal.
> > > > > >
> > > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > > ---
> > > > > >  drivers/vfio/mdev/mdev_core.c    | 83
> > > > > ++++++++++++++++++++++++++++++++++------
> > > > > >  drivers/vfio/mdev/mdev_private.h |  6 +++
> > > > > >  2 files changed, 77 insertions(+), 12 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > > b/drivers/vfio/mdev/mdev_core.c index aefcf34..fa233c8 100644
> > > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref
> *kref)
> > > > > >  						  ref);
> > > > > >  	struct device *dev = parent->dev;
> > > > > >
> > > > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > > > >  	kfree(parent);
> > > > > >  	put_device(dev);
> > > > > >  }
> > > > > > @@ -147,10 +148,30 @@ static int
> mdev_device_remove_ops(struct
> > > > > mdev_device *mdev, bool force_remove)
> > > > > >  	return 0;
> > > > > >  }
> > > > > >
> > > > > > +static int mdev_device_remove_common(struct mdev_device
> *mdev,
> > > > > > +				     bool force_remove)
> > > > > > +{
> > > > > > +	struct mdev_type *type;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	type = to_mdev_type(mdev->type_kobj);
> > > > >
> > > > > I know you're just moving this into the common function, but I
> > > > > think we're just caching this for aesthetics, the mdev object is
> > > > > still valid after the remove ops and I don't see anything
> > > > > touching this field.  If so, maybe we should remove 'type' or at
> > > > > least set it right before it's used so it doesn't appear that
> > > > > we're preserving it before
> > > the remove op.
> > > > >
> > > > Sure, yes.
> > > > Type assignment should be done just before calling
> > > mdev_remove_sysfs_files().
> > > > Will send v2.
> > > >
> > > > > > +
> > > > > > +	ret = mdev_device_remove_ops(mdev, force_remove);
> > > > > > +	if (ret && !force_remove) {
> > > > > > +		mutex_lock(&mdev_list_lock);
> > > > > > +		mdev->active = true;
> > > > > > +		mutex_unlock(&mdev_list_lock);
> > > > >
> > > > > The mutex around this is a change from the previous code and I'm
> > > > > not sure it adds anything.  If there's a thread testing for
> > > > > active racing with this thread setting active to true, there's
> > > > > no meaningful difference in the result by acquiring the mutex.
> > > > > 'active' may change from false->true during the critical section
> > > > > of the other thread, but I don't think there are any strange out
> > > > > of order things that give the wrong result, the other thread
> > > > > either sees true
> > > or false and continues or exits, regardless of this mutex.
> > > > >
> > > > Yes, I can drop the mutex.
> > > > In future remove sequence fix, this will anyway vanish.
> > > >
> > > > Shall we finish this series with these 7 patches?
> > > > Once you ack it will send v2 for these 7 patches and follow on to
> > > > that we
> > > cleanup the sequencing?
> > >
> > > Do you intend to move the removal of the mdev sanitization loop from
> > > 6/7 to this patch?  I don't think we can really claim that what it's
> > > trying to do is unnecessary until after we have the new code here
> > > that prevents the sysfs remove path from running concurrent to the
> > > parent remove path.  It's not really related to the changes in 6/7
> > > anyway.  In fact, rather than moving that chunk here, it could be
> > > added as a follow-on patch with explanation of why it is now
> > > unnecessary.  Thanks,
> > >
> > Device type cannot change from mdev to something else when it was
> invoked by the remove() sysfs cb.
> > Neither it can be something else in parent removal is passes bus
> comparison check.
> 
> Hi Parav,
> 
> I tried to explain the concern in
> Message-ID: <20190402163309.414c45ad@x1.home> It hinges on the fact
> that
> remove_store() calls device_remove_file_self() before calling
> mdev_device_remove().  Therefore imagine this scenarios:
> 
> Thread A			Thread B
> 
> mdev_device_remove()
>  mdev_remove_sysfs_files()
> 				remove_store()
> 				 device_remove_file_self()
>   sysfs_remove_files...
> 				 mdev_device_remove()
> 				  return -EAGAIN
>                                  device_create_file()
>  device_unregister()
>  mdev_put_parent()
> 
> So Thread B recreated a stale sysfs attribute.  If it prevents the mdev from
> being released via mdev_device_release() then it will forever be !active and
> calling remove store will continue to error and recreate it.  If the mdev does
> get freed, then remove_store() is working with a stale device, which the
> sanitizing loop removed in 6/7 is meant to catch.  Therefore, it makes sense
> to me to relocate that loop removal until after we clean up the mess around
> removal.
> 
If device gets freed in mdev_device_release(), than remove_store() shouldn't find a valid entry.
Isn't it?

> BTW, I exchanged email with Kirti offline and I think we're in agreement
> around your plans to fix this.  The confusion was around whether the vendor
> driver remove callback can be called while the device is still in use, but I
> believe vfio-core will prevent that with the correct bus removal logic in place.
> 
Yes. vfio-core waits there. I think I shared the trace of it.
If this mdev is used by non vfio driver, such as mlx5_core, than remove() of the mlx5_core driver will get called, and there it will follow standard PCI bus style forced removal anyway.
So we are good there.

> So where do we stand on this series?  I think patches 1-5 look good.
Yes. There were more review-by tags that I guess you need to include.

> Should I incorporate them for v5.2?  
Yes. that will be good. So next series can be shorter. :-)

> Patch 6 looks ok, except I'd rather see
> the sanitizing loop stay until we can otherwise fix the race above.  
I can put back the sanitizing look, once it looks valid. Will wait for your response.

> Patch 7 needed more work, iirc.  Thanks,
For a moment if we assume sanitizing loop exists, than patch-7 looks good?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-25 23:29             ` Parav Pandit
@ 2019-04-26 15:33               ` Alex Williamson
  2019-04-26 15:55                 ` Parav Pandit
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2019-04-26 15:33 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, cjia

On Thu, 25 Apr 2019 23:29:26 +0000
Parav Pandit <parav@mellanox.com> wrote:

> Hi Alex,
> 
> First, sorry for my late reply.
> 
> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, April 23, 2019 2:22 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com; cjia@nvidia.com
> > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> > life cycle APIs
> > 
> > On Thu, 4 Apr 2019 23:05:43 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Thursday, April 4, 2019 3:44 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > kwankhede@nvidia.com; cjia@nvidia.com
> > > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev
> > > > device life cycle APIs
> > > >
> > > > On Thu, 4 Apr 2019 00:02:22 +0000
> > > > Parav Pandit <parav@mellanox.com> wrote:
> > > >  
> > > > > > -----Original Message-----
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Wednesday, April 3, 2019 4:27 PM
> > > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > > kwankhede@nvidia.com; cjia@nvidia.com
> > > > > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with
> > > > > > mdev device life cycle APIs
> > > > > >
> > > > > > On Tue, 26 Mar 2019 22:45:45 -0500 Parav Pandit
> > > > > > <parav@mellanox.com> wrote:
> > > > > >  
> > > > > > > Below race condition and call trace exist with current device
> > > > > > > life cycle sequence.
> > > > > > >
> > > > > > > 1. In following sequence, child devices created while removing
> > > > > > > mdev parent device can be left out, or it may lead to race of
> > > > > > > removing half initialized child mdev devices.
> > > > > > >
> > > > > > > issue-1:
> > > > > > > --------
> > > > > > >        cpu-0                         cpu-1
> > > > > > >        -----                         -----
> > > > > > >                                   mdev_unregister_device()
> > > > > > >                                      device_for_each_child()
> > > > > > >                                         mdev_device_remove_cb()
> > > > > > >
> > > > > > > mdev_device_remove()
> > > > > > > create_store()
> > > > > > >   mdev_device_create()                   [...]
> > > > > > >        device_register()
> > > > > > >                                   parent_remove_sysfs_files()
> > > > > > >                                   /* BUG: device added by cpu-0
> > > > > > >                                    * whose parent is getting removed.
> > > > > > >                                    */
> > > > > > >
> > > > > > > issue-2:
> > > > > > > --------
> > > > > > >        cpu-0                         cpu-1
> > > > > > >        -----                         -----
> > > > > > > create_store()
> > > > > > >   mdev_device_create()                   [...]
> > > > > > >        device_register()
> > > > > > >
> > > > > > >        [...]                      mdev_unregister_device()
> > > > > > >                                      device_for_each_child()
> > > > > > >                                         mdev_device_remove_cb()
> > > > > > >
> > > > > > > mdev_device_remove()
> > > > > > >
> > > > > > >        mdev_create_sysfs_files()
> > > > > > >        /* BUG: create is adding
> > > > > > >         * sysfs files for a device
> > > > > > >         * which is undergoing removal.
> > > > > > >         */
> > > > > > >                                  parent_remove_sysfs_files()
> > > > > > >
> > > > > > > 2. Below crash is observed when user initiated remove is in
> > > > > > > progress and mdev_unregister_driver() completes parent  
> > > > unregistration.  
> > > > > > >
> > > > > > >        cpu-0                         cpu-1
> > > > > > >        -----                         -----
> > > > > > > remove_store()
> > > > > > >    mdev_device_remove()
> > > > > > >    active = false;
> > > > > > >                                   mdev_unregister_device()
> > > > > > >                                     remove type
> > > > > > >    [...]
> > > > > > >    mdev_remove_ops() crashes.
> > > > > > >
> > > > > > > This is similar race like create() racing with  
> > mdev_unregister_device().  
> > > > > > >
> > > > > > > mtty mtty: MDEV: Registered
> > > > > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to
> > > > > > > group
> > > > > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV:
> > > > > > > group_id = 57 mtty mtty: MDEV: Unregistering
> > > > > > > mtty_dev: Unloaded!
> > > > > > > BUG: unable to handle kernel paging request at
> > > > > > > ffffffffc027d668 PGD
> > > > > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > > > > Oops: 0000 [#1] SMP PTI
> > > > > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > > > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > > > > >  remove_store+0x77/0xa0 [mdev]
> > > > > > >  kernfs_fop_write+0x113/0x1a0
> > > > > > >  __vfs_write+0x33/0x1b0
> > > > > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > > > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > > > > >  vfs_write+0xad/0x1b0
> > > > > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > > > > >  ksys_write+0x55/0xc0
> > > > > > >  do_syscall_64+0x5a/0x210
> > > > > > >
> > > > > > > Therefore, mdev core is improved to overcome above issues.
> > > > > > >
> > > > > > > Wait for any ongoing mdev create() and remove() to finish
> > > > > > > before unregistering parent device using srcu. This continues
> > > > > > > to allow multiple create and remove to progress in parallel.
> > > > > > > At the same time guard parent removal while parent is being
> > > > > > > access by create() and remove  
> > > > > > callbacks.  
> > > > > > >
> > > > > > > mdev_device_remove() is refactored to not block on srcu when
> > > > > > > device is removed as part of parent removal.
> > > > > > >
> > > > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > > > ---
> > > > > > >  drivers/vfio/mdev/mdev_core.c    | 83  
> > > > > > ++++++++++++++++++++++++++++++++++------  
> > > > > > >  drivers/vfio/mdev/mdev_private.h |  6 +++
> > > > > > >  2 files changed, 77 insertions(+), 12 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > > > b/drivers/vfio/mdev/mdev_core.c index aefcf34..fa233c8 100644
> > > > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref  
> > *kref)  
> > > > > > >  						  ref);
> > > > > > >  	struct device *dev = parent->dev;
> > > > > > >
> > > > > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > > > > >  	kfree(parent);
> > > > > > >  	put_device(dev);
> > > > > > >  }
> > > > > > > @@ -147,10 +148,30 @@ static int  
> > mdev_device_remove_ops(struct  
> > > > > > mdev_device *mdev, bool force_remove)  
> > > > > > >  	return 0;
> > > > > > >  }
> > > > > > >
> > > > > > > +static int mdev_device_remove_common(struct mdev_device  
> > *mdev,  
> > > > > > > +				     bool force_remove)
> > > > > > > +{
> > > > > > > +	struct mdev_type *type;
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	type = to_mdev_type(mdev->type_kobj);  
> > > > > >
> > > > > > I know you're just moving this into the common function, but I
> > > > > > think we're just caching this for aesthetics, the mdev object is
> > > > > > still valid after the remove ops and I don't see anything
> > > > > > touching this field.  If so, maybe we should remove 'type' or at
> > > > > > least set it right before it's used so it doesn't appear that
> > > > > > we're preserving it before  
> > > > the remove op.  
> > > > > >  
> > > > > Sure, yes.
> > > > > Type assignment should be done just before calling  
> > > > mdev_remove_sysfs_files().  
> > > > > Will send v2.
> > > > >  
> > > > > > > +
> > > > > > > +	ret = mdev_device_remove_ops(mdev, force_remove);
> > > > > > > +	if (ret && !force_remove) {
> > > > > > > +		mutex_lock(&mdev_list_lock);
> > > > > > > +		mdev->active = true;
> > > > > > > +		mutex_unlock(&mdev_list_lock);  
> > > > > >
> > > > > > The mutex around this is a change from the previous code and I'm
> > > > > > not sure it adds anything.  If there's a thread testing for
> > > > > > active racing with this thread setting active to true, there's
> > > > > > no meaningful difference in the result by acquiring the mutex.
> > > > > > 'active' may change from false->true during the critical section
> > > > > > of the other thread, but I don't think there are any strange out
> > > > > > of order things that give the wrong result, the other thread
> > > > > > either sees true  
> > > > or false and continues or exits, regardless of this mutex.  
> > > > > >  
> > > > > Yes, I can drop the mutex.
> > > > > In future remove sequence fix, this will anyway vanish.
> > > > >
> > > > > Shall we finish this series with these 7 patches?
> > > > > Once you ack it will send v2 for these 7 patches and follow on to
> > > > > that we  
> > > > cleanup the sequencing?
> > > >
> > > > Do you intend to move the removal of the mdev sanitization loop from
> > > > 6/7 to this patch?  I don't think we can really claim that what it's
> > > > trying to do is unnecessary until after we have the new code here
> > > > that prevents the sysfs remove path from running concurrent to the
> > > > parent remove path.  It's not really related to the changes in 6/7
> > > > anyway.  In fact, rather than moving that chunk here, it could be
> > > > added as a follow-on patch with explanation of why it is now
> > > > unnecessary.  Thanks,
> > > >  
> > > Device type cannot change from mdev to something else when it was  
> > invoked by the remove() sysfs cb.  
> > > Neither it can be something else in parent removal is passes bus  
> > comparison check.
> > 
> > Hi Parav,
> > 
> > I tried to explain the concern in
> > Message-ID: <20190402163309.414c45ad@x1.home> It hinges on the fact
> > that
> > remove_store() calls device_remove_file_self() before calling
> > mdev_device_remove().  Therefore imagine this scenarios:
> > 
> > Thread A			Thread B
> > 
> > mdev_device_remove()
> >  mdev_remove_sysfs_files()
> > 				remove_store()
> > 				 device_remove_file_self()
> >   sysfs_remove_files...
> > 				 mdev_device_remove()
> > 				  return -EAGAIN
> >                                  device_create_file()
> >  device_unregister()
> >  mdev_put_parent()
> > 
> > So Thread B recreated a stale sysfs attribute.  If it prevents the mdev from
> > being released via mdev_device_release() then it will forever be !active and
> > calling remove store will continue to error and recreate it.  If the mdev does
> > get freed, then remove_store() is working with a stale device, which the
> > sanitizing loop removed in 6/7 is meant to catch.  Therefore, it makes sense
> > to me to relocate that loop removal until after we clean up the mess around
> > removal.
> >   
> If device gets freed in mdev_device_release(), than remove_store() shouldn't find a valid entry.
> Isn't it?

That's the question that I haven't tested or investigated further, if
remove_store() re-adds the sysfs file after mdev_device_remove() would
remove it, does the sysfs attribute still exist and can remove_store()
still be called with a bogus pointer.
 
> > BTW, I exchanged email with Kirti offline and I think we're in
> > agreement around your plans to fix this.  The confusion was around
> > whether the vendor driver remove callback can be called while the
> > device is still in use, but I believe vfio-core will prevent that
> > with the correct bus removal logic in place. 
> Yes. vfio-core waits there. I think I shared the trace of it.
> If this mdev is used by non vfio driver, such as mlx5_core, than
> remove() of the mlx5_core driver will get called, and there it will
> follow standard PCI bus style forced removal anyway. So we are good
> there.
> 
> > So where do we stand on this series?  I think patches 1-5 look
> > good.  
> Yes. There were more review-by tags that I guess you need to include.

Yep, got 'em.

> > Should I incorporate them for v5.2?    
> Yes. that will be good. So next series can be shorter. :-)
> 
> > Patch 6 looks ok, except I'd rather see
> > the sanitizing loop stay until we can otherwise fix the race
> > above.    
> I can put back the sanitizing look, once it looks valid. Will wait
> for your response.

Yep, I think patch 6 is good w/o the removal of the sanitizing loop.
Will you repost it?
 
> > Patch 7 needed more work, iirc.  Thanks,  
> For a moment if we assume sanitizing loop exists, than patch-7 looks
> good?

Patch 7 is a bit less trivial, so I think as we're close to the merge
window for v5.2, I'd rather push it out to be included with the later
re-works.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-26 15:33               ` Alex Williamson
@ 2019-04-26 15:55                 ` Parav Pandit
  2019-04-26 16:09                   ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-04-26 15:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, kwankhede, cjia



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 26, 2019 10:34 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com; cjia@nvidia.com
> Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> life cycle APIs
> 
> On Thu, 25 Apr 2019 23:29:26 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > Hi Alex,
> >
> > First, sorry for my late reply.
> >
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Tuesday, April 23, 2019 2:22 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com; cjia@nvidia.com
> > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev
> > > device life cycle APIs
> > >
> > > On Thu, 4 Apr 2019 23:05:43 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Thursday, April 4, 2019 3:44 PM
> > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > kwankhede@nvidia.com; cjia@nvidia.com
> > > > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with
> > > > > mdev device life cycle APIs
> > > > >
> > > > > On Thu, 4 Apr 2019 00:02:22 +0000 Parav Pandit
> > > > > <parav@mellanox.com> wrote:
> > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > Sent: Wednesday, April 3, 2019 4:27 PM
> > > > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > > > kwankhede@nvidia.com; cjia@nvidia.com
> > > > > > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions
> > > > > > > with mdev device life cycle APIs
> > > > > > >
> > > > > > > On Tue, 26 Mar 2019 22:45:45 -0500 Parav Pandit
> > > > > > > <parav@mellanox.com> wrote:
> > > > > > >
> > > > > > > > Below race condition and call trace exist with current
> > > > > > > > device life cycle sequence.
> > > > > > > >
> > > > > > > > 1. In following sequence, child devices created while
> > > > > > > > removing mdev parent device can be left out, or it may
> > > > > > > > lead to race of removing half initialized child mdev devices.
> > > > > > > >
> > > > > > > > issue-1:
> > > > > > > > --------
> > > > > > > >        cpu-0                         cpu-1
> > > > > > > >        -----                         -----
> > > > > > > >                                   mdev_unregister_device()
> > > > > > > >                                      device_for_each_child()
> > > > > > > >
> > > > > > > > mdev_device_remove_cb()
> > > > > > > >
> > > > > > > > mdev_device_remove()
> > > > > > > > create_store()
> > > > > > > >   mdev_device_create()                   [...]
> > > > > > > >        device_register()
> > > > > > > >                                   parent_remove_sysfs_files()
> > > > > > > >                                   /* BUG: device added by cpu-0
> > > > > > > >                                    * whose parent is getting removed.
> > > > > > > >                                    */
> > > > > > > >
> > > > > > > > issue-2:
> > > > > > > > --------
> > > > > > > >        cpu-0                         cpu-1
> > > > > > > >        -----                         -----
> > > > > > > > create_store()
> > > > > > > >   mdev_device_create()                   [...]
> > > > > > > >        device_register()
> > > > > > > >
> > > > > > > >        [...]                      mdev_unregister_device()
> > > > > > > >                                      device_for_each_child()
> > > > > > > >
> > > > > > > > mdev_device_remove_cb()
> > > > > > > >
> > > > > > > > mdev_device_remove()
> > > > > > > >
> > > > > > > >        mdev_create_sysfs_files()
> > > > > > > >        /* BUG: create is adding
> > > > > > > >         * sysfs files for a device
> > > > > > > >         * which is undergoing removal.
> > > > > > > >         */
> > > > > > > >
> > > > > > > > parent_remove_sysfs_files()
> > > > > > > >
> > > > > > > > 2. Below crash is observed when user initiated remove is
> > > > > > > > in progress and mdev_unregister_driver() completes parent
> > > > > unregistration.
> > > > > > > >
> > > > > > > >        cpu-0                         cpu-1
> > > > > > > >        -----                         -----
> > > > > > > > remove_store()
> > > > > > > >    mdev_device_remove()
> > > > > > > >    active = false;
> > > > > > > >                                   mdev_unregister_device()
> > > > > > > >                                     remove type
> > > > > > > >    [...]
> > > > > > > >    mdev_remove_ops() crashes.
> > > > > > > >
> > > > > > > > This is similar race like create() racing with
> > > mdev_unregister_device().
> > > > > > > >
> > > > > > > > mtty mtty: MDEV: Registered
> > > > > > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
> > > > > > > > to group
> > > > > > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV:
> > > > > > > > group_id = 57 mtty mtty: MDEV: Unregistering
> > > > > > > > mtty_dev: Unloaded!
> > > > > > > > BUG: unable to handle kernel paging request at
> > > > > > > > ffffffffc027d668 PGD
> > > > > > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > > > > > Oops: 0000 [#1] SMP PTI
> > > > > > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > > > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > > > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > > > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call
> Trace:
> > > > > > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > > > > > >  remove_store+0x77/0xa0 [mdev]
> > > > > > > >  kernfs_fop_write+0x113/0x1a0
> > > > > > > >  __vfs_write+0x33/0x1b0
> > > > > > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > > > > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > > > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > > > > > >  vfs_write+0xad/0x1b0
> > > > > > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > > > > > >  ksys_write+0x55/0xc0
> > > > > > > >  do_syscall_64+0x5a/0x210
> > > > > > > >
> > > > > > > > Therefore, mdev core is improved to overcome above issues.
> > > > > > > >
> > > > > > > > Wait for any ongoing mdev create() and remove() to finish
> > > > > > > > before unregistering parent device using srcu. This
> > > > > > > > continues to allow multiple create and remove to progress in
> parallel.
> > > > > > > > At the same time guard parent removal while parent is
> > > > > > > > being access by create() and remove
> > > > > > > callbacks.
> > > > > > > >
> > > > > > > > mdev_device_remove() is refactored to not block on srcu
> > > > > > > > when device is removed as part of parent removal.
> > > > > > > >
> > > > > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > > > > ---
> > > > > > > >  drivers/vfio/mdev/mdev_core.c    | 83
> > > > > > > ++++++++++++++++++++++++++++++++++------
> > > > > > > >  drivers/vfio/mdev/mdev_private.h |  6 +++
> > > > > > > >  2 files changed, 77 insertions(+), 12 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > > > > b/drivers/vfio/mdev/mdev_core.c index aefcf34..fa233c8
> > > > > > > > 100644
> > > > > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct
> > > > > > > > kref
> > > *kref)
> > > > > > > >  						  ref);
> > > > > > > >  	struct device *dev = parent->dev;
> > > > > > > >
> > > > > > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > > > > > >  	kfree(parent);
> > > > > > > >  	put_device(dev);
> > > > > > > >  }
> > > > > > > > @@ -147,10 +148,30 @@ static int
> > > mdev_device_remove_ops(struct
> > > > > > > mdev_device *mdev, bool force_remove)
> > > > > > > >  	return 0;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static int mdev_device_remove_common(struct mdev_device
> > > *mdev,
> > > > > > > > +				     bool force_remove) {
> > > > > > > > +	struct mdev_type *type;
> > > > > > > > +	int ret;
> > > > > > > > +
> > > > > > > > +	type = to_mdev_type(mdev->type_kobj);
> > > > > > >
> > > > > > > I know you're just moving this into the common function, but
> > > > > > > I think we're just caching this for aesthetics, the mdev
> > > > > > > object is still valid after the remove ops and I don't see
> > > > > > > anything touching this field.  If so, maybe we should remove
> > > > > > > 'type' or at least set it right before it's used so it
> > > > > > > doesn't appear that we're preserving it before
> > > > > the remove op.
> > > > > > >
> > > > > > Sure, yes.
> > > > > > Type assignment should be done just before calling
> > > > > mdev_remove_sysfs_files().
> > > > > > Will send v2.
> > > > > >
> > > > > > > > +
> > > > > > > > +	ret = mdev_device_remove_ops(mdev, force_remove);
> > > > > > > > +	if (ret && !force_remove) {
> > > > > > > > +		mutex_lock(&mdev_list_lock);
> > > > > > > > +		mdev->active = true;
> > > > > > > > +		mutex_unlock(&mdev_list_lock);
> > > > > > >
> > > > > > > The mutex around this is a change from the previous code and
> > > > > > > I'm not sure it adds anything.  If there's a thread testing
> > > > > > > for active racing with this thread setting active to true,
> > > > > > > there's no meaningful difference in the result by acquiring the
> mutex.
> > > > > > > 'active' may change from false->true during the critical
> > > > > > > section of the other thread, but I don't think there are any
> > > > > > > strange out of order things that give the wrong result, the
> > > > > > > other thread either sees true
> > > > > or false and continues or exits, regardless of this mutex.
> > > > > > >
> > > > > > Yes, I can drop the mutex.
> > > > > > In future remove sequence fix, this will anyway vanish.
> > > > > >
> > > > > > Shall we finish this series with these 7 patches?
> > > > > > Once you ack it will send v2 for these 7 patches and follow on
> > > > > > to that we
> > > > > cleanup the sequencing?
> > > > >
> > > > > Do you intend to move the removal of the mdev sanitization loop
> > > > > from
> > > > > 6/7 to this patch?  I don't think we can really claim that what
> > > > > it's trying to do is unnecessary until after we have the new
> > > > > code here that prevents the sysfs remove path from running
> > > > > concurrent to the parent remove path.  It's not really related
> > > > > to the changes in 6/7 anyway.  In fact, rather than moving that
> > > > > chunk here, it could be added as a follow-on patch with
> > > > > explanation of why it is now unnecessary.  Thanks,
> > > > >
> > > > Device type cannot change from mdev to something else when it was
> > > invoked by the remove() sysfs cb.
> > > > Neither it can be something else in parent removal is passes bus
> > > comparison check.
> > >
> > > Hi Parav,
> > >
> > > I tried to explain the concern in
> > > Message-ID: <20190402163309.414c45ad@x1.home> It hinges on the fact
> > > that
> > > remove_store() calls device_remove_file_self() before calling
> > > mdev_device_remove().  Therefore imagine this scenarios:
> > >
> > > Thread A			Thread B
> > >
> > > mdev_device_remove()
> > >  mdev_remove_sysfs_files()
> > > 				remove_store()
> > > 				 device_remove_file_self()
> > >   sysfs_remove_files...
> > > 				 mdev_device_remove()
> > > 				  return -EAGAIN
> > >                                  device_create_file()
> > >  device_unregister()
> > >  mdev_put_parent()
> > >
> > > So Thread B recreated a stale sysfs attribute.  If it prevents the
> > > mdev from being released via mdev_device_release() then it will
> > > forever be !active and calling remove store will continue to error
> > > and recreate it.  If the mdev does get freed, then remove_store() is
> > > working with a stale device, which the sanitizing loop removed in
> > > 6/7 is meant to catch.  Therefore, it makes sense to me to relocate
> > > that loop removal until after we clean up the mess around removal.
> > >
> > If device gets freed in mdev_device_release(), than remove_store()
> shouldn't find a valid entry.
> > Isn't it?
> 
> That's the question that I haven't tested or investigated further, if
> remove_store() re-adds the sysfs file after mdev_device_remove() would
> remove it, does the sysfs attribute still exist and can remove_store() still be
> called with a bogus pointer.
> 
> > > BTW, I exchanged email with Kirti offline and I think we're in
> > > agreement around your plans to fix this.  The confusion was around
> > > whether the vendor driver remove callback can be called while the
> > > device is still in use, but I believe vfio-core will prevent that
> > > with the correct bus removal logic in place.
> > Yes. vfio-core waits there. I think I shared the trace of it.
> > If this mdev is used by non vfio driver, such as mlx5_core, than
> > remove() of the mlx5_core driver will get called, and there it will
> > follow standard PCI bus style forced removal anyway. So we are good
> > there.
> >
> > > So where do we stand on this series?  I think patches 1-5 look good.
> > Yes. There were more review-by tags that I guess you need to include.
> 
> Yep, got 'em.
> 
> > > Should I incorporate them for v5.2?
> > Yes. that will be good. So next series can be shorter. :-)
> >
> > > Patch 6 looks ok, except I'd rather see the sanitizing loop stay
> > > until we can otherwise fix the race
> > > above.
> > I can put back the sanitizing look, once it looks valid. Will wait for
> > your response.
> 
> Yep, I think patch 6 is good w/o the removal of the sanitizing loop.
> Will you repost it?
> 
Just the patch-6 or 1 to 6?

> > > Patch 7 needed more work, iirc.  Thanks,
> > For a moment if we assume sanitizing loop exists, than patch-7 looks
> > good?
> 
> Patch 7 is a bit less trivial, so I think as we're close to the merge window for
> v5.2, I'd rather push it out to be included with the later re-works.  Thanks,
> 
I agree it little less trivial, I tried to place as much comment as possible, but it is an important fix.
Let me repost 6-7 and decide if it can be included or not?

> Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-26 15:55                 ` Parav Pandit
@ 2019-04-26 16:09                   ` Alex Williamson
  2019-04-26 19:02                     ` Parav Pandit
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2019-04-26 16:09 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, cjia

On Fri, 26 Apr 2019 15:55:32 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 26, 2019 10:34 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com; cjia@nvidia.com
> > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> > life cycle APIs
> > 
> > On Thu, 25 Apr 2019 23:29:26 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > Hi Alex,
> > >
> > > First, sorry for my late reply.
> > >  
> > > > -----Original Message-----
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Tuesday, April 23, 2019 2:22 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > kwankhede@nvidia.com; cjia@nvidia.com
> > > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev
> > > > device life cycle APIs
> > > >
> > > > On Thu, 4 Apr 2019 23:05:43 +0000
> > > > Parav Pandit <parav@mellanox.com> wrote:
> > > >  
> > > > > > -----Original Message-----
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Thursday, April 4, 2019 3:44 PM
> > > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > > kwankhede@nvidia.com; cjia@nvidia.com
> > > > > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with
> > > > > > mdev device life cycle APIs
> > > > > >
> > > > > > On Thu, 4 Apr 2019 00:02:22 +0000 Parav Pandit
> > > > > > <parav@mellanox.com> wrote:
> > > > > >  
> > > > > > > > -----Original Message-----
> > > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > Sent: Wednesday, April 3, 2019 4:27 PM
> > > > > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > > > > kwankhede@nvidia.com; cjia@nvidia.com
> > > > > > > > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions
> > > > > > > > with mdev device life cycle APIs
> > > > > > > >
> > > > > > > > On Tue, 26 Mar 2019 22:45:45 -0500 Parav Pandit
> > > > > > > > <parav@mellanox.com> wrote:
> > > > > > > >  
> > > > > > > > > Below race condition and call trace exist with current
> > > > > > > > > device life cycle sequence.
> > > > > > > > >
> > > > > > > > > 1. In following sequence, child devices created while
> > > > > > > > > removing mdev parent device can be left out, or it may
> > > > > > > > > lead to race of removing half initialized child mdev devices.
> > > > > > > > >
> > > > > > > > > issue-1:
> > > > > > > > > --------
> > > > > > > > >        cpu-0                         cpu-1
> > > > > > > > >        -----                         -----
> > > > > > > > >                                   mdev_unregister_device()
> > > > > > > > >                                      device_for_each_child()
> > > > > > > > >
> > > > > > > > > mdev_device_remove_cb()
> > > > > > > > >
> > > > > > > > > mdev_device_remove()
> > > > > > > > > create_store()
> > > > > > > > >   mdev_device_create()                   [...]
> > > > > > > > >        device_register()
> > > > > > > > >                                   parent_remove_sysfs_files()
> > > > > > > > >                                   /* BUG: device added by cpu-0
> > > > > > > > >                                    * whose parent is getting removed.
> > > > > > > > >                                    */
> > > > > > > > >
> > > > > > > > > issue-2:
> > > > > > > > > --------
> > > > > > > > >        cpu-0                         cpu-1
> > > > > > > > >        -----                         -----
> > > > > > > > > create_store()
> > > > > > > > >   mdev_device_create()                   [...]
> > > > > > > > >        device_register()
> > > > > > > > >
> > > > > > > > >        [...]                      mdev_unregister_device()
> > > > > > > > >                                      device_for_each_child()
> > > > > > > > >
> > > > > > > > > mdev_device_remove_cb()
> > > > > > > > >
> > > > > > > > > mdev_device_remove()
> > > > > > > > >
> > > > > > > > >        mdev_create_sysfs_files()
> > > > > > > > >        /* BUG: create is adding
> > > > > > > > >         * sysfs files for a device
> > > > > > > > >         * which is undergoing removal.
> > > > > > > > >         */
> > > > > > > > >
> > > > > > > > > parent_remove_sysfs_files()
> > > > > > > > >
> > > > > > > > > 2. Below crash is observed when user initiated remove is
> > > > > > > > > in progress and mdev_unregister_driver() completes parent  
> > > > > > unregistration.  
> > > > > > > > >
> > > > > > > > >        cpu-0                         cpu-1
> > > > > > > > >        -----                         -----
> > > > > > > > > remove_store()
> > > > > > > > >    mdev_device_remove()
> > > > > > > > >    active = false;
> > > > > > > > >                                   mdev_unregister_device()
> > > > > > > > >                                     remove type
> > > > > > > > >    [...]
> > > > > > > > >    mdev_remove_ops() crashes.
> > > > > > > > >
> > > > > > > > > This is similar race like create() racing with  
> > > > mdev_unregister_device().  
> > > > > > > > >
> > > > > > > > > mtty mtty: MDEV: Registered
> > > > > > > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
> > > > > > > > > to group
> > > > > > > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV:
> > > > > > > > > group_id = 57 mtty mtty: MDEV: Unregistering
> > > > > > > > > mtty_dev: Unloaded!
> > > > > > > > > BUG: unable to handle kernel paging request at
> > > > > > > > > ffffffffc027d668 PGD
> > > > > > > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > > > > > > Oops: 0000 [#1] SMP PTI
> > > > > > > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > > > > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > > > > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > > > > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call  
> > Trace:  
> > > > > > > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > > > > > > >  remove_store+0x77/0xa0 [mdev]
> > > > > > > > >  kernfs_fop_write+0x113/0x1a0
> > > > > > > > >  __vfs_write+0x33/0x1b0
> > > > > > > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > > > > > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > > > > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > > > > > > >  vfs_write+0xad/0x1b0
> > > > > > > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > > > > > > >  ksys_write+0x55/0xc0
> > > > > > > > >  do_syscall_64+0x5a/0x210
> > > > > > > > >
> > > > > > > > > Therefore, mdev core is improved to overcome above issues.
> > > > > > > > >
> > > > > > > > > Wait for any ongoing mdev create() and remove() to finish
> > > > > > > > > before unregistering parent device using srcu. This
> > > > > > > > > continues to allow multiple create and remove to progress in  
> > parallel.  
> > > > > > > > > At the same time guard parent removal while parent is
> > > > > > > > > being access by create() and remove  
> > > > > > > > callbacks.  
> > > > > > > > >
> > > > > > > > > mdev_device_remove() is refactored to not block on srcu
> > > > > > > > > when device is removed as part of parent removal.
> > > > > > > > >
> > > > > > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > > > > > ---
> > > > > > > > >  drivers/vfio/mdev/mdev_core.c    | 83  
> > > > > > > > ++++++++++++++++++++++++++++++++++------  
> > > > > > > > >  drivers/vfio/mdev/mdev_private.h |  6 +++
> > > > > > > > >  2 files changed, 77 insertions(+), 12 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > > > > > b/drivers/vfio/mdev/mdev_core.c index aefcf34..fa233c8
> > > > > > > > > 100644
> > > > > > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > > > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct
> > > > > > > > > kref  
> > > > *kref)  
> > > > > > > > >  						  ref);
> > > > > > > > >  	struct device *dev = parent->dev;
> > > > > > > > >
> > > > > > > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > > > > > > >  	kfree(parent);
> > > > > > > > >  	put_device(dev);
> > > > > > > > >  }
> > > > > > > > > @@ -147,10 +148,30 @@ static int  
> > > > mdev_device_remove_ops(struct  
> > > > > > > > mdev_device *mdev, bool force_remove)  
> > > > > > > > >  	return 0;
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static int mdev_device_remove_common(struct mdev_device  
> > > > *mdev,  
> > > > > > > > > +				     bool force_remove) {
> > > > > > > > > +	struct mdev_type *type;
> > > > > > > > > +	int ret;
> > > > > > > > > +
> > > > > > > > > +	type = to_mdev_type(mdev->type_kobj);  
> > > > > > > >
> > > > > > > > I know you're just moving this into the common function, but
> > > > > > > > I think we're just caching this for aesthetics, the mdev
> > > > > > > > object is still valid after the remove ops and I don't see
> > > > > > > > anything touching this field.  If so, maybe we should remove
> > > > > > > > 'type' or at least set it right before it's used so it
> > > > > > > > doesn't appear that we're preserving it before  
> > > > > > the remove op.  
> > > > > > > >  
> > > > > > > Sure, yes.
> > > > > > > Type assignment should be done just before calling  
> > > > > > mdev_remove_sysfs_files().  
> > > > > > > Will send v2.
> > > > > > >  
> > > > > > > > > +
> > > > > > > > > +	ret = mdev_device_remove_ops(mdev, force_remove);
> > > > > > > > > +	if (ret && !force_remove) {
> > > > > > > > > +		mutex_lock(&mdev_list_lock);
> > > > > > > > > +		mdev->active = true;
> > > > > > > > > +		mutex_unlock(&mdev_list_lock);  
> > > > > > > >
> > > > > > > > The mutex around this is a change from the previous code and
> > > > > > > > I'm not sure it adds anything.  If there's a thread testing
> > > > > > > > for active racing with this thread setting active to true,
> > > > > > > > there's no meaningful difference in the result by acquiring the  
> > mutex.  
> > > > > > > > 'active' may change from false->true during the critical
> > > > > > > > section of the other thread, but I don't think there are any
> > > > > > > > strange out of order things that give the wrong result, the
> > > > > > > > other thread either sees true  
> > > > > > or false and continues or exits, regardless of this mutex.  
> > > > > > > >  
> > > > > > > Yes, I can drop the mutex.
> > > > > > > In future remove sequence fix, this will anyway vanish.
> > > > > > >
> > > > > > > Shall we finish this series with these 7 patches?
> > > > > > > Once you ack it will send v2 for these 7 patches and follow on
> > > > > > > to that we  
> > > > > > cleanup the sequencing?
> > > > > >
> > > > > > Do you intend to move the removal of the mdev sanitization loop
> > > > > > from
> > > > > > 6/7 to this patch?  I don't think we can really claim that what
> > > > > > it's trying to do is unnecessary until after we have the new
> > > > > > code here that prevents the sysfs remove path from running
> > > > > > concurrent to the parent remove path.  It's not really related
> > > > > > to the changes in 6/7 anyway.  In fact, rather than moving that
> > > > > > chunk here, it could be added as a follow-on patch with
> > > > > > explanation of why it is now unnecessary.  Thanks,
> > > > > >  
> > > > > Device type cannot change from mdev to something else when it was  
> > > > invoked by the remove() sysfs cb.  
> > > > > Neither it can be something else in parent removal is passes bus  
> > > > comparison check.
> > > >
> > > > Hi Parav,
> > > >
> > > > I tried to explain the concern in
> > > > Message-ID: <20190402163309.414c45ad@x1.home> It hinges on the fact
> > > > that
> > > > remove_store() calls device_remove_file_self() before calling
> > > > mdev_device_remove().  Therefore imagine this scenarios:
> > > >
> > > > Thread A			Thread B
> > > >
> > > > mdev_device_remove()
> > > >  mdev_remove_sysfs_files()
> > > > 				remove_store()
> > > > 				 device_remove_file_self()
> > > >   sysfs_remove_files...
> > > > 				 mdev_device_remove()
> > > > 				  return -EAGAIN
> > > >                                  device_create_file()
> > > >  device_unregister()
> > > >  mdev_put_parent()
> > > >
> > > > So Thread B recreated a stale sysfs attribute.  If it prevents the
> > > > mdev from being released via mdev_device_release() then it will
> > > > forever be !active and calling remove store will continue to error
> > > > and recreate it.  If the mdev does get freed, then remove_store() is
> > > > working with a stale device, which the sanitizing loop removed in
> > > > 6/7 is meant to catch.  Therefore, it makes sense to me to relocate
> > > > that loop removal until after we clean up the mess around removal.
> > > >  
> > > If device gets freed in mdev_device_release(), than remove_store()  
> > shouldn't find a valid entry.  
> > > Isn't it?  
> > 
> > That's the question that I haven't tested or investigated further, if
> > remove_store() re-adds the sysfs file after mdev_device_remove() would
> > remove it, does the sysfs attribute still exist and can remove_store() still be
> > called with a bogus pointer.
> >   
> > > > BTW, I exchanged email with Kirti offline and I think we're in
> > > > agreement around your plans to fix this.  The confusion was around
> > > > whether the vendor driver remove callback can be called while the
> > > > device is still in use, but I believe vfio-core will prevent that
> > > > with the correct bus removal logic in place.  
> > > Yes. vfio-core waits there. I think I shared the trace of it.
> > > If this mdev is used by non vfio driver, such as mlx5_core, than
> > > remove() of the mlx5_core driver will get called, and there it will
> > > follow standard PCI bus style forced removal anyway. So we are good
> > > there.
> > >  
> > > > So where do we stand on this series?  I think patches 1-5 look good.  
> > > Yes. There were more review-by tags that I guess you need to include.  
> > 
> > Yep, got 'em.
> >   
> > > > Should I incorporate them for v5.2?  
> > > Yes. that will be good. So next series can be shorter. :-)
> > >  
> > > > Patch 6 looks ok, except I'd rather see the sanitizing loop stay
> > > > until we can otherwise fix the race
> > > > above.  
> > > I can put back the sanitizing look, once it looks valid. Will wait for
> > > your response.  
> > 
> > Yep, I think patch 6 is good w/o the removal of the sanitizing loop.
> > Will you repost it?
> >   
> Just the patch-6 or 1 to 6?

Your choice, please roll in reviews/acks if you repost the rest.

> > > > Patch 7 needed more work, iirc.  Thanks,  
> > > For a moment if we assume sanitizing loop exists, than patch-7 looks
> > > good?  
> > 
> > Patch 7 is a bit less trivial, so I think as we're close to the merge window for
> > v5.2, I'd rather push it out to be included with the later re-works.  Thanks,
> >   
> I agree it little less trivial, I tried to place as much comment as possible, but it is an important fix.
> Let me repost 6-7 and decide if it can be included or not?

Sounds good.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-26 16:09                   ` Alex Williamson
@ 2019-04-26 19:02                     ` Parav Pandit
  2019-04-26 21:58                       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Parav Pandit @ 2019-04-26 19:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, kwankhede, cjia

Hi Alex,


> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 26, 2019 11:09 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com; cjia@nvidia.com
> Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> life cycle APIs

[..]

> > > >
> > > > > Patch 6 looks ok, except I'd rather see the sanitizing loop stay
> > > > > until we can otherwise fix the race above.
> > > > I can put back the sanitizing look, once it looks valid. Will wait
> > > > for your response.
> > >
> > > Yep, I think patch 6 is good w/o the removal of the sanitizing loop.
> > > Will you repost it?
> > >
> > Just the patch-6 or 1 to 6?
> 
> Your choice, please roll in reviews/acks if you repost the rest.
> 
> > > > > Patch 7 needed more work, iirc.  Thanks,
> > > > For a moment if we assume sanitizing loop exists, than patch-7
> > > > looks good?
> > >
> > > Patch 7 is a bit less trivial, so I think as we're close to the
> > > merge window for v5.2, I'd rather push it out to be included with
> > > the later re-works.  Thanks,
> > >
> > I agree it little less trivial, I tried to place as much comment as possible,
> but it is an important fix.
> > Let me repost 6-7 and decide if it can be included or not?
> 
> Sounds good.  Thanks,
> 
I am dropping patch-7 for today and reworking the patch-6 for now.

Even after keeping that that crazy loop, I am easily able to create this below [1] call trace on adding file when mdev_remove() fails with the thread sequence we discussed above.

I think this is high time, we fix the sequence to match the linux bus sequence.
I have some cycles this week.
Post these 6 patches,
I like to get total 3 patches done.
1. fix the bus sequence
2. race with parent device removal
3. do not try to add sysfs file on remove() failure

Is there any possibility above 3 patches can make to 5.2, given that merge window closes in June?
If yes, I will get them in 2-3 days. I will test with sample drivers and mlx5 driver.
Can we get some tests also done from Kirti also done on their hw?

kernel: WARNING: CPU: 2 PID: 9348 at fs/sysfs/file.c:327 sysfs_create_file_ns+0x7f/0x90
kernel: CPU: 2 PID: 9348 Comm: bash Kdump: loaded Not tainted 5.1.0-rc6-vdevbus+ #6
kernel: Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
kernel: RIP: 0010:sysfs_create_file_ns+0x7f/0x90
kernel: Call Trace:
kernel: remove_store+0xdc/0x100 [mdev]
kernel: kernfs_fop_write+0x113/0x1a0
kernel: vfs_write+0xad/0x1b0
kernel: ksys_write+0x5a/0xe0
kernel: do_syscall_64+0x5a/0x210
kernel: entry_SYSCALL_64_after_hwframe+0x49/0xbe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs
  2019-04-26 19:02                     ` Parav Pandit
@ 2019-04-26 21:58                       ` Alex Williamson
  0 siblings, 0 replies; 28+ messages in thread
From: Alex Williamson @ 2019-04-26 21:58 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede, cjia

On Fri, 26 Apr 2019 19:02:40 +0000
Parav Pandit <parav@mellanox.com> wrote:

> Hi Alex,
> 
> 
> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 26, 2019 11:09 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com; cjia@nvidia.com
> > Subject: Re: [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device
> > life cycle APIs  
> 
> [..]
> 
> > > > >  
> > > > > > Patch 6 looks ok, except I'd rather see the sanitizing loop stay
> > > > > > until we can otherwise fix the race above.  
> > > > > I can put back the sanitizing look, once it looks valid. Will wait
> > > > > for your response.  
> > > >
> > > > Yep, I think patch 6 is good w/o the removal of the sanitizing loop.
> > > > Will you repost it?
> > > >  
> > > Just the patch-6 or 1 to 6?  
> > 
> > Your choice, please roll in reviews/acks if you repost the rest.
> >   
> > > > > > Patch 7 needed more work, iirc.  Thanks,  
> > > > > For a moment if we assume sanitizing loop exists, than patch-7
> > > > > looks good?  
> > > >
> > > > Patch 7 is a bit less trivial, so I think as we're close to the
> > > > merge window for v5.2, I'd rather push it out to be included with
> > > > the later re-works.  Thanks,
> > > >  
> > > I agree it little less trivial, I tried to place as much comment as possible,  
> > but it is an important fix.  
> > > Let me repost 6-7 and decide if it can be included or not?  
> > 
> > Sounds good.  Thanks,
> >   
> I am dropping patch-7 for today and reworking the patch-6 for now.
> 
> Even after keeping that that crazy loop, I am easily able to create this below [1] call trace on adding file when mdev_remove() fails with the thread sequence we discussed above.
> 
> I think this is high time, we fix the sequence to match the linux bus sequence.
> I have some cycles this week.
> Post these 6 patches,
> I like to get total 3 patches done.
> 1. fix the bus sequence
> 2. race with parent device removal
> 3. do not try to add sysfs file on remove() failure
> 
> Is there any possibility above 3 patches can make to 5.2, given that merge window closes in June?
> If yes, I will get them in 2-3 days. I will test with sample drivers and mlx5 driver.
> Can we get some tests also done from Kirti also done on their hw?

It depends how soon they stabilize and how invasive they are.  These
are bug fixes, so we can consider them for after the merge window
(which will close before the end of May), but the longer they take to
stabilize and the more significant the change, the more likely I'd be
to wait for 5.3.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-04-26 21:58 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-27  3:45 [PATCHv1 0/7] vfio/mdev: Improve vfio/mdev core module Parav Pandit
2019-03-27  3:45 ` [PATCHv1 1/7] vfio/mdev: Avoid release parent reference during error path Parav Pandit
2019-04-01 16:58   ` Cornelia Huck
2019-03-27  3:45 ` [PATCHv1 2/7] vfio/mdev: Removed unused kref Parav Pandit
2019-04-01 17:01   ` Cornelia Huck
2019-03-27  3:45 ` [PATCHv1 3/7] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
2019-04-01 17:02   ` Cornelia Huck
2019-03-27  3:45 ` [PATCHv1 4/7] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
2019-04-01 17:21   ` Cornelia Huck
2019-03-27  3:45 ` [PATCHv1 5/7] vfio/mdev: Follow correct remove sequence Parav Pandit
2019-04-01 17:24   ` Cornelia Huck
2019-03-27  3:45 ` [PATCHv1 6/7] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
2019-04-01 17:39   ` Cornelia Huck
2019-04-02 19:59     ` Parav Pandit
2019-04-02 22:33       ` Alex Williamson
2019-04-03  6:34         ` Parav Pandit
2019-03-27  3:45 ` [PATCHv1 7/7] vfio/mdev: Fix race conditions with mdev device life cycle APIs Parav Pandit
2019-04-03 21:27   ` Alex Williamson
2019-04-04  0:02     ` Parav Pandit
2019-04-04 20:44       ` Alex Williamson
2019-04-04 23:05         ` Parav Pandit
2019-04-23 19:21           ` Alex Williamson
2019-04-25 23:29             ` Parav Pandit
2019-04-26 15:33               ` Alex Williamson
2019-04-26 15:55                 ` Parav Pandit
2019-04-26 16:09                   ` Alex Williamson
2019-04-26 19:02                     ` Parav Pandit
2019-04-26 21:58                       ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).