linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module
@ 2019-03-22 23:20 Parav Pandit
  2019-03-22 23:20 ` [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure Parav Pandit
                   ` (7 more replies)
  0 siblings, 8 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-22 23:20 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: parav

As we would like to use mdev subsystem for wider use case as
discussed in [1], [2] apart from an offline discussion.
This use case is also discussed with wider forum in [4] in track
'Lightweight NIC HW functions for container offload use cases'.

This series is prep-work and improves vfio/mdev module in following ways.

Patch-1 and 2 Fixes releasing parent dev reference during error unwinding
of mdev create and mdev parent registration.
Patch-3 Simplifies mdev device for unused kref.
Patch-4 Drops redundant extern prefix of exported symbols.
Patch-5 Returns right error code from vendor driver.
Patch-6 Fixes to use right sysfs remove sequence.
Patch-7 Fixes removing all child devices if one of them fails.
Patch 8 Brings improvements to mdev in following ways.

1. Fix race conditions among mdev parent's create(), remove() and
mdev parent unregistration routines that leads to call traces.

2. Setup vendor mdev device before placing the device on mdev bus.
This ensures that vfio_mdev or any other module that accesses mdev,
is rightly in any of the callbacks of mdev_register_driver().
This follows Linux driver model now.
Similarly follow exact reverse remove sequence, i.e. to take away the
device first from the bus before removing underlying hardware mdev.

This series is tested using
(a) mtty with VM using vfio_mdev driver for positive tests.
(b) mtty with vfio_mdev with error race condition cases of create,
remove and mtty driver.
(c) mlx5 core driver using RFC patches [3] and internal patches.
Internal patches are large and cannot be combined with this
prep-work patches. It will posted once prep-work completes.

[1] https://www.spinics.net/lists/netdev/msg556978.html
[2] https://lkml.org/lkml/2019/3/7/696
[3] https://lkml.org/lkml/2019/3/8/819
[4] https://netdevconf.org/0x13/session.html?workshop-hardware-offload


Parav Pandit (8):
  vfio/mdev: Fix to not do put_device on device_register failure
  vfio/mdev: Avoid release parent reference during error path
  vfio/mdev: Removed unused kref
  vfio/mdev: Drop redundant extern for exported symbols
  vfio/mdev: Avoid masking error code to EBUSY
  vfio/mdev: Follow correct remove sequence
  vfio/mdev: Fix aborting mdev child device removal if one fails
  vfio/mdev: Improve the create/remove sequence

 drivers/vfio/mdev/mdev_core.c    | 164 +++++++++++++++++++--------------------
 drivers/vfio/mdev/mdev_private.h |   8 +-
 drivers/vfio/mdev/mdev_sysfs.c   |   8 +-
 include/linux/mdev.h             |  21 +++--
 4 files changed, 98 insertions(+), 103 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure
  2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
@ 2019-03-22 23:20 ` Parav Pandit
  2019-03-25 11:48   ` Maxim Levitsky
  2019-03-25 18:17   ` Kirti Wankhede
  2019-03-22 23:20 ` [PATCH 2/8] vfio/mdev: Avoid release parent reference during error path Parav Pandit
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-22 23:20 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: parav

device_register() performs put_device() if device_add() fails.
This balances with device_initialize().

mdev core performing put_device() when device_register() fails,
is an error that puts already released device again.
Therefore, don't put the device on error.

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 0212f0e..3e5880a 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -318,10 +318,8 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 	dev_set_name(&mdev->dev, "%pUl", uuid.b);
 
 	ret = device_register(&mdev->dev);
-	if (ret) {
-		put_device(&mdev->dev);
+	if (ret)
 		goto mdev_fail;
-	}
 
 	ret = mdev_device_create_ops(kobj, mdev);
 	if (ret)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 2/8] vfio/mdev: Avoid release parent reference during error path
  2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
  2019-03-22 23:20 ` [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure Parav Pandit
@ 2019-03-22 23:20 ` Parav Pandit
  2019-03-25 11:49   ` Maxim Levitsky
  2019-03-25 18:27   ` Kirti Wankhede
  2019-03-22 23:20 ` [PATCH 3/8] vfio/mdev: Removed unused kref Parav Pandit
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-22 23:20 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: parav

During mdev parent registration in mdev_register_device(),
if parent device is duplicate, it releases the reference of existing
parent device.
This is incorrect. Existing parent device should not be touched.

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 3e5880a..4f213e4d 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -182,6 +182,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	/* Check for duplicate */
 	parent = __find_parent_device(dev);
 	if (parent) {
+		parent = NULL;
 		ret = -EEXIST;
 		goto add_dev_err;
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 3/8] vfio/mdev: Removed unused kref
  2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
  2019-03-22 23:20 ` [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure Parav Pandit
  2019-03-22 23:20 ` [PATCH 2/8] vfio/mdev: Avoid release parent reference during error path Parav Pandit
@ 2019-03-22 23:20 ` Parav Pandit
  2019-03-25 11:50   ` Maxim Levitsky
  2019-03-25 18:41   ` Kirti Wankhede
  2019-03-22 23:20 ` [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-22 23:20 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: parav

Remove unused kref from the mdev_device structure.

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c    | 1 -
 drivers/vfio/mdev/mdev_private.h | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 4f213e4d..3d91f62 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -311,7 +311,6 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 	mutex_unlock(&mdev_list_lock);
 
 	mdev->parent = parent;
-	kref_init(&mdev->ref);
 
 	mdev->dev.parent  = dev;
 	mdev->dev.bus     = &mdev_bus_type;
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index b5819b7..84b2b6c 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -30,7 +30,6 @@ struct mdev_device {
 	struct mdev_parent *parent;
 	uuid_le uuid;
 	void *driver_data;
-	struct kref ref;
 	struct list_head next;
 	struct kobject *type_kobj;
 	bool active;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols
  2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
                   ` (2 preceding siblings ...)
  2019-03-22 23:20 ` [PATCH 3/8] vfio/mdev: Removed unused kref Parav Pandit
@ 2019-03-22 23:20 ` Parav Pandit
  2019-03-25 11:56   ` Maxim Levitsky
  2019-03-25 19:07   ` Kirti Wankhede
  2019-03-22 23:20 ` [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-22 23:20 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: parav

There is no need use 'extern' for exported functions.

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 include/linux/mdev.h | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index b6e048e..0924c48 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -118,21 +118,20 @@ struct mdev_driver {
 
 #define to_mdev_driver(drv)	container_of(drv, struct mdev_driver, driver)
 
-extern void *mdev_get_drvdata(struct mdev_device *mdev);
-extern void mdev_set_drvdata(struct mdev_device *mdev, void *data);
-extern uuid_le mdev_uuid(struct mdev_device *mdev);
+void *mdev_get_drvdata(struct mdev_device *mdev);
+void mdev_set_drvdata(struct mdev_device *mdev, void *data);
+uuid_le mdev_uuid(struct mdev_device *mdev);
 
 extern struct bus_type mdev_bus_type;
 
-extern int  mdev_register_device(struct device *dev,
-				 const struct mdev_parent_ops *ops);
-extern void mdev_unregister_device(struct device *dev);
+int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops);
+void mdev_unregister_device(struct device *dev);
 
-extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
-extern void mdev_unregister_driver(struct mdev_driver *drv);
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+void mdev_unregister_driver(struct mdev_driver *drv);
 
-extern struct device *mdev_parent_dev(struct mdev_device *mdev);
-extern struct device *mdev_dev(struct mdev_device *mdev);
-extern struct mdev_device *mdev_from_dev(struct device *dev);
+struct device *mdev_parent_dev(struct mdev_device *mdev);
+struct device *mdev_dev(struct mdev_device *mdev);
+struct mdev_device *mdev_from_dev(struct device *dev);
 
 #endif /* MDEV_H */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY
  2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
                   ` (3 preceding siblings ...)
  2019-03-22 23:20 ` [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
@ 2019-03-22 23:20 ` Parav Pandit
  2019-03-25 11:57   ` Maxim Levitsky
  2019-03-25 19:18   ` Kirti Wankhede
  2019-03-22 23:20 ` [PATCH 6/8] vfio/mdev: Follow correct remove sequence Parav Pandit
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-22 23:20 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: parav

Instead of masking return error to -EBUSY, return actual error
returned by the driver.

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 3d91f62..ab05464 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -142,7 +142,7 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
 	 */
 	ret = parent->ops->remove(mdev);
 	if (ret && !force_remove)
-		return -EBUSY;
+		return ret;
 
 	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
 	return 0;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 6/8] vfio/mdev: Follow correct remove sequence
  2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
                   ` (4 preceding siblings ...)
  2019-03-22 23:20 ` [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
@ 2019-03-22 23:20 ` Parav Pandit
  2019-03-25 11:58   ` Maxim Levitsky
  2019-03-25 20:20   ` Alex Williamson
  2019-03-22 23:20 ` [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
  2019-03-22 23:20 ` [PATCH 8/8] vfio/mdev: Improve the create/remove sequence Parav Pandit
  7 siblings, 2 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-22 23:20 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: parav

mdev_remove_sysfs_files() should follow exact mirror sequence of a
create, similar to what is followed in error unwinding path of
mdev_create_sysfs_files().

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index ce5dd21..c782fa9 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -280,7 +280,7 @@ int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
 
 void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
 {
+	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
 	sysfs_remove_link(&dev->kobj, "mdev_type");
 	sysfs_remove_link(type->devices_kobj, dev_name(dev));
-	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
                   ` (5 preceding siblings ...)
  2019-03-22 23:20 ` [PATCH 6/8] vfio/mdev: Follow correct remove sequence Parav Pandit
@ 2019-03-22 23:20 ` Parav Pandit
  2019-03-25 11:58   ` Maxim Levitsky
  2019-03-25 19:35   ` Kirti Wankhede
  2019-03-22 23:20 ` [PATCH 8/8] vfio/mdev: Improve the create/remove sequence Parav Pandit
  7 siblings, 2 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-22 23:20 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: parav

device_for_each_child() stops executing callback function for remaining
child devices, if callback hits an error.
Each child mdev device is independent of each other.
While unregistering parent device, mdev core must remove all child mdev
devices.
Therefore, mdev_device_remove_cb() always returns success so that
device_for_each_child doesn't abort if one child removal hits error.

While at it, improve remove and unregister functions for below simplicity.

There isn't need to pass forced flag pointer during mdev parent
removal which invokes mdev_device_remove(). So simplify the flow.

mdev_device_remove() is called from two paths.
1. mdev_unregister_driver()
     mdev_device_remove_cb()
       mdev_device_remove()
2. remove_store()
     mdev_device_remove()

When device is removed by user using remote_store(), device under
removal is mdev device.
When device is removed during parent device removal using generic child
iterator, mdev check is already done using dev_is_mdev().

Hence, remove the unnecessary loop in mdev_device_remove().

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 24 +++++-------------------
 1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index ab05464..944a058 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -150,10 +150,10 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
 
 static int mdev_device_remove_cb(struct device *dev, void *data)
 {
-	if (!dev_is_mdev(dev))
-		return 0;
+	if (dev_is_mdev(dev))
+		mdev_device_remove(dev, true);
 
-	return mdev_device_remove(dev, data ? *(bool *)data : true);
+	return 0;
 }
 
 /*
@@ -241,7 +241,6 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 void mdev_unregister_device(struct device *dev)
 {
 	struct mdev_parent *parent;
-	bool force_remove = true;
 
 	mutex_lock(&parent_list_lock);
 	parent = __find_parent_device(dev);
@@ -255,8 +254,7 @@ void mdev_unregister_device(struct device *dev)
 	list_del(&parent->next);
 	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
 
-	device_for_each_child(dev, (void *)&force_remove,
-			      mdev_device_remove_cb);
+	device_for_each_child(dev, NULL, mdev_device_remove_cb);
 
 	parent_remove_sysfs_files(parent);
 
@@ -346,24 +344,12 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 
 int mdev_device_remove(struct device *dev, bool force_remove)
 {
-	struct mdev_device *mdev, *tmp;
+	struct mdev_device *mdev;
 	struct mdev_parent *parent;
 	struct mdev_type *type;
 	int ret;
 
 	mdev = to_mdev_device(dev);
-
-	mutex_lock(&mdev_list_lock);
-	list_for_each_entry(tmp, &mdev_list, next) {
-		if (tmp == mdev)
-			break;
-	}
-
-	if (tmp != mdev) {
-		mutex_unlock(&mdev_list_lock);
-		return -ENODEV;
-	}
-
 	if (!mdev->active) {
 		mutex_unlock(&mdev_list_lock);
 		return -EAGAIN;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
                   ` (6 preceding siblings ...)
  2019-03-22 23:20 ` [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
@ 2019-03-22 23:20 ` Parav Pandit
  2019-03-25 13:24   ` Maxim Levitsky
                     ` (2 more replies)
  7 siblings, 3 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-22 23:20 UTC (permalink / raw)
  To: kvm, linux-kernel, kwankhede, alex.williamson; +Cc: parav

There are five problems with current code structure.
1. mdev device is placed on the mdev bus before it is created in the
vendor driver. Once a device is placed on the mdev bus without creating
its supporting underlying vendor device, an open() can get triggered by
userspace on partially initialized device.
Below ladder diagram highlight it.

      cpu-0                                       cpu-1
      -----                                       -----
   create_store()
     mdev_create_device()
       device_register()
          ...
         vfio_mdev_probe()
         ...creates char device
                                        vfio_mdev_open()
                                          parent->ops->open(mdev)
                                            vfio_ap_mdev_open()
                                              matrix_mdev = NULL
        [...]
        parent->ops->create()
          vfio_ap_mdev_create()
            mdev_set_drvdata(mdev, matrix_mdev);
            /* Valid pointer set above */

2. Current creation sequence is,
   parent->ops_create()
   groups_register()

Remove sequence is,
   parent->ops->remove()
   groups_unregister()
However, remove sequence should be exact mirror of creation sequence.
Once this is achieved, all users of the mdev will be terminated first
before removing underlying vendor device.
(Follow standard linux driver model).
At that point vendor's remove() ops shouldn't failed because device is
taken off the bus that should terminate the users.

3. Additionally any new mdev driver that wants to work on mdev device
during probe() routine registered using mdev_register_driver() needs to
get stable mdev structure.

4. In following sequence, child devices created while removing mdev parent
device can be left out, or it may lead to race of removing half
initialized child mdev devices.

issue-1:
--------
       cpu-0                         cpu-1
       -----                         -----
                                  mdev_unregister_device()
                                     device_for_each_child()
                                        mdev_device_remove_cb()
                                            mdev_device_remove()
create_store()
  mdev_device_create()                   [...]
       device_register()
                                  parent_remove_sysfs_files()
                                  /* BUG: device added by cpu-0
                                   * whose parent is getting removed.
                                   */

issue-2:
--------
       cpu-0                         cpu-1
       -----                         -----
create_store()
  mdev_device_create()                   [...]
       device_register()

       [...]                      mdev_unregister_device()
                                     device_for_each_child()
                                        mdev_device_remove_cb()
                                            mdev_device_remove()

       mdev_create_sysfs_files()
       /* BUG: create is adding
        * sysfs files for a device
        * which is undergoing removal.
        */
                                 parent_remove_sysfs_files()

5. Below crash is observed when user initiated remove is in progress
and mdev_unregister_driver() completes parent unregistration.

       cpu-0                         cpu-1
       -----                         -----
remove_store()
   mdev_device_remove()
   active = false;
                                  mdev_unregister_device()
                                    remove type
   [...]
   mdev_remove_ops() crashes.

This is similar race like create() racing with mdev_unregister_device().

mtty mtty: MDEV: Registered
iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
mdev_device_remove sleep started
mtty mtty: MDEV: Unregistering
mtty_dev: Unloaded!
BUG: unable to handle kernel paging request at ffffffffc027d668
PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
Oops: 0000 [#1] SMP PTI
CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
Call Trace:
 mdev_device_remove+0xef/0x130 [mdev]
 remove_store+0x77/0xa0 [mdev]
 kernfs_fop_write+0x113/0x1a0
 __vfs_write+0x33/0x1b0
 ? rcu_read_lock_sched_held+0x64/0x70
 ? rcu_sync_lockdep_assert+0x2a/0x50
 ? __sb_start_write+0x121/0x1b0
 ? vfs_write+0x17c/0x1b0
 vfs_write+0xad/0x1b0
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 ksys_write+0x55/0xc0
 do_syscall_64+0x5a/0x210

Therefore, mdev core is improved in following ways to overcome above
issues.

1. Before placing mdev devices on the bus, perform vendor drivers
creation which supports the mdev creation.
This ensures that mdev specific all necessary fields are initialized
before a given mdev can be accessed by bus driver.

2. During remove flow, first remove the device from the bus. This
ensures that any bus specific devices and data is cleared.
Once device is taken of the mdev bus, perform remove() of mdev from the
vendor driver.

3. Linux core device model provides way to register and auto unregister
the device sysfs attribute groups at dev->groups.
Make use of this groups to let core create the groups and simplify code
to avoid explicit groups creation and removal.

4. Wait for any ongoing mdev create() and remove() to finish before
unregistering parent device using srcu. This continues to allow multiple
create and remove to progress in parallel. At the same time guard parent
removal while parent is being access by create() and remove callbacks.

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++------------------
 drivers/vfio/mdev/mdev_private.h |   7 +-
 drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
 3 files changed, 84 insertions(+), 71 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 944a058..8fe0ed1 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
 						  ref);
 	struct device *dev = parent->dev;
 
+	cleanup_srcu_struct(&parent->unreg_srcu);
 	kfree(parent);
 	put_device(dev);
 }
@@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct mdev_parent *parent)
 		kref_put(&parent->ref, mdev_release_parent);
 }
 
-static int mdev_device_create_ops(struct kobject *kobj,
-				  struct mdev_device *mdev)
+static int mdev_device_must_remove(struct mdev_device *mdev)
 {
-	struct mdev_parent *parent = mdev->parent;
+	struct mdev_parent *parent;
+	struct mdev_type *type;
 	int ret;
 
-	ret = parent->ops->create(kobj, mdev);
-	if (ret)
-		return ret;
+	type = to_mdev_type(mdev->type_kobj);
 
-	ret = sysfs_create_groups(&mdev->dev.kobj,
-				  parent->ops->mdev_attr_groups);
+	mdev_remove_sysfs_files(&mdev->dev, type);
+	device_del(&mdev->dev);
+	parent = mdev->parent;
+	ret = parent->ops->remove(mdev);
 	if (ret)
-		parent->ops->remove(mdev);
+		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
 
+	/* Balances with device_initialize() */
+	put_device(&mdev->dev);
 	return ret;
 }
 
-/*
- * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
- * device is being unregistered from mdev device framework.
- * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
- *   indicates that if the mdev device is active, used by VMM or userspace
- *   application, vendor driver could return error then don't remove the device.
- * - 'force_remove' is set to 'true' when called from mdev_unregister_device()
- *   which indicate that parent device is being removed from mdev device
- *   framework so remove mdev device forcefully.
- */
-static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
-{
-	struct mdev_parent *parent = mdev->parent;
-	int ret;
-
-	/*
-	 * Vendor driver can return error if VMM or userspace application is
-	 * using this mdev device.
-	 */
-	ret = parent->ops->remove(mdev);
-	if (ret && !force_remove)
-		return ret;
-
-	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
-	return 0;
-}
-
 static int mdev_device_remove_cb(struct device *dev, void *data)
 {
 	if (dev_is_mdev(dev))
-		mdev_device_remove(dev, true);
-
+		mdev_device_must_remove(to_mdev_device(dev));
 	return 0;
 }
 
@@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	}
 
 	kref_init(&parent->ref);
+	init_srcu_struct(&parent->unreg_srcu);
 
 	parent->dev = dev;
 	parent->ops = ops;
@@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	if (ret)
 		dev_warn(dev, "Failed to create compatibility class link\n");
 
+	rcu_assign_pointer(parent->self, parent);
 	list_add(&parent->next, &parent_list);
 	mutex_unlock(&parent_list_lock);
 
@@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
 
 	mutex_lock(&parent_list_lock);
 	parent = __find_parent_device(dev);
-
 	if (!parent) {
 		mutex_unlock(&parent_list_lock);
 		return;
 	}
+	list_del(&parent->next);
+	mutex_unlock(&parent_list_lock);
+
 	dev_info(dev, "MDEV: Unregistering\n");
 
-	list_del(&parent->next);
+	/* Publish that this mdev parent is unregistering. So any new
+	 * create/remove cannot start on this parent anymore by user.
+	 */
+	rcu_assign_pointer(parent->self, NULL);
+
+	/*
+	 * Wait for any active create() or remove() mdev ops on the parent
+	 * to complete.
+	 */
+	synchronize_srcu(&parent->unreg_srcu);
+
+	/* At this point it is confirmed that any pending user initiated
+	 * create or remove callbacks accessing the parent are completed.
+	 * It is safe to remove the parent now.
+	 */
 	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
 
 	device_for_each_child(dev, NULL, mdev_device_remove_cb);
 
 	parent_remove_sysfs_files(parent);
 
-	mutex_unlock(&parent_list_lock);
 	mdev_put_parent(parent);
 }
 EXPORT_SYMBOL(mdev_unregister_device);
@@ -278,14 +270,24 @@ static void mdev_device_release(struct device *dev)
 int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 {
 	int ret;
+	struct mdev_parent *valid_parent;
 	struct mdev_device *mdev, *tmp;
 	struct mdev_parent *parent;
 	struct mdev_type *type = to_mdev_type(kobj);
+	int srcu_idx;
 
 	parent = mdev_get_parent(type->parent);
 	if (!parent)
 		return -EINVAL;
 
+	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
+	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
+	if (!valid_parent) {
+		/* parent is undergoing unregistration */
+		ret = -ENODEV;
+		goto mdev_fail;
+	}
+
 	mutex_lock(&mdev_list_lock);
 
 	/* Check for duplicate */
@@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 
 	mdev->parent = parent;
 
+	device_initialize(&mdev->dev);
 	mdev->dev.parent  = dev;
 	mdev->dev.bus     = &mdev_bus_type;
 	mdev->dev.release = mdev_device_release;
+	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
 	dev_set_name(&mdev->dev, "%pUl", uuid.b);
 
-	ret = device_register(&mdev->dev);
+	ret = type->parent->ops->create(kobj, mdev);
 	if (ret)
-		goto mdev_fail;
+		goto create_fail;
 
-	ret = mdev_device_create_ops(kobj, mdev);
+	ret = device_add(&mdev->dev);
 	if (ret)
-		goto create_fail;
+		goto dev_fail;
 
 	ret = mdev_create_sysfs_files(&mdev->dev, type);
-	if (ret) {
-		mdev_device_remove_ops(mdev, true);
-		goto create_fail;
-	}
+	if (ret)
+		goto sysfs_fail;
 
 	mdev->type_kobj = kobj;
 	mdev->active = true;
 	dev_dbg(&mdev->dev, "MDEV: created\n");
+	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
 
 	return 0;
 
+sysfs_fail:
+	device_del(&mdev->dev);
+dev_fail:
+	type->parent->ops->remove(mdev);
 create_fail:
-	device_unregister(&mdev->dev);
+	put_device(&mdev->dev);
 mdev_fail:
+	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
 	mdev_put_parent(parent);
 	return ret;
 }
 
-int mdev_device_remove(struct device *dev, bool force_remove)
+int mdev_device_remove(struct device *dev)
 {
+	struct mdev_parent *valid_parent;
 	struct mdev_device *mdev;
 	struct mdev_parent *parent;
-	struct mdev_type *type;
+	int srcu_idx;
 	int ret;
 
 	mdev = to_mdev_device(dev);
+	parent = mdev->parent;
+	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
+	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
+	if (!valid_parent) {
+		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
+		/* parent is undergoing unregistration */
+		return -ENODEV;
+	}
+
+	mutex_lock(&mdev_list_lock);
 	if (!mdev->active) {
 		mutex_unlock(&mdev_list_lock);
-		return -EAGAIN;
+		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
+		return -ENODEV;
 	}
-
 	mdev->active = false;
 	mutex_unlock(&mdev_list_lock);
 
-	type = to_mdev_type(mdev->type_kobj);
-	parent = mdev->parent;
-
-	ret = mdev_device_remove_ops(mdev, force_remove);
-	if (ret) {
-		mdev->active = true;
-		return ret;
-	}
+	ret = mdev_device_must_remove(mdev);
+	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
 
-	mdev_remove_sysfs_files(dev, type);
-	device_unregister(dev);
 	mdev_put_parent(parent);
-
-	return 0;
+	return ret;
 }
 
 static int __init mdev_init(void)
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 84b2b6c..3d17db9 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -23,6 +23,11 @@ struct mdev_parent {
 	struct list_head next;
 	struct kset *mdev_types_kset;
 	struct list_head type_list;
+	/* Protects unregistration to wait until create/remove
+	 * are completed.
+	 */
+	struct srcu_struct unreg_srcu;
+	struct mdev_parent __rcu *self;
 };
 
 struct mdev_device {
@@ -58,6 +63,6 @@ struct mdev_type {
 void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
 
 int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
-int  mdev_device_remove(struct device *dev, bool force_remove);
+int  mdev_device_remove(struct device *dev);
 
 #endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index c782fa9..68a8191 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -236,11 +236,9 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
 	if (val && device_remove_file_self(dev, attr)) {
 		int ret;
 
-		ret = mdev_device_remove(dev, false);
-		if (ret) {
-			device_create_file(dev, attr);
+		ret = mdev_device_remove(dev);
+		if (ret)
 			return ret;
-		}
 	}
 
 	return count;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure
  2019-03-22 23:20 ` [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure Parav Pandit
@ 2019-03-25 11:48   ` Maxim Levitsky
  2019-03-25 18:17   ` Kirti Wankhede
  1 sibling, 0 replies; 49+ messages in thread
From: Maxim Levitsky @ 2019-03-25 11:48 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, kwankhede, alex.williamson

On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> device_register() performs put_device() if device_add() fails.
> This balances with device_initialize().
> 
> mdev core performing put_device() when device_register() fails,
> is an error that puts already released device again.
> Therefore, don't put the device on error.
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 0212f0e..3e5880a 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -318,10 +318,8 @@ int mdev_device_create(struct kobject *kobj, struct
> device *dev, uuid_le uuid)
>  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
>  	ret = device_register(&mdev->dev);
> -	if (ret) {
> -		put_device(&mdev->dev);
> +	if (ret)
>  		goto mdev_fail;
> -	}
>  
>  	ret = mdev_device_create_ops(kobj, mdev);
>  	if (ret)

Very good catch! Thanks!

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/8] vfio/mdev: Avoid release parent reference during error path
  2019-03-22 23:20 ` [PATCH 2/8] vfio/mdev: Avoid release parent reference during error path Parav Pandit
@ 2019-03-25 11:49   ` Maxim Levitsky
  2019-03-25 18:27   ` Kirti Wankhede
  1 sibling, 0 replies; 49+ messages in thread
From: Maxim Levitsky @ 2019-03-25 11:49 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, kwankhede, alex.williamson

On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> During mdev parent registration in mdev_register_device(),
> if parent device is duplicate, it releases the reference of existing
> parent device.
> This is incorrect. Existing parent device should not be touched.
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 3e5880a..4f213e4d 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -182,6 +182,7 @@ int mdev_register_device(struct device *dev, const struct
> mdev_parent_ops *ops)
>  	/* Check for duplicate */
>  	parent = __find_parent_device(dev);
>  	if (parent) {
> +		parent = NULL;
>  		ret = -EEXIST;
>  		goto add_dev_err;
>  	}

This is also clearly an issue.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/8] vfio/mdev: Removed unused kref
  2019-03-22 23:20 ` [PATCH 3/8] vfio/mdev: Removed unused kref Parav Pandit
@ 2019-03-25 11:50   ` Maxim Levitsky
  2019-03-25 18:41   ` Kirti Wankhede
  1 sibling, 0 replies; 49+ messages in thread
From: Maxim Levitsky @ 2019-03-25 11:50 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, kwankhede, alex.williamson

On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> Remove unused kref from the mdev_device structure.
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 1 -
>  drivers/vfio/mdev/mdev_private.h | 1 -
>  2 files changed, 2 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 4f213e4d..3d91f62 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -311,7 +311,6 @@ int mdev_device_create(struct kobject *kobj, struct device
> *dev, uuid_le uuid)
>  	mutex_unlock(&mdev_list_lock);
>  
>  	mdev->parent = parent;
> -	kref_init(&mdev->ref);
>  
>  	mdev->dev.parent  = dev;
>  	mdev->dev.bus     = &mdev_bus_type;
> diff --git a/drivers/vfio/mdev/mdev_private.h
> b/drivers/vfio/mdev/mdev_private.h
> index b5819b7..84b2b6c 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -30,7 +30,6 @@ struct mdev_device {
>  	struct mdev_parent *parent;
>  	uuid_le uuid;
>  	void *driver_data;
> -	struct kref ref;
>  	struct list_head next;
>  	struct kobject *type_kobj;
>  	bool active;

When develping my nvme-mdev driver, I'll seen that unused kref too.
Dead code has to go.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols
  2019-03-22 23:20 ` [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
@ 2019-03-25 11:56   ` Maxim Levitsky
  2019-03-25 19:07   ` Kirti Wankhede
  1 sibling, 0 replies; 49+ messages in thread
From: Maxim Levitsky @ 2019-03-25 11:56 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, kwankhede, alex.williamson

On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> There is no need use 'extern' for exported functions.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  include/linux/mdev.h | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index b6e048e..0924c48 100644
> --- a/include/linux/mdev.h
> +++ b/include/linux/mdev.h
> @@ -118,21 +118,20 @@ struct mdev_driver {
>  
>  #define to_mdev_driver(drv)	container_of(drv, struct mdev_driver, driver)
>  
> -extern void *mdev_get_drvdata(struct mdev_device *mdev);
> -extern void mdev_set_drvdata(struct mdev_device *mdev, void *data);
> -extern uuid_le mdev_uuid(struct mdev_device *mdev);
> +void *mdev_get_drvdata(struct mdev_device *mdev);
> +void mdev_set_drvdata(struct mdev_device *mdev, void *data);
> +uuid_le mdev_uuid(struct mdev_device *mdev);
>  
>  extern struct bus_type mdev_bus_type;
>  
> -extern int  mdev_register_device(struct device *dev,
> -				 const struct mdev_parent_ops *ops);
> -extern void mdev_unregister_device(struct device *dev);
> +int mdev_register_device(struct device *dev, const struct mdev_parent_ops
> *ops);
> +void mdev_unregister_device(struct device *dev);
>  
> -extern int  mdev_register_driver(struct mdev_driver *drv, struct module
> *owner);
> -extern void mdev_unregister_driver(struct mdev_driver *drv);
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +void mdev_unregister_driver(struct mdev_driver *drv);
>  
> -extern struct device *mdev_parent_dev(struct mdev_device *mdev);
> -extern struct device *mdev_dev(struct mdev_device *mdev);
> -extern struct mdev_device *mdev_from_dev(struct device *dev);
> +struct device *mdev_parent_dev(struct mdev_device *mdev);
> +struct device *mdev_dev(struct mdev_device *mdev);
> +struct mdev_device *mdev_from_dev(struct device *dev);
>  
>  #endif /* MDEV_H */

I honestly didn't knew/paid attention to that nice bit of C.
Indeed 'extern' is already kind of a default for function declarations.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY
  2019-03-22 23:20 ` [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
@ 2019-03-25 11:57   ` Maxim Levitsky
  2019-03-25 19:18   ` Kirti Wankhede
  1 sibling, 0 replies; 49+ messages in thread
From: Maxim Levitsky @ 2019-03-25 11:57 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, kwankhede, alex.williamson

On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> Instead of masking return error to -EBUSY, return actual error
> returned by the driver.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 3d91f62..ab05464 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -142,7 +142,7 @@ static int mdev_device_remove_ops(struct mdev_device
> *mdev, bool force_remove)
>  	 */
>  	ret = parent->ops->remove(mdev);
>  	if (ret && !force_remove)
> -		return -EBUSY;
> +		return ret;
>  
>  	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
>  	return 0;

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 6/8] vfio/mdev: Follow correct remove sequence
  2019-03-22 23:20 ` [PATCH 6/8] vfio/mdev: Follow correct remove sequence Parav Pandit
@ 2019-03-25 11:58   ` Maxim Levitsky
  2019-03-25 20:20   ` Alex Williamson
  1 sibling, 0 replies; 49+ messages in thread
From: Maxim Levitsky @ 2019-03-25 11:58 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, kwankhede, alex.williamson

On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> mdev_remove_sysfs_files() should follow exact mirror sequence of a
> create, similar to what is followed in error unwinding path of
> mdev_create_sysfs_files().
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_sysfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index ce5dd21..c782fa9 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -280,7 +280,7 @@ int  mdev_create_sysfs_files(struct device *dev, struct
> mdev_type *type)
>  
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
>  {
> +	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
>  	sysfs_remove_link(&dev->kobj, "mdev_type");
>  	sysfs_remove_link(type->devices_kobj, dev_name(dev));
> -	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
>  }

I agree with that.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-03-22 23:20 ` [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
@ 2019-03-25 11:58   ` Maxim Levitsky
  2019-03-25 19:35   ` Kirti Wankhede
  1 sibling, 0 replies; 49+ messages in thread
From: Maxim Levitsky @ 2019-03-25 11:58 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, kwankhede, alex.williamson

On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> device_for_each_child() stops executing callback function for remaining
> child devices, if callback hits an error.
> Each child mdev device is independent of each other.
> While unregistering parent device, mdev core must remove all child mdev
> devices.
> Therefore, mdev_device_remove_cb() always returns success so that
> device_for_each_child doesn't abort if one child removal hits error.
> 
> While at it, improve remove and unregister functions for below simplicity.
> 
> There isn't need to pass forced flag pointer during mdev parent
> removal which invokes mdev_device_remove(). So simplify the flow.
> 
> mdev_device_remove() is called from two paths.
> 1. mdev_unregister_driver()
>      mdev_device_remove_cb()
>        mdev_device_remove()
> 2. remove_store()
>      mdev_device_remove()
> 
> When device is removed by user using remote_store(), device under
> removal is mdev device.
> When device is removed during parent device removal using generic child
> iterator, mdev check is already done using dev_is_mdev().
> 
> Hence, remove the unnecessary loop in mdev_device_remove().
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 24 +++++-------------------
>  1 file changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index ab05464..944a058 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -150,10 +150,10 @@ static int mdev_device_remove_ops(struct mdev_device
> *mdev, bool force_remove)
>  
>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
> -	if (!dev_is_mdev(dev))
> -		return 0;
> +	if (dev_is_mdev(dev))
> +		mdev_device_remove(dev, true);
>  
> -	return mdev_device_remove(dev, data ? *(bool *)data : true);
> +	return 0;
>  }
>  
>  /*
> @@ -241,7 +241,6 @@ int mdev_register_device(struct device *dev, const struct
> mdev_parent_ops *ops)
>  void mdev_unregister_device(struct device *dev)
>  {
>  	struct mdev_parent *parent;
> -	bool force_remove = true;
>  
>  	mutex_lock(&parent_list_lock);
>  	parent = __find_parent_device(dev);
> @@ -255,8 +254,7 @@ void mdev_unregister_device(struct device *dev)
>  	list_del(&parent->next);
>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
> -	device_for_each_child(dev, (void *)&force_remove,
> -			      mdev_device_remove_cb);
> +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  
> @@ -346,24 +344,12 @@ int mdev_device_create(struct kobject *kobj, struct
> device *dev, uuid_le uuid)
>  
>  int mdev_device_remove(struct device *dev, bool force_remove)
>  {
> -	struct mdev_device *mdev, *tmp;
> +	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> -
> -	mutex_lock(&mdev_list_lock);
> -	list_for_each_entry(tmp, &mdev_list, next) {
> -		if (tmp == mdev)
> -			break;
> -	}
> -
> -	if (tmp != mdev) {
> -		mutex_unlock(&mdev_list_lock);
> -		return -ENODEV;
> -	}
> -
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
>  		return -EAGAIN;

Very nice catch and good refactoring.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-22 23:20 ` [PATCH 8/8] vfio/mdev: Improve the create/remove sequence Parav Pandit
@ 2019-03-25 13:24   ` Maxim Levitsky
  2019-03-25 21:42     ` Parav Pandit
  2019-03-25 23:18   ` Alex Williamson
  2019-03-26  7:06   ` Kirti Wankhede
  2 siblings, 1 reply; 49+ messages in thread
From: Maxim Levitsky @ 2019-03-25 13:24 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, kwankhede, alex.williamson

On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> There are five problems with current code structure.
> 1. mdev device is placed on the mdev bus before it is created in the
> vendor driver. Once a device is placed on the mdev bus without creating
> its supporting underlying vendor device, an open() can get triggered by
> userspace on partially initialized device.
> Below ladder diagram highlight it.
> 
>       cpu-0                                       cpu-1
>       -----                                       -----
>    create_store()
>      mdev_create_device()
>        device_register()
>           ...
>          vfio_mdev_probe()
>          ...creates char device
>                                         vfio_mdev_open()
>                                           parent->ops->open(mdev)
>                                             vfio_ap_mdev_open()
>                                               matrix_mdev = NULL
>         [...]
>         parent->ops->create()
>           vfio_ap_mdev_create()
>             mdev_set_drvdata(mdev, matrix_mdev);
>             /* Valid pointer set above */

Agree.
You probably mean mdev_device_create here.

> 
> 2. Current creation sequence is,
>    parent->ops_create()
>    groups_register()
> 
> Remove sequence is,
>    parent->ops->remove()
>    groups_unregister()
> However, remove sequence should be exact mirror of creation sequence.
> Once this is achieved, all users of the mdev will be terminated first
> before removing underlying vendor device.
> (Follow standard linux driver model).
> At that point vendor's remove() ops shouldn't failed because device is
> taken off the bus that should terminate the users.
Agreee here too.



> 
> 3. Additionally any new mdev driver that wants to work on mdev device
> during probe() routine registered using mdev_register_driver() needs to
> get stable mdev structure.
> 
> 4. In following sequence, child devices created while removing mdev parent
> device can be left out, or it may lead to race of removing half
> initialized child mdev devices.
> 
> issue-1:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
>                                   mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
>                                   parent_remove_sysfs_files()
>                                   /* BUG: device added by cpu-0
>                                    * whose parent is getting removed.
>                                    */
> 
> issue-2:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
> 
>        [...]                      mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> 
>        mdev_create_sysfs_files()
>        /* BUG: create is adding
>         * sysfs files for a device
>         * which is undergoing removal.
>         */
>                                  parent_remove_sysfs_files()
Looks like an issue to me too.

> 
> 5. Below crash is observed when user initiated remove is in progress
> and mdev_unregister_driver() completes parent unregistration.
> 
>        cpu-0                         cpu-1
>        -----                         -----
> remove_store()
>    mdev_device_remove()
>    active = false;
>                                   mdev_unregister_device()
>                                     remove type
>    [...]
>    mdev_remove_ops() crashes.
> 
> This is similar race like create() racing with mdev_unregister_device().
> 
> mtty mtty: MDEV: Registered
> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> mdev_device_remove sleep started
> mtty mtty: MDEV: Unregistering
> mtty_dev: Unloaded!
> BUG: unable to handle kernel paging request at ffffffffc027d668
> PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> Oops: 0000 [#1] SMP PTI
> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
> Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
> Call Trace:
>  mdev_device_remove+0xef/0x130 [mdev]
>  remove_store+0x77/0xa0 [mdev]
>  kernfs_fop_write+0x113/0x1a0
>  __vfs_write+0x33/0x1b0
>  ? rcu_read_lock_sched_held+0x64/0x70
>  ? rcu_sync_lockdep_assert+0x2a/0x50
>  ? __sb_start_write+0x121/0x1b0
>  ? vfs_write+0x17c/0x1b0
>  vfs_write+0xad/0x1b0
>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>  ksys_write+0x55/0xc0
>  do_syscall_64+0x5a/0x210
> 
> Therefore, mdev core is improved in following ways to overcome above
> issues.
> 
> 1. Before placing mdev devices on the bus, perform vendor drivers
> creation which supports the mdev creation.
> This ensures that mdev specific all necessary fields are initialized
> before a given mdev can be accessed by bus driver.
> 
> 2. During remove flow, first remove the device from the bus. This
> ensures that any bus specific devices and data is cleared.
> Once device is taken of the mdev bus, perform remove() of mdev from the
> vendor driver.
> 
> 
> 3. Linux core device model provides way to register and auto unregister
> the device sysfs attribute groups at dev->groups.
> to avoid explicit groups creation and removal.
> to avoid explicit groups creation and removal.
> 
> 4. Wait for any ongoing mdev create() and remove() to finish before
> unregistering parent device using srcu. This continues to allow multiple
> create and remove to progress in parallel. At the same time guard parent
> removal while parent is being access by create() and remove callbacks.
All these fixes seem reasonable and correct to me


> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++-----------------
> -
>  drivers/vfio/mdev/mdev_private.h |   7 +-
>  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
>  3 files changed, 84 insertions(+), 71 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 944a058..8fe0ed1 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
>  						  ref);
>  	struct device *dev = parent->dev;
>  
> +	cleanup_srcu_struct(&parent->unreg_srcu);
>  	kfree(parent);
>  	put_device(dev);
>  }
> @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct mdev_parent
> *parent)
>  		kref_put(&parent->ref, mdev_release_parent);
>  }
>  
> -static int mdev_device_create_ops(struct kobject *kobj,
> -				  struct mdev_device *mdev)
> +static int mdev_device_must_remove(struct mdev_device *mdev)

Tiny nitpic: maybe a better name? or a comment for this function that state that
it tries removes the device even if in use

>  {
> -	struct mdev_parent *parent = mdev->parent;
> +	struct mdev_parent *parent;
> +	struct mdev_type *type;
>  	int ret;
>  
> -	ret = parent->ops->create(kobj, mdev);
> -	if (ret)
> -		return ret;
> +	type = to_mdev_type(mdev->type_kobj);
>  
> -	ret = sysfs_create_groups(&mdev->dev.kobj,
> -				  parent->ops->mdev_attr_groups);
> +	mdev_remove_sysfs_files(&mdev->dev, type);
> +	device_del(&mdev->dev);
> +	parent = mdev->parent;
> +	ret = parent->ops->remove(mdev);
>  	if (ret)
> -		parent->ops->remove(mdev);
> +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
>  
> +	/* Balances with device_initialize() */
> +	put_device(&mdev->dev);
>  	return ret;
>  }
>  
> -/*
> - * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
> - * device is being unregistered from mdev device framework.
> - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> - *   indicates that if the mdev device is active, used by VMM or userspace
> - *   application, vendor driver could return error then don't remove the
> device.
> - * - 'force_remove' is set to 'true' when called from
> mdev_unregister_device()
> - *   which indicate that parent device is being removed from mdev device
> - *   framework so remove mdev device forcefully.
> - */
> -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> force_remove)
> -{
> -	struct mdev_parent *parent = mdev->parent;
> -	int ret;
> -
> -	/*
> -	 * Vendor driver can return error if VMM or userspace application is
> -	 * using this mdev device.
> -	 */
> -	ret = parent->ops->remove(mdev);
> -	if (ret && !force_remove)
> -		return ret;
> -
> -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
> -	return 0;
> -}
> -
>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
>  	if (dev_is_mdev(dev))
> -		mdev_device_remove(dev, true);
> -
> +		mdev_device_must_remove(to_mdev_device(dev));
>  	return 0;
>  }
>  
> @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const struct
> mdev_parent_ops *ops)
>  	}
>  
>  	kref_init(&parent->ref);
> +	init_srcu_struct(&parent->unreg_srcu);
>  
>  	parent->dev = dev;
>  	parent->ops = ops;
> @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const struct
> mdev_parent_ops *ops)
>  	if (ret)
>  		dev_warn(dev, "Failed to create compatibility class link\n");
>  
> +	rcu_assign_pointer(parent->self, parent);
>  	list_add(&parent->next, &parent_list);
>  	mutex_unlock(&parent_list_lock);
>  
> @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
>  
>  	mutex_lock(&parent_list_lock);
>  	parent = __find_parent_device(dev);
> -
>  	if (!parent) {
>  		mutex_unlock(&parent_list_lock);
>  		return;
>  	}
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +
>  	dev_info(dev, "MDEV: Unregistering\n");
>  
> -	list_del(&parent->next);
> +	/* Publish that this mdev parent is unregistering. So any new
> +	 * create/remove cannot start on this parent anymore by user.
> +	 */
> +	rcu_assign_pointer(parent->self, NULL);
> +
> +	/*
> +	 * Wait for any active create() or remove() mdev ops on the parent
> +	 * to complete.
> +	 */
> +	synchronize_srcu(&parent->unreg_srcu);
> +
> +	/* At this point it is confirmed that any pending user initiated
> +	 * create or remove callbacks accessing the parent are completed.
> +	 * It is safe to remove the parent now.
> +	 */
>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
>  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  
> -	mutex_unlock(&parent_list_lock);
>  	mdev_put_parent(parent);
>  }
>  EXPORT_SYMBOL(mdev_unregister_device);
> @@ -278,14 +270,24 @@ static void mdev_device_release(struct device *dev)
>  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le
> uuid)
>  {
>  	int ret;
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	int srcu_idx;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		/* parent is undergoing unregistration */
> +		ret = -ENODEV;
> +		goto mdev_fail;
> +	}
> +
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj, struct
> device *dev, uuid_le uuid)
>  
>  	mdev->parent = parent;
>  
> +	device_initialize(&mdev->dev);
>  	mdev->dev.parent  = dev;
>  	mdev->dev.bus     = &mdev_bus_type;
>  	mdev->dev.release = mdev_device_release;
> +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
>  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
> -	ret = device_register(&mdev->dev);
> +	ret = type->parent->ops->create(kobj, mdev);
>  	if (ret)
> -		goto mdev_fail;
> +		goto create_fail;
>  
> -	ret = mdev_device_create_ops(kobj, mdev);
> +	ret = device_add(&mdev->dev);
>  	if (ret)
> -		goto create_fail;
> +		goto dev_fail;
>  
>  	ret = mdev_create_sysfs_files(&mdev->dev, type);
> -	if (ret) {
> -		mdev_device_remove_ops(mdev, true);
> -		goto create_fail;
> -	}
> +	if (ret)
> +		goto sysfs_fail;
>  
>  	mdev->type_kobj = kobj;
>  	mdev->active = true;
>  	dev_dbg(&mdev->dev, "MDEV: created\n");
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
>  	return 0;
>  
> +sysfs_fail:
> +	device_del(&mdev->dev);
> +dev_fail:
> +	type->parent->ops->remove(mdev);
>  create_fail:
> -	device_unregister(&mdev->dev);
> +	put_device(&mdev->dev);
>  mdev_fail:
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  	mdev_put_parent(parent);
>  	return ret;
>  }
>  
> -int mdev_device_remove(struct device *dev, bool force_remove)
> +int mdev_device_remove(struct device *dev)
>  {
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
> -	struct mdev_type *type;
> +	int srcu_idx;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		/* parent is undergoing unregistration */
> +		return -ENODEV;
> +	}
> +
> +	mutex_lock(&mdev_list_lock);
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
> -		return -EAGAIN;
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		return -ENODEV;
>  	}
> -
>  	mdev->active = false;
>  	mutex_unlock(&mdev_list_lock);
>  
> -	type = to_mdev_type(mdev->type_kobj);
> -	parent = mdev->parent;
> -
> -	ret = mdev_device_remove_ops(mdev, force_remove);
> -	if (ret) {
> -		mdev->active = true;
> -		return ret;
> -	}
> +	ret = mdev_device_must_remove(mdev);
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
> -	mdev_remove_sysfs_files(dev, type);
> -	device_unregister(dev);
>  	mdev_put_parent(parent);
> -
> -	return 0;
> +	return ret;
>  }
>  
>  static int __init mdev_init(void)
> diff --git a/drivers/vfio/mdev/mdev_private.h
> b/drivers/vfio/mdev/mdev_private.h
> index 84b2b6c..3d17db9 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -23,6 +23,11 @@ struct mdev_parent {
>  	struct list_head next;
>  	struct kset *mdev_types_kset;
>  	struct list_head type_list;
> +	/* Protects unregistration to wait until create/remove
> +	 * are completed.
> +	 */
> +	struct srcu_struct unreg_srcu;
> +	struct mdev_parent __rcu *self;
>  };
>  
>  struct mdev_device {
> @@ -58,6 +63,6 @@ struct mdev_type {
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
>  
>  int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le
> uuid);
> -int  mdev_device_remove(struct device *dev, bool force_remove);
> +int  mdev_device_remove(struct device *dev);
>  
>  #endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index c782fa9..68a8191 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -236,11 +236,9 @@ static ssize_t remove_store(struct device *dev, struct
> device_attribute *attr,
>  	if (val && device_remove_file_self(dev, attr)) {
>  		int ret;
>  
> -		ret = mdev_device_remove(dev, false);
> -		if (ret) {
> -			device_create_file(dev, attr);
> +		ret = mdev_device_remove(dev);
> +		if (ret)
>  			return ret;
> -		}
>  	}
>  
>  	return count;

The patch looks OK to me, especially looking at the code after the changes were
apllied. I might have missed something though due to amount of changes done.

I lightly tested the whole patch series with my mdev driver, and it seems to
survive, but my testing doesn't test much of the error paths so there that.

I'll keep this applied so if I notice any errors I'll let you know.

If you could split this into few patches, this would be even better, but anyway
thanks a lot for this work!

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure
  2019-03-22 23:20 ` [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure Parav Pandit
  2019-03-25 11:48   ` Maxim Levitsky
@ 2019-03-25 18:17   ` Kirti Wankhede
  2019-03-25 19:21     ` Alex Williamson
  1 sibling, 1 reply; 49+ messages in thread
From: Kirti Wankhede @ 2019-03-25 18:17 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, alex.williamson



On 3/23/2019 4:50 AM, Parav Pandit wrote:
> device_register() performs put_device() if device_add() fails.
> This balances with device_initialize().
> 
> mdev core performing put_device() when device_register() fails,
> is an error that puts already released device again.
> Therefore, don't put the device on error.
> 

device_add() on all errors doesn't call put_device(dev). It releases
reference to its parent, put_device(parent), but not the device itself,
put_device(dev).

Thanks,
Kirti


> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 0212f0e..3e5880a 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -318,10 +318,8 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
>  	ret = device_register(&mdev->dev);
> -	if (ret) {
> -		put_device(&mdev->dev);
> +	if (ret)
>  		goto mdev_fail;
> -	}
>  
>  	ret = mdev_device_create_ops(kobj, mdev);
>  	if (ret)
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/8] vfio/mdev: Avoid release parent reference during error path
  2019-03-22 23:20 ` [PATCH 2/8] vfio/mdev: Avoid release parent reference during error path Parav Pandit
  2019-03-25 11:49   ` Maxim Levitsky
@ 2019-03-25 18:27   ` Kirti Wankhede
  1 sibling, 0 replies; 49+ messages in thread
From: Kirti Wankhede @ 2019-03-25 18:27 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, alex.williamson



On 3/23/2019 4:50 AM, Parav Pandit wrote:
> During mdev parent registration in mdev_register_device(),
> if parent device is duplicate, it releases the reference of existing
> parent device.
> This is incorrect. Existing parent device should not be touched.
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 3e5880a..4f213e4d 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -182,6 +182,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	/* Check for duplicate */
>  	parent = __find_parent_device(dev);
>  	if (parent) {
> +		parent = NULL;
>  		ret = -EEXIST;
>  		goto add_dev_err;
>  	}
> 

Agreed. Thanks for fixing this.

Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/8] vfio/mdev: Removed unused kref
  2019-03-22 23:20 ` [PATCH 3/8] vfio/mdev: Removed unused kref Parav Pandit
  2019-03-25 11:50   ` Maxim Levitsky
@ 2019-03-25 18:41   ` Kirti Wankhede
  1 sibling, 0 replies; 49+ messages in thread
From: Kirti Wankhede @ 2019-03-25 18:41 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, alex.williamson



On 3/23/2019 4:50 AM, Parav Pandit wrote:
> Remove unused kref from the mdev_device structure.
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 1 -
>  drivers/vfio/mdev/mdev_private.h | 1 -
>  2 files changed, 2 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 4f213e4d..3d91f62 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -311,7 +311,6 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  	mutex_unlock(&mdev_list_lock);
>  
>  	mdev->parent = parent;
> -	kref_init(&mdev->ref);
>  
>  	mdev->dev.parent  = dev;
>  	mdev->dev.bus     = &mdev_bus_type;
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index b5819b7..84b2b6c 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -30,7 +30,6 @@ struct mdev_device {
>  	struct mdev_parent *parent;
>  	uuid_le uuid;
>  	void *driver_data;
> -	struct kref ref;
>  	struct list_head next;
>  	struct kobject *type_kobj;
>  	bool active;
> 

Yes, this should be removed.

Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols
  2019-03-22 23:20 ` [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
  2019-03-25 11:56   ` Maxim Levitsky
@ 2019-03-25 19:07   ` Kirti Wankhede
  2019-03-25 19:49     ` Alex Williamson
  1 sibling, 1 reply; 49+ messages in thread
From: Kirti Wankhede @ 2019-03-25 19:07 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, alex.williamson



On 3/23/2019 4:50 AM, Parav Pandit wrote:
> There is no need use 'extern' for exported functions.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  include/linux/mdev.h | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index b6e048e..0924c48 100644
> --- a/include/linux/mdev.h
> +++ b/include/linux/mdev.h
> @@ -118,21 +118,20 @@ struct mdev_driver {
>  
>  #define to_mdev_driver(drv)	container_of(drv, struct mdev_driver, driver)
>  
> -extern void *mdev_get_drvdata(struct mdev_device *mdev);
> -extern void mdev_set_drvdata(struct mdev_device *mdev, void *data);
> -extern uuid_le mdev_uuid(struct mdev_device *mdev);
> +void *mdev_get_drvdata(struct mdev_device *mdev);
> +void mdev_set_drvdata(struct mdev_device *mdev, void *data);
> +uuid_le mdev_uuid(struct mdev_device *mdev);
>  
>  extern struct bus_type mdev_bus_type;
>  
> -extern int  mdev_register_device(struct device *dev,
> -				 const struct mdev_parent_ops *ops);
> -extern void mdev_unregister_device(struct device *dev);
> +int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops);
> +void mdev_unregister_device(struct device *dev);
>  
> -extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> -extern void mdev_unregister_driver(struct mdev_driver *drv);
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +void mdev_unregister_driver(struct mdev_driver *drv);
>  
> -extern struct device *mdev_parent_dev(struct mdev_device *mdev);
> -extern struct device *mdev_dev(struct mdev_device *mdev);
> -extern struct mdev_device *mdev_from_dev(struct device *dev);
> +struct device *mdev_parent_dev(struct mdev_device *mdev);
> +struct device *mdev_dev(struct mdev_device *mdev);
> +struct mdev_device *mdev_from_dev(struct device *dev);
>  
>  #endif /* MDEV_H */
> 

Adding 'extern' to exported symbols is inline to other exported
functions from device's core module like device_register(),
device_unregister(), get_device(), put_device()

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY
  2019-03-22 23:20 ` [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
  2019-03-25 11:57   ` Maxim Levitsky
@ 2019-03-25 19:18   ` Kirti Wankhede
  2019-03-25 21:29     ` Parav Pandit
  1 sibling, 1 reply; 49+ messages in thread
From: Kirti Wankhede @ 2019-03-25 19:18 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, alex.williamson



On 3/23/2019 4:50 AM, Parav Pandit wrote:
> Instead of masking return error to -EBUSY, return actual error
> returned by the driver.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 3d91f62..ab05464 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -142,7 +142,7 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
>  	 */
>  	ret = parent->ops->remove(mdev);
>  	if (ret && !force_remove)
> -		return -EBUSY;
> +		return ret;
>  
>  	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
>  	return 0;
> 

Intentionally returned -EBUSY here. If VMM or userspace application is
using this mdev device, vendor driver can return error. In that case
sysfs interface should see -EBUSY error indicating device is still active.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure
  2019-03-25 18:17   ` Kirti Wankhede
@ 2019-03-25 19:21     ` Alex Williamson
  2019-03-25 21:11       ` Parav Pandit
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2019-03-25 19:21 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: Parav Pandit, kvm, linux-kernel

On Mon, 25 Mar 2019 23:47:30 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > device_register() performs put_device() if device_add() fails.
> > This balances with device_initialize().
> > 
> > mdev core performing put_device() when device_register() fails,
> > is an error that puts already released device again.
> > Therefore, don't put the device on error.
> >   
> 
> device_add() on all errors doesn't call put_device(dev). It releases
> reference to its parent, put_device(parent), but not the device itself,
> put_device(dev).

Sort of, device_initialize() initializes the reference count to 1,
device_add() increments the reference count to 2 via the get_device()
and then drops it back to 1 on all exit paths.  The oddity is the
failure path of get_device() itself, but that can only happen if passed
a NULL device, where put_device() is a no-op and not relevant here.  So
in all cases device_register() returns with a reference count of 1 and
we need to call put_device() to free the allocated object.  The below
change would leak the mdev on error.  Thanks,

Alex

> > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c | 4 +---
> >  1 file changed, 1 insertion(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> > index 0212f0e..3e5880a 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -318,10 +318,8 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> >  
> >  	ret = device_register(&mdev->dev);
> > -	if (ret) {
> > -		put_device(&mdev->dev);
> > +	if (ret)
> >  		goto mdev_fail;
> > -	}
> >  
> >  	ret = mdev_device_create_ops(kobj, mdev);
> >  	if (ret)
> >   


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-03-22 23:20 ` [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
  2019-03-25 11:58   ` Maxim Levitsky
@ 2019-03-25 19:35   ` Kirti Wankhede
  2019-03-25 20:49     ` Alex Williamson
  1 sibling, 1 reply; 49+ messages in thread
From: Kirti Wankhede @ 2019-03-25 19:35 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, alex.williamson



On 3/23/2019 4:50 AM, Parav Pandit wrote:
> device_for_each_child() stops executing callback function for remaining
> child devices, if callback hits an error.
> Each child mdev device is independent of each other.
> While unregistering parent device, mdev core must remove all child mdev
> devices.
> Therefore, mdev_device_remove_cb() always returns success so that
> device_for_each_child doesn't abort if one child removal hits error.
> 

When unregistering parent device, force_remove is set to true amd
mdev_device_remove_ops() always returns success.

> While at it, improve remove and unregister functions for below simplicity.
> 
> There isn't need to pass forced flag pointer during mdev parent
> removal which invokes mdev_device_remove().

There is a need to pass the flag, pasting here the comment above
mdev_device_remove_ops() which explains why the flag is needed:

/*
 * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
 * device is being unregistered from mdev device framework.
 * - 'force_remove' is set to 'false' when called from sysfs's 'remove'
which
 *   indicates that if the mdev device is active, used by VMM or userspace
 *   application, vendor driver could return error then don't remove the
device.
 * - 'force_remove' is set to 'true' when called from
mdev_unregister_device()
 *   which indicate that parent device is being removed from mdev device
 *   framework so remove mdev device forcefully.
 */

Thanks,
Kirti

 So simplify the flow.
> 
> mdev_device_remove() is called from two paths.
> 1. mdev_unregister_driver()
>      mdev_device_remove_cb()
>        mdev_device_remove()
> 2. remove_store()
>      mdev_device_remove()
> 
> When device is removed by user using remote_store(), device under
> removal is mdev device.
> When device is removed during parent device removal using generic child
> iterator, mdev check is already done using dev_is_mdev().
> 
> Hence, remove the unnecessary loop in mdev_device_remove().
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c | 24 +++++-------------------
>  1 file changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index ab05464..944a058 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -150,10 +150,10 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
>  
>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
> -	if (!dev_is_mdev(dev))
> -		return 0;
> +	if (dev_is_mdev(dev))
> +		mdev_device_remove(dev, true);
>  
> -	return mdev_device_remove(dev, data ? *(bool *)data : true);
> +	return 0;
>  }
>  
>  /*
> @@ -241,7 +241,6 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  void mdev_unregister_device(struct device *dev)
>  {
>  	struct mdev_parent *parent;
> -	bool force_remove = true;
>  
>  	mutex_lock(&parent_list_lock);
>  	parent = __find_parent_device(dev);
> @@ -255,8 +254,7 @@ void mdev_unregister_device(struct device *dev)
>  	list_del(&parent->next);
>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
> -	device_for_each_child(dev, (void *)&force_remove,
> -			      mdev_device_remove_cb);
> +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  
> @@ -346,24 +344,12 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  
>  int mdev_device_remove(struct device *dev, bool force_remove)
>  {
> -	struct mdev_device *mdev, *tmp;
> +	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> -
> -	mutex_lock(&mdev_list_lock);
> -	list_for_each_entry(tmp, &mdev_list, next) {
> -		if (tmp == mdev)
> -			break;
> -	}
> -
> -	if (tmp != mdev) {
> -		mutex_unlock(&mdev_list_lock);
> -		return -ENODEV;
> -	}
> -
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
>  		return -EAGAIN;
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols
  2019-03-25 19:07   ` Kirti Wankhede
@ 2019-03-25 19:49     ` Alex Williamson
  2019-03-25 21:27       ` Parav Pandit
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2019-03-25 19:49 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: Parav Pandit, kvm, linux-kernel

On Tue, 26 Mar 2019 00:37:04 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > There is no need use 'extern' for exported functions.
> > 
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  include/linux/mdev.h | 21 ++++++++++-----------
> >  1 file changed, 10 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> > index b6e048e..0924c48 100644
> > --- a/include/linux/mdev.h
> > +++ b/include/linux/mdev.h
> > @@ -118,21 +118,20 @@ struct mdev_driver {
> >  
> >  #define to_mdev_driver(drv)	container_of(drv, struct mdev_driver, driver)
> >  
> > -extern void *mdev_get_drvdata(struct mdev_device *mdev);
> > -extern void mdev_set_drvdata(struct mdev_device *mdev, void *data);
> > -extern uuid_le mdev_uuid(struct mdev_device *mdev);
> > +void *mdev_get_drvdata(struct mdev_device *mdev);
> > +void mdev_set_drvdata(struct mdev_device *mdev, void *data);
> > +uuid_le mdev_uuid(struct mdev_device *mdev);
> >  
> >  extern struct bus_type mdev_bus_type;
> >  
> > -extern int  mdev_register_device(struct device *dev,
> > -				 const struct mdev_parent_ops *ops);
> > -extern void mdev_unregister_device(struct device *dev);
> > +int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops);
> > +void mdev_unregister_device(struct device *dev);
> >  
> > -extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> > -extern void mdev_unregister_driver(struct mdev_driver *drv);
> > +int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> > +void mdev_unregister_driver(struct mdev_driver *drv);
> >  
> > -extern struct device *mdev_parent_dev(struct mdev_device *mdev);
> > -extern struct device *mdev_dev(struct mdev_device *mdev);
> > -extern struct mdev_device *mdev_from_dev(struct device *dev);
> > +struct device *mdev_parent_dev(struct mdev_device *mdev);
> > +struct device *mdev_dev(struct mdev_device *mdev);
> > +struct mdev_device *mdev_from_dev(struct device *dev);
> >  
> >  #endif /* MDEV_H */
> >   
> 
> Adding 'extern' to exported symbols is inline to other exported
> functions from device's core module like device_register(),
> device_unregister(), get_device(), put_device()

Right, I'd be inclined to leave this as a style choice, but...

commit 3fe5dbfef47e992b810cbe82af1df02d8255fb8c
Author: Alexey Dobriyan <adobriyan@gmail.com>
Date:   Thu Jan 3 15:26:16 2019 -0800

    Documentation/process/coding-style.rst: don't use "extern" with function prototypes
    
    `extern' with function prototypes makes lines longer and creates more
    characters on the screen.
    
    Do not bug people with checkpatch.pl warnings for now as fallout can be
    devastating.

So it's a new decision and rather weakly imposed new standard.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 6/8] vfio/mdev: Follow correct remove sequence
  2019-03-22 23:20 ` [PATCH 6/8] vfio/mdev: Follow correct remove sequence Parav Pandit
  2019-03-25 11:58   ` Maxim Levitsky
@ 2019-03-25 20:20   ` Alex Williamson
  2019-03-25 21:31     ` Parav Pandit
  1 sibling, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2019-03-25 20:20 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede

On Fri, 22 Mar 2019 18:20:33 -0500
Parav Pandit <parav@mellanox.com> wrote:

> mdev_remove_sysfs_files() should follow exact mirror sequence of a
> create, similar to what is followed in error unwinding path of
> mdev_create_sysfs_files().
> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_sysfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index ce5dd21..c782fa9 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -280,7 +280,7 @@ int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
>  
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
>  {
> +	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
>  	sysfs_remove_link(&dev->kobj, "mdev_type");
>  	sysfs_remove_link(type->devices_kobj, dev_name(dev));
> -	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
>  }

Ok, I agree this is good practice, but what qualifies a "Fixes:" tag
here?  The fixes reference is incorrect in any case, 6a62c1dfb5c7
changed the creation ordering and didn't update the remove ordering to
match, but I still don't see an actual problem with the remove ordering
that necessitates the tag.  Please clarify.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-03-25 19:35   ` Kirti Wankhede
@ 2019-03-25 20:49     ` Alex Williamson
  2019-03-25 21:36       ` Parav Pandit
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2019-03-25 20:49 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: Parav Pandit, kvm, linux-kernel

On Tue, 26 Mar 2019 01:05:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > device_for_each_child() stops executing callback function for remaining
> > child devices, if callback hits an error.
> > Each child mdev device is independent of each other.
> > While unregistering parent device, mdev core must remove all child mdev
> > devices.
> > Therefore, mdev_device_remove_cb() always returns success so that
> > device_for_each_child doesn't abort if one child removal hits error.
> >   
> 
> When unregistering parent device, force_remove is set to true amd
> mdev_device_remove_ops() always returns success.

Can we know that?  mdev_device_remove() doesn't guarantee to return
zero.

> > While at it, improve remove and unregister functions for below simplicity.
> > 
> > There isn't need to pass forced flag pointer during mdev parent
> > removal which invokes mdev_device_remove().  
> 
> There is a need to pass the flag, pasting here the comment above
> mdev_device_remove_ops() which explains why the flag is needed:
> 
> /*
>  * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
>  * device is being unregistered from mdev device framework.
>  * - 'force_remove' is set to 'false' when called from sysfs's 'remove'
> which
>  *   indicates that if the mdev device is active, used by VMM or userspace
>  *   application, vendor driver could return error then don't remove the
> device.
>  * - 'force_remove' is set to 'true' when called from
> mdev_unregister_device()
>  *   which indicate that parent device is being removed from mdev device
>  *   framework so remove mdev device forcefully.
>  */

I don't see that this changes the force behavior, it's simply noting
that in order to continue the device_for_each_child() iterator, we need
to return zero, regardless of what mdev_device_remove() returns, and
the parent remove path is the only caller of mdev_device_remove_cb(),
so we can assume force = true when calling mdev_device_remove().  Aside
from maybe a WARN_ON if mdev_device_remove() returns non-zero, that
much looks reasonable to me.

>  So simplify the flow.
> > 
> > mdev_device_remove() is called from two paths.
> > 1. mdev_unregister_driver()
> >      mdev_device_remove_cb()
> >        mdev_device_remove()
> > 2. remove_store()
> >      mdev_device_remove()
> > 
> > When device is removed by user using remote_store(), device under
> > removal is mdev device.
> > When device is removed during parent device removal using generic child
> > iterator, mdev check is already done using dev_is_mdev().
> > 
> > Hence, remove the unnecessary loop in mdev_device_remove().

I don't think knowing the device type is the only reason for this loop
though.  Both paths you mention above can race with each other, so we
need to serialize them and pick a winner.  The mdev_list_lock allows us
to do that.  Additionally...

> > 
> > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c | 24 +++++-------------------
> >  1 file changed, 5 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> > index ab05464..944a058 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -150,10 +150,10 @@ static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
> >  
> >  static int mdev_device_remove_cb(struct device *dev, void *data)
> >  {
> > -	if (!dev_is_mdev(dev))
> > -		return 0;
> > +	if (dev_is_mdev(dev))
> > +		mdev_device_remove(dev, true);
> >  
> > -	return mdev_device_remove(dev, data ? *(bool *)data : true);
> > +	return 0;
> >  }
> >  
> >  /*
> > @@ -241,7 +241,6 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
> >  void mdev_unregister_device(struct device *dev)
> >  {
> >  	struct mdev_parent *parent;
> > -	bool force_remove = true;
> >  
> >  	mutex_lock(&parent_list_lock);
> >  	parent = __find_parent_device(dev);
> > @@ -255,8 +254,7 @@ void mdev_unregister_device(struct device *dev)
> >  	list_del(&parent->next);
> >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> >  
> > -	device_for_each_child(dev, (void *)&force_remove,
> > -			      mdev_device_remove_cb);
> > +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> >  
> >  	parent_remove_sysfs_files(parent);
> >  
> > @@ -346,24 +344,12 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> >  
> >  int mdev_device_remove(struct device *dev, bool force_remove)
> >  {
> > -	struct mdev_device *mdev, *tmp;
> > +	struct mdev_device *mdev;
> >  	struct mdev_parent *parent;
> >  	struct mdev_type *type;
> >  	int ret;
> >  
> >  	mdev = to_mdev_device(dev);
> > -
> > -	mutex_lock(&mdev_list_lock);

Acquiring the lock is removed, but...

> > -	list_for_each_entry(tmp, &mdev_list, next) {
> > -		if (tmp == mdev)
> > -			break;
> > -	}
> > -
> > -	if (tmp != mdev) {
> > -		mutex_unlock(&mdev_list_lock);
> > -		return -ENODEV;
> > -	}
> > -
> >  	if (!mdev->active) {
> >  		mutex_unlock(&mdev_list_lock);
> >  		return -EAGAIN;
> >   

We still release it in this path and the code below here.  If we don't
find the device on the list under lock, then we're working with a stale
device and playing with the 'active' flag of that device outside of any
sort of mutual exclusion is racy.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure
  2019-03-25 19:21     ` Alex Williamson
@ 2019-03-25 21:11       ` Parav Pandit
  0 siblings, 0 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-25 21:11 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede; +Cc: kvm, linux-kernel



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 2:21 PM
> To: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH 1/8] vfio/mdev: Fix to not do put_device on
> device_register failure
> 
> On Mon, 25 Mar 2019 23:47:30 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > > device_register() performs put_device() if device_add() fails.
> > > This balances with device_initialize().
> > >
> > > mdev core performing put_device() when device_register() fails, is
> > > an error that puts already released device again.
> > > Therefore, don't put the device on error.
> > >
> >
> > device_add() on all errors doesn't call put_device(dev). It releases
> > reference to its parent, put_device(parent), but not the device
> > itself, put_device(dev).
> 
> Sort of, device_initialize() initializes the reference count to 1,
> device_add() increments the reference count to 2 via the get_device() and
> then drops it back to 1 on all exit paths.  The oddity is the failure path of
> get_device() itself, but that can only happen if passed a NULL device, where
> put_device() is a no-op and not relevant here.  So in all cases
> device_register() returns with a reference count of 1 and we need to call
> put_device() to free the allocated object.  The below change would leak the
> mdev on error.  Thanks,
> 
Yes.
I missed the NOTE at the starting of device_add() comment block.
I will drop this patch in series.

> Alex
> 
> > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > ---
> > >  drivers/vfio/mdev/mdev_core.c | 4 +---
> > >  1 file changed, 1 insertion(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > b/drivers/vfio/mdev/mdev_core.c index 0212f0e..3e5880a 100644
> > > --- a/drivers/vfio/mdev/mdev_core.c
> > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > @@ -318,10 +318,8 @@ int mdev_device_create(struct kobject *kobj,
> struct device *dev, uuid_le uuid)
> > >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> > >
> > >  	ret = device_register(&mdev->dev);
> > > -	if (ret) {
> > > -		put_device(&mdev->dev);
> > > +	if (ret)
> > >  		goto mdev_fail;
> > > -	}
> > >
> > >  	ret = mdev_device_create_ops(kobj, mdev);
> > >  	if (ret)
> > >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols
  2019-03-25 19:49     ` Alex Williamson
@ 2019-03-25 21:27       ` Parav Pandit
  0 siblings, 0 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-25 21:27 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede; +Cc: kvm, linux-kernel



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 2:50 PM
> To: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH 4/8] vfio/mdev: Drop redundant extern for exported
> symbols
> 
> On Tue, 26 Mar 2019 00:37:04 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > > There is no need use 'extern' for exported functions.
> > >
> > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > ---
> > >  include/linux/mdev.h | 21 ++++++++++-----------
> > >  1 file changed, 10 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/include/linux/mdev.h b/include/linux/mdev.h index
> > > b6e048e..0924c48 100644
> > > --- a/include/linux/mdev.h
> > > +++ b/include/linux/mdev.h
> > > @@ -118,21 +118,20 @@ struct mdev_driver {
> > >
> > >  #define to_mdev_driver(drv)	container_of(drv, struct mdev_driver,
> driver)
> > >
> > > -extern void *mdev_get_drvdata(struct mdev_device *mdev); -extern
> > > void mdev_set_drvdata(struct mdev_device *mdev, void *data); -extern
> > > uuid_le mdev_uuid(struct mdev_device *mdev);
> > > +void *mdev_get_drvdata(struct mdev_device *mdev); void
> > > +mdev_set_drvdata(struct mdev_device *mdev, void *data); uuid_le
> > > +mdev_uuid(struct mdev_device *mdev);
> > >
> > >  extern struct bus_type mdev_bus_type;
> > >
> > > -extern int  mdev_register_device(struct device *dev,
> > > -				 const struct mdev_parent_ops *ops);
> > > -extern void mdev_unregister_device(struct device *dev);
> > > +int mdev_register_device(struct device *dev, const struct
> > > +mdev_parent_ops *ops); void mdev_unregister_device(struct device
> > > +*dev);
> > >
> > > -extern int  mdev_register_driver(struct mdev_driver *drv, struct
> > > module *owner); -extern void mdev_unregister_driver(struct
> > > mdev_driver *drv);
> > > +int mdev_register_driver(struct mdev_driver *drv, struct module
> > > +*owner); void mdev_unregister_driver(struct mdev_driver *drv);
> > >
> > > -extern struct device *mdev_parent_dev(struct mdev_device *mdev);
> > > -extern struct device *mdev_dev(struct mdev_device *mdev); -extern
> > > struct mdev_device *mdev_from_dev(struct device *dev);
> > > +struct device *mdev_parent_dev(struct mdev_device *mdev); struct
> > > +device *mdev_dev(struct mdev_device *mdev); struct mdev_device
> > > +*mdev_from_dev(struct device *dev);
> > >
> > >  #endif /* MDEV_H */
> > >
> >
> > Adding 'extern' to exported symbols is inline to other exported
> > functions from device's core module like device_register(),
> > device_unregister(), get_device(), put_device()
> 
> Right, I'd be inclined to leave this as a style choice, but...
> 
> commit 3fe5dbfef47e992b810cbe82af1df02d8255fb8c
> Author: Alexey Dobriyan <adobriyan@gmail.com>
> Date:   Thu Jan 3 15:26:16 2019 -0800
> 
>     Documentation/process/coding-style.rst: don't use "extern" with function
> prototypes
> 
>     `extern' with function prototypes makes lines longer and creates more
>     characters on the screen.
> 
>     Do not bug people with checkpatch.pl warnings for now as fallout can be
>     devastating.
> 
> So it's a new decision and rather weakly imposed new standard.  Thanks,
> 
We always improve the kernel, sometimes in pieces, sometime at subsystem level or sometimes tree wide.
This is done mdev level.
device core is not good example to point that they use 'extern' so its fine here...
That was written more than 10 years ago.
So we should be open to improvements.. silly or large..

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY
  2019-03-25 19:18   ` Kirti Wankhede
@ 2019-03-25 21:29     ` Parav Pandit
  0 siblings, 0 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-25 21:29 UTC (permalink / raw)
  To: Kirti Wankhede, kvm, linux-kernel, alex.williamson



> -----Original Message-----
> From: Kirti Wankhede <kwankhede@nvidia.com>
> Sent: Monday, March 25, 2019 2:18 PM
> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; alex.williamson@redhat.com
> Subject: Re: [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY
> 
> 
> 
> On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > Instead of masking return error to -EBUSY, return actual error
> > returned by the driver.
> >
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/mdev/mdev_core.c
> > b/drivers/vfio/mdev/mdev_core.c index 3d91f62..ab05464 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -142,7 +142,7 @@ static int mdev_device_remove_ops(struct
> mdev_device *mdev, bool force_remove)
> >  	 */
> >  	ret = parent->ops->remove(mdev);
> >  	if (ret && !force_remove)
> > -		return -EBUSY;
> > +		return ret;
> >
> >  	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> >mdev_attr_groups);
> >  	return 0;
> >
> 
> Intentionally returned -EBUSY here. If VMM or userspace application is using
> this mdev device, vendor driver can return error.
If vendor driver detects that its busy, it must return EBUSY, not any other status.
mdev core is not supposed to mask some other error to EBUSY.
Hence the fix.

 In that case sysfs interface
> should see -EBUSY error indicating device is still active.
> 
> Thanks,
> Kirti

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 6/8] vfio/mdev: Follow correct remove sequence
  2019-03-25 20:20   ` Alex Williamson
@ 2019-03-25 21:31     ` Parav Pandit
  0 siblings, 0 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-25 21:31 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, kwankhede



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 3:21 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com
> Subject: Re: [PATCH 6/8] vfio/mdev: Follow correct remove sequence
> 
> On Fri, 22 Mar 2019 18:20:33 -0500
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > mdev_remove_sysfs_files() should follow exact mirror sequence of a
> > create, similar to what is followed in error unwinding path of
> > mdev_create_sysfs_files().
> >
> > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_sysfs.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/mdev/mdev_sysfs.c
> > b/drivers/vfio/mdev/mdev_sysfs.c index ce5dd21..c782fa9 100644
> > --- a/drivers/vfio/mdev/mdev_sysfs.c
> > +++ b/drivers/vfio/mdev/mdev_sysfs.c
> > @@ -280,7 +280,7 @@ int  mdev_create_sysfs_files(struct device *dev,
> > struct mdev_type *type)
> >
> >  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type
> > *type)  {
> > +	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
> >  	sysfs_remove_link(&dev->kobj, "mdev_type");
> >  	sysfs_remove_link(type->devices_kobj, dev_name(dev));
> > -	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
> >  }
> 
> Ok, I agree this is good practice, but what qualifies a "Fixes:" tag here?  The
> fixes reference is incorrect in any case, 6a62c1dfb5c7 changed the creation
> ordering and didn't update the remove ordering to match, but I still don't
> see an actual problem with the remove ordering that necessitates the tag.
> Please clarify.  Thanks,
> 
In netdev and rdma subsystem we always follow Fixes tag line whenever there is fix, small or big.
So following good practice is better.
I will fix the tag number in v1.

> Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-03-25 20:49     ` Alex Williamson
@ 2019-03-25 21:36       ` Parav Pandit
  2019-03-25 21:52         ` Alex Williamson
  0 siblings, 1 reply; 49+ messages in thread
From: Parav Pandit @ 2019-03-25 21:36 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede; +Cc: kvm, linux-kernel



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 3:50 PM
> To: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if
> one fails
> 
> On Tue, 26 Mar 2019 01:05:34 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > > device_for_each_child() stops executing callback function for
> > > remaining child devices, if callback hits an error.
> > > Each child mdev device is independent of each other.
> > > While unregistering parent device, mdev core must remove all child
> > > mdev devices.
> > > Therefore, mdev_device_remove_cb() always returns success so that
> > > device_for_each_child doesn't abort if one child removal hits error.
> > >
> >
> > When unregistering parent device, force_remove is set to true amd
> > mdev_device_remove_ops() always returns success.
> 
> Can we know that?  mdev_device_remove() doesn't guarantee to return
> zero.
> 
> > > While at it, improve remove and unregister functions for below
> simplicity.
> > >
> > > There isn't need to pass forced flag pointer during mdev parent
> > > removal which invokes mdev_device_remove().
> >
> > There is a need to pass the flag, pasting here the comment above
> > mdev_device_remove_ops() which explains why the flag is needed:
> >
> > /*
> >  * mdev_device_remove_ops gets called from sysfs's 'remove' and when
> > parent
> >  * device is being unregistered from mdev device framework.
> >  * - 'force_remove' is set to 'false' when called from sysfs's 'remove'
> > which
> >  *   indicates that if the mdev device is active, used by VMM or userspace
> >  *   application, vendor driver could return error then don't remove the
> > device.
> >  * - 'force_remove' is set to 'true' when called from
> > mdev_unregister_device()
> >  *   which indicate that parent device is being removed from mdev device
> >  *   framework so remove mdev device forcefully.
> >  */
> 
> I don't see that this changes the force behavior, it's simply noting that in
> order to continue the device_for_each_child() iterator, we need to return
> zero, regardless of what mdev_device_remove() returns, and the parent
> remove path is the only caller of mdev_device_remove_cb(), so we can
> assume force = true when calling mdev_device_remove().  Aside from maybe
> a WARN_ON if mdev_device_remove() returns non-zero, that much looks
> reasonable to me.
> 
> >  So simplify the flow.
> > >
> > > mdev_device_remove() is called from two paths.
> > > 1. mdev_unregister_driver()
> > >      mdev_device_remove_cb()
> > >        mdev_device_remove()
> > > 2. remove_store()
> > >      mdev_device_remove()
> > >
> > > When device is removed by user using remote_store(), device under
> > > removal is mdev device.
> > > When device is removed during parent device removal using generic
> > > child iterator, mdev check is already done using dev_is_mdev().
> > >
> > > Hence, remove the unnecessary loop in mdev_device_remove().
> 
> I don't think knowing the device type is the only reason for this loop though.
> Both paths you mention above can race with each other, so we need to
> serialize them and pick a winner.  The mdev_list_lock allows us to do that.
> Additionally...
> 
> > >
> > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > ---
> > >  drivers/vfio/mdev/mdev_core.c | 24 +++++-------------------
> > >  1 file changed, 5 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > b/drivers/vfio/mdev/mdev_core.c index ab05464..944a058 100644
> > > --- a/drivers/vfio/mdev/mdev_core.c
> > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > @@ -150,10 +150,10 @@ static int mdev_device_remove_ops(struct
> > > mdev_device *mdev, bool force_remove)
> > >
> > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > > -	if (!dev_is_mdev(dev))
> > > -		return 0;
> > > +	if (dev_is_mdev(dev))
> > > +		mdev_device_remove(dev, true);
> > >
> > > -	return mdev_device_remove(dev, data ? *(bool *)data : true);
> > > +	return 0;
> > >  }
> > >
> > >  /*
> > > @@ -241,7 +241,6 @@ int mdev_register_device(struct device *dev,
> > > const struct mdev_parent_ops *ops)  void
> > > mdev_unregister_device(struct device *dev)  {
> > >  	struct mdev_parent *parent;
> > > -	bool force_remove = true;
> > >
> > >  	mutex_lock(&parent_list_lock);
> > >  	parent = __find_parent_device(dev); @@ -255,8 +254,7 @@ void
> > > mdev_unregister_device(struct device *dev)
> > >  	list_del(&parent->next);
> > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > >
> > > -	device_for_each_child(dev, (void *)&force_remove,
> > > -			      mdev_device_remove_cb);
> > > +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > >
> > >  	parent_remove_sysfs_files(parent);
> > >
> > > @@ -346,24 +344,12 @@ int mdev_device_create(struct kobject *kobj,
> > > struct device *dev, uuid_le uuid)
> > >
> > >  int mdev_device_remove(struct device *dev, bool force_remove)  {
> > > -	struct mdev_device *mdev, *tmp;
> > > +	struct mdev_device *mdev;
> > >  	struct mdev_parent *parent;
> > >  	struct mdev_type *type;
> > >  	int ret;
> > >
> > >  	mdev = to_mdev_device(dev);
> > > -
> > > -	mutex_lock(&mdev_list_lock);
> 
> Acquiring the lock is removed, but...
> 
Crap. Missed the lower part.

> > > -	list_for_each_entry(tmp, &mdev_list, next) {
> > > -		if (tmp == mdev)
> > > -			break;
> > > -	}
> > > -
> > > -	if (tmp != mdev) {
> > > -		mutex_unlock(&mdev_list_lock);
> > > -		return -ENODEV;
> > > -	}
> > > -
> > >  	if (!mdev->active) {
> > >  		mutex_unlock(&mdev_list_lock);
> > >  		return -EAGAIN;
> > >
> 
> We still release it in this path and the code below here.  If we don't find the
> device on the list under lock, then we're working with a stale device and
> playing with the 'active' flag of that device outside of any sort of mutual
> exclusion is racy.  Thanks,
Subsequent patch makes the order sane.
I think I should merge this change with patch-8 in the series.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-25 13:24   ` Maxim Levitsky
@ 2019-03-25 21:42     ` Parav Pandit
  0 siblings, 0 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-25 21:42 UTC (permalink / raw)
  To: Maxim Levitsky, kvm, linux-kernel, kwankhede, alex.williamson



> -----Original Message-----
> From: Maxim Levitsky <mlevitsk@redhat.com>
> Sent: Monday, March 25, 2019 8:24 AM
> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; kwankhede@nvidia.com;
> alex.williamson@redhat.com
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> > There are five problems with current code structure.
> > 1. mdev device is placed on the mdev bus before it is created in the
> > vendor driver. Once a device is placed on the mdev bus without
> > creating its supporting underlying vendor device, an open() can get
> > triggered by userspace on partially initialized device.
> > Below ladder diagram highlight it.
> >
> >       cpu-0                                       cpu-1
> >       -----                                       -----
> >    create_store()
> >      mdev_create_device()
> >        device_register()
> >           ...
> >          vfio_mdev_probe()
> >          ...creates char device
> >                                         vfio_mdev_open()
> >                                           parent->ops->open(mdev)
> >                                             vfio_ap_mdev_open()
> >                                               matrix_mdev = NULL
> >         [...]
> >         parent->ops->create()
> >           vfio_ap_mdev_create()
> >             mdev_set_drvdata(mdev, matrix_mdev);
> >             /* Valid pointer set above */
> 
> Agree.
> You probably mean mdev_device_create here.
> 
> >
> > 2. Current creation sequence is,
> >    parent->ops_create()
> >    groups_register()
> >
> > Remove sequence is,
> >    parent->ops->remove()
> >    groups_unregister()
> > However, remove sequence should be exact mirror of creation sequence.
> > Once this is achieved, all users of the mdev will be terminated first
> > before removing underlying vendor device.
> > (Follow standard linux driver model).
> > At that point vendor's remove() ops shouldn't failed because device is
> > taken off the bus that should terminate the users.
> Agreee here too.
> 
> 
> 
> >
> > 3. Additionally any new mdev driver that wants to work on mdev device
> > during probe() routine registered using mdev_register_driver() needs
> > to get stable mdev structure.
> >
> > 4. In following sequence, child devices created while removing mdev
> > parent device can be left out, or it may lead to race of removing half
> > initialized child mdev devices.
> >
> > issue-1:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> >                                   mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >                                   parent_remove_sysfs_files()
> >                                   /* BUG: device added by cpu-0
> >                                    * whose parent is getting removed.
> >                                    */
> >
> > issue-2:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >
> >        [...]                      mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> >
> >        mdev_create_sysfs_files()
> >        /* BUG: create is adding
> >         * sysfs files for a device
> >         * which is undergoing removal.
> >         */
> >                                  parent_remove_sysfs_files()
> Looks like an issue to me too.
> 
> >
> > 5. Below crash is observed when user initiated remove is in progress
> > and mdev_unregister_driver() completes parent unregistration.
> >
> >        cpu-0                         cpu-1
> >        -----                         -----
> > remove_store()
> >    mdev_device_remove()
> >    active = false;
> >                                   mdev_unregister_device()
> >                                     remove type
> >    [...]
> >    mdev_remove_ops() crashes.
> >
> > This is similar race like create() racing with mdev_unregister_device().
> >
> > mtty mtty: MDEV: Registered
> > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > mtty_dev: Unloaded!
> > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> >  mdev_device_remove+0xef/0x130 [mdev]
> >  remove_store+0x77/0xa0 [mdev]
> >  kernfs_fop_write+0x113/0x1a0
> >  __vfs_write+0x33/0x1b0
> >  ? rcu_read_lock_sched_held+0x64/0x70
> >  ? rcu_sync_lockdep_assert+0x2a/0x50
> >  ? __sb_start_write+0x121/0x1b0
> >  ? vfs_write+0x17c/0x1b0
> >  vfs_write+0xad/0x1b0
> >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >  ksys_write+0x55/0xc0
> >  do_syscall_64+0x5a/0x210
> >
> > Therefore, mdev core is improved in following ways to overcome above
> > issues.
> >
> > 1. Before placing mdev devices on the bus, perform vendor drivers
> > creation which supports the mdev creation.
> > This ensures that mdev specific all necessary fields are initialized
> > before a given mdev can be accessed by bus driver.
> >
> > 2. During remove flow, first remove the device from the bus. This
> > ensures that any bus specific devices and data is cleared.
> > Once device is taken of the mdev bus, perform remove() of mdev from
> > the vendor driver.
> >
> >
> > 3. Linux core device model provides way to register and auto
> > unregister the device sysfs attribute groups at dev->groups.
> > to avoid explicit groups creation and removal.
> > to avoid explicit groups creation and removal.
> >
> > 4. Wait for any ongoing mdev create() and remove() to finish before
> > unregistering parent device using srcu. This continues to allow
> > multiple create and remove to progress in parallel. At the same time
> > guard parent removal while parent is being access by create() and remove
> callbacks.
> All these fixes seem reasonable and correct to me
> 
> 
> >
> > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++--------------
> ---
> > -
> >  drivers/vfio/mdev/mdev_private.h |   7 +-
> >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> >  3 files changed, 84 insertions(+), 71 deletions(-)
> >
> > diff --git a/drivers/vfio/mdev/mdev_core.c
> > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> >  						  ref);
> >  	struct device *dev = parent->dev;
> >
> > +	cleanup_srcu_struct(&parent->unreg_srcu);
> >  	kfree(parent);
> >  	put_device(dev);
> >  }
> > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct
> > mdev_parent
> > *parent)
> >  		kref_put(&parent->ref, mdev_release_parent);  }
> >
> > -static int mdev_device_create_ops(struct kobject *kobj,
> > -				  struct mdev_device *mdev)
> > +static int mdev_device_must_remove(struct mdev_device *mdev)
> 
> Tiny nitpic: maybe a better name? or a comment for this function that state
> that it tries removes the device even if in use
> 
> >  {
> > -	struct mdev_parent *parent = mdev->parent;
> > +	struct mdev_parent *parent;
> > +	struct mdev_type *type;
> >  	int ret;
> >
> > -	ret = parent->ops->create(kobj, mdev);
> > -	if (ret)
> > -		return ret;
> > +	type = to_mdev_type(mdev->type_kobj);
> >
> > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > -				  parent->ops->mdev_attr_groups);
> > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > +	device_del(&mdev->dev);
> > +	parent = mdev->parent;
> > +	ret = parent->ops->remove(mdev);
> >  	if (ret)
> > -		parent->ops->remove(mdev);
> > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
> >
> > +	/* Balances with device_initialize() */
> > +	put_device(&mdev->dev);
> >  	return ret;
> >  }
> >
> > -/*
> > - * mdev_device_remove_ops gets called from sysfs's 'remove' and when
> > parent
> > - * device is being unregistered from mdev device framework.
> > - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> > - *   indicates that if the mdev device is active, used by VMM or userspace
> > - *   application, vendor driver could return error then don't remove the
> > device.
> > - * - 'force_remove' is set to 'true' when called from
> > mdev_unregister_device()
> > - *   which indicate that parent device is being removed from mdev device
> > - *   framework so remove mdev device forcefully.
> > - */
> > -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> > force_remove)
> > -{
> > -	struct mdev_parent *parent = mdev->parent;
> > -	int ret;
> > -
> > -	/*
> > -	 * Vendor driver can return error if VMM or userspace application is
> > -	 * using this mdev device.
> > -	 */
> > -	ret = parent->ops->remove(mdev);
> > -	if (ret && !force_remove)
> > -		return ret;
> > -
> > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> >mdev_attr_groups);
> > -	return 0;
> > -}
> > -
> >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> >  	if (dev_is_mdev(dev))
> > -		mdev_device_remove(dev, true);
> > -
> > +		mdev_device_must_remove(to_mdev_device(dev));
> >  	return 0;
> >  }
> >
> > @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const
> > struct mdev_parent_ops *ops)
> >  	}
> >
> >  	kref_init(&parent->ref);
> > +	init_srcu_struct(&parent->unreg_srcu);
> >
> >  	parent->dev = dev;
> >  	parent->ops = ops;
> > @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const
> > struct mdev_parent_ops *ops)
> >  	if (ret)
> >  		dev_warn(dev, "Failed to create compatibility class link\n");
> >
> > +	rcu_assign_pointer(parent->self, parent);
> >  	list_add(&parent->next, &parent_list);
> >  	mutex_unlock(&parent_list_lock);
> >
> > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
> >
> >  	mutex_lock(&parent_list_lock);
> >  	parent = __find_parent_device(dev);
> > -
> >  	if (!parent) {
> >  		mutex_unlock(&parent_list_lock);
> >  		return;
> >  	}
> > +	list_del(&parent->next);
> > +	mutex_unlock(&parent_list_lock);
> > +
> >  	dev_info(dev, "MDEV: Unregistering\n");
> >
> > -	list_del(&parent->next);
> > +	/* Publish that this mdev parent is unregistering. So any new
> > +	 * create/remove cannot start on this parent anymore by user.
> > +	 */
> > +	rcu_assign_pointer(parent->self, NULL);
> > +
> > +	/*
> > +	 * Wait for any active create() or remove() mdev ops on the parent
> > +	 * to complete.
> > +	 */
> > +	synchronize_srcu(&parent->unreg_srcu);
> > +
> > +	/* At this point it is confirmed that any pending user initiated
> > +	 * create or remove callbacks accessing the parent are completed.
> > +	 * It is safe to remove the parent now.
> > +	 */
> >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> >
> >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> >
> >  	parent_remove_sysfs_files(parent);
> >
> > -	mutex_unlock(&parent_list_lock);
> >  	mdev_put_parent(parent);
> >  }
> >  EXPORT_SYMBOL(mdev_unregister_device);
> > @@ -278,14 +270,24 @@ static void mdev_device_release(struct device
> > *dev)  int mdev_device_create(struct kobject *kobj, struct device
> > *dev, uuid_le
> > uuid)
> >  {
> >  	int ret;
> > +	struct mdev_parent *valid_parent;
> >  	struct mdev_device *mdev, *tmp;
> >  	struct mdev_parent *parent;
> >  	struct mdev_type *type = to_mdev_type(kobj);
> > +	int srcu_idx;
> >
> >  	parent = mdev_get_parent(type->parent);
> >  	if (!parent)
> >  		return -EINVAL;
> >
> > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > +	if (!valid_parent) {
> > +		/* parent is undergoing unregistration */
> > +		ret = -ENODEV;
> > +		goto mdev_fail;
> > +	}
> > +
> >  	mutex_lock(&mdev_list_lock);
> >
> >  	/* Check for duplicate */
> > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj,
> > struct device *dev, uuid_le uuid)
> >
> >  	mdev->parent = parent;
> >
> > +	device_initialize(&mdev->dev);
> >  	mdev->dev.parent  = dev;
> >  	mdev->dev.bus     = &mdev_bus_type;
> >  	mdev->dev.release = mdev_device_release;
> > +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
> >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> >
> > -	ret = device_register(&mdev->dev);
> > +	ret = type->parent->ops->create(kobj, mdev);
> >  	if (ret)
> > -		goto mdev_fail;
> > +		goto create_fail;
> >
> > -	ret = mdev_device_create_ops(kobj, mdev);
> > +	ret = device_add(&mdev->dev);
> >  	if (ret)
> > -		goto create_fail;
> > +		goto dev_fail;
> >
> >  	ret = mdev_create_sysfs_files(&mdev->dev, type);
> > -	if (ret) {
> > -		mdev_device_remove_ops(mdev, true);
> > -		goto create_fail;
> > -	}
> > +	if (ret)
> > +		goto sysfs_fail;
> >
> >  	mdev->type_kobj = kobj;
> >  	mdev->active = true;
> >  	dev_dbg(&mdev->dev, "MDEV: created\n");
> > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> >
> >  	return 0;
> >
> > +sysfs_fail:
> > +	device_del(&mdev->dev);
> > +dev_fail:
> > +	type->parent->ops->remove(mdev);
> >  create_fail:
> > -	device_unregister(&mdev->dev);
> > +	put_device(&mdev->dev);
> >  mdev_fail:
> > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> >  	mdev_put_parent(parent);
> >  	return ret;
> >  }
> >
> > -int mdev_device_remove(struct device *dev, bool force_remove)
> > +int mdev_device_remove(struct device *dev)
> >  {
> > +	struct mdev_parent *valid_parent;
> >  	struct mdev_device *mdev;
> >  	struct mdev_parent *parent;
> > -	struct mdev_type *type;
> > +	int srcu_idx;
> >  	int ret;
> >
> >  	mdev = to_mdev_device(dev);
> > +	parent = mdev->parent;
> > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > +	if (!valid_parent) {
> > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > +		/* parent is undergoing unregistration */
> > +		return -ENODEV;
> > +	}
> > +
> > +	mutex_lock(&mdev_list_lock);
> >  	if (!mdev->active) {
> >  		mutex_unlock(&mdev_list_lock);
> > -		return -EAGAIN;
> > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > +		return -ENODEV;
> >  	}
> > -
> >  	mdev->active = false;
> >  	mutex_unlock(&mdev_list_lock);
> >
> > -	type = to_mdev_type(mdev->type_kobj);
> > -	parent = mdev->parent;
> > -
> > -	ret = mdev_device_remove_ops(mdev, force_remove);
> > -	if (ret) {
> > -		mdev->active = true;
> > -		return ret;
> > -	}
> > +	ret = mdev_device_must_remove(mdev);
> > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> >
> > -	mdev_remove_sysfs_files(dev, type);
> > -	device_unregister(dev);
> >  	mdev_put_parent(parent);
> > -
> > -	return 0;
> > +	return ret;
> >  }
> >
> >  static int __init mdev_init(void)
> > diff --git a/drivers/vfio/mdev/mdev_private.h
> > b/drivers/vfio/mdev/mdev_private.h
> > index 84b2b6c..3d17db9 100644
> > --- a/drivers/vfio/mdev/mdev_private.h
> > +++ b/drivers/vfio/mdev/mdev_private.h
> > @@ -23,6 +23,11 @@ struct mdev_parent {
> >  	struct list_head next;
> >  	struct kset *mdev_types_kset;
> >  	struct list_head type_list;
> > +	/* Protects unregistration to wait until create/remove
> > +	 * are completed.
> > +	 */
> > +	struct srcu_struct unreg_srcu;
> > +	struct mdev_parent __rcu *self;
> >  };
> >
> >  struct mdev_device {
> > @@ -58,6 +63,6 @@ struct mdev_type {
> >  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type
> > *type);
> >
> >  int  mdev_device_create(struct kobject *kobj, struct device *dev,
> > uuid_le uuid); -int  mdev_device_remove(struct device *dev, bool
> > force_remove);
> > +int  mdev_device_remove(struct device *dev);
> >
> >  #endif /* MDEV_PRIVATE_H */
> > diff --git a/drivers/vfio/mdev/mdev_sysfs.c
> > b/drivers/vfio/mdev/mdev_sysfs.c index c782fa9..68a8191 100644
> > --- a/drivers/vfio/mdev/mdev_sysfs.c
> > +++ b/drivers/vfio/mdev/mdev_sysfs.c
> > @@ -236,11 +236,9 @@ static ssize_t remove_store(struct device *dev,
> > struct device_attribute *attr,
> >  	if (val && device_remove_file_self(dev, attr)) {
> >  		int ret;
> >
> > -		ret = mdev_device_remove(dev, false);
> > -		if (ret) {
> > -			device_create_file(dev, attr);
> > +		ret = mdev_device_remove(dev);
> > +		if (ret)
> >  			return ret;
> > -		}
> >  	}
> >
> >  	return count;
> 
> The patch looks OK to me, especially looking at the code after the changes
> were apllied. I might have missed something though due to amount of
> changes done.
> 
> I lightly tested the whole patch series with my mdev driver, and it seems to
> survive, but my testing doesn't test much of the error paths so there that.
> 
> I'll keep this applied so if I notice any errors I'll let you know.
> 
> If you could split this into few patches, this would be even better, but
> anyway thanks a lot for this work!
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
Thanks a lot Maxim for the review.
This particular patch cannot be cut into more patches because it touches serialization between multiple functions.
So all of those have to be touched.
I tried to cut into two where previous patch does the mdev cleanup, but I guess for correctness I got to merge here.
Only the commit message is probably big, but I had to explain all 4-5 cases for this refactor.

I will send v1 with below changes.
1. drop patch-1 in the series.
2. simplify patch 7 to drop the bool part...
3. move the loop_removal code in mdev_remove() from patch-7 to 8 for correctness.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-03-25 21:36       ` Parav Pandit
@ 2019-03-25 21:52         ` Alex Williamson
  2019-03-25 22:07           ` Parav Pandit
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2019-03-25 21:52 UTC (permalink / raw)
  To: Parav Pandit; +Cc: Kirti Wankhede, kvm, linux-kernel

On Mon, 25 Mar 2019 21:36:42 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, March 25, 2019 3:50 PM
> > To: Kirti Wankhede <kwankhede@nvidia.com>
> > Cc: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org
> > Subject: Re: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if
> > one fails
> > 
> > On Tue, 26 Mar 2019 01:05:34 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > On 3/23/2019 4:50 AM, Parav Pandit wrote:  
> > > > device_for_each_child() stops executing callback function for
> > > > remaining child devices, if callback hits an error.
> > > > Each child mdev device is independent of each other.
> > > > While unregistering parent device, mdev core must remove all child
> > > > mdev devices.
> > > > Therefore, mdev_device_remove_cb() always returns success so that
> > > > device_for_each_child doesn't abort if one child removal hits error.
> > > >  
> > >
> > > When unregistering parent device, force_remove is set to true amd
> > > mdev_device_remove_ops() always returns success.  
> > 
> > Can we know that?  mdev_device_remove() doesn't guarantee to return
> > zero.
> >   
> > > > While at it, improve remove and unregister functions for below  
> > simplicity.  
> > > >
> > > > There isn't need to pass forced flag pointer during mdev parent
> > > > removal which invokes mdev_device_remove().  
> > >
> > > There is a need to pass the flag, pasting here the comment above
> > > mdev_device_remove_ops() which explains why the flag is needed:
> > >
> > > /*
> > >  * mdev_device_remove_ops gets called from sysfs's 'remove' and when
> > > parent
> > >  * device is being unregistered from mdev device framework.
> > >  * - 'force_remove' is set to 'false' when called from sysfs's 'remove'
> > > which
> > >  *   indicates that if the mdev device is active, used by VMM or userspace
> > >  *   application, vendor driver could return error then don't remove the
> > > device.
> > >  * - 'force_remove' is set to 'true' when called from
> > > mdev_unregister_device()
> > >  *   which indicate that parent device is being removed from mdev device
> > >  *   framework so remove mdev device forcefully.
> > >  */  
> > 
> > I don't see that this changes the force behavior, it's simply noting that in
> > order to continue the device_for_each_child() iterator, we need to return
> > zero, regardless of what mdev_device_remove() returns, and the parent
> > remove path is the only caller of mdev_device_remove_cb(), so we can
> > assume force = true when calling mdev_device_remove().  Aside from maybe
> > a WARN_ON if mdev_device_remove() returns non-zero, that much looks
> > reasonable to me.
> >   
> > >  So simplify the flow.  
> > > >
> > > > mdev_device_remove() is called from two paths.
> > > > 1. mdev_unregister_driver()
> > > >      mdev_device_remove_cb()
> > > >        mdev_device_remove()
> > > > 2. remove_store()
> > > >      mdev_device_remove()
> > > >
> > > > When device is removed by user using remote_store(), device under
> > > > removal is mdev device.
> > > > When device is removed during parent device removal using generic
> > > > child iterator, mdev check is already done using dev_is_mdev().
> > > >
> > > > Hence, remove the unnecessary loop in mdev_device_remove().  
> > 
> > I don't think knowing the device type is the only reason for this loop though.
> > Both paths you mention above can race with each other, so we need to
> > serialize them and pick a winner.  The mdev_list_lock allows us to do that.
> > Additionally...
> >   
> > > >
> > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > ---
> > > >  drivers/vfio/mdev/mdev_core.c | 24 +++++-------------------
> > > >  1 file changed, 5 insertions(+), 19 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > b/drivers/vfio/mdev/mdev_core.c index ab05464..944a058 100644
> > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > @@ -150,10 +150,10 @@ static int mdev_device_remove_ops(struct
> > > > mdev_device *mdev, bool force_remove)
> > > >
> > > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > > > -	if (!dev_is_mdev(dev))
> > > > -		return 0;
> > > > +	if (dev_is_mdev(dev))
> > > > +		mdev_device_remove(dev, true);
> > > >
> > > > -	return mdev_device_remove(dev, data ? *(bool *)data : true);
> > > > +	return 0;
> > > >  }
> > > >
> > > >  /*
> > > > @@ -241,7 +241,6 @@ int mdev_register_device(struct device *dev,
> > > > const struct mdev_parent_ops *ops)  void
> > > > mdev_unregister_device(struct device *dev)  {
> > > >  	struct mdev_parent *parent;
> > > > -	bool force_remove = true;
> > > >
> > > >  	mutex_lock(&parent_list_lock);
> > > >  	parent = __find_parent_device(dev); @@ -255,8 +254,7 @@ void
> > > > mdev_unregister_device(struct device *dev)
> > > >  	list_del(&parent->next);
> > > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > > >
> > > > -	device_for_each_child(dev, (void *)&force_remove,
> > > > -			      mdev_device_remove_cb);
> > > > +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > > >
> > > >  	parent_remove_sysfs_files(parent);
> > > >
> > > > @@ -346,24 +344,12 @@ int mdev_device_create(struct kobject *kobj,
> > > > struct device *dev, uuid_le uuid)
> > > >
> > > >  int mdev_device_remove(struct device *dev, bool force_remove)  {
> > > > -	struct mdev_device *mdev, *tmp;
> > > > +	struct mdev_device *mdev;
> > > >  	struct mdev_parent *parent;
> > > >  	struct mdev_type *type;
> > > >  	int ret;
> > > >
> > > >  	mdev = to_mdev_device(dev);
> > > > -
> > > > -	mutex_lock(&mdev_list_lock);  
> > 
> > Acquiring the lock is removed, but...
> >   
> Crap. Missed the lower part.
> 
> > > > -	list_for_each_entry(tmp, &mdev_list, next) {
> > > > -		if (tmp == mdev)
> > > > -			break;
> > > > -	}
> > > > -
> > > > -	if (tmp != mdev) {
> > > > -		mutex_unlock(&mdev_list_lock);
> > > > -		return -ENODEV;
> > > > -	}
> > > > -
> > > >  	if (!mdev->active) {
> > > >  		mutex_unlock(&mdev_list_lock);
> > > >  		return -EAGAIN;
> > > >  
> > 
> > We still release it in this path and the code below here.  If we don't find the
> > device on the list under lock, then we're working with a stale device and
> > playing with the 'active' flag of that device outside of any sort of mutual
> > exclusion is racy.  Thanks,  
> Subsequent patch makes the order sane.
> I think I should merge this change with patch-8 in the series.

Please don't incorporate more fixes into patch 8, it has too many
already.  I'd really prefer to see patch 8 split into issues you've
identified as much as possible.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails
  2019-03-25 21:52         ` Alex Williamson
@ 2019-03-25 22:07           ` Parav Pandit
  0 siblings, 0 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-25 22:07 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Kirti Wankhede, kvm, linux-kernel

Hi Alex,

> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 4:52 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Kirti Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if
> one fails
> 
> On Mon, 25 Mar 2019 21:36:42 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, March 25, 2019 3:50 PM
> > > To: Kirti Wankhede <kwankhede@nvidia.com>
> > > Cc: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> > > kernel@vger.kernel.org
> > > Subject: Re: [PATCH 7/8] vfio/mdev: Fix aborting mdev child device
> > > removal if one fails
> > >
> > > On Tue, 26 Mar 2019 01:05:34 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > > > On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > > > > device_for_each_child() stops executing callback function for
> > > > > remaining child devices, if callback hits an error.
> > > > > Each child mdev device is independent of each other.
> > > > > While unregistering parent device, mdev core must remove all
> > > > > child mdev devices.
> > > > > Therefore, mdev_device_remove_cb() always returns success so
> > > > > that device_for_each_child doesn't abort if one child removal hits
> error.
> > > > >
> > > >
> > > > When unregistering parent device, force_remove is set to true amd
> > > > mdev_device_remove_ops() always returns success.
> > >
> > > Can we know that?  mdev_device_remove() doesn't guarantee to return
> > > zero.
> > >
> > > > > While at it, improve remove and unregister functions for below
> > > simplicity.
> > > > >
> > > > > There isn't need to pass forced flag pointer during mdev parent
> > > > > removal which invokes mdev_device_remove().
> > > >
> > > > There is a need to pass the flag, pasting here the comment above
> > > > mdev_device_remove_ops() which explains why the flag is needed:
> > > >
> > > > /*
> > > >  * mdev_device_remove_ops gets called from sysfs's 'remove' and
> > > > when parent
> > > >  * device is being unregistered from mdev device framework.
> > > >  * - 'force_remove' is set to 'false' when called from sysfs's 'remove'
> > > > which
> > > >  *   indicates that if the mdev device is active, used by VMM or
> userspace
> > > >  *   application, vendor driver could return error then don't remove the
> > > > device.
> > > >  * - 'force_remove' is set to 'true' when called from
> > > > mdev_unregister_device()
> > > >  *   which indicate that parent device is being removed from mdev
> device
> > > >  *   framework so remove mdev device forcefully.
> > > >  */
> > >
> > > I don't see that this changes the force behavior, it's simply noting
> > > that in order to continue the device_for_each_child() iterator, we
> > > need to return zero, regardless of what mdev_device_remove()
> > > returns, and the parent remove path is the only caller of
> > > mdev_device_remove_cb(), so we can assume force = true when calling
> > > mdev_device_remove().  Aside from maybe a WARN_ON if
> > > mdev_device_remove() returns non-zero, that much looks reasonable to
> me.
> > >
> > > >  So simplify the flow.
> > > > >
> > > > > mdev_device_remove() is called from two paths.
> > > > > 1. mdev_unregister_driver()
> > > > >      mdev_device_remove_cb()
> > > > >        mdev_device_remove()
> > > > > 2. remove_store()
> > > > >      mdev_device_remove()
> > > > >
> > > > > When device is removed by user using remote_store(), device
> > > > > under removal is mdev device.
> > > > > When device is removed during parent device removal using
> > > > > generic child iterator, mdev check is already done using
> dev_is_mdev().
> > > > >
> > > > > Hence, remove the unnecessary loop in mdev_device_remove().
> > >
> > > I don't think knowing the device type is the only reason for this loop
> though.
> > > Both paths you mention above can race with each other, so we need to
> > > serialize them and pick a winner.  The mdev_list_lock allows us to do
> that.
> > > Additionally...
> > >
> > > > >
> > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > ---
> > > > >  drivers/vfio/mdev/mdev_core.c | 24 +++++-------------------
> > > > >  1 file changed, 5 insertions(+), 19 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > b/drivers/vfio/mdev/mdev_core.c index ab05464..944a058 100644
> > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > @@ -150,10 +150,10 @@ static int mdev_device_remove_ops(struct
> > > > > mdev_device *mdev, bool force_remove)
> > > > >
> > > > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > > > > -	if (!dev_is_mdev(dev))
> > > > > -		return 0;
> > > > > +	if (dev_is_mdev(dev))
> > > > > +		mdev_device_remove(dev, true);
> > > > >
> > > > > -	return mdev_device_remove(dev, data ? *(bool *)data :
> true);
> > > > > +	return 0;
> > > > >  }
> > > > >
> > > > >  /*
> > > > > @@ -241,7 +241,6 @@ int mdev_register_device(struct device *dev,
> > > > > const struct mdev_parent_ops *ops)  void
> > > > > mdev_unregister_device(struct device *dev)  {
> > > > >  	struct mdev_parent *parent;
> > > > > -	bool force_remove = true;
> > > > >
> > > > >  	mutex_lock(&parent_list_lock);
> > > > >  	parent = __find_parent_device(dev); @@ -255,8 +254,7 @@
> void
> > > > > mdev_unregister_device(struct device *dev)
> > > > >  	list_del(&parent->next);
> > > > >  	class_compat_remove_link(mdev_bus_compat_class, dev,
> NULL);
> > > > >
> > > > > -	device_for_each_child(dev, (void *)&force_remove,
> > > > > -			      mdev_device_remove_cb);
> > > > > +	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > > > >
> > > > >  	parent_remove_sysfs_files(parent);
> > > > >
> > > > > @@ -346,24 +344,12 @@ int mdev_device_create(struct kobject
> > > > > *kobj, struct device *dev, uuid_le uuid)
> > > > >
> > > > >  int mdev_device_remove(struct device *dev, bool force_remove)  {
> > > > > -	struct mdev_device *mdev, *tmp;
> > > > > +	struct mdev_device *mdev;
> > > > >  	struct mdev_parent *parent;
> > > > >  	struct mdev_type *type;
> > > > >  	int ret;
> > > > >
> > > > >  	mdev = to_mdev_device(dev);
> > > > > -
> > > > > -	mutex_lock(&mdev_list_lock);
> > >
> > > Acquiring the lock is removed, but...
> > >
> > Crap. Missed the lower part.
> >
> > > > > -	list_for_each_entry(tmp, &mdev_list, next) {
> > > > > -		if (tmp == mdev)
> > > > > -			break;
> > > > > -	}
> > > > > -
> > > > > -	if (tmp != mdev) {
> > > > > -		mutex_unlock(&mdev_list_lock);
> > > > > -		return -ENODEV;
> > > > > -	}
> > > > > -
> > > > >  	if (!mdev->active) {
> > > > >  		mutex_unlock(&mdev_list_lock);
> > > > >  		return -EAGAIN;
> > > > >
> > >
> > > We still release it in this path and the code below here.  If we
> > > don't find the device on the list under lock, then we're working
> > > with a stale device and playing with the 'active' flag of that
> > > device outside of any sort of mutual exclusion is racy.  Thanks,
> > Subsequent patch makes the order sane.
> > I think I should merge this change with patch-8 in the series.
> 
> Please don't incorporate more fixes into patch 8, it has too many already.  I'd
> really prefer to see patch 8 split into issues you've identified as much as
> possible.  Thanks,
> 
I tried to split into two patches.
one for user initiated race conditions, second for driver side race conditions.
But its generating more code churn as synchronization is inter-related. So dropped it.

This patch is just fine, only thing I messed up is accidental mutex lock removal.
Below is the fixup patch for patch-7 that I want to roll in v2.
Rest all stays same in patch-7 and 8.

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 5bd8d22..e09b94f 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -349,6 +349,7 @@ int mdev_device_remove(struct device *dev, bool force_remove)
        struct mdev_type *type;
        int ret;

+       mutex_lock(&mdev_list_lock);
        mdev = to_mdev_device(dev);
        if (!mdev->active) {
                mutex_unlock(&mdev_list_lock);


> Alex

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-22 23:20 ` [PATCH 8/8] vfio/mdev: Improve the create/remove sequence Parav Pandit
  2019-03-25 13:24   ` Maxim Levitsky
@ 2019-03-25 23:18   ` Alex Williamson
  2019-03-25 23:34     ` Parav Pandit
  2019-03-26  7:06   ` Kirti Wankhede
  2 siblings, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2019-03-25 23:18 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede

On Fri, 22 Mar 2019 18:20:35 -0500
Parav Pandit <parav@mellanox.com> wrote:

> There are five problems with current code structure.
> 1. mdev device is placed on the mdev bus before it is created in the
> vendor driver. Once a device is placed on the mdev bus without creating
> its supporting underlying vendor device, an open() can get triggered by
> userspace on partially initialized device.
> Below ladder diagram highlight it.
> 
>       cpu-0                                       cpu-1
>       -----                                       -----
>    create_store()
>      mdev_create_device()
>        device_register()
>           ...
>          vfio_mdev_probe()
>          ...creates char device
>                                         vfio_mdev_open()
>                                           parent->ops->open(mdev)
>                                             vfio_ap_mdev_open()
>                                               matrix_mdev = NULL
>         [...]
>         parent->ops->create()
>           vfio_ap_mdev_create()
>             mdev_set_drvdata(mdev, matrix_mdev);
>             /* Valid pointer set above */
> 
> 2. Current creation sequence is,
>    parent->ops_create()
>    groups_register()
> 
> Remove sequence is,
>    parent->ops->remove()
>    groups_unregister()
> However, remove sequence should be exact mirror of creation sequence.
> Once this is achieved, all users of the mdev will be terminated first
> before removing underlying vendor device.
> (Follow standard linux driver model).
> At that point vendor's remove() ops shouldn't failed because device is
> taken off the bus that should terminate the users.
> 
> 3. Additionally any new mdev driver that wants to work on mdev device
> during probe() routine registered using mdev_register_driver() needs to
> get stable mdev structure.
> 
> 4. In following sequence, child devices created while removing mdev parent
> device can be left out, or it may lead to race of removing half
> initialized child mdev devices.
> 
> issue-1:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
>                                   mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
>                                   parent_remove_sysfs_files()
>                                   /* BUG: device added by cpu-0
>                                    * whose parent is getting removed.
>                                    */
> 
> issue-2:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
> 
>        [...]                      mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> 
>        mdev_create_sysfs_files()
>        /* BUG: create is adding
>         * sysfs files for a device
>         * which is undergoing removal.
>         */
>                                  parent_remove_sysfs_files()

In both cases above, it looks like the device will hold a reference to
the parent, so while there is a race, the parent object isn't released.

> 
> 5. Below crash is observed when user initiated remove is in progress
> and mdev_unregister_driver() completes parent unregistration.
> 
>        cpu-0                         cpu-1
>        -----                         -----
> remove_store()
>    mdev_device_remove()
>    active = false;
>                                   mdev_unregister_device()
>                                     remove type
>    [...]
>    mdev_remove_ops() crashes.
> 
> This is similar race like create() racing with mdev_unregister_device().

Not sure I catch this, the device should have a reference to the
parent, and we don't specifically clear parent->ops, so what's getting
removed that causes this oops?  Is .remove pointing at bad text
regardless?

> mtty mtty: MDEV: Registered
> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> mdev_device_remove sleep started
> mtty mtty: MDEV: Unregistering
> mtty_dev: Unloaded!
> BUG: unable to handle kernel paging request at ffffffffc027d668
> PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> Oops: 0000 [#1] SMP PTI
> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
> Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
> Call Trace:
>  mdev_device_remove+0xef/0x130 [mdev]
>  remove_store+0x77/0xa0 [mdev]
>  kernfs_fop_write+0x113/0x1a0
>  __vfs_write+0x33/0x1b0
>  ? rcu_read_lock_sched_held+0x64/0x70
>  ? rcu_sync_lockdep_assert+0x2a/0x50
>  ? __sb_start_write+0x121/0x1b0
>  ? vfs_write+0x17c/0x1b0
>  vfs_write+0xad/0x1b0
>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>  ksys_write+0x55/0xc0
>  do_syscall_64+0x5a/0x210
> 
> Therefore, mdev core is improved in following ways to overcome above
> issues.
> 
> 1. Before placing mdev devices on the bus, perform vendor drivers
> creation which supports the mdev creation.
> This ensures that mdev specific all necessary fields are initialized
> before a given mdev can be accessed by bus driver.
> 
> 2. During remove flow, first remove the device from the bus. This
> ensures that any bus specific devices and data is cleared.
> Once device is taken of the mdev bus, perform remove() of mdev from the
> vendor driver.
> 
> 3. Linux core device model provides way to register and auto unregister
> the device sysfs attribute groups at dev->groups.
> Make use of this groups to let core create the groups and simplify code
> to avoid explicit groups creation and removal.
> 
> 4. Wait for any ongoing mdev create() and remove() to finish before
> unregistering parent device using srcu. This continues to allow multiple
> create and remove to progress in parallel. At the same time guard parent
> removal while parent is being access by create() and remove callbacks.

So there should be 4-5 separate patches here?  Wishful thinking?

> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++------------------
>  drivers/vfio/mdev/mdev_private.h |   7 +-
>  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
>  3 files changed, 84 insertions(+), 71 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 944a058..8fe0ed1 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
>  						  ref);
>  	struct device *dev = parent->dev;
>  
> +	cleanup_srcu_struct(&parent->unreg_srcu);
>  	kfree(parent);
>  	put_device(dev);
>  }
> @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct mdev_parent *parent)
>  		kref_put(&parent->ref, mdev_release_parent);
>  }
>  
> -static int mdev_device_create_ops(struct kobject *kobj,
> -				  struct mdev_device *mdev)
> +static int mdev_device_must_remove(struct mdev_device *mdev)

Naming is off here, mdev_device_remove_common()?

>  {
> -	struct mdev_parent *parent = mdev->parent;
> +	struct mdev_parent *parent;
> +	struct mdev_type *type;
>  	int ret;
>  
> -	ret = parent->ops->create(kobj, mdev);
> -	if (ret)
> -		return ret;
> +	type = to_mdev_type(mdev->type_kobj);
>  
> -	ret = sysfs_create_groups(&mdev->dev.kobj,
> -				  parent->ops->mdev_attr_groups);
> +	mdev_remove_sysfs_files(&mdev->dev, type);
> +	device_del(&mdev->dev);
> +	parent = mdev->parent;
> +	ret = parent->ops->remove(mdev);
>  	if (ret)
> -		parent->ops->remove(mdev);
> +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);

Let the caller decide whether to be verbose with the error, parent
removal might want to warn, sysfs remove might just return an error.

>  
> +	/* Balances with device_initialize() */
> +	put_device(&mdev->dev);
>  	return ret;
>  }
>  
> -/*
> - * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
> - * device is being unregistered from mdev device framework.
> - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> - *   indicates that if the mdev device is active, used by VMM or userspace
> - *   application, vendor driver could return error then don't remove the device.
> - * - 'force_remove' is set to 'true' when called from mdev_unregister_device()
> - *   which indicate that parent device is being removed from mdev device
> - *   framework so remove mdev device forcefully.
> - */
> -static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
> -{
> -	struct mdev_parent *parent = mdev->parent;
> -	int ret;
> -
> -	/*
> -	 * Vendor driver can return error if VMM or userspace application is
> -	 * using this mdev device.
> -	 */
> -	ret = parent->ops->remove(mdev);
> -	if (ret && !force_remove)
> -		return ret;
> -
> -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
> -	return 0;
> -}

Seems like there's easily a separate patch in pushing the create/remove
ops into the calling function and separating for the iterator callback,
that would make this easier to review.

> -
>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
>  	if (dev_is_mdev(dev))
> -		mdev_device_remove(dev, true);
> -
> +		mdev_device_must_remove(to_mdev_device(dev));
>  	return 0;
>  }
>  
> @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	}
>  
>  	kref_init(&parent->ref);
> +	init_srcu_struct(&parent->unreg_srcu);
>  
>  	parent->dev = dev;
>  	parent->ops = ops;
> @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	if (ret)
>  		dev_warn(dev, "Failed to create compatibility class link\n");
>  
> +	rcu_assign_pointer(parent->self, parent);
>  	list_add(&parent->next, &parent_list);
>  	mutex_unlock(&parent_list_lock);
>  
> @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
>  
>  	mutex_lock(&parent_list_lock);
>  	parent = __find_parent_device(dev);
> -
>  	if (!parent) {
>  		mutex_unlock(&parent_list_lock);
>  		return;
>  	}
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +
>  	dev_info(dev, "MDEV: Unregistering\n");
>  
> -	list_del(&parent->next);
> +	/* Publish that this mdev parent is unregistering. So any new
> +	 * create/remove cannot start on this parent anymore by user.
> +	 */

Comment style, we're not in netdev.

> +	rcu_assign_pointer(parent->self, NULL);
> +
> +	/*
> +	 * Wait for any active create() or remove() mdev ops on the parent
> +	 * to complete.
> +	 */
> +	synchronize_srcu(&parent->unreg_srcu);
> +
> +	/* At this point it is confirmed that any pending user initiated
> +	 * create or remove callbacks accessing the parent are completed.
> +	 * It is safe to remove the parent now.
> +	 */
>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
>  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  
> -	mutex_unlock(&parent_list_lock);
>  	mdev_put_parent(parent);
>  }
>  EXPORT_SYMBOL(mdev_unregister_device);
> @@ -278,14 +270,24 @@ static void mdev_device_release(struct device *dev)
>  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  {
>  	int ret;
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	int srcu_idx;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		/* parent is undergoing unregistration */
> +		ret = -ENODEV;
> +		goto mdev_fail;
> +	}
> +
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  
>  	mdev->parent = parent;
>  
> +	device_initialize(&mdev->dev);
>  	mdev->dev.parent  = dev;
>  	mdev->dev.bus     = &mdev_bus_type;
>  	mdev->dev.release = mdev_device_release;
> +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
>  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
> -	ret = device_register(&mdev->dev);
> +	ret = type->parent->ops->create(kobj, mdev);
>  	if (ret)
> -		goto mdev_fail;
> +		goto create_fail;
>  
> -	ret = mdev_device_create_ops(kobj, mdev);
> +	ret = device_add(&mdev->dev);

Separating device_initialize() and device_add() also looks like a
separate patch, then the srcu could be added at the end.  Thanks,

Alex

>  	if (ret)
> -		goto create_fail;
> +		goto dev_fail;
>  
>  	ret = mdev_create_sysfs_files(&mdev->dev, type);
> -	if (ret) {
> -		mdev_device_remove_ops(mdev, true);
> -		goto create_fail;
> -	}
> +	if (ret)
> +		goto sysfs_fail;
>  
>  	mdev->type_kobj = kobj;
>  	mdev->active = true;
>  	dev_dbg(&mdev->dev, "MDEV: created\n");
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
>  	return 0;
>  
> +sysfs_fail:
> +	device_del(&mdev->dev);
> +dev_fail:
> +	type->parent->ops->remove(mdev);
>  create_fail:
> -	device_unregister(&mdev->dev);
> +	put_device(&mdev->dev);
>  mdev_fail:
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  	mdev_put_parent(parent);
>  	return ret;
>  }
>  
> -int mdev_device_remove(struct device *dev, bool force_remove)
> +int mdev_device_remove(struct device *dev)
>  {
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
> -	struct mdev_type *type;
> +	int srcu_idx;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		/* parent is undergoing unregistration */
> +		return -ENODEV;
> +	}
> +
> +	mutex_lock(&mdev_list_lock);
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
> -		return -EAGAIN;
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		return -ENODEV;
>  	}
> -
>  	mdev->active = false;
>  	mutex_unlock(&mdev_list_lock);
>  
> -	type = to_mdev_type(mdev->type_kobj);
> -	parent = mdev->parent;
> -
> -	ret = mdev_device_remove_ops(mdev, force_remove);
> -	if (ret) {
> -		mdev->active = true;
> -		return ret;
> -	}
> +	ret = mdev_device_must_remove(mdev);
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
> -	mdev_remove_sysfs_files(dev, type);
> -	device_unregister(dev);
>  	mdev_put_parent(parent);
> -
> -	return 0;
> +	return ret;
>  }
>  
>  static int __init mdev_init(void)
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index 84b2b6c..3d17db9 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -23,6 +23,11 @@ struct mdev_parent {
>  	struct list_head next;
>  	struct kset *mdev_types_kset;
>  	struct list_head type_list;
> +	/* Protects unregistration to wait until create/remove
> +	 * are completed.
> +	 */
> +	struct srcu_struct unreg_srcu;
> +	struct mdev_parent __rcu *self;
>  };
>  
>  struct mdev_device {
> @@ -58,6 +63,6 @@ struct mdev_type {
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
>  
>  int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
> -int  mdev_device_remove(struct device *dev, bool force_remove);
> +int  mdev_device_remove(struct device *dev);
>  
>  #endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index c782fa9..68a8191 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -236,11 +236,9 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
>  	if (val && device_remove_file_self(dev, attr)) {
>  		int ret;
>  
> -		ret = mdev_device_remove(dev, false);
> -		if (ret) {
> -			device_create_file(dev, attr);
> +		ret = mdev_device_remove(dev);
> +		if (ret)
>  			return ret;
> -		}
>  	}
>  
>  	return count;


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-25 23:18   ` Alex Williamson
@ 2019-03-25 23:34     ` Parav Pandit
  2019-03-26  0:05       ` Alex Williamson
  0 siblings, 1 reply; 49+ messages in thread
From: Parav Pandit @ 2019-03-25 23:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, kwankhede



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 6:19 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Fri, 22 Mar 2019 18:20:35 -0500
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > There are five problems with current code structure.
> > 1. mdev device is placed on the mdev bus before it is created in the
> > vendor driver. Once a device is placed on the mdev bus without
> > creating its supporting underlying vendor device, an open() can get
> > triggered by userspace on partially initialized device.
> > Below ladder diagram highlight it.
> >
> >       cpu-0                                       cpu-1
> >       -----                                       -----
> >    create_store()
> >      mdev_create_device()
> >        device_register()
> >           ...
> >          vfio_mdev_probe()
> >          ...creates char device
> >                                         vfio_mdev_open()
> >                                           parent->ops->open(mdev)
> >                                             vfio_ap_mdev_open()
> >                                               matrix_mdev = NULL
> >         [...]
> >         parent->ops->create()
> >           vfio_ap_mdev_create()
> >             mdev_set_drvdata(mdev, matrix_mdev);
> >             /* Valid pointer set above */
> >
> > 2. Current creation sequence is,
> >    parent->ops_create()
> >    groups_register()
> >
> > Remove sequence is,
> >    parent->ops->remove()
> >    groups_unregister()
> > However, remove sequence should be exact mirror of creation sequence.
> > Once this is achieved, all users of the mdev will be terminated first
> > before removing underlying vendor device.
> > (Follow standard linux driver model).
> > At that point vendor's remove() ops shouldn't failed because device is
> > taken off the bus that should terminate the users.
> >
> > 3. Additionally any new mdev driver that wants to work on mdev device
> > during probe() routine registered using mdev_register_driver() needs
> > to get stable mdev structure.
> >
> > 4. In following sequence, child devices created while removing mdev
> > parent device can be left out, or it may lead to race of removing half
> > initialized child mdev devices.
> >
> > issue-1:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> >                                   mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >                                   parent_remove_sysfs_files()
> >                                   /* BUG: device added by cpu-0
> >                                    * whose parent is getting removed.
> >                                    */
> >
> > issue-2:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >
> >        [...]                      mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> >
> >        mdev_create_sysfs_files()
> >        /* BUG: create is adding
> >         * sysfs files for a device
> >         * which is undergoing removal.
> >         */
> >                                  parent_remove_sysfs_files()
> 
> In both cases above, it looks like the device will hold a reference to the
> parent, so while there is a race, the parent object isn't released.
Yes, parent object is not released but parent fields are not stable.

> 
> >
> > 5. Below crash is observed when user initiated remove is in progress
> > and mdev_unregister_driver() completes parent unregistration.
> >
> >        cpu-0                         cpu-1
> >        -----                         -----
> > remove_store()
> >    mdev_device_remove()
> >    active = false;
> >                                   mdev_unregister_device()
> >                                     remove type
> >    [...]
> >    mdev_remove_ops() crashes.
> >
> > This is similar race like create() racing with mdev_unregister_device().
> 
> Not sure I catch this, the device should have a reference to the parent, and
> we don't specifically clear parent->ops, so what's getting removed that
> causes this oops?  Is .remove pointing at bad text regardless?
> 
I guess the mdev_attr_groups being stale now.

> > mtty mtty: MDEV: Registered
> > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > mtty_dev: Unloaded!
> > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> >  mdev_device_remove+0xef/0x130 [mdev]
> >  remove_store+0x77/0xa0 [mdev]
> >  kernfs_fop_write+0x113/0x1a0
> >  __vfs_write+0x33/0x1b0
> >  ? rcu_read_lock_sched_held+0x64/0x70
> >  ? rcu_sync_lockdep_assert+0x2a/0x50
> >  ? __sb_start_write+0x121/0x1b0
> >  ? vfs_write+0x17c/0x1b0
> >  vfs_write+0xad/0x1b0
> >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >  ksys_write+0x55/0xc0
> >  do_syscall_64+0x5a/0x210
> >
> > Therefore, mdev core is improved in following ways to overcome above
> > issues.
> >
> > 1. Before placing mdev devices on the bus, perform vendor drivers
> > creation which supports the mdev creation.
> > This ensures that mdev specific all necessary fields are initialized
> > before a given mdev can be accessed by bus driver.
> >
> > 2. During remove flow, first remove the device from the bus. This
> > ensures that any bus specific devices and data is cleared.
> > Once device is taken of the mdev bus, perform remove() of mdev from
> > the vendor driver.
> >
> > 3. Linux core device model provides way to register and auto
> > unregister the device sysfs attribute groups at dev->groups.
> > Make use of this groups to let core create the groups and simplify
> > code to avoid explicit groups creation and removal.
> >
> > 4. Wait for any ongoing mdev create() and remove() to finish before
> > unregistering parent device using srcu. This continues to allow
> > multiple create and remove to progress in parallel. At the same time
> > guard parent removal while parent is being access by create() and remove
> callbacks.
> 
> So there should be 4-5 separate patches here?  Wishful thinking?
> 
create, remove racing with unregister is handled using srcu.
Change-3 cannot be done without fixing the sequence so it should be in patch that fixes it.
Change described changes 1-2-3 are just one change. It is just the patch description to bring clarity.
Change-4 can be possibly done as split to different patch.

> > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++--------------
> ----
> >  drivers/vfio/mdev/mdev_private.h |   7 +-
> >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> >  3 files changed, 84 insertions(+), 71 deletions(-)
> >
> > diff --git a/drivers/vfio/mdev/mdev_core.c
> > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> >  						  ref);
> >  	struct device *dev = parent->dev;
> >
> > +	cleanup_srcu_struct(&parent->unreg_srcu);
> >  	kfree(parent);
> >  	put_device(dev);
> >  }
> > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct
> mdev_parent *parent)
> >  		kref_put(&parent->ref, mdev_release_parent);  }
> >
> > -static int mdev_device_create_ops(struct kobject *kobj,
> > -				  struct mdev_device *mdev)
> > +static int mdev_device_must_remove(struct mdev_device *mdev)
> 
> Naming is off here, mdev_device_remove_common()?
> 
Yes, sounds better.

> >  {
> > -	struct mdev_parent *parent = mdev->parent;
> > +	struct mdev_parent *parent;
> > +	struct mdev_type *type;
> >  	int ret;
> >
> > -	ret = parent->ops->create(kobj, mdev);
> > -	if (ret)
> > -		return ret;
> > +	type = to_mdev_type(mdev->type_kobj);
> >
> > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > -				  parent->ops->mdev_attr_groups);
> > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > +	device_del(&mdev->dev);
> > +	parent = mdev->parent;
> > +	ret = parent->ops->remove(mdev);
> >  	if (ret)
> > -		parent->ops->remove(mdev);
> > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
> 
> Let the caller decide whether to be verbose with the error, parent removal
> might want to warn, sysfs remove might just return an error.
> 
I didn't follow. Caller meaning mdev_device_remove_common() or vendor driver?

> >
> > +	/* Balances with device_initialize() */
> > +	put_device(&mdev->dev);
> >  	return ret;
> >  }
> >
> > -/*
> > - * mdev_device_remove_ops gets called from sysfs's 'remove' and when
> > parent
> > - * device is being unregistered from mdev device framework.
> > - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> > - *   indicates that if the mdev device is active, used by VMM or userspace
> > - *   application, vendor driver could return error then don't remove the
> device.
> > - * - 'force_remove' is set to 'true' when called from
> mdev_unregister_device()
> > - *   which indicate that parent device is being removed from mdev device
> > - *   framework so remove mdev device forcefully.
> > - */
> > -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> > force_remove) -{
> > -	struct mdev_parent *parent = mdev->parent;
> > -	int ret;
> > -
> > -	/*
> > -	 * Vendor driver can return error if VMM or userspace application is
> > -	 * using this mdev device.
> > -	 */
> > -	ret = parent->ops->remove(mdev);
> > -	if (ret && !force_remove)
> > -		return ret;
> > -
> > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> >mdev_attr_groups);
> > -	return 0;
> > -}
> 
> Seems like there's easily a separate patch in pushing the create/remove ops
> into the calling function and separating for the iterator callback, that would
> make this easier to review.
> 
> > -
> >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> >  	if (dev_is_mdev(dev))
> > -		mdev_device_remove(dev, true);
> > -
> > +		mdev_device_must_remove(to_mdev_device(dev));
> >  	return 0;
> >  }
> >
> > @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const
> struct mdev_parent_ops *ops)
> >  	}
> >
> >  	kref_init(&parent->ref);
> > +	init_srcu_struct(&parent->unreg_srcu);
> >
> >  	parent->dev = dev;
> >  	parent->ops = ops;
> > @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const
> struct mdev_parent_ops *ops)
> >  	if (ret)
> >  		dev_warn(dev, "Failed to create compatibility class link\n");
> >
> > +	rcu_assign_pointer(parent->self, parent);
> >  	list_add(&parent->next, &parent_list);
> >  	mutex_unlock(&parent_list_lock);
> >
> > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
> >
> >  	mutex_lock(&parent_list_lock);
> >  	parent = __find_parent_device(dev);
> > -
> >  	if (!parent) {
> >  		mutex_unlock(&parent_list_lock);
> >  		return;
> >  	}
> > +	list_del(&parent->next);
> > +	mutex_unlock(&parent_list_lock);
> > +
> >  	dev_info(dev, "MDEV: Unregistering\n");
> >
> > -	list_del(&parent->next);
> > +	/* Publish that this mdev parent is unregistering. So any new
> > +	 * create/remove cannot start on this parent anymore by user.
> > +	 */
> 
> Comment style, we're not in netdev.
Yep. Will fix it.
> 
> > +	rcu_assign_pointer(parent->self, NULL);
> > +
> > +	/*
> > +	 * Wait for any active create() or remove() mdev ops on the parent
> > +	 * to complete.
> > +	 */
> > +	synchronize_srcu(&parent->unreg_srcu);
> > +
> > +	/* At this point it is confirmed that any pending user initiated
> > +	 * create or remove callbacks accessing the parent are completed.
> > +	 * It is safe to remove the parent now.
> > +	 */
> >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> >
> >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> >
> >  	parent_remove_sysfs_files(parent);
> >
> > -	mutex_unlock(&parent_list_lock);
> >  	mdev_put_parent(parent);
> >  }
> >  EXPORT_SYMBOL(mdev_unregister_device);
> > @@ -278,14 +270,24 @@ static void mdev_device_release(struct device
> > *dev)  int mdev_device_create(struct kobject *kobj, struct device
> > *dev, uuid_le uuid)  {
> >  	int ret;
> > +	struct mdev_parent *valid_parent;
> >  	struct mdev_device *mdev, *tmp;
> >  	struct mdev_parent *parent;
> >  	struct mdev_type *type = to_mdev_type(kobj);
> > +	int srcu_idx;
> >
> >  	parent = mdev_get_parent(type->parent);
> >  	if (!parent)
> >  		return -EINVAL;
> >
> > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > +	if (!valid_parent) {
> > +		/* parent is undergoing unregistration */
> > +		ret = -ENODEV;
> > +		goto mdev_fail;
> > +	}
> > +
> >  	mutex_lock(&mdev_list_lock);
> >
> >  	/* Check for duplicate */
> > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj,
> > struct device *dev, uuid_le uuid)
> >
> >  	mdev->parent = parent;
> >
> > +	device_initialize(&mdev->dev);
> >  	mdev->dev.parent  = dev;
> >  	mdev->dev.bus     = &mdev_bus_type;
> >  	mdev->dev.release = mdev_device_release;
> > +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
> >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> >
> > -	ret = device_register(&mdev->dev);
> > +	ret = type->parent->ops->create(kobj, mdev);
> >  	if (ret)
> > -		goto mdev_fail;
> > +		goto create_fail;
> >
> > -	ret = mdev_device_create_ops(kobj, mdev);
> > +	ret = device_add(&mdev->dev);
> 
> Separating device_initialize() and device_add() also looks like a separate
> patch, then the srcu could be added at the end.  Thanks,
> 
> Alex

I saw little more core generated that way, but I think its fine.
Basically, create/remove callback sequencing that does the device_inititailze/add etc in one patch and 
User side race handling using srcu in another patch.
Sounds good?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-25 23:34     ` Parav Pandit
@ 2019-03-26  0:05       ` Alex Williamson
  2019-03-26  1:43         ` Parav Pandit
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2019-03-26  0:05 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede

On Mon, 25 Mar 2019 23:34:28 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, March 25, 2019 6:19 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com
> > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> > 
> > On Fri, 22 Mar 2019 18:20:35 -0500
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > There are five problems with current code structure.
> > > 1. mdev device is placed on the mdev bus before it is created in the
> > > vendor driver. Once a device is placed on the mdev bus without
> > > creating its supporting underlying vendor device, an open() can get
> > > triggered by userspace on partially initialized device.
> > > Below ladder diagram highlight it.
> > >
> > >       cpu-0                                       cpu-1
> > >       -----                                       -----
> > >    create_store()
> > >      mdev_create_device()
> > >        device_register()
> > >           ...
> > >          vfio_mdev_probe()
> > >          ...creates char device
> > >                                         vfio_mdev_open()
> > >                                           parent->ops->open(mdev)
> > >                                             vfio_ap_mdev_open()
> > >                                               matrix_mdev = NULL
> > >         [...]
> > >         parent->ops->create()
> > >           vfio_ap_mdev_create()
> > >             mdev_set_drvdata(mdev, matrix_mdev);
> > >             /* Valid pointer set above */
> > >
> > > 2. Current creation sequence is,
> > >    parent->ops_create()
> > >    groups_register()
> > >
> > > Remove sequence is,
> > >    parent->ops->remove()
> > >    groups_unregister()
> > > However, remove sequence should be exact mirror of creation sequence.
> > > Once this is achieved, all users of the mdev will be terminated first
> > > before removing underlying vendor device.
> > > (Follow standard linux driver model).
> > > At that point vendor's remove() ops shouldn't failed because device is
> > > taken off the bus that should terminate the users.
> > >
> > > 3. Additionally any new mdev driver that wants to work on mdev device
> > > during probe() routine registered using mdev_register_driver() needs
> > > to get stable mdev structure.
> > >
> > > 4. In following sequence, child devices created while removing mdev
> > > parent device can be left out, or it may lead to race of removing half
> > > initialized child mdev devices.
> > >
> > > issue-1:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > >                                   mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >                                   parent_remove_sysfs_files()
> > >                                   /* BUG: device added by cpu-0
> > >                                    * whose parent is getting removed.
> > >                                    */
> > >
> > > issue-2:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >
> > >        [...]                      mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > >
> > >        mdev_create_sysfs_files()
> > >        /* BUG: create is adding
> > >         * sysfs files for a device
> > >         * which is undergoing removal.
> > >         */
> > >                                  parent_remove_sysfs_files()  
> > 
> > In both cases above, it looks like the device will hold a reference to the
> > parent, so while there is a race, the parent object isn't released.  
> Yes, parent object is not released but parent fields are not stable.
> 
> >   
> > >
> > > 5. Below crash is observed when user initiated remove is in progress
> > > and mdev_unregister_driver() completes parent unregistration.
> > >
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > remove_store()
> > >    mdev_device_remove()
> > >    active = false;
> > >                                   mdev_unregister_device()
> > >                                     remove type
> > >    [...]
> > >    mdev_remove_ops() crashes.
> > >
> > > This is similar race like create() racing with mdev_unregister_device().  
> > 
> > Not sure I catch this, the device should have a reference to the parent, and
> > we don't specifically clear parent->ops, so what's getting removed that
> > causes this oops?  Is .remove pointing at bad text regardless?
> >   
> I guess the mdev_attr_groups being stale now.
> 
> > > mtty mtty: MDEV: Registered
> > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > > mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > > mtty_dev: Unloaded!
> > > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > >  mdev_device_remove+0xef/0x130 [mdev]
> > >  remove_store+0x77/0xa0 [mdev]
> > >  kernfs_fop_write+0x113/0x1a0
> > >  __vfs_write+0x33/0x1b0
> > >  ? rcu_read_lock_sched_held+0x64/0x70
> > >  ? rcu_sync_lockdep_assert+0x2a/0x50
> > >  ? __sb_start_write+0x121/0x1b0
> > >  ? vfs_write+0x17c/0x1b0
> > >  vfs_write+0xad/0x1b0
> > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > >  ksys_write+0x55/0xc0
> > >  do_syscall_64+0x5a/0x210
> > >
> > > Therefore, mdev core is improved in following ways to overcome above
> > > issues.
> > >
> > > 1. Before placing mdev devices on the bus, perform vendor drivers
> > > creation which supports the mdev creation.
> > > This ensures that mdev specific all necessary fields are initialized
> > > before a given mdev can be accessed by bus driver.
> > >
> > > 2. During remove flow, first remove the device from the bus. This
> > > ensures that any bus specific devices and data is cleared.
> > > Once device is taken of the mdev bus, perform remove() of mdev from
> > > the vendor driver.
> > >
> > > 3. Linux core device model provides way to register and auto
> > > unregister the device sysfs attribute groups at dev->groups.
> > > Make use of this groups to let core create the groups and simplify
> > > code to avoid explicit groups creation and removal.
> > >
> > > 4. Wait for any ongoing mdev create() and remove() to finish before
> > > unregistering parent device using srcu. This continues to allow
> > > multiple create and remove to progress in parallel. At the same time
> > > guard parent removal while parent is being access by create() and remove  
> > callbacks.
> > 
> > So there should be 4-5 separate patches here?  Wishful thinking?
> >   
> create, remove racing with unregister is handled using srcu.
> Change-3 cannot be done without fixing the sequence so it should be in patch that fixes it.
> Change described changes 1-2-3 are just one change. It is just the patch description to bring clarity.
> Change-4 can be possibly done as split to different patch.
> 
> > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > ---
> > >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++--------------  
> > ----  
> > >  drivers/vfio/mdev/mdev_private.h |   7 +-
> > >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> > >  3 files changed, 84 insertions(+), 71 deletions(-)
> > >
> > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > > --- a/drivers/vfio/mdev/mdev_core.c
> > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> > >  						  ref);
> > >  	struct device *dev = parent->dev;
> > >
> > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > >  	kfree(parent);
> > >  	put_device(dev);
> > >  }
> > > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct  
> > mdev_parent *parent)  
> > >  		kref_put(&parent->ref, mdev_release_parent);  }
> > >
> > > -static int mdev_device_create_ops(struct kobject *kobj,
> > > -				  struct mdev_device *mdev)
> > > +static int mdev_device_must_remove(struct mdev_device *mdev)  
> > 
> > Naming is off here, mdev_device_remove_common()?
> >   
> Yes, sounds better.
> 
> > >  {
> > > -	struct mdev_parent *parent = mdev->parent;
> > > +	struct mdev_parent *parent;
> > > +	struct mdev_type *type;
> > >  	int ret;
> > >
> > > -	ret = parent->ops->create(kobj, mdev);
> > > -	if (ret)
> > > -		return ret;
> > > +	type = to_mdev_type(mdev->type_kobj);
> > >
> > > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > > -				  parent->ops->mdev_attr_groups);
> > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > +	device_del(&mdev->dev);
> > > +	parent = mdev->parent;
> > > +	ret = parent->ops->remove(mdev);
> > >  	if (ret)
> > > -		parent->ops->remove(mdev);
> > > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);  
> > 
> > Let the caller decide whether to be verbose with the error, parent removal
> > might want to warn, sysfs remove might just return an error.
> >   
> I didn't follow. Caller meaning mdev_device_remove_common() or vendor driver?

I mean the callback iterator on the parent remove can do a WARN_ON if
this returns an error while the device remove path can silently return
-EBUSY, the common function doesn't need to decide whether the parent
ops remove function deserves a dev_err.

> > >
> > > +	/* Balances with device_initialize() */
> > > +	put_device(&mdev->dev);
> > >  	return ret;
> > >  }
> > >
> > > -/*
> > > - * mdev_device_remove_ops gets called from sysfs's 'remove' and when
> > > parent
> > > - * device is being unregistered from mdev device framework.
> > > - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> > > - *   indicates that if the mdev device is active, used by VMM or userspace
> > > - *   application, vendor driver could return error then don't remove the  
> > device.  
> > > - * - 'force_remove' is set to 'true' when called from  
> > mdev_unregister_device()  
> > > - *   which indicate that parent device is being removed from mdev device
> > > - *   framework so remove mdev device forcefully.
> > > - */
> > > -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> > > force_remove) -{
> > > -	struct mdev_parent *parent = mdev->parent;
> > > -	int ret;
> > > -
> > > -	/*
> > > -	 * Vendor driver can return error if VMM or userspace application is
> > > -	 * using this mdev device.
> > > -	 */
> > > -	ret = parent->ops->remove(mdev);
> > > -	if (ret && !force_remove)
> > > -		return ret;
> > > -
> > > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> > >mdev_attr_groups);
> > > -	return 0;
> > > -}  
> > 
> > Seems like there's easily a separate patch in pushing the create/remove ops
> > into the calling function and separating for the iterator callback, that would
> > make this easier to review.
> >   
> > > -
> > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > >  	if (dev_is_mdev(dev))
> > > -		mdev_device_remove(dev, true);
> > > -
> > > +		mdev_device_must_remove(to_mdev_device(dev));
> > >  	return 0;
> > >  }
> > >
> > > @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const  
> > struct mdev_parent_ops *ops)  
> > >  	}
> > >
> > >  	kref_init(&parent->ref);
> > > +	init_srcu_struct(&parent->unreg_srcu);
> > >
> > >  	parent->dev = dev;
> > >  	parent->ops = ops;
> > > @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const  
> > struct mdev_parent_ops *ops)  
> > >  	if (ret)
> > >  		dev_warn(dev, "Failed to create compatibility class link\n");
> > >
> > > +	rcu_assign_pointer(parent->self, parent);
> > >  	list_add(&parent->next, &parent_list);
> > >  	mutex_unlock(&parent_list_lock);
> > >
> > > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
> > >
> > >  	mutex_lock(&parent_list_lock);
> > >  	parent = __find_parent_device(dev);
> > > -
> > >  	if (!parent) {
> > >  		mutex_unlock(&parent_list_lock);
> > >  		return;
> > >  	}
> > > +	list_del(&parent->next);
> > > +	mutex_unlock(&parent_list_lock);
> > > +
> > >  	dev_info(dev, "MDEV: Unregistering\n");
> > >
> > > -	list_del(&parent->next);
> > > +	/* Publish that this mdev parent is unregistering. So any new
> > > +	 * create/remove cannot start on this parent anymore by user.
> > > +	 */  
> > 
> > Comment style, we're not in netdev.  
> Yep. Will fix it.
> >   
> > > +	rcu_assign_pointer(parent->self, NULL);
> > > +
> > > +	/*
> > > +	 * Wait for any active create() or remove() mdev ops on the parent
> > > +	 * to complete.
> > > +	 */
> > > +	synchronize_srcu(&parent->unreg_srcu);
> > > +
> > > +	/* At this point it is confirmed that any pending user initiated
> > > +	 * create or remove callbacks accessing the parent are completed.
> > > +	 * It is safe to remove the parent now.
> > > +	 */
> > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > >
> > >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > >
> > >  	parent_remove_sysfs_files(parent);
> > >
> > > -	mutex_unlock(&parent_list_lock);
> > >  	mdev_put_parent(parent);
> > >  }
> > >  EXPORT_SYMBOL(mdev_unregister_device);
> > > @@ -278,14 +270,24 @@ static void mdev_device_release(struct device
> > > *dev)  int mdev_device_create(struct kobject *kobj, struct device
> > > *dev, uuid_le uuid)  {
> > >  	int ret;
> > > +	struct mdev_parent *valid_parent;
> > >  	struct mdev_device *mdev, *tmp;
> > >  	struct mdev_parent *parent;
> > >  	struct mdev_type *type = to_mdev_type(kobj);
> > > +	int srcu_idx;
> > >
> > >  	parent = mdev_get_parent(type->parent);
> > >  	if (!parent)
> > >  		return -EINVAL;
> > >
> > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > > +	if (!valid_parent) {
> > > +		/* parent is undergoing unregistration */
> > > +		ret = -ENODEV;
> > > +		goto mdev_fail;
> > > +	}
> > > +
> > >  	mutex_lock(&mdev_list_lock);
> > >
> > >  	/* Check for duplicate */
> > > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj,
> > > struct device *dev, uuid_le uuid)
> > >
> > >  	mdev->parent = parent;
> > >
> > > +	device_initialize(&mdev->dev);
> > >  	mdev->dev.parent  = dev;
> > >  	mdev->dev.bus     = &mdev_bus_type;
> > >  	mdev->dev.release = mdev_device_release;
> > > +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
> > >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> > >
> > > -	ret = device_register(&mdev->dev);
> > > +	ret = type->parent->ops->create(kobj, mdev);
> > >  	if (ret)
> > > -		goto mdev_fail;
> > > +		goto create_fail;
> > >
> > > -	ret = mdev_device_create_ops(kobj, mdev);
> > > +	ret = device_add(&mdev->dev);  
> > 
> > Separating device_initialize() and device_add() also looks like a separate
> > patch, then the srcu could be added at the end.  Thanks,
> > 
> > Alex  
> 
> I saw little more core generated that way, but I think its fine.
> Basically, create/remove callback sequencing that does the device_inititailze/add etc in one patch and 
> User side race handling using srcu in another patch.
> Sounds good?

Splitting device_register into device_intialize/device_add solves the
first issue alone, that can be one patch.  Creating the common remove
function seems like a logical next patch.  The third patch could be
using the driver-core group attribute via those paths.  Another patch
could then incorporate the srcu code to gate the create/remove around
parent removal.  This basically matches your steps to address these
issues, it seems very split-able.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-26  0:05       ` Alex Williamson
@ 2019-03-26  1:43         ` Parav Pandit
  2019-03-26  2:16           ` Alex Williamson
  0 siblings, 1 reply; 49+ messages in thread
From: Parav Pandit @ 2019-03-26  1:43 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, kwankhede



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 7:06 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Mon, 25 Mar 2019 23:34:28 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, March 25, 2019 6:19 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com
> > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove
> > > sequence
> > >
> > > On Fri, 22 Mar 2019 18:20:35 -0500
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > There are five problems with current code structure.
> > > > 1. mdev device is placed on the mdev bus before it is created in
> > > > the vendor driver. Once a device is placed on the mdev bus without
> > > > creating its supporting underlying vendor device, an open() can
> > > > get triggered by userspace on partially initialized device.
> > > > Below ladder diagram highlight it.
> > > >
> > > >       cpu-0                                       cpu-1
> > > >       -----                                       -----
> > > >    create_store()
> > > >      mdev_create_device()
> > > >        device_register()
> > > >           ...
> > > >          vfio_mdev_probe()
> > > >          ...creates char device
> > > >                                         vfio_mdev_open()
> > > >                                           parent->ops->open(mdev)
> > > >                                             vfio_ap_mdev_open()
> > > >                                               matrix_mdev = NULL
> > > >         [...]
> > > >         parent->ops->create()
> > > >           vfio_ap_mdev_create()
> > > >             mdev_set_drvdata(mdev, matrix_mdev);
> > > >             /* Valid pointer set above */
> > > >
> > > > 2. Current creation sequence is,
> > > >    parent->ops_create()
> > > >    groups_register()
> > > >
> > > > Remove sequence is,
> > > >    parent->ops->remove()
> > > >    groups_unregister()
> > > > However, remove sequence should be exact mirror of creation
> sequence.
> > > > Once this is achieved, all users of the mdev will be terminated
> > > > first before removing underlying vendor device.
> > > > (Follow standard linux driver model).
> > > > At that point vendor's remove() ops shouldn't failed because
> > > > device is taken off the bus that should terminate the users.
> > > >
> > > > 3. Additionally any new mdev driver that wants to work on mdev
> > > > device during probe() routine registered using
> > > > mdev_register_driver() needs to get stable mdev structure.
> > > >
> > > > 4. In following sequence, child devices created while removing
> > > > mdev parent device can be left out, or it may lead to race of
> > > > removing half initialized child mdev devices.
> > > >
> > > > issue-1:
> > > > --------
> > > >        cpu-0                         cpu-1
> > > >        -----                         -----
> > > >                                   mdev_unregister_device()
> > > >                                      device_for_each_child()
> > > >                                         mdev_device_remove_cb()
> > > >                                             mdev_device_remove()
> > > > create_store()
> > > >   mdev_device_create()                   [...]
> > > >        device_register()
> > > >                                   parent_remove_sysfs_files()
> > > >                                   /* BUG: device added by cpu-0
> > > >                                    * whose parent is getting removed.
> > > >                                    */
> > > >
> > > > issue-2:
> > > > --------
> > > >        cpu-0                         cpu-1
> > > >        -----                         -----
> > > > create_store()
> > > >   mdev_device_create()                   [...]
> > > >        device_register()
> > > >
> > > >        [...]                      mdev_unregister_device()
> > > >                                      device_for_each_child()
> > > >                                         mdev_device_remove_cb()
> > > >                                             mdev_device_remove()
> > > >
> > > >        mdev_create_sysfs_files()
> > > >        /* BUG: create is adding
> > > >         * sysfs files for a device
> > > >         * which is undergoing removal.
> > > >         */
> > > >                                  parent_remove_sysfs_files()
> > >
> > > In both cases above, it looks like the device will hold a reference
> > > to the parent, so while there is a race, the parent object isn't released.
> > Yes, parent object is not released but parent fields are not stable.
> >
> > >
> > > >
> > > > 5. Below crash is observed when user initiated remove is in
> > > > progress and mdev_unregister_driver() completes parent
> unregistration.
> > > >
> > > >        cpu-0                         cpu-1
> > > >        -----                         -----
> > > > remove_store()
> > > >    mdev_device_remove()
> > > >    active = false;
> > > >                                   mdev_unregister_device()
> > > >                                     remove type
> > > >    [...]
> > > >    mdev_remove_ops() crashes.
> > > >
> > > > This is similar race like create() racing with mdev_unregister_device().
> > >
> > > Not sure I catch this, the device should have a reference to the
> > > parent, and we don't specifically clear parent->ops, so what's
> > > getting removed that causes this oops?  Is .remove pointing at bad text
> regardless?
> > >
> > I guess the mdev_attr_groups being stale now.
> >
> > > > mtty mtty: MDEV: Registered
> > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group
> > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id
> > > > = 57 mdev_device_remove sleep started mtty mtty: MDEV:
> > > > Unregistering
> > > > mtty_dev: Unloaded!
> > > > BUG: unable to handle kernel paging request at ffffffffc027d668
> > > > PGD
> > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > Oops: 0000 [#1] SMP PTI
> > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > >  remove_store+0x77/0xa0 [mdev]
> > > >  kernfs_fop_write+0x113/0x1a0
> > > >  __vfs_write+0x33/0x1b0
> > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > >  vfs_write+0xad/0x1b0
> > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > >  ksys_write+0x55/0xc0
> > > >  do_syscall_64+0x5a/0x210
> > > >
> > > > Therefore, mdev core is improved in following ways to overcome
> > > > above issues.
> > > >
> > > > 1. Before placing mdev devices on the bus, perform vendor drivers
> > > > creation which supports the mdev creation.
> > > > This ensures that mdev specific all necessary fields are
> > > > initialized before a given mdev can be accessed by bus driver.
> > > >
> > > > 2. During remove flow, first remove the device from the bus. This
> > > > ensures that any bus specific devices and data is cleared.
> > > > Once device is taken of the mdev bus, perform remove() of mdev
> > > > from the vendor driver.
> > > >
> > > > 3. Linux core device model provides way to register and auto
> > > > unregister the device sysfs attribute groups at dev->groups.
> > > > Make use of this groups to let core create the groups and simplify
> > > > code to avoid explicit groups creation and removal.
> > > >
> > > > 4. Wait for any ongoing mdev create() and remove() to finish
> > > > before unregistering parent device using srcu. This continues to
> > > > allow multiple create and remove to progress in parallel. At the
> > > > same time guard parent removal while parent is being access by
> > > > create() and remove
> > > callbacks.
> > >
> > > So there should be 4-5 separate patches here?  Wishful thinking?
> > >
> > create, remove racing with unregister is handled using srcu.
> > Change-3 cannot be done without fixing the sequence so it should be in
> patch that fixes it.
> > Change described changes 1-2-3 are just one change. It is just the patch
> description to bring clarity.
> > Change-4 can be possibly done as split to different patch.
> >
> > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > ---
> > > >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++---------
> -----
> > > ----
> > > >  drivers/vfio/mdev/mdev_private.h |   7 +-
> > > >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> > > >  3 files changed, 84 insertions(+), 71 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> > > >  						  ref);
> > > >  	struct device *dev = parent->dev;
> > > >
> > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > >  	kfree(parent);
> > > >  	put_device(dev);
> > > >  }
> > > > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct
> > > mdev_parent *parent)
> > > >  		kref_put(&parent->ref, mdev_release_parent);  }
> > > >
> > > > -static int mdev_device_create_ops(struct kobject *kobj,
> > > > -				  struct mdev_device *mdev)
> > > > +static int mdev_device_must_remove(struct mdev_device *mdev)
> > >
> > > Naming is off here, mdev_device_remove_common()?
> > >
> > Yes, sounds better.
> >
> > > >  {
> > > > -	struct mdev_parent *parent = mdev->parent;
> > > > +	struct mdev_parent *parent;
> > > > +	struct mdev_type *type;
> > > >  	int ret;
> > > >
> > > > -	ret = parent->ops->create(kobj, mdev);
> > > > -	if (ret)
> > > > -		return ret;
> > > > +	type = to_mdev_type(mdev->type_kobj);
> > > >
> > > > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > > > -				  parent->ops->mdev_attr_groups);
> > > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > > +	device_del(&mdev->dev);
> > > > +	parent = mdev->parent;
> > > > +	ret = parent->ops->remove(mdev);
> > > >  	if (ret)
> > > > -		parent->ops->remove(mdev);
> > > > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
> > >
> > > Let the caller decide whether to be verbose with the error, parent
> > > removal might want to warn, sysfs remove might just return an error.
> > >
> > I didn't follow. Caller meaning mdev_device_remove_common() or vendor
> driver?
> 
> I mean the callback iterator on the parent remove can do a WARN_ON if this
> returns an error while the device remove path can silently return -EBUSY, the
> common function doesn't need to decide whether the parent ops remove
> function deserves a dev_err.
> 
Ok. I understood. 
But device remove returning silent -EBUSY looks an error that should get logged in, because this is something not expected.
Its probably late for sysfs layer to return report an error by that time it prints device name, because put_device() is done.
So if remove() returns an error, I think its legitimate failure to do WARN_ON or dev_err().

> > > >
> > > > +	/* Balances with device_initialize() */
> > > > +	put_device(&mdev->dev);
> > > >  	return ret;
> > > >  }
> > > >
> > > > -/*
> > > > - * mdev_device_remove_ops gets called from sysfs's 'remove' and
> > > > when parent
> > > > - * device is being unregistered from mdev device framework.
> > > > - * - 'force_remove' is set to 'false' when called from sysfs's 'remove'
> which
> > > > - *   indicates that if the mdev device is active, used by VMM or
> userspace
> > > > - *   application, vendor driver could return error then don't remove
> the
> > > device.
> > > > - * - 'force_remove' is set to 'true' when called from
> > > mdev_unregister_device()
> > > > - *   which indicate that parent device is being removed from mdev
> device
> > > > - *   framework so remove mdev device forcefully.
> > > > - */
> > > > -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> > > > force_remove) -{
> > > > -	struct mdev_parent *parent = mdev->parent;
> > > > -	int ret;
> > > > -
> > > > -	/*
> > > > -	 * Vendor driver can return error if VMM or userspace application is
> > > > -	 * using this mdev device.
> > > > -	 */
> > > > -	ret = parent->ops->remove(mdev);
> > > > -	if (ret && !force_remove)
> > > > -		return ret;
> > > > -
> > > > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> > > >mdev_attr_groups);
> > > > -	return 0;
> > > > -}
> > >
> > > Seems like there's easily a separate patch in pushing the
> > > create/remove ops into the calling function and separating for the
> > > iterator callback, that would make this easier to review.
> > >
> > > > -
> > > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > > >  	if (dev_is_mdev(dev))
> > > > -		mdev_device_remove(dev, true);
> > > > -
> > > > +		mdev_device_must_remove(to_mdev_device(dev));
> > > >  	return 0;
> > > >  }
> > > >
> > > > @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev,
> > > > const
> > > struct mdev_parent_ops *ops)
> > > >  	}
> > > >
> > > >  	kref_init(&parent->ref);
> > > > +	init_srcu_struct(&parent->unreg_srcu);
> > > >
> > > >  	parent->dev = dev;
> > > >  	parent->ops = ops;
> > > > @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev,
> > > > const
> > > struct mdev_parent_ops *ops)
> > > >  	if (ret)
> > > >  		dev_warn(dev, "Failed to create compatibility class link\n");
> > > >
> > > > +	rcu_assign_pointer(parent->self, parent);
> > > >  	list_add(&parent->next, &parent_list);
> > > >  	mutex_unlock(&parent_list_lock);
> > > >
> > > > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device
> > > > *dev)
> > > >
> > > >  	mutex_lock(&parent_list_lock);
> > > >  	parent = __find_parent_device(dev);
> > > > -
> > > >  	if (!parent) {
> > > >  		mutex_unlock(&parent_list_lock);
> > > >  		return;
> > > >  	}
> > > > +	list_del(&parent->next);
> > > > +	mutex_unlock(&parent_list_lock);
> > > > +
> > > >  	dev_info(dev, "MDEV: Unregistering\n");
> > > >
> > > > -	list_del(&parent->next);
> > > > +	/* Publish that this mdev parent is unregistering. So any new
> > > > +	 * create/remove cannot start on this parent anymore by user.
> > > > +	 */
> > >
> > > Comment style, we're not in netdev.
> > Yep. Will fix it.
> > >
> > > > +	rcu_assign_pointer(parent->self, NULL);
> > > > +
> > > > +	/*
> > > > +	 * Wait for any active create() or remove() mdev ops on the parent
> > > > +	 * to complete.
> > > > +	 */
> > > > +	synchronize_srcu(&parent->unreg_srcu);
> > > > +
> > > > +	/* At this point it is confirmed that any pending user initiated
> > > > +	 * create or remove callbacks accessing the parent are completed.
> > > > +	 * It is safe to remove the parent now.
> > > > +	 */
> > > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > > >
> > > >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > > >
> > > >  	parent_remove_sysfs_files(parent);
> > > >
> > > > -	mutex_unlock(&parent_list_lock);
> > > >  	mdev_put_parent(parent);
> > > >  }
> > > >  EXPORT_SYMBOL(mdev_unregister_device);
> > > > @@ -278,14 +270,24 @@ static void mdev_device_release(struct
> > > > device
> > > > *dev)  int mdev_device_create(struct kobject *kobj, struct device
> > > > *dev, uuid_le uuid)  {
> > > >  	int ret;
> > > > +	struct mdev_parent *valid_parent;
> > > >  	struct mdev_device *mdev, *tmp;
> > > >  	struct mdev_parent *parent;
> > > >  	struct mdev_type *type = to_mdev_type(kobj);
> > > > +	int srcu_idx;
> > > >
> > > >  	parent = mdev_get_parent(type->parent);
> > > >  	if (!parent)
> > > >  		return -EINVAL;
> > > >
> > > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > > > +	if (!valid_parent) {
> > > > +		/* parent is undergoing unregistration */
> > > > +		ret = -ENODEV;
> > > > +		goto mdev_fail;
> > > > +	}
> > > > +
> > > >  	mutex_lock(&mdev_list_lock);
> > > >
> > > >  	/* Check for duplicate */
> > > > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj,
> > > > struct device *dev, uuid_le uuid)
> > > >
> > > >  	mdev->parent = parent;
> > > >
> > > > +	device_initialize(&mdev->dev);
> > > >  	mdev->dev.parent  = dev;
> > > >  	mdev->dev.bus     = &mdev_bus_type;
> > > >  	mdev->dev.release = mdev_device_release;
> > > > +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
> > > >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> > > >
> > > > -	ret = device_register(&mdev->dev);
> > > > +	ret = type->parent->ops->create(kobj, mdev);
> > > >  	if (ret)
> > > > -		goto mdev_fail;
> > > > +		goto create_fail;
> > > >
> > > > -	ret = mdev_device_create_ops(kobj, mdev);
> > > > +	ret = device_add(&mdev->dev);
> > >
> > > Separating device_initialize() and device_add() also looks like a
> > > separate patch, then the srcu could be added at the end.  Thanks,
> > >
> > > Alex
> >
> > I saw little more core generated that way, but I think its fine.
> > Basically, create/remove callback sequencing that does the
> > device_inititailze/add etc in one patch and User side race handling using
> srcu in another patch.
> > Sounds good?
> 
> Splitting device_register into device_intialize/device_add solves the first
> issue alone, that can be one patch.  
Yes, once this is done, mdev_device_create_ops() is just a one line wrapper to groups creation.
Hence I was considering to do in same patch, but its fine as a separate clean up patch.
More split details below.

> Creating the common remove function
> seems like a logical next patch.  The third patch could be using the driver-
> core group attribute via those paths.  Another patch could then incorporate
> the srcu code to gate the create/remove around parent removal.  This
> basically matches your steps to address these issues, it seems very split-able.
> Thanks,
> 
So I reworked to split this one patch to following smaller refactor and fixes.
1. use of device_inititalize/add/remove helpers without fixing the sequence as prep patch
2. fix the create/remove sequence
3. factor out groups creation
4. remove helper function
5. srcu fix

> Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-26  1:43         ` Parav Pandit
@ 2019-03-26  2:16           ` Alex Williamson
  2019-03-26  3:19             ` Parav Pandit
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2019-03-26  2:16 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede

On Tue, 26 Mar 2019 01:43:44 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, March 25, 2019 7:06 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com
> > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> > 
> > On Mon, 25 Mar 2019 23:34:28 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Monday, March 25, 2019 6:19 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > kwankhede@nvidia.com
> > > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove
> > > > sequence
> > > >
> > > > On Fri, 22 Mar 2019 18:20:35 -0500
> > > > Parav Pandit <parav@mellanox.com> wrote:
> > > >  
> > > > > There are five problems with current code structure.
> > > > > 1. mdev device is placed on the mdev bus before it is created in
> > > > > the vendor driver. Once a device is placed on the mdev bus without
> > > > > creating its supporting underlying vendor device, an open() can
> > > > > get triggered by userspace on partially initialized device.
> > > > > Below ladder diagram highlight it.
> > > > >
> > > > >       cpu-0                                       cpu-1
> > > > >       -----                                       -----
> > > > >    create_store()
> > > > >      mdev_create_device()
> > > > >        device_register()
> > > > >           ...
> > > > >          vfio_mdev_probe()
> > > > >          ...creates char device
> > > > >                                         vfio_mdev_open()
> > > > >                                           parent->ops->open(mdev)
> > > > >                                             vfio_ap_mdev_open()
> > > > >                                               matrix_mdev = NULL
> > > > >         [...]
> > > > >         parent->ops->create()
> > > > >           vfio_ap_mdev_create()
> > > > >             mdev_set_drvdata(mdev, matrix_mdev);
> > > > >             /* Valid pointer set above */
> > > > >
> > > > > 2. Current creation sequence is,
> > > > >    parent->ops_create()
> > > > >    groups_register()
> > > > >
> > > > > Remove sequence is,
> > > > >    parent->ops->remove()
> > > > >    groups_unregister()
> > > > > However, remove sequence should be exact mirror of creation  
> > sequence.  
> > > > > Once this is achieved, all users of the mdev will be terminated
> > > > > first before removing underlying vendor device.
> > > > > (Follow standard linux driver model).
> > > > > At that point vendor's remove() ops shouldn't failed because
> > > > > device is taken off the bus that should terminate the users.
> > > > >
> > > > > 3. Additionally any new mdev driver that wants to work on mdev
> > > > > device during probe() routine registered using
> > > > > mdev_register_driver() needs to get stable mdev structure.
> > > > >
> > > > > 4. In following sequence, child devices created while removing
> > > > > mdev parent device can be left out, or it may lead to race of
> > > > > removing half initialized child mdev devices.
> > > > >
> > > > > issue-1:
> > > > > --------
> > > > >        cpu-0                         cpu-1
> > > > >        -----                         -----
> > > > >                                   mdev_unregister_device()
> > > > >                                      device_for_each_child()
> > > > >                                         mdev_device_remove_cb()
> > > > >                                             mdev_device_remove()
> > > > > create_store()
> > > > >   mdev_device_create()                   [...]
> > > > >        device_register()
> > > > >                                   parent_remove_sysfs_files()
> > > > >                                   /* BUG: device added by cpu-0
> > > > >                                    * whose parent is getting removed.
> > > > >                                    */
> > > > >
> > > > > issue-2:
> > > > > --------
> > > > >        cpu-0                         cpu-1
> > > > >        -----                         -----
> > > > > create_store()
> > > > >   mdev_device_create()                   [...]
> > > > >        device_register()
> > > > >
> > > > >        [...]                      mdev_unregister_device()
> > > > >                                      device_for_each_child()
> > > > >                                         mdev_device_remove_cb()
> > > > >                                             mdev_device_remove()
> > > > >
> > > > >        mdev_create_sysfs_files()
> > > > >        /* BUG: create is adding
> > > > >         * sysfs files for a device
> > > > >         * which is undergoing removal.
> > > > >         */
> > > > >                                  parent_remove_sysfs_files()  
> > > >
> > > > In both cases above, it looks like the device will hold a reference
> > > > to the parent, so while there is a race, the parent object isn't released.  
> > > Yes, parent object is not released but parent fields are not stable.
> > >  
> > > >  
> > > > >
> > > > > 5. Below crash is observed when user initiated remove is in
> > > > > progress and mdev_unregister_driver() completes parent  
> > unregistration.  
> > > > >
> > > > >        cpu-0                         cpu-1
> > > > >        -----                         -----
> > > > > remove_store()
> > > > >    mdev_device_remove()
> > > > >    active = false;
> > > > >                                   mdev_unregister_device()
> > > > >                                     remove type
> > > > >    [...]
> > > > >    mdev_remove_ops() crashes.
> > > > >
> > > > > This is similar race like create() racing with mdev_unregister_device().  
> > > >
> > > > Not sure I catch this, the device should have a reference to the
> > > > parent, and we don't specifically clear parent->ops, so what's
> > > > getting removed that causes this oops?  Is .remove pointing at bad text  
> > regardless?  
> > > >  
> > > I guess the mdev_attr_groups being stale now.
> > >  
> > > > > mtty mtty: MDEV: Registered
> > > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group
> > > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id
> > > > > = 57 mdev_device_remove sleep started mtty mtty: MDEV:
> > > > > Unregistering
> > > > > mtty_dev: Unloaded!
> > > > > BUG: unable to handle kernel paging request at ffffffffc027d668
> > > > > PGD
> > > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > > Oops: 0000 [#1] SMP PTI
> > > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > > >  remove_store+0x77/0xa0 [mdev]
> > > > >  kernfs_fop_write+0x113/0x1a0
> > > > >  __vfs_write+0x33/0x1b0
> > > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > > >  vfs_write+0xad/0x1b0
> > > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > > >  ksys_write+0x55/0xc0
> > > > >  do_syscall_64+0x5a/0x210
> > > > >
> > > > > Therefore, mdev core is improved in following ways to overcome
> > > > > above issues.
> > > > >
> > > > > 1. Before placing mdev devices on the bus, perform vendor drivers
> > > > > creation which supports the mdev creation.
> > > > > This ensures that mdev specific all necessary fields are
> > > > > initialized before a given mdev can be accessed by bus driver.
> > > > >
> > > > > 2. During remove flow, first remove the device from the bus. This
> > > > > ensures that any bus specific devices and data is cleared.
> > > > > Once device is taken of the mdev bus, perform remove() of mdev
> > > > > from the vendor driver.
> > > > >
> > > > > 3. Linux core device model provides way to register and auto
> > > > > unregister the device sysfs attribute groups at dev->groups.
> > > > > Make use of this groups to let core create the groups and simplify
> > > > > code to avoid explicit groups creation and removal.
> > > > >
> > > > > 4. Wait for any ongoing mdev create() and remove() to finish
> > > > > before unregistering parent device using srcu. This continues to
> > > > > allow multiple create and remove to progress in parallel. At the
> > > > > same time guard parent removal while parent is being access by
> > > > > create() and remove  
> > > > callbacks.
> > > >
> > > > So there should be 4-5 separate patches here?  Wishful thinking?
> > > >  
> > > create, remove racing with unregister is handled using srcu.
> > > Change-3 cannot be done without fixing the sequence so it should be in  
> > patch that fixes it.  
> > > Change described changes 1-2-3 are just one change. It is just the patch  
> > description to bring clarity.  
> > > Change-4 can be possibly done as split to different patch.
> > >  
> > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > ---
> > > > >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++---------  
> > -----  
> > > > ----  
> > > > >  drivers/vfio/mdev/mdev_private.h |   7 +-
> > > > >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> > > > >  3 files changed, 84 insertions(+), 71 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> > > > >  						  ref);
> > > > >  	struct device *dev = parent->dev;
> > > > >
> > > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > > >  	kfree(parent);
> > > > >  	put_device(dev);
> > > > >  }
> > > > > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct  
> > > > mdev_parent *parent)  
> > > > >  		kref_put(&parent->ref, mdev_release_parent);  }
> > > > >
> > > > > -static int mdev_device_create_ops(struct kobject *kobj,
> > > > > -				  struct mdev_device *mdev)
> > > > > +static int mdev_device_must_remove(struct mdev_device *mdev)  
> > > >
> > > > Naming is off here, mdev_device_remove_common()?
> > > >  
> > > Yes, sounds better.
> > >  
> > > > >  {
> > > > > -	struct mdev_parent *parent = mdev->parent;
> > > > > +	struct mdev_parent *parent;
> > > > > +	struct mdev_type *type;
> > > > >  	int ret;
> > > > >
> > > > > -	ret = parent->ops->create(kobj, mdev);
> > > > > -	if (ret)
> > > > > -		return ret;
> > > > > +	type = to_mdev_type(mdev->type_kobj);
> > > > >
> > > > > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > > > > -				  parent->ops->mdev_attr_groups);
> > > > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > > > +	device_del(&mdev->dev);
> > > > > +	parent = mdev->parent;
> > > > > +	ret = parent->ops->remove(mdev);
> > > > >  	if (ret)
> > > > > -		parent->ops->remove(mdev);
> > > > > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);  
> > > >
> > > > Let the caller decide whether to be verbose with the error, parent
> > > > removal might want to warn, sysfs remove might just return an error.
> > > >  
> > > I didn't follow. Caller meaning mdev_device_remove_common() or vendor  
> > driver?
> > 
> > I mean the callback iterator on the parent remove can do a WARN_ON if this
> > returns an error while the device remove path can silently return -EBUSY, the
> > common function doesn't need to decide whether the parent ops remove
> > function deserves a dev_err.
> >   
> Ok. I understood. 
> But device remove returning silent -EBUSY looks an error that should
> get logged in, because this is something not expected. Its probably
> late for sysfs layer to return report an error by that time it prints
> device name, because put_device() is done. So if remove() returns an
> error, I think its legitimate failure to do WARN_ON or dev_err().

Calling put_device() if the parent remove op fails looks like a bug
introduced by this series, the current code allows that failure leaving
the device in a coherent state and returning errno to the sysfs store
function.

> > > > >
> > > > > +	/* Balances with device_initialize() */
> > > > > +	put_device(&mdev->dev);
> > > > >  	return ret;
> > > > >  }
> > > > >
> > > > > -/*
> > > > > - * mdev_device_remove_ops gets called from sysfs's 'remove'
> > > > > and when parent
> > > > > - * device is being unregistered from mdev device framework.
> > > > > - * - 'force_remove' is set to 'false' when called from
> > > > > sysfs's 'remove'  
> > which  
> > > > > - *   indicates that if the mdev device is active, used by
> > > > > VMM or  
> > userspace  
> > > > > - *   application, vendor driver could return error then
> > > > > don't remove  
> > the  
> > > > device.  
> > > > > - * - 'force_remove' is set to 'true' when called from  
> > > > mdev_unregister_device()  
> > > > > - *   which indicate that parent device is being removed from
> > > > > mdev  
> > device  
> > > > > - *   framework so remove mdev device forcefully.
> > > > > - */
> > > > > -static int mdev_device_remove_ops(struct mdev_device *mdev,
> > > > > bool force_remove) -{
> > > > > -	struct mdev_parent *parent = mdev->parent;
> > > > > -	int ret;
> > > > > -
> > > > > -	/*
> > > > > -	 * Vendor driver can return error if VMM or
> > > > > userspace application is
> > > > > -	 * using this mdev device.
> > > > > -	 */
> > > > > -	ret = parent->ops->remove(mdev);
> > > > > -	if (ret && !force_remove)
> > > > > -		return ret;
> > > > > -
> > > > > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> > > > >mdev_attr_groups);
> > > > > -	return 0;
> > > > > -}  
> > > >
> > > > Seems like there's easily a separate patch in pushing the
> > > > create/remove ops into the calling function and separating for
> > > > the iterator callback, that would make this easier to review.
> > > >  
> > > > > -
> > > > >  static int mdev_device_remove_cb(struct device *dev, void
> > > > > *data)  { if (dev_is_mdev(dev))
> > > > > -		mdev_device_remove(dev, true);
> > > > > -
> > > > > +		mdev_device_must_remove(to_mdev_device(dev));
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > > @@ -194,6 +169,7 @@ int mdev_register_device(struct device
> > > > > *dev, const  
> > > > struct mdev_parent_ops *ops)  
> > > > >  	}
> > > > >
> > > > >  	kref_init(&parent->ref);
> > > > > +	init_srcu_struct(&parent->unreg_srcu);
> > > > >
> > > > >  	parent->dev = dev;
> > > > >  	parent->ops = ops;
> > > > > @@ -214,6 +190,7 @@ int mdev_register_device(struct device
> > > > > *dev, const  
> > > > struct mdev_parent_ops *ops)  
> > > > >  	if (ret)
> > > > >  		dev_warn(dev, "Failed to create
> > > > > compatibility class link\n");
> > > > >
> > > > > +	rcu_assign_pointer(parent->self, parent);
> > > > >  	list_add(&parent->next, &parent_list);
> > > > >  	mutex_unlock(&parent_list_lock);
> > > > >
> > > > > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct
> > > > > device *dev)
> > > > >
> > > > >  	mutex_lock(&parent_list_lock);
> > > > >  	parent = __find_parent_device(dev);
> > > > > -
> > > > >  	if (!parent) {
> > > > >  		mutex_unlock(&parent_list_lock);
> > > > >  		return;
> > > > >  	}
> > > > > +	list_del(&parent->next);
> > > > > +	mutex_unlock(&parent_list_lock);
> > > > > +
> > > > >  	dev_info(dev, "MDEV: Unregistering\n");
> > > > >
> > > > > -	list_del(&parent->next);
> > > > > +	/* Publish that this mdev parent is unregistering.
> > > > > So any new
> > > > > +	 * create/remove cannot start on this parent anymore
> > > > > by user.
> > > > > +	 */  
> > > >
> > > > Comment style, we're not in netdev.  
> > > Yep. Will fix it.  
> > > >  
> > > > > +	rcu_assign_pointer(parent->self, NULL);
> > > > > +
> > > > > +	/*
> > > > > +	 * Wait for any active create() or remove() mdev ops
> > > > > on the parent
> > > > > +	 * to complete.
> > > > > +	 */
> > > > > +	synchronize_srcu(&parent->unreg_srcu);
> > > > > +
> > > > > +	/* At this point it is confirmed that any pending
> > > > > user initiated
> > > > > +	 * create or remove callbacks accessing the parent
> > > > > are completed.
> > > > > +	 * It is safe to remove the parent now.
> > > > > +	 */
> > > > >  	class_compat_remove_link(mdev_bus_compat_class, dev,
> > > > > NULL);
> > > > >
> > > > >  	device_for_each_child(dev, NULL,
> > > > > mdev_device_remove_cb);
> > > > >
> > > > >  	parent_remove_sysfs_files(parent);
> > > > >
> > > > > -	mutex_unlock(&parent_list_lock);
> > > > >  	mdev_put_parent(parent);
> > > > >  }
> > > > >  EXPORT_SYMBOL(mdev_unregister_device);
> > > > > @@ -278,14 +270,24 @@ static void mdev_device_release(struct
> > > > > device
> > > > > *dev)  int mdev_device_create(struct kobject *kobj, struct
> > > > > device *dev, uuid_le uuid)  {
> > > > >  	int ret;
> > > > > +	struct mdev_parent *valid_parent;
> > > > >  	struct mdev_device *mdev, *tmp;
> > > > >  	struct mdev_parent *parent;
> > > > >  	struct mdev_type *type = to_mdev_type(kobj);
> > > > > +	int srcu_idx;
> > > > >
> > > > >  	parent = mdev_get_parent(type->parent);
> > > > >  	if (!parent)
> > > > >  		return -EINVAL;
> > > > >
> > > > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > > > +	valid_parent = srcu_dereference(parent->self,
> > > > > &parent->unreg_srcu);
> > > > > +	if (!valid_parent) {
> > > > > +		/* parent is undergoing unregistration */
> > > > > +		ret = -ENODEV;
> > > > > +		goto mdev_fail;
> > > > > +	}
> > > > > +
> > > > >  	mutex_lock(&mdev_list_lock);
> > > > >
> > > > >  	/* Check for duplicate */
> > > > > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject
> > > > > *kobj, struct device *dev, uuid_le uuid)
> > > > >
> > > > >  	mdev->parent = parent;
> > > > >
> > > > > +	device_initialize(&mdev->dev);
> > > > >  	mdev->dev.parent  = dev;
> > > > >  	mdev->dev.bus     = &mdev_bus_type;
> > > > >  	mdev->dev.release = mdev_device_release;
> > > > > +	mdev->dev.groups =
> > > > > type->parent->ops->mdev_attr_groups; dev_set_name(&mdev->dev,
> > > > > "%pUl", uuid.b);
> > > > >
> > > > > -	ret = device_register(&mdev->dev);
> > > > > +	ret = type->parent->ops->create(kobj, mdev);
> > > > >  	if (ret)
> > > > > -		goto mdev_fail;
> > > > > +		goto create_fail;
> > > > >
> > > > > -	ret = mdev_device_create_ops(kobj, mdev);
> > > > > +	ret = device_add(&mdev->dev);  
> > > >
> > > > Separating device_initialize() and device_add() also looks like
> > > > a separate patch, then the srcu could be added at the end.
> > > > Thanks,
> > > >
> > > > Alex  
> > >
> > > I saw little more core generated that way, but I think its fine.
> > > Basically, create/remove callback sequencing that does the
> > > device_inititailze/add etc in one patch and User side race
> > > handling using  
> > srcu in another patch.  
> > > Sounds good?  
> > 
> > Splitting device_register into device_intialize/device_add solves
> > the first issue alone, that can be one patch.    
> Yes, once this is done, mdev_device_create_ops() is just a one line
> wrapper to groups creation. Hence I was considering to do in same
> patch, but its fine as a separate clean up patch. More split details
> below.
> 
> > Creating the common remove function
> > seems like a logical next patch.  The third patch could be using
> > the driver- core group attribute via those paths.  Another patch
> > could then incorporate the srcu code to gate the create/remove
> > around parent removal.  This basically matches your steps to
> > address these issues, it seems very split-able. Thanks,
> >   
> So I reworked to split this one patch to following smaller refactor
> and fixes. 1. use of device_inititalize/add/remove helpers without
> fixing the sequence as prep patch 2. fix the create/remove sequence
> 3. factor out groups creation
> 4. remove helper function
> 5. srcu fix

Looks good, I think it will be much easier to review that way.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-26  2:16           ` Alex Williamson
@ 2019-03-26  3:19             ` Parav Pandit
  2019-03-26  5:53               ` Parav Pandit
  0 siblings, 1 reply; 49+ messages in thread
From: Parav Pandit @ 2019-03-26  3:19 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, kwankhede



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 9:17 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Tue, 26 Mar 2019 01:43:44 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, March 25, 2019 7:06 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com
> > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove
> > > sequence
> > >
> > > On Mon, 25 Mar 2019 23:34:28 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Monday, March 25, 2019 6:19 PM
> > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > kwankhede@nvidia.com
> > > > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove
> > > > > sequence
> > > > >
> > > > > On Fri, 22 Mar 2019 18:20:35 -0500 Parav Pandit
> > > > > <parav@mellanox.com> wrote:
> > > > >
> > > > > > There are five problems with current code structure.
> > > > > > 1. mdev device is placed on the mdev bus before it is created
> > > > > > in the vendor driver. Once a device is placed on the mdev bus
> > > > > > without creating its supporting underlying vendor device, an
> > > > > > open() can get triggered by userspace on partially initialized device.
> > > > > > Below ladder diagram highlight it.
> > > > > >
> > > > > >       cpu-0                                       cpu-1
> > > > > >       -----                                       -----
> > > > > >    create_store()
> > > > > >      mdev_create_device()
> > > > > >        device_register()
> > > > > >           ...
> > > > > >          vfio_mdev_probe()
> > > > > >          ...creates char device
> > > > > >                                         vfio_mdev_open()
> > > > > >                                           parent->ops->open(mdev)
> > > > > >                                             vfio_ap_mdev_open()
> > > > > >                                               matrix_mdev = NULL
> > > > > >         [...]
> > > > > >         parent->ops->create()
> > > > > >           vfio_ap_mdev_create()
> > > > > >             mdev_set_drvdata(mdev, matrix_mdev);
> > > > > >             /* Valid pointer set above */
> > > > > >
> > > > > > 2. Current creation sequence is,
> > > > > >    parent->ops_create()
> > > > > >    groups_register()
> > > > > >
> > > > > > Remove sequence is,
> > > > > >    parent->ops->remove()
> > > > > >    groups_unregister()
> > > > > > However, remove sequence should be exact mirror of creation
> > > sequence.
> > > > > > Once this is achieved, all users of the mdev will be
> > > > > > terminated first before removing underlying vendor device.
> > > > > > (Follow standard linux driver model).
> > > > > > At that point vendor's remove() ops shouldn't failed because
> > > > > > device is taken off the bus that should terminate the users.
> > > > > >
> > > > > > 3. Additionally any new mdev driver that wants to work on mdev
> > > > > > device during probe() routine registered using
> > > > > > mdev_register_driver() needs to get stable mdev structure.
> > > > > >
> > > > > > 4. In following sequence, child devices created while removing
> > > > > > mdev parent device can be left out, or it may lead to race of
> > > > > > removing half initialized child mdev devices.
> > > > > >
> > > > > > issue-1:
> > > > > > --------
> > > > > >        cpu-0                         cpu-1
> > > > > >        -----                         -----
> > > > > >                                   mdev_unregister_device()
> > > > > >                                      device_for_each_child()
> > > > > >                                         mdev_device_remove_cb()
> > > > > >
> > > > > > mdev_device_remove()
> > > > > > create_store()
> > > > > >   mdev_device_create()                   [...]
> > > > > >        device_register()
> > > > > >                                   parent_remove_sysfs_files()
> > > > > >                                   /* BUG: device added by cpu-0
> > > > > >                                    * whose parent is getting removed.
> > > > > >                                    */
> > > > > >
> > > > > > issue-2:
> > > > > > --------
> > > > > >        cpu-0                         cpu-1
> > > > > >        -----                         -----
> > > > > > create_store()
> > > > > >   mdev_device_create()                   [...]
> > > > > >        device_register()
> > > > > >
> > > > > >        [...]                      mdev_unregister_device()
> > > > > >                                      device_for_each_child()
> > > > > >                                         mdev_device_remove_cb()
> > > > > >
> > > > > > mdev_device_remove()
> > > > > >
> > > > > >        mdev_create_sysfs_files()
> > > > > >        /* BUG: create is adding
> > > > > >         * sysfs files for a device
> > > > > >         * which is undergoing removal.
> > > > > >         */
> > > > > >                                  parent_remove_sysfs_files()
> > > > >
> > > > > In both cases above, it looks like the device will hold a
> > > > > reference to the parent, so while there is a race, the parent object
> isn't released.
> > > > Yes, parent object is not released but parent fields are not stable.
> > > >
> > > > >
> > > > > >
> > > > > > 5. Below crash is observed when user initiated remove is in
> > > > > > progress and mdev_unregister_driver() completes parent
> > > unregistration.
> > > > > >
> > > > > >        cpu-0                         cpu-1
> > > > > >        -----                         -----
> > > > > > remove_store()
> > > > > >    mdev_device_remove()
> > > > > >    active = false;
> > > > > >                                   mdev_unregister_device()
> > > > > >                                     remove type
> > > > > >    [...]
> > > > > >    mdev_remove_ops() crashes.
> > > > > >
> > > > > > This is similar race like create() racing with
> mdev_unregister_device().
> > > > >
> > > > > Not sure I catch this, the device should have a reference to the
> > > > > parent, and we don't specifically clear parent->ops, so what's
> > > > > getting removed that causes this oops?  Is .remove pointing at
> > > > > bad text
> > > regardless?
> > > > >
> > > > I guess the mdev_attr_groups being stale now.
> > > >
> > > > > > mtty mtty: MDEV: Registered
> > > > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to
> > > > > > group
> > > > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV:
> > > > > > group_id = 57 mdev_device_remove sleep started mtty mtty: MDEV:
> > > > > > Unregistering
> > > > > > mtty_dev: Unloaded!
> > > > > > BUG: unable to handle kernel paging request at
> > > > > > ffffffffc027d668 PGD
> > > > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > > > Oops: 0000 [#1] SMP PTI
> > > > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > > > >  remove_store+0x77/0xa0 [mdev]
> > > > > >  kernfs_fop_write+0x113/0x1a0
> > > > > >  __vfs_write+0x33/0x1b0
> > > > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > > > >  vfs_write+0xad/0x1b0
> > > > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > > > >  ksys_write+0x55/0xc0
> > > > > >  do_syscall_64+0x5a/0x210
> > > > > >
> > > > > > Therefore, mdev core is improved in following ways to overcome
> > > > > > above issues.
> > > > > >
> > > > > > 1. Before placing mdev devices on the bus, perform vendor
> > > > > > drivers creation which supports the mdev creation.
> > > > > > This ensures that mdev specific all necessary fields are
> > > > > > initialized before a given mdev can be accessed by bus driver.
> > > > > >
> > > > > > 2. During remove flow, first remove the device from the bus.
> > > > > > This ensures that any bus specific devices and data is cleared.
> > > > > > Once device is taken of the mdev bus, perform remove() of mdev
> > > > > > from the vendor driver.
> > > > > >
> > > > > > 3. Linux core device model provides way to register and auto
> > > > > > unregister the device sysfs attribute groups at dev->groups.
> > > > > > Make use of this groups to let core create the groups and
> > > > > > simplify code to avoid explicit groups creation and removal.
> > > > > >
> > > > > > 4. Wait for any ongoing mdev create() and remove() to finish
> > > > > > before unregistering parent device using srcu. This continues
> > > > > > to allow multiple create and remove to progress in parallel.
> > > > > > At the same time guard parent removal while parent is being
> > > > > > access by
> > > > > > create() and remove
> > > > > callbacks.
> > > > >
> > > > > So there should be 4-5 separate patches here?  Wishful thinking?
> > > > >
> > > > create, remove racing with unregister is handled using srcu.
> > > > Change-3 cannot be done without fixing the sequence so it should
> > > > be in
> > > patch that fixes it.
> > > > Change described changes 1-2-3 are just one change. It is just the
> > > > patch
> > > description to bring clarity.
> > > > Change-4 can be possibly done as split to different patch.
> > > >
> > > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > > ---
> > > > > >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++----
> -----
> > > -----
> > > > > ----
> > > > > >  drivers/vfio/mdev/mdev_private.h |   7 +-
> > > > > >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> > > > > >  3 files changed, 84 insertions(+), 71 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref
> *kref)
> > > > > >  						  ref);
> > > > > >  	struct device *dev = parent->dev;
> > > > > >
> > > > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > > > >  	kfree(parent);
> > > > > >  	put_device(dev);
> > > > > >  }
> > > > > > @@ -103,56 +104,30 @@ static inline void
> > > > > > mdev_put_parent(struct
> > > > > mdev_parent *parent)
> > > > > >  		kref_put(&parent->ref, mdev_release_parent);  }
> > > > > >
> > > > > > -static int mdev_device_create_ops(struct kobject *kobj,
> > > > > > -				  struct mdev_device *mdev)
> > > > > > +static int mdev_device_must_remove(struct mdev_device *mdev)
> > > > >
> > > > > Naming is off here, mdev_device_remove_common()?
> > > > >
> > > > Yes, sounds better.
> > > >
> > > > > >  {
> > > > > > -	struct mdev_parent *parent = mdev->parent;
> > > > > > +	struct mdev_parent *parent;
> > > > > > +	struct mdev_type *type;
> > > > > >  	int ret;
> > > > > >
> > > > > > -	ret = parent->ops->create(kobj, mdev);
> > > > > > -	if (ret)
> > > > > > -		return ret;
> > > > > > +	type = to_mdev_type(mdev->type_kobj);
> > > > > >
> > > > > > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > > > > > -				  parent->ops->mdev_attr_groups);
> > > > > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > > > > +	device_del(&mdev->dev);
> > > > > > +	parent = mdev->parent;
> > > > > > +	ret = parent->ops->remove(mdev);
> > > > > >  	if (ret)
> > > > > > -		parent->ops->remove(mdev);
> > > > > > +		dev_err(&mdev->dev, "Remove failed: err=%d\n",
> ret);
> > > > >
> > > > > Let the caller decide whether to be verbose with the error,
> > > > > parent removal might want to warn, sysfs remove might just return
> an error.
> > > > >
> > > > I didn't follow. Caller meaning mdev_device_remove_common() or
> > > > vendor
> > > driver?
> > >
> > > I mean the callback iterator on the parent remove can do a WARN_ON
> > > if this returns an error while the device remove path can silently
> > > return -EBUSY, the common function doesn't need to decide whether
> > > the parent ops remove function deserves a dev_err.
> > >
> > Ok. I understood.
> > But device remove returning silent -EBUSY looks an error that should
> > get logged in, because this is something not expected. Its probably
> > late for sysfs layer to return report an error by that time it prints
> > device name, because put_device() is done. So if remove() returns an
> > error, I think its legitimate failure to do WARN_ON or dev_err().
> 
> Calling put_device() if the parent remove op fails looks like a bug introduced
> by this series, the current code allows that failure leaving the device in a
> coherent state and returning errno to the sysfs store function.
> 
Why should it fail?
We are taking off the device bus first as describe in commit log.
This ensures that everything is closed before calling the remove().
We cannot avoid put_device() and put_parent, it all buggy path...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-26  3:19             ` Parav Pandit
@ 2019-03-26  5:53               ` Parav Pandit
  2019-03-26 15:21                 ` Alex Williamson
  0 siblings, 1 reply; 49+ messages in thread
From: Parav Pandit @ 2019-03-26  5:53 UTC (permalink / raw)
  To: Parav Pandit, Alex Williamson; +Cc: kvm, linux-kernel, kwankhede



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> owner@vger.kernel.org> On Behalf Of Parav Pandit
> Sent: Monday, March 25, 2019 10:19 PM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com
> Subject: RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> 
> 
> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, March 25, 2019 9:17 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com
> > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> >
> > On Tue, 26 Mar 2019 01:43:44 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >
> > > > -----Original Message-----
> > > > From: Alex Williamson <alex.williamson@redhat.com>

> > > > I mean the callback iterator on the parent remove can do a WARN_ON
> > > > if this returns an error while the device remove path can silently
> > > > return -EBUSY, the common function doesn't need to decide whether
> > > > the parent ops remove function deserves a dev_err.
> > > >
> > > Ok. I understood.
> > > But device remove returning silent -EBUSY looks an error that should
> > > get logged in, because this is something not expected. Its probably
> > > late for sysfs layer to return report an error by that time it
> > > prints device name, because put_device() is done. So if remove()
> > > returns an error, I think its legitimate failure to do WARN_ON or
> dev_err().
> >
> > Calling put_device() if the parent remove op fails looks like a bug
> > introduced by this series, the current code allows that failure
> > leaving the device in a coherent state and returning errno to the sysfs
> store function.
> >
> Why should it fail?
> We are taking off the device bus first as describe in commit log.
> This ensures that everything is closed before calling the remove().
> We cannot avoid put_device() and put_parent, it all buggy path...

I audited remove() callbacks of kvmgt.c, vfio_ccw_ops.c, vfio_ap_ops.c, mbochs.c, mdpy.c, mtty.c, who makes the remove possible once the device release is executed.
This should complete once the device is taken off the bus.
This was not the case before this sequence where remove() is done while device is open...hence the check was needed in past.
dev_err() is to help catch any errors/bugs in this area.

I doubt we need to retry remove() like vfio_del_group_dev(), in mdev_core if release() is not yet complete.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-22 23:20 ` [PATCH 8/8] vfio/mdev: Improve the create/remove sequence Parav Pandit
  2019-03-25 13:24   ` Maxim Levitsky
  2019-03-25 23:18   ` Alex Williamson
@ 2019-03-26  7:06   ` Kirti Wankhede
  2019-03-26 15:26     ` Alex Williamson
  2019-03-26 15:30     ` Parav Pandit
  2 siblings, 2 replies; 49+ messages in thread
From: Kirti Wankhede @ 2019-03-26  7:06 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, alex.williamson; +Cc: Neo Jia



On 3/23/2019 4:50 AM, Parav Pandit wrote:
> There are five problems with current code structure.
> 1. mdev device is placed on the mdev bus before it is created in the
> vendor driver. Once a device is placed on the mdev bus without creating
> its supporting underlying vendor device, an open() can get triggered by
> userspace on partially initialized device.
> Below ladder diagram highlight it.
> 
>       cpu-0                                       cpu-1
>       -----                                       -----
>    create_store()
>      mdev_create_device()
>        device_register()
>           ...
>          vfio_mdev_probe()
>          ...creates char device
>                                         vfio_mdev_open()
>                                           parent->ops->open(mdev)
>                                             vfio_ap_mdev_open()
>                                               matrix_mdev = NULL
>         [...]
>         parent->ops->create()
>           vfio_ap_mdev_create()
>             mdev_set_drvdata(mdev, matrix_mdev);
>             /* Valid pointer set above */
> 

VFIO interface uses sysfs path of device or PCI device's BDF where it
checks sysfs file for that device exist.
In case of VFIO mdev device, above situation will never happen as open
will only get called if sysfs entry for that device exist.

If you don't use VFIO interface then this situation can arise. In that
case probe() can be used for very basic initialization then create
actual char device from create().


> 2. Current creation sequence is,
>    parent->ops_create()
>    groups_register()
> 
> Remove sequence is,
>    parent->ops->remove()
>    groups_unregister()
> However, remove sequence should be exact mirror of creation sequence.
> Once this is achieved, all users of the mdev will be terminated first
> before removing underlying vendor device.
> (Follow standard linux driver model).
> At that point vendor's remove() ops shouldn't failed because device is
> taken off the bus that should terminate the users.
> 

If VMM or user space application is using mdev device,
parent->ops->remove() can return failure. In that case sysfs files
shouldn't be removed. Hence above sequence is followed for remove.

Standard linux driver model doesn't allow remove() to fail, but in
of mdev framework, interface is defined to handle such error case.


> 3. Additionally any new mdev driver that wants to work on mdev device
> during probe() routine registered using mdev_register_driver() needs to
> get stable mdev structure.
> 

Things that you are trying to handle with mdev structure from probe(),
couldn't that be moved to create()?


> 4. In following sequence, child devices created while removing mdev parent
> device can be left out, or it may lead to race of removing half
> initialized child mdev devices.
> 
> issue-1:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
>                                   mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
>                                   parent_remove_sysfs_files()
>                                   /* BUG: device added by cpu-0
>                                    * whose parent is getting removed.
>                                    */
> 
> issue-2:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
> 
>        [...]                      mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> 
>        mdev_create_sysfs_files()
>        /* BUG: create is adding
>         * sysfs files for a device
>         * which is undergoing removal.
>         */
>                                  parent_remove_sysfs_files()
> 
> 5. Below crash is observed when user initiated remove is in progress
> and mdev_unregister_driver() completes parent unregistration.
> 
>        cpu-0                         cpu-1
>        -----                         -----
> remove_store()
>    mdev_device_remove()
>    active = false;
>                                   mdev_unregister_device()
>                                     remove type
>    [...]
>    mdev_remove_ops() crashes.
> 
> This is similar race like create() racing with mdev_unregister_device().
> 
> mtty mtty: MDEV: Registered
> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> mdev_device_remove sleep started
> mtty mtty: MDEV: Unregistering
> mtty_dev: Unloaded!
> BUG: unable to handle kernel paging request at ffffffffc027d668
> PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> Oops: 0000 [#1] SMP PTI
> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
> Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
> Call Trace:
>  mdev_device_remove+0xef/0x130 [mdev]
>  remove_store+0x77/0xa0 [mdev]
>  kernfs_fop_write+0x113/0x1a0
>  __vfs_write+0x33/0x1b0
>  ? rcu_read_lock_sched_held+0x64/0x70
>  ? rcu_sync_lockdep_assert+0x2a/0x50
>  ? __sb_start_write+0x121/0x1b0
>  ? vfs_write+0x17c/0x1b0
>  vfs_write+0xad/0x1b0
>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>  ksys_write+0x55/0xc0
>  do_syscall_64+0x5a/0x210
> 
> Therefore, mdev core is improved in following ways to overcome above
> issues.
> 
> 1. Before placing mdev devices on the bus, perform vendor drivers
> creation which supports the mdev creation.
> This ensures that mdev specific all necessary fields are initialized
> before a given mdev can be accessed by bus driver.
> 
> 2. During remove flow, first remove the device from the bus. This
> ensures that any bus specific devices and data is cleared.
> Once device is taken of the mdev bus, perform remove() of mdev from the
> vendor driver.
>

If user space application is using the device and someone underneath
remove the device from bus, how would use space application know that
device is being removed?
If DMA is setup, user space application is accessing that memory and
device is removed from bus - how will you restrict to not to remove that
device? If remove() is not restricted then host might crash.
I know Linux kernel device core model doesn't allow remove() to fail,
but we had tackled that problem for mdev devices in this framework. I
prefer not to change this behavior. This will regress existing working
drivers.


> 3. Linux core device model provides way to register and auto unregister
> the device sysfs attribute groups at dev->groups.
> Make use of this groups to let core create the groups and simplify code
> to avoid explicit groups creation and removal.
> 
> 4. Wait for any ongoing mdev create() and remove() to finish before
> unregistering parent device using srcu. This continues to allow multiple
> create and remove to progress in parallel. At the same time guard parent
> removal while parent is being access by create() and remove callbacks.
> 

Agreed with this.
Alex already mentioned, it would be better to have separate patch for
this fix.

Thanks,
Kirti

> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++------------------
>  drivers/vfio/mdev/mdev_private.h |   7 +-
>  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
>  3 files changed, 84 insertions(+), 71 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 944a058..8fe0ed1 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
>  						  ref);
>  	struct device *dev = parent->dev;
>  
> +	cleanup_srcu_struct(&parent->unreg_srcu);
>  	kfree(parent);
>  	put_device(dev);
>  }
> @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct mdev_parent *parent)
>  		kref_put(&parent->ref, mdev_release_parent);
>  }
>  
> -static int mdev_device_create_ops(struct kobject *kobj,
> -				  struct mdev_device *mdev)
> +static int mdev_device_must_remove(struct mdev_device *mdev)
>  {
> -	struct mdev_parent *parent = mdev->parent;
> +	struct mdev_parent *parent;
> +	struct mdev_type *type;
>  	int ret;
>  
> -	ret = parent->ops->create(kobj, mdev);
> -	if (ret)
> -		return ret;
> +	type = to_mdev_type(mdev->type_kobj);
>  
> -	ret = sysfs_create_groups(&mdev->dev.kobj,
> -				  parent->ops->mdev_attr_groups);
> +	mdev_remove_sysfs_files(&mdev->dev, type);
> +	device_del(&mdev->dev);
> +	parent = mdev->parent;
> +	ret = parent->ops->remove(mdev);
>  	if (ret)
> -		parent->ops->remove(mdev);
> +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
>  
> +	/* Balances with device_initialize() */
> +	put_device(&mdev->dev);
>  	return ret;
>  }
>  
> -/*
> - * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
> - * device is being unregistered from mdev device framework.
> - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> - *   indicates that if the mdev device is active, used by VMM or userspace
> - *   application, vendor driver could return error then don't remove the device.
> - * - 'force_remove' is set to 'true' when called from mdev_unregister_device()
> - *   which indicate that parent device is being removed from mdev device
> - *   framework so remove mdev device forcefully.
> - */
> -static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
> -{
> -	struct mdev_parent *parent = mdev->parent;
> -	int ret;
> -
> -	/*
> -	 * Vendor driver can return error if VMM or userspace application is
> -	 * using this mdev device.
> -	 */
> -	ret = parent->ops->remove(mdev);
> -	if (ret && !force_remove)
> -		return ret;
> -
> -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
> -	return 0;
> -}
> -
>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
>  	if (dev_is_mdev(dev))
> -		mdev_device_remove(dev, true);
> -
> +		mdev_device_must_remove(to_mdev_device(dev));
>  	return 0;
>  }
>  
> @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	}
>  
>  	kref_init(&parent->ref);
> +	init_srcu_struct(&parent->unreg_srcu);
>  
>  	parent->dev = dev;
>  	parent->ops = ops;
> @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	if (ret)
>  		dev_warn(dev, "Failed to create compatibility class link\n");
>  
> +	rcu_assign_pointer(parent->self, parent);
>  	list_add(&parent->next, &parent_list);
>  	mutex_unlock(&parent_list_lock);
>  
> @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
>  
>  	mutex_lock(&parent_list_lock);
>  	parent = __find_parent_device(dev);
> -
>  	if (!parent) {
>  		mutex_unlock(&parent_list_lock);
>  		return;
>  	}
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +
>  	dev_info(dev, "MDEV: Unregistering\n");
>  
> -	list_del(&parent->next);
> +	/* Publish that this mdev parent is unregistering. So any new
> +	 * create/remove cannot start on this parent anymore by user.
> +	 */
> +	rcu_assign_pointer(parent->self, NULL);
> +
> +	/*
> +	 * Wait for any active create() or remove() mdev ops on the parent
> +	 * to complete.
> +	 */
> +	synchronize_srcu(&parent->unreg_srcu);
> +
> +	/* At this point it is confirmed that any pending user initiated
> +	 * create or remove callbacks accessing the parent are completed.
> +	 * It is safe to remove the parent now.
> +	 */
>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
>  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  
> -	mutex_unlock(&parent_list_lock);
>  	mdev_put_parent(parent);
>  }
>  EXPORT_SYMBOL(mdev_unregister_device);
> @@ -278,14 +270,24 @@ static void mdev_device_release(struct device *dev)
>  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  {
>  	int ret;
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	int srcu_idx;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		/* parent is undergoing unregistration */
> +		ret = -ENODEV;
> +		goto mdev_fail;
> +	}
> +
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  
>  	mdev->parent = parent;
>  
> +	device_initialize(&mdev->dev);
>  	mdev->dev.parent  = dev;
>  	mdev->dev.bus     = &mdev_bus_type;
>  	mdev->dev.release = mdev_device_release;
> +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
>  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
> -	ret = device_register(&mdev->dev);
> +	ret = type->parent->ops->create(kobj, mdev);
>  	if (ret)
> -		goto mdev_fail;
> +		goto create_fail;
>  
> -	ret = mdev_device_create_ops(kobj, mdev);
> +	ret = device_add(&mdev->dev);
>  	if (ret)
> -		goto create_fail;
> +		goto dev_fail;
>  
>  	ret = mdev_create_sysfs_files(&mdev->dev, type);
> -	if (ret) {
> -		mdev_device_remove_ops(mdev, true);
> -		goto create_fail;
> -	}
> +	if (ret)
> +		goto sysfs_fail;
>  
>  	mdev->type_kobj = kobj;
>  	mdev->active = true;
>  	dev_dbg(&mdev->dev, "MDEV: created\n");
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
>  	return 0;
>  
> +sysfs_fail:
> +	device_del(&mdev->dev);
> +dev_fail:
> +	type->parent->ops->remove(mdev);
>  create_fail:
> -	device_unregister(&mdev->dev);
> +	put_device(&mdev->dev);
>  mdev_fail:
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  	mdev_put_parent(parent);
>  	return ret;
>  }
>  
> -int mdev_device_remove(struct device *dev, bool force_remove)
> +int mdev_device_remove(struct device *dev)
>  {
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
> -	struct mdev_type *type;
> +	int srcu_idx;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		/* parent is undergoing unregistration */
> +		return -ENODEV;
> +	}
> +
> +	mutex_lock(&mdev_list_lock);
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
> -		return -EAGAIN;
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		return -ENODEV;
>  	}
> -
>  	mdev->active = false;
>  	mutex_unlock(&mdev_list_lock);
>  
> -	type = to_mdev_type(mdev->type_kobj);
> -	parent = mdev->parent;
> -
> -	ret = mdev_device_remove_ops(mdev, force_remove);
> -	if (ret) {
> -		mdev->active = true;
> -		return ret;
> -	}
> +	ret = mdev_device_must_remove(mdev);
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
> -	mdev_remove_sysfs_files(dev, type);
> -	device_unregister(dev);
>  	mdev_put_parent(parent);
> -
> -	return 0;
> +	return ret;
>  }
>  
>  static int __init mdev_init(void)
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index 84b2b6c..3d17db9 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -23,6 +23,11 @@ struct mdev_parent {
>  	struct list_head next;
>  	struct kset *mdev_types_kset;
>  	struct list_head type_list;
> +	/* Protects unregistration to wait until create/remove
> +	 * are completed.
> +	 */
> +	struct srcu_struct unreg_srcu;
> +	struct mdev_parent __rcu *self;
>  };
>  
>  struct mdev_device {
> @@ -58,6 +63,6 @@ struct mdev_type {
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
>  
>  int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
> -int  mdev_device_remove(struct device *dev, bool force_remove);
> +int  mdev_device_remove(struct device *dev);
>  
>  #endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index c782fa9..68a8191 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -236,11 +236,9 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
>  	if (val && device_remove_file_self(dev, attr)) {
>  		int ret;
>  
> -		ret = mdev_device_remove(dev, false);
> -		if (ret) {
> -			device_create_file(dev, attr);
> +		ret = mdev_device_remove(dev);
> +		if (ret)
>  			return ret;
> -		}
>  	}
>  
>  	return count;
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-26  5:53               ` Parav Pandit
@ 2019-03-26 15:21                 ` Alex Williamson
  0 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2019-03-26 15:21 UTC (permalink / raw)
  To: Parav Pandit; +Cc: kvm, linux-kernel, kwankhede

On Tue, 26 Mar 2019 05:53:22 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org <linux-kernel-  
> > owner@vger.kernel.org> On Behalf Of Parav Pandit  
> > Sent: Monday, March 25, 2019 10:19 PM
> > To: Alex Williamson <alex.williamson@redhat.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com
> > Subject: RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> > 
> > 
> >   
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, March 25, 2019 9:17 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com
> > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> > >
> > > On Tue, 26 Mar 2019 01:43:44 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >  
> > > > > -----Original Message-----
> > > > > From: Alex Williamson <alex.williamson@redhat.com>  
> 
> > > > > I mean the callback iterator on the parent remove can do a WARN_ON
> > > > > if this returns an error while the device remove path can silently
> > > > > return -EBUSY, the common function doesn't need to decide whether
> > > > > the parent ops remove function deserves a dev_err.
> > > > >  
> > > > Ok. I understood.
> > > > But device remove returning silent -EBUSY looks an error that should
> > > > get logged in, because this is something not expected. Its probably
> > > > late for sysfs layer to return report an error by that time it
> > > > prints device name, because put_device() is done. So if remove()
> > > > returns an error, I think its legitimate failure to do WARN_ON or  
> > dev_err().  
> > >
> > > Calling put_device() if the parent remove op fails looks like a bug
> > > introduced by this series, the current code allows that failure
> > > leaving the device in a coherent state and returning errno to the sysfs  
> > store function.  
> > >  
> > Why should it fail?
> > We are taking off the device bus first as describe in commit log.
> > This ensures that everything is closed before calling the remove().
> > We cannot avoid put_device() and put_parent, it all buggy path...  
> 
> I audited remove() callbacks of kvmgt.c, vfio_ccw_ops.c,
> vfio_ap_ops.c, mbochs.c, mdpy.c, mtty.c, who makes the remove
> possible once the device release is executed. This should complete
> once the device is taken off the bus. This was not the case before
> this sequence where remove() is done while device is open...hence the
> check was needed in past. dev_err() is to help catch any errors/bugs
> in this area.
> 
> I doubt we need to retry remove() like vfio_del_group_dev(), in
> mdev_core if release() is not yet complete.

I'm ok with this, I've always thought the 'force' semantics and
allowing remove to fail were not terribly inline with other drivers,
even if ultimately I wish drivers could nak a remove request to avoid
the ugliness of blocking.  But ultimately you'll need to come to an
agreement with Kirti, the drivers we have in-tree are not the complete
set of mdev drivers, but it also doesn't necessarily make sense to cater
to the lone out-of-tree driver either.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-26  7:06   ` Kirti Wankhede
@ 2019-03-26 15:26     ` Alex Williamson
  2019-03-27  3:19       ` Parav Pandit
  2019-03-26 15:30     ` Parav Pandit
  1 sibling, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2019-03-26 15:26 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: Parav Pandit, kvm, linux-kernel, Neo Jia

On Tue, 26 Mar 2019 12:36:22 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > There are five problems with current code structure.
> > 1. mdev device is placed on the mdev bus before it is created in the
> > vendor driver. Once a device is placed on the mdev bus without creating
> > its supporting underlying vendor device, an open() can get triggered by
> > userspace on partially initialized device.
> > Below ladder diagram highlight it.
> > 
> >       cpu-0                                       cpu-1
> >       -----                                       -----
> >    create_store()
> >      mdev_create_device()
> >        device_register()
> >           ...
> >          vfio_mdev_probe()
> >          ...creates char device
> >                                         vfio_mdev_open()
> >                                           parent->ops->open(mdev)
> >                                             vfio_ap_mdev_open()
> >                                               matrix_mdev = NULL
> >         [...]
> >         parent->ops->create()
> >           vfio_ap_mdev_create()
> >             mdev_set_drvdata(mdev, matrix_mdev);
> >             /* Valid pointer set above */
> >   
> 
> VFIO interface uses sysfs path of device or PCI device's BDF where it
> checks sysfs file for that device exist.
> In case of VFIO mdev device, above situation will never happen as open
> will only get called if sysfs entry for that device exist.
> 
> If you don't use VFIO interface then this situation can arise. In that
> case probe() can be used for very basic initialization then create
> actual char device from create().
> 
> 
> > 2. Current creation sequence is,
> >    parent->ops_create()
> >    groups_register()
> > 
> > Remove sequence is,
> >    parent->ops->remove()
> >    groups_unregister()
> > However, remove sequence should be exact mirror of creation sequence.
> > Once this is achieved, all users of the mdev will be terminated first
> > before removing underlying vendor device.
> > (Follow standard linux driver model).
> > At that point vendor's remove() ops shouldn't failed because device is
> > taken off the bus that should terminate the users.
> >   
> 
> If VMM or user space application is using mdev device,
> parent->ops->remove() can return failure. In that case sysfs files
> shouldn't be removed. Hence above sequence is followed for remove.
> 
> Standard linux driver model doesn't allow remove() to fail, but in
> of mdev framework, interface is defined to handle such error case.
> 
> 
> > 3. Additionally any new mdev driver that wants to work on mdev device
> > during probe() routine registered using mdev_register_driver() needs to
> > get stable mdev structure.
> >   
> 
> Things that you are trying to handle with mdev structure from probe(),
> couldn't that be moved to create()?
> 
> 
> > 4. In following sequence, child devices created while removing mdev parent
> > device can be left out, or it may lead to race of removing half
> > initialized child mdev devices.
> > 
> > issue-1:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> >                                   mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >                                   parent_remove_sysfs_files()
> >                                   /* BUG: device added by cpu-0
> >                                    * whose parent is getting removed.
> >                                    */
> > 
> > issue-2:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> > 
> >        [...]                      mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > 
> >        mdev_create_sysfs_files()
> >        /* BUG: create is adding
> >         * sysfs files for a device
> >         * which is undergoing removal.
> >         */
> >                                  parent_remove_sysfs_files()
> > 
> > 5. Below crash is observed when user initiated remove is in progress
> > and mdev_unregister_driver() completes parent unregistration.
> > 
> >        cpu-0                         cpu-1
> >        -----                         -----
> > remove_store()
> >    mdev_device_remove()
> >    active = false;
> >                                   mdev_unregister_device()
> >                                     remove type
> >    [...]
> >    mdev_remove_ops() crashes.
> > 
> > This is similar race like create() racing with mdev_unregister_device().
> > 
> > mtty mtty: MDEV: Registered
> > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > mdev_device_remove sleep started
> > mtty mtty: MDEV: Unregistering
> > mtty_dev: Unloaded!
> > BUG: unable to handle kernel paging request at ffffffffc027d668
> > PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
> > Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
> > Call Trace:
> >  mdev_device_remove+0xef/0x130 [mdev]
> >  remove_store+0x77/0xa0 [mdev]
> >  kernfs_fop_write+0x113/0x1a0
> >  __vfs_write+0x33/0x1b0
> >  ? rcu_read_lock_sched_held+0x64/0x70
> >  ? rcu_sync_lockdep_assert+0x2a/0x50
> >  ? __sb_start_write+0x121/0x1b0
> >  ? vfs_write+0x17c/0x1b0
> >  vfs_write+0xad/0x1b0
> >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >  ksys_write+0x55/0xc0
> >  do_syscall_64+0x5a/0x210
> > 
> > Therefore, mdev core is improved in following ways to overcome above
> > issues.
> > 
> > 1. Before placing mdev devices on the bus, perform vendor drivers
> > creation which supports the mdev creation.
> > This ensures that mdev specific all necessary fields are initialized
> > before a given mdev can be accessed by bus driver.
> > 
> > 2. During remove flow, first remove the device from the bus. This
> > ensures that any bus specific devices and data is cleared.
> > Once device is taken of the mdev bus, perform remove() of mdev from the
> > vendor driver.
> >  
> 
> If user space application is using the device and someone underneath
> remove the device from bus, how would use space application know that
> device is being removed?
> If DMA is setup, user space application is accessing that memory and
> device is removed from bus - how will you restrict to not to remove that
> device? If remove() is not restricted then host might crash.
> I know Linux kernel device core model doesn't allow remove() to fail,
> but we had tackled that problem for mdev devices in this framework. I
> prefer not to change this behavior. This will regress existing working
> drivers.


We have exactly this issue with vfio-pci, or really any vfio driver,
where the solution is that a remove request is blocked until the device
becomes unused by the user.  In fact there's a notification that
userspace can connect to so that we don't need to silently wait for
userspace to be done.  We could also potentially kill the userspace
application using the device, or if we ever implemented revoke support
for mmaps, we could unmap the device and the use could handle the
SIGBUS.  With Parav's suggestion to fix the ordering such that the
device is first removed from the bus, where the blocking opportunity
comes into play, it might be time to let go of this one-off
force/not-force behavior.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-26  7:06   ` Kirti Wankhede
  2019-03-26 15:26     ` Alex Williamson
@ 2019-03-26 15:30     ` Parav Pandit
  2019-03-28 17:20       ` Kirti Wankhede
  1 sibling, 1 reply; 49+ messages in thread
From: Parav Pandit @ 2019-03-26 15:30 UTC (permalink / raw)
  To: Kirti Wankhede, kvm, linux-kernel, alex.williamson; +Cc: Neo Jia



> -----Original Message-----
> From: Kirti Wankhede <kwankhede@nvidia.com>
> Sent: Tuesday, March 26, 2019 2:06 AM
> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; alex.williamson@redhat.com
> Cc: Neo Jia <cjia@nvidia.com>
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> 
> 
> On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > There are five problems with current code structure.
> > 1. mdev device is placed on the mdev bus before it is created in the
> > vendor driver. Once a device is placed on the mdev bus without
> > creating its supporting underlying vendor device, an open() can get
> > triggered by userspace on partially initialized device.
> > Below ladder diagram highlight it.
> >
> >       cpu-0                                       cpu-1
> >       -----                                       -----
> >    create_store()
> >      mdev_create_device()
> >        device_register()
> >           ...
> >          vfio_mdev_probe()
> >          ...creates char device
> >                                         vfio_mdev_open()
> >                                           parent->ops->open(mdev)
> >                                             vfio_ap_mdev_open()
> >                                               matrix_mdev = NULL
> >         [...]
> >         parent->ops->create()
> >           vfio_ap_mdev_create()
> >             mdev_set_drvdata(mdev, matrix_mdev);
> >             /* Valid pointer set above */
> >
> 
> VFIO interface uses sysfs path of device or PCI device's BDF where it checks
> sysfs file for that device exist.
> In case of VFIO mdev device, above situation will never happen as open will
> only get called if sysfs entry for that device exist.
> 
> If you don't use VFIO interface then this situation can arise. In that case
> probe() can be used for very basic initialization then create actual char
> device from create().
> 
I explained you that create() cannot do the heavy lifting work of creating netdev and rdma dev because at that stage driver doesn't know whether its getting used for VM or host.
create() needs to create the device that probe() can work on in stable manner.

> 
> > 2. Current creation sequence is,
> >    parent->ops_create()
> >    groups_register()
> >
> > Remove sequence is,
> >    parent->ops->remove()
> >    groups_unregister()
> > However, remove sequence should be exact mirror of creation sequence.
> > Once this is achieved, all users of the mdev will be terminated first
> > before removing underlying vendor device.
> > (Follow standard linux driver model).
> > At that point vendor's remove() ops shouldn't failed because device is
> > taken off the bus that should terminate the users.
> >
> 
> If VMM or user space application is using mdev device,
> parent->ops->remove() can return failure. In that case sysfs files
> shouldn't be removed. Hence above sequence is followed for remove.
> 
> Standard linux driver model doesn't allow remove() to fail, but in of mdev
> framework, interface is defined to handle such error case.
> 
But the sequence is incorrect for wider use case.
> 
> > 3. Additionally any new mdev driver that wants to work on mdev device
> > during probe() routine registered using mdev_register_driver() needs
> > to get stable mdev structure.
> >
> 
> Things that you are trying to handle with mdev structure from probe(),
> couldn't that be moved to create()?
> 
No, as explained before and above.
That approach just doesn't look right.
 
> 
> > 4. In following sequence, child devices created while removing mdev
> > parent device can be left out, or it may lead to race of removing half
> > initialized child mdev devices.
> >
> > issue-1:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> >                                   mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >                                   parent_remove_sysfs_files()
> >                                   /* BUG: device added by cpu-0
> >                                    * whose parent is getting removed.
> >                                    */
> >
> > issue-2:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >
> >        [...]                      mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> >
> >        mdev_create_sysfs_files()
> >        /* BUG: create is adding
> >         * sysfs files for a device
> >         * which is undergoing removal.
> >         */
> >                                  parent_remove_sysfs_files()
> >
> > 5. Below crash is observed when user initiated remove is in progress
> > and mdev_unregister_driver() completes parent unregistration.
> >
> >        cpu-0                         cpu-1
> >        -----                         -----
> > remove_store()
> >    mdev_device_remove()
> >    active = false;
> >                                   mdev_unregister_device()
> >                                     remove type
> >    [...]
> >    mdev_remove_ops() crashes.
> >
> > This is similar race like create() racing with mdev_unregister_device().
> >
> > mtty mtty: MDEV: Registered
> > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > mtty_dev: Unloaded!
> > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> >  mdev_device_remove+0xef/0x130 [mdev]
> >  remove_store+0x77/0xa0 [mdev]
> >  kernfs_fop_write+0x113/0x1a0
> >  __vfs_write+0x33/0x1b0
> >  ? rcu_read_lock_sched_held+0x64/0x70
> >  ? rcu_sync_lockdep_assert+0x2a/0x50
> >  ? __sb_start_write+0x121/0x1b0
> >  ? vfs_write+0x17c/0x1b0
> >  vfs_write+0xad/0x1b0
> >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >  ksys_write+0x55/0xc0
> >  do_syscall_64+0x5a/0x210
> >
> > Therefore, mdev core is improved in following ways to overcome above
> > issues.
> >
> > 1. Before placing mdev devices on the bus, perform vendor drivers
> > creation which supports the mdev creation.
> > This ensures that mdev specific all necessary fields are initialized
> > before a given mdev can be accessed by bus driver.
> >
> > 2. During remove flow, first remove the device from the bus. This
> > ensures that any bus specific devices and data is cleared.
> > Once device is taken of the mdev bus, perform remove() of mdev from
> > the vendor driver.
> >
> 
> If user space application is using the device and someone underneath
> remove the device from bus, how would use space application know that
> device is being removed?
vfio_mdev guards and wait for device to get closed.

One sample trace is below.
[<0>] vfio_del_group_dev+0x34a/0x3c0 [vfio]
[<0>] mdev_remove+0x21/0x40 [mdev]
[<0>] device_release_driver_internal+0xe8/0x1b0
[<0>] bus_remove_device+0xf9/0x170
[<0>] device_del+0x168/0x350
[<0>] mdev_device_remove_common+0x1e/0x60 [mdev]
[<0>] mdev_device_remove_cb+0x1a/0x30 [mdev]
[<0>] device_for_each_child+0x47/0x90
[<0>] mdev_unregister_device+0xdb/0x100 [mdev]
[<0>] mtty_dev_exit+0x17/0x843 [mtty]
[<0>] __x64_sys_delete_module+0x16b/0x240
[<0>] do_syscall_64+0x5a/0x210
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<0>] 0xffffffffffffffff

> If DMA is setup, user space application is accessing that memory and device
> is removed from bus - how will you restrict to not to remove that device? If
> remove() is not restricted then host might crash.
> I know Linux kernel device core model doesn't allow remove() to fail, but we
> had tackled that problem for mdev devices in this framework. I prefer not to
> change this behavior. This will regress existing working drivers.
> 
vfio layer ensures that open device cannot be removed from above trace.

Other drivers will follow similar method. In case of mlx5 driver which binds
to mdev follows standard driver model to terminate for this mdev device,
similar way for pci device.

> 
> > 3. Linux core device model provides way to register and auto
> > unregister the device sysfs attribute groups at dev->groups.
> > Make use of this groups to let core create the groups and simplify
> > code to avoid explicit groups creation and removal.
> >
> > 4. Wait for any ongoing mdev create() and remove() to finish before
> > unregistering parent device using srcu. This continues to allow
> > multiple create and remove to progress in parallel. At the same time
> > guard parent removal while parent is being access by create() and remove
> callbacks.
> >
> 
> Agreed with this.
> Alex already mentioned, it would be better to have separate patch for this
> fix.
> 
Patches are ready, I am waiting for above discussion to close before posting v1.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-26 15:26     ` Alex Williamson
@ 2019-03-27  3:19       ` Parav Pandit
  0 siblings, 0 replies; 49+ messages in thread
From: Parav Pandit @ 2019-03-27  3:19 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede; +Cc: kvm, linux-kernel, Neo Jia

Hi Alex,

> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, March 26, 2019 10:27 AM
> To: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; Neo Jia <cjia@nvidia.com>
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Tue, 26 Mar 2019 12:36:22 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > > There are five problems with current code structure.
> > > 1. mdev device is placed on the mdev bus before it is created in the
> > > vendor driver. Once a device is placed on the mdev bus without
> > > creating its supporting underlying vendor device, an open() can get
> > > triggered by userspace on partially initialized device.
> > > Below ladder diagram highlight it.
> > >
> > >       cpu-0                                       cpu-1
> > >       -----                                       -----
> > >    create_store()
> > >      mdev_create_device()
> > >        device_register()
> > >           ...
> > >          vfio_mdev_probe()
> > >          ...creates char device
> > >                                         vfio_mdev_open()
> > >                                           parent->ops->open(mdev)
> > >                                             vfio_ap_mdev_open()
> > >                                               matrix_mdev = NULL
> > >         [...]
> > >         parent->ops->create()
> > >           vfio_ap_mdev_create()
> > >             mdev_set_drvdata(mdev, matrix_mdev);
> > >             /* Valid pointer set above */
> > >
> >
> > VFIO interface uses sysfs path of device or PCI device's BDF where it
> > checks sysfs file for that device exist.
> > In case of VFIO mdev device, above situation will never happen as open
> > will only get called if sysfs entry for that device exist.
> >
> > If you don't use VFIO interface then this situation can arise. In that
> > case probe() can be used for very basic initialization then create
> > actual char device from create().
> >
> >
> > > 2. Current creation sequence is,
> > >    parent->ops_create()
> > >    groups_register()
> > >
> > > Remove sequence is,
> > >    parent->ops->remove()
> > >    groups_unregister()
> > > However, remove sequence should be exact mirror of creation sequence.
> > > Once this is achieved, all users of the mdev will be terminated
> > > first before removing underlying vendor device.
> > > (Follow standard linux driver model).
> > > At that point vendor's remove() ops shouldn't failed because device
> > > is taken off the bus that should terminate the users.
> > >
> >
> > If VMM or user space application is using mdev device,
> > parent->ops->remove() can return failure. In that case sysfs files
> > shouldn't be removed. Hence above sequence is followed for remove.
> >
> > Standard linux driver model doesn't allow remove() to fail, but in of
> > mdev framework, interface is defined to handle such error case.
> >
> >
> > > 3. Additionally any new mdev driver that wants to work on mdev
> > > device during probe() routine registered using
> > > mdev_register_driver() needs to get stable mdev structure.
> > >
> >
> > Things that you are trying to handle with mdev structure from probe(),
> > couldn't that be moved to create()?
> >
> >
> > > 4. In following sequence, child devices created while removing mdev
> > > parent device can be left out, or it may lead to race of removing
> > > half initialized child mdev devices.
> > >
> > > issue-1:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > >                                   mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >                                   parent_remove_sysfs_files()
> > >                                   /* BUG: device added by cpu-0
> > >                                    * whose parent is getting removed.
> > >                                    */
> > >
> > > issue-2:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >
> > >        [...]                      mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > >
> > >        mdev_create_sysfs_files()
> > >        /* BUG: create is adding
> > >         * sysfs files for a device
> > >         * which is undergoing removal.
> > >         */
> > >                                  parent_remove_sysfs_files()
> > >
> > > 5. Below crash is observed when user initiated remove is in progress
> > > and mdev_unregister_driver() completes parent unregistration.
> > >
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > remove_store()
> > >    mdev_device_remove()
> > >    active = false;
> > >                                   mdev_unregister_device()
> > >                                     remove type
> > >    [...]
> > >    mdev_remove_ops() crashes.
> > >
> > > This is similar race like create() racing with mdev_unregister_device().
> > >
> > > mtty mtty: MDEV: Registered
> > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group
> > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id =
> > > 57 mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > > mtty_dev: Unloaded!
> > > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > >  mdev_device_remove+0xef/0x130 [mdev]
> > >  remove_store+0x77/0xa0 [mdev]
> > >  kernfs_fop_write+0x113/0x1a0
> > >  __vfs_write+0x33/0x1b0
> > >  ? rcu_read_lock_sched_held+0x64/0x70
> > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ? __sb_start_write+0x121/0x1b0
> > > ? vfs_write+0x17c/0x1b0
> > >  vfs_write+0xad/0x1b0
> > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > >  ksys_write+0x55/0xc0
> > >  do_syscall_64+0x5a/0x210
> > >
> > > Therefore, mdev core is improved in following ways to overcome above
> > > issues.
> > >
> > > 1. Before placing mdev devices on the bus, perform vendor drivers
> > > creation which supports the mdev creation.
> > > This ensures that mdev specific all necessary fields are initialized
> > > before a given mdev can be accessed by bus driver.
> > >
> > > 2. During remove flow, first remove the device from the bus. This
> > > ensures that any bus specific devices and data is cleared.
> > > Once device is taken of the mdev bus, perform remove() of mdev from
> > > the vendor driver.
> > >
> >
> > If user space application is using the device and someone underneath
> > remove the device from bus, how would use space application know that
> > device is being removed?
> > If DMA is setup, user space application is accessing that memory and
> > device is removed from bus - how will you restrict to not to remove
> > that device? If remove() is not restricted then host might crash.
> > I know Linux kernel device core model doesn't allow remove() to fail,
> > but we had tackled that problem for mdev devices in this framework. I
> > prefer not to change this behavior. This will regress existing working
> > drivers.
> 
> 
> We have exactly this issue with vfio-pci, or really any vfio driver, where the
> solution is that a remove request is blocked until the device becomes
> unused by the user.  In fact there's a notification that userspace can connect
> to so that we don't need to silently wait for userspace to be done.  We could
> also potentially kill the userspace application using the device, or if we ever
> implemented revoke support for mmaps, we could unmap the device and
> the use could handle the SIGBUS.  With Parav's suggestion to fix the ordering
> such that the device is first removed from the bus, where the blocking
> opportunity comes into play, it might be time to let go of this one-off
> force/not-force behavior.  Thanks,
>
 
Yes. I think we should do it.
For now (for next few days), I am dropping this particular order fixing patch from the series.
From my last 8th patch, I am keeping only the fix for create/remove race with parent removal along with other fixes and cleanup.
Posting the v1 in sometime to make progress on already reviewed parts and part of the 8th patch.

I cannot split the remove_common() helper function to a different patch, because remove_cb() will bypass mdev->active check without srcu().
So as individual patch, its not correct behavior.
Hence, that small refactor is part of srcu fix.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-26 15:30     ` Parav Pandit
@ 2019-03-28 17:20       ` Kirti Wankhede
  2019-03-29 14:49         ` Alex Williamson
  0 siblings, 1 reply; 49+ messages in thread
From: Kirti Wankhede @ 2019-03-28 17:20 UTC (permalink / raw)
  To: Parav Pandit, kvm, linux-kernel, alex.williamson; +Cc: Neo Jia



On 3/26/2019 9:00 PM, Parav Pandit wrote:
> 
> 
>> -----Original Message-----
>> From: Kirti Wankhede <kwankhede@nvidia.com>
>> Sent: Tuesday, March 26, 2019 2:06 AM
>> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
>> kernel@vger.kernel.org; alex.williamson@redhat.com
>> Cc: Neo Jia <cjia@nvidia.com>
>> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
>>
>>
>>
>> On 3/23/2019 4:50 AM, Parav Pandit wrote:
>>> There are five problems with current code structure.
>>> 1. mdev device is placed on the mdev bus before it is created in the
>>> vendor driver. Once a device is placed on the mdev bus without
>>> creating its supporting underlying vendor device, an open() can get
>>> triggered by userspace on partially initialized device.
>>> Below ladder diagram highlight it.
>>>
>>>       cpu-0                                       cpu-1
>>>       -----                                       -----
>>>    create_store()
>>>      mdev_create_device()
>>>        device_register()
>>>           ...
>>>          vfio_mdev_probe()
>>>          ...creates char device
>>>                                         vfio_mdev_open()
>>>                                           parent->ops->open(mdev)
>>>                                             vfio_ap_mdev_open()
>>>                                               matrix_mdev = NULL
>>>         [...]
>>>         parent->ops->create()
>>>           vfio_ap_mdev_create()
>>>             mdev_set_drvdata(mdev, matrix_mdev);
>>>             /* Valid pointer set above */
>>>
>>
>> VFIO interface uses sysfs path of device or PCI device's BDF where it checks
>> sysfs file for that device exist.
>> In case of VFIO mdev device, above situation will never happen as open will
>> only get called if sysfs entry for that device exist.
>>
>> If you don't use VFIO interface then this situation can arise. In that case
>> probe() can be used for very basic initialization then create actual char
>> device from create().
>>
> I explained you that create() cannot do the heavy lifting work of creating netdev and rdma dev because at that stage driver doesn't know whether its getting used for VM or host.
> create() needs to create the device that probe() can work on in stable manner.
> 

You can identify if its getting used by VM or host from create(). Since
probe() happens first, from create() you can check
mdev_dev(mdev)->driver->name, if its 'vfio_mdev' then its getting used
by VM, otherwise used by host.

>>
>>> 2. Current creation sequence is,
>>>    parent->ops_create()
>>>    groups_register()
>>>
>>> Remove sequence is,
>>>    parent->ops->remove()
>>>    groups_unregister()
>>> However, remove sequence should be exact mirror of creation sequence.
>>> Once this is achieved, all users of the mdev will be terminated first
>>> before removing underlying vendor device.
>>> (Follow standard linux driver model).
>>> At that point vendor's remove() ops shouldn't failed because device is
>>> taken off the bus that should terminate the users.
>>>
>>
>> If VMM or user space application is using mdev device,
>> parent->ops->remove() can return failure. In that case sysfs files
>> shouldn't be removed. Hence above sequence is followed for remove.
>>
>> Standard linux driver model doesn't allow remove() to fail, but in of mdev
>> framework, interface is defined to handle such error case.
>>
> But the sequence is incorrect for wider use case.
>>
>>> 3. Additionally any new mdev driver that wants to work on mdev device
>>> during probe() routine registered using mdev_register_driver() needs
>>> to get stable mdev structure.
>>>
>>
>> Things that you are trying to handle with mdev structure from probe(),
>> couldn't that be moved to create()?
>>
> No, as explained before and above.
> That approach just doesn't look right.
>

As I mentioned abouve, you can do that.


>>
>>> 4. In following sequence, child devices created while removing mdev
>>> parent device can be left out, or it may lead to race of removing half
>>> initialized child mdev devices.
>>>
>>> issue-1:
>>> --------
>>>        cpu-0                         cpu-1
>>>        -----                         -----
>>>                                   mdev_unregister_device()
>>>                                      device_for_each_child()
>>>                                         mdev_device_remove_cb()
>>>                                             mdev_device_remove()
>>> create_store()
>>>   mdev_device_create()                   [...]
>>>        device_register()
>>>                                   parent_remove_sysfs_files()
>>>                                   /* BUG: device added by cpu-0
>>>                                    * whose parent is getting removed.
>>>                                    */
>>>
>>> issue-2:
>>> --------
>>>        cpu-0                         cpu-1
>>>        -----                         -----
>>> create_store()
>>>   mdev_device_create()                   [...]
>>>        device_register()
>>>
>>>        [...]                      mdev_unregister_device()
>>>                                      device_for_each_child()
>>>                                         mdev_device_remove_cb()
>>>                                             mdev_device_remove()
>>>
>>>        mdev_create_sysfs_files()
>>>        /* BUG: create is adding
>>>         * sysfs files for a device
>>>         * which is undergoing removal.
>>>         */
>>>                                  parent_remove_sysfs_files()
>>>
>>> 5. Below crash is observed when user initiated remove is in progress
>>> and mdev_unregister_driver() completes parent unregistration.
>>>
>>>        cpu-0                         cpu-1
>>>        -----                         -----
>>> remove_store()
>>>    mdev_device_remove()
>>>    active = false;
>>>                                   mdev_unregister_device()
>>>                                     remove type
>>>    [...]
>>>    mdev_remove_ops() crashes.
>>>
>>> This is similar race like create() racing with mdev_unregister_device().
>>>
>>> mtty mtty: MDEV: Registered
>>> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
>>> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
>>> mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
>>> mtty_dev: Unloaded!
>>> BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
>>> af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
>>> Oops: 0000 [#1] SMP PTI
>>> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
>>> 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
>>> SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
>>> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
>>>  mdev_device_remove+0xef/0x130 [mdev]
>>>  remove_store+0x77/0xa0 [mdev]
>>>  kernfs_fop_write+0x113/0x1a0
>>>  __vfs_write+0x33/0x1b0
>>>  ? rcu_read_lock_sched_held+0x64/0x70
>>>  ? rcu_sync_lockdep_assert+0x2a/0x50
>>>  ? __sb_start_write+0x121/0x1b0
>>>  ? vfs_write+0x17c/0x1b0
>>>  vfs_write+0xad/0x1b0
>>>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>>>  ksys_write+0x55/0xc0
>>>  do_syscall_64+0x5a/0x210
>>>
>>> Therefore, mdev core is improved in following ways to overcome above
>>> issues.
>>>
>>> 1. Before placing mdev devices on the bus, perform vendor drivers
>>> creation which supports the mdev creation.
>>> This ensures that mdev specific all necessary fields are initialized
>>> before a given mdev can be accessed by bus driver.
>>>
>>> 2. During remove flow, first remove the device from the bus. This
>>> ensures that any bus specific devices and data is cleared.
>>> Once device is taken of the mdev bus, perform remove() of mdev from
>>> the vendor driver.
>>>
>>
>> If user space application is using the device and someone underneath
>> remove the device from bus, how would use space application know that
>> device is being removed?
> vfio_mdev guards and wait for device to get closed.
> 
> One sample trace is below.
> [<0>] vfio_del_group_dev+0x34a/0x3c0 [vfio]
> [<0>] mdev_remove+0x21/0x40 [mdev]
> [<0>] device_release_driver_internal+0xe8/0x1b0
> [<0>] bus_remove_device+0xf9/0x170
> [<0>] device_del+0x168/0x350
> [<0>] mdev_device_remove_common+0x1e/0x60 [mdev]
> [<0>] mdev_device_remove_cb+0x1a/0x30 [mdev]
> [<0>] device_for_each_child+0x47/0x90
> [<0>] mdev_unregister_device+0xdb/0x100 [mdev]
> [<0>] mtty_dev_exit+0x17/0x843 [mtty]
> [<0>] __x64_sys_delete_module+0x16b/0x240
> [<0>] do_syscall_64+0x5a/0x210
> [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [<0>] 0xffffffffffffffff
> 
>> If DMA is setup, user space application is accessing that memory and device
>> is removed from bus - how will you restrict to not to remove that device? If
>> remove() is not restricted then host might crash.
>> I know Linux kernel device core model doesn't allow remove() to fail, but we
>> had tackled that problem for mdev devices in this framework. I prefer not to
>> change this behavior. This will regress existing working drivers.
>>
> vfio layer ensures that open device cannot be removed from above trace.
> 
> Other drivers will follow similar method. In case of mlx5 driver which binds
> to mdev follows standard driver model to terminate for this mdev device,
> similar way for pci device.
> 

But then remove() or write on 'remove' sysfs would block, which could be
indefinite. For example in case of VM, it will block until VM is not
shutdown.
With current approach, write on 'remove' sysfs doesn't block.

Thanks,
Kirti

>>
>>> 3. Linux core device model provides way to register and auto
>>> unregister the device sysfs attribute groups at dev->groups.
>>> Make use of this groups to let core create the groups and simplify
>>> code to avoid explicit groups creation and removal.
>>>
>>> 4. Wait for any ongoing mdev create() and remove() to finish before
>>> unregistering parent device using srcu. This continues to allow
>>> multiple create and remove to progress in parallel. At the same time
>>> guard parent removal while parent is being access by create() and remove
>> callbacks.
>>>
>>
>> Agreed with this.
>> Alex already mentioned, it would be better to have separate patch for this
>> fix.
>>
> Patches are ready, I am waiting for above discussion to close before posting v1.
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
  2019-03-28 17:20       ` Kirti Wankhede
@ 2019-03-29 14:49         ` Alex Williamson
  0 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2019-03-29 14:49 UTC (permalink / raw)
  To: Kirti Wankhede; +Cc: Parav Pandit, kvm, linux-kernel, Neo Jia

On Thu, 28 Mar 2019 22:50:38 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/26/2019 9:00 PM, Parav Pandit wrote:
> > 
> >   
> >> -----Original Message-----
> >> From: Kirti Wankhede <kwankhede@nvidia.com>
> >> Sent: Tuesday, March 26, 2019 2:06 AM
> >> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> >> kernel@vger.kernel.org; alex.williamson@redhat.com
> >> Cc: Neo Jia <cjia@nvidia.com>
> >> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> >>
> >>
> >>
> >> On 3/23/2019 4:50 AM, Parav Pandit wrote:  
> >>> There are five problems with current code structure.
> >>> 1. mdev device is placed on the mdev bus before it is created in the
> >>> vendor driver. Once a device is placed on the mdev bus without
> >>> creating its supporting underlying vendor device, an open() can get
> >>> triggered by userspace on partially initialized device.
> >>> Below ladder diagram highlight it.
> >>>
> >>>       cpu-0                                       cpu-1
> >>>       -----                                       -----
> >>>    create_store()
> >>>      mdev_create_device()
> >>>        device_register()
> >>>           ...
> >>>          vfio_mdev_probe()
> >>>          ...creates char device
> >>>                                         vfio_mdev_open()
> >>>                                           parent->ops->open(mdev)
> >>>                                             vfio_ap_mdev_open()
> >>>                                               matrix_mdev = NULL
> >>>         [...]
> >>>         parent->ops->create()
> >>>           vfio_ap_mdev_create()
> >>>             mdev_set_drvdata(mdev, matrix_mdev);
> >>>             /* Valid pointer set above */
> >>>  
> >>
> >> VFIO interface uses sysfs path of device or PCI device's BDF where it checks
> >> sysfs file for that device exist.
> >> In case of VFIO mdev device, above situation will never happen as open will
> >> only get called if sysfs entry for that device exist.
> >>
> >> If you don't use VFIO interface then this situation can arise. In that case
> >> probe() can be used for very basic initialization then create actual char
> >> device from create().
> >>  
> > I explained you that create() cannot do the heavy lifting work of creating netdev and rdma dev because at that stage driver doesn't know whether its getting used for VM or host.
> > create() needs to create the device that probe() can work on in stable manner.
> >   
> 
> You can identify if its getting used by VM or host from create(). Since
> probe() happens first, from create() you can check
> mdev_dev(mdev)->driver->name, if its 'vfio_mdev' then its getting used
> by VM, otherwise used by host.

If this is suggesting that we should have different create paths based
on driver name, please no.  Mdev devices should not be special, they're
attached to a bus which can host multiple drivers and devices on that
bus should have the ability to switch between drivers.  Not to mention
that a strcmp of a driver name to infer the purpose of a device is just
ugly as can be.

> >>> 2. Current creation sequence is,
> >>>    parent->ops_create()
> >>>    groups_register()
> >>>
> >>> Remove sequence is,
> >>>    parent->ops->remove()
> >>>    groups_unregister()
> >>> However, remove sequence should be exact mirror of creation sequence.
> >>> Once this is achieved, all users of the mdev will be terminated first
> >>> before removing underlying vendor device.
> >>> (Follow standard linux driver model).
> >>> At that point vendor's remove() ops shouldn't failed because device is
> >>> taken off the bus that should terminate the users.
> >>>  
> >>
> >> If VMM or user space application is using mdev device,
> >> parent->ops->remove() can return failure. In that case sysfs files
> >> shouldn't be removed. Hence above sequence is followed for remove.
> >>
> >> Standard linux driver model doesn't allow remove() to fail, but in of mdev
> >> framework, interface is defined to handle such error case.
> >>  
> > But the sequence is incorrect for wider use case.  
> >>  
> >>> 3. Additionally any new mdev driver that wants to work on mdev device
> >>> during probe() routine registered using mdev_register_driver() needs
> >>> to get stable mdev structure.
> >>>  
> >>
> >> Things that you are trying to handle with mdev structure from probe(),
> >> couldn't that be moved to create()?
> >>  
> > No, as explained before and above.
> > That approach just doesn't look right.
> >  
> 
> As I mentioned abouve, you can do that.

But it would be wrong to do so.

> >>  
> >>> 4. In following sequence, child devices created while removing mdev
> >>> parent device can be left out, or it may lead to race of removing half
> >>> initialized child mdev devices.
> >>>
> >>> issue-1:
> >>> --------
> >>>        cpu-0                         cpu-1
> >>>        -----                         -----
> >>>                                   mdev_unregister_device()
> >>>                                      device_for_each_child()
> >>>                                         mdev_device_remove_cb()
> >>>                                             mdev_device_remove()
> >>> create_store()
> >>>   mdev_device_create()                   [...]
> >>>        device_register()
> >>>                                   parent_remove_sysfs_files()
> >>>                                   /* BUG: device added by cpu-0
> >>>                                    * whose parent is getting removed.
> >>>                                    */
> >>>
> >>> issue-2:
> >>> --------
> >>>        cpu-0                         cpu-1
> >>>        -----                         -----
> >>> create_store()
> >>>   mdev_device_create()                   [...]
> >>>        device_register()
> >>>
> >>>        [...]                      mdev_unregister_device()
> >>>                                      device_for_each_child()
> >>>                                         mdev_device_remove_cb()
> >>>                                             mdev_device_remove()
> >>>
> >>>        mdev_create_sysfs_files()
> >>>        /* BUG: create is adding
> >>>         * sysfs files for a device
> >>>         * which is undergoing removal.
> >>>         */
> >>>                                  parent_remove_sysfs_files()
> >>>
> >>> 5. Below crash is observed when user initiated remove is in progress
> >>> and mdev_unregister_driver() completes parent unregistration.
> >>>
> >>>        cpu-0                         cpu-1
> >>>        -----                         -----
> >>> remove_store()
> >>>    mdev_device_remove()
> >>>    active = false;
> >>>                                   mdev_unregister_device()
> >>>                                     remove type
> >>>    [...]
> >>>    mdev_remove_ops() crashes.
> >>>
> >>> This is similar race like create() racing with mdev_unregister_device().
> >>>
> >>> mtty mtty: MDEV: Registered
> >>> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> >>> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> >>> mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> >>> mtty_dev: Unloaded!
> >>> BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> >>> af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> >>> Oops: 0000 [#1] SMP PTI
> >>> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> >>> 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> >>> SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> >>> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> >>>  mdev_device_remove+0xef/0x130 [mdev]
> >>>  remove_store+0x77/0xa0 [mdev]
> >>>  kernfs_fop_write+0x113/0x1a0
> >>>  __vfs_write+0x33/0x1b0
> >>>  ? rcu_read_lock_sched_held+0x64/0x70
> >>>  ? rcu_sync_lockdep_assert+0x2a/0x50
> >>>  ? __sb_start_write+0x121/0x1b0
> >>>  ? vfs_write+0x17c/0x1b0
> >>>  vfs_write+0xad/0x1b0
> >>>  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >>>  ksys_write+0x55/0xc0
> >>>  do_syscall_64+0x5a/0x210
> >>>
> >>> Therefore, mdev core is improved in following ways to overcome above
> >>> issues.
> >>>
> >>> 1. Before placing mdev devices on the bus, perform vendor drivers
> >>> creation which supports the mdev creation.
> >>> This ensures that mdev specific all necessary fields are initialized
> >>> before a given mdev can be accessed by bus driver.
> >>>
> >>> 2. During remove flow, first remove the device from the bus. This
> >>> ensures that any bus specific devices and data is cleared.
> >>> Once device is taken of the mdev bus, perform remove() of mdev from
> >>> the vendor driver.
> >>>  
> >>
> >> If user space application is using the device and someone underneath
> >> remove the device from bus, how would use space application know that
> >> device is being removed?  
> > vfio_mdev guards and wait for device to get closed.
> > 
> > One sample trace is below.
> > [<0>] vfio_del_group_dev+0x34a/0x3c0 [vfio]
> > [<0>] mdev_remove+0x21/0x40 [mdev]
> > [<0>] device_release_driver_internal+0xe8/0x1b0
> > [<0>] bus_remove_device+0xf9/0x170
> > [<0>] device_del+0x168/0x350
> > [<0>] mdev_device_remove_common+0x1e/0x60 [mdev]
> > [<0>] mdev_device_remove_cb+0x1a/0x30 [mdev]
> > [<0>] device_for_each_child+0x47/0x90
> > [<0>] mdev_unregister_device+0xdb/0x100 [mdev]
> > [<0>] mtty_dev_exit+0x17/0x843 [mtty]
> > [<0>] __x64_sys_delete_module+0x16b/0x240
> > [<0>] do_syscall_64+0x5a/0x210
> > [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > [<0>] 0xffffffffffffffff
> >   
> >> If DMA is setup, user space application is accessing that memory and device
> >> is removed from bus - how will you restrict to not to remove that device? If
> >> remove() is not restricted then host might crash.
> >> I know Linux kernel device core model doesn't allow remove() to fail, but we
> >> had tackled that problem for mdev devices in this framework. I prefer not to
> >> change this behavior. This will regress existing working drivers.
> >>  
> > vfio layer ensures that open device cannot be removed from above trace.
> > 
> > Other drivers will follow similar method. In case of mlx5 driver which binds
> > to mdev follows standard driver model to terminate for this mdev device,
> > similar way for pci device.
> >   
> 
> But then remove() or write on 'remove' sysfs would block, which could be
> indefinite. For example in case of VM, it will block until VM is not
> shutdown.
> With current approach, write on 'remove' sysfs doesn't block.

OTOH, why should mdev be different than any other driver?  Blocking is
the current solution for all directly assigned vfio devices.  This is a
compromise between the device model not allowing an error return and
lack of support to be able to revoke mmaps to the device.  We already
have an interface in vfio to request a device from a cooperative user
(Maxim proposed adding this to the mdev interface), lacking a revoke
interface, that can be further escalated to killing the process.  What
we've heard previously when pursuing an error path from removing a
device is that all responsibility lies with the admin in using these
interfaces.  If a remove is requested, it should be honored.  If that
results in killing a task, the fault is on the admin.  Mdev is not its
own island to decide a different model. Thanks,

Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2019-03-29 14:49 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
2019-03-22 23:20 ` [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure Parav Pandit
2019-03-25 11:48   ` Maxim Levitsky
2019-03-25 18:17   ` Kirti Wankhede
2019-03-25 19:21     ` Alex Williamson
2019-03-25 21:11       ` Parav Pandit
2019-03-22 23:20 ` [PATCH 2/8] vfio/mdev: Avoid release parent reference during error path Parav Pandit
2019-03-25 11:49   ` Maxim Levitsky
2019-03-25 18:27   ` Kirti Wankhede
2019-03-22 23:20 ` [PATCH 3/8] vfio/mdev: Removed unused kref Parav Pandit
2019-03-25 11:50   ` Maxim Levitsky
2019-03-25 18:41   ` Kirti Wankhede
2019-03-22 23:20 ` [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
2019-03-25 11:56   ` Maxim Levitsky
2019-03-25 19:07   ` Kirti Wankhede
2019-03-25 19:49     ` Alex Williamson
2019-03-25 21:27       ` Parav Pandit
2019-03-22 23:20 ` [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
2019-03-25 11:57   ` Maxim Levitsky
2019-03-25 19:18   ` Kirti Wankhede
2019-03-25 21:29     ` Parav Pandit
2019-03-22 23:20 ` [PATCH 6/8] vfio/mdev: Follow correct remove sequence Parav Pandit
2019-03-25 11:58   ` Maxim Levitsky
2019-03-25 20:20   ` Alex Williamson
2019-03-25 21:31     ` Parav Pandit
2019-03-22 23:20 ` [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
2019-03-25 11:58   ` Maxim Levitsky
2019-03-25 19:35   ` Kirti Wankhede
2019-03-25 20:49     ` Alex Williamson
2019-03-25 21:36       ` Parav Pandit
2019-03-25 21:52         ` Alex Williamson
2019-03-25 22:07           ` Parav Pandit
2019-03-22 23:20 ` [PATCH 8/8] vfio/mdev: Improve the create/remove sequence Parav Pandit
2019-03-25 13:24   ` Maxim Levitsky
2019-03-25 21:42     ` Parav Pandit
2019-03-25 23:18   ` Alex Williamson
2019-03-25 23:34     ` Parav Pandit
2019-03-26  0:05       ` Alex Williamson
2019-03-26  1:43         ` Parav Pandit
2019-03-26  2:16           ` Alex Williamson
2019-03-26  3:19             ` Parav Pandit
2019-03-26  5:53               ` Parav Pandit
2019-03-26 15:21                 ` Alex Williamson
2019-03-26  7:06   ` Kirti Wankhede
2019-03-26 15:26     ` Alex Williamson
2019-03-27  3:19       ` Parav Pandit
2019-03-26 15:30     ` Parav Pandit
2019-03-28 17:20       ` Kirti Wankhede
2019-03-29 14:49         ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).