All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
@ 2015-05-13 11:10 Yishai Hadas
       [not found] ` <1431515438-24042-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Yishai Hadas @ 2015-05-13 11:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w

Currently, if there is any user space application using an IB device,
it is impossible to unload the HW device driver for this device.

Similarly, if the device is hot-unplugged or reset, the device driver
hardware removal flow blocks until all user contexts are destroyed.

This patchset removes the above limitations from both uverbs and ucma.

The IB-core and uverbs layers are still required to remain loaded as
long as there are user applications using the verbs API. However, the
hardware device drivers are not blocked any more by the user space
activity.

To support this, the hardware device needs to expose a new kernel API
named 'disassociate_ucontext'. The device driver is given a ucontext
to detach from, and it should block this user context from any future
hardware access. In the IB-core level, we use this interface to
deactivate all ucontext that address a specific device when handling a
remove_one callback for it.

In the RDMA-CM layer, a similar change is needed in the ucma module,
to prevent blocking of the remove_one operation. This allows for
hot-removal of devices with RDMA-CM users in the user space.

The first patch introduces the new API between the HW device driver and
the IB core. For devices which implement the functionality, IB core
will use it in remove_one, disassociating any active ucontext from the
hardware device. Other drivers that didn't implement it will behave as
today, remove_one will block until all ucontexts referring the device
are destroyed before returning.

The second patch provides implementation of this API for the mlx4
driver.

The third patch extends ucma to avoid blocking remove_one operation in
the cma module. When such device removal event is received, ucma is
turning all user contexts to zombie contexts. This is done by
releasing all underlying resources and preventing any further user
operations on the context.

Changes from V2:
patch #1: Rebase over ODP patches.

Changes from V1:
patch #1: Use uverbs flags instead of disassociate support, drop fatal_event_raised flag.
patch #3: Add support in ucma for handling device removal.

Changes from V0:
patch #1: ib_uverbs_close, reduced mutex scope to enable tasks run in parallel.

Yishai Hadas (3):
  IB/uverbs: Enable device removal when there are active user space
    applications
  IB/mlx4_ib: Disassociate support
  IB/ucma: HW Device hot-removal support

 drivers/infiniband/core/ucma.c        |  130 ++++++++++++-
 drivers/infiniband/core/uverbs.h      |   12 ++
 drivers/infiniband/core/uverbs_cmd.c  |    8 +
 drivers/infiniband/core/uverbs_main.c |  325 +++++++++++++++++++++++++++------
 drivers/infiniband/hw/mlx4/main.c     |  125 +++++++++++++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h  |   12 ++
 include/rdma/ib_verbs.h               |    1 +
 7 files changed, 545 insertions(+), 68 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found] ` <1431515438-24042-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-13 11:10   ` Yishai Hadas
       [not found]     ` <1431515438-24042-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-13 11:10   ` [PATCH V3 for-next 2/3] IB/mlx4_ib: Disassociate support Yishai Hadas
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Yishai Hadas @ 2015-05-13 11:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w, Jack Morgenstein

Enables the uverbs_remove_one to succeed despite the fact that there are
running IB applications working with the given ib device.  This functionality
enables a HW device to be unbind/reset despite the fact that there are running
user space applications using it.

It exposes a new IB kernel API named 'disassociate_ucontext' which lets a
driver detaching its HW resources from a given user context without
crashing/terminating the application. In case a driver implemented the above
API and registered with ib_uverb there will be no dependency between its device
to its uverbs_device. Upon calling remove_one of ib_uverbs the call should
return after disassociating the open HW resources without waiting to clients
disconnecting. In case driver didn't implement this API there will be no change
to current behaviour and uverbs_remove_one will return only when last client
has disconnected and reference count on uverbs device became 0.

In case the lower driver device was removed any application will continue
working over some zombie HCA, further calls will ended with an immediate error.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Jack Morgenstein <jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
---
 drivers/infiniband/core/uverbs.h      |   12 ++
 drivers/infiniband/core/uverbs_cmd.c  |    8 +
 drivers/infiniband/core/uverbs_main.c |  325 +++++++++++++++++++++++++++------
 include/rdma/ib_verbs.h               |    1 +
 4 files changed, 290 insertions(+), 56 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index b716b08..4ef6b1c 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -63,6 +63,10 @@
 		(udata)->outlen = (olen);					\
 	} while (0)
 
+enum uverbs_flags {
+	UVERBS_FLAG_DISASSOCIATE = 1
+};
+
 /*
  * Our lifetime rules for these structs are the following:
  *
@@ -94,6 +98,12 @@ struct ib_uverbs_device {
 	struct cdev			        cdev;
 	struct rb_root				xrcd_tree;
 	struct mutex				xrcd_tree_mutex;
+	struct mutex				disassociate_mutex; /* protect lists of files. */
+	int					disassociated;
+	u32					flags;
+	struct srcu_struct			disassociate_srcu;
+	struct list_head			uverbs_file_list;
+	struct list_head			uverbs_events_file_list;
 };
 
 struct ib_uverbs_event_file {
@@ -105,6 +115,7 @@ struct ib_uverbs_event_file {
 	wait_queue_head_t			poll_wait;
 	struct fasync_struct		       *async_queue;
 	struct list_head			event_list;
+	struct list_head			list;
 };
 
 struct ib_uverbs_file {
@@ -114,6 +125,7 @@ struct ib_uverbs_file {
 	struct ib_ucontext		       *ucontext;
 	struct ib_event_handler			event_handler;
 	struct ib_uverbs_event_file	       *async_file;
+	struct list_head			list;
 };
 
 struct ib_uverbs_event {
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index a9f0489..4b6a9e6 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -39,6 +39,7 @@
 #include <linux/sched.h>
 
 #include <asm/uaccess.h>
+#include <linux/sched.h>
 
 #include "uverbs.h"
 #include "core_priv.h"
@@ -1326,6 +1327,13 @@ ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
 		return -EFAULT;
 	}
 
+	/* Taking ref count on uverbs_file to make sure that file won't be
+	 * freed till that event file is closed. It will enable accessing the
+	 * uverbs_device fields as part of closing the events file and making
+	 * sure that uverbs device is available by that time as well.
+	 * Note: similar is already done for the async event file.
+	*/
+	kref_get(&file->ref);
 	fd_install(resp.fd, filp);
 	return in_len;
 }
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 88cce9b..d96d7c7 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -134,7 +134,12 @@ static void ib_uverbs_release_dev(struct kref *ref)
 	struct ib_uverbs_device *dev =
 		container_of(ref, struct ib_uverbs_device, ref);
 
-	complete(&dev->comp);
+	if (dev->disassociated) {
+		cleanup_srcu_struct(&dev->disassociate_srcu);
+		kfree(dev);
+	} else {
+		complete(&dev->comp);
+	}
 }
 
 static void ib_uverbs_release_event_file(struct kref *ref)
@@ -307,7 +312,9 @@ static void ib_uverbs_release_file(struct kref *ref)
 	struct ib_uverbs_file *file =
 		container_of(ref, struct ib_uverbs_file, ref);
 
-	module_put(file->device->ib_dev->owner);
+	if (!(file->device->flags & UVERBS_FLAG_DISASSOCIATE))
+		module_put(file->device->ib_dev->owner);
+
 	kref_put(&file->device->ref, ib_uverbs_release_dev);
 
 	kfree(file);
@@ -330,9 +337,14 @@ static ssize_t ib_uverbs_event_read(struct file *filp, char __user *buf,
 			return -EAGAIN;
 
 		if (wait_event_interruptible(file->poll_wait,
-					     !list_empty(&file->event_list)))
+					     (!list_empty(&file->event_list) ||
+					     file->uverbs_file->device->disassociated)))
 			return -ERESTARTSYS;
 
+		/* We reach here when list is not empty or when device was disassociated */
+		if (list_empty(&file->event_list) && file->uverbs_file->device->disassociated)
+			return -EIO;
+
 		spin_lock_irq(&file->lock);
 	}
 
@@ -405,12 +417,17 @@ static int ib_uverbs_event_close(struct inode *inode, struct file *filp)
 	}
 	spin_unlock_irq(&file->lock);
 
-	if (file->is_async) {
-		ib_unregister_event_handler(&file->uverbs_file->event_handler);
-		kref_put(&file->uverbs_file->ref, ib_uverbs_release_file);
+	mutex_lock(&file->uverbs_file->device->disassociate_mutex);
+	if (!file->uverbs_file->device->disassociated) {
+		list_del(&file->list);
+		if (file->is_async)
+			ib_unregister_event_handler(&file->uverbs_file->event_handler);
 	}
-	kref_put(&file->ref, ib_uverbs_release_event_file);
 
+	mutex_unlock(&file->uverbs_file->device->disassociate_mutex);
+
+	kref_put(&file->uverbs_file->ref, ib_uverbs_release_file);
+	kref_put(&file->ref, ib_uverbs_release_event_file);
 	return 0;
 }
 
@@ -546,7 +563,7 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 	struct ib_uverbs_event_file *ev_file;
 	struct file *filp;
 
-	ev_file = kmalloc(sizeof *ev_file, GFP_KERNEL);
+	ev_file = kzalloc(sizeof *ev_file, GFP_KERNEL);
 	if (!ev_file)
 		return ERR_PTR(-ENOMEM);
 
@@ -561,10 +578,25 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 
 	filp = anon_inode_getfile("[infinibandevent]", &uverbs_event_fops,
 				  ev_file, O_RDONLY);
-	if (IS_ERR(filp))
+	if (IS_ERR(filp)) {
 		kfree(ev_file);
+		return filp;
+	}
+
+	mutex_lock(&uverbs_file->device->disassociate_mutex);
+	if (!uverbs_file->device->disassociated) {
+		list_add_tail(&ev_file->list,
+			      &uverbs_file->device->uverbs_events_file_list);
+		mutex_unlock(&uverbs_file->device->disassociate_mutex);
 
-	return filp;
+		return filp;
+	}
+
+	mutex_unlock(&uverbs_file->device->disassociate_mutex);
+
+	fput(filp);
+	kfree(ev_file);
+	return ERR_PTR(-EIO);
 }
 
 /*
@@ -602,6 +634,8 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 	struct ib_uverbs_file *file = filp->private_data;
 	struct ib_uverbs_cmd_hdr hdr;
 	__u32 flags;
+	int srcu_key;
+	ssize_t ret;
 
 	if (count < sizeof hdr)
 		return -EINVAL;
@@ -609,6 +643,12 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 	if (copy_from_user(&hdr, buf, sizeof hdr))
 		return -EFAULT;
 
+	srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
+	if (file->device->disassociated) {
+		ret = -EIO;
+		goto out;
+	}
+
 	flags = (hdr.command &
 		 IB_USER_VERBS_CMD_FLAGS_MASK) >> IB_USER_VERBS_CMD_FLAGS_SHIFT;
 
@@ -616,26 +656,36 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 		__u32 command;
 
 		if (hdr.command & ~(__u32)(IB_USER_VERBS_CMD_FLAGS_MASK |
-					   IB_USER_VERBS_CMD_COMMAND_MASK))
-			return -EINVAL;
+					   IB_USER_VERBS_CMD_COMMAND_MASK)) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		command = hdr.command & IB_USER_VERBS_CMD_COMMAND_MASK;
 
 		if (command >= ARRAY_SIZE(uverbs_cmd_table) ||
-		    !uverbs_cmd_table[command])
-			return -EINVAL;
+		    !uverbs_cmd_table[command]) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		if (!file->ucontext &&
-		    command != IB_USER_VERBS_CMD_GET_CONTEXT)
-			return -EINVAL;
+		    command != IB_USER_VERBS_CMD_GET_CONTEXT) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (!(file->device->ib_dev->uverbs_cmd_mask & (1ull << command)))
-			return -ENOSYS;
+		if (!(file->device->ib_dev->uverbs_cmd_mask & (1ull << command))) {
+			ret = -ENOSYS;
+			goto out;
+		}
 
-		if (hdr.in_words * 4 != count)
-			return -EINVAL;
+		if (hdr.in_words * 4 != count) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		return uverbs_cmd_table[command](file,
+		ret = uverbs_cmd_table[command](file,
 						 buf + sizeof(hdr),
 						 hdr.in_words * 4,
 						 hdr.out_words * 4);
@@ -650,47 +700,69 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 		size_t written_count = count;
 
 		if (hdr.command & ~(__u32)(IB_USER_VERBS_CMD_FLAGS_MASK |
-					   IB_USER_VERBS_CMD_COMMAND_MASK))
-			return -EINVAL;
+					   IB_USER_VERBS_CMD_COMMAND_MASK)) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		command = hdr.command & IB_USER_VERBS_CMD_COMMAND_MASK;
 
 		if (command >= ARRAY_SIZE(uverbs_ex_cmd_table) ||
-		    !uverbs_ex_cmd_table[command])
-			return -ENOSYS;
+		    !uverbs_ex_cmd_table[command]) {
+			ret = -ENOSYS;
+			goto out;
+		}
 
-		if (!file->ucontext)
-			return -EINVAL;
+		if (!file->ucontext) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (!(file->device->ib_dev->uverbs_ex_cmd_mask & (1ull << command)))
-			return -ENOSYS;
+		if (!(file->device->ib_dev->uverbs_ex_cmd_mask & (1ull << command))) {
+			ret = -ENOSYS;
+			goto out;
+		}
 
-		if (count < (sizeof(hdr) + sizeof(ex_hdr)))
-			return -EINVAL;
+		if (count < (sizeof(hdr) + sizeof(ex_hdr))) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (copy_from_user(&ex_hdr, buf + sizeof(hdr), sizeof(ex_hdr)))
-			return -EFAULT;
+		if (copy_from_user(&ex_hdr, buf + sizeof(hdr), sizeof(ex_hdr))) {
+			ret = -EFAULT;
+			goto out;
+		}
 
 		count -= sizeof(hdr) + sizeof(ex_hdr);
 		buf += sizeof(hdr) + sizeof(ex_hdr);
 
-		if ((hdr.in_words + ex_hdr.provider_in_words) * 8 != count)
-			return -EINVAL;
+		if ((hdr.in_words + ex_hdr.provider_in_words) * 8 != count) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-		if (ex_hdr.cmd_hdr_reserved)
-			return -EINVAL;
+		if (ex_hdr.cmd_hdr_reserved) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		if (ex_hdr.response) {
-			if (!hdr.out_words && !ex_hdr.provider_out_words)
-				return -EINVAL;
+			if (!hdr.out_words && !ex_hdr.provider_out_words) {
+				ret = -EINVAL;
+				goto out;
+			}
 
 			if (!access_ok(VERIFY_WRITE,
 				       (void __user *) (unsigned long) ex_hdr.response,
-				       (hdr.out_words + ex_hdr.provider_out_words) * 8))
-				return -EFAULT;
+				       (hdr.out_words + ex_hdr.provider_out_words) * 8)) {
+				ret = -EFAULT;
+				goto out;
+			}
 		} else {
-			if (hdr.out_words || ex_hdr.provider_out_words)
-				return -EINVAL;
+			if (hdr.out_words || ex_hdr.provider_out_words) {
+				ret = -EINVAL;
+				goto out;
+			}
 		}
 
 		INIT_UDATA_BUF_OR_NULL(&ucore, buf, (unsigned long) ex_hdr.response,
@@ -707,22 +779,37 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
 						   &uhw);
 
 		if (err)
-			return err;
-
-		return written_count;
+			ret = err;
+		else
+			ret = written_count;
+	} else {
+		ret = -ENOSYS;
 	}
 
-	return -ENOSYS;
+out:
+	srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);
+	return ret;
 }
 
 static int ib_uverbs_mmap(struct file *filp, struct vm_area_struct *vma)
 {
 	struct ib_uverbs_file *file = filp->private_data;
+	int ret = 0;
+	int srcu_key;
+
+	srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
+	if (file->device->disassociated) {
+		ret = -EIO;
+		goto out;
+	}
 
 	if (!file->ucontext)
-		return -ENODEV;
+		ret = -ENODEV;
 	else
-		return file->device->ib_dev->mmap(file->ucontext, vma);
+		ret = file->device->ib_dev->mmap(file->ucontext, vma);
+out:
+	srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);
+	return ret;
 }
 
 /*
@@ -740,6 +827,7 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	struct ib_uverbs_device *dev;
 	struct ib_uverbs_file *file;
 	int ret;
+	int module_dependent;
 
 	dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev);
 	if (dev)
@@ -747,15 +835,31 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	else
 		return -ENXIO;
 
-	if (!try_module_get(dev->ib_dev->owner)) {
-		ret = -ENODEV;
+	mutex_lock(&dev->disassociate_mutex);
+	if (dev->disassociated) {
+		ret = -EIO;
 		goto err;
 	}
 
-	file = kmalloc(sizeof *file, GFP_KERNEL);
+	/* In case IB device supports disassociate ucontext, there is no hard
+	 * dependency between uverbs device and its low level device.
+	 */
+	module_dependent = !(dev->flags & UVERBS_FLAG_DISASSOCIATE);
+
+	if (module_dependent) {
+		if (!try_module_get(dev->ib_dev->owner)) {
+			ret = -ENODEV;
+			goto err;
+		}
+	}
+
+	file = kzalloc(sizeof *file, GFP_KERNEL);
 	if (!file) {
 		ret = -ENOMEM;
-		goto err_module;
+		if (module_dependent)
+			goto err_module;
+
+		goto err;
 	}
 
 	file->device	 = dev;
@@ -765,6 +869,8 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	mutex_init(&file->mutex);
 
 	filp->private_data = file;
+	list_add_tail(&file->list, &dev->uverbs_file_list);
+	mutex_unlock(&dev->disassociate_mutex);
 
 	return nonseekable_open(inode, filp);
 
@@ -772,6 +878,7 @@ err_module:
 	module_put(dev->ib_dev->owner);
 
 err:
+	mutex_unlock(&dev->disassociate_mutex);
 	kref_put(&dev->ref, ib_uverbs_release_dev);
 	return ret;
 }
@@ -779,9 +886,26 @@ err:
 static int ib_uverbs_close(struct inode *inode, struct file *filp)
 {
 	struct ib_uverbs_file *file = filp->private_data;
+	struct ib_ucontext *ucontext = NULL;
+	int srcu_key;
+
+	srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
+	mutex_lock(&file->device->disassociate_mutex);
+	if (!file->device->disassociated) {
+		/* No need to remove from the list once alreday disassociated.
+		 * Try doing that might race with ib_uverbs_free_hw_resources
+		 * as mutex is not held by that time.
+		 */
+		list_del(&file->list);
+		ucontext = file->ucontext;
+	}
+
+	mutex_unlock(&file->device->disassociate_mutex);
 
-	ib_uverbs_cleanup_ucontext(file, file->ucontext);
+	if (ucontext)
+		ib_uverbs_cleanup_ucontext(file, ucontext);
 
+	srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);
 	if (file->async_file)
 		kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
 
@@ -873,6 +997,7 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	int devnum;
 	dev_t base;
 	struct ib_uverbs_device *uverbs_dev;
+	int ret;
 
 	if (!device->alloc_ucontext)
 		return;
@@ -885,6 +1010,13 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	init_completion(&uverbs_dev->comp);
 	uverbs_dev->xrcd_tree = RB_ROOT;
 	mutex_init(&uverbs_dev->xrcd_tree_mutex);
+	mutex_init(&uverbs_dev->disassociate_mutex);
+	ret = init_srcu_struct(&uverbs_dev->disassociate_srcu);
+	if (ret)
+		goto err_init;
+
+	INIT_LIST_HEAD(&uverbs_dev->uverbs_file_list);
+	INIT_LIST_HEAD(&uverbs_dev->uverbs_events_file_list);
 
 	spin_lock(&map_lock);
 	devnum = find_first_zero_bit(dev_map, IB_UVERBS_MAX_DEVICES);
@@ -926,6 +1058,9 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	if (device_create_file(uverbs_dev->dev, &dev_attr_abi_version))
 		goto err_class;
 
+	if (device->disassociate_ucontext)
+		uverbs_dev->flags |= UVERBS_FLAG_DISASSOCIATE;
+
 	ib_set_client_data(device, &uverbs_client, uverbs_dev);
 
 	return;
@@ -941,15 +1076,71 @@ err_cdev:
 		clear_bit(devnum, overflow_map);
 
 err:
+	cleanup_srcu_struct(&uverbs_dev->disassociate_srcu);
+
+err_init:
 	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
 	wait_for_completion(&uverbs_dev->comp);
 	kfree(uverbs_dev);
 	return;
 }
 
+static void ib_uverbs_free_hw_resources(struct ib_uverbs_device *uverbs_dev)
+{
+	struct ib_uverbs_file *file, *tmp_file;
+	struct ib_uverbs_event_file *event_file, *tmp_event_file;
+	struct ib_event event;
+
+	mutex_lock(&uverbs_dev->disassociate_mutex);
+	uverbs_dev->disassociated = 1;
+	/* We must release the mutex before going ahead and calling
+	 * disassociate_ucontext as a nested call to uverbs_close might
+	 * be called as a result of freeing the resources (e.g mmput).
+	 * In addition, we should take an extra ref count on files to prevent
+	 * them being freed as part of parallel file closing, from other task
+	 * or from event occurs internally from that one.
+	 */
+	list_for_each_entry(file, &uverbs_dev->uverbs_file_list, list)
+		kref_get(&file->ref);
+	list_for_each_entry(event_file, &uverbs_dev->uverbs_events_file_list, list)
+		kref_get(&event_file->ref);
+	mutex_unlock(&uverbs_dev->disassociate_mutex);
+
+	/* Pending running commands to terminate */
+	synchronize_srcu(&uverbs_dev->disassociate_srcu);
+	event.event = IB_EVENT_DEVICE_FATAL;
+	event.element.port_num = 0;
+	event.device = uverbs_dev->ib_dev;
+
+	list_for_each_entry(file, &uverbs_dev->uverbs_file_list, list) {
+		ib_uverbs_event_handler(&file->event_handler, &event);
+		uverbs_dev->ib_dev->disassociate_ucontext(file->ucontext);
+		ib_uverbs_cleanup_ucontext(file, file->ucontext);
+	}
+
+	list_for_each_entry(event_file, &uverbs_dev->uverbs_events_file_list, list) {
+		if (event_file->is_async) {
+			/* ib_device is freed once that function/remove_one is
+			 * finished, must unregister the event handler before.
+			 */
+			ib_unregister_event_handler(&event_file->uverbs_file->event_handler);
+		}
+
+		wake_up_interruptible(&event_file->poll_wait);
+		kill_fasync(&event_file->async_queue, SIGIO, POLL_IN);
+	}
+
+	/* A safe iterator is needed as file might be freed as part of loop */
+	list_for_each_entry_safe(file, tmp_file, &uverbs_dev->uverbs_file_list, list)
+		kref_put(&file->ref, ib_uverbs_release_file);
+
+	list_for_each_entry_safe(event_file, tmp_event_file, &uverbs_dev->uverbs_events_file_list, list)
+		kref_put(&event_file->ref, ib_uverbs_release_event_file);
+}
 static void ib_uverbs_remove_one(struct ib_device *device)
 {
 	struct ib_uverbs_device *uverbs_dev = ib_get_client_data(device, &uverbs_client);
+	int wait_clients = 1;
 
 	if (!uverbs_dev)
 		return;
@@ -963,9 +1154,31 @@ static void ib_uverbs_remove_one(struct ib_device *device)
 	else
 		clear_bit(uverbs_dev->devnum - IB_UVERBS_MAX_DEVICES, overflow_map);
 
+	if (uverbs_dev->flags & UVERBS_FLAG_DISASSOCIATE) {
+		/* We disassociate HW resources and immediately returning, not
+		 * pending to active userspace clients. Upon returning ib_device
+		 * may be freed internally and is not valid any more.
+		 * uverbs_device is still available, when all clients close
+		 * their files, the uverbs device ref count will be zero and its
+		 * resources will be freed.
+		 * Note: At that step no more files can be opened on that cdev
+		 * as it was deleted, however active clients can still issue
+		 * commands and close their open files.
+		 */
+		ib_uverbs_free_hw_resources(uverbs_dev);
+		wait_clients = 0;
+		/* ib device can no longer be accessed, it is freed when this
+		 * function returns.
+		 */
+		uverbs_dev->ib_dev = NULL;
+	}
+	/* ref count taken as part of add one is put back in both modes. */
 	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
-	wait_for_completion(&uverbs_dev->comp);
-	kfree(uverbs_dev);
+	if (wait_clients) {
+		wait_for_completion(&uverbs_dev->comp);
+		cleanup_srcu_struct(&uverbs_dev->disassociate_srcu);
+		kfree(uverbs_dev);
+	}
 }
 
 static char *uverbs_devnode(struct device *dev, umode_t *mode)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 65994a1..d9501e7 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1652,6 +1652,7 @@ struct ib_device {
 	int			   (*destroy_flow)(struct ib_flow *flow_id);
 	int			   (*check_mr_status)(struct ib_mr *mr, u32 check_mask,
 						      struct ib_mr_status *mr_status);
+	void			   (*disassociate_ucontext)(struct ib_ucontext *ibcontext);
 
 	struct ib_dma_mapping_ops   *dma_ops;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V3 for-next 2/3] IB/mlx4_ib: Disassociate support
       [not found] ` <1431515438-24042-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-13 11:10   ` [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications Yishai Hadas
@ 2015-05-13 11:10   ` Yishai Hadas
  2015-05-13 11:10   ` [PATCH V3 for-next 3/3] IB/ucma: HW Device hot-removal support Yishai Hadas
  2015-05-13 20:55   ` [RESEND PATCH V3 for-next 0/3] " Hefty, Sean
  3 siblings, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2015-05-13 11:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w, Jack Morgenstein

Implements the IB core disassociate_ucontext API. The driver detaches the HW
resources for a given user context to prevent a dependency between application
termination and device disconnecting. This is done by managing the VMAs that
were mapped to the HW bars such as door bell and blueflame. When need to detach
remap them to an arbitrary kernel page returned by the zap API.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Jack Morgenstein <jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c    |  125 +++++++++++++++++++++++++++++++++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   12 +++
 2 files changed, 136 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 57070c5..7fa7cb6 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -653,7 +653,7 @@ static struct ib_ucontext *mlx4_ib_alloc_ucontext(struct ib_device *ibdev,
 		resp.cqe_size	      = dev->dev->caps.cqe_size;
 	}
 
-	context = kmalloc(sizeof *context, GFP_KERNEL);
+	context = kzalloc(sizeof *context, GFP_KERNEL);
 	if (!context)
 		return ERR_PTR(-ENOMEM);
 
@@ -690,21 +690,140 @@ static int mlx4_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
 	return 0;
 }
 
+static void  mlx4_ib_vma_open(struct vm_area_struct *area)
+{
+	/* vma_open is called when a new VMA is created on top of our VMA.
+	 * This is done through either mremap flow or split_vma (usually due
+	 * to mlock, madvise, munmap, etc.). We do not support a clone of the
+	 * vma, as this VMA is strongly hardware related. Therefore we set the
+	 * vm_ops of the newly created/cloned VMA to NULL, to prevent it from
+	 * calling us again and trying to do incorrect actions. We assume that
+	 * the original vma size is exactly a single page that there will be no
+	 * "splitting" operations on.
+	 */
+	area->vm_ops = NULL;
+}
+
+static void  mlx4_ib_vma_close(struct vm_area_struct *area)
+{
+	struct mlx4_ib_vma_private_data *mlx4_ib_vma_priv_data;
+
+	/* It's guaranteed that all VMAs opened on a FD are closed before the
+	 * file itself is closed, therefore no sync is needed with the regular
+	 * closing flow. (e.g. mlx4_ib_dealloc_ucontext) However need a sync
+	 * with accessing the vma as part of mlx4_ib_disassociate_ucontext.
+	 * The close operation is usually called under mm->mmap_sem except when
+	 * process is exiting.  The exiting case is handled explicitly as part
+	 * of mlx4_ib_disassociate_ucontext.
+	 */
+	mlx4_ib_vma_priv_data = (struct mlx4_ib_vma_private_data *)area->vm_private_data;
+
+	/* set the vma context pointer to null in the mlx4_ib driver's private
+	 * data to protect against a race condition in mlx4_ib_dissassociate_ucontext().
+	 */
+	mlx4_ib_vma_priv_data->vma = NULL;
+}
+
+static const struct vm_operations_struct mlx4_ib_vm_ops = {
+	.open = mlx4_ib_vma_open,
+	.close = mlx4_ib_vma_close
+};
+
+static void mlx4_ib_disassociate_ucontext(struct ib_ucontext *ibcontext)
+{
+	int i;
+	int ret = 0;
+	struct vm_area_struct *vma;
+	struct mlx4_ib_ucontext *context = to_mucontext(ibcontext);
+	struct task_struct *owning_process  = NULL;
+	struct mm_struct   *owning_mm       = NULL;
+
+	owning_process = get_pid_task(ibcontext->tgid, PIDTYPE_PID);
+	if (!owning_process)
+		return;
+
+	owning_mm = get_task_mm(owning_process);
+	if (!owning_mm) {
+		pr_info("no mm, disassociate ucontext is pending task termination\n");
+		while (1) {
+			/* make sure that task is dead before returning, it may
+			 * prevent a rare case of module down in parallel to a
+			 * call to mlx4_ib_vma_close.
+			 */
+			put_task_struct(owning_process);
+			msleep(1);
+			owning_process = get_pid_task(ibcontext->tgid, PIDTYPE_PID);
+			if (!owning_process || owning_process->state == TASK_DEAD) {
+				pr_info("disassociate ucontext done, task was terminated\n");
+				/* in case task was dead need to release the task struct */
+				if (owning_process)
+					put_task_struct(owning_process);
+				return;
+			}
+		}
+	}
+
+	/* need to protect from a race on closing the vma as part of
+	 * mlx4_ib_vma_close().
+	 */
+	down_read(&owning_mm->mmap_sem);
+	for (i = 0; i < HW_BAR_COUNT; i++) {
+		vma = context->hw_bar_info[i].vma;
+		if (!vma)
+			continue;
+
+		ret = zap_vma_ptes(context->hw_bar_info[i].vma,
+				   context->hw_bar_info[i].vma->vm_start,
+				   PAGE_SIZE);
+		if (ret) {
+			pr_err("Error: zap_vma_ptes failed for index=%d, ret=%d\n", i, ret);
+			BUG_ON(1);
+		}
+
+		/* context going to be destroyed, should not access ops any more */
+		context->hw_bar_info[i].vma->vm_ops = NULL;
+	}
+
+	up_read(&owning_mm->mmap_sem);
+	mmput(owning_mm);
+	put_task_struct(owning_process);
+}
+
+static void mlx4_ib_set_vma_data(struct vm_area_struct *vma,
+				 struct mlx4_ib_vma_private_data *vma_private_data)
+{
+	vma_private_data->vma = vma;
+	vma->vm_private_data = vma_private_data;
+	vma->vm_ops =  &mlx4_ib_vm_ops;
+}
+
 static int mlx4_ib_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 {
 	struct mlx4_ib_dev *dev = to_mdev(context->device);
+	struct mlx4_ib_ucontext *mucontext = to_mucontext(context);
 
 	if (vma->vm_end - vma->vm_start != PAGE_SIZE)
 		return -EINVAL;
 
 	if (vma->vm_pgoff == 0) {
+		/* We prevent double mmaping on same context */
+		if (mucontext->hw_bar_info[HW_BAR_DB].vma != NULL)
+			return -EINVAL;
+
 		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 		if (io_remap_pfn_range(vma, vma->vm_start,
 				       to_mucontext(context)->uar.pfn,
 				       PAGE_SIZE, vma->vm_page_prot))
 			return -EAGAIN;
+
+		mlx4_ib_set_vma_data(vma, &mucontext->hw_bar_info[HW_BAR_DB]);
+
 	} else if (vma->vm_pgoff == 1 && dev->dev->caps.bf_reg_size != 0) {
+		/* We prevent double mmaping on same context */
+		if (mucontext->hw_bar_info[HW_BAR_BF].vma != NULL)
+			return -EINVAL;
+
 		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
 
 		if (io_remap_pfn_range(vma, vma->vm_start,
@@ -712,6 +831,9 @@ static int mlx4_ib_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 				       dev->dev->caps.num_uars,
 				       PAGE_SIZE, vma->vm_page_prot))
 			return -EAGAIN;
+
+		mlx4_ib_set_vma_data(vma, &mucontext->hw_bar_info[HW_BAR_BF]);
+
 	} else
 		return -EINVAL;
 
@@ -2242,6 +2364,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 	ibdev->ib_dev.attach_mcast	= mlx4_ib_mcg_attach;
 	ibdev->ib_dev.detach_mcast	= mlx4_ib_mcg_detach;
 	ibdev->ib_dev.process_mad	= mlx4_ib_process_mad;
+	ibdev->ib_dev.disassociate_ucontext = mlx4_ib_disassociate_ucontext;
 
 	if (!mlx4_is_slave(ibdev->dev)) {
 		ibdev->ib_dev.alloc_fmr		= mlx4_ib_fmr_alloc;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index fce3934..e2090f9 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -70,11 +70,23 @@ extern int mlx4_ib_sm_guid_assign;
 
 #define MLX4_IB_UC_STEER_QPN_ALIGN 1
 #define MLX4_IB_UC_MAX_NUM_QPS     256
+
+enum hw_bar_type {
+	HW_BAR_BF,
+	HW_BAR_DB,
+	HW_BAR_COUNT
+};
+
+struct mlx4_ib_vma_private_data {
+	struct vm_area_struct *vma;
+};
+
 struct mlx4_ib_ucontext {
 	struct ib_ucontext	ibucontext;
 	struct mlx4_uar		uar;
 	struct list_head	db_page_list;
 	struct mutex		db_page_mutex;
+	struct mlx4_ib_vma_private_data hw_bar_info[HW_BAR_COUNT];
 };
 
 struct mlx4_ib_pd {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V3 for-next 3/3] IB/ucma: HW Device hot-removal support
       [not found] ` <1431515438-24042-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-13 11:10   ` [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications Yishai Hadas
  2015-05-13 11:10   ` [PATCH V3 for-next 2/3] IB/mlx4_ib: Disassociate support Yishai Hadas
@ 2015-05-13 11:10   ` Yishai Hadas
  2015-05-13 20:55   ` [RESEND PATCH V3 for-next 0/3] " Hefty, Sean
  3 siblings, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2015-05-13 11:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, jackm-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w

Currently, IB/cma remove_one flow blocks until all user descriptor managed by
IB/ucma are released. This prevents hot-removal of IB devices. This patch
allows IB/cma to remove devices regardless of user space activity. Upon getting
the RDMA_CM_EVENT_DEVICE_REMOVAL event we close all the underlying HW resources
for the given ucontext. The ucontext itself is still alive till its explicit
destroying by its creator.

Running applications at that time will have some zombie device, further
operations may fail.

Signed-off-by: Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/ucma.c |  130 ++++++++++++++++++++++++++++++++++++----
 1 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 45d67e9..538cf8f 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -74,6 +74,7 @@ struct ucma_file {
 	struct list_head	ctx_list;
 	struct list_head	event_list;
 	wait_queue_head_t	poll_wait;
+	struct workqueue_struct	*close_wq;
 };
 
 struct ucma_context {
@@ -89,6 +90,9 @@ struct ucma_context {
 
 	struct list_head	list;
 	struct list_head	mc_list;
+	int			closing;
+	int			destroying;
+	struct work_struct	close_work;
 };
 
 struct ucma_multicast {
@@ -107,6 +111,7 @@ struct ucma_event {
 	struct list_head	list;
 	struct rdma_cm_id	*cm_id;
 	struct rdma_ucm_event_resp resp;
+	struct work_struct	close_work;
 };
 
 static DEFINE_MUTEX(mut);
@@ -132,8 +137,12 @@ static struct ucma_context *ucma_get_ctx(struct ucma_file *file, int id)
 
 	mutex_lock(&mut);
 	ctx = _ucma_find_context(id, file);
-	if (!IS_ERR(ctx))
-		atomic_inc(&ctx->ref);
+	if (!IS_ERR(ctx)) {
+		if (ctx->closing)
+			ctx = ERR_PTR(-EIO);
+		else
+			atomic_inc(&ctx->ref);
+	}
 	mutex_unlock(&mut);
 	return ctx;
 }
@@ -144,6 +153,34 @@ static void ucma_put_ctx(struct ucma_context *ctx)
 		complete(&ctx->comp);
 }
 
+static void ucma_close_event_id(struct work_struct *work)
+{
+	struct ucma_event *uevent_close =  container_of(work, struct ucma_event, close_work);
+
+	rdma_destroy_id(uevent_close->cm_id);
+	kfree(uevent_close);
+}
+
+static void ucma_close_id(struct work_struct *work)
+{
+	struct ucma_context *ctx =  container_of(work, struct ucma_context, close_work);
+
+	/* Fence to ensure that ctx->closing was seen by all
+	 * ucma_get_ctx running
+	 */
+	mutex_lock(&mut);
+	mutex_unlock(&mut);
+
+	/* once all inflight tasks are finished, we close all underlying
+	 * resources. The context is still alive till its explicit destryoing
+	 * by its creator.
+	 */
+	ucma_put_ctx(ctx);
+	wait_for_completion(&ctx->comp);
+	/* No new events will be generated after destroying the id. */
+	rdma_destroy_id(ctx->cm_id);
+}
+
 static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file)
 {
 	struct ucma_context *ctx;
@@ -152,6 +189,7 @@ static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file)
 	if (!ctx)
 		return NULL;
 
+	INIT_WORK(&ctx->close_work, ucma_close_id);
 	atomic_set(&ctx->ref, 1);
 	init_completion(&ctx->comp);
 	INIT_LIST_HEAD(&ctx->mc_list);
@@ -242,6 +280,42 @@ static void ucma_set_event_context(struct ucma_context *ctx,
 	}
 }
 
+/* Called with file->mut locked for the relevant context. */
+static void ucma_removal_event_handler(struct rdma_cm_id *cm_id)
+{
+	struct ucma_context *ctx = cm_id->context;
+	struct ucma_event *con_req_eve;
+	int event_found = 0;
+
+	if (ctx->destroying)
+		return;
+
+	/* only if context is pointing to cm_id that it owns it and can be
+	 * queued to be closed, otherwise that cm_id is an inflight one that
+	 * is part of that context event list pending to be detached and
+	 * reattached to its new context as part of ucma_get_event,
+	 * handled separately below.
+	 */
+	if (ctx->cm_id == cm_id) {
+		ctx->closing = 1;
+		queue_work(ctx->file->close_wq, &ctx->close_work);
+		return;
+	}
+
+	list_for_each_entry(con_req_eve, &ctx->file->event_list, list) {
+		if (con_req_eve->cm_id == cm_id &&
+		    con_req_eve->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) {
+			list_del(&con_req_eve->list);
+			INIT_WORK(&con_req_eve->close_work, ucma_close_event_id);
+			queue_work(ctx->file->close_wq, &con_req_eve->close_work);
+			event_found = 1;
+			break;
+		}
+	}
+	if (!event_found)
+		printk(KERN_ERR "ucma_removal_event_handler: warning: connect request event wasn't found\n");
+}
+
 static int ucma_event_handler(struct rdma_cm_id *cm_id,
 			      struct rdma_cm_event *event)
 {
@@ -276,14 +350,21 @@ static int ucma_event_handler(struct rdma_cm_id *cm_id,
 		 * We ignore events for new connections until userspace has set
 		 * their context.  This can only happen if an error occurs on a
 		 * new connection before the user accepts it.  This is okay,
-		 * since the accept will just fail later.
+		 * since the accept will just fail later. However, we do need
+		 * to release the underlying HW resources in case of a device
+		 * removal event.
 		 */
+		if (event->event == RDMA_CM_EVENT_DEVICE_REMOVAL)
+			ucma_removal_event_handler(cm_id);
+
 		kfree(uevent);
 		goto out;
 	}
 
 	list_add_tail(&uevent->list, &ctx->file->event_list);
 	wake_up_interruptible(&ctx->file->poll_wait);
+	if (event->event == RDMA_CM_EVENT_DEVICE_REMOVAL)
+		ucma_removal_event_handler(cm_id);
 out:
 	mutex_unlock(&ctx->file->mut);
 	return ret;
@@ -442,9 +523,15 @@ static void ucma_cleanup_mc_events(struct ucma_multicast *mc)
 }
 
 /*
- * We cannot hold file->mut when calling rdma_destroy_id() or we can
- * deadlock.  We also acquire file->mut in ucma_event_handler(), and
- * rdma_destroy_id() will wait until all callbacks have completed.
+ * ucma_free_ctx is called after the underlying rdma CM-ID is destroyed. At
+ * this point, no new events will be reported from the hardware. However, we
+ * still need to cleanup the UCMA context for this ID. Specifically, there
+ * might be events that have not yet been consumed by the user space software.
+ * These might include pending connect requests which we have not completed
+ * processing.  We cannot call rdma_destroy_id while holding the lock of the
+ * context (file->mut), as it might cause a deadlock. We therefore extract all
+ * relevant events from the context pending events list while holding the
+ * mutex. After that we release them as needed.
  */
 static int ucma_free_ctx(struct ucma_context *ctx)
 {
@@ -452,8 +539,6 @@ static int ucma_free_ctx(struct ucma_context *ctx)
 	struct ucma_event *uevent, *tmp;
 	LIST_HEAD(list);
 
-	/* No new events will be generated after destroying the id. */
-	rdma_destroy_id(ctx->cm_id);
 
 	ucma_cleanup_multicast(ctx);
 
@@ -501,10 +586,20 @@ static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf,
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
 
-	ucma_put_ctx(ctx);
-	wait_for_completion(&ctx->comp);
-	resp.events_reported = ucma_free_ctx(ctx);
+	mutex_lock(&ctx->file->mut);
+	ctx->destroying = 1;
+	mutex_unlock(&ctx->file->mut);
+
+	flush_workqueue(ctx->file->close_wq);
+	/* At this point it's guaranteed that there is no inflight
+	 * closing task */
+	if (!ctx->closing) {
+		ucma_put_ctx(ctx);
+		wait_for_completion(&ctx->comp);
+		rdma_destroy_id(ctx->cm_id);
+	}
 
+	resp.events_reported = ucma_free_ctx(ctx);
 	if (copy_to_user((void __user *)(unsigned long)cmd.response,
 			 &resp, sizeof(resp)))
 		ret = -EFAULT;
@@ -1542,6 +1637,7 @@ static int ucma_open(struct inode *inode, struct file *filp)
 	INIT_LIST_HEAD(&file->ctx_list);
 	init_waitqueue_head(&file->poll_wait);
 	mutex_init(&file->mut);
+	file->close_wq = create_singlethread_workqueue("ucma_close_id");
 
 	filp->private_data = file;
 	file->filp = filp;
@@ -1556,16 +1652,28 @@ static int ucma_close(struct inode *inode, struct file *filp)
 
 	mutex_lock(&file->mut);
 	list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) {
+		ctx->destroying = 1;
 		mutex_unlock(&file->mut);
 
 		mutex_lock(&mut);
 		idr_remove(&ctx_idr, ctx->id);
 		mutex_unlock(&mut);
 
+		flush_workqueue(file->close_wq);
+		/* At that step once ctx was marked as destroying and workqueue
+		 * was flushed we are safe from any inflights handlers that
+		 * might put other closing task.
+		 */
+		if (!ctx->closing)
+			/* rdma_destroy_id ensures that no event handlers are
+			 * inflight for that id before releasing it.
+			 */
+			rdma_destroy_id(ctx->cm_id);
 		ucma_free_ctx(ctx);
 		mutex_lock(&file->mut);
 	}
 	mutex_unlock(&file->mut);
+	destroy_workqueue(file->close_wq);
 	kfree(file);
 	return 0;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found] ` <1431515438-24042-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-05-13 11:10   ` [PATCH V3 for-next 3/3] IB/ucma: HW Device hot-removal support Yishai Hadas
@ 2015-05-13 20:55   ` Hefty, Sean
       [not found]     ` <1828884A29C6694DAF28B7E6B8A82373A8FDA36D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  3 siblings, 1 reply; 23+ messages in thread
From: Hefty, Sean @ 2015-05-13 20:55 UTC (permalink / raw)
  To: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, jackm-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w

> Currently, if there is any user space application using an IB device,
> it is impossible to unload the HW device driver for this device.
> 
> Similarly, if the device is hot-unplugged or reset, the device driver
> hardware removal flow blocks until all user contexts are destroyed.
> 
> This patchset removes the above limitations from both uverbs and ucma.

Please look at ib_user_mad and how it handles device removal with active clients.  We should be able to follow the same model that doesn't require driving changes down into every driver.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found]     ` <1828884A29C6694DAF28B7E6B8A82373A8FDA36D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-05-14 11:28       ` Yishai Hadas
       [not found]         ` <555486CA.8080409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Yishai Hadas @ 2015-05-14 11:28 UTC (permalink / raw)
  To: Hefty, Sean, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	liranl-VPRAkNaXOzVWk0Htik3J/w

On 5/13/2015 11:55 PM, Hefty, Sean wrote:
> Please look at ib_user_mad and how it handles device removal with active clients.  We should be able to follow the same model that doesn't require driving changes down into every driver.
>

The ib_uverbs code does follow the same model in the sense that the 
context (uverbs_device) associated with the fd remains open until the 
app closes it, while all references to the HW device are removed 
immediately. However, in contrast to the umad API, which does *not* 
expose direct HW access and thus can be managed in a device-independent 
manner, ib_uverbs does provide user-space direct HW access.
Therefore, detaching a ib_uverbs context from a HW device does require 
device-specific logic.
The amount of code depends on the specific driver. The proposed 
framework provides any driver a chance to do proper cleanup when devices 
are removed.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found]         ` <555486CA.8080409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-05-14 16:50           ` Hefty, Sean
       [not found]             ` <1828884A29C6694DAF28B7E6B8A82373A8FDA931-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Hefty, Sean @ 2015-05-14 16:50 UTC (permalink / raw)
  To: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	liranl-VPRAkNaXOzVWk0Htik3J/w

> The ib_uverbs code does follow the same model in the sense that the
> context (uverbs_device) associated with the fd remains open until the
> app closes it, while all references to the HW device are removed
> immediately. However, in contrast to the umad API, which does *not*
> expose direct HW access and thus can be managed in a device-independent
> manner, ib_uverbs does provide user-space direct HW access.
> Therefore, detaching a ib_uverbs context from a HW device does require
> device-specific logic.
> The amount of code depends on the specific driver. The proposed
> framework provides any driver a chance to do proper cleanup when devices
> are removed.

Why can't uverbs simply cleanup all resources associated with the open device, similar to what it does when an app closes unexpectedly (just limited to the device in question)?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found]             ` <1828884A29C6694DAF28B7E6B8A82373A8FDA931-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-05-17 10:33               ` Yishai Hadas
       [not found]                 ` <55586E5D.1030801-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Yishai Hadas @ 2015-05-17 10:33 UTC (permalink / raw)
  To: Hefty, Sean, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	liranl-VPRAkNaXOzVWk0Htik3J/w

On 5/14/2015 7:50 PM, Hefty, Sean wrote:
> Why can't uverbs simply cleanup all resources associated with the open device, similar to what it does when an app closes unexpectedly (just limited to the device in question)?

When an application closes un-expectedly, the application does not 
attempt to make use of open resources, such as FDs and memory mappings.
In the hot-removal case, the application must *continue* to operate 
without crashing.
For example, the application may want to use other RDMA devices.

In this case, all FDs and memory mappings remain open. The semantics of 
these remaining resources may be device-specific.
The proposed framework first of all allows a provider to indicate 
whether hot-removal is supported (i.e., the presence of the 
'disassociate_ucontext' callback), and if so, allow the provider to 
perform the proper cleanup so that the corresponding user-space driver 
will continue to function.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found]                 ` <55586E5D.1030801-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-05-18 18:21                   ` Hefty, Sean
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373A8FDC9A4-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Hefty, Sean @ 2015-05-18 18:21 UTC (permalink / raw)
  To: Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	liranl-VPRAkNaXOzVWk0Htik3J/w

> In this case, all FDs and memory mappings remain open.

*Why* is this needed?  The fd remains open -- that seems to be needed -- but all other kernel resources should be safe to clean up.  Any future calls into the kernel will just fail.  (Or mark the objects as defunct, so that close works.)

> these remaining resources may be device-specific.
> The proposed framework first of all allows a provider to indicate
> whether hot-removal is supported (i.e., the presence of the
> 'disassociate_ucontext' callback), and if so, allow the provider to
> perform the proper cleanup so that the corresponding user-space driver
> will continue to function.

The approach seems strange.  The driver knows that it is being removed.  It was informed up front which open contexts were associated with user space processes.  But the driver calls up to indicate that it is being removed, so that the upper layer can call back down to tell the driver to process the removal.

I wasn't asking what this series did.  I was asking why the uverbs driver just can't delete the underlying resources.  That's the natural thing to expect.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373A8FDC9A4-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-05-19 16:17                       ` Liran Liss
       [not found]                         ` <HE1PR05MB1418D0CE0E031DA35E3DDEBAB1C30-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liran Liss @ 2015-05-19 16:17 UTC (permalink / raw)
  To: Hefty, Sean, Yishai Hadas, dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Jack Morgenstein, Shachar Raindel

> From: Hefty, Sean [mailto:sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]

> > these remaining resources may be device-specific.
> > The proposed framework first of all allows a provider to indicate
> > whether hot-removal is supported (i.e., the presence of the
> > 'disassociate_ucontext' callback), and if so, allow the provider to
> > perform the proper cleanup so that the corresponding user-space driver
> > will continue to function.
> 
> The approach seems strange.  The driver knows that it is being removed.  It
> was informed up front which open contexts were associated with user space
> processes.  But the driver calls up to indicate that it is being removed, so that
> the upper layer can call back down to tell the driver to process the removal.
> 
> I wasn't asking what this series did.  I was asking why the uverbs driver just
> can't delete the underlying resources.  That's the natural thing to expect.

I suppose that the main issue would be handling existing user memory mappings,
which cannot be just invalidated -- the user-space driver may not be aware of the
device removal and may access this memory concurrently, and we don't want it
to crash.

The meaning of these mappings is device specific: some devices only write them,
while others read them, expecting some specific format. That's why we need
device-specific code for this.

While it is true that the device initiates the removal process, the current flow is
to let every ib_client first detach itself before device-specific cleanups. In this regard,
ib_uverbs is just another client.
In addition, the "per-open" (fd) state is held in ib_ucontext, which is maintained by
ib_uverbs. The device driver doesn't hold a duplicate list of open HCA handles, so
it relies on ib_uverbs to iterate the relevant contexts and trigger the device-specific
stuff.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found]                         ` <HE1PR05MB1418D0CE0E031DA35E3DDEBAB1C30-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2015-05-25 14:01                           ` Yishai Hadas
  2015-05-27 16:12                           ` Doug Ledford
  1 sibling, 0 replies; 23+ messages in thread
From: Yishai Hadas @ 2015-05-25 14:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Liran Liss, Hefty, Sean, Yishai Hadas,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jack Morgenstein,
	Shachar Raindel

On 5/19/2015 7:17 PM, Liran Liss wrote:
>> From: Hefty, Sean [mailto:sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]
>
>>> these remaining resources may be device-specific.
>>> The proposed framework first of all allows a provider to indicate
>>> whether hot-removal is supported (i.e., the presence of the
>>> 'disassociate_ucontext' callback), and if so, allow the provider to
>>> perform the proper cleanup so that the corresponding user-space driver
>>> will continue to function.
>>
>> The approach seems strange.  The driver knows that it is being removed.  It
>> was informed up front which open contexts were associated with user space
>> processes.  But the driver calls up to indicate that it is being removed, so that
>> the upper layer can call back down to tell the driver to process the removal.
>>
>> I wasn't asking what this series did.  I was asking why the uverbs driver just
>> can't delete the underlying resources.  That's the natural thing to expect.
>
> I suppose that the main issue would be handling existing user memory mappings,
> which cannot be just invalidated -- the user-space driver may not be aware of the
> device removal and may access this memory concurrently, and we don't want it
> to crash.
>
> The meaning of these mappings is device specific: some devices only write them,
> while others read them, expecting some specific format. That's why we need
> device-specific code for this.
>
> While it is true that the device initiates the removal process, the current flow is
> to let every ib_client first detach itself before device-specific cleanups. In this regard,
> ib_uverbs is just another client.
> In addition, the "per-open" (fd) state is held in ib_ucontext, which is maintained by
> ib_uverbs. The device driver doesn't hold a duplicate list of open HCA handles, so
> it relies on ib_uverbs to iterate the relevant contexts and trigger the device-specific
> stuff.
>

Hi Doug,
Please see Liran's answer above, it clarified the need for that 
approach. Can you please take into "for-next" ?

Yishai







--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found]     ` <1431515438-24042-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-25 16:54       ` Jason Gunthorpe
       [not found]         ` <20150525165449.GA4379-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2015-05-25 16:54 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, jackm-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w, Jack Morgenstein

On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:

> +	struct srcu_struct			disassociate_srcu;

There is no need for rcu for this, use a rw sem.

> @@ -1326,6 +1327,13 @@ ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
>  		return -EFAULT;
>  	}
>  
> +	/* Taking ref count on uverbs_file to make sure that file won't be
> +	 * freed till that event file is closed. It will enable accessing the
> +	 * uverbs_device fields as part of closing the events file and making
> +	 * sure that uverbs device is available by that time as well.
> +	 * Note: similar is already done for the async event file.
> +	*/
> +	kref_get(&file->ref);

Is this a bug today? It doesn't look like it, but this stuff does look wrong.

Woulnd't this would make more sense for ib_uverbs_alloc_event_file to
unconditionally grab the kref and unconditionally release it on
release? 

The existing code for this looks broken, in ib_uverbs_get_context all
the error paths between ib_uverbs_alloc_event_file and the
kref_get(file->ref) are wrong - the will result in fput() which will
call ib_uverbs_event_close, which will try to do kref_put and
ib_unregister_event_handler - which are no longer paired.

[I recommend moving the kref_get and ib_register_event_handler into
 ib_uverbs_alloc_event_file, so the 'create' and 'destroy' code paths
 are clearly paired instead of being partially open coded in call
 sites]

Fix all this in a seperate patch to add the needed change in kref
semantics please.

> -	if (!try_module_get(dev->ib_dev->owner)) {
> -		ret = -ENODEV;
> +	mutex_lock(&dev->disassociate_mutex);
> +	if (dev->disassociated) {
> +		ret = -EIO;
>  		goto err;
>  	}
>  
> -	file = kmalloc(sizeof *file, GFP_KERNEL);
> +	/* In case IB device supports disassociate ucontext, there is no hard
> +	 * dependency between uverbs device and its low level device.
> +	 */
> +	module_dependent = !(dev->flags & UVERBS_FLAG_DISASSOCIATE);
> +
> +	if (module_dependent) {
> +		if (!try_module_get(dev->ib_dev->owner)) {
> +			ret = -ENODEV;
> +			goto err;

Again? Why I do I keep pointing this same basic thing to Mellanox
people:

 If you hold a X then you hold the ref to X as well.

So, if the core code is holding function pointers to module code, then
the core code holds a module ref. When the core code null's those
function pointers, then it can release the module ref.

This might work today like this (I'm not entirely sure), but it makes
no sense at all.

I'll look more closely in a few weeks once the rwsem change is done.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found]         ` <20150525165449.GA4379-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-27 16:04           ` Doug Ledford
       [not found]             ` <1432742669.28905.228.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2015-05-27 16:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	Jack Morgenstein

[-- Attachment #1: Type: text/plain, Size: 3233 bytes --]

On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote:
> On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:
> 
> > +	struct srcu_struct			disassociate_srcu;
> 
> There is no need for rcu for this, use a rw sem.

The rcu was used becuase it's on the hot path I assume.  Do we have
numbers on whether a rwsem vs. an rcu matters performance wise?  If the
rcu actually helps performance, then I'm inclined to leave it, but if it
doesn't, then I'd agree that a rwsem is simpler and easier to deal with.

> > @@ -1326,6 +1327,13 @@ ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
> >  		return -EFAULT;
> >  	}
> >  
> > +	/* Taking ref count on uverbs_file to make sure that file won't be
> > +	 * freed till that event file is closed. It will enable accessing the
> > +	 * uverbs_device fields as part of closing the events file and making
> > +	 * sure that uverbs device is available by that time as well.
> > +	 * Note: similar is already done for the async event file.
> > +	*/
> > +	kref_get(&file->ref);
> 
> Is this a bug today? It doesn't look like it, but this stuff does look wrong.
> 
> Woulnd't this would make more sense for ib_uverbs_alloc_event_file to
> unconditionally grab the kref and unconditionally release it on
> release? 
> 
> The existing code for this looks broken, in ib_uverbs_get_context all
> the error paths between ib_uverbs_alloc_event_file and the
> kref_get(file->ref) are wrong - the will result in fput() which will
> call ib_uverbs_event_close, which will try to do kref_put and
> ib_unregister_event_handler - which are no longer paired.
> 
> [I recommend moving the kref_get and ib_register_event_handler into
>  ib_uverbs_alloc_event_file, so the 'create' and 'destroy' code paths
>  are clearly paired instead of being partially open coded in call
>  sites]
> 
> Fix all this in a seperate patch to add the needed change in kref
> semantics please.

Seconded.

> > -	if (!try_module_get(dev->ib_dev->owner)) {
> > -		ret = -ENODEV;
> > +	mutex_lock(&dev->disassociate_mutex);
> > +	if (dev->disassociated) {
> > +		ret = -EIO;
> >  		goto err;
> >  	}
> >  
> > -	file = kmalloc(sizeof *file, GFP_KERNEL);
> > +	/* In case IB device supports disassociate ucontext, there is no hard
> > +	 * dependency between uverbs device and its low level device.
> > +	 */
> > +	module_dependent = !(dev->flags & UVERBS_FLAG_DISASSOCIATE);
> > +
> > +	if (module_dependent) {
> > +		if (!try_module_get(dev->ib_dev->owner)) {
> > +			ret = -ENODEV;
> > +			goto err;
> 
> Again? Why I do I keep pointing this same basic thing to Mellanox
> people:
> 
>  If you hold a X then you hold the ref to X as well.
> 
> So, if the core code is holding function pointers to module code, then
> the core code holds a module ref. When the core code null's those
> function pointers, then it can release the module ref.

Seconded.

> This might work today like this (I'm not entirely sure), but it makes
> no sense at all.
> 
> I'll look more closely in a few weeks once the rwsem change is done.


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found]                         ` <HE1PR05MB1418D0CE0E031DA35E3DDEBAB1C30-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2015-05-25 14:01                           ` Yishai Hadas
@ 2015-05-27 16:12                           ` Doug Ledford
       [not found]                             ` <1432743174.28905.231.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2015-05-27 16:12 UTC (permalink / raw)
  To: Liran Liss
  Cc: Hefty, Sean, Yishai Hadas, Yishai Hadas,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jack Morgenstein,
	Shachar Raindel

[-- Attachment #1: Type: text/plain, Size: 2531 bytes --]

On Tue, 2015-05-19 at 16:17 +0000, Liran Liss wrote:
> > From: Hefty, Sean [mailto:sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]
> 
> > > these remaining resources may be device-specific.
> > > The proposed framework first of all allows a provider to indicate
> > > whether hot-removal is supported (i.e., the presence of the
> > > 'disassociate_ucontext' callback), and if so, allow the provider to
> > > perform the proper cleanup so that the corresponding user-space driver
> > > will continue to function.
> > 
> > The approach seems strange.  The driver knows that it is being removed.  It
> > was informed up front which open contexts were associated with user space
> > processes.  But the driver calls up to indicate that it is being removed, so that
> > the upper layer can call back down to tell the driver to process the removal.
> > 
> > I wasn't asking what this series did.  I was asking why the uverbs driver just
> > can't delete the underlying resources.  That's the natural thing to expect.
> 
> I suppose that the main issue would be handling existing user memory mappings,
> which cannot be just invalidated -- the user-space driver may not be aware of the
> device removal and may access this memory concurrently, and we don't want it
> to crash.

In this case, you are mapping it out of the device BAR space and into a
random kernel page, yes?  So, if the driver doesn't catch the
DRIVER_FATAL event and process that to mean "don't bother touching this
RDMA device any more", it's going to write to a mailbox that no longer
responds and have infinite timeouts, yes?  Essentially meaning all
mailbox type operations just go into lala land from here on out, right?

> The meaning of these mappings is device specific: some devices only write them,
> while others read them, expecting some specific format. That's why we need
> device-specific code for this.
> 
> While it is true that the device initiates the removal process, the current flow is
> to let every ib_client first detach itself before device-specific cleanups. In this regard,
> ib_uverbs is just another client.
> In addition, the "per-open" (fd) state is held in ib_ucontext, which is maintained by
> ib_uverbs. The device driver doesn't hold a duplicate list of open HCA handles, so
> it relies on ib_uverbs to iterate the relevant contexts and trigger the device-specific
> stuff.
> 


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found]             ` <1432742669.28905.228.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-27 17:43               ` Jason Gunthorpe
       [not found]                 ` <20150527174332.GC9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-05-27 20:47               ` Yishai Hadas
  1 sibling, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2015-05-27 17:43 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	Jack Morgenstein

On Wed, May 27, 2015 at 12:04:29PM -0400, Doug Ledford wrote:
> On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote:
> > On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:
> > 
> > > +	struct srcu_struct			disassociate_srcu;
> > 
> > There is no need for rcu for this, use a rw sem.
> 
> The rcu was used becuase it's on the hot path I assume.

Perhaps, I looked at that a bit, it was used on syscall paths, but
that wasn't even the big reason I made the comment..

The use of RCU is *strange*. There is no pointer, there is no call to
rcu_derference. Stuff protected by the write lock is being touched
without any read side locking (ie ib_uverbs_event_read sure looks
funky). There is no Copy or Update phase as far as I can see. It fails
two of the tests in Documentation/RCU/checklist.txt.

I'm pretty sure what is intended here is that disassociated is
actually an unlocked atomic boolean and the RCU is being used
(abused?) to create both exclusion and a rendezvous..

I gave up at that point, if RCU is not being used properly, asking for
rwsem is a way to force the author to deal with their locking
sanely.

Jasno
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found]             ` <1432742669.28905.228.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-05-27 17:43               ` Jason Gunthorpe
@ 2015-05-27 20:47               ` Yishai Hadas
       [not found]                 ` <55662D63.3030806-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Yishai Hadas @ 2015-05-27 20:47 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	Jack Morgenstein

On 5/27/2015 7:04 PM, Doug Ledford wrote:
> On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote:
>> On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:
>>
>>> +	struct srcu_struct			disassociate_srcu;
>>
>> There is no need for rcu for this, use a rw sem.
>
> The rcu was used becuase it's on the hot path I assume.  Do we have
> numbers on whether a rwsem vs. an rcu matters performance wise?  If the
> rcu actually helps performance, then I'm inclined to leave it, but if it
> doesn't, then I'd agree that a rwsem is simpler and easier to deal with.
>

That's correct, it was chosen from performance reasons to enable 
parallel commands as part of ib_uverbs_write with minimum 
synchronization overhead comparing the rwsem.

Most of its usage is for read, upon low level device removal there is 
only one point when synchronize_srcu is used.

>>> @@ -1326,6 +1327,13 @@ ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
>>>   		return -EFAULT;
>>>   	}
>>>
>>> +	/* Taking ref count on uverbs_file to make sure that file won't be
>>> +	 * freed till that event file is closed. It will enable accessing the
>>> +	 * uverbs_device fields as part of closing the events file and making
>>> +	 * sure that uverbs device is available by that time as well.
>>> +	 * Note: similar is already done for the async event file.
>>> +	*/
>>> +	kref_get(&file->ref);
>>
>> Is this a bug today? It doesn't look like it, but this stuff does look wrong.
>>
>> Woulnd't this would make more sense for ib_uverbs_alloc_event_file to
>> unconditionally grab the kref and unconditionally release it on
>> release?
>>
>> The existing code for this looks broken, in ib_uverbs_get_context all
>> the error paths between ib_uverbs_alloc_event_file and the
>> kref_get(file->ref) are wrong - the will result in fput() which will
>> call ib_uverbs_event_close, which will try to do kref_put and
>> ib_unregister_event_handler - which are no longer paired.
>>
>> [I recommend moving the kref_get and ib_register_event_handler into
>>   ib_uverbs_alloc_event_file, so the 'create' and 'destroy' code paths
>>   are clearly paired instead of being partially open coded in call
>>   sites]
>>
>> Fix all this in a seperate patch to add the needed change in kref
>> semantics please.
>
> Seconded.

OK, will add a separate patch to handle that.
>
>>> -	if (!try_module_get(dev->ib_dev->owner)) {
>>> -		ret = -ENODEV;
>>> +	mutex_lock(&dev->disassociate_mutex);
>>> +	if (dev->disassociated) {
>>> +		ret = -EIO;
>>>   		goto err;
>>>   	}
>>>
>>> -	file = kmalloc(sizeof *file, GFP_KERNEL);
>>> +	/* In case IB device supports disassociate ucontext, there is no hard
>>> +	 * dependency between uverbs device and its low level device.
>>> +	 */
>>> +	module_dependent = !(dev->flags & UVERBS_FLAG_DISASSOCIATE);
>>> +
>>> +	if (module_dependent) {
>>> +		if (!try_module_get(dev->ib_dev->owner)) {
>>> +			ret = -ENODEV;
>>> +			goto err;
>>
>> Again? Why I do I keep pointing this same basic thing to Mellanox
>> people:
>>
>>   If you hold a X then you hold the ref to X as well.
>>
>> So, if the core code is holding function pointers to module code, then
>> the core code holds a module ref. When the core code null's those
>> function pointers, then it can release the module ref.
>
> Seconded.

The module get/put mechanism was previously used here and wasn't 
introduced by that patch.

In case the low level driver can be unloaded (e.g. rmmod mlx4_ib) 
unconditionally to active uverbs clients we should prevent the module 
dependency by ignoring the get/put mechanism. Hope that it clarifies the 
usage here.

>
>> This might work today like this (I'm not entirely sure), but it makes
>> no sense at all.
>>
>> I'll look more closely in a few weeks once the rwsem change is done.
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found]                 ` <20150527174332.GC9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-27 21:36                   ` Or Gerlitz
       [not found]                     ` <CAJ3xEMiYbrr2mkXuywNcyYu1v0dqeM+6Lks-CL0U6fSnx=DQ+Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Or Gerlitz @ 2015-05-27 21:36 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, Shachar Raindel, Jack Morgenstein

On Wed, May 27, 2015 at 8:43 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Wed, May 27, 2015 at 12:04:29PM -0400, Doug Ledford wrote:
>> On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote:
>> > On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:
>> >
>> > > + struct srcu_struct                      disassociate_srcu;
>> >
>> > There is no need for rcu for this, use a rw sem.
>>
>> The rcu was used becuase it's on the hot path I assume.
>
> Perhaps, I looked at that a bit, it was used on syscall paths, but
> that wasn't even the big reason I made the comment..

Doug, what hot path we have in uverbs?! IB's stack hot path goes from
user-space to the HW, right?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found]                     ` <CAJ3xEMiYbrr2mkXuywNcyYu1v0dqeM+6Lks-CL0U6fSnx=DQ+Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-28  0:51                       ` Doug Ledford
  0 siblings, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2015-05-28  0:51 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jason Gunthorpe, Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, Shachar Raindel, Jack Morgenstein

[-- Attachment #1: Type: text/plain, Size: 1504 bytes --]

On Thu, 2015-05-28 at 00:36 +0300, Or Gerlitz wrote:
> On Wed, May 27, 2015 at 8:43 PM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > On Wed, May 27, 2015 at 12:04:29PM -0400, Doug Ledford wrote:
> >> On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote:
> >> > On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:
> >> >
> >> > > + struct srcu_struct                      disassociate_srcu;
> >> >
> >> > There is no need for rcu for this, use a rw sem.
> >>
> >> The rcu was used becuase it's on the hot path I assume.
> >
> > Perhaps, I looked at that a bit, it was used on syscall paths, but
> > that wasn't even the big reason I made the comment..
> 
> Doug, what hot path we have in uverbs?! IB's stack hot path goes from
> user-space to the HW, right?
> 
> Or.

For lots of stuff, yes.  However, the function that they picked this for
is ib_uverbs_write() and for certain things, that is a hot path.
Examples would be things like the cmtime test program in librdmacm, and
let's not forget that the whole reason that program was written was
because we were chasing an issue that caused real world applications to
fall over when they couldn't handle roughly 900 reverse route
resolutions per second, so I consider the cmtime utility, and the
stack's responsiveness to it, a valid item to optimize for.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found]                 ` <55662D63.3030806-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-05-28  6:19                   ` Jason Gunthorpe
       [not found]                     ` <20150528061923.GA4054-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2015-05-28  6:19 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Doug Ledford, Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	jackm-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	Jack Morgenstein

On Wed, May 27, 2015 at 11:47:31PM +0300, Yishai Hadas wrote:
> That's correct, it was chosen from performance reasons to enable parallel
> commands as part of ib_uverbs_write with minimum synchronization overhead
> comparing the rwsem.

The locking scheme makes no sense, see my other email, design it for a
rwsem and then consider rcu if necessary.

Hint: Get rid of dev->disassociated and use ib_dev == null to
accomplish the same thing. Now it is clear that ib_dev is what must be
protected by a rwlock and the locking scheme will flow sanely from
that point. If you must use RCU for performance then it is now clear
that rcu_dereference must be applied to the ib_dev.

Then the other locking can stop being so subtle:

-----
ib_uverbs_close:

down(lists_mutex)
tmp = file->ucontext;
if (tmp) {
   list_del(&file->list);
   file->ucontext = NULL;
}
up(lists_mutex);

if (tmp)
  ib_uverbs_cleanup_ucontext(file, tmp);

-----
ib_uverbs_free_hw_resources:

down(lists_mutex)
while (!lists_empty(&uverbs_dev->uverbs_file_list) {
  file = list_first_entry(&uverbs_dev->uverbs_file_list, list)
  tmp = file->ucontext;
  list_del(&file->list);
  file->ucontext = NULL;
  kref_get(&file->ref);
  up(list_mutex);
      
  ib_uverbs_cleanup_ucontext(file, tmp);

  down(list_mutex);
  kref_put(&file->ref,ib_uverbs_release_file);
}
up(list_mutex);

Now the krefs are unquestionably paired, we don't leave a dangling
ucontext pointer, we don't leave a dangling list entry, locking of the
list is explicit and doesn't rely on an unlocked access with an
implicit exclusion. Basically, it is actually obvious what each lock
is protecting.

> In case the low level driver can be unloaded (e.g. rmmod mlx4_ib)
> unconditionally to active uverbs clients we should prevent the module
> dependency by ignoring the get/put mechanism. Hope that it clarifies the
> usage here.

Okay, that does make sense. But I do suspect try_module_get should still
be run, just immediately released. Otherwise the check for in-progress
module removal is skipped.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found]                             ` <1432743174.28905.231.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-28 15:58                               ` Liran Liss
  2015-05-28 16:08                               ` Liran Liss
  1 sibling, 0 replies; 23+ messages in thread
From: Liran Liss @ 2015-05-28 15:58 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Hefty, Sean, Yishai Hadas, Yishai Hadas,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jack Morgenstein,
	Shachar Raindel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1351 bytes --]

> From: Doug Ledford [mailto:dledford@redhat.com]

> > I suppose that the main issue would be handling existing user memory
> mappings,
> > which cannot be just invalidated -- the user-space driver may not be aware
> of the
> > device removal and may access this memory concurrently, and we don't
> want it
> > to crash.
> 
> In this case, you are mapping it out of the device BAR space and into a
> random kernel page, yes?  So, if the driver doesn't catch the
> DRIVER_FATAL event and process that to mean "don't bother touching this
> RDMA device any more", it's going to write to a mailbox that no longer
> responds and have infinite timeouts, yes?  Essentially meaning all
> mailbox type operations just go into lala land from here on out, right?

The kernel activity is asynchronous to user-space.
The device may be un-plugged before the user-space driver has a chance to learn that a DEVICE_FATAL event has occurred. In fact, in the current user-space stack design, device drivers don't have a context of their own to read() from file descriptors and rely on the application for that.
But even so, you probably don't want a driver to invoke a system call during the fast path just to 



N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
       [not found]                             ` <1432743174.28905.231.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-05-28 15:58                               ` Liran Liss
@ 2015-05-28 16:08                               ` Liran Liss
  1 sibling, 0 replies; 23+ messages in thread
From: Liran Liss @ 2015-05-28 16:08 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Hefty, Sean, Yishai Hadas, Yishai Hadas,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jack Morgenstein,
	Shachar Raindel

> From: Doug Ledford [mailto:dledford@redhat.com]
> > I suppose that the main issue would be handling existing user memory
> mappings,
> > which cannot be just invalidated -- the user-space driver may not be aware
> of the
> > device removal and may access this memory concurrently, and we don't
> want it
> > to crash.
> 
> In this case, you are mapping it out of the device BAR space and into a
> random kernel page, yes?  So, if the driver doesn't catch the
> DRIVER_FATAL event and process that to mean "don't bother touching this
> RDMA device any more", it's going to write to a mailbox that no longer
> responds and have infinite timeouts, yes?  Essentially meaning all
> mailbox type operations just go into lala land from here on out, right?
> 

Pressed 'send' too early...

The kernel activity is asynchronous to user-space.
The device may be un-plugged before the user-space driver has a chance to learn that a DEVICE_FATAL event has occurred. In fact, in the current user-space stack design, device drivers don't have a context of their own to read() from file descriptors and rely on the application for that.
But even so, you probably don't want a driver to invoke a system call during the fast path just to check this condition.

For devices that just write the BAR space, an arbitrary kernel page would do.
Other devices might wish to first populate this page so that the user-space driver can detect this situation efficiently.

--Liran


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found]                     ` <20150528061923.GA4054-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-29 10:30                       ` Shachar Raindel
       [not found]                         ` <AM3PR05MB0935AB183EA8764FD83C36A0DCC90-LOZWmgKjnYgQouBfZGh8ttqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Shachar Raindel @ 2015-05-29 10:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Yishai Hadas
  Cc: Doug Ledford, Yishai Hadas, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Jack Morgenstein, Jack Morgenstein

Hi,

> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Thursday, May 28, 2015 9:19 AM
> Subject: Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal
> when there are active user space applications
> 
> On Wed, May 27, 2015 at 11:47:31PM +0300, Yishai Hadas wrote:
> > That's correct, it was chosen from performance reasons to enable
> parallel
> > commands as part of ib_uverbs_write with minimum synchronization
> overhead
> > comparing the rwsem.
> 
> The locking scheme makes no sense, see my other email, design it for a
> rwsem and then consider rcu if necessary.
> 

Jason, the RCU based scheme is important, as it allows the performance
critical path (verbs such as ibv_reg_mr) to avoid cross core dependencies.

> Hint: Get rid of dev->disassociated and use ib_dev == null to

Initially, I liked this idea, as it makes it easier to prove the RCU
correctness. However, it makes life really complicated:

- We will need to latch ib_dev in uverbs_cmd, and handle possible NULL
  value. There are 17 uses of ib_dev in uverbs_cmd. This makes the patch
  even bigger.

- To make life even more complicated, a large number of objects (i.e. PD,
  MR, QP, XRCD, etc.) are keeping a pointer to the same ib_dev struct.
  Making ib_dev NULL will not magically make these pointer NULL as well.
  As a result, attempting to RCU annotating all affected users will be
  impossible.

- The release flows for the relevant uverbs object might be accessing the
  ib_dev pointer from time to time. Setting ib_dev to NULL before calling
  the relevant ib_verbs_cleanup_context seems like the recipe for a kernel
  OOPS for NULL pointer deref.

> accomplish the same thing. Now it is clear that ib_dev is what must be
> protected by a rwlock and the locking scheme will flow sanely from
> that point. If you must use RCU for performance then it is now clear
> that rcu_dereference must be applied to the ib_dev.

See above, it will make the patch ~4x bigger, without performing any actual
annotation value.

> 
> Then the other locking can stop being so subtle:
> 

Locks are subtle creatures by nature. I will explain why your scheme will
break below.

> -----
> ib_uverbs_close:
> 

This is give or take the same as what is in the current code.

> down(lists_mutex)

This name is better than the current "disassociate_mutex" name we have for
this mutex, will fix in next version.

> tmp = file->ucontext;
> if (tmp) {
>    list_del(&file->list);
>    file->ucontext = NULL;

We avoid setting the file->ucontext to NULL early, to avoid a spurious
NULL pointer dereferences. 
However, if you think that this is what will fix the readability, we can 
set it to NULL and avoid touching the pointer after this point.

> }
> up(lists_mutex);
> 
> if (tmp)
>   ib_uverbs_cleanup_ucontext(file, tmp);
> 
> -----
> ib_uverbs_free_hw_resources:
> 

Here is where subtle locking is getting out of the bag.
Your skeleton flow only handles the main uverbs file.
However, this will get the user applications stuck. If
the application is waiting for an event, we should raise
a fatal "device disassociated" event to release the application.

Doing so after you started NULLifying pointers and removing
elements from lists seems the wrong way to do it, a way which
can lead to large amount of oopsing and deadlocking later on.

> down(lists_mutex)
> while (!lists_empty(&uverbs_dev->uverbs_file_list) {
>   file = list_first_entry(&uverbs_dev->uverbs_file_list, list)
>   tmp = file->ucontext;
>   list_del(&file->list);

This is where a bug will hit you - if ib_uverbs_close also
unconditionally does list_del from file->list, you have
a nice big race here.

>   file->ucontext = NULL;
>   kref_get(&file->ref);

The current code takes reference to all files with a single
locking, and after that releases them all without holding a
lock. Your suggestion is busy bouncing the lock back and forth.
To make it more interesting, if we try to use SRCU with your
suggestion, we will need to sync the SRCU every time we drop
the lock. Syncing the SRCU is really long operation (seconds). 
Doing it once during the device removal is OK. Doing it for every
FD is not practical.

>   up(list_mutex);
> 
>   ib_uverbs_cleanup_ucontext(file, tmp);
> 
>   down(list_mutex);
>   kref_put(&file->ref,ib_uverbs_release_file);
> }
> up(list_mutex);
> 
> Now the krefs are unquestionably paired, we don't leave a dangling
> ucontext pointer, we don't leave a dangling list entry, locking of the
> list is explicit and doesn't rely on an unlocked access with an
> implicit exclusion.

We don't leave a dangling list entry in our code either.
The list locking in our scheme, while slightly less bold, is much more
efficient, especially for the hot path of the IB verb operations.

> Basically, it is actually obvious what each lock
> is protecting.
> 
> > In case the low level driver can be unloaded (e.g. rmmod mlx4_ib)
> > unconditionally to active uverbs clients we should prevent the module
> > dependency by ignoring the get/put mechanism. Hope that it clarifies
> the
> > usage here.
> 
> Okay, that does make sense. But I do suspect try_module_get should still
> be run, just immediately released. Otherwise the check for in-progress
> module removal is skipped.
> 

If the module is in the process of being removed, it will call the 
ib_uverbs_remove_one. This calls our disassociate code, and is synchronize
using the list_mutex. This will prevent such issues, without dirtying
another cache line in the good case.


Thanks,
--Shachar
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
       [not found]                         ` <AM3PR05MB0935AB183EA8764FD83C36A0DCC90-LOZWmgKjnYgQouBfZGh8ttqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2015-05-29 17:06                           ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2015-05-29 17:06 UTC (permalink / raw)
  To: Shachar Raindel
  Cc: Yishai Hadas, Doug Ledford, Yishai Hadas,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jack Morgenstein,
	Jack Morgenstein

On Fri, May 29, 2015 at 10:30:52AM +0000, Shachar Raindel wrote:
> > The locking scheme makes no sense, see my other email, design it for a
> > rwsem and then consider rcu if necessary.
> 
> Jason, the RCU based scheme is important, as it allows the performance
> critical path (verbs such as ibv_reg_mr) to avoid cross core dependencies.

The difference between RCU and rwsem is at worst a cache line bounce
if simultaneously stressed by different readers on different CPUs (at
best it might be faster). Can reg_mr even run concurrently on multiple
cpus?

Right now I want to see a locking scheme that makes sense, isn't so
damn subtle and follows the RCU usage guidelines.

> > Hint: Get rid of dev->disassociated and use ib_dev == null to
> 
> Initially, I liked this idea, as it makes it easier to prove the RCU
> correctness. However, it makes life really complicated:

It doesn't make it 'easier to prove RCU correctness' it is 'the
correct way to use RCU'.

> - We will need to latch ib_dev in uverbs_cmd, and handle possible NULL
>   value. There are 17 uses of ib_dev in uverbs_cmd. This makes the patch
>   even bigger.

rwsem avoids that, of course. But even so, it is a pretty small change
to have ib_uverbs_write grab the ib_dev and ucontext under rcu and
then pass them as function arguments to the callback.

> - To make life even more complicated, a large number of objects (i.e. PD,
>   MR, QP, XRCD, etc.) are keeping a pointer to the same ib_dev
>   struct.
>   Making ib_dev NULL will not magically make these pointer NULL as well.
>   As a result, attempting to RCU annotating all affected users will be
>   impossible.

This point makes no sense. I don't think you understand RCU. Only the
member in the ib_uverbs_device device needs to be read while holding
rcu. Once ib_dev is copied and kref incr'd into the PD it is outside
of uverbs concern, and no longer covered by the RCU.

> - The release flows for the relevant uverbs object might be accessing the
>   ib_dev pointer from time to time. Setting ib_dev to NULL before calling
>   the relevant ib_verbs_cleanup_context seems like the recipe for a kernel
>   OOPS for NULL pointer deref.

Are you just guessing at this? Did you check them? I briefly did, and
again just now. Looks OK.

> > ib_uverbs_close:
> 
> This is give or take the same as what is in the current code.

It looks similar but works very differently.

> > down(lists_mutex)
> 
> This name is better than the current "disassociate_mutex" name we have for
> this mutex, will fix in next version.
> 
> > tmp = file->ucontext;
> > if (tmp) {
> >    list_del(&file->list);
> >    file->ucontext = NULL;
> 
> We avoid setting the file->ucontext to NULL early, to avoid a spurious
> NULL pointer dereferences.

Again, this comment makes no sense. You are concerned about a NULL
dereference when this code is freeing the structure? No concern about
use-after-free?

Are you just guessing there might be a problem? Did you study this
scenario?

> However, if you think that this is what will fix the readability, we can 
> set it to NULL and avoid touching the pointer after this point.

This has nothing to do with readability, it is mandatory that the null
set be possible.

> > }
> > up(lists_mutex);
> > 
> > if (tmp)
> >   ib_uverbs_cleanup_ucontext(file, tmp);
> > 
> > ib_uverbs_free_hw_resources:
> > 
> 
> Here is where subtle locking is getting out of the bag.
> Your skeleton flow only handles the main uverbs file.
> However, this will get the user applications stuck. If
> the application is waiting for an event, we should raise
> a fatal "device disassociated" event to release the application.

I'm not going to code this for you, I just showed the basic idea. You
need to extrapolate that to the other cases.

> Doing so after you started NULLifying pointers and removing
> elements from lists seems the wrong way to do it, a way which
> can lead to large amount of oopsing and deadlocking later on.

You should actually look at how event delivery works.

> > down(lists_mutex)
> > while (!lists_empty(&uverbs_dev->uverbs_file_list) {
> >   file = list_first_entry(&uverbs_dev->uverbs_file_list, list)
> >   tmp = file->ucontext;
> >   list_del(&file->list);
> 
> This is where a bug will hit you - if ib_uverbs_close also
> unconditionally does list_del from file->list, you have
> a nice big race here.

And #3 'this comment makes no sense' - seriously?

 down(lists_mutex)
 tmp = file->ucontext;
 if (tmp) {
    list_del(&file->list);
    file->ucontext = NULL;
 }
 up(lists_mutex);

Does that look unconditional/unlocked to you?

Look, this is wasting my time. If you are going to answer a detailed
email then you had better study the subject matter deeply.

> The current code takes reference to all files with a single
> locking, and after that releases them all without holding a
> lock. Your suggestion is busy bouncing the lock back and forth.

This is removal, so any cost we pay is completely irrelevant.
Be Clear. Be Clean. Be Correct.

> To make it more interesting, if we try to use SRCU with your
> suggestion, we will need to sync the SRCU every time we drop
> the lock. Syncing the SRCU is really long operation (seconds). 
> Doing it once during the device removal is OK. Doing it for every
> FD is not practical.

You really don't understand RCU if you think that loop needs to
synchronize_rcu.

> > Now the krefs are unquestionably paired, we don't leave a dangling
> > ucontext pointer, we don't leave a dangling list entry, locking of the
> > list is explicit and doesn't rely on an unlocked access with an
> > implicit exclusion.
> 
> We don't leave a dangling list entry in our code either.

I see two missing list_del's at least.

> If the module is in the process of being removed, it will call the 
> ib_uverbs_remove_one. This calls our disassociate code, and is synchronize
> using the list_mutex. This will prevent such issues, without dirtying
> another cache line in the good case.

You need to stop micro optimizing things that don't need to be micro
optimized.

It doesn't hurt things to follow the normal pattern and release the
lock early.

I feel like looking at this patch has been a waste of my time.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-05-29 17:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-13 11:10 [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support Yishai Hadas
     [not found] ` <1431515438-24042-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-13 11:10   ` [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications Yishai Hadas
     [not found]     ` <1431515438-24042-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-25 16:54       ` Jason Gunthorpe
     [not found]         ` <20150525165449.GA4379-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-27 16:04           ` Doug Ledford
     [not found]             ` <1432742669.28905.228.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-27 17:43               ` Jason Gunthorpe
     [not found]                 ` <20150527174332.GC9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-27 21:36                   ` Or Gerlitz
     [not found]                     ` <CAJ3xEMiYbrr2mkXuywNcyYu1v0dqeM+6Lks-CL0U6fSnx=DQ+Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-28  0:51                       ` Doug Ledford
2015-05-27 20:47               ` Yishai Hadas
     [not found]                 ` <55662D63.3030806-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-05-28  6:19                   ` Jason Gunthorpe
     [not found]                     ` <20150528061923.GA4054-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-29 10:30                       ` Shachar Raindel
     [not found]                         ` <AM3PR05MB0935AB183EA8764FD83C36A0DCC90-LOZWmgKjnYgQouBfZGh8ttqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2015-05-29 17:06                           ` Jason Gunthorpe
2015-05-13 11:10   ` [PATCH V3 for-next 2/3] IB/mlx4_ib: Disassociate support Yishai Hadas
2015-05-13 11:10   ` [PATCH V3 for-next 3/3] IB/ucma: HW Device hot-removal support Yishai Hadas
2015-05-13 20:55   ` [RESEND PATCH V3 for-next 0/3] " Hefty, Sean
     [not found]     ` <1828884A29C6694DAF28B7E6B8A82373A8FDA36D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-05-14 11:28       ` Yishai Hadas
     [not found]         ` <555486CA.8080409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-05-14 16:50           ` Hefty, Sean
     [not found]             ` <1828884A29C6694DAF28B7E6B8A82373A8FDA931-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-05-17 10:33               ` Yishai Hadas
     [not found]                 ` <55586E5D.1030801-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-05-18 18:21                   ` Hefty, Sean
     [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373A8FDC9A4-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-05-19 16:17                       ` Liran Liss
     [not found]                         ` <HE1PR05MB1418D0CE0E031DA35E3DDEBAB1C30-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2015-05-25 14:01                           ` Yishai Hadas
2015-05-27 16:12                           ` Doug Ledford
     [not found]                             ` <1432743174.28905.231.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-28 15:58                               ` Liran Liss
2015-05-28 16:08                               ` Liran Liss

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.