All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] rdma/uverbs: Sketch for an ioctl framework
@ 2016-05-24  6:25 Hefty, Sean
       [not found] ` <1828884A29C6694DAF28B7E6B8A82373AB04FB7F-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Hefty, Sean @ 2016-05-24  6:25 UTC (permalink / raw)
  To: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

The following is a sketch, or outline, for an ioctl framework.
The purpose of the patch is to drive discussion and feedback on
various design decisions rather than low-level comments on the
code, so that the changes can be incorporated earlier in the
development cycle.  The code is entirely untested and some functionality
is not coded.

The general architecture behind the implementation is described below.

Conceptually, ioctl's are split into two main groups:
structured and unstructured.

Unstructured ioctls:
Unstructured ioctls are dispatched directly to a driver with
minimal processing by the framework.

Structured ioctls:
Structured ioctls are designed to support generic framework
processing.  These ioctls are dispatched to per-device routines and
are based on the object ioctl model which was previously posted to
the linux-rdma mail list.  It is anticipated that most ioctls
will be structured, and the following descriptions assume structured ioctls.

Ioctl commands:
There are 6 generic ioctl commands: QUERY, OPEN, CLOSE, MODIFY, READ,
and WRITE.  Commands target a specific functional domain, which may map
to specific objects.  The implementation has a 1:1 mapping, but this
is not required.  Possible domains include: DRIVER, DEVICE, PORT, CQ,
PD, MR, QP, etc.  The full set of domains, and how they map to objects,
needs to be determined.  Ioctls are routed to dispatch routines based
on the <domain, command> pair.  Additionally, each command includes
pre and post operation hooks.  The hooks allow command specific
(e.g. OPEN or CLOSE) functionality to be applied.

Ioctl descriptors:
Descriptors are kernel structures that describe how each ioctl should
be processed by the framework, and include the dispatch routine and
control flags.  Each device maintains an array of ioctl descriptors.
Discussion is needed on the format of the descriptors, though these
are easily changeable.

Ioctl format:
All ioctls start with the same basic header.  The header includes a version,
the length of any input data, and the domain.  Structured ioctls also include
an array of objects that the ioctl will require access to for processing.
This patch does not define the format of the ioctl data beyond the object
array.  That is ultimately determined based on the <domain, command>.
However, as seen in the patch, the format of the object array will matter
and needs discussion.

Kernel objects:
The framework maps object identifiers between user space and the kernel.
Based on descriptor control flags, it will provide exclusive access to objects
where needed.  The framework also handles clean-up in case of application exits
and supports device removal.  Event handling will also be supported by the
framework, though that work still needs to be done.

Architecturally, the framework supports drivers plugging into it independent
of any kernel interface that may be used.  It is also relatively independent
of any user space library or implementation which may require access to
kernel resources.

Signed-off-by: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
I restructured the email to show the headers first.
The code is also available in the dev branch on github.com/shefty/linux.

 drivers/infiniband/core/Makefile |    2 +-
 drivers/infiniband/core/urdma.c  |  373 ++++++++++++++++++++++++++++++++++++++
 include/rdma/rdma_uapi.h         |  140 ++++++++++++++
 include/uapi/rdma/rdma_ioctl.h   |  134 ++++++++++++++
 4 files changed, 648 insertions(+), 1 deletions(-)
 create mode 100644 drivers/infiniband/core/urdma.c
 create mode 100644 include/rdma/rdma_uapi.h
 create mode 100644 include/uapi/rdma/rdma_ioctl.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index f818538..d70ba32 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -34,4 +34,4 @@ ib_umad-y :=			user_mad.o
 
 ib_ucm-y :=			ucm.o
 
-ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_marshall.o
+ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_marshall.o urdma.o
diff --git a/include/uapi/rdma/rdma_ioctl.h b/include/uapi/rdma/rdma_ioctl.h
new file mode 100644
index 0000000..3e34791
--- /dev/null
+++ b/include/uapi/rdma/rdma_ioctl.h
@@ -0,0 +1,134 @@
+/*
+ * Copyright (c) 2016 Intel Corporation, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef RDMA_IOCTL_H
+#define RDMA_IOCTL_H
+
+#include <linux/types.h>
+#include <rdma/ib_user_verbs.h>
+#include <linux/ioctl.h>
+
+
+/* ioctls are grouped into 1 of 8 domains/namespaces
+#define URDMA_DOMAIN(nr)	(_IOC_NR(nr) >> 5)
+enum {
+	URDMA_DOMAIN_OBJECT,
+	URDMA_DOMAIN_DRIVER,
+	URDMA_MAX_DOMAIN
+};
+*/
+
+#define URDMA_OP_MASK			0x7F
+#define URDMA_OP(nr)			(_IOC_NR(nr) & URDMA_OP_MASK)
+
+/* operations */
+enum {
+	URDMA_QUERY,
+	URDMA_OPEN,
+	URDMA_CLOSE,
+	URDMA_MODIFY,
+	URDMA_READ,
+	URDMA_WRITE,
+	URDMA_MAX_OP
+};
+
+/* driver specific object operations set the high-order op bit */
+#define URDMA_DRIVER_OP			(0x80)
+
+/* operation domains, doubles as object types */
+enum {
+	URDMA_DRIVER,	/* is this usable? */
+	URDMA_DEVICE,
+	URDMA_PORT,
+	URDMA_CQ,
+	URDMA_PD,
+	URDMA_AH,
+	URDMA_MR,
+	URDMA_SHARED_RX,
+	URDMA_SHARED_TX,
+	URDMA_QP,
+	URDMA_CMD_CTX,
+	/* others... */
+	URDMA_MAX_DOMAIN
+};
+
+/* driver specific domains set the high-order domain bit */
+#define URDMA_DRIVER_DOMAIN		(1 << 16)
+
+struct urdma_obj_id {
+	u32	instance_id;
+	u16	obj_type;
+	u16	resv;
+};
+
+/* ensure that data beyond header starts at 64-byte alignment */
+struct urdma_ioctl {
+	u8	version;
+	u8	count;
+	u16	domain;
+	u16	length;
+	u16	resv;
+	u64	flags;
+	union {
+		struct urdma_obj_id	obj_id[0];
+		u64			data[0];
+#ifdef __KERNEL__
+		void			*obj[0];
+#endif
+	};
+};
+
+
+#define URDMA_TYPE			0xda
+#define URDMA_IO(op)			_IO(URDMA_TYPE, op)
+#define URDMA_IOR(op, type)		_IOR(URDMA_TYPE, op, type)
+#define URDMA_IOW(op, type)		_IOW(URDMA_TYPE, op, type)
+#define URDMA_IOWR(op, type)		_IOWR(URDMA_TYPE, op, type)
+
+#define URDMA_DRIVER_CMD(op)		(op | URDMA_DRIVER_OP)
+#define URDMA_DRIVER_IO(op)		URDMA_IO(URDMA_DRIVER_CMD(op))
+#define URDMA_DRIVER_IOR(op, type)	URDMA_IOR(URDMA_DRIVER_CMD(op), type)
+#define URDMA_DRIVER_IOW(op, type)	URDMA_IOW(URDMA_DRIVER_CMD(op), type)
+#define URDMA_DRIVER_IOWR(op, type)	URDMA_IOWR(URDMA_DRIVER_CMD(op), type)
+
+#define URDMA_IOCTL(op)			URDMA_IOWR(URDMA_##op, struct urdma_ioctl)
+
+#define URDMA_IOCTL_QUERY		URDMA_IOCTL(QUERY)
+#define URDMA_IOCTL_OPEN		URDMA_IOCTL(OPEN)
+#define URDMA_IOCTL_CLOSE		URDMA_IOCTL(CLOSE)
+#define URDMA_IOCTL_MODIFY		URDMA_IOCTL(MODIFY)
+#define URDMA_IOCTL_READ		URDMA_IOCTL(READ)
+#define URDMA_IOCTL_WRITE		URDMA_IOCTL(WRITE)
+
+
+#endif /* RDMA_IOCTL_H */
+
diff --git a/include/rdma/rdma_uapi.h b/include/rdma/rdma_uapi.h
new file mode 100644
index 0000000..8b5105e
--- /dev/null
+++ b/include/rdma/rdma_uapi.h
@@ -0,0 +1,140 @@
+/*
+ * Copyright (c) 2016 Intel Corporation, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef RDMA_UAPI_H
+#define RDMA_UAPI_H
+
+#include <linux/types.h>
+#include <rdma/ib_user_verbs.h>
+#include <uapi/rdma/rdma_ioctl.h>
+
+
+#define URDMA_OFFSET(dom, op)		(dom * URDMA_MAX_OP + op)
+#define URDMA_MAX_BASE			(URDMA_MAX_DOMAIN * URDMA_MAX_OP - 1)
+#define URDMA_DRIVER_OFFSET(op)		(op)
+
+/* Object and control flags */
+/* Operation on object requires exclusive access - e.g. MODIFY */
+#define URDMA_EXCL		(1 << 0)
+/* Events may be generated for the given object - e.g. CQ, QP */
+#define URDMA_EVENT		(1 << 1)
+/* Device resources have been freed */
+#define URDMA_CLOSED		(1 << 2)
+
+struct urdma_device;
+
+typedef long (*urdma_ioctl_handler_t)(struct urdma_device *dev,
+				      void *data, void *file_data);
+
+/* The purpose of this structure is to guide the behavior of the
+ * common ioctl processing code.
+ */
+struct urdma_ioctl_desc {
+	u64			flags;
+	unsigned int		cmd;
+	u16			length; /* max size needed */
+	urdma_ioctl_handler_t	func;
+	const char		*name;
+};
+
+typedef long (*urdma_ioctl_hook_t)(struct urdma_device *dev,
+				   struct urdma_ioctl *ioctl,
+				   struct urdma_ioctl_desc *desc,
+				   void *file_data);
+
+#define URDMA_DESC(_dom, _op, _func, _flags)			\
+	[URDMA_OFFSET(URDMA_##_dom, URDMA_##_op)] = {		\
+		.flags = _flags,				\
+		.cmd = URDMA_IOCTL_##_op,			\
+		.func = _func,					\
+		.name = #_dom "_" #_op				\
+	}
+
+#define URDMA_DRIVER_DESC(_dom, _op, _func, _flags)		\
+	[URDMA_DRIVER_OFFSET(URDMA_##_dom, URDMA_##_op)] = {	\
+		.flags = _flags,				\
+		.cmd = URDMA_IOCTL_##_op,			\
+		.func = _func,					\
+		.name = #_dom "_" #_op				\
+	}
+
+extern const struct urdma_ioctl_desc verbs_ioctl[URDMA_MAX_BASE];
+
+struct urdma_driver {
+	int			num_ioctls;
+	struct urdma_ioctl_desc	*ioctl;
+};
+
+/* will merge with ib_device, can be separated later to support
+ * non-verbs devices that do not plug into the kernel APIs
+ */
+struct urdma_device {
+	struct urdma_driver	*drv;
+	struct rw_semaphore	rw_lock;
+	int			flags;
+	int			num_ioctls;
+	struct urdma_ioctl_desc	*ioctl;
+
+	/* Order to cleanup obj_list.  Objects are destroyed from
+	 * obj_list[close_map[0]]..obj_list[close_map[n]]
+	 */
+	int			num_objs;
+	int			*close_map;
+	struct list_head	*obj_lists;
+};
+
+/* use ib_uobject? */
+/* urdma will protect against destroying an object that is in use,
+ * but all locking is pushed down to the drivers.
+ * Keep this structure as small as possible
+ */
+struct urdma_obj {
+	u64			ucontext;
+	void			*kcontext;
+	u32			instance_id;	/* idr index */
+	u16			obj_type;
+	u16			flags;
+	struct urdma_device	*dev;
+	struct list_head	entry;
+	atomic_t		use_cnt;
+	//struct kref		ref;
+	//struct rw_semaphore	mutex;
+};
+
+struct urdma_map {
+	struct idr		idr;
+	struct mutex		lock;
+};
+
+
+#endif /* RDMA_UAPI_H */
+diff --git a/drivers/infiniband/core/urdma.c b/drivers/infiniband/core/urdma.c
new file mode 100644
index 0000000..bd84eb9
--- /dev/null
+++ b/drivers/infiniband/core/urdma.c
@@ -0,0 +1,373 @@
+/*
+ * Copyright (c) 2016 Intel Corporation, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/types.h>
+#include <linux/bitops.h>
+
+#include <rdma/ib_user_verbs.h>
+#include <rdma/ib_verbs.h>
+#include <uapi/rdma/rdma_ioctl.h>
+#include <rdma/rdma_uapi.h>
+
+#include "uverbs.h"
+
+
+static long urdma_query_device(struct urdma_device *dev, void *data,
+			       void *file_data)
+{
+	return -ENOSYS;
+}
+
+/* shared ioctl function dispatch table, usable by all verbs devices */
+const struct urdma_ioctl_desc verbs_ioctl[URDMA_MAX_BASE] = {
+	URDMA_DESC(DEVICE, QUERY, urdma_query_device, 0),
+//	URDMA_DESC(DEVICE, OPEN, urdma_open_device, URDMA_EVENT),
+	/* we could also assume exclusive access for modify/close operations */
+//	URDMA_DESC(DEVICE, CLOSE, urdma_close_device, URDMA_EXCL),
+//	URDMA_DESC(DEVICE, MODIFY, urdma_modify_device, URDMA_EXCL),
+//	URDMA_DESC(CQ, QUERY, urdma_query_cq, 0),
+//	URDMA_DESC(CQ, OPEN, urdma_open_cq, URDMA_EVENT),
+//	URDMA_DESC(CQ, CLOSE, urdma_close_cq, URDMA_EXCL),
+//	URDMA_DESC(CQ, MODIFY, urdma_modify_cq, URDMA_EXCL),
+	/* ... yadda yadda yadda */
+};
+EXPORT_SYMBOL(verbs_ioctl);
+
+
+/* Map instance id's to object structures.
+ * We can define per object/device/driver maps if needed for better
+ * parallelism, but use one for now.
+ */
+struct urdma_map map = { IDR_INIT(map.idr),
+			 __MUTEX_INITIALIZER(map.lock) };
+
+
+static struct urdma_obj * urdma_get_obj(struct idr *idr, struct urdma_device *dev,
+					struct urdma_obj_id *id, bool excl)
+{
+	struct urdma_obj *obj;
+
+	if (id->resv)
+		return ERR_PTR(-EINVAL);
+
+	obj = idr_find(idr, id->instance_id);
+	if (!obj || obj->dev != dev || obj->obj_type != id->obj_type)
+		return ERR_PTR(-ENOENT);
+	else if (obj->flags & URDMA_EXCL || (excl && atomic_read(&obj->use_cnt)))
+		return ERR_PTR(-EBUSY);
+
+	if (excl)
+		obj->flags |= URDMA_EXCL;
+	atomic_inc(&obj->use_cnt);
+	return obj;
+}
+
+static void urdma_put_obj(struct urdma_obj *obj)
+{
+	if (obj->flags & URDMA_EXCL)
+		obj->flags &= ~URDMA_EXCL;
+	atomic_dec(&obj->use_cnt);
+}
+
+static void urdma_unmap_obj(struct urdma_ioctl *ioctl, int index)
+{
+	struct urdma_obj *obj;
+
+	obj = ioctl->obj[index];
+	ioctl->obj_id[index].instance_id = obj->instance_id;
+	ioctl->obj_id[index].obj_type = obj->obj_type;
+	ioctl->obj_id[index].resv = 0;
+	urdma_put_obj(obj);
+}
+
+static void urdma_unmap_objs(struct urdma_device *dev, struct urdma_ioctl *ioctl)
+{
+	int i;
+
+	for (i = 0; i < ioctl->count; i++)
+		urdma_unmap_obj(ioctl, i);
+}
+
+static long urdma_map_objs(struct urdma_device *dev,
+			   struct urdma_ioctl *ioctl, bool excl)
+{
+	struct urdma_obj *obj;
+	int i;
+
+	mutex_lock(&map.lock);
+	for (i = 0; i < ioctl->count; i++) {
+		obj = urdma_get_obj(&map.idr, dev, &ioctl->obj_id[i],
+				    excl && i == 0);
+		if (IS_ERR(obj))
+			goto err;
+
+		ioctl->obj[i] = obj;
+	}
+	mutex_unlock(&map.lock);
+	return 0;
+err:
+	while (i--)
+		urdma_unmap_obj(ioctl, i);
+	return PTR_ERR(obj);
+}
+
+/* process driver specific ioctl
+ * driver ioctl's follow more conventional ioctl format
+ */
+long urdma_driver_ioctl(struct ib_uverbs_file *file_data, unsigned int cmd,
+			unsigned long arg)
+{
+	struct urdma_device *dev /*= file_data->dev*/;
+	struct urdma_driver *drv = dev->drv;
+	struct urdma_ioctl_desc *desc;
+	char stack_data[128], *data;
+	u16 size;
+	int offset;
+	long ret;
+
+	offset = URDMA_OP(cmd);
+	if (offset >= drv->num_ioctls || !drv->ioctl[offset].func)
+		return -EINVAL;
+
+	desc = &drv->ioctl[offset];
+	size = _IOC_SIZE(desc->cmd);
+	if (size > sizeof(stack_data)) {
+		data = kmalloc(size, GFP_KERNEL);
+		if (!data)
+			return -ENOMEM;
+	} else {
+		data = stack_data;
+	}
+
+	if (desc->cmd & IOC_IN) {
+		if (copy_from_user(data, (void __user *) arg, size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+	} else if (desc->cmd & IOC_OUT) {
+		memset(data, 0, size);
+	}
+
+	/* data is in/out parameter */
+	ret = desc->func(dev, data, file_data);
+
+	if (desc->cmd & IOC_OUT) {
+		if (copy_to_user((void __user *) arg, data, size))
+			ret = -EFAULT;
+	}
+out:
+	if (data != stack_data)
+		kfree(data);
+	return ret;
+}
+
+static long urdma_pre_common(struct urdma_device *dev, struct urdma_ioctl *ioctl,
+			     struct urdma_ioctl_desc *desc, void *file_data)
+{
+	down_read(&dev->rw_lock);
+	if (dev->flags & URDMA_CLOSED) {
+		up_read(&dev->rw_lock);
+		return -ENODEV;
+	}
+
+	return urdma_map_objs(dev, ioctl, desc->flags & URDMA_EXCL);
+}
+
+static long urdma_post_common(struct urdma_device *dev, struct urdma_ioctl *ioctl,
+			      struct urdma_ioctl_desc *desc, void *file_data)
+{
+	urdma_unmap_objs(dev, ioctl);
+	up_read(&dev->rw_lock);
+	return 0;
+}
+
+static long urdma_pre_open(struct urdma_device *dev, struct urdma_ioctl *ioctl,
+			   struct urdma_ioctl_desc *desc, void *file_data)
+{
+	struct urdma_obj *obj;
+
+	obj = kzalloc(sizeof *obj, GFP_KERNEL);
+	if (!obj)
+		return -ENOMEM;
+
+	obj->flags = URDMA_EXCL;
+	obj->obj_type = ioctl->domain;
+	atomic_set(&obj->use_cnt, 1);
+
+	mutex_lock(&map.lock);
+	obj->instance_id = idr_alloc(&map.idr, obj, 0, 0, GFP_KERNEL);
+	/* TODO: handle driver objects */
+	if (obj->instance_id >= 0)
+		list_add_tail(&obj->entry, &dev->obj_lists[obj->obj_type]);
+	mutex_unlock(&map.lock);
+
+	if (obj->instance_id < 0) {
+		kfree(obj);
+		return -ENOMEM;
+	}
+
+	/* new object added after input object array */
+	ioctl->obj[ioctl->count++] = obj;
+	return 0;
+}
+
+static long urdma_pre_close(struct urdma_device *dev, struct urdma_ioctl *ioctl,
+			    struct urdma_ioctl_desc *desc, void *file_data)
+{
+	if (ioctl->count != 1)
+		return -EINVAL;
+	return urdma_map_objs(dev, ioctl, desc->flags & URDMA_EXCL);
+}
+
+static long urdma_post_close(struct urdma_device *dev, struct urdma_ioctl *ioctl,
+			     struct urdma_ioctl_desc *desc, void *file_data)
+{
+	struct urdma_obj *obj;
+
+	obj = ioctl->obj[0];
+	ioctl->obj[0] = NULL;
+
+	mutex_lock(&map.lock);
+	idr_remove(&map.idr, obj->instance_id);
+	list_del(&obj->entry);
+	mutex_unlock(&map.lock);
+	kfree(obj);
+	return 0;
+}
+
+const static urdma_ioctl_hook_t urdma_pre_op[URDMA_MAX_OP] = {
+	[URDMA_QUERY] = urdma_pre_common,
+	[URDMA_OPEN] = urdma_pre_open,
+	[URDMA_CLOSE] = urdma_pre_close,
+	[URDMA_MODIFY] = urdma_pre_common,
+	[URDMA_READ] = urdma_pre_common,
+	[URDMA_WRITE]= urdma_pre_common,
+};
+
+const static urdma_ioctl_hook_t urdma_post_op[URDMA_MAX_OP] = {
+	[URDMA_QUERY] = urdma_post_common,
+	[URDMA_OPEN] = urdma_post_common,
+	[URDMA_CLOSE] = urdma_post_close,
+	[URDMA_MODIFY] = urdma_post_common,
+	[URDMA_READ] = urdma_post_common,
+	[URDMA_WRITE]= urdma_post_common,
+};
+
+long urdma_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	struct urdma_device *dev;
+	struct ib_uverbs_file *file_data;
+	struct urdma_ioctl_desc *desc;
+	struct urdma_ioctl hdr, *data;
+	char stack_data[128];
+	u8 op;
+	int offset;
+	long ret;
+
+	file_data = filp->private_data;
+	/* dev = file_data->dev; */
+
+	if (_IOC_NR(cmd) & URDMA_DRIVER_OP)
+		return urdma_driver_ioctl(file_data, cmd, arg);
+
+	op = URDMA_OP(cmd);
+	if (op > URDMA_MAX_OP || _IOC_SIZE(cmd) < sizeof(hdr))
+		return -EINVAL;
+
+	if (copy_from_user(&hdr, (void __user *) arg, sizeof(hdr)))
+		return -EFAULT;
+
+	offset = URDMA_OFFSET(hdr.domain, op);
+	if (offset >= dev->num_ioctls || !dev->ioctl[offset].func)
+		return -EINVAL;
+
+	desc = &dev->ioctl[offset];
+	if ((sizeof(hdr) + hdr.count * sizeof(hdr.obj_id) > hdr.length) ||
+	    (hdr.length > desc->length))
+		return -EINVAL;
+
+	if (desc->length > sizeof(stack_data)) {
+		data = kmalloc(desc->length, GFP_KERNEL);
+		if (!data)
+			return -ENOMEM;
+	} else {
+		data = (struct urdma_ioctl *) stack_data;
+	}
+
+	if (copy_from_user(data, (void __user *) arg, hdr.length)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (urdma_pre_op[op]) {
+		ret = urdma_pre_op[op](dev, data, desc, file_data);
+		if (ret)
+			goto out;
+	}
+
+	ret = desc->func(dev, data, file_data);
+
+	if (urdma_post_op[op]) {
+		ret = urdma_post_op[op](dev, data, desc, file_data);
+		if (ret)
+			goto out;
+	}
+
+	if (copy_to_user((void __user *) arg, data, data->length))
+		ret = -EFAULT;
+out:
+	if (data != (struct urdma_ioctl *) stack_data)
+		kfree(data);
+	return ret;
+}
+
+static void urdma_close_obj(struct urdma_device *dev, struct urdma_obj *obj)
+{
+	/* kernel initiated close, releaes device resources */
+}
+
+static void urdma_close_dev(struct urdma_device *dev)
+{
+	struct urdma_obj *obj;
+	int i;
+
+	down_write(&dev->rw_lock);
+	dev->flags |= URDMA_CLOSED;
+
+	for (i = 0; i < dev->num_objs; i++) {
+		list_for_each_entry(obj, &dev->obj_lists[dev->close_map[i]], entry) {
+			urdma_close_obj(dev, obj);
+		}
+	}
+	up_write(&dev->rw_lock);
+}
+
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found] ` <1828884A29C6694DAF28B7E6B8A82373AB04FB7F-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2016-05-24 16:02   ` Liran Liss
       [not found]     ` <HE1PR05MB141819B27F9AAA360DCB420FB14F0-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Liran Liss @ 2016-05-24 16:02 UTC (permalink / raw)
  To: Hefty, Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Hefty, Sean
...
> 
> The following is a sketch, or outline, for an ioctl framework.
> The purpose of the patch is to drive discussion and feedback on various design
> decisions rather than low-level comments on the code, so that the changes can
> be incorporated earlier in the development cycle.  The code is entirely untested
> and some functionality is not coded.
> 

This approach focuses on facilities for arbitrary device-specific objects (which could basically fit GPUs, NVMe, whatever...), while our primary goals are to replace write() with ioctl() and strive for a common *rdma* object model. 

In contrast, the patchset [1], focuses on the task at hand. The emphasis is on a common framework for parsing and parameter validation. It follows the agreed-upon concepts such as a single ioctl request code for common objects, natural mapping of the existing object mode (and hence easier backporting), typed attributes for extending the objects to accommodate multiple vendor features, and a unique ioctl request code for vendor-specific commands.
--Liran

[1] http://permalink.gmane.org/gmane.linux.drivers.rdma/36984
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]     ` <HE1PR05MB141819B27F9AAA360DCB420FB14F0-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-05-24 17:57       ` Doug Ledford
       [not found]         ` <11b6d9c1-0b20-f929-c896-ca084fe18192-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2016-05-24 17:57 UTC (permalink / raw)
  To: Liran Liss, Hefty Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

[-- Attachment #1: Type: text/plain, Size: 4196 bytes --]

On 05/24/2016 12:02 PM, Liran Liss wrote:
>> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
>> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Hefty, Sean
> ...
>>
>> The following is a sketch, or outline, for an ioctl framework.
>> The purpose of the patch is to drive discussion and feedback on various design
>> decisions rather than low-level comments on the code, so that the changes can
>> be incorporated earlier in the development cycle.  The code is entirely untested
>> and some functionality is not coded.
>>
> 
> This approach focuses on facilities for arbitrary device-specific objects (which could basically fit GPUs, NVMe, whatever...), while our primary goals are to replace write() with ioctl() and strive for a common *rdma* object model.

Please be precise when you use the word "our".  In particular, I do not
count myself among your "our".

> In contrast, the patchset [1], focuses on the task at hand. The emphasis is on a common framework for parsing and parameter validation. It follows the agreed-upon

Again, precision is needed here.  Agreed upon by whom?  Not I, and
evidently not Sean either.

> concepts such as a single ioctl request code for common objects,

I'm not sure I agree with this as a priority.

> natural mapping of the existing object mode (and hence easier backporting),

Nor this.

> typed attributes for extending the objects to accommodate multiple vendor features, and a unique ioctl request code for vendor-specific commands.
> --Liran
> 
> [1] http://permalink.gmane.org/gmane.linux.drivers.rdma/36984

While concalls to discuss these topics are handy, it is easy to miss
objections or alternative priorities.

So let me sum up how I see things.

First, we had a security issue that needed addressed.  We did so with a
security check on the current ABI.  During the discussions about that, I
specifically told Linus we wanted to address it that way for now so that
we could work on getting the verbs 2.0 API that addresses some of the
shortcomings of the current API out the door, but do it right and not
rush things.  That verbs 2.0 API would fix the security issue up
completely where the security checks are no longer needed.

What we have today is a proposal to basically take the verbs 1.0 API and
convert it from write() to ioctl() with as little change as possible to
the API.  This is what the patchset you referenced is all about, and
what you and Jason have been talking about.  Allow me to be clear: this
is not verbs 2.0 API.  This is verbs 1.0 over ioctl.  Nothing more.  And
it does nothing to fix up the shortcomings of the verbs 1.0 API that are
playing out right now in things like the timestamp pacthset.

Sean's proposal does away with the rigid nature of the current verbs 1.0
API and makes it much more flexible on a per-driver basis.  This doesn't
address the end user API issues, but it at least cleans up the user
driver <-> kernel driver API so that one vendor's driver is not forced
to carry around all the baggage needed for every other vendor's drivers.

Under that model, each vendor only carries what they need.  It would
then be libibverbs responsibility to take that driver specific data and
present it to the user in a generic way.  The mechanism of that is not
yet known, but thinking about it is what prompted my email in the
"Further thoughts on uAPI" discussion, the one where I made the possible
wc processing example.

This is why I don't think it's possible to think about the verbs 2.0 API
and not also the user visible libibverbs API at the same time.  If all
we want to talk about is verbs 1.0 over ioctl, then yes, we can do that.
 But not if we truly want to discuss a verbs 2.0 API.  And I have yet to
gather from the discussions I hear from people that we are in fact
decided on pursuing a verbs 1.0 over ioctl API instead of considering a
verbs 2.0 API.  So please don't present things as though they are
settled.  As far as I can tell, they very much are not.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]         ` <11b6d9c1-0b20-f929-c896-ca084fe18192-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-05-24 21:41           ` Jason Gunthorpe
       [not found]             ` <20160524214137.GA6760-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-05-24 21:41 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Liran Liss, Hefty Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Tue, May 24, 2016 at 01:57:54PM -0400, Doug Ledford wrote:

> Sean's proposal does away with the rigid nature of the current verbs 1.0
> API and makes it much more flexible on a per-driver basis.  This doesn't
> address the end user API issues, but it at least cleans up the user
> driver <-> kernel driver API so that one vendor's driver is not forced
> to carry around all the baggage needed for every other vendor's drivers.

I'm not sure what you are reading, but to me both proposals look very
similar. They are both based on the generic object/action/method sort
of model I talked about in an earlier thread.

The main differences seem to boil down to the data marshalling and the
dispatching style for the kernel side..

Sean hasn't explored how to encode the actual method arguments, while
Mellanox's has a fairly well developed scheme with the netlink
encoding and sgl result list thingy.

> Under that model, each vendor only carries what they need.  It would
> then be libibverbs responsibility to take that driver specific data
> and

Either patchset could go in this direction. This is a basic question
we need to decide on.

I'm starting to think the basic thrust of the Mellanox series (provide
an easy compatability with our userspace) is a sane approach. The
other options look like far too much work to use as a starting point.

That doesn't mean we can't decide to move in a direct-only direction -
the uAPI needs to have enough extension points to allow for that. Such
work should happen incrementally, and mainly target new uAPIs.

> and not also the user visible libibverbs API at the same time.  If all
> we want to talk about is verbs 1.0 over ioctl, then yes, we can do that.
>  But not if we truly want to discuss a verbs 2.0 API.  And I have yet to
> gather from the discussions I hear from people that we are in fact
> decided on pursuing a verbs 1.0 over ioctl API instead of considering a
> verbs 2.0 API.

You are the only person I've heard who wants to restructure the
libibverbs interface at the same time..

When I kicked off this process at the OFA meeting the discussions
revolved around some very clear goals:
 1) Comprehensively close the security issue
 2) Provide 100% support for existing libibverbs users on the current
    libibverbs API
 3) Restructure obvious known weaknesses (eg IB specific areas)
 4) Add additional extension points

I feel like both patches are still going down the above path..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]             ` <20160524214137.GA6760-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-24 22:38               ` Hefty, Sean
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A82373AB0502ED-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2016-05-25 14:44               ` Liran Liss
  2016-05-25 18:06               ` Doug Ledford
  2 siblings, 1 reply; 23+ messages in thread
From: Hefty, Sean @ 2016-05-24 22:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: Liran Liss,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> > Under that model, each vendor only carries what they need.  It would
> > then be libibverbs responsibility to take that driver specific data
> > and
> 
> Either patchset could go in this direction. This is a basic question
> we need to decide on.
> 
> I'm starting to think the basic thrust of the Mellanox series (provide
> an easy compatability with our userspace) is a sane approach. The
> other options look like far too much work to use as a starting point.

What matters is whether the work needs to be done anyway, and how can we get to the end design.  I don’t see a time crunch to force switching to ioctls.

> You are the only person I've heard who wants to restructure the
> libibverbs interface at the same time..
> 
> When I kicked off this process at the OFA meeting the discussions
> revolved around some very clear goals:
>  1) Comprehensively close the security issue
>  2) Provide 100% support for existing libibverbs users on the current
>     libibverbs API
>  3) Restructure obvious known weaknesses (eg IB specific areas)
>  4) Add additional extension points

I still don't think we even have agreement on what this interface should be.  There is not a single user space library that's being targeted here.  If the only goal is to support libibverbs, then we need another option for devices that are not verbs based, but expose hardware specific structures to user space.  This is why you see objections to putting this under uverbs.  Non-verbs use cases keep getting completely overlooked in the conversation.




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A82373AB0502ED-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2016-05-24 23:13                   ` Jason Gunthorpe
       [not found]                     ` <20160524231359.GA10664-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-05-24 23:13 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Doug Ledford, Liran Liss,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Tue, May 24, 2016 at 10:38:58PM +0000, Hefty, Sean wrote:

> > I'm starting to think the basic thrust of the Mellanox series (provide
> > an easy compatability with our userspace) is a sane approach. The
> > other options look like far too much work to use as a starting point.
> 
> What matters is whether the work needs to be done anyway, and how
> can we get to the end design.  I don?t see a time crunch to force
> switching to ioctls.

Well, it doesn't need to be done, the verbs calls we've already coded
do not *need* to be recoded - except due to the security problem.

> > You are the only person I've heard who wants to restructure the
> > libibverbs interface at the same time..
> > 
> > When I kicked off this process at the OFA meeting the discussions
> > revolved around some very clear goals:
> >  1) Comprehensively close the security issue
> >  2) Provide 100% support for existing libibverbs users on the current
> >     libibverbs API
> >  3) Restructure obvious known weaknesses (eg IB specific areas)
> >  4) Add additional extension points
> 
> I still don't think we even have agreement on what this interface
> should be.  There is not a single user space library that's being
> targeted here.

As I outlined, the genesis to do this was to migrate libibverbs to an
ioctl interface and in the process set a better stage for all the
other things people want to do. That includes better supporting
non-libibverbs libraries (dpdk, libfabric, whatever)

The purpose of the object/method style framework was to better set the
stage for the core code to provide that kind of wider interface for
the drivers.

A major family of objects/methods is going to be the libibverbs 1.0
support layer which is mandatory to port libibverbs to this interface.

The driver specific and any future 'non-verbs' common objects/methods
have little concrete to bring to the table right now. The cdev stuff
in hfi1/qib is very small and easilly handled by the bypass-to-driver
path that both patches include.

> hardware specific structures to user space.  This is why you see
> objections to putting this under uverbs.  Non-verbs use cases keep
> getting completely overlooked in the conversation.

I disagree, non-verbs is not a central topic of conversation because
nothing beyond needing a bypass-to-driver channel has been brought up
by that community. That need has clearly been met in both patch series.

Further, the basic things presented by both series so far are still
general enough to allow for new common uAPIs that are unrelated to the
verbs family.

I'm not sure what else can be done until the non-verbs universe makes
a consensus on what common kernel support it needs???

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]             ` <20160524214137.GA6760-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2016-05-24 22:38               ` Hefty, Sean
@ 2016-05-25 14:44               ` Liran Liss
  2016-05-25 18:06               ` Doug Ledford
  2 siblings, 0 replies; 23+ messages in thread
From: Liran Liss @ 2016-05-25 14:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: Hefty Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Wednesday, May 25, 2016 12:42 AM

> > Sean's proposal does away with the rigid nature of the current verbs
> > 1.0 API and makes it much more flexible on a per-driver basis.  This
> > doesn't address the end user API issues, but it at least cleans up the
> > user driver <-> kernel driver API so that one vendor's driver is not
> > forced to carry around all the baggage needed for every other vendor's
> drivers.
> 
> I'm not sure what you are reading, but to me both proposals look very similar.
> They are both based on the generic object/action/method sort of model I talked
> about in an earlier thread.
> 
> The main differences seem to boil down to the data marshalling and the
> dispatching style for the kernel side..
> 
> Sean hasn't explored how to encode the actual method arguments, while
> Mellanox's has a fairly well developed scheme with the netlink encoding and sgl
> result list thingy.
> 

BTW, we can easily incorporate Sean's facility for mapping user-->kernel objects as helper function for method handlers in the proposed patches.

> > Under that model, each vendor only carries what they need.  It would
> > then be libibverbs responsibility to take that driver specific data
> > and
> 
> Either patchset could go in this direction. This is a basic question we need to
> decide on.
> 
> I'm starting to think the basic thrust of the Mellanox series (provide an easy
> compatability with our userspace) is a sane approach. The other options look
> like far too much work to use as a starting point.
> 
> That doesn't mean we can't decide to move in a direct-only direction - the uAPI
> needs to have enough extension points to allow for that. Such work should
> happen incrementally, and mainly target new uAPIs.
> 

Right.

The proposed patches go a long way in enabling extensibility.
Typed parameters allow you to extend/replace/omit object properties as the interface evolves.
New object types allow you to introduce new concepts.
And the vendor channel can be used for arbitrary communication between a provider's user/kernel drivers.

We are not rushing anything.
--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                     ` <20160524231359.GA10664-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-25 14:59                       ` Hefty, Sean
       [not found]                         ` <1828884A29C6694DAF28B7E6B8A82373AB050592-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Hefty, Sean @ 2016-05-25 14:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, Liran Liss,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> As I outlined, the genesis to do this was to migrate libibverbs to an
> ioctl interface and in the process set a better stage for all the
> other things people want to do. That includes better supporting
> non-libibverbs libraries (dpdk, libfabric, whatever)

Then create ioctl 1 - direct mapping of the write interface
Ioctl 2-255 - extensions

This requires the least amount of work, supports what's needed, and is extensible.

> The purpose of the object/method style framework was to better set the
> stage for the core code to provide that kind of wider interface for
> the drivers.

I tried applying this to the rdma cm.  It's a very poor fit.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                         ` <1828884A29C6694DAF28B7E6B8A82373AB050592-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2016-05-25 17:06                           ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2016-05-25 17:06 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Doug Ledford, Liran Liss,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Wed, May 25, 2016 at 02:59:41PM +0000, Hefty, Sean wrote:
> > As I outlined, the genesis to do this was to migrate libibverbs to an
> > ioctl interface and in the process set a better stage for all the
> > other things people want to do. That includes better supporting
> > non-libibverbs libraries (dpdk, libfabric, whatever)
> 
> Then create ioctl 1 - direct mapping of the write interface
> Ioctl 2-255 - extensions

That is an option..

> > The purpose of the object/method style framework was to better set the
> > stage for the core code to provide that kind of wider interface for
> > the drivers.
> 
> I tried applying this to the rdma cm.  It's a very poor fit.

I'm very surprised by this, what do we have to do for rdma cm then?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]             ` <20160524214137.GA6760-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2016-05-24 22:38               ` Hefty, Sean
  2016-05-25 14:44               ` Liran Liss
@ 2016-05-25 18:06               ` Doug Ledford
       [not found]                 ` <5745E9AE.6020700-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2016-05-25 18:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liran Liss, Hefty Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)


[-- Attachment #1.1: Type: text/plain, Size: 5150 bytes --]

On 5/24/2016 5:41 PM, Jason Gunthorpe wrote:
> On Tue, May 24, 2016 at 01:57:54PM -0400, Doug Ledford wrote:
> 
>> Sean's proposal does away with the rigid nature of the current verbs 1.0
>> API and makes it much more flexible on a per-driver basis.  This doesn't
>> address the end user API issues, but it at least cleans up the user
>> driver <-> kernel driver API so that one vendor's driver is not forced
>> to carry around all the baggage needed for every other vendor's drivers.
> 
> I'm not sure what you are reading, but to me both proposals look very
> similar. They are both based on the generic object/action/method sort
> of model I talked about in an earlier thread.

They are similar in initial expression, but not in intent.  Sean's is
not concerned about preserving struct ib_qp (just as an example) as it
stands, while Mellanox's patchset is all about passing around the same
objects via a different interface.  Even though they encode objects in
netlink attributes, they are still expected to be the same basic objects
we use today (and this is how they minimize the driver impact).

> The main differences seem to boil down to the data marshalling and the
> dispatching style for the kernel side..

Data marshalling in Sean's case also entails data content changes with a
modest reorganization of what it entails for an item to be a core item
(Sean can correct me if I'm wrong here).

> Sean hasn't explored how to encode the actual method arguments, while
> Mellanox's has a fairly well developed scheme with the netlink
> encoding and sgl result list thingy.

You are correct that Sean's patch has very little in the way of argument
validation.  However, I'm not entirely sure that Sean intended the core
to do that sort of validation, he may have intended the drivers to do
their own on the passed through data.  The Mellanox patches do so, but
at the expense of netlink which many people on this list find painful to
read.

>> Under that model, each vendor only carries what they need.  It would
>> then be libibverbs responsibility to take that driver specific data
>> and
> 
> Either patchset could go in this direction. This is a basic question
> we need to decide on.

And this is my central point, that I tried to make in my previous email:

There are multiple trains of thought on where this will end up, and
simply switching from write to ioctl is only part of the overall big
picture.  There should only be one API break in this entire process, so
we need to make sure that any other possible API breakers are included
in the initial change to the ioctl interface.

> I'm starting to think the basic thrust of the Mellanox series (provide
> an easy compatability with our userspace) is a sane approach. The
> other options look like far too much work to use as a starting point.

I could not care less about this argument.  When you have to break an
API, you do what you have to do to do the job right.  Doing things the
right way may turn out to be the easy way, but the argument would be
because it's the right thing to do, not the easy thing to do.

> That doesn't mean we can't decide to move in a direct-only direction -
> the uAPI needs to have enough extension points to allow for that. Such
> work should happen incrementally, and mainly target new uAPIs.

This is arguable.  If we know we want to go basically direct only in the
future, then preserving the existing layer in the ioctl API eventually
becomes a burden.  It would be better to go direct only from the
beginning.  This needs to be settled.

>> and not also the user visible libibverbs API at the same time.  If all
>> we want to talk about is verbs 1.0 over ioctl, then yes, we can do that.
>>  But not if we truly want to discuss a verbs 2.0 API.  And I have yet to
>> gather from the discussions I hear from people that we are in fact
>> decided on pursuing a verbs 1.0 over ioctl API instead of considering a
>> verbs 2.0 API.
> 
> You are the only person I've heard who wants to restructure the
> libibverbs interface at the same time..

That's not entirely true.  My vision in my head for how we might start
altering the libibverbs interface is already being done (although with a
slightly different implementation than I had in mind) in the timestamp
patches.  What I want to see us do in libibverbs is to make extensions
start following a new pattern instead of the one they have traditionally
followed.

But the reason I bring this up is because we need to be thinking of the
end to end data transfers when we are thinking about the API break and
any changes we might make.  I'm thinking about possible changes in
libibverbs, Sean is thinking about libfabrics/psm2/hfi1.

If we end up just doing a behind the scenes switch from write to ioctl
with no changing of data structures or command flow or anything else,
then we can ignore the end to end picture because it won't change
significantly.  But if we do other things too, then I want other people
to keep things like this in mind, it's fundamental to good design in a
case like this.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                 ` <5745E9AE.6020700-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-05-25 19:00                   ` Jason Gunthorpe
       [not found]                     ` <20160525190039.GA5525-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-05-25 19:00 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Liran Liss, Hefty Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Wed, May 25, 2016 at 02:06:38PM -0400, Doug Ledford wrote:

> >> Sean's proposal does away with the rigid nature of the current verbs 1.0
> >> API and makes it much more flexible on a per-driver basis.  This doesn't
> >> address the end user API issues, but it at least cleans up the user
> >> driver <-> kernel driver API so that one vendor's driver is not forced
> >> to carry around all the baggage needed for every other vendor's drivers.
> > 
> > I'm not sure what you are reading, but to me both proposals look very
> > similar. They are both based on the generic object/action/method sort
> > of model I talked about in an earlier thread.
 
> They are similar in initial expression, but not in intent.  Sean's is
> not concerned about preserving struct ib_qp (just as an example) as it
> stands, while Mellanox's patchset is all about passing around the same
> objects via a different interface.  Even though they encode objects in
> netlink attributes, they are still expected to be the same basic objects
> we use today (and this is how they minimize the driver impact).

Again, I think you are reading way too much into both patches. Until
we go down the object by object patches it is too early to say how
exactly things will be translated in either case.

Eg Sean's comments about dis-aggregating the event_fd from the context
in the Mellanox series is a good example of the kind of discussion and
improvements I'm expecting to see on an object-by-object basis.

But today I view both series as primarily exploring the dispatch and
marshalling of the function calls - which is one of the first
questions to settle.

> their own on the passed through data.  The Mellanox patches do so, but
> at the expense of netlink which many people on this list find painful to
> read.

Well, I'm not sold on the netlink idea either, and I posted an
alternative already. If you see another alternative you should
describe it.

> There are multiple trains of thought on where this will end up, and
> simply switching from write to ioctl is only part of the overall big
> picture.  There should only be one API break in this entire process, so
> we need to make sure that any other possible API breakers are included
> in the initial change to the ioctl interface.

To be very clear, an *API* break is not tolerable. The libibverbs 1.0
API must still be 100% functional on the new interface. We are talking
about changing the *ABI* that transports that API. There is a lot
of latitude there, but at the end of the day all the same objects must
still exist in some form.

If people have ideas for different future APIs then the best approach
is to make room for them in the ABI. Sean's discussion on the event
stuff is a prime example of that process. eg we can make room in the
ABI for a different async event model by dis-aggregating things at the
ABI level. The old API still remains preserved by that kind of
dis-aggregation.

If someone wants to suggest a different starting point for this than
libibverbs 1.0 then they had better do the work and concretely
describe that soon. Otherwise I see no better option than to go
object-by-object through the libibverbs 1.0 API and collect comments.

Obvious existing avenues to source comments from are better support
for OPA, iWarp, rocee, dpdk and libfabric.

> > I'm starting to think the basic thrust of the Mellanox series (provide
> > an easy compatability with our userspace) is a sane approach. The
> > other options look like far too much work to use as a starting point.
> 
> I could not care less about this argument.  When you have to break an
> API, you do what you have to do to do the job right.  Doing things the
> right way may turn out to be the easy way, but the argument would be
> because it's the right thing to do, not the easy thing to do.

Usually giant monster changes result in failure, it is a bad software
engineering practice. Substantially recoding every driver and every
provider seems like it is too big a task to complete to me.

Just preserving the existing write interface structure-by-structure
also seems like a bad idea because we already know it has an
inadaquate ABI for current drivers.

> This is arguable.  If we know we want to go basically direct only in the
> future, then preserving the existing layer in the ioctl API eventually
> becomes a burden.  It would be better to go direct only from the
> beginning.  This needs to be settled.

It would be like any other migration, if we eventually reach a point
where the kernel fully supports something like direct-only for all
kernel drivers then we can talk about an obsolescence path for the
other interface.

But, it also isn't really clear if direct-only is even a good idea, I
would be really interested to see some patches exploring what that
would look like.

I'm still thinking of a hybrid common & direct approach as the leading
option ...

> If we end up just doing a behind the scenes switch from write to ioctl
> with no changing of data structures or command flow or anything else,
> then we can ignore the end to end picture because it won't change
> significantly.

You should describe what specific things you want to see.

It is time for the people who are hand waving about non-verbs, non-qp,
'new data transfers' to sit down and specify what they want to see out
of the kernel uABI.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                     ` <20160525190039.GA5525-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-25 19:31                       ` Doug Ledford
  2016-05-25 19:59                       ` Hefty, Sean
  1 sibling, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2016-05-25 19:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liran Liss, Hefty Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

[-- Attachment #1: Type: text/plain, Size: 1507 bytes --]

On 05/25/2016 03:00 PM, Jason Gunthorpe wrote:
> On Wed, May 25, 2016 at 02:06:38PM -0400, Doug Ledford wrote:
> 

> To be very clear, an *API* break is not tolerable. The libibverbs 1.0
> API must still be 100% functional on the new interface.

You're drawing a distinction that doesn't exist.  Libibverbs presents
both an API and an ABI to user space applications, and the kernel
presents both an API and ABI to libibverbs (or libfabric or whatever
chooses to use it).  Changing from write to ioctl changes the kernel to
verbs API, and creates a new ABI, and possibly removes an old ABI if we
dropped the write interface.  The bare term API is not deterministic.

>> If we end up just doing a behind the scenes switch from write to ioctl
>> with no changing of data structures or command flow or anything else,
>> then we can ignore the end to end picture because it won't change
>> significantly.
> 
> You should describe what specific things you want to see.
> 
> It is time for the people who are hand waving about non-verbs, non-qp,
> 'new data transfers' to sit down and specify what they want to see out
> of the kernel uABI.

This is a fair point, but one I don't have time to respond to at the
moment.  I've already spent more time on this topic this week than I
should have given the other tasks I have at hand with the merge window
closing in a few days.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                     ` <20160525190039.GA5525-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2016-05-25 19:31                       ` Doug Ledford
@ 2016-05-25 19:59                       ` Hefty, Sean
       [not found]                         ` <1828884A29C6694DAF28B7E6B8A82373AB050907-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Hefty, Sean @ 2016-05-25 19:59 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: Liran Liss,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> It is time for the people who are hand waving about non-verbs, non-qp,
> 'new data transfers' to sit down and specify what they want to see out
> of the kernel uABI.

I want the core ioctl framework to do this:

1. Map user objects to/from the kernel to validate pointers.
2. Provide reference counting on such objects, in order to protect against the user attempting to destroy objects that are in use.
3. Provide protection that the state of an object is not driven in two different directions.
4. Safely process the removal of the underlying driver or device.
5. Support driver specific operations to the kernel.
6. Report events from multiple sources to a single file descriptor.
7. Minimize the kernel footprint for allocated kernel objects.
8. Minimize the code path needed to process any ioctl, without greatly affecting maintainability.
9. Copy ioctl input/output data to/from the kernel.

If designed correctly, I believe that the ioctl core will be suitably generic to support uverbs, ucm (if kept), urdma cm, umad (if there's a reason to add new ioctls), and any drivers.  Honestly, there's very little that would stand in the way of future subsystems (e.g. accelerators like Xeon Phi, maybe nvm) adopting it, if they so choose, though I'm not suggesting that be a goal.

Even limiting the discussion to short term needs, the rdma cm requires the same update.  The same ioctl core should work there as well.  Otherwise, it's a very good indication that we will not meet the needs of whatever future drivers are thrown to linux-rdma, except to say use a driver specific channel and re-implement the above functionality.  I would go as far as mixing rdma cm, verbs, and driver ioctls on the same fd (driven by shared event reporting), and that's a different file structure than what we have today.

IMO, the format of the ioctl data is a much lower concern and not something the framework should be concerned about.  From the above requirements, the framework needs to understand specific types of operations: creation, close, and ones that require exclusive access.  Event handling requires more thought.

I started re-working my patch to address the rdma cm.  The original proposal wasn't generic enough and provided a poor fit, because the rdma cm has a single object and 25 versions of 'modify'.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                         ` <1828884A29C6694DAF28B7E6B8A82373AB050907-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2016-05-25 20:51                           ` Jason Gunthorpe
       [not found]                             ` <20160525205156.GB5525-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-05-25 20:51 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Doug Ledford, Liran Liss,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Wed, May 25, 2016 at 07:59:29PM +0000, Hefty, Sean wrote:
> > It is time for the people who are hand waving about non-verbs, non-qp,
> > 'new data transfers' to sit down and specify what they want to see out
> > of the kernel uABI.
> 
> I want the core ioctl framework to do this:
> 
> 1. Map user objects to/from the kernel to validate pointers.
> 2. Provide reference counting on such objects, in order to protect against the user attempting to destroy objects that are in use.
> 3. Provide protection that the state of an object is not driven in two different directions.
> 4. Safely process the removal of the underlying driver or device.
> 5. Support driver specific operations to the kernel.
> 6. Report events from multiple sources to a single file descriptor.
> 7. Minimize the kernel footprint for allocated kernel objects.
> 8. Minimize the code path needed to process any ioctl, without greatly affecting maintainability.
> 9. Copy ioctl input/output data to/from the kernel.

I agree with all of this as a suitable goal for the common
dispatch/marshal framework.

I like the idea of the generic uobject stuff in your patches.

#6 seems 'not like the others' - seems like a reasonable concrete
thing to consider as we build the objects on the
framework. [understanding that compat operation is still required]

> I started re-working my patch to address the rdma cm.  The original
> proposal wasn't generic enough and provided a poor fit, because the
> rdma cm has a single object and 25 versions of 'modify'.

I was imagining that all objects would be able to have a set of object
specific functions beyond the common set. IIRC some verbs objects will
need this as well, and I expect drivers to want it.

This is why I wasn't thinking of using the ioctl # to dispatch the
method. It is best if the driver specific calls have globally unique
identifiers of some sort to protect against calling a method on the
wrong driver.

I would encourage you to go as far as Mellanox's series does and
demonstrate the verbs context object in this framework ...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                             ` <20160525205156.GB5525-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-25 21:46                               ` Hefty, Sean
       [not found]                                 ` <1828884A29C6694DAF28B7E6B8A82373AB050A07-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2016-05-26 18:07                               ` Liran Liss
  1 sibling, 1 reply; 23+ messages in thread
From: Hefty, Sean @ 2016-05-25 21:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, Liran Liss,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> > I want the core ioctl framework to do this:
> >
> > 1. Map user objects to/from the kernel to validate pointers.
> > 2. Provide reference counting on such objects, in order to protect against the user attempting to
> destroy objects that are in use.
> > 3. Provide protection that the state of an object is not driven in two different directions.
> > 4. Safely process the removal of the underlying driver or device.
> > 5. Support driver specific operations to the kernel.
> > 6. Report events from multiple sources to a single file descriptor.
> > 7. Minimize the kernel footprint for allocated kernel objects.
> > 8. Minimize the code path needed to process any ioctl, without greatly affecting maintainability.
> > 9. Copy ioctl input/output data to/from the kernel.
> 
> I agree with all of this as a suitable goal for the common
> dispatch/marshal framework.
> 
> I like the idea of the generic uobject stuff in your patches.
> 
> #6 seems 'not like the others' - seems like a reasonable concrete
> thing to consider as we build the objects on the
> framework. [understanding that compat operation is still required]

I agree.  Handling event reporting in general is slightly different from the other objections.  But it ties in with object close processing, so I think the framework needs something here.

#6 is based on examining the application APIs and driving requirements down.  I.e. poll a single fd to get all events: CQ notifications, async events, CM disconnect events, etc.  And while I'm thinking about it, expand events to support user specified.  This would allow many apps to rid themselves of socketpairs which currently fill that role.


> This is why I wasn't thinking of using the ioctl # to dispatch the
> method. It is best if the driver specific calls have globally unique
> identifiers of some sort to protect against calling a method on the
> wrong driver.

I thought of this as well.  I'll see if I can fit it into my next revision.


> I would encourage you to go as far as Mellanox's series does and
> demonstrate the verbs context object in this framework ...

I'm planning on going as far as issuing rdma cm commands, though through a 'quick' method that leaves the existing kernel code mostly intact.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                                 ` <1828884A29C6694DAF28B7E6B8A82373AB050A07-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2016-05-25 22:13                                   ` Jason Gunthorpe
       [not found]                                     ` <20160525221340.GB6207-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2016-05-25 22:47                                   ` Steve Wise
  1 sibling, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-05-25 22:13 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Doug Ledford, Liran Liss,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Wed, May 25, 2016 at 09:46:16PM +0000, Hefty, Sean wrote:
> > #6 seems 'not like the others' - seems like a reasonable concrete
> > thing to consider as we build the objects on the
> > framework. [understanding that compat operation is still required]
> 
> I agree.  Handling event reporting in general is slightly different
> from the other objections.  But it ties in with object close
> processing, so I think the framework needs something here.

Yes, some kind of core framework seem essential here, and funneling
all events seems reasonable if we can figure out how.

> #6 is based on examining the application APIs and driving
> requirements down.  I.e. poll a single fd to get all events: CQ
> notifications, async events, CM disconnect events, etc.  And while
> I'm thinking about it, expand events to support user specified.
> This would allow many apps to rid themselves of socketpairs which
> currently fill that role.

socketpairs? I always use eventfds for that kind of stuff..

Are you hearing people object to number of fds? I've never thought of
it as such a big deal, epoll makes it very painless to deal with.. Do
people have scaling issues?

What would be nice is having strong synchronization in the event
stream, eg between cm (dis)connection and related cq events so those
unexpected races go away...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                                     ` <20160525221340.GB6207-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-25 22:24                                       ` Hefty, Sean
  0 siblings, 0 replies; 23+ messages in thread
From: Hefty, Sean @ 2016-05-25 22:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, Liran Liss,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> Are you hearing people object to number of fds? I've never thought of
> it as such a big deal, epoll makes it very painless to deal with.. Do
> people have scaling issues?

This was an item brought up in the ofiwg requirement gathering phase.  Epoll is nice, and it's what libfabric uses internally to hide multiple fd's, but there are still more resources involved, plus additional overhead processing the events.  Exascale is one big scaling issue, and large core count systems exacerbate the problems.

- Sean 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                                 ` <1828884A29C6694DAF28B7E6B8A82373AB050A07-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2016-05-25 22:13                                   ` Jason Gunthorpe
@ 2016-05-25 22:47                                   ` Steve Wise
  1 sibling, 0 replies; 23+ messages in thread
From: Steve Wise @ 2016-05-25 22:47 UTC (permalink / raw)
  To: 'Hefty, Sean', 'Jason Gunthorpe'
  Cc: 'Doug Ledford', 'Liran Liss', 'linux-rdma'

> 
> > > I want the core ioctl framework to do this:
> > >
> > > 1. Map user objects to/from the kernel to validate pointers.
> > > 2. Provide reference counting on such objects, in order to protect against
the
> user attempting to
> > destroy objects that are in use.
> > > 3. Provide protection that the state of an object is not driven in two
different
> directions.
> > > 4. Safely process the removal of the underlying driver or device.
> > > 5. Support driver specific operations to the kernel.
> > > 6. Report events from multiple sources to a single file descriptor.
> > > 7. Minimize the kernel footprint for allocated kernel objects.
> > > 8. Minimize the code path needed to process any ioctl, without greatly
affecting
> maintainability.
> > > 9. Copy ioctl input/output data to/from the kernel.
> >
> > I agree with all of this as a suitable goal for the common
> > dispatch/marshal framework.
> >
> > I like the idea of the generic uobject stuff in your patches.
> >
> > #6 seems 'not like the others' - seems like a reasonable concrete
> > thing to consider as we build the objects on the
> > framework. [understanding that compat operation is still required]
> 
> I agree.  Handling event reporting in general is slightly different from the
other
> objections.  But it ties in with object close processing, so I think the
framework
> needs something here.
> 
> #6 is based on examining the application APIs and driving requirements down.
I.e.
> poll a single fd to get all events: CQ notifications, async events, CM
disconnect
> events, etc.  And while I'm thinking about it, expand events to support user
> specified.  This would allow many apps to rid themselves of socketpairs which
> currently fill that role.
>

Applications I have written that block typically put the CQ, async, rdmacm fds
all into an poll/epoll group and block that way.  But a common mechanism for
events might end up cleaner and simpler...



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                             ` <20160525205156.GB5525-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2016-05-25 21:46                               ` Hefty, Sean
@ 2016-05-26 18:07                               ` Liran Liss
       [not found]                                 ` <HE1PR05MB1418B4396F696F763D67A893B1410-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Liran Liss @ 2016-05-26 18:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Hefty, Sean
  Cc: Doug Ledford,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Wednesday, May 25, 2016 11:52 PM

> >
> > I want the core ioctl framework to do this:
> >
> > 1. Map user objects to/from the kernel to validate pointers.
> > 2. Provide reference counting on such objects, in order to protect against the
> user attempting to destroy objects that are in use.
> > 3. Provide protection that the state of an object is not driven in two different
> directions.
> > 4. Safely process the removal of the underlying driver or device.
> > 5. Support driver specific operations to the kernel.
> > 6. Report events from multiple sources to a single file descriptor.
> > 7. Minimize the kernel footprint for allocated kernel objects.
> > 8. Minimize the code path needed to process any ioctl, without greatly
> affecting maintainability.
> > 9. Copy ioctl input/output data to/from the kernel.
> 
> I agree with all of this as a suitable goal for the common dispatch/marshal
> framework.
> 

The goals are OK.
However, I think that processing object locking before validation is not secure.
A malicious user could easily cause a deadlock due to ordering.

Rather, IDR processing should be offered as helper functions to method handlers *after* parameter validation.
This also removes the need to pass arrays of objects separately from the command attributes, which looks cumbersome.

I hope that we can show this in our next revision.
--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                                 ` <HE1PR05MB1418B4396F696F763D67A893B1410-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-05-26 18:43                                   ` Jason Gunthorpe
       [not found]                                     ` <20160526184348.GA22174-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-05-26 18:43 UTC (permalink / raw)
  To: Liran Liss
  Cc: Hefty, Sean, Doug Ledford,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Thu, May 26, 2016 at 06:07:02PM +0000, Liran Liss wrote:

> However, I think that processing object locking before validation is not secure.
> A malicious user could easily cause a deadlock due to ordering.

Hum, interesting point. I wonder if we already have bugs there today?

I'd say that is actually more likely to happen if it is open
coded. Nested object locking like that requires consistent ordering to
avoid deadlock..

It would be easier for the core code to enforce consistent nested lock
acquire ordering than to try and do that scattered across all the
code. You are right, the core code cannot just iterate over the
provided object list and grab the locks.

One very simple option is for the core to enforce a sort order on the
object list as it grabs the locks and fail&unwind for unordered lists.

Parameter validation doesn't really factor into this problem....

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                                     ` <20160526184348.GA22174-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-27  0:22                                       ` Hefty, Sean
       [not found]                                         ` <1828884A29C6694DAF28B7E6B8A82373AB05BBED-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Hefty, Sean @ 2016-05-27  0:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Liran Liss
  Cc: Doug Ledford,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> I'd say that is actually more likely to happen if it is open
> coded. Nested object locking like that requires consistent ordering to
> avoid deadlock..
> 
> It would be easier for the core code to enforce consistent nested lock
> acquire ordering than to try and do that scattered across all the
> code. You are right, the core code cannot just iterate over the
> provided object list and grab the locks.
> 
> One very simple option is for the core to enforce a sort order on the
> object list as it grabs the locks and fail&unwind for unordered lists.

I did not have object locking in my proposal.  I tried to maintain very simple locking, hoping it would make it possible for a mere human to validate that the locking was correct.  Plus it kept the size of the object structure smaller.

A lock associated with the idr is held while objects are acquired.  A simple reference count is bumped for each object when used.  If exclusive access is required, a flag is set.  If the exclusive flag is set, then any attempt to acquire the object is rejected and the ioctl fails.

Coordination with the removal of the kernel client (i.e. device) is done using a rw-lock.  Ioctls acquire a read lock, while the removal processing acquires the write lock.

Maybe there's a reason to create a more complex scheme than this, similar to what uverbs has, but AFAICT, this level of synchronization seems sufficient.  Any lower-level locking is the responsibility of the kernel client.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                                         ` <1828884A29C6694DAF28B7E6B8A82373AB05BBED-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2016-05-27 16:50                                           ` Jason Gunthorpe
       [not found]                                             ` <20160527165023.GA2449-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2016-05-27 16:50 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Liran Liss, Doug Ledford,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Fri, May 27, 2016 at 12:22:56AM +0000, Hefty, Sean wrote:

> A lock associated with the idr is held while objects are acquired.
> A simple reference count is bumped for each object when used.  If
> exclusive access is required, a flag is set.  If the exclusive flag
> is set, then any attempt to acquire the object is rejected and the
> ioctl fails.

How is that different from your basic idr+kref locking scheme common
in the kernel?

What is a non-blocking exclusive useful for? Removal?

> Coordination with the removal of the kernel client (i.e. device) is
> done using a rw-lock.  Ioctls acquire a read lock, while the removal
> processing acquires the write lock.

uverbs ended up using a more complex srcu scheme for this citing
performance reasons.. I argued for the simpler rwlock at the time..

> synchronization seems sufficient.  Any lower-level locking is the
> responsibility of the kernel client.

So there are still object locks, but the core doesn't help with them
at all.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [RFC] rdma/uverbs: Sketch for an ioctl framework
       [not found]                                             ` <20160527165023.GA2449-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-05-27 17:24                                               ` Hefty, Sean
  0 siblings, 0 replies; 23+ messages in thread
From: Hefty, Sean @ 2016-05-27 17:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liran Liss, Doug Ledford,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> > A lock associated with the idr is held while objects are acquired.
> > A simple reference count is bumped for each object when used.  If
> > exclusive access is required, a flag is set.  If the exclusive flag
> > is set, then any attempt to acquire the object is rejected and the
> > ioctl fails.
> 
> How is that different from your basic idr+kref locking scheme common
> in the kernel?

I don't view it as significantly different.  A kref is just an atomic anyway.

> What is a non-blocking exclusive useful for? Removal?

Removal and possibly modify calls.  Note that I only apply the exclusive flag to the first object in the list, so some ordering of what the objects are is required.  (I didn't expect clients to accept arbitrary ordering anyway.)

We can debate how the ioctl framework should serialize ioctl processing, versus pushing that functionality either into user space or the ioctl handlers.  My proposal serialized against device removal only, otherwise it failed the call if exclusive access was required.  This supports pushing the serialization either up or down, but it's outside the framework.

> > Coordination with the removal of the kernel client (i.e. device) is
> > done using a rw-lock.  Ioctls acquire a read lock, while the removal
> > processing acquires the write lock.
> 
> uverbs ended up using a more complex srcu scheme for this citing
> performance reasons.. I argued for the simpler rwlock at the time..

I guess I'm re-arguing for a simpler rwlock then.  :)  In any case, the actual locking mechanism can change.

> > synchronization seems sufficient.  Any lower-level locking is the
> > responsibility of the kernel client.
> 
> So there are still object locks, but the core doesn't help with them
> at all.

Moving this into a framework which doesn't understand what the objects actually are would be difficult.  I can see this being done by a uverbs core, which is one of the users of the ioctl framework.  The ucma is another user of this code.

- Sean 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-05-27 17:24 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-24  6:25 [RFC] rdma/uverbs: Sketch for an ioctl framework Hefty, Sean
     [not found] ` <1828884A29C6694DAF28B7E6B8A82373AB04FB7F-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2016-05-24 16:02   ` Liran Liss
     [not found]     ` <HE1PR05MB141819B27F9AAA360DCB420FB14F0-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-05-24 17:57       ` Doug Ledford
     [not found]         ` <11b6d9c1-0b20-f929-c896-ca084fe18192-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-05-24 21:41           ` Jason Gunthorpe
     [not found]             ` <20160524214137.GA6760-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-24 22:38               ` Hefty, Sean
     [not found]                 ` <1828884A29C6694DAF28B7E6B8A82373AB0502ED-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2016-05-24 23:13                   ` Jason Gunthorpe
     [not found]                     ` <20160524231359.GA10664-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-25 14:59                       ` Hefty, Sean
     [not found]                         ` <1828884A29C6694DAF28B7E6B8A82373AB050592-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2016-05-25 17:06                           ` Jason Gunthorpe
2016-05-25 14:44               ` Liran Liss
2016-05-25 18:06               ` Doug Ledford
     [not found]                 ` <5745E9AE.6020700-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-05-25 19:00                   ` Jason Gunthorpe
     [not found]                     ` <20160525190039.GA5525-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-25 19:31                       ` Doug Ledford
2016-05-25 19:59                       ` Hefty, Sean
     [not found]                         ` <1828884A29C6694DAF28B7E6B8A82373AB050907-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2016-05-25 20:51                           ` Jason Gunthorpe
     [not found]                             ` <20160525205156.GB5525-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-25 21:46                               ` Hefty, Sean
     [not found]                                 ` <1828884A29C6694DAF28B7E6B8A82373AB050A07-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2016-05-25 22:13                                   ` Jason Gunthorpe
     [not found]                                     ` <20160525221340.GB6207-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-25 22:24                                       ` Hefty, Sean
2016-05-25 22:47                                   ` Steve Wise
2016-05-26 18:07                               ` Liran Liss
     [not found]                                 ` <HE1PR05MB1418B4396F696F763D67A893B1410-eBadYZ65MZ87O8BmmlM1zNqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-05-26 18:43                                   ` Jason Gunthorpe
     [not found]                                     ` <20160526184348.GA22174-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-27  0:22                                       ` Hefty, Sean
     [not found]                                         ` <1828884A29C6694DAF28B7E6B8A82373AB05BBED-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2016-05-27 16:50                                           ` Jason Gunthorpe
     [not found]                                             ` <20160527165023.GA2449-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-05-27 17:24                                               ` Hefty, Sean

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.