All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rdma-next 0/3] Core fixes and cleanup
@ 2017-03-19  8:55 Leon Romanovsky
  2017-03-19  8:55 ` [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to initialize device Leon Romanovsky
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Leon Romanovsky @ 2017-03-19  8:55 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Doug,

Please find below three patches for IB/core and this series
are based on v4.11-rc2.

Thanks

Jack Morgenstein (1):
  IB/core: Fix sysfs registration error flow

Leon Romanovsky (1):
  IB/umem: Drop hugetlb variable and VMA allocation

Parav Pandit (1):
  IB/core: Fix kernel crash during fail to initialize device

 drivers/infiniband/core/device.c          | 33 ++++++++++++++++++++-----------
 drivers/infiniband/core/sysfs.c           |  2 +-
 drivers/infiniband/core/umem.c            | 23 ++-------------------
 drivers/infiniband/core/umem_odp.c        |  1 -
 drivers/infiniband/hw/bnxt_re/ib_verbs.c  |  6 +-----
 drivers/infiniband/hw/i40iw/i40iw_verbs.c |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c  |  1 -
 include/rdma/ib_umem.h                    |  1 -
 8 files changed, 27 insertions(+), 42 deletions(-)

--
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to initialize device
  2017-03-19  8:55 [PATCH rdma-next 0/3] Core fixes and cleanup Leon Romanovsky
@ 2017-03-19  8:55 ` Leon Romanovsky
  2017-04-24 18:03   ` Parav Pandit
  2017-04-24 18:04   ` Parav Pandit
  2017-03-19  8:55 ` [PATCH rdma-next 3/3] IB/core: Fix sysfs registration error flow Leon Romanovsky
       [not found] ` <20170319085557.6023-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2 siblings, 2 replies; 11+ messages in thread
From: Leon Romanovsky @ 2017-03-19  8:55 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma, Parav Pandit, # v4 . 2+

From: Parav Pandit <parav@mellanox.com>

This patch fixes the kernel crash that occurs during ib_dealloc_device()
called due to provider driver fails with an error after
ib_alloc_device() and before it can register using ib_register_device().

This crashed seen in tha lab as below which can occur with any IB device
which fails to perform its device initialization before invoking
ib_register_device().

This patch avoids touching cache and port immutable structures if device
is not yet initialized.
It also releases related memory when cache and port immutable data
structure initialization fails during register_device() state.

[81416.561946] BUG: unable to handle kernel NULL pointer dereference at (null)
[81416.570340] IP: ib_cache_release_one+0x29/0x80 [ib_core]
[81416.576222] PGD 78da66067
[81416.576223] PUD 7f2d7c067
[81416.579484] PMD 0
[81416.582720]
[81416.587242] Oops: 0000 [#1] SMP
[81416.722395] task: ffff8807887515c0 task.stack: ffffc900062c0000
[81416.729148] RIP: 0010:ib_cache_release_one+0x29/0x80 [ib_core]
[81416.735793] RSP: 0018:ffffc900062c3a90 EFLAGS: 00010202
[81416.741823] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[81416.749785] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffff880859fec000
[81416.757757] RBP: ffffc900062c3aa0 R08: ffff8808536e5ac0 R09: ffff880859fec5b0
[81416.765708] R10: 00000000536e5c01 R11: ffff8808536e5ac0 R12: ffff880859fec000
[81416.773672] R13: 0000000000000000 R14: ffff8808536e5ac0 R15: ffff88084ebc0060
[81416.781621] FS:  00007fd879fab740(0000) GS:ffff88085fac0000(0000) knlGS:0000000000000000
[81416.790522] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[81416.797094] CR2: 0000000000000000 CR3: 00000007eb215000 CR4: 00000000003406e0
[81416.805051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[81416.812997] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[81416.820950] Call Trace:
[81416.824226]  ib_device_release+0x1e/0x40 [ib_core]
[81416.829858]  device_release+0x32/0xa0
[81416.834370]  kobject_cleanup+0x63/0x170
[81416.839058]  kobject_put+0x25/0x50
[81416.843319]  ib_dealloc_device+0x25/0x40 [ib_core]
[81416.848986]  mlx5_ib_add+0x163/0x1990 [mlx5_ib]
[81416.854414]  mlx5_add_device+0x5a/0x160 [mlx5_core]
[81416.860191]  mlx5_register_interface+0x8d/0xc0 [mlx5_core]
[81416.866587]  ? 0xffffffffa09e9000
[81416.870816]  mlx5_ib_init+0x15/0x17 [mlx5_ib]
[81416.876094]  do_one_initcall+0x51/0x1b0
[81416.880861]  ? __vunmap+0x85/0xd0
[81416.885113]  ? kmem_cache_alloc_trace+0x14b/0x1b0
[81416.890768]  ? vfree+0x2e/0x70
[81416.894762]  do_init_module+0x60/0x1fa
[81416.899441]  load_module+0x15f6/0x1af0
[81416.904114]  ? __symbol_put+0x60/0x60
[81416.908709]  ? ima_post_read_file+0x3d/0x80
[81416.913828]  ? security_kernel_post_read_file+0x6b/0x80
[81416.920006]  SYSC_finit_module+0xa6/0xf0
[81416.924888]  SyS_finit_module+0xe/0x10
[81416.929568]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[81416.935089] RIP: 0033:0x7fd879494949
[81416.939543] RSP: 002b:00007ffdbc1b4e58 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[81416.947982] RAX: ffffffffffffffda RBX: 0000000001b66f00 RCX: 00007fd879494949
[81416.955965] RDX: 0000000000000000 RSI: 000000000041a13c RDI: 0000000000000003
[81416.963926] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000001b652a0
[81416.971861] R10: 0000000000000003 R11: 0000000000000202 R12: 00007ffdbc1b3e70
[81416.979763] R13: 00007ffdbc1b3e50 R14: 0000000000000005 R15: 0000000000000000
[81417.008005] RIP: ib_cache_release_one+0x29/0x80 [ib_core] RSP: ffffc900062c3a90
[81417.016045] CR2: 0000000000000000

Fixes: 55aeed0654 ("IB/core: Make ib_alloc_device init the kobject")
Fixes: 7738613e7c ("IB/core: Add per port immutable struct to ib_device")
Cc: <stable@vger.kernel.org> # v4.2+
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
---
 drivers/infiniband/core/device.c | 33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 593d2ce6ec7c..64a2ae4d8eaa 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -172,8 +172,16 @@ static void ib_device_release(struct device *device)
 {
 	struct ib_device *dev = container_of(device, struct ib_device, dev);
 
-	ib_cache_release_one(dev);
-	kfree(dev->port_immutable);
+	WARN_ON(dev->reg_state == IB_DEV_REGISTERED);
+	if (dev->reg_state == IB_DEV_UNREGISTERED) {
+		/*
+		 * In IB_DEV_UNINITIALIZED state, cache or port table
+		 * is not even created. Free cache and port table only when
+		 * device reaches UNREGISTERED state.
+		 */
+		ib_cache_release_one(dev);
+		kfree(dev->port_immutable);
+	}
 	kfree(dev);
 }
 
@@ -366,32 +374,27 @@ int ib_register_device(struct ib_device *device,
 	ret = ib_cache_setup_one(device);
 	if (ret) {
 		pr_warn("Couldn't set up InfiniBand P_Key/GID cache\n");
-		goto out;
+		goto port_cleanup;
 	}
 
 	ret = ib_device_register_rdmacg(device);
 	if (ret) {
 		pr_warn("Couldn't register device with rdma cgroup\n");
-		ib_cache_cleanup_one(device);
-		goto out;
+		goto cache_cleanup;
 	}
 
 	memset(&device->attrs, 0, sizeof(device->attrs));
 	ret = device->query_device(device, &device->attrs, &uhw);
 	if (ret) {
 		pr_warn("Couldn't query the device attributes\n");
-		ib_device_unregister_rdmacg(device);
-		ib_cache_cleanup_one(device);
-		goto out;
+		goto cache_cleanup;
 	}
 
 	ret = ib_device_register_sysfs(device, port_callback);
 	if (ret) {
 		pr_warn("Couldn't register device %s with driver model\n",
 			device->name);
-		ib_device_unregister_rdmacg(device);
-		ib_cache_cleanup_one(device);
-		goto out;
+		goto cache_cleanup;
 	}
 
 	device->reg_state = IB_DEV_REGISTERED;
@@ -403,6 +406,14 @@ int ib_register_device(struct ib_device *device,
 	down_write(&lists_rwsem);
 	list_add_tail(&device->core_list, &device_list);
 	up_write(&lists_rwsem);
+	mutex_unlock(&device_mutex);
+	return 0;
+
+cache_cleanup:
+	ib_cache_cleanup_one(device);
+	ib_cache_release_one(device);
+port_cleanup:
+	kfree(device->port_immutable);
 out:
 	mutex_unlock(&device_mutex);
 	return ret;
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH rdma-next 2/3] IB/umem: Drop hugetlb variable and VMA allocation
       [not found] ` <20170319085557.6023-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2017-03-19  8:55   ` Leon Romanovsky
       [not found]     ` <20170319085557.6023-3-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2017-04-24 16:09   ` [PATCH rdma-next 0/3] Core fixes and cleanup Doug Ledford
  1 sibling, 1 reply; 11+ messages in thread
From: Leon Romanovsky @ 2017-03-19  8:55 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Leon Romanovsky,
	Shiraz Saleem, Christian Benvenuti, Selvin Xavier

From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

IB umem structure has field hugetlb which was assigned, but except
i40iw driver, it was actually never used.

This patch drops that variable out of the struct ib_umem, removes
all writes to it, get rids of VMA allocation which wasn't used and
removes check hugetlb from i40iw driver, because it is checked anyway in
i40iw_set_hugetlb_values() routine.

CC: Shiraz Saleem <shiraz.saleem-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
CC: Christian Benvenuti <benve-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
CC: Selvin Xavier <selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 drivers/infiniband/core/umem.c            | 23 ++---------------------
 drivers/infiniband/core/umem_odp.c        |  1 -
 drivers/infiniband/hw/bnxt_re/ib_verbs.c  |  6 +-----
 drivers/infiniband/hw/i40iw/i40iw_verbs.c |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c  |  1 -
 include/rdma/ib_umem.h                    |  1 -
 6 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 27f155d2df8d..6455256533df 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -37,7 +37,6 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
 #include <linux/export.h>
-#include <linux/hugetlb.h>
 #include <linux/slab.h>
 #include <rdma/ib_umem_odp.h>
 
@@ -85,7 +84,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 {
 	struct ib_umem *umem;
 	struct page **page_list;
-	struct vm_area_struct **vma_list;
 	unsigned long locked;
 	unsigned long lock_limit;
 	unsigned long cur_base;
@@ -143,9 +141,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 
 	umem->odp_data = NULL;
 
-	/* We assume the memory is from hugetlb until proved otherwise */
-	umem->hugetlb   = 1;
-
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
 	if (!page_list) {
 		put_pid(umem->pid);
@@ -153,14 +148,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		return ERR_PTR(-ENOMEM);
 	}
 
-	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
-	 */
-	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
-		umem->hugetlb = 0;
-
 	npages = ib_umem_num_pages(umem);
 
 	down_write(&current->mm->mmap_sem);
@@ -194,7 +181,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		ret = get_user_pages(cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
-				     gup_flags, page_list, vma_list);
+				     gup_flags, page_list, NULL);
 
 		if (ret < 0)
 			goto out;
@@ -203,12 +190,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
-		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
-
+		for_each_sg(sg_list_start, sg, ret, i)
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
-		}
 
 		/* preparing for next loop */
 		sg_list_start = sg;
@@ -237,8 +220,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		current->mm->pinned_vm = locked;
 
 	up_write(&current->mm->mmap_sem);
-	if (vma_list)
-		free_page((unsigned long) vma_list);
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index cb2742b548bb..2d7b90b63ba0 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -325,7 +325,6 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
 		goto out_mm;
 	}
 
-	umem->hugetlb = 0;
 	umem->odp_data = kzalloc(sizeof(*umem->odp_data), GFP_KERNEL);
 	if (!umem->odp_data) {
 		ret_val = -ENOMEM;
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 33af2e3de399..60c86fe5a5a4 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -3063,16 +3063,12 @@ struct ib_mr *bnxt_re_reg_user_mr(struct ib_pd *ib_pd, u64 start, u64 length,
 	pbl_tbl_orig = pbl_tbl;
 
 	page_shift = ilog2(umem->page_size);
-	if (umem->hugetlb) {
-		dev_err(rdev_to_dev(rdev), "umem hugetlb not supported!");
-		rc = -EFAULT;
-		goto fail;
-	}
 	if (umem->page_size != PAGE_SIZE) {
 		dev_err(rdev_to_dev(rdev), "umem page size unsupported!");
 		rc = -EFAULT;
 		goto fail;
 	}
+
 	/* Map umem buf ptrs to the PBL */
 	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
 		pages = sg_dma_len(sg) >> page_shift;
diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
index 9b2849979756..59c8d74e3704 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
@@ -1850,7 +1850,7 @@ static struct ib_mr *i40iw_reg_user_mr(struct ib_pd *pd,
 	iwmr->page_size = region->page_size;
 	iwmr->page_msk = PAGE_MASK;
 
-	if (region->hugetlb && (req.reg_type == IW_MEMREG_TYPE_MEM))
+	if (req.reg_type == IW_MEMREG_TYPE_MEM)
 		i40iw_set_hugetlb_values(start, iwmr);
 
 	region_length = region->length + (start & (iwmr->page_size - 1));
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index c49db7c33979..f836f35f87ab 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -36,7 +36,6 @@
 #include <linux/dma-mapping.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
-#include <linux/hugetlb.h>
 #include <linux/iommu.h>
 #include <linux/workqueue.h>
 #include <linux/list.h>
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 2d83cfd7e6ce..3d06d25a33ac 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -46,7 +46,6 @@ struct ib_umem {
 	unsigned long		address;
 	int			page_size;
 	int                     writable;
-	int                     hugetlb;
 	struct work_struct	work;
 	struct pid             *pid;
 	struct mm_struct       *mm;
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH rdma-next 3/3] IB/core: Fix sysfs registration error flow
  2017-03-19  8:55 [PATCH rdma-next 0/3] Core fixes and cleanup Leon Romanovsky
  2017-03-19  8:55 ` [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to initialize device Leon Romanovsky
@ 2017-03-19  8:55 ` Leon Romanovsky
       [not found] ` <20170319085557.6023-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2 siblings, 0 replies; 11+ messages in thread
From: Leon Romanovsky @ 2017-03-19  8:55 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma, Jack Morgenstein, # v4 . 2+

From: Jack Morgenstein <jackm@dev.mellanox.co.il>

The kernel commit cited below restructured ib device management
so that the device kobject is initialized in ib_alloc_device.

As part of the restructuring, the kobject is now initialized in
procedure ib_alloc_device, and is later added to the device hierarchy
in the ib_register_device call stack, in procedure
ib_device_register_sysfs (which calls device_add).

However, in the ib_device_register_sysfs error flow, if an error
occurs following the call to device_add, the cleanup procedure
device_unregister is called. This call results in the device object
being deleted -- which results in various use-after-free crashes.

The correct cleanup call is device_del -- which undoes device_add
without deleting the device object.

The device object will then (correctly) be deleted in the
ib_register_device caller's error cleanup flow, when the caller invokes
ib_dealloc_device.

Fixes: 55aeed06544f6 ("IB/core: Make ib_alloc_device init the kobject")
Cc: <stable@vger.kernel.org> # v4.2+
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
---
 drivers/infiniband/core/sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index daadf3130c9f..48bb75503255 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -1301,7 +1301,7 @@ int ib_device_register_sysfs(struct ib_device *device,
 	free_port_list_attributes(device);
 
 err_unregister:
-	device_unregister(class_dev);
+	device_del(class_dev);
 
 err:
 	return ret;
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next 2/3] IB/umem: Drop hugetlb variable and VMA allocation
       [not found]     ` <20170319085557.6023-3-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2017-03-21  2:19       ` Shiraz Saleem
       [not found]         ` <20170321021936.GA18356-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Shiraz Saleem @ 2017-03-21  2:19 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Leon Romanovsky,
	Christian Benvenuti, Selvin Xavier,
	fenkes-tA70FqPdS9bQT0dZR+AlfA

On Sun, Mar 19, 2017 at 10:55:56AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> IB umem structure has field hugetlb which was assigned, but except
> i40iw driver, it was actually never used.
> 
> This patch drops that variable out of the struct ib_umem, removes
> all writes to it, get rids of VMA allocation which wasn't used and
> removes check hugetlb from i40iw driver, because it is checked anyway in
> i40iw_set_hugetlb_values() routine.
> 
> @@ -143,9 +141,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  
>  	umem->odp_data = NULL;
>  
> -	/* We assume the memory is from hugetlb until proved otherwise */
> -	umem->hugetlb   = 1;
> -
>  	page_list = (struct page **) __get_free_page(GFP_KERNEL);
>  	if (!page_list) {
>  		put_pid(umem->pid);
> @@ -153,14 +148,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  		return ERR_PTR(-ENOMEM);
>  	}
>  
> -	/*
> -	 * if we can't alloc the vma_list, it's not so bad;
> -	 * just assume the memory is not hugetlb memory
> -	 */
> -	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
> -	if (!vma_list)
> -		umem->hugetlb = 0;
> -
>  	npages = ib_umem_num_pages(umem);
>  
>  	down_write(&current->mm->mmap_sem);
> @@ -194,7 +181,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  		ret = get_user_pages(cur_base,
>  				     min_t(unsigned long, npages,
>  					   PAGE_SIZE / sizeof (struct page *)),
> -				     gup_flags, page_list, vma_list);
> +				     gup_flags, page_list, NULL);
>  
>  		if (ret < 0)
>  			goto out;
> @@ -203,12 +190,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  		cur_base += ret * PAGE_SIZE;
>  		npages   -= ret;
>  
> -		for_each_sg(sg_list_start, sg, ret, i) {
> -			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
> -				umem->hugetlb = 0;
> -
> +		for_each_sg(sg_list_start, sg, ret, i)
>  			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
> -		}
>  
>  		/* preparing for next loop */
>  		sg_list_start = sg;

> diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> index 9b2849979756..59c8d74e3704 100644
> --- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> +++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> @@ -1850,7 +1850,7 @@ static struct ib_mr *i40iw_reg_user_mr(struct ib_pd *pd,
>  	iwmr->page_size = region->page_size;
>  	iwmr->page_msk = PAGE_MASK;
>  
> -	if (region->hugetlb && (req.reg_type == IW_MEMREG_TYPE_MEM))
> +	if (req.reg_type == IW_MEMREG_TYPE_MEM)
>  		i40iw_set_hugetlb_values(start, iwmr);

The code in ib_umem_get iterates through __all__ VMA’s corresponding to the 
user-space buffer and hugetlb field of the umem struct is set only if they 
are all using huge_pages. So in the essence, umem->hugetlb provides the 
driver info on whether __all__ pages of the memory region uses huge pages
or not.

i40iw_set_hugetlb_values only finds the VMA w.r.t the start virtual address of 
the user-space buffer and checks if __this__ VMA uses hugetlb pages or not. 
It is a sanity check done on the VMA found.
 
So the question is -- Do we know that only a single VMA represents the user-space 
buffer backed by huge pages?
If not, it doesnt appear that the check in i40iw_set_hugetlb_values will suffice to 
identify if the entire memory region is composed of huge pages since it doesnt
check all the VMAs. umem->hugetlb would be required for it.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next 2/3] IB/umem: Drop hugetlb variable and VMA allocation
       [not found]         ` <20170321021936.GA18356-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2017-03-21  8:14           ` Leon Romanovsky
       [not found]             ` <20170321081408.GZ2079-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Leon Romanovsky @ 2017-03-21  8:14 UTC (permalink / raw)
  To: Shiraz Saleem
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Christian Benvenuti, Selvin Xavier,
	fenkes-tA70FqPdS9bQT0dZR+AlfA

[-- Attachment #1: Type: text/plain, Size: 4511 bytes --]

On Mon, Mar 20, 2017 at 09:19:36PM -0500, Shiraz Saleem wrote:
> On Sun, Mar 19, 2017 at 10:55:56AM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >
> > IB umem structure has field hugetlb which was assigned, but except
> > i40iw driver, it was actually never used.
> >
> > This patch drops that variable out of the struct ib_umem, removes
> > all writes to it, get rids of VMA allocation which wasn't used and
> > removes check hugetlb from i40iw driver, because it is checked anyway in
> > i40iw_set_hugetlb_values() routine.
> >
> > @@ -143,9 +141,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >
> >  	umem->odp_data = NULL;
> >
> > -	/* We assume the memory is from hugetlb until proved otherwise */
> > -	umem->hugetlb   = 1;
> > -
> >  	page_list = (struct page **) __get_free_page(GFP_KERNEL);
> >  	if (!page_list) {
> >  		put_pid(umem->pid);
> > @@ -153,14 +148,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >  		return ERR_PTR(-ENOMEM);
> >  	}
> >
> > -	/*
> > -	 * if we can't alloc the vma_list, it's not so bad;
> > -	 * just assume the memory is not hugetlb memory
> > -	 */
> > -	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
> > -	if (!vma_list)
> > -		umem->hugetlb = 0;
> > -
> >  	npages = ib_umem_num_pages(umem);
> >
> >  	down_write(&current->mm->mmap_sem);
> > @@ -194,7 +181,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >  		ret = get_user_pages(cur_base,
> >  				     min_t(unsigned long, npages,
> >  					   PAGE_SIZE / sizeof (struct page *)),
> > -				     gup_flags, page_list, vma_list);
> > +				     gup_flags, page_list, NULL);
> >
> >  		if (ret < 0)
> >  			goto out;
> > @@ -203,12 +190,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >  		cur_base += ret * PAGE_SIZE;
> >  		npages   -= ret;
> >
> > -		for_each_sg(sg_list_start, sg, ret, i) {
> > -			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
> > -				umem->hugetlb = 0;
> > -
> > +		for_each_sg(sg_list_start, sg, ret, i)
> >  			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
> > -		}
> >
> >  		/* preparing for next loop */
> >  		sg_list_start = sg;
>
> > diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> > index 9b2849979756..59c8d74e3704 100644
> > --- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> > +++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> > @@ -1850,7 +1850,7 @@ static struct ib_mr *i40iw_reg_user_mr(struct ib_pd *pd,
> >  	iwmr->page_size = region->page_size;
> >  	iwmr->page_msk = PAGE_MASK;
> >
> > -	if (region->hugetlb && (req.reg_type == IW_MEMREG_TYPE_MEM))
> > +	if (req.reg_type == IW_MEMREG_TYPE_MEM)
> >  		i40iw_set_hugetlb_values(start, iwmr);
>
> The code in ib_umem_get iterates through __all__ VMA’s corresponding to the
> user-space buffer and hugetlb field of the umem struct is set only if they
> are all using huge_pages. So in the essence, umem->hugetlb provides the
> driver info on whether __all__ pages of the memory region uses huge pages
> or not.
>
> i40iw_set_hugetlb_values only finds the VMA w.r.t the start virtual address of
> the user-space buffer and checks if __this__ VMA uses hugetlb pages or not.
> It is a sanity check done on the VMA found.

>
> So the question is -- Do we know that only a single VMA represents the user-space
> buffer backed by huge pages?
> If not, it doesnt appear that the check in i40iw_set_hugetlb_values will suffice to
> identify if the entire memory region is composed of huge pages since it doesnt
> check all the VMAs. umem->hugetlb would be required for it.

You are right and thank you for pointing that. User memory buffer indeed can
span more than one VMA. For example one reg_mr with READ property for
two adjacent VAMs, one with READ properties and another with READ-WRITE.

However, relying on hugetlb which was set during creation is not best
thing too. It can be changed during the application run. The mlx5 relies on
the mlx5_ib_cont_pages() function to understand the actual layout.

Thanks

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next 2/3] IB/umem: Drop hugetlb variable and VMA allocation
       [not found]             ` <20170321081408.GZ2079-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-03-21 14:06               ` Shiraz Saleem
       [not found]                 ` <20170321140644.GA26628-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Shiraz Saleem @ 2017-03-21 14:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Christian Benvenuti, Selvin Xavier,
	fenkes-tA70FqPdS9bQT0dZR+AlfA

On Tue, Mar 21, 2017 at 10:14:08AM +0200, Leon Romanovsky wrote:
> On Mon, Mar 20, 2017 at 09:19:36PM -0500, Shiraz Saleem wrote:
> > On Sun, Mar 19, 2017 at 10:55:56AM +0200, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > >
> > > IB umem structure has field hugetlb which was assigned, but except
> > > i40iw driver, it was actually never used.
> > >
> > > This patch drops that variable out of the struct ib_umem, removes
> > > all writes to it, get rids of VMA allocation which wasn't used and
> > > removes check hugetlb from i40iw driver, because it is checked anyway in
> > > i40iw_set_hugetlb_values() routine.
> > >
> > > @@ -143,9 +141,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> > >
> > >  	umem->odp_data = NULL;
> > >
> > > -	/* We assume the memory is from hugetlb until proved otherwise */
> > > -	umem->hugetlb   = 1;
> > > -
> > >  	page_list = (struct page **) __get_free_page(GFP_KERNEL);
> > >  	if (!page_list) {
> > >  		put_pid(umem->pid);
> > > @@ -153,14 +148,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> > >  		return ERR_PTR(-ENOMEM);
> > >  	}
> > >
> > > -	/*
> > > -	 * if we can't alloc the vma_list, it's not so bad;
> > > -	 * just assume the memory is not hugetlb memory
> > > -	 */
> > > -	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
> > > -	if (!vma_list)
> > > -		umem->hugetlb = 0;
> > > -
> > >  	npages = ib_umem_num_pages(umem);
> > >
> > >  	down_write(&current->mm->mmap_sem);
> > > @@ -194,7 +181,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> > >  		ret = get_user_pages(cur_base,
> > >  				     min_t(unsigned long, npages,
> > >  					   PAGE_SIZE / sizeof (struct page *)),
> > > -				     gup_flags, page_list, vma_list);
> > > +				     gup_flags, page_list, NULL);
> > >
> > >  		if (ret < 0)
> > >  			goto out;
> > > @@ -203,12 +190,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> > >  		cur_base += ret * PAGE_SIZE;
> > >  		npages   -= ret;
> > >
> > > -		for_each_sg(sg_list_start, sg, ret, i) {
> > > -			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
> > > -				umem->hugetlb = 0;
> > > -
> > > +		for_each_sg(sg_list_start, sg, ret, i)
> > >  			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
> > > -		}
> > >
> > >  		/* preparing for next loop */
> > >  		sg_list_start = sg;
> >
> > > diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> > > index 9b2849979756..59c8d74e3704 100644
> > > --- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> > > +++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> > > @@ -1850,7 +1850,7 @@ static struct ib_mr *i40iw_reg_user_mr(struct ib_pd *pd,
> > >  	iwmr->page_size = region->page_size;
> > >  	iwmr->page_msk = PAGE_MASK;
> > >
> > > -	if (region->hugetlb && (req.reg_type == IW_MEMREG_TYPE_MEM))
> > > +	if (req.reg_type == IW_MEMREG_TYPE_MEM)
> > >  		i40iw_set_hugetlb_values(start, iwmr);
> >
> > The code in ib_umem_get iterates through __all__ VMA’s corresponding to the
> > user-space buffer and hugetlb field of the umem struct is set only if they
> > are all using huge_pages. So in the essence, umem->hugetlb provides the
> > driver info on whether __all__ pages of the memory region uses huge pages
> > or not.
> >
> > i40iw_set_hugetlb_values only finds the VMA w.r.t the start virtual address of
> > the user-space buffer and checks if __this__ VMA uses hugetlb pages or not.
> > It is a sanity check done on the VMA found.
> 
> >
> > So the question is -- Do we know that only a single VMA represents the user-space
> > buffer backed by huge pages?
> > If not, it doesnt appear that the check in i40iw_set_hugetlb_values will suffice to
> > identify if the entire memory region is composed of huge pages since it doesnt
> > check all the VMAs. umem->hugetlb would be required for it.
> 
> You are right and thank you for pointing that. User memory buffer indeed can
> span more than one VMA. For example one reg_mr with READ property for
> two adjacent VAMs, one with READ properties and another with READ-WRITE.
> 
> However, relying on hugetlb which was set during creation is not best
> thing too. It can be changed during the application run. The mlx5 relies on
> the mlx5_ib_cont_pages() function to understand the actual layout.
> 

The patch as is, will break us. We believe ib_umem_get is the correct 
place to scan the VMAs and report if the MR is composed of hugetlb pages via the
hugetlb field of umem. This way it can be done in one place and not in each 
individual driver.

Not sure I understand what you mean by MR changing during application run. And how
mlx5_ib_cont_pages is helping if this happens.         
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next 2/3] IB/umem: Drop hugetlb variable and VMA allocation
       [not found]                 ` <20170321140644.GA26628-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
@ 2017-03-21 16:36                   ` Leon Romanovsky
  0 siblings, 0 replies; 11+ messages in thread
From: Leon Romanovsky @ 2017-03-21 16:36 UTC (permalink / raw)
  To: Shiraz Saleem
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Christian Benvenuti, Selvin Xavier,
	fenkes-tA70FqPdS9bQT0dZR+AlfA

[-- Attachment #1: Type: text/plain, Size: 5381 bytes --]

On Tue, Mar 21, 2017 at 09:06:44AM -0500, Shiraz Saleem wrote:
> On Tue, Mar 21, 2017 at 10:14:08AM +0200, Leon Romanovsky wrote:
> > On Mon, Mar 20, 2017 at 09:19:36PM -0500, Shiraz Saleem wrote:
> > > On Sun, Mar 19, 2017 at 10:55:56AM +0200, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > >
> > > > IB umem structure has field hugetlb which was assigned, but except
> > > > i40iw driver, it was actually never used.
> > > >
> > > > This patch drops that variable out of the struct ib_umem, removes
> > > > all writes to it, get rids of VMA allocation which wasn't used and
> > > > removes check hugetlb from i40iw driver, because it is checked anyway in
> > > > i40iw_set_hugetlb_values() routine.
> > > >
> > > > @@ -143,9 +141,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> > > >
> > > >  	umem->odp_data = NULL;
> > > >
> > > > -	/* We assume the memory is from hugetlb until proved otherwise */
> > > > -	umem->hugetlb   = 1;
> > > > -
> > > >  	page_list = (struct page **) __get_free_page(GFP_KERNEL);
> > > >  	if (!page_list) {
> > > >  		put_pid(umem->pid);
> > > > @@ -153,14 +148,6 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> > > >  		return ERR_PTR(-ENOMEM);
> > > >  	}
> > > >
> > > > -	/*
> > > > -	 * if we can't alloc the vma_list, it's not so bad;
> > > > -	 * just assume the memory is not hugetlb memory
> > > > -	 */
> > > > -	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
> > > > -	if (!vma_list)
> > > > -		umem->hugetlb = 0;
> > > > -
> > > >  	npages = ib_umem_num_pages(umem);
> > > >
> > > >  	down_write(&current->mm->mmap_sem);
> > > > @@ -194,7 +181,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> > > >  		ret = get_user_pages(cur_base,
> > > >  				     min_t(unsigned long, npages,
> > > >  					   PAGE_SIZE / sizeof (struct page *)),
> > > > -				     gup_flags, page_list, vma_list);
> > > > +				     gup_flags, page_list, NULL);
> > > >
> > > >  		if (ret < 0)
> > > >  			goto out;
> > > > @@ -203,12 +190,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> > > >  		cur_base += ret * PAGE_SIZE;
> > > >  		npages   -= ret;
> > > >
> > > > -		for_each_sg(sg_list_start, sg, ret, i) {
> > > > -			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
> > > > -				umem->hugetlb = 0;
> > > > -
> > > > +		for_each_sg(sg_list_start, sg, ret, i)
> > > >  			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
> > > > -		}
> > > >
> > > >  		/* preparing for next loop */
> > > >  		sg_list_start = sg;
> > >
> > > > diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> > > > index 9b2849979756..59c8d74e3704 100644
> > > > --- a/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> > > > +++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
> > > > @@ -1850,7 +1850,7 @@ static struct ib_mr *i40iw_reg_user_mr(struct ib_pd *pd,
> > > >  	iwmr->page_size = region->page_size;
> > > >  	iwmr->page_msk = PAGE_MASK;
> > > >
> > > > -	if (region->hugetlb && (req.reg_type == IW_MEMREG_TYPE_MEM))
> > > > +	if (req.reg_type == IW_MEMREG_TYPE_MEM)
> > > >  		i40iw_set_hugetlb_values(start, iwmr);
> > >
> > > The code in ib_umem_get iterates through __all__ VMA’s corresponding to the
> > > user-space buffer and hugetlb field of the umem struct is set only if they
> > > are all using huge_pages. So in the essence, umem->hugetlb provides the
> > > driver info on whether __all__ pages of the memory region uses huge pages
> > > or not.
> > >
> > > i40iw_set_hugetlb_values only finds the VMA w.r.t the start virtual address of
> > > the user-space buffer and checks if __this__ VMA uses hugetlb pages or not.
> > > It is a sanity check done on the VMA found.
> >
> > >
> > > So the question is -- Do we know that only a single VMA represents the user-space
> > > buffer backed by huge pages?
> > > If not, it doesnt appear that the check in i40iw_set_hugetlb_values will suffice to
> > > identify if the entire memory region is composed of huge pages since it doesnt
> > > check all the VMAs. umem->hugetlb would be required for it.
> >
> > You are right and thank you for pointing that. User memory buffer indeed can
> > span more than one VMA. For example one reg_mr with READ property for
> > two adjacent VAMs, one with READ properties and another with READ-WRITE.
> >
> > However, relying on hugetlb which was set during creation is not best
> > thing too. It can be changed during the application run. The mlx5 relies on
> > the mlx5_ib_cont_pages() function to understand the actual layout.
> >
>
> The patch as is, will break us.

Doug,
Please be sure do NOT take this patch.
Do you want me to resubmit the series without this patch?

> We believe ib_umem_get is the correct
> place to scan the VMAs and report if the MR is composed of hugetlb pages via the
> hugetlb field of umem. This way it can be done in one place and not in each
> individual driver.

The problem with this approach that it is done at the creation of
umem region. The call to madvise(MADV_NOHUGEPAGE, ...) will clear
the hugetlb property from VMA without change of umem->hugetlb variable.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rdma-next 0/3] Core fixes and cleanup
       [not found] ` <20170319085557.6023-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2017-03-19  8:55   ` [PATCH rdma-next 2/3] IB/umem: Drop hugetlb variable and VMA allocation Leon Romanovsky
@ 2017-04-24 16:09   ` Doug Ledford
  1 sibling, 0 replies; 11+ messages in thread
From: Doug Ledford @ 2017-04-24 16:09 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sun, 2017-03-19 at 10:55 +0200, Leon Romanovsky wrote:
> Hi Doug,
> 
> Please find below three patches for IB/core and this series
> are based on v4.11-rc2.
> 
> Thanks
> 
> Jack Morgenstein (1):
>   IB/core: Fix sysfs registration error flow
> 
> Leon Romanovsky (1):
>   IB/umem: Drop hugetlb variable and VMA allocation
> 
> Parav Pandit (1):
>   IB/core: Fix kernel crash during fail to initialize device
> 
>  drivers/infiniband/core/device.c          | 33 ++++++++++++++++++++-
> ----------
>  drivers/infiniband/core/sysfs.c           |  2 +-
>  drivers/infiniband/core/umem.c            | 23 ++-------------------
>  drivers/infiniband/core/umem_odp.c        |  1 -
>  drivers/infiniband/hw/bnxt_re/ib_verbs.c  |  6 +-----
>  drivers/infiniband/hw/i40iw/i40iw_verbs.c |  2 +-
>  drivers/infiniband/hw/usnic/usnic_uiom.c  |  1 -
>  include/rdma/ib_umem.h                    |  1 -
>  8 files changed, 27 insertions(+), 42 deletions(-)

Series, minus the patch that would break Intel's driver, applied.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
   
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to initialize device
  2017-03-19  8:55 ` [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to initialize device Leon Romanovsky
@ 2017-04-24 18:03   ` Parav Pandit
  2017-04-24 18:04   ` Parav Pandit
  1 sibling, 0 replies; 11+ messages in thread
From: Parav Pandit @ 2017-04-24 18:03 UTC (permalink / raw)
  To: Leon Romanovsky, Doug Ledford; +Cc: linux-rdma, # v4 . 2+

Hi Doug,

Did you get a chance to merge this fix that avoids kernel crash in error scenario?

Parav

> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: Sunday, March 19, 2017 3:56 AM
> To: Doug Ledford <dledford@redhat.com>
> Cc: linux-rdma@vger.kernel.org; Parav Pandit <parav@mellanox.com>; # v4 . 2+
> <stable@vger.kernel.org>
> Subject: [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to initialize
> device
> 
> From: Parav Pandit <parav@mellanox.com>
> 
> This patch fixes the kernel crash that occurs during ib_dealloc_device() called
> due to provider driver fails with an error after
> ib_alloc_device() and before it can register using ib_register_device().
> 
> This crashed seen in tha lab as below which can occur with any IB device which
> fails to perform its device initialization before invoking ib_register_device().
> 
> This patch avoids touching cache and port immutable structures if device is not
> yet initialized.
> It also releases related memory when cache and port immutable data structure
> initialization fails during register_device() state.
> 
> [81416.561946] BUG: unable to handle kernel NULL pointer dereference at (null)
> [81416.570340] IP: ib_cache_release_one+0x29/0x80 [ib_core] [81416.576222]
> PGD 78da66067 [81416.576223] PUD 7f2d7c067 [81416.579484] PMD 0
> [81416.582720] [81416.587242] Oops: 0000 [#1] SMP [81416.722395] task:
> ffff8807887515c0 task.stack: ffffc900062c0000 [81416.729148] RIP:
> 0010:ib_cache_release_one+0x29/0x80 [ib_core] [81416.735793] RSP:
> 0018:ffffc900062c3a90 EFLAGS: 00010202 [81416.741823] RAX:
> 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
> [81416.749785] RDX: 0000000000000000 RSI: 0000000000000282 RDI:
> ffff880859fec000 [81416.757757] RBP: ffffc900062c3aa0 R08:
> ffff8808536e5ac0 R09: ffff880859fec5b0 [81416.765708] R10:
> 00000000536e5c01 R11: ffff8808536e5ac0 R12: ffff880859fec000
> [81416.773672] R13: 0000000000000000 R14: ffff8808536e5ac0 R15:
> ffff88084ebc0060 [81416.781621] FS:  00007fd879fab740(0000)
> GS:ffff88085fac0000(0000) knlGS:0000000000000000 [81416.790522] CS:  0010
> DS: 0000 ES: 0000 CR0: 0000000080050033 [81416.797094] CR2:
> 0000000000000000 CR3: 00000007eb215000 CR4: 00000000003406e0
> [81416.805051] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000 [81416.812997] DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400 [81416.820950] Call Trace:
> [81416.824226]  ib_device_release+0x1e/0x40 [ib_core] [81416.829858]
> device_release+0x32/0xa0 [81416.834370]  kobject_cleanup+0x63/0x170
> [81416.839058]  kobject_put+0x25/0x50 [81416.843319]
> ib_dealloc_device+0x25/0x40 [ib_core] [81416.848986]
> mlx5_ib_add+0x163/0x1990 [mlx5_ib] [81416.854414]
> mlx5_add_device+0x5a/0x160 [mlx5_core] [81416.860191]
> mlx5_register_interface+0x8d/0xc0 [mlx5_core] [81416.866587]  ?
> 0xffffffffa09e9000 [81416.870816]  mlx5_ib_init+0x15/0x17 [mlx5_ib]
> [81416.876094]  do_one_initcall+0x51/0x1b0 [81416.880861]  ?
> __vunmap+0x85/0xd0 [81416.885113]  ?
> kmem_cache_alloc_trace+0x14b/0x1b0
> [81416.890768]  ? vfree+0x2e/0x70
> [81416.894762]  do_init_module+0x60/0x1fa [81416.899441]
> load_module+0x15f6/0x1af0 [81416.904114]  ? __symbol_put+0x60/0x60
> [81416.908709]  ? ima_post_read_file+0x3d/0x80 [81416.913828]  ?
> security_kernel_post_read_file+0x6b/0x80
> [81416.920006]  SYSC_finit_module+0xa6/0xf0 [81416.924888]
> SyS_finit_module+0xe/0x10 [81416.929568]
> entry_SYSCALL_64_fastpath+0x1a/0xa9
> [81416.935089] RIP: 0033:0x7fd879494949
> [81416.939543] RSP: 002b:00007ffdbc1b4e58 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000139 [81416.947982] RAX: ffffffffffffffda RBX:
> 0000000001b66f00 RCX: 00007fd879494949 [81416.955965] RDX:
> 0000000000000000 RSI: 000000000041a13c RDI: 0000000000000003
> [81416.963926] RBP: 0000000000000003 R08: 0000000000000000 R09:
> 0000000001b652a0 [81416.971861] R10: 0000000000000003 R11:
> 0000000000000202 R12: 00007ffdbc1b3e70 [81416.979763] R13:
> 00007ffdbc1b3e50 R14: 0000000000000005 R15: 0000000000000000
> [81417.008005] RIP: ib_cache_release_one+0x29/0x80 [ib_core] RSP:
> ffffc900062c3a90 [81417.016045] CR2: 0000000000000000
> 
> Fixes: 55aeed0654 ("IB/core: Make ib_alloc_device init the kobject")
> Fixes: 7738613e7c ("IB/core: Add per port immutable struct to ib_device")
> Cc: <stable@vger.kernel.org> # v4.2+
> Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> Signed-off-by: Leon Romanovsky <leon@kernel.org>
> ---
>  drivers/infiniband/core/device.c | 33 ++++++++++++++++++++++-----------
>  1 file changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index 593d2ce6ec7c..64a2ae4d8eaa 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -172,8 +172,16 @@ static void ib_device_release(struct device *device)  {
>  	struct ib_device *dev = container_of(device, struct ib_device, dev);
> 
> -	ib_cache_release_one(dev);
> -	kfree(dev->port_immutable);
> +	WARN_ON(dev->reg_state == IB_DEV_REGISTERED);
> +	if (dev->reg_state == IB_DEV_UNREGISTERED) {
> +		/*
> +		 * In IB_DEV_UNINITIALIZED state, cache or port table
> +		 * is not even created. Free cache and port table only when
> +		 * device reaches UNREGISTERED state.
> +		 */
> +		ib_cache_release_one(dev);
> +		kfree(dev->port_immutable);
> +	}
>  	kfree(dev);
>  }
> 
> @@ -366,32 +374,27 @@ int ib_register_device(struct ib_device *device,
>  	ret = ib_cache_setup_one(device);
>  	if (ret) {
>  		pr_warn("Couldn't set up InfiniBand P_Key/GID cache\n");
> -		goto out;
> +		goto port_cleanup;
>  	}
> 
>  	ret = ib_device_register_rdmacg(device);
>  	if (ret) {
>  		pr_warn("Couldn't register device with rdma cgroup\n");
> -		ib_cache_cleanup_one(device);
> -		goto out;
> +		goto cache_cleanup;
>  	}
> 
>  	memset(&device->attrs, 0, sizeof(device->attrs));
>  	ret = device->query_device(device, &device->attrs, &uhw);
>  	if (ret) {
>  		pr_warn("Couldn't query the device attributes\n");
> -		ib_device_unregister_rdmacg(device);
> -		ib_cache_cleanup_one(device);
> -		goto out;
> +		goto cache_cleanup;
>  	}
> 
>  	ret = ib_device_register_sysfs(device, port_callback);
>  	if (ret) {
>  		pr_warn("Couldn't register device %s with driver model\n",
>  			device->name);
> -		ib_device_unregister_rdmacg(device);
> -		ib_cache_cleanup_one(device);
> -		goto out;
> +		goto cache_cleanup;
>  	}
> 
>  	device->reg_state = IB_DEV_REGISTERED; @@ -403,6 +406,14 @@ int
> ib_register_device(struct ib_device *device,
>  	down_write(&lists_rwsem);
>  	list_add_tail(&device->core_list, &device_list);
>  	up_write(&lists_rwsem);
> +	mutex_unlock(&device_mutex);
> +	return 0;
> +
> +cache_cleanup:
> +	ib_cache_cleanup_one(device);
> +	ib_cache_release_one(device);
> +port_cleanup:
> +	kfree(device->port_immutable);
>  out:
>  	mutex_unlock(&device_mutex);
>  	return ret;
> --
> 2.12.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to initialize device
  2017-03-19  8:55 ` [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to initialize device Leon Romanovsky
  2017-04-24 18:03   ` Parav Pandit
@ 2017-04-24 18:04   ` Parav Pandit
  1 sibling, 0 replies; 11+ messages in thread
From: Parav Pandit @ 2017-04-24 18:04 UTC (permalink / raw)
  To: Leon Romanovsky, Doug Ledford; +Cc: linux-rdma, # v4 . 2+

I lately saw the ack from you that its merged.

Thanks,
Parav

> -----Original Message-----
> From: Parav Pandit
> Sent: Monday, April 24, 2017 1:04 PM
> To: 'Leon Romanovsky' <leon@kernel.org>; Doug Ledford
> <dledford@redhat.com>
> Cc: linux-rdma@vger.kernel.org; # v4 . 2+ <stable@vger.kernel.org>
> Subject: RE: [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to
> initialize device
> 
> Hi Doug,
> 
> Did you get a chance to merge this fix that avoids kernel crash in error scenario?
> 
> Parav
> 
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon@kernel.org]
> > Sent: Sunday, March 19, 2017 3:56 AM
> > To: Doug Ledford <dledford@redhat.com>
> > Cc: linux-rdma@vger.kernel.org; Parav Pandit <parav@mellanox.com>; #
> > v4 . 2+ <stable@vger.kernel.org>
> > Subject: [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail
> > to initialize device
> >
> > From: Parav Pandit <parav@mellanox.com>
> >
> > This patch fixes the kernel crash that occurs during
> > ib_dealloc_device() called due to provider driver fails with an error
> > after
> > ib_alloc_device() and before it can register using ib_register_device().
> >
> > This crashed seen in tha lab as below which can occur with any IB
> > device which fails to perform its device initialization before invoking
> ib_register_device().
> >
> > This patch avoids touching cache and port immutable structures if
> > device is not yet initialized.
> > It also releases related memory when cache and port immutable data
> > structure initialization fails during register_device() state.
> >
> > [81416.561946] BUG: unable to handle kernel NULL pointer dereference
> > at (null) [81416.570340] IP: ib_cache_release_one+0x29/0x80 [ib_core]
> > [81416.576222] PGD 78da66067 [81416.576223] PUD 7f2d7c067
> > [81416.579484] PMD 0 [81416.582720] [81416.587242] Oops: 0000 [#1] SMP
> [81416.722395] task:
> > ffff8807887515c0 task.stack: ffffc900062c0000 [81416.729148] RIP:
> > 0010:ib_cache_release_one+0x29/0x80 [ib_core] [81416.735793] RSP:
> > 0018:ffffc900062c3a90 EFLAGS: 00010202 [81416.741823] RAX:
> > 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
> > [81416.749785] RDX: 0000000000000000 RSI: 0000000000000282 RDI:
> > ffff880859fec000 [81416.757757] RBP: ffffc900062c3aa0 R08:
> > ffff8808536e5ac0 R09: ffff880859fec5b0 [81416.765708] R10:
> > 00000000536e5c01 R11: ffff8808536e5ac0 R12: ffff880859fec000
> > [81416.773672] R13: 0000000000000000 R14: ffff8808536e5ac0 R15:
> > ffff88084ebc0060 [81416.781621] FS:  00007fd879fab740(0000)
> > GS:ffff88085fac0000(0000) knlGS:0000000000000000 [81416.790522] CS:
> > 0010
> > DS: 0000 ES: 0000 CR0: 0000000080050033 [81416.797094] CR2:
> > 0000000000000000 CR3: 00000007eb215000 CR4: 00000000003406e0
> > [81416.805051] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000 [81416.812997] DR3: 0000000000000000 DR6:
> > 00000000fffe0ff0 DR7: 0000000000000400 [81416.820950] Call Trace:
> > [81416.824226]  ib_device_release+0x1e/0x40 [ib_core] [81416.829858]
> > device_release+0x32/0xa0 [81416.834370]  kobject_cleanup+0x63/0x170
> > [81416.839058]  kobject_put+0x25/0x50 [81416.843319]
> > ib_dealloc_device+0x25/0x40 [ib_core] [81416.848986]
> > mlx5_ib_add+0x163/0x1990 [mlx5_ib] [81416.854414]
> > mlx5_add_device+0x5a/0x160 [mlx5_core] [81416.860191]
> > mlx5_register_interface+0x8d/0xc0 [mlx5_core] [81416.866587]  ?
> > 0xffffffffa09e9000 [81416.870816]  mlx5_ib_init+0x15/0x17 [mlx5_ib]
> > [81416.876094]  do_one_initcall+0x51/0x1b0 [81416.880861]  ?
> > __vunmap+0x85/0xd0 [81416.885113]  ?
> > kmem_cache_alloc_trace+0x14b/0x1b0
> > [81416.890768]  ? vfree+0x2e/0x70
> > [81416.894762]  do_init_module+0x60/0x1fa [81416.899441]
> > load_module+0x15f6/0x1af0 [81416.904114]  ? __symbol_put+0x60/0x60
> > [81416.908709]  ? ima_post_read_file+0x3d/0x80 [81416.913828]  ?
> > security_kernel_post_read_file+0x6b/0x80
> > [81416.920006]  SYSC_finit_module+0xa6/0xf0 [81416.924888]
> > SyS_finit_module+0xe/0x10 [81416.929568]
> > entry_SYSCALL_64_fastpath+0x1a/0xa9
> > [81416.935089] RIP: 0033:0x7fd879494949 [81416.939543] RSP:
> > 002b:00007ffdbc1b4e58 EFLAGS: 00000202 ORIG_RAX:
> > 0000000000000139 [81416.947982] RAX: ffffffffffffffda RBX:
> > 0000000001b66f00 RCX: 00007fd879494949 [81416.955965] RDX:
> > 0000000000000000 RSI: 000000000041a13c RDI: 0000000000000003
> > [81416.963926] RBP: 0000000000000003 R08: 0000000000000000 R09:
> > 0000000001b652a0 [81416.971861] R10: 0000000000000003 R11:
> > 0000000000000202 R12: 00007ffdbc1b3e70 [81416.979763] R13:
> > 00007ffdbc1b3e50 R14: 0000000000000005 R15: 0000000000000000
> > [81417.008005] RIP: ib_cache_release_one+0x29/0x80 [ib_core] RSP:
> > ffffc900062c3a90 [81417.016045] CR2: 0000000000000000
> >
> > Fixes: 55aeed0654 ("IB/core: Make ib_alloc_device init the kobject")
> > Fixes: 7738613e7c ("IB/core: Add per port immutable struct to
> > ib_device")
> > Cc: <stable@vger.kernel.org> # v4.2+
> > Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > Signed-off-by: Leon Romanovsky <leon@kernel.org>
> > ---
> >  drivers/infiniband/core/device.c | 33
> > ++++++++++++++++++++++-----------
> >  1 file changed, 22 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/device.c
> > b/drivers/infiniband/core/device.c
> > index 593d2ce6ec7c..64a2ae4d8eaa 100644
> > --- a/drivers/infiniband/core/device.c
> > +++ b/drivers/infiniband/core/device.c
> > @@ -172,8 +172,16 @@ static void ib_device_release(struct device *device)  {
> >  	struct ib_device *dev = container_of(device, struct ib_device, dev);
> >
> > -	ib_cache_release_one(dev);
> > -	kfree(dev->port_immutable);
> > +	WARN_ON(dev->reg_state == IB_DEV_REGISTERED);
> > +	if (dev->reg_state == IB_DEV_UNREGISTERED) {
> > +		/*
> > +		 * In IB_DEV_UNINITIALIZED state, cache or port table
> > +		 * is not even created. Free cache and port table only when
> > +		 * device reaches UNREGISTERED state.
> > +		 */
> > +		ib_cache_release_one(dev);
> > +		kfree(dev->port_immutable);
> > +	}
> >  	kfree(dev);
> >  }
> >
> > @@ -366,32 +374,27 @@ int ib_register_device(struct ib_device *device,
> >  	ret = ib_cache_setup_one(device);
> >  	if (ret) {
> >  		pr_warn("Couldn't set up InfiniBand P_Key/GID cache\n");
> > -		goto out;
> > +		goto port_cleanup;
> >  	}
> >
> >  	ret = ib_device_register_rdmacg(device);
> >  	if (ret) {
> >  		pr_warn("Couldn't register device with rdma cgroup\n");
> > -		ib_cache_cleanup_one(device);
> > -		goto out;
> > +		goto cache_cleanup;
> >  	}
> >
> >  	memset(&device->attrs, 0, sizeof(device->attrs));
> >  	ret = device->query_device(device, &device->attrs, &uhw);
> >  	if (ret) {
> >  		pr_warn("Couldn't query the device attributes\n");
> > -		ib_device_unregister_rdmacg(device);
> > -		ib_cache_cleanup_one(device);
> > -		goto out;
> > +		goto cache_cleanup;
> >  	}
> >
> >  	ret = ib_device_register_sysfs(device, port_callback);
> >  	if (ret) {
> >  		pr_warn("Couldn't register device %s with driver model\n",
> >  			device->name);
> > -		ib_device_unregister_rdmacg(device);
> > -		ib_cache_cleanup_one(device);
> > -		goto out;
> > +		goto cache_cleanup;
> >  	}
> >
> >  	device->reg_state = IB_DEV_REGISTERED; @@ -403,6 +406,14 @@ int
> > ib_register_device(struct ib_device *device,
> >  	down_write(&lists_rwsem);
> >  	list_add_tail(&device->core_list, &device_list);
> >  	up_write(&lists_rwsem);
> > +	mutex_unlock(&device_mutex);
> > +	return 0;
> > +
> > +cache_cleanup:
> > +	ib_cache_cleanup_one(device);
> > +	ib_cache_release_one(device);
> > +port_cleanup:
> > +	kfree(device->port_immutable);
> >  out:
> >  	mutex_unlock(&device_mutex);
> >  	return ret;
> > --
> > 2.12.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-04-24 18:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-19  8:55 [PATCH rdma-next 0/3] Core fixes and cleanup Leon Romanovsky
2017-03-19  8:55 ` [PATCH rdma-next 1/3] IB/core: Fix kernel crash during fail to initialize device Leon Romanovsky
2017-04-24 18:03   ` Parav Pandit
2017-04-24 18:04   ` Parav Pandit
2017-03-19  8:55 ` [PATCH rdma-next 3/3] IB/core: Fix sysfs registration error flow Leon Romanovsky
     [not found] ` <20170319085557.6023-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2017-03-19  8:55   ` [PATCH rdma-next 2/3] IB/umem: Drop hugetlb variable and VMA allocation Leon Romanovsky
     [not found]     ` <20170319085557.6023-3-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2017-03-21  2:19       ` Shiraz Saleem
     [not found]         ` <20170321021936.GA18356-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
2017-03-21  8:14           ` Leon Romanovsky
     [not found]             ` <20170321081408.GZ2079-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-03-21 14:06               ` Shiraz Saleem
     [not found]                 ` <20170321140644.GA26628-GOXS9JX10wfOxmVO0tvppfooFf0ArEBIu+b9c/7xato@public.gmane.org>
2017-03-21 16:36                   ` Leon Romanovsky
2017-04-24 16:09   ` [PATCH rdma-next 0/3] Core fixes and cleanup Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.