Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation

From: Tariq Toukan <tariqt@mellanox.com>
To: Eric Dumazet <edumazet@google.com>,
	"David S . Miller" <davem@davemloft.net>
Cc: netdev <netdev@vger.kernel.org>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	John Sperbeck <jsperbeck@google.com>,
	Tarick Bedeir <tarick@google.com>,
	Qing Huang <qing.huang@oracle.com>,
	Daniel Jurgens <danielj@mellanox.com>,
	Zhu Yanjun <yanjun.zhu@oracle.com>,
	Tariq Toukan <tariqt@mellanox.com>
Subject: Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation
Date: Wed, 30 May 2018 16:49:02 +0300	[thread overview]
Message-ID: <3392f810-9776-f31e-1b9f-8a6543d89251@mellanox.com> (raw)
In-Reply-To: <20180530041152.113393-1-edumazet@google.com>

On 30/05/2018 7:11 AM, Eric Dumazet wrote:
> Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
> brought a regression caught in our regression suite, thanks to KASAN.
> 
> Note that mlx4_alloc_icm() is already able to try high order allocations
> and fallback to low-order allocations under high memory pressure.
> 
> We only have to tweak gfp_mask a bit, to help falling back faster,
> without risking OOM killings.
> 
> BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib]
> Read of size 4 at addr ffff8817df584f68 by task qp_listing_test/92585
> 
> CPU: 38 PID: 92585 Comm: qp_listing_test Tainted: G           O
> Call Trace:
>   [<ffffffffba80d7bb>] dump_stack+0x4d/0x72
>   [<ffffffffb951dc5f>] print_address_description+0x6f/0x260
>   [<ffffffffb951e1c7>] kasan_report+0x257/0x370
>   [<ffffffffb951e339>] __asan_report_load4_noabort+0x19/0x20
>   [<ffffffffc0256d28>] to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib]
>   [<ffffffffc02785b3>] mlx4_ib_query_qp+0x1213/0x1660 [mlx4_ib]
>   [<ffffffffc02dbfdb>] qpstat_print_qp+0x13b/0x500 [ib_uverbs]
>   [<ffffffffc02dc3ea>] qpstat_seq_show+0x4a/0xb0 [ib_uverbs]
>   [<ffffffffb95f125c>] seq_read+0xa9c/0x1230
>   [<ffffffffb96e0821>] proc_reg_read+0xc1/0x180
>   [<ffffffffb9577918>] __vfs_read+0xe8/0x730
>   [<ffffffffb9578057>] vfs_read+0xf7/0x300
>   [<ffffffffb95794d2>] SyS_read+0xd2/0x1b0
>   [<ffffffffb8e06b16>] do_syscall_64+0x186/0x420
>   [<ffffffffbaa00071>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> RIP: 0033:0x7f851a7bb30d
> RSP: 002b:00007ffd09a758c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 00007f84ff959440 RCX: 00007f851a7bb30d
> RDX: 000000000003fc00 RSI: 00007f84ff60a000 RDI: 000000000000000b
> RBP: 00007ffd09a75900 R08: 00000000ffffffff R09: 0000000000000000
> R10: 0000000000000022 R11: 0000000000000293 R12: 0000000000000000
> R13: 000000000003ffff R14: 000000000003ffff R15: 00007f84ff60a000
> 
> Allocated by task 4488:
>   save_stack+0x46/0xd0
>   kasan_kmalloc+0xad/0xe0
>   __kmalloc+0x101/0x5e0
>   ib_register_device+0xc03/0x1250 [ib_core]
>   mlx4_ib_add+0x27d6/0x4dd0 [mlx4_ib]
>   mlx4_add_device+0xa9/0x340 [mlx4_core]
>   mlx4_register_interface+0x16e/0x390 [mlx4_core]
>   xhci_pci_remove+0x7a/0x180 [xhci_pci]
>   do_one_initcall+0xa0/0x230
>   do_init_module+0x1b9/0x5a4
>   load_module+0x63e6/0x94c0
>   SYSC_init_module+0x1a4/0x1c0
>   SyS_init_module+0xe/0x10
>   do_syscall_64+0x186/0x420
>   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Freed by task 0:
> (stack is not available)
> 
> The buggy address belongs to the object at ffff8817df584f40
>   which belongs to the cache kmalloc-32 of size 32
> The buggy address is located 8 bytes to the right of
>   32-byte region [ffff8817df584f40, ffff8817df584f60)
> The buggy address belongs to the page:
> page:ffffea005f7d6100 count:1 mapcount:0 mapping:ffff8817df584000 index:0xffff8817df584fc1
> flags: 0x880000000000100(slab)
> raw: 0880000000000100 ffff8817df584000 ffff8817df584fc1 000000010000003f
> raw: ffffea005f3ac0a0 ffffea005c476760 ffff8817fec00900 ffff883ff78d26c0
> page dumped because: kasan: bad access detected
> page->mem_cgroup:ffff883ff78d26c0
> 
> Memory state around the buggy address:
>   ffff8817df584e00: 00 03 fc fc fc fc fc fc 00 03 fc fc fc fc fc fc
>   ffff8817df584e80: 00 00 00 04 fc fc fc fc 00 00 00 fc fc fc fc fc
>> ffff8817df584f00: fb fb fb fb fc fc fc fc 00 00 00 00 fc fc fc fc
>                                                            ^
>   ffff8817df584f80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
>   ffff8817df585000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 
> Fixes: 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: John Sperbeck <jsperbeck@google.com>
> Cc: Tarick Bedeir <tarick@google.com>
> Cc: Qing Huang <qing.huang@oracle.com>
> Cc: Daniel Jurgens <danielj@mellanox.com>
> Cc: Zhu Yanjun <yanjun.zhu@oracle.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/icm.c | 17 +++++++++++------
>   1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index 685337d58276fc91baeeb64387c52985e1bc6dda..cae33d5c7dbd9ba7929adcf2127b104f6796fa5a 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,13 @@
>   #include "fw.h"
>   
>   /*
> - * We allocate in page size (default 4KB on many archs) chunks to avoid high
> - * order memory allocations in fragmented/high usage memory situation.
> + * We allocate in as big chunks as we can, up to a maximum of 256 KB
> + * per chunk. Note that the chunks are not necessarily in contiguous
> + * physical memory.
>    */
>   enum {
> -	MLX4_ICM_ALLOC_SIZE	= PAGE_SIZE,
> -	MLX4_TABLE_CHUNK_SIZE	= PAGE_SIZE,
> +	MLX4_ICM_ALLOC_SIZE	= 1 << 18,
> +	MLX4_TABLE_CHUNK_SIZE	= 1 << 18,
>   };
>   
>   static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
> @@ -135,6 +136,7 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages,
>   	struct mlx4_icm *icm;
>   	struct mlx4_icm_chunk *chunk = NULL;
>   	int cur_order;
> +	gfp_t mask;
>   	int ret;
>   
>   	/* We use sg_set_buf for coherent allocs, which assumes low memory */
> @@ -178,13 +180,16 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages,
>   		while (1 << cur_order > npages)
>   			--cur_order;
>   
> +		mask = gfp_mask;
> +		if (cur_order)
> +			mask = (mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY;
>   		if (coherent)
>   			ret = mlx4_alloc_icm_coherent(&dev->persist->pdev->dev,
>   						      &chunk->mem[chunk->npages],
> -						      cur_order, gfp_mask);
> +						      cur_order, mask);
>   		else
>   			ret = mlx4_alloc_icm_pages(&chunk->mem[chunk->npages],
> -						   cur_order, gfp_mask,
> +						   cur_order, mask,
>   						   dev->numa_node);
>   
>   		if (ret) {
> 

Thanks Eric.

I think it preserves the original intention of commit 1383cb8103bb 
("mlx4_core: allocate ICM memory in page size chunks").

Looks good to me.

Acked-by: Tariq Toukan <tariqt@mellanox.com>