All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] bnx2x: page allocation failure
@ 2016-09-16 21:30 Jason Baron
  2016-09-16 21:30 ` [PATCH net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments Jason Baron
  2016-09-16 21:30 ` [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list " Jason Baron
  0 siblings, 2 replies; 11+ messages in thread
From: Jason Baron @ 2016-09-16 21:30 UTC (permalink / raw)
  To: davem, Ariel.Elior; +Cc: Yuval.Mintz, netdev

Hi,

While configuring ~500 multicast addrs, we ran into high order
page allocation failures. They don't need to be high order, and
thus I'm proposing to split them into at most PAGE_SIZE allocations.

Below is a sample failure.

Thanks,

-Jason

[1201902.617882] bnx2x: [bnx2x_set_mc_list:12374(eth0)]Failed to create multicast MACs list: -12
[1207325.695021] kworker/1:0: page allocation failure: order:2, mode:0xc020
[1207325.702059] CPU: 1 PID: 15805 Comm: kworker/1:0 Tainted: G        W
[1207325.712940] Hardware name: SYNNEX CORPORATION 1x8-X4i SSD 10GE/S5512LE, BIOS V8.810 05/16/2013
[1207325.722284] Workqueue: events bnx2x_sp_rtnl_task [bnx2x]
[1207325.728206]  0000000000000000 ffff88012d873a78 ffffffff8267f7c7 000000000000c020
[1207325.736754]  0000000000000000 ffff88012d873b08 ffffffff8212f8e0 fffffffc00000003
[1207325.745301]  ffff88041ffecd80 ffff880400000030 0000000000000002 0000c0206800da13
[1207325.753846] Call Trace:
[1207325.756789]  [<ffffffff8267f7c7>] dump_stack+0x4d/0x63
[1207325.762426]  [<ffffffff8212f8e0>] warn_alloc_failed+0xe0/0x130
[1207325.768756]  [<ffffffff8213c898>] ? wakeup_kswapd+0x48/0x140
[1207325.774914]  [<ffffffff82132afc>] __alloc_pages_nodemask+0x2bc/0x970
[1207325.781761]  [<ffffffff82173691>] alloc_pages_current+0x91/0x100
[1207325.788260]  [<ffffffff8212fa1e>] alloc_kmem_pages+0xe/0x10
[1207325.794329]  [<ffffffff8214c9c8>] kmalloc_order+0x18/0x50
[1207325.800227]  [<ffffffff8214ca26>] kmalloc_order_trace+0x26/0xb0
[1207325.806642]  [<ffffffff82451c68>] ? _xfer_secondary_pool+0xa8/0x1a0
[1207325.813404]  [<ffffffff8217cfda>] __kmalloc+0x19a/0x1b0
[1207325.819142]  [<ffffffffa02fe975>] bnx2x_set_rx_mode_inner+0x3d5/0x590 [bnx2x]
[1207325.827000]  [<ffffffffa02ff52d>] bnx2x_sp_rtnl_task+0x28d/0x760 [bnx2x]
[1207325.834197]  [<ffffffff820695d4>] process_one_work+0x134/0x3c0
[1207325.840522]  [<ffffffff82069981>] worker_thread+0x121/0x460
[1207325.846585]  [<ffffffff82069860>] ? process_one_work+0x3c0/0x3c0
[1207325.853089]  [<ffffffff8206f039>] kthread+0xc9/0xe0
[1207325.858459]  [<ffffffff82070000>] ? notify_die+0x10/0x40
[1207325.864263]  [<ffffffff8206ef70>] ? kthread_create_on_node+0x180/0x180
[1207325.871288]  [<ffffffff826852d2>] ret_from_fork+0x42/0x70
[1207325.877183]  [<ffffffff8206ef70>] ? kthread_create_on_node+0x180/0x180


Jason Baron (2):
  bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments
  bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  85 ++++++++++-----
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c   | 131 ++++++++++++++++-------
 2 files changed, 151 insertions(+), 65 deletions(-)

-- 
2.6.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments
  2016-09-16 21:30 [PATCH net-next 0/2] bnx2x: page allocation failure Jason Baron
@ 2016-09-16 21:30 ` Jason Baron
  2016-09-16 21:30 ` [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list " Jason Baron
  1 sibling, 0 replies; 11+ messages in thread
From: Jason Baron @ 2016-09-16 21:30 UTC (permalink / raw)
  To: davem, Ariel.Elior; +Cc: Yuval.Mintz, netdev

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for 'mcast_list' to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 85 ++++++++++++++++--------
 1 file changed, 57 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index dab61a81a3ba..9f5b2d94e4df 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -12563,43 +12563,70 @@ static int bnx2x_close(struct net_device *dev)
 	return 0;
 }
 
-static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
-				      struct bnx2x_mcast_ramrod_params *p)
+struct bnx2x_mcast_list_elem_group
 {
-	int mc_count = netdev_mc_count(bp->dev);
-	struct bnx2x_mcast_list_elem *mc_mac =
-		kcalloc(mc_count, sizeof(*mc_mac), GFP_ATOMIC);
-	struct netdev_hw_addr *ha;
+	struct list_head mcast_group_link;
+	struct bnx2x_mcast_list_elem mcast_elems[];
+};
 
-	if (!mc_mac) {
-		BNX2X_ERR("Failed to allocate mc MAC list\n");
-		return -ENOMEM;
+#define MCAST_ELEMS_PER_PG \
+	((PAGE_SIZE - sizeof(struct bnx2x_mcast_list_elem_group)) / \
+	sizeof(struct bnx2x_mcast_list_elem))
+
+static void bnx2x_free_mcast_macs_list(struct list_head *mcast_group_list)
+{
+	struct bnx2x_mcast_list_elem_group *current_mcast_group;
+
+	while (!list_empty(mcast_group_list)) {
+		current_mcast_group = list_first_entry(mcast_group_list,
+				      struct bnx2x_mcast_list_elem_group,
+				      mcast_group_link);
+		list_del(&current_mcast_group->mcast_group_link);
+		kfree(current_mcast_group);
 	}
+}
 
-	INIT_LIST_HEAD(&p->mcast_list);
+static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
+				      struct bnx2x_mcast_ramrod_params *p,
+				      struct list_head *mcast_group_list)
+{
+	struct bnx2x_mcast_list_elem *mc_mac;
+	struct netdev_hw_addr *ha;
+	struct bnx2x_mcast_list_elem_group *current_mcast_group = NULL;
+	int mc_count = netdev_mc_count(bp->dev);
+	int i = 0, offset = 0, alloc_size;
 
+	INIT_LIST_HEAD(&p->mcast_list);
 	netdev_for_each_mc_addr(ha, bp->dev) {
+		if (!offset) {
+			if ((mc_count - i) < MCAST_ELEMS_PER_PG)
+				alloc_size = sizeof(struct list_head) +
+				(sizeof(struct bnx2x_mcast_list_elem) *
+				(mc_count - i));
+			else
+				alloc_size = PAGE_SIZE;
+			current_mcast_group = kmalloc(alloc_size, GFP_ATOMIC);
+			if (!current_mcast_group) {
+				bnx2x_free_mcast_macs_list(mcast_group_list);
+				BNX2X_ERR("Failed to allocate mc MAC list\n");
+				return -ENOMEM;
+			}
+			list_add(&current_mcast_group->mcast_group_link,
+				 mcast_group_list);
+		}
+		mc_mac = &current_mcast_group->mcast_elems[offset];
 		mc_mac->mac = bnx2x_mc_addr(ha);
 		list_add_tail(&mc_mac->link, &p->mcast_list);
-		mc_mac++;
+		offset++;
+		if (offset == MCAST_ELEMS_PER_PG) {
+			i += offset;
+			offset = 0;
+		}
 	}
-
 	p->mcast_list_len = mc_count;
-
 	return 0;
 }
 
-static void bnx2x_free_mcast_macs_list(
-	struct bnx2x_mcast_ramrod_params *p)
-{
-	struct bnx2x_mcast_list_elem *mc_mac =
-		list_first_entry(&p->mcast_list, struct bnx2x_mcast_list_elem,
-				 link);
-
-	WARN_ON(!mc_mac);
-	kfree(mc_mac);
-}
-
 /**
  * bnx2x_set_uc_list - configure a new unicast MACs list.
  *
@@ -12647,6 +12674,7 @@ static int bnx2x_set_uc_list(struct bnx2x *bp)
 
 static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 {
+	LIST_HEAD(mcast_group_list);
 	struct net_device *dev = bp->dev;
 	struct bnx2x_mcast_ramrod_params rparam = {NULL};
 	int rc = 0;
@@ -12662,7 +12690,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 
 	/* then, configure a new MACs list */
 	if (netdev_mc_count(dev)) {
-		rc = bnx2x_init_mcast_macs_list(bp, &rparam);
+		rc = bnx2x_init_mcast_macs_list(bp, &rparam, &mcast_group_list);
 		if (rc)
 			return rc;
 
@@ -12673,7 +12701,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 			BNX2X_ERR("Failed to set a new multicast configuration: %d\n",
 				  rc);
 
-		bnx2x_free_mcast_macs_list(&rparam);
+		bnx2x_free_mcast_macs_list(&mcast_group_list);
 	}
 
 	return rc;
@@ -12681,6 +12709,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 
 static int bnx2x_set_mc_list(struct bnx2x *bp)
 {
+	LIST_HEAD(mcast_group_list);
 	struct bnx2x_mcast_ramrod_params rparam = {NULL};
 	struct net_device *dev = bp->dev;
 	int rc = 0;
@@ -12692,7 +12721,7 @@ static int bnx2x_set_mc_list(struct bnx2x *bp)
 	rparam.mcast_obj = &bp->mcast_obj;
 
 	if (netdev_mc_count(dev)) {
-		rc = bnx2x_init_mcast_macs_list(bp, &rparam);
+		rc = bnx2x_init_mcast_macs_list(bp, &rparam, &mcast_group_list);
 		if (rc)
 			return rc;
 
@@ -12703,7 +12732,7 @@ static int bnx2x_set_mc_list(struct bnx2x *bp)
 			BNX2X_ERR("Failed to set a new multicast configuration: %d\n",
 				  rc);
 
-		bnx2x_free_mcast_macs_list(&rparam);
+		bnx2x_free_mcast_macs_list(&mcast_group_list);
 	} else {
 		/* If no mc addresses are required, flush the configuration */
 		rc = bnx2x_config_mcast(bp, &rparam, BNX2X_MCAST_CMD_DEL);
-- 
2.6.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
  2016-09-16 21:30 [PATCH net-next 0/2] bnx2x: page allocation failure Jason Baron
  2016-09-16 21:30 ` [PATCH net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments Jason Baron
@ 2016-09-16 21:30 ` Jason Baron
  2016-09-18 10:25   ` Mintz, Yuval
  1 sibling, 1 reply; 11+ messages in thread
From: Jason Baron @ 2016-09-16 21:30 UTC (permalink / raw)
  To: davem, Ariel.Elior; +Cc: Yuval.Mintz, netdev

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for the pending list to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 131 ++++++++++++++++++-------
 1 file changed, 94 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index d468380c2a23..6db8dd252d7c 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -2606,8 +2606,23 @@ struct bnx2x_mcast_bin_elem {
 	int type; /* BNX2X_MCAST_CMD_SET_{ADD, DEL} */
 };
 
+union bnx2x_mcast_elem {
+	struct bnx2x_mcast_bin_elem bin_elem;
+	struct bnx2x_mcast_mac_elem mac_elem;
+};
+
+struct bnx2x_mcast_elem_group {
+	struct list_head mcast_group_link;
+	union bnx2x_mcast_elem mcast_elems[];
+};
+
+#define MCAST_MAC_ELEMS_PER_PG \
+	((PAGE_SIZE - sizeof(struct bnx2x_mcast_elem_group)) / \
+	sizeof(union bnx2x_mcast_elem))
+
 struct bnx2x_pending_mcast_cmd {
 	struct list_head link;
+	struct list_head group_head;
 	int type; /* BNX2X_MCAST_CMD_X */
 	union {
 		struct list_head macs_head;
@@ -2638,16 +2653,30 @@ static int bnx2x_mcast_wait(struct bnx2x *bp,
 	return 0;
 }
 
+static void bnx2x_free_groups(struct list_head *mcast_group_list)
+{
+	struct bnx2x_mcast_elem_group *current_mcast_group;
+
+	while (!list_empty(mcast_group_list)) {
+		current_mcast_group = list_first_entry(mcast_group_list,
+				      struct bnx2x_mcast_elem_group,
+				      mcast_group_link);
+		list_del(&current_mcast_group->mcast_group_link);
+		kfree(current_mcast_group);
+	}
+}
+
 static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
 				   struct bnx2x_mcast_obj *o,
 				   struct bnx2x_mcast_ramrod_params *p,
 				   enum bnx2x_mcast_cmd cmd)
 {
-	int total_sz;
 	struct bnx2x_pending_mcast_cmd *new_cmd;
-	struct bnx2x_mcast_mac_elem *cur_mac = NULL;
 	struct bnx2x_mcast_list_elem *pos;
-	int macs_list_len = 0, macs_list_len_size;
+	struct bnx2x_mcast_elem_group *elem_group;
+	struct bnx2x_mcast_mac_elem *mac_elem;
+	int i = 0, offset = 0, macs_list_len = 0;
+	int total_elems, alloc_size;
 
 	/* When adding MACs we'll need to store their values */
 	if (cmd == BNX2X_MCAST_CMD_ADD || cmd == BNX2X_MCAST_CMD_SET)
@@ -2657,50 +2686,68 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
 	if (!p->mcast_list_len)
 		return 0;
 
-	/* For a set command, we need to allocate sufficient memory for all
-	 * the bins, since we can't analyze at this point how much memory would
-	 * be required.
-	 */
-	macs_list_len_size = macs_list_len *
-			     sizeof(struct bnx2x_mcast_mac_elem);
-	if (cmd == BNX2X_MCAST_CMD_SET) {
-		int bin_size = BNX2X_MCAST_BINS_NUM *
-			       sizeof(struct bnx2x_mcast_bin_elem);
-
-		if (bin_size > macs_list_len_size)
-			macs_list_len_size = bin_size;
-	}
-	total_sz = sizeof(*new_cmd) + macs_list_len_size;
-
 	/* Add mcast is called under spin_lock, thus calling with GFP_ATOMIC */
-	new_cmd = kzalloc(total_sz, GFP_ATOMIC);
-
+	new_cmd = kzalloc(sizeof(*new_cmd), GFP_ATOMIC);
 	if (!new_cmd)
 		return -ENOMEM;
 
-	DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n",
-	   cmd, macs_list_len);
-
 	INIT_LIST_HEAD(&new_cmd->data.macs_head);
-
+	INIT_LIST_HEAD(&new_cmd->group_head);
 	new_cmd->type = cmd;
 	new_cmd->done = false;
 
+	DP(BNX2X_MSG_SP, "About to enqueue a new %d command. macs_list_len=%d\n",
+	   cmd, macs_list_len);
+
 	switch (cmd) {
 	case BNX2X_MCAST_CMD_ADD:
 	case BNX2X_MCAST_CMD_SET:
-		cur_mac = (struct bnx2x_mcast_mac_elem *)
-			  ((u8 *)new_cmd + sizeof(*new_cmd));
-
-		/* Push the MACs of the current command into the pending command
-		 * MACs list: FIFO
+		/* For a set command, we need to allocate sufficient memory for
+		 * all the bins, since we can't analyze at this point how much
+		 * memory would be required.
 		 */
+		total_elems = macs_list_len;
+		if (cmd == BNX2X_MCAST_CMD_SET) {
+			if (total_elems < BNX2X_MCAST_BINS_NUM)
+				total_elems = BNX2X_MCAST_BINS_NUM;
+		}
+		i = total_elems;
+		while (i) {
+			if (i < MCAST_MAC_ELEMS_PER_PG) {
+				alloc_size = sizeof(struct list_head) +
+					     i * sizeof(union bnx2x_mcast_elem);
+				i = 0;
+			} else {
+				alloc_size = PAGE_SIZE;
+				i -= MCAST_MAC_ELEMS_PER_PG;
+			}
+			elem_group = kzalloc(alloc_size, GFP_ATOMIC);
+			if (!elem_group) {
+				bnx2x_free_groups(&new_cmd->group_head);
+				kfree(new_cmd);
+				return -ENOMEM;
+			}
+			list_add_tail(&elem_group->mcast_group_link,
+				      &new_cmd->group_head);
+		}
+		elem_group = list_first_entry(&new_cmd->group_head,
+					      struct bnx2x_mcast_elem_group,
+					      mcast_group_link);
 		list_for_each_entry(pos, &p->mcast_list, link) {
-			memcpy(cur_mac->mac, pos->mac, ETH_ALEN);
-			list_add_tail(&cur_mac->link, &new_cmd->data.macs_head);
-			cur_mac++;
+			mac_elem = &elem_group->mcast_elems[offset].mac_elem;
+			memcpy(mac_elem->mac, pos->mac, ETH_ALEN);
+			/* Push the MACs of the current command into the pending
+			 * command MACs list: FIFO
+			 */
+			list_add_tail(&mac_elem->link,
+				      &new_cmd->data.macs_head);
+			offset++;
+			if (offset == MCAST_MAC_ELEMS_PER_PG) {
+				offset = 0;
+				elem_group = (struct bnx2x_mcast_elem_group *)
+					     elem_group->mcast_group_link.next;
+			}
 		}
-
 		break;
 
 	case BNX2X_MCAST_CMD_DEL:
@@ -2978,7 +3025,8 @@ bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp,
 	u64 cur[BNX2X_MCAST_VEC_SZ], req[BNX2X_MCAST_VEC_SZ];
 	struct bnx2x_mcast_mac_elem *pmac_pos, *pmac_pos_n;
 	struct bnx2x_mcast_bin_elem *p_item;
-	int i, cnt = 0, mac_cnt = 0;
+	struct bnx2x_mcast_elem_group *elem_group;
+	int cnt = 0, mac_cnt = 0, offset = 0, i;
 
 	memset(req, 0, sizeof(u64) * BNX2X_MCAST_VEC_SZ);
 	memcpy(cur, o->registry.aprox_match.vec,
@@ -3001,9 +3049,10 @@ bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp,
 	 * a list that will be used to configure bins.
 	 */
 	cmd_pos->set_convert = true;
-	p_item = (struct bnx2x_mcast_bin_elem *)(cmd_pos + 1);
 	INIT_LIST_HEAD(&cmd_pos->data.macs_head);
-
+	elem_group = list_first_entry(&cmd_pos->group_head,
+				      struct bnx2x_mcast_elem_group,
+				      mcast_group_link);
 	for (i = 0; i < BNX2X_MCAST_BINS_NUM; i++) {
 		bool b_current = !!BIT_VEC64_TEST_BIT(cur, i);
 		bool b_required = !!BIT_VEC64_TEST_BIT(req, i);
@@ -3011,12 +3060,18 @@ bnx2x_mcast_hdl_pending_set_e2_convert(struct bnx2x *bp,
 		if (b_current == b_required)
 			continue;
 
+		p_item = &elem_group->mcast_elems[offset].bin_elem;
 		p_item->bin = i;
 		p_item->type = b_required ? BNX2X_MCAST_CMD_SET_ADD
 					  : BNX2X_MCAST_CMD_SET_DEL;
 		list_add_tail(&p_item->link , &cmd_pos->data.macs_head);
-		p_item++;
 		cnt++;
+		offset++;
+		if (offset == MCAST_MAC_ELEMS_PER_PG) {
+			offset = 0;
+			elem_group = (struct bnx2x_mcast_elem_group *)
+				     elem_group->mcast_group_link.next;
+		}
 	}
 
 	/* We now definitely know how many commands are hiding here.
@@ -3103,6 +3158,7 @@ static inline int bnx2x_mcast_handle_pending_cmds_e2(struct bnx2x *bp,
 		 */
 		if (cmd_pos->done) {
 			list_del(&cmd_pos->link);
+			bnx2x_free_groups(&cmd_pos->group_head);
 			kfree(cmd_pos);
 		}
 
@@ -3741,6 +3797,7 @@ static inline int bnx2x_mcast_handle_pending_cmds_e1(
 	}
 
 	list_del(&cmd_pos->link);
+	bnx2x_free_groups(&cmd_pos->group_head);
 	kfree(cmd_pos);
 
 	return cnt;
-- 
2.6.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
  2016-09-16 21:30 ` [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list " Jason Baron
@ 2016-09-18 10:25   ` Mintz, Yuval
  2016-09-19 18:33     ` Jason Baron
  0 siblings, 1 reply; 11+ messages in thread
From: Mintz, Yuval @ 2016-09-18 10:25 UTC (permalink / raw)
  To: Jason Baron, davem; +Cc: netdev, Ariel.Elior

> Currently, we can have high order page allocations that specify
> GFP_ATOMIC when configuring multicast MAC address filters.
> 
> For example, we have seen order 2 page allocation failures with
> ~500 multicast addresses configured.
> 
> Convert the allocation for the pending list to be done in PAGE_SIZE
> increments.
> 
> Signed-off-by: Jason Baron <jbaron@akamai.com>

While I appreciate the effort, I wonder whether it's worth it:

- The hardware [even in its newer generation] provides an approximate
based classification [I.e., hashed] with 256 bins.
When configuring 500 multicast addresses, one can argue the
difference between multicast-promisc mode and actual configuration
is insignificant.
Perhaps the easier-to-maintain alternative would simply be to
determine the maximal number of multicast addresses that can be
configured using a single PAGE, and if in need of more than that
simply move into multicast-promisc.

 - While GFP_ATOMIC is required in this flow due to the fact it's being
called from sleepless context, I do believe this is mostly a remnant -
it's possible that by slightly changing the locking scheme we can have
the configuration done from sleepless context and simply switch to
GFP_KERNEL instead.

Regarding the patch itself, only comment I have:
> +			elem_group = (struct bnx2x_mcast_elem_group *)
> +				     elem_group->mcast_group_link.next;
Let's use list_next_entry() instead.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
  2016-09-18 10:25   ` Mintz, Yuval
@ 2016-09-19 18:33     ` Jason Baron
  2016-09-20  7:41       ` Mintz, Yuval
  2016-09-20 11:30       ` David Laight
  0 siblings, 2 replies; 11+ messages in thread
From: Jason Baron @ 2016-09-19 18:33 UTC (permalink / raw)
  To: Mintz, Yuval, davem; +Cc: netdev, Ariel.Elior

On 09/18/2016 06:25 AM, Mintz, Yuval wrote:
>> Currently, we can have high order page allocations that specify
>> GFP_ATOMIC when configuring multicast MAC address filters.
>>
>> For example, we have seen order 2 page allocation failures with
>> ~500 multicast addresses configured.
>>
>> Convert the allocation for the pending list to be done in PAGE_SIZE
>> increments.
>>
>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>
> While I appreciate the effort, I wonder whether it's worth it:
>
> - The hardware [even in its newer generation] provides an approximate
> based classification [I.e., hashed] with 256 bins.
> When configuring 500 multicast addresses, one can argue the
> difference between multicast-promisc mode and actual configuration
> is insignificant.

With 256 bins, I think it takes close to: 256*lg(256) or 2,048
multicast addresses to expect to have all bins have at least one hash, 
assuming a uniform distribution of the hashes.

> Perhaps the easier-to-maintain alternative would simply be to
> determine the maximal number of multicast addresses that can be
> configured using a single PAGE, and if in need of more than that
> simply move into multicast-promisc.
>

sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per
page on x86. So if we want to fit 2,048 elements, we need 12 pages.

>   - While GFP_ATOMIC is required in this flow due to the fact it's being
> called from sleepless context, I do believe this is mostly a remnant -
> it's possible that by slightly changing the locking scheme we can have
> the configuration done from sleepless context and simply switch to
> GFP_KERNEL instead.
>

Ok if its GFP_KERNEL, I think its still undesirable to do large page 
order allocations (unless of course its converted to a single page, but
I'm not sure this makes sense as mentioned).

> Regarding the patch itself, only comment I have:
>> +			elem_group = (struct bnx2x_mcast_elem_group *)
>> +				     elem_group->mcast_group_link.next;
> Let's use list_next_entry() instead.
>
>

Yes, agreed.

I think it would be easy to add a check to bnx2x_set_rx_mode_inner() to
enforce some maximum number of elements (perhaps 2,048 based on the
above math) for the !CHIP_IS_E1() case on top of what I already posted.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
  2016-09-19 18:33     ` Jason Baron
@ 2016-09-20  7:41       ` Mintz, Yuval
  2016-09-20 14:52         ` Jason Baron
  2016-09-20 11:30       ` David Laight
  1 sibling, 1 reply; 11+ messages in thread
From: Mintz, Yuval @ 2016-09-20  7:41 UTC (permalink / raw)
  To: Jason Baron, davem; +Cc: netdev, Ariel.Elior

> >> Currently, we can have high order page allocations that specify
> >> GFP_ATOMIC when configuring multicast MAC address filters.
> >>
> >> For example, we have seen order 2 page allocation failures with
> >> ~500 multicast addresses configured.
> >>
> >> Convert the allocation for the pending list to be done in PAGE_SIZE
> >> increments.
> >>
> >> Signed-off-by: Jason Baron <jbaron@akamai.com>
> >
> > While I appreciate the effort, I wonder whether it's worth it:
> >
> > - The hardware [even in its newer generation] provides an approximate
> > based classification [I.e., hashed] with 256 bins.
> > When configuring 500 multicast addresses, one can argue the difference
> > between multicast-promisc mode and actual configuration is
> > insignificant.
> 
> With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast addresses
> to expect to have all bins have at least one hash, assuming a uniform distribution
> of the hashes.
> 
> > Perhaps the easier-to-maintain alternative would simply be to
> > determine the maximal number of multicast addresses that can be
> > configured using a single PAGE, and if in need of more than that
> > simply move into multicast-promisc.
> >
> 
> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So
> if we want to fit 2,048 elements, we need 12 pages.

That's not exactly what I mean - let's assume you'd have problems
allocating more than a PAGE. According to your calculation, that
means you're already using more than 170 multicast addresses.
I didn't bother trying to solve the combinatorics question of how
many bins you'd use on average for 170 filters given there are only
256 bins, but that would be a significant portion.
The question I rose was whether it actually makes a difference
under such circumstances whether the device would actually filter
those multicast addresses or be completely multicast promiscuous.
e.g., whether it's significant to be filtering out multicast ingress
traffic when you're already allowing 1/2 of all random multicast
packets to be classified for the interface.

But again, given that you've actually taken the trouble of solving
this, I guess this question is mostly theoretical. We HAVE a better
solution now [a.k.a., yours ;-) ]

> I think it would be easy to add a check to bnx2x_set_rx_mode_inner() to enforce
> some maximum number of elements (perhaps 2,048 based on the above math)
> for the !CHIP_IS_E1() case on top of what I already posted.

The benefit should have been that we could have dropped your
Solution by limiting the driver to use at most the number of filters
that would fit in a single page.
I don't think it would serve any purpose to take your change and
in addition choose some combinatorics based upper limit.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
  2016-09-19 18:33     ` Jason Baron
  2016-09-20  7:41       ` Mintz, Yuval
@ 2016-09-20 11:30       ` David Laight
  2016-09-20 18:46         ` Jason Baron
  1 sibling, 1 reply; 11+ messages in thread
From: David Laight @ 2016-09-20 11:30 UTC (permalink / raw)
  To: 'Jason Baron', Mintz, Yuval, davem; +Cc: netdev, Ariel.Elior

From: Jason Baron
> Sent: 19 September 2016 19:34
...
> 
> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per
> page on x86. So if we want to fit 2,048 elements, we need 12 pages.

If you only need to save the mcast addresses you could use a 'heap'
that requires no overhead per entry and gives you O(log) lookup.
6 bytes per entry is 682 in a 4k page.

	David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
  2016-09-20  7:41       ` Mintz, Yuval
@ 2016-09-20 14:52         ` Jason Baron
  2016-09-20 15:00           ` Mintz, Yuval
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Baron @ 2016-09-20 14:52 UTC (permalink / raw)
  To: Mintz, Yuval, davem; +Cc: netdev, Ariel.Elior



On 09/20/2016 03:41 AM, Mintz, Yuval wrote:
>>>> Currently, we can have high order page allocations that specify
>>>> GFP_ATOMIC when configuring multicast MAC address filters.
>>>>
>>>> For example, we have seen order 2 page allocation failures with
>>>> ~500 multicast addresses configured.
>>>>
>>>> Convert the allocation for the pending list to be done in PAGE_SIZE
>>>> increments.
>>>>
>>>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>>>
>>> While I appreciate the effort, I wonder whether it's worth it:
>>>
>>> - The hardware [even in its newer generation] provides an approximate
>>> based classification [I.e., hashed] with 256 bins.
>>> When configuring 500 multicast addresses, one can argue the difference
>>> between multicast-promisc mode and actual configuration is
>>> insignificant.
>>
>> With 256 bins, I think it takes close to: 256*lg(256) or 2,048 multicast addresses
>> to expect to have all bins have at least one hash, assuming a uniform distribution
>> of the hashes.
>>
>>> Perhaps the easier-to-maintain alternative would simply be to
>>> determine the maximal number of multicast addresses that can be
>>> configured using a single PAGE, and if in need of more than that
>>> simply move into multicast-promisc.
>>>
>>
>> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per page on x86. So
>> if we want to fit 2,048 elements, we need 12 pages.
>
> That's not exactly what I mean - let's assume you'd have problems
> allocating more than a PAGE. According to your calculation, that
> means you're already using more than 170 multicast addresses.
> I didn't bother trying to solve the combinatorics question of how
> many bins you'd use on average for 170 filters given there are only
> 256 bins, but that would be a significant portion.

On average for 170 filters, I get an average of 124 bins in use out
of 256 possible bins.

> The question I rose was whether it actually makes a difference
> under such circumstances whether the device would actually filter
> those multicast addresses or be completely multicast promiscuous.
> e.g., whether it's significant to be filtering out multicast ingress
> traffic when you're already allowing 1/2 of all random multicast
> packets to be classified for the interface.
>

Agreed, I think this is the more interesting question here. I thought 
that we would want to make sure we are using most of the bins before 
falling back to multicast ingress. The reason being that even if its 
more expensive for the NIC to do the filtering than the multicast mode, 
it would be more than made up for by having to drop the traffic higher 
up the stack. So I think if we can determine the percent of the bins 
that we want to use, we can then back into the average number of filters 
required to get there. As I said, I thought we would want to make sure 
we filled basically all the bins (with a high probability that is) 
before falling back to multicast, and so I threw out 2,048.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
  2016-09-20 14:52         ` Jason Baron
@ 2016-09-20 15:00           ` Mintz, Yuval
  2016-09-20 15:19             ` Jason Baron
  0 siblings, 1 reply; 11+ messages in thread
From: Mintz, Yuval @ 2016-09-20 15:00 UTC (permalink / raw)
  To: Jason Baron, davem; +Cc: netdev, Ariel.Elior

> > The question I rose was whether it actually makes a difference under
> > such circumstances whether the device would actually filter those
> > multicast addresses or be completely multicast promiscuous.
> > e.g., whether it's significant to be filtering out multicast ingress
> > traffic when you're already allowing 1/2 of all random multicast
> > packets to be classified for the interface.
> >
> 
> Agreed, I think this is the more interesting question here. I thought that we
> would want to make sure we are using most of the bins before falling back to
> multicast ingress. The reason being that even if its more expensive for the NIC to
> do the filtering than the multicast mode, it would be more than made up for by
> having to drop the traffic higher up the stack. So I think if we can determine the
> percent of the bins that we want to use, we can then back into the average
> number of filters required to get there. As I said, I thought we would want to
> make sure we filled basically all the bins (with a high probability that is) before
> falling back to multicast, and so I threw out 2,048.

AFAIK configuring multiple filters doesn't incur any performance penalty
from the adapter side.
And I agree that from 'offloading' perspective it's probably better to
filter in HW even if the gain is negligible. 
So for the upper limit - there's not much of a reason to it; The only gain
would be to prevent driver from allocating lots-and-lots of memory
temporarily for an unnecessary configuration.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
  2016-09-20 15:00           ` Mintz, Yuval
@ 2016-09-20 15:19             ` Jason Baron
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Baron @ 2016-09-20 15:19 UTC (permalink / raw)
  To: Mintz, Yuval, davem; +Cc: netdev, Ariel.Elior



On 09/20/2016 11:00 AM, Mintz, Yuval wrote:
>>> The question I rose was whether it actually makes a difference under
>>> such circumstances whether the device would actually filter those
>>> multicast addresses or be completely multicast promiscuous.
>>> e.g., whether it's significant to be filtering out multicast ingress
>>> traffic when you're already allowing 1/2 of all random multicast
>>> packets to be classified for the interface.
>>>
>>
>> Agreed, I think this is the more interesting question here. I thought that we
>> would want to make sure we are using most of the bins before falling back to
>> multicast ingress. The reason being that even if its more expensive for the NIC to
>> do the filtering than the multicast mode, it would be more than made up for by
>> having to drop the traffic higher up the stack. So I think if we can determine the
>> percent of the bins that we want to use, we can then back into the average
>> number of filters required to get there. As I said, I thought we would want to
>> make sure we filled basically all the bins (with a high probability that is) before
>> falling back to multicast, and so I threw out 2,048.
>
> AFAIK configuring multiple filters doesn't incur any performance penalty
> from the adapter side.
> And I agree that from 'offloading' perspective it's probably better to
> filter in HW even if the gain is negligible.
> So for the upper limit - there's not much of a reason to it; The only gain
> would be to prevent driver from allocating lots-and-lots of memory
> temporarily for an unnecessary configuration.
>

Ok. We already have an upper limit to an extent with 
/proc/sys/net/ipv4/igmp_max_memberships. And as posted I didn't include 
one b/c of the higher level limits already in place.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments
  2016-09-20 11:30       ` David Laight
@ 2016-09-20 18:46         ` Jason Baron
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Baron @ 2016-09-20 18:46 UTC (permalink / raw)
  To: David Laight, Mintz, Yuval, davem; +Cc: netdev, Ariel.Elior



On 09/20/2016 07:30 AM, David Laight wrote:
> From: Jason Baron
>> Sent: 19 September 2016 19:34
> ...
>>
>> sizeof(struct bnx2x_mcast_list_elem) = 24. So there are 170 per
>> page on x86. So if we want to fit 2,048 elements, we need 12 pages.
>
> If you only need to save the mcast addresses you could use a 'heap'
> that requires no overhead per entry and gives you O(log) lookup.
> 6 bytes per entry is 682 in a 4k page.
>
> 	David
>

Indeed, that would save space here.

Based on Yuval's comments it sounds as though he agrees that it makes 
sense to go beyond a page (even if we get 682 per page as you suggest), 
when configuring these mac filters. So we would then have to allocate 
and manage the page pointers. Currently, there is a list_head per entry 
to manage the macs as a linked list. The patch I proposed continues to 
use that same data structure, thus it will not add to the memory 
footprint, it only proposes to break that footprint up into PAGE_SIZE 
chunks.

So I think the change you suggest can be viewed as an additional 
enhancement here, and also note that the memory allocations here are 
short-lived. That is, they only exist in memory until the NIC is 
re-configured.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-09-20 18:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-16 21:30 [PATCH net-next 0/2] bnx2x: page allocation failure Jason Baron
2016-09-16 21:30 ` [PATCH net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments Jason Baron
2016-09-16 21:30 ` [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list " Jason Baron
2016-09-18 10:25   ` Mintz, Yuval
2016-09-19 18:33     ` Jason Baron
2016-09-20  7:41       ` Mintz, Yuval
2016-09-20 14:52         ` Jason Baron
2016-09-20 15:00           ` Mintz, Yuval
2016-09-20 15:19             ` Jason Baron
2016-09-20 11:30       ` David Laight
2016-09-20 18:46         ` Jason Baron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.