netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
@ 2014-09-18 21:53 Mahesh Bandewar
  2014-09-19 10:00 ` Nikolay Aleksandrov
  0 siblings, 1 reply; 11+ messages in thread
From: Mahesh Bandewar @ 2014-09-18 21:53 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller
  Cc: netdev, Mahesh Bandewar, Eric Dumazet, Maciej Zenczykowski

Earlier change to use usable slave array for TLB mode had an additional
performance advantage. So extending the same logic to all other modes
that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
Also consolidating this with the earlier TLB change.

The main idea is to build the usable slaves array in the control path
and use that array for slave selection during xmit operation.

Measured performance in a setup with a bond of 4x1G NICs with 200
instances of netperf for the modes involved (3ad, xor, tlb)
cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5

Mode        TPS-Before   TPS-After

802.3ad   : 468,694      493,101
TLB (lb=0): 392,583      392,965
XOR       : 475,696      484,517

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
v1:
  (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
      the slave that need to be removed.
  (b) Freeing of array will assign NULL (to handle bond->down to bond->up
      transition gracefully.
  (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
      failure.
  (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
      will populate the array even if these parameters are not used.
  (e) 3AD: Should handle the ad_agg_selection_logic correctly.
v2:
  (a) Removed rcu_read_{un}lock() calls from array manipulation code.
  (b) Slave link-events now refresh array for all these modes.
  (c) Moved free-array call from bond_close() to bond_uninit().
v3:
  (a) Fixed null pointer dereference.
  (b) Removed bond->lock lockdep dependency.
v4:
  (a) Made to changes to comply with Nikolay's locking changes
  (b) Added a work-queue to refresh slave-array when RTNL is not held
  (c) Array refresh happens ONLY with RTNL now.
  (d) alloc changed from GFP_ATOMIC to GFP_KERNEL

 drivers/net/bonding/bond_3ad.c  |  88 +++++--------------
 drivers/net/bonding/bond_alb.c  |  51 ++---------
 drivers/net/bonding/bond_alb.h  |   8 --
 drivers/net/bonding/bond_main.c | 189 ++++++++++++++++++++++++++++++++++++++--
 drivers/net/bonding/bonding.h   |  10 +++
 5 files changed, 218 insertions(+), 128 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 7e9e522fd476..4bf3756dcc11 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -1550,6 +1550,11 @@ static void ad_agg_selection_logic(struct aggregator *agg)
 				__disable_port(port);
 			}
 		}
+		/* RTNL may or may not be held but bond->mode_lock
+		 * is held. It's not safe to update slave-arr here.
+		 * Defer it to delayed-work.
+		 */
+		bond_slave_arr_work_rearm(bond);
 	}
 
 	/* if the selected aggregator is of join individuals
@@ -1688,6 +1693,11 @@ static void ad_enable_collecting_distributing(struct port *port)
 			 port->actor_port_number,
 			 port->aggregator->aggregator_identifier);
 		__enable_port(port);
+		/* RTNL is not be held and bond->mode_lock is held.
+		 * It's not safe to update slave-arr here!
+		 * Defer it to delayed-work.
+		 */
+		bond_slave_arr_work_rearm(port->slave->bond);
 	}
 }
 
@@ -1704,6 +1714,11 @@ static void ad_disable_collecting_distributing(struct port *port)
 			 port->actor_port_number,
 			 port->aggregator->aggregator_identifier);
 		__disable_port(port);
+		/* RTNL is not be held and bond->mode_lock is held.
+		 * It's not safe to update slave-arr here!
+		 * Defer it to delayed-work.
+		 */
+		bond_slave_arr_work_rearm(port->slave->bond);
 	}
 }
 
@@ -2283,6 +2298,12 @@ void bond_3ad_handle_link_change(struct slave *slave, char link)
 	port->sm_vars |= AD_PORT_BEGIN;
 
 	spin_unlock_bh(&slave->bond->mode_lock);
+
+	/* RTNL is held and mode_lock is released so it's safe
+	 * to update slave_array here.
+	 */
+	if (bond_update_slave_arr(slave->bond, NULL))
+		pr_err("Failed to build slave-array for 3ad mode.\n");
 }
 
 /**
@@ -2377,73 +2398,6 @@ int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info)
 	return ret;
 }
 
-int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
-{
-	struct bonding *bond = netdev_priv(dev);
-	struct slave *slave, *first_ok_slave;
-	struct aggregator *agg;
-	struct ad_info ad_info;
-	struct list_head *iter;
-	int slaves_in_agg;
-	int slave_agg_no;
-	int agg_id;
-
-	if (__bond_3ad_get_active_agg_info(bond, &ad_info)) {
-		netdev_dbg(dev, "__bond_3ad_get_active_agg_info failed\n");
-		goto err_free;
-	}
-
-	slaves_in_agg = ad_info.ports;
-	agg_id = ad_info.aggregator_id;
-
-	if (slaves_in_agg == 0) {
-		netdev_dbg(dev, "active aggregator is empty\n");
-		goto err_free;
-	}
-
-	slave_agg_no = bond_xmit_hash(bond, skb) % slaves_in_agg;
-	first_ok_slave = NULL;
-
-	bond_for_each_slave_rcu(bond, slave, iter) {
-		agg = SLAVE_AD_INFO(slave)->port.aggregator;
-		if (!agg || agg->aggregator_identifier != agg_id)
-			continue;
-
-		if (slave_agg_no >= 0) {
-			if (!first_ok_slave && bond_slave_can_tx(slave))
-				first_ok_slave = slave;
-			slave_agg_no--;
-			continue;
-		}
-
-		if (bond_slave_can_tx(slave)) {
-			bond_dev_queue_xmit(bond, skb, slave->dev);
-			goto out;
-		}
-	}
-
-	if (slave_agg_no >= 0) {
-		netdev_err(dev, "Couldn't find a slave to tx on for aggregator ID %d\n",
-			   agg_id);
-		goto err_free;
-	}
-
-	/* we couldn't find any suitable slave after the agg_no, so use the
-	 * first suitable found, if found.
-	 */
-	if (first_ok_slave)
-		bond_dev_queue_xmit(bond, skb, first_ok_slave->dev);
-	else
-		goto err_free;
-
-out:
-	return NETDEV_TX_OK;
-err_free:
-	/* no suitable interface, frame not sent */
-	dev_kfree_skb_any(skb);
-	goto out;
-}
-
 int bond_3ad_lacpdu_recv(const struct sk_buff *skb, struct bonding *bond,
 			 struct slave *slave)
 {
diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 615f3bebd019..d2eadab787c5 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -177,7 +177,6 @@ static int tlb_initialize(struct bonding *bond)
 static void tlb_deinitialize(struct bonding *bond)
 {
 	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct tlb_up_slave *arr;
 
 	spin_lock_bh(&bond->mode_lock);
 
@@ -185,10 +184,6 @@ static void tlb_deinitialize(struct bonding *bond)
 	bond_info->tx_hashtbl = NULL;
 
 	spin_unlock_bh(&bond->mode_lock);
-
-	arr = rtnl_dereference(bond_info->slave_arr);
-	if (arr)
-		kfree_rcu(arr, rcu);
 }
 
 static long long compute_gap(struct slave *slave)
@@ -1336,39 +1331,9 @@ out:
 	return NETDEV_TX_OK;
 }
 
-static int bond_tlb_update_slave_arr(struct bonding *bond,
-				     struct slave *skipslave)
-{
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct slave *tx_slave;
-	struct list_head *iter;
-	struct tlb_up_slave *new_arr, *old_arr;
-
-	new_arr = kzalloc(offsetof(struct tlb_up_slave, arr[bond->slave_cnt]),
-			  GFP_ATOMIC);
-	if (!new_arr)
-		return -ENOMEM;
-
-	bond_for_each_slave(bond, tx_slave, iter) {
-		if (!bond_slave_can_tx(tx_slave))
-			continue;
-		if (skipslave == tx_slave)
-			continue;
-		new_arr->arr[new_arr->count++] = tx_slave;
-	}
-
-	old_arr = rtnl_dereference(bond_info->slave_arr);
-	rcu_assign_pointer(bond_info->slave_arr, new_arr);
-	if (old_arr)
-		kfree_rcu(old_arr, rcu);
-
-	return 0;
-}
-
 int bond_tlb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
 	struct ethhdr *eth_data;
 	struct slave *tx_slave = NULL;
 	u32 hash_index;
@@ -1389,12 +1354,14 @@ int bond_tlb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 							      hash_index & 0xFF,
 							      skb->len);
 			} else {
-				struct tlb_up_slave *slaves;
+				struct bond_up_slave *slaves;
+				unsigned int count;
 
-				slaves = rcu_dereference(bond_info->slave_arr);
-				if (slaves && slaves->count)
+				slaves = rcu_dereference(bond->slave_arr);
+				count = slaves ? ACCESS_ONCE(slaves->count) : 0;
+				if (likely(count))
 					tx_slave = slaves->arr[hash_index %
-							       slaves->count];
+							       count];
 			}
 			break;
 		}
@@ -1641,10 +1608,6 @@ void bond_alb_deinit_slave(struct bonding *bond, struct slave *slave)
 		rlb_clear_slave(bond, slave);
 	}
 
-	if (bond_is_nondyn_tlb(bond))
-		if (bond_tlb_update_slave_arr(bond, slave))
-			pr_err("Failed to build slave-array for TLB mode.\n");
-
 }
 
 void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char link)
@@ -1669,7 +1632,7 @@ void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char
 	}
 
 	if (bond_is_nondyn_tlb(bond)) {
-		if (bond_tlb_update_slave_arr(bond, NULL))
+		if (bond_update_slave_arr(bond, NULL))
 			pr_err("Failed to build slave-array for TLB mode.\n");
 	}
 }
diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
index 3c6a7ff974d7..1ad473b4ade5 100644
--- a/drivers/net/bonding/bond_alb.h
+++ b/drivers/net/bonding/bond_alb.h
@@ -139,19 +139,11 @@ struct tlb_slave_info {
 			 */
 };
 
-struct tlb_up_slave {
-	unsigned int	count;
-	struct rcu_head rcu;
-	struct slave	*arr[0];
-};
-
 struct alb_bond_info {
 	struct tlb_client_info	*tx_hashtbl; /* Dynamically allocated */
 	u32			unbalanced_load;
 	int			tx_rebalance_counter;
 	int			lp_counter;
-	/* -------- non-dynamic tlb mode only ---------*/
-	struct tlb_up_slave __rcu *slave_arr;	  /* Up slaves */
 	/* -------- rlb parameters -------- */
 	int rlb_enabled;
 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5e7987bba583..e87b802d8813 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -208,6 +208,7 @@ static int lacp_fast;
 
 static int bond_init(struct net_device *bond_dev);
 static void bond_uninit(struct net_device *bond_dev);
+static void bond_slave_arr_handler(struct work_struct *work);
 
 /*---------------------------- General routines -----------------------------*/
 
@@ -1547,6 +1548,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		unblock_netpoll_tx();
 	}
 
+	if (bond_mode_uses_xmit_hash(bond) &&
+	    bond_update_slave_arr(bond, NULL))
+		pr_err("Failed to build slave-array.\n");
+
 	netdev_info(bond_dev, "Enslaving %s as %s interface with %s link\n",
 		    slave_dev->name,
 		    bond_is_active_slave(new_slave) ? "an active" : "a backup",
@@ -1661,6 +1666,10 @@ static int __bond_release_one(struct net_device *bond_dev,
 	if (BOND_MODE(bond) == BOND_MODE_8023AD)
 		bond_3ad_unbind_slave(slave);
 
+	if (bond_mode_uses_xmit_hash(bond) &&
+	    bond_update_slave_arr(bond, slave))
+		pr_err("Failed to build slave-array.\n");
+
 	netdev_info(bond_dev, "Releasing %s interface %s\n",
 		    bond_is_active_slave(slave) ? "active" : "backup",
 		    slave_dev->name);
@@ -1963,6 +1972,10 @@ static void bond_miimon_commit(struct bonding *bond)
 				bond_alb_handle_link_change(bond, slave,
 							    BOND_LINK_UP);
 
+			if (BOND_MODE(bond) == BOND_MODE_XOR &&
+			    bond_update_slave_arr(bond, NULL))
+				pr_err("Failed to build slave-array for XOR mode.\n");
+
 			if (!bond->curr_active_slave || slave == primary)
 				goto do_failover;
 
@@ -1990,6 +2003,10 @@ static void bond_miimon_commit(struct bonding *bond)
 				bond_alb_handle_link_change(bond, slave,
 							    BOND_LINK_DOWN);
 
+			if (BOND_MODE(bond) == BOND_MODE_XOR &&
+			    bond_update_slave_arr(bond, NULL))
+				pr_err("Failed to build slave-array for XOR mode.\n");
+
 			if (slave == rcu_access_pointer(bond->curr_active_slave))
 				goto do_failover;
 
@@ -2446,6 +2463,9 @@ static void bond_loadbalance_arp_mon(struct work_struct *work)
 
 		if (slave_state_changed) {
 			bond_slave_state_change(bond);
+			if (BOND_MODE(bond) == BOND_MODE_XOR &&
+			    bond_update_slave_arr(bond, NULL))
+				pr_err("Failed to build slave-array for XOR mode.\n");
 		} else if (do_failover) {
 			block_netpoll_tx();
 			bond_select_active_slave(bond);
@@ -2822,8 +2842,23 @@ static int bond_slave_netdev_event(unsigned long event,
 			if (old_duplex != slave->duplex)
 				bond_3ad_adapter_duplex_changed(slave);
 		}
+		/* Refresh slave-array if applicable!
+		 * If the setuo does not use miimon or arpmon (mode-specific!),
+		 * then these event will not cause the slave-array to be
+		 * refreshed. This will cause xmit to use a slave that is not
+		 * usable. Avoid such situation by refeshing the array at these
+		 * events. If these (miimon/arpmon) parameters are configured
+		 * then array gets refreshed twice and that should be fine!
+		 */
+		if (bond_mode_uses_xmit_hash(bond) &&
+		    bond_update_slave_arr(bond, NULL))
+			pr_err("Failed to build slave-array.\n");
 		break;
 	case NETDEV_DOWN:
+		/* Refresh slave-array if applicable! */
+		if (bond_mode_uses_xmit_hash(bond) &&
+		    bond_update_slave_arr(bond, NULL))
+			pr_err("Failed to build slave-array.\n");
 		break;
 	case NETDEV_CHANGEMTU:
 		/* TODO: Should slaves be allowed to
@@ -3003,6 +3038,7 @@ static void bond_work_init_all(struct bonding *bond)
 	else
 		INIT_DELAYED_WORK(&bond->arp_work, bond_loadbalance_arp_mon);
 	INIT_DELAYED_WORK(&bond->ad_work, bond_3ad_state_machine_handler);
+	INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler);
 }
 
 static void bond_work_cancel_all(struct bonding *bond)
@@ -3012,6 +3048,7 @@ static void bond_work_cancel_all(struct bonding *bond)
 	cancel_delayed_work_sync(&bond->alb_work);
 	cancel_delayed_work_sync(&bond->ad_work);
 	cancel_delayed_work_sync(&bond->mcast_work);
+	cancel_delayed_work_sync(&bond->slave_arr_work);
 }
 
 static int bond_open(struct net_device *bond_dev)
@@ -3061,6 +3098,10 @@ static int bond_open(struct net_device *bond_dev)
 		bond_3ad_initiate_agg_selection(bond, 1);
 	}
 
+	if (bond_mode_uses_xmit_hash(bond) &&
+	    bond_update_slave_arr(bond, NULL))
+		pr_err("Failed to build slave-array.\n");
+
 	return 0;
 }
 
@@ -3555,15 +3596,139 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
 	return NETDEV_TX_OK;
 }
 
-/* In bond_xmit_xor() , we determine the output device by using a pre-
- * determined xmit_hash_policy(), If the selected device is not enabled,
- * find the next active slave.
+/* The caller is holding bond->mode_lock and may or may not be
+ * holding RTNL.
  */
-static int bond_xmit_xor(struct sk_buff *skb, struct net_device *bond_dev)
+void bond_slave_arr_work_rearm(struct bonding *bond)
 {
-	struct bonding *bond = netdev_priv(bond_dev);
+	queue_delayed_work(bond->wq, &bond->slave_arr_work, 1);
+}
 
-	bond_xmit_slave_id(bond, skb, bond_xmit_hash(bond, skb) % bond->slave_cnt);
+/* Slave array work handler. Holds only RTNL */
+static void bond_slave_arr_handler(struct work_struct *work)
+{
+	struct bonding *bond = container_of(work, struct bonding,
+					    slave_arr_work.work);
+	int ret;
+
+	if (!rtnl_trylock())
+		goto err;
+
+	ret = bond_update_slave_arr(bond, NULL);
+	rtnl_unlock();
+	if (ret) {
+		pr_warn_ratelimited("Failed to update slave array from WT\n");
+		goto err;
+	}
+	return;
+
+err:
+	bond_slave_arr_work_rearm(bond);
+}
+
+/* Build the usable slaves array in control path for modes that use xmit-hash
+ * to determine the slave interface -
+ * (a) BOND_MODE_8023AD
+ * (b) BOND_MODE_XOR
+ * (c) BOND_MODE_TLB && tlb_dynamic_lb == 0
+ *
+ * The caller is expected to hold RTNL only and NO other lock!
+ */
+int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave)
+{
+	struct slave *slave;
+	struct list_head *iter;
+	struct bond_up_slave *new_arr, *old_arr;
+	int slaves_in_agg;
+	int agg_id = 0;
+	int ret = 0;
+
+#ifdef CONFIG_LOCKDEP
+	WARN_ON(lockdep_is_held(&bond->mode_lock));
+#endif
+
+	new_arr = kzalloc(offsetof(struct bond_up_slave, arr[bond->slave_cnt]),
+			  GFP_KERNEL);
+	if (!new_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	if (BOND_MODE(bond) == BOND_MODE_8023AD) {
+		struct ad_info ad_info;
+
+		if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
+			pr_debug("bond_3ad_get_active_agg_info failed\n");
+			kfree_rcu(new_arr, rcu);
+			ret = -EINVAL;
+			goto out;
+		}
+		slaves_in_agg = ad_info.ports;
+		agg_id = ad_info.aggregator_id;
+	}
+	bond_for_each_slave(bond, slave, iter) {
+		if (BOND_MODE(bond) == BOND_MODE_8023AD) {
+			struct aggregator *agg;
+
+			agg = SLAVE_AD_INFO(slave)->port.aggregator;
+			if (!agg || agg->aggregator_identifier != agg_id)
+				continue;
+		}
+		if (!bond_slave_can_tx(slave))
+			continue;
+		if (skipslave == slave)
+			continue;
+		new_arr->arr[new_arr->count++] = slave;
+	}
+
+	old_arr = rtnl_dereference(bond->slave_arr);
+	rcu_assign_pointer(bond->slave_arr, new_arr);
+	if (old_arr)
+		kfree_rcu(old_arr, rcu);
+out:
+	if (ret != 0 && skipslave) {
+		int idx;
+
+		/* Rare situation where caller has asked to skip a specific
+		 * slave but allocation failed (most likely!). BTW this is
+		 * only possible when the call is initiated from
+		 * __bond_release_one(). In this sitation; overwrite the
+		 * skipslave entry in the array with the last entry from the
+		 * array to avoid a situation where the xmit path may choose
+		 * this to-be-skipped slave to send a packet out.
+		 */
+		old_arr = rtnl_dereference(bond->slave_arr);
+		for (idx = 0; idx < old_arr->count; idx++) {
+			if (skipslave == old_arr->arr[idx]) {
+				old_arr->arr[idx] =
+				    old_arr->arr[old_arr->count-1];
+				old_arr->count--;
+				break;
+			}
+		}
+	}
+	return ret;
+}
+
+/* Use this Xmit function for 3AD as well as XOR modes. The current
+ * usable slave array is formed in the control path. The xmit function
+ * just calculates hash and sends the packet out.
+ */
+int bond_3ad_xor_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct bonding *bond = netdev_priv(dev);
+	struct slave *slave;
+	struct bond_up_slave *slaves;
+	unsigned int count;
+
+	slaves = rcu_dereference(bond->slave_arr);
+	count = slaves ? ACCESS_ONCE(slaves->count) : 0;
+	if (likely(count)) {
+		slave = slaves->arr[bond_xmit_hash(bond, skb) % count];
+		bond_dev_queue_xmit(bond, skb, slave->dev);
+	} else {
+		dev_kfree_skb_any(skb);
+		atomic_long_inc(&dev->tx_dropped);
+	}
 
 	return NETDEV_TX_OK;
 }
@@ -3660,12 +3825,11 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
 		return bond_xmit_roundrobin(skb, dev);
 	case BOND_MODE_ACTIVEBACKUP:
 		return bond_xmit_activebackup(skb, dev);
+	case BOND_MODE_8023AD:
 	case BOND_MODE_XOR:
-		return bond_xmit_xor(skb, dev);
+		return bond_3ad_xor_xmit(skb, dev);
 	case BOND_MODE_BROADCAST:
 		return bond_xmit_broadcast(skb, dev);
-	case BOND_MODE_8023AD:
-		return bond_3ad_xmit_xor(skb, dev);
 	case BOND_MODE_ALB:
 		return bond_alb_xmit(skb, dev);
 	case BOND_MODE_TLB:
@@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
 	struct bonding *bond = netdev_priv(bond_dev);
 	struct list_head *iter;
 	struct slave *slave;
+	struct bond_up_slave *arr;
 
 	bond_netpoll_cleanup(bond_dev);
 
@@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
 		__bond_release_one(bond_dev, slave->dev, true);
 	netdev_info(bond_dev, "Released all slaves\n");
 
+	arr = rtnl_dereference(bond->slave_arr);
+	if (arr) {
+		kfree_rcu(arr, rcu);
+		RCU_INIT_POINTER(bond->slave_arr, NULL);
+	}
+
 	list_del(&bond->bond_list);
 
 	bond_debug_unregister(bond);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 98dc0d7ad731..4635b175256a 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -177,6 +177,12 @@ struct slave {
 	struct kobject kobj;
 };
 
+struct bond_up_slave {
+	unsigned int	count;
+	struct rcu_head rcu;
+	struct slave	*arr[0];
+};
+
 /*
  * Link pseudo-state only used internally by monitors
  */
@@ -191,6 +197,7 @@ struct bonding {
 	struct   slave __rcu *curr_active_slave;
 	struct   slave __rcu *current_arp_slave;
 	struct   slave __rcu *primary_slave;
+	struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
 	bool     force_primary;
 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
 	int     (*recv_probe)(const struct sk_buff *, struct bonding *,
@@ -220,6 +227,7 @@ struct bonding {
 	struct   delayed_work alb_work;
 	struct   delayed_work ad_work;
 	struct   delayed_work mcast_work;
+	struct   delayed_work slave_arr_work;
 #ifdef CONFIG_DEBUG_FS
 	/* debugging support via debugfs */
 	struct	 dentry *debug_dir;
@@ -531,6 +539,8 @@ const char *bond_slave_link_status(s8 link);
 struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
 					      struct net_device *end_dev,
 					      int level);
+int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
+void bond_slave_arr_work_rearm(struct bonding *bond);
 
 #ifdef CONFIG_PROC_FS
 void bond_create_proc_entry(struct bonding *bond);
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-18 21:53 [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash Mahesh Bandewar
@ 2014-09-19 10:00 ` Nikolay Aleksandrov
  2014-09-19 10:08   ` Nikolay Aleksandrov
  2014-09-19 11:06   ` Nikolay Aleksandrov
  0 siblings, 2 replies; 11+ messages in thread
From: Nikolay Aleksandrov @ 2014-09-19 10:00 UTC (permalink / raw)
  To: Mahesh Bandewar, Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David Miller
  Cc: netdev, Eric Dumazet, Maciej Zenczykowski

On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
> Earlier change to use usable slave array for TLB mode had an additional
> performance advantage. So extending the same logic to all other modes
> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
> Also consolidating this with the earlier TLB change.
> 
> The main idea is to build the usable slaves array in the control path
> and use that array for slave selection during xmit operation.
> 
> Measured performance in a setup with a bond of 4x1G NICs with 200
> instances of netperf for the modes involved (3ad, xor, tlb)
> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
> 
> Mode        TPS-Before   TPS-After
> 
> 802.3ad   : 468,694      493,101
> TLB (lb=0): 392,583      392,965
> XOR       : 475,696      484,517
> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> ---
> v1:
>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>       the slave that need to be removed.
>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>       transition gracefully.
>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>       failure.
>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>       will populate the array even if these parameters are not used.
>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
> v2:
>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>   (b) Slave link-events now refresh array for all these modes.
>   (c) Moved free-array call from bond_close() to bond_uninit().
> v3:
>   (a) Fixed null pointer dereference.
>   (b) Removed bond->lock lockdep dependency.
> v4:
>   (a) Made to changes to comply with Nikolay's locking changes
>   (b) Added a work-queue to refresh slave-array when RTNL is not held
>   (c) Array refresh happens ONLY with RTNL now.
>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
> 
Hello Mahesh,
This looks much better, I think we've ironed out most of the issues. A few
suggestions and one issue below.
First, I think you can fold the pr_err()s from the failing slave array
update into bond_update_slave_arr(), there's no error handling or rollback
so you can save a few lines and some complexity there and just check for
the mode and do the slave array update, if there's an error it can output
it itself. The rest is inlined below.


>  drivers/net/bonding/bond_3ad.c  |  88 +++++--------------
>  drivers/net/bonding/bond_alb.c  |  51 ++---------
>  drivers/net/bonding/bond_alb.h  |   8 --
>  drivers/net/bonding/bond_main.c | 189 ++++++++++++++++++++++++++++++++++++++--
>  drivers/net/bonding/bonding.h   |  10 +++
>  5 files changed, 218 insertions(+), 128 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index 7e9e522fd476..4bf3756dcc11 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -1550,6 +1550,11 @@ static void ad_agg_selection_logic(struct aggregator *agg)
>  				__disable_port(port);
>  			}
>  		}
> +		/* RTNL may or may not be held but bond->mode_lock
> +		 * is held. It's not safe to update slave-arr here.
> +		 * Defer it to delayed-work.
> +		 */
I don't think the information about RTNL matters, the important point is
that mode_lock is held thus we can't update so we defer it. IMO You can
drop the RTNL part here and shorten it a bit :-)

> +		bond_slave_arr_work_rearm(bond);
>  	}
>  
>  	/* if the selected aggregator is of join individuals
> @@ -1688,6 +1693,11 @@ static void ad_enable_collecting_distributing(struct port *port)
>  			 port->actor_port_number,
>  			 port->aggregator->aggregator_identifier);
>  		__enable_port(port);
> +		/* RTNL is not be held and bond->mode_lock is held.
s/is not be/is not/

> +		 * It's not safe to update slave-arr here!
> +		 * Defer it to delayed-work.
> +		 */
> +		bond_slave_arr_work_rearm(port->slave->bond);
>  	}
>  }
>  
> @@ -1704,6 +1714,11 @@ static void ad_disable_collecting_distributing(struct port *port)
>  			 port->actor_port_number,
>  			 port->aggregator->aggregator_identifier);
>  		__disable_port(port);
> +		/* RTNL is not be held and bond->mode_lock is held.
Same here.

> +		 * It's not safe to update slave-arr here!
> +		 * Defer it to delayed-work.
> +		 */
> +		bond_slave_arr_work_rearm(port->slave->bond);
>  	}
>  }
>  
> @@ -2283,6 +2298,12 @@ void bond_3ad_handle_link_change(struct slave *slave, char link)
>  	port->sm_vars |= AD_PORT_BEGIN;
>  
>  	spin_unlock_bh(&slave->bond->mode_lock);
> +
> +	/* RTNL is held and mode_lock is released so it's safe
> +	 * to update slave_array here.
> +	 */
> +	if (bond_update_slave_arr(slave->bond, NULL))
> +		pr_err("Failed to build slave-array for 3ad mode.\n");
>  }
>  
>  /**
> @@ -2377,73 +2398,6 @@ int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info)
>  	return ret;
>  }
>  
> -int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
> -{
> -	struct bonding *bond = netdev_priv(dev);
> -	struct slave *slave, *first_ok_slave;
> -	struct aggregator *agg;
> -	struct ad_info ad_info;
> -	struct list_head *iter;
> -	int slaves_in_agg;
> -	int slave_agg_no;
> -	int agg_id;
> -
> -	if (__bond_3ad_get_active_agg_info(bond, &ad_info)) {
> -		netdev_dbg(dev, "__bond_3ad_get_active_agg_info failed\n");
> -		goto err_free;
> -	}
> -
> -	slaves_in_agg = ad_info.ports;
> -	agg_id = ad_info.aggregator_id;
> -
> -	if (slaves_in_agg == 0) {
> -		netdev_dbg(dev, "active aggregator is empty\n");
> -		goto err_free;
> -	}
> -
> -	slave_agg_no = bond_xmit_hash(bond, skb) % slaves_in_agg;
> -	first_ok_slave = NULL;
> -
> -	bond_for_each_slave_rcu(bond, slave, iter) {
> -		agg = SLAVE_AD_INFO(slave)->port.aggregator;
> -		if (!agg || agg->aggregator_identifier != agg_id)
> -			continue;
> -
> -		if (slave_agg_no >= 0) {
> -			if (!first_ok_slave && bond_slave_can_tx(slave))
> -				first_ok_slave = slave;
> -			slave_agg_no--;
> -			continue;
> -		}
> -
> -		if (bond_slave_can_tx(slave)) {
> -			bond_dev_queue_xmit(bond, skb, slave->dev);
> -			goto out;
> -		}
> -	}
> -
> -	if (slave_agg_no >= 0) {
> -		netdev_err(dev, "Couldn't find a slave to tx on for aggregator ID %d\n",
> -			   agg_id);
> -		goto err_free;
> -	}
> -
> -	/* we couldn't find any suitable slave after the agg_no, so use the
> -	 * first suitable found, if found.
> -	 */
> -	if (first_ok_slave)
> -		bond_dev_queue_xmit(bond, skb, first_ok_slave->dev);
> -	else
> -		goto err_free;
> -
> -out:
> -	return NETDEV_TX_OK;
> -err_free:
> -	/* no suitable interface, frame not sent */
> -	dev_kfree_skb_any(skb);
> -	goto out;
> -}
> -
>  int bond_3ad_lacpdu_recv(const struct sk_buff *skb, struct bonding *bond,
>  			 struct slave *slave)
>  {
> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
> index 615f3bebd019..d2eadab787c5 100644
> --- a/drivers/net/bonding/bond_alb.c
> +++ b/drivers/net/bonding/bond_alb.c
> @@ -177,7 +177,6 @@ static int tlb_initialize(struct bonding *bond)
>  static void tlb_deinitialize(struct bonding *bond)
>  {
>  	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
> -	struct tlb_up_slave *arr;
>  
>  	spin_lock_bh(&bond->mode_lock);
>  
> @@ -185,10 +184,6 @@ static void tlb_deinitialize(struct bonding *bond)
>  	bond_info->tx_hashtbl = NULL;
>  
>  	spin_unlock_bh(&bond->mode_lock);
> -
> -	arr = rtnl_dereference(bond_info->slave_arr);
> -	if (arr)
> -		kfree_rcu(arr, rcu);
>  }
>  
>  static long long compute_gap(struct slave *slave)
> @@ -1336,39 +1331,9 @@ out:
>  	return NETDEV_TX_OK;
>  }
>  
> -static int bond_tlb_update_slave_arr(struct bonding *bond,
> -				     struct slave *skipslave)
> -{
> -	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
> -	struct slave *tx_slave;
> -	struct list_head *iter;
> -	struct tlb_up_slave *new_arr, *old_arr;
> -
> -	new_arr = kzalloc(offsetof(struct tlb_up_slave, arr[bond->slave_cnt]),
> -			  GFP_ATOMIC);
> -	if (!new_arr)
> -		return -ENOMEM;
> -
> -	bond_for_each_slave(bond, tx_slave, iter) {
> -		if (!bond_slave_can_tx(tx_slave))
> -			continue;
> -		if (skipslave == tx_slave)
> -			continue;
> -		new_arr->arr[new_arr->count++] = tx_slave;
> -	}
> -
> -	old_arr = rtnl_dereference(bond_info->slave_arr);
> -	rcu_assign_pointer(bond_info->slave_arr, new_arr);
> -	if (old_arr)
> -		kfree_rcu(old_arr, rcu);
> -
> -	return 0;
> -}
> -
>  int bond_tlb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
>  {
>  	struct bonding *bond = netdev_priv(bond_dev);
> -	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>  	struct ethhdr *eth_data;
>  	struct slave *tx_slave = NULL;
>  	u32 hash_index;
> @@ -1389,12 +1354,14 @@ int bond_tlb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
>  							      hash_index & 0xFF,
>  							      skb->len);
>  			} else {
> -				struct tlb_up_slave *slaves;
> +				struct bond_up_slave *slaves;
> +				unsigned int count;
>  
> -				slaves = rcu_dereference(bond_info->slave_arr);
> -				if (slaves && slaves->count)
> +				slaves = rcu_dereference(bond->slave_arr);
> +				count = slaves ? ACCESS_ONCE(slaves->count) : 0;
> +				if (likely(count))
>  					tx_slave = slaves->arr[hash_index %
> -							       slaves->count];
> +							       count];
>  			}
>  			break;
>  		}
> @@ -1641,10 +1608,6 @@ void bond_alb_deinit_slave(struct bonding *bond, struct slave *slave)
>  		rlb_clear_slave(bond, slave);
>  	}
>  
> -	if (bond_is_nondyn_tlb(bond))
> -		if (bond_tlb_update_slave_arr(bond, slave))
> -			pr_err("Failed to build slave-array for TLB mode.\n");
> -
>  }
>  
>  void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char link)
> @@ -1669,7 +1632,7 @@ void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char
>  	}
>  
>  	if (bond_is_nondyn_tlb(bond)) {
> -		if (bond_tlb_update_slave_arr(bond, NULL))
> +		if (bond_update_slave_arr(bond, NULL))
>  			pr_err("Failed to build slave-array for TLB mode.\n");
>  	}
>  }
> diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
> index 3c6a7ff974d7..1ad473b4ade5 100644
> --- a/drivers/net/bonding/bond_alb.h
> +++ b/drivers/net/bonding/bond_alb.h
> @@ -139,19 +139,11 @@ struct tlb_slave_info {
>  			 */
>  };
>  
> -struct tlb_up_slave {
> -	unsigned int	count;
> -	struct rcu_head rcu;
> -	struct slave	*arr[0];
> -};
> -
>  struct alb_bond_info {
>  	struct tlb_client_info	*tx_hashtbl; /* Dynamically allocated */
>  	u32			unbalanced_load;
>  	int			tx_rebalance_counter;
>  	int			lp_counter;
> -	/* -------- non-dynamic tlb mode only ---------*/
> -	struct tlb_up_slave __rcu *slave_arr;	  /* Up slaves */
>  	/* -------- rlb parameters -------- */
>  	int rlb_enabled;
>  	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 5e7987bba583..e87b802d8813 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -208,6 +208,7 @@ static int lacp_fast;
>  
>  static int bond_init(struct net_device *bond_dev);
>  static void bond_uninit(struct net_device *bond_dev);
> +static void bond_slave_arr_handler(struct work_struct *work);
>  
>  /*---------------------------- General routines -----------------------------*/
>  
> @@ -1547,6 +1548,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>  		unblock_netpoll_tx();
>  	}
>  
> +	if (bond_mode_uses_xmit_hash(bond) &&
> +	    bond_update_slave_arr(bond, NULL))
> +		pr_err("Failed to build slave-array.\n");
> +
>  	netdev_info(bond_dev, "Enslaving %s as %s interface with %s link\n",
>  		    slave_dev->name,
>  		    bond_is_active_slave(new_slave) ? "an active" : "a backup",
> @@ -1661,6 +1666,10 @@ static int __bond_release_one(struct net_device *bond_dev,
>  	if (BOND_MODE(bond) == BOND_MODE_8023AD)
>  		bond_3ad_unbind_slave(slave);
>  
> +	if (bond_mode_uses_xmit_hash(bond) &&
> +	    bond_update_slave_arr(bond, slave))
> +		pr_err("Failed to build slave-array.\n");
> +
>  	netdev_info(bond_dev, "Releasing %s interface %s\n",
>  		    bond_is_active_slave(slave) ? "active" : "backup",
>  		    slave_dev->name);
> @@ -1963,6 +1972,10 @@ static void bond_miimon_commit(struct bonding *bond)
>  				bond_alb_handle_link_change(bond, slave,
>  							    BOND_LINK_UP);
>  
> +			if (BOND_MODE(bond) == BOND_MODE_XOR &&
> +			    bond_update_slave_arr(bond, NULL))
> +				pr_err("Failed to build slave-array for XOR mode.\n");
> +
miimon is also supported in the other hash using modes, it's used to look
for link failure and speed/duplex changes. There's even a warning about it
for 802.3ad/TLB/ALB modes:
pr_warn("Warning: miimon must be specified, otherwise bonding will not
detect link failure, speed and duplex which are essential for 802.3ad
operation\n");
pr_warn("Forcing miimon to 100msec\n");

bond_main.c: line 4026

>  			if (!bond->curr_active_slave || slave == primary)
>  				goto do_failover;
>  
> @@ -1990,6 +2003,10 @@ static void bond_miimon_commit(struct bonding *bond)
>  				bond_alb_handle_link_change(bond, slave,
>  							    BOND_LINK_DOWN);
>  
> +			if (BOND_MODE(bond) == BOND_MODE_XOR &&
> +			    bond_update_slave_arr(bond, NULL))
> +				pr_err("Failed to build slave-array for XOR mode.\n");
> +
Same here.

>  			if (slave == rcu_access_pointer(bond->curr_active_slave))
>  				goto do_failover;
>  
> @@ -2446,6 +2463,9 @@ static void bond_loadbalance_arp_mon(struct work_struct *work)
>  
>  		if (slave_state_changed) {
>  			bond_slave_state_change(bond);
> +			if (BOND_MODE(bond) == BOND_MODE_XOR &&
> +			    bond_update_slave_arr(bond, NULL))
> +				pr_err("Failed to build slave-array for XOR mode.\n");
>  		} else if (do_failover) {
>  			block_netpoll_tx();
>  			bond_select_active_slave(bond);
> @@ -2822,8 +2842,23 @@ static int bond_slave_netdev_event(unsigned long event,
>  			if (old_duplex != slave->duplex)
>  				bond_3ad_adapter_duplex_changed(slave);
>  		}
> +		/* Refresh slave-array if applicable!
> +		 * If the setuo does not use miimon or arpmon (mode-specific!),
s/setuo/setup/

> +		 * then these event will not cause the slave-array to be
s/event/events/ ?

> +		 * refreshed. This will cause xmit to use a slave that is not
> +		 * usable. Avoid such situation by refeshing the array at these
> +		 * events. If these (miimon/arpmon) parameters are configured
> +		 * then array gets refreshed twice and that should be fine!
> +		 */
> +		if (bond_mode_uses_xmit_hash(bond) &&
> +		    bond_update_slave_arr(bond, NULL))
> +			pr_err("Failed to build slave-array.\n");
>  		break;
>  	case NETDEV_DOWN:
> +		/* Refresh slave-array if applicable! */
Please drop this comment, it doesn't bring any new information from the if().

> +		if (bond_mode_uses_xmit_hash(bond) &&
> +		    bond_update_slave_arr(bond, NULL))
> +			pr_err("Failed to build slave-array.\n");
>  		break;
>  	case NETDEV_CHANGEMTU:
>  		/* TODO: Should slaves be allowed to
> @@ -3003,6 +3038,7 @@ static void bond_work_init_all(struct bonding *bond)
>  	else
>  		INIT_DELAYED_WORK(&bond->arp_work, bond_loadbalance_arp_mon);
>  	INIT_DELAYED_WORK(&bond->ad_work, bond_3ad_state_machine_handler);
> +	INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler);
>  }
>  
>  static void bond_work_cancel_all(struct bonding *bond)
> @@ -3012,6 +3048,7 @@ static void bond_work_cancel_all(struct bonding *bond)
>  	cancel_delayed_work_sync(&bond->alb_work);
>  	cancel_delayed_work_sync(&bond->ad_work);
>  	cancel_delayed_work_sync(&bond->mcast_work);
> +	cancel_delayed_work_sync(&bond->slave_arr_work);
>  }
>  
>  static int bond_open(struct net_device *bond_dev)
> @@ -3061,6 +3098,10 @@ static int bond_open(struct net_device *bond_dev)
>  		bond_3ad_initiate_agg_selection(bond, 1);
>  	}
>  
> +	if (bond_mode_uses_xmit_hash(bond) &&
> +	    bond_update_slave_arr(bond, NULL))
> +		pr_err("Failed to build slave-array.\n");
> +
>  	return 0;
>  }
>  
> @@ -3555,15 +3596,139 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
>  	return NETDEV_TX_OK;
>  }
>  
> -/* In bond_xmit_xor() , we determine the output device by using a pre-
> - * determined xmit_hash_policy(), If the selected device is not enabled,
> - * find the next active slave.
> +/* The caller is holding bond->mode_lock and may or may not be
> + * holding RTNL.
>   */
I'd say change this comment to note that this should be used when it's not
appropriate to update the slave array right away F.e. when sleeping is not
an option or when RTNL isn't held. Currently it's only the mode_lock case,
but that may change and a more general comment would be more helpful.

> -static int bond_xmit_xor(struct sk_buff *skb, struct net_device *bond_dev)
> +void bond_slave_arr_work_rearm(struct bonding *bond)
>  {
> -	struct bonding *bond = netdev_priv(bond_dev);
> +	queue_delayed_work(bond->wq, &bond->slave_arr_work, 1);
> +}
>  
> -	bond_xmit_slave_id(bond, skb, bond_xmit_hash(bond, skb) % bond->slave_cnt);
> +/* Slave array work handler. Holds only RTNL */
> +static void bond_slave_arr_handler(struct work_struct *work)
> +{
> +	struct bonding *bond = container_of(work, struct bonding,
> +					    slave_arr_work.work);
> +	int ret;
> +
> +	if (!rtnl_trylock())
> +		goto err;
> +
> +	ret = bond_update_slave_arr(bond, NULL);
> +	rtnl_unlock();
> +	if (ret) {
> +		pr_warn_ratelimited("Failed to update slave array from WT\n");
> +		goto err;
> +	}
> +	return;
> +
> +err:
> +	bond_slave_arr_work_rearm(bond);
> +}
> +
> +/* Build the usable slaves array in control path for modes that use xmit-hash
> + * to determine the slave interface -
> + * (a) BOND_MODE_8023AD
> + * (b) BOND_MODE_XOR
> + * (c) BOND_MODE_TLB && tlb_dynamic_lb == 0
> + *
> + * The caller is expected to hold RTNL only and NO other lock!
> + */
> +int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave)
> +{
> +	struct slave *slave;
> +	struct list_head *iter;
> +	struct bond_up_slave *new_arr, *old_arr;
> +	int slaves_in_agg;
> +	int agg_id = 0;
> +	int ret = 0;
> +
> +#ifdef CONFIG_LOCKDEP
> +	WARN_ON(lockdep_is_held(&bond->mode_lock));
> +#endif
> +
> +	new_arr = kzalloc(offsetof(struct bond_up_slave, arr[bond->slave_cnt]),
> +			  GFP_KERNEL);
> +	if (!new_arr) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	if (BOND_MODE(bond) == BOND_MODE_8023AD) {
> +		struct ad_info ad_info;
> +
> +		if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
> +			pr_debug("bond_3ad_get_active_agg_info failed\n");
> +			kfree_rcu(new_arr, rcu);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		slaves_in_agg = ad_info.ports;
> +		agg_id = ad_info.aggregator_id;
> +	}
> +	bond_for_each_slave(bond, slave, iter) {
> +		if (BOND_MODE(bond) == BOND_MODE_8023AD) {
> +			struct aggregator *agg;
> +
> +			agg = SLAVE_AD_INFO(slave)->port.aggregator;
> +			if (!agg || agg->aggregator_identifier != agg_id)
> +				continue;
> +		}
> +		if (!bond_slave_can_tx(slave))
> +			continue;
> +		if (skipslave == slave)
> +			continue;
> +		new_arr->arr[new_arr->count++] = slave;
> +	}
> +
> +	old_arr = rtnl_dereference(bond->slave_arr);
> +	rcu_assign_pointer(bond->slave_arr, new_arr);
> +	if (old_arr)
> +		kfree_rcu(old_arr, rcu);
> +out:
> +	if (ret != 0 && skipslave) {
> +		int idx;
> +
> +		/* Rare situation where caller has asked to skip a specific
> +		 * slave but allocation failed (most likely!). BTW this is
> +		 * only possible when the call is initiated from
> +		 * __bond_release_one(). In this sitation; overwrite the
s/sitation/situation/

> +		 * skipslave entry in the array with the last entry from the
> +		 * array to avoid a situation where the xmit path may choose
> +		 * this to-be-skipped slave to send a packet out.
> +		 */
> +		old_arr = rtnl_dereference(bond->slave_arr);
> +		for (idx = 0; idx < old_arr->count; idx++) {
> +			if (skipslave == old_arr->arr[idx]) {
> +				old_arr->arr[idx] =
> +				    old_arr->arr[old_arr->count-1];
> +				old_arr->count--;
> +				break;
> +			}
> +		}
> +	}
> +	return ret;
> +}
> +
> +/* Use this Xmit function for 3AD as well as XOR modes. The current
> + * usable slave array is formed in the control path. The xmit function
> + * just calculates hash and sends the packet out.
> + */
> +int bond_3ad_xor_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct bonding *bond = netdev_priv(dev);
> +	struct slave *slave;
> +	struct bond_up_slave *slaves;
> +	unsigned int count;
> +
> +	slaves = rcu_dereference(bond->slave_arr);
> +	count = slaves ? ACCESS_ONCE(slaves->count) : 0;
> +	if (likely(count)) {
> +		slave = slaves->arr[bond_xmit_hash(bond, skb) % count];
> +		bond_dev_queue_xmit(bond, skb, slave->dev);
> +	} else {
> +		dev_kfree_skb_any(skb);
> +		atomic_long_inc(&dev->tx_dropped);
> +	}
>  
>  	return NETDEV_TX_OK;
>  }
> @@ -3660,12 +3825,11 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
>  		return bond_xmit_roundrobin(skb, dev);
>  	case BOND_MODE_ACTIVEBACKUP:
>  		return bond_xmit_activebackup(skb, dev);
> +	case BOND_MODE_8023AD:
>  	case BOND_MODE_XOR:
> -		return bond_xmit_xor(skb, dev);
> +		return bond_3ad_xor_xmit(skb, dev);
>  	case BOND_MODE_BROADCAST:
>  		return bond_xmit_broadcast(skb, dev);
> -	case BOND_MODE_8023AD:
> -		return bond_3ad_xmit_xor(skb, dev);
>  	case BOND_MODE_ALB:
>  		return bond_alb_xmit(skb, dev);
>  	case BOND_MODE_TLB:
> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
>  	struct bonding *bond = netdev_priv(bond_dev);
>  	struct list_head *iter;
>  	struct slave *slave;
> +	struct bond_up_slave *arr;
>  
>  	bond_netpoll_cleanup(bond_dev);
>  
> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
>  		__bond_release_one(bond_dev, slave->dev, true);
>  	netdev_info(bond_dev, "Released all slaves\n");
>  
> +	arr = rtnl_dereference(bond->slave_arr);
> +	if (arr) {
> +		kfree_rcu(arr, rcu);
> +		RCU_INIT_POINTER(bond->slave_arr, NULL);
> +	}
> +
>  	list_del(&bond->bond_list);
>  
>  	bond_debug_unregister(bond);
> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
> index 98dc0d7ad731..4635b175256a 100644
> --- a/drivers/net/bonding/bonding.h
> +++ b/drivers/net/bonding/bonding.h
> @@ -177,6 +177,12 @@ struct slave {
>  	struct kobject kobj;
>  };
>  
> +struct bond_up_slave {
> +	unsigned int	count;
> +	struct rcu_head rcu;
> +	struct slave	*arr[0];
> +};
> +
>  /*
>   * Link pseudo-state only used internally by monitors
>   */
> @@ -191,6 +197,7 @@ struct bonding {
>  	struct   slave __rcu *curr_active_slave;
>  	struct   slave __rcu *current_arp_slave;
>  	struct   slave __rcu *primary_slave;
> +	struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
>  	bool     force_primary;
>  	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
>  	int     (*recv_probe)(const struct sk_buff *, struct bonding *,
> @@ -220,6 +227,7 @@ struct bonding {
>  	struct   delayed_work alb_work;
>  	struct   delayed_work ad_work;
>  	struct   delayed_work mcast_work;
> +	struct   delayed_work slave_arr_work;
>  #ifdef CONFIG_DEBUG_FS
>  	/* debugging support via debugfs */
>  	struct	 dentry *debug_dir;
> @@ -531,6 +539,8 @@ const char *bond_slave_link_status(s8 link);
>  struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
>  					      struct net_device *end_dev,
>  					      int level);
> +int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
> +void bond_slave_arr_work_rearm(struct bonding *bond);
>  
>  #ifdef CONFIG_PROC_FS
>  void bond_create_proc_entry(struct bonding *bond);
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-19 10:00 ` Nikolay Aleksandrov
@ 2014-09-19 10:08   ` Nikolay Aleksandrov
  2014-09-19 11:06   ` Nikolay Aleksandrov
  1 sibling, 0 replies; 11+ messages in thread
From: Nikolay Aleksandrov @ 2014-09-19 10:08 UTC (permalink / raw)
  To: Mahesh Bandewar, Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David Miller
  Cc: netdev, Eric Dumazet, Maciej Zenczykowski

On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
> On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
>> Earlier change to use usable slave array for TLB mode had an additional
>> performance advantage. So extending the same logic to all other modes
>> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
>> Also consolidating this with the earlier TLB change.
>>
>> The main idea is to build the usable slaves array in the control path
>> and use that array for slave selection during xmit operation.
>>
>> Measured performance in a setup with a bond of 4x1G NICs with 200
>> instances of netperf for the modes involved (3ad, xor, tlb)
>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>
>> Mode        TPS-Before   TPS-After
>>
>> 802.3ad   : 468,694      493,101
>> TLB (lb=0): 392,583      392,965
>> XOR       : 475,696      484,517
>>
>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>> ---
>> v1:
>>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>>       the slave that need to be removed.
>>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>>       transition gracefully.
>>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>>       failure.
>>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>>       will populate the array even if these parameters are not used.
>>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>> v2:
>>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>>   (b) Slave link-events now refresh array for all these modes.
>>   (c) Moved free-array call from bond_close() to bond_uninit().
>> v3:
>>   (a) Fixed null pointer dereference.
>>   (b) Removed bond->lock lockdep dependency.
>> v4:
>>   (a) Made to changes to comply with Nikolay's locking changes
>>   (b) Added a work-queue to refresh slave-array when RTNL is not held
>>   (c) Array refresh happens ONLY with RTNL now.
>>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>>
<<<<snip>>>>>
>> @@ -1963,6 +1972,10 @@ static void bond_miimon_commit(struct bonding *bond)
>>  				bond_alb_handle_link_change(bond, slave,
>>  							    BOND_LINK_UP);
>>  
>> +			if (BOND_MODE(bond) == BOND_MODE_XOR &&
>> +			    bond_update_slave_arr(bond, NULL))
>> +				pr_err("Failed to build slave-array for XOR mode.\n");
>> +
> miimon is also supported in the other hash using modes, it's used to look
> for link failure and speed/duplex changes. There's even a warning about it
> for 802.3ad/TLB/ALB modes:
> pr_warn("Warning: miimon must be specified, otherwise bonding will not
> detect link failure, speed and duplex which are essential for 802.3ad
> operation\n");
> pr_warn("Forcing miimon to 100msec\n");
> 
> bond_main.c: line 4026
> 
Actually nevermind this comment, their arrays will get rebuilt in their
respective link handling functions. I just thought we could somehow fold
these rebuilds but it seems impossible currently.

Nik

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-19 10:00 ` Nikolay Aleksandrov
  2014-09-19 10:08   ` Nikolay Aleksandrov
@ 2014-09-19 11:06   ` Nikolay Aleksandrov
  2014-09-20  0:09     ` Mahesh Bandewar
  1 sibling, 1 reply; 11+ messages in thread
From: Nikolay Aleksandrov @ 2014-09-19 11:06 UTC (permalink / raw)
  To: Mahesh Bandewar, Jay Vosburgh, Veaceslav Falico, Andy Gospodarek,
	David Miller
  Cc: netdev, Eric Dumazet, Maciej Zenczykowski

On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
> On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
>> Earlier change to use usable slave array for TLB mode had an additional
>> performance advantage. So extending the same logic to all other modes
>> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
>> Also consolidating this with the earlier TLB change.
>>
>> The main idea is to build the usable slaves array in the control path
>> and use that array for slave selection during xmit operation.
>>
>> Measured performance in a setup with a bond of 4x1G NICs with 200
>> instances of netperf for the modes involved (3ad, xor, tlb)
>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>
>> Mode        TPS-Before   TPS-After
>>
>> 802.3ad   : 468,694      493,101
>> TLB (lb=0): 392,583      392,965
>> XOR       : 475,696      484,517
>>
>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>> ---
>> v1:
>>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>>       the slave that need to be removed.
>>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>>       transition gracefully.
>>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>>       failure.
>>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>>       will populate the array even if these parameters are not used.
>>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>> v2:
>>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>>   (b) Slave link-events now refresh array for all these modes.
>>   (c) Moved free-array call from bond_close() to bond_uninit().
>> v3:
>>   (a) Fixed null pointer dereference.
>>   (b) Removed bond->lock lockdep dependency.
>> v4:
>>   (a) Made to changes to comply with Nikolay's locking changes
>>   (b) Added a work-queue to refresh slave-array when RTNL is not held
>>   (c) Array refresh happens ONLY with RTNL now.
>>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>>
<<<snip>>>
>> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
>>  	struct bonding *bond = netdev_priv(bond_dev);
>>  	struct list_head *iter;
>>  	struct slave *slave;
>> +	struct bond_up_slave *arr;
>>  
>>  	bond_netpoll_cleanup(bond_dev);
>>  
>> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
>>  		__bond_release_one(bond_dev, slave->dev, true);
>>  	netdev_info(bond_dev, "Released all slaves\n");
>>  
Sorry but I just spotted a major problem, bond_3ad_unbind_slave() (called
from __bond_release_one) calls ad_agg_selection_logic() which can re-arm
the slave_arr work after it's supposed to be stopped here (i.e. the bond
device has been closed so all works should've been stopped) so we might
leak memory and access freed memory after all since it'll keep
re-scheduling itself until it can acquire rtnl which is after the bond
device has been destroyed.

>> +	arr = rtnl_dereference(bond->slave_arr);
>> +	if (arr) {
>> +		kfree_rcu(arr, rcu);
>> +		RCU_INIT_POINTER(bond->slave_arr, NULL);
>> +	}
>> +
>>  	list_del(&bond->bond_list);
>>  
>>  	bond_debug_unregister(bond);
>> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>> index 98dc0d7ad731..4635b175256a 100644
>> --- a/drivers/net/bonding/bonding.h
>> +++ b/drivers/net/bonding/bonding.h
>> @@ -177,6 +177,12 @@ struct slave {
>>  	struct kobject kobj;
>>  };
>>  
>> +struct bond_up_slave {
>> +	unsigned int	count;
>> +	struct rcu_head rcu;
>> +	struct slave	*arr[0];
>> +};
>> +
>>  /*
>>   * Link pseudo-state only used internally by monitors
>>   */
>> @@ -191,6 +197,7 @@ struct bonding {
>>  	struct   slave __rcu *curr_active_slave;
>>  	struct   slave __rcu *current_arp_slave;
>>  	struct   slave __rcu *primary_slave;
>> +	struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
>>  	bool     force_primary;
>>  	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
>>  	int     (*recv_probe)(const struct sk_buff *, struct bonding *,
>> @@ -220,6 +227,7 @@ struct bonding {
>>  	struct   delayed_work alb_work;
>>  	struct   delayed_work ad_work;
>>  	struct   delayed_work mcast_work;
>> +	struct   delayed_work slave_arr_work;
>>  #ifdef CONFIG_DEBUG_FS
>>  	/* debugging support via debugfs */
>>  	struct	 dentry *debug_dir;
>> @@ -531,6 +539,8 @@ const char *bond_slave_link_status(s8 link);
>>  struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
>>  					      struct net_device *end_dev,
>>  					      int level);
>> +int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
>> +void bond_slave_arr_work_rearm(struct bonding *bond);
>>  
>>  #ifdef CONFIG_PROC_FS
>>  void bond_create_proc_entry(struct bonding *bond);
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-19 11:06   ` Nikolay Aleksandrov
@ 2014-09-20  0:09     ` Mahesh Bandewar
  2014-09-20 10:19       ` Nikolay Aleksandrov
  0 siblings, 1 reply; 11+ messages in thread
From: Mahesh Bandewar @ 2014-09-20  0:09 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller,
	netdev, Eric Dumazet, Maciej Zenczykowski

On Fri, Sep 19, 2014 at 4:06 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>
> On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
> > On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
> >> Earlier change to use usable slave array for TLB mode had an additional
> >> performance advantage. So extending the same logic to all other modes
> >> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
> >> Also consolidating this with the earlier TLB change.
> >>
> >> The main idea is to build the usable slaves array in the control path
> >> and use that array for slave selection during xmit operation.
> >>
> >> Measured performance in a setup with a bond of 4x1G NICs with 200
> >> instances of netperf for the modes involved (3ad, xor, tlb)
> >> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
> >>
> >> Mode        TPS-Before   TPS-After
> >>
> >> 802.3ad   : 468,694      493,101
> >> TLB (lb=0): 392,583      392,965
> >> XOR       : 475,696      484,517
> >>
> >> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> >> ---
> >> v1:
> >>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
> >>       the slave that need to be removed.
> >>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
> >>       transition gracefully.
> >>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
> >>       failure.
> >>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
> >>       will populate the array even if these parameters are not used.
> >>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
> >> v2:
> >>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
> >>   (b) Slave link-events now refresh array for all these modes.
> >>   (c) Moved free-array call from bond_close() to bond_uninit().
> >> v3:
> >>   (a) Fixed null pointer dereference.
> >>   (b) Removed bond->lock lockdep dependency.
> >> v4:
> >>   (a) Made to changes to comply with Nikolay's locking changes
> >>   (b) Added a work-queue to refresh slave-array when RTNL is not held
> >>   (c) Array refresh happens ONLY with RTNL now.
> >>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
> >>
> <<<snip>>>
> >> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
> >>      struct bonding *bond = netdev_priv(bond_dev);
> >>      struct list_head *iter;
> >>      struct slave *slave;
> >> +    struct bond_up_slave *arr;
> >>
> >>      bond_netpoll_cleanup(bond_dev);
> >>
> >> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
> >>              __bond_release_one(bond_dev, slave->dev, true);
> >>      netdev_info(bond_dev, "Released all slaves\n");
> >>
> Sorry but I just spotted a major problem, bond_3ad_unbind_slave() (called
> from __bond_release_one) calls ad_agg_selection_logic() which can re-arm
> the slave_arr work after it's supposed to be stopped here (i.e. the bond
> device has been closed so all works should've been stopped) so we might
> leak memory and access freed memory after all since it'll keep
> re-scheduling itself until it can acquire rtnl which is after the bond
> device has been destroyed.
>
This should not be a problem. ndo_close (bond_close()) is called
before ndo_uninit(bond_uninit()), so the work-queues get cancelled
there so if rearm tries to schedule some work after queue gets
cancelled, it can't do much and wont harm anything.
Hence there wont be any arrays built once it's free-ed completely and
therefore no memory leak. I addded some instrumentation and tried
following sequence -

# modprobe bonding mode=4
# ip link set bond0 up
# [Add ip]
# [Add default route]
# ifenslave bond0 eth0 eth1 eth2 eth3
....
[Run some backgound traffic. I used netperf.]

# ip link bond0 down

I did not see anything "bad" happening. Did your trial produced
something unpleasant?

> >> +    arr = rtnl_dereference(bond->slave_arr);
> >> +    if (arr) {
> >> +            kfree_rcu(arr, rcu);
> >> +            RCU_INIT_POINTER(bond->slave_arr, NULL);
> >> +    }
> >> +
> >>      list_del(&bond->bond_list);
> >>
> >>      bond_debug_unregister(bond);
> >> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
> >> index 98dc0d7ad731..4635b175256a 100644
> >> --- a/drivers/net/bonding/bonding.h
> >> +++ b/drivers/net/bonding/bonding.h
> >> @@ -177,6 +177,12 @@ struct slave {
> >>      struct kobject kobj;
> >>  };
> >>
> >> +struct bond_up_slave {
> >> +    unsigned int    count;
> >> +    struct rcu_head rcu;
> >> +    struct slave    *arr[0];
> >> +};
> >> +
> >>  /*
> >>   * Link pseudo-state only used internally by monitors
> >>   */
> >> @@ -191,6 +197,7 @@ struct bonding {
> >>      struct   slave __rcu *curr_active_slave;
> >>      struct   slave __rcu *current_arp_slave;
> >>      struct   slave __rcu *primary_slave;
> >> +    struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
> >>      bool     force_primary;
> >>      s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
> >>      int     (*recv_probe)(const struct sk_buff *, struct bonding *,
> >> @@ -220,6 +227,7 @@ struct bonding {
> >>      struct   delayed_work alb_work;
> >>      struct   delayed_work ad_work;
> >>      struct   delayed_work mcast_work;
> >> +    struct   delayed_work slave_arr_work;
> >>  #ifdef CONFIG_DEBUG_FS
> >>      /* debugging support via debugfs */
> >>      struct   dentry *debug_dir;
> >> @@ -531,6 +539,8 @@ const char *bond_slave_link_status(s8 link);
> >>  struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
> >>                                            struct net_device *end_dev,
> >>                                            int level);
> >> +int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
> >> +void bond_slave_arr_work_rearm(struct bonding *bond);
> >>
> >>  #ifdef CONFIG_PROC_FS
> >>  void bond_create_proc_entry(struct bonding *bond);
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-20  0:09     ` Mahesh Bandewar
@ 2014-09-20 10:19       ` Nikolay Aleksandrov
  2014-09-20 20:04         ` Mahesh Bandewar
  0 siblings, 1 reply; 11+ messages in thread
From: Nikolay Aleksandrov @ 2014-09-20 10:19 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller,
	netdev, Eric Dumazet, Maciej Zenczykowski

On 09/20/2014 02:09 AM, Mahesh Bandewar wrote:
> On Fri, Sep 19, 2014 at 4:06 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>>
>> On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
>>> On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
>>>> Earlier change to use usable slave array for TLB mode had an additional
>>>> performance advantage. So extending the same logic to all other modes
>>>> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
>>>> Also consolidating this with the earlier TLB change.
>>>>
>>>> The main idea is to build the usable slaves array in the control path
>>>> and use that array for slave selection during xmit operation.
>>>>
>>>> Measured performance in a setup with a bond of 4x1G NICs with 200
>>>> instances of netperf for the modes involved (3ad, xor, tlb)
>>>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>>>
>>>> Mode        TPS-Before   TPS-After
>>>>
>>>> 802.3ad   : 468,694      493,101
>>>> TLB (lb=0): 392,583      392,965
>>>> XOR       : 475,696      484,517
>>>>
>>>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>>>> ---
>>>> v1:
>>>>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>>>>       the slave that need to be removed.
>>>>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>>>>       transition gracefully.
>>>>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>>>>       failure.
>>>>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>>>>       will populate the array even if these parameters are not used.
>>>>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>>>> v2:
>>>>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>>>>   (b) Slave link-events now refresh array for all these modes.
>>>>   (c) Moved free-array call from bond_close() to bond_uninit().
>>>> v3:
>>>>   (a) Fixed null pointer dereference.
>>>>   (b) Removed bond->lock lockdep dependency.
>>>> v4:
>>>>   (a) Made to changes to comply with Nikolay's locking changes
>>>>   (b) Added a work-queue to refresh slave-array when RTNL is not held
>>>>   (c) Array refresh happens ONLY with RTNL now.
>>>>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>>>>
>> <<<snip>>>
>>>> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>      struct bonding *bond = netdev_priv(bond_dev);
>>>>      struct list_head *iter;
>>>>      struct slave *slave;
>>>> +    struct bond_up_slave *arr;
>>>>
>>>>      bond_netpoll_cleanup(bond_dev);
>>>>
>>>> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>              __bond_release_one(bond_dev, slave->dev, true);
>>>>      netdev_info(bond_dev, "Released all slaves\n");
>>>>
>> Sorry but I just spotted a major problem, bond_3ad_unbind_slave() (called
>> from __bond_release_one) calls ad_agg_selection_logic() which can re-arm
>> the slave_arr work after it's supposed to be stopped here (i.e. the bond
>> device has been closed so all works should've been stopped) so we might
>> leak memory and access freed memory after all since it'll keep
>> re-scheduling itself until it can acquire rtnl which is after the bond
>> device has been destroyed.
>>
> This should not be a problem. ndo_close (bond_close()) is called
> before ndo_uninit(bond_uninit()), so the work-queues get cancelled
> there so if rearm tries to schedule some work after queue gets
> cancelled, it can't do much and wont harm anything.
> Hence there wont be any arrays built once it's free-ed completely and
> therefore no memory leak. I addded some instrumentation and tried
> following sequence -
> 
> # modprobe bonding mode=4
> # ip link set bond0 up
> # [Add ip]
> # [Add default route]
> # ifenslave bond0 eth0 eth1 eth2 eth3
> ....
> [Run some backgound traffic. I used netperf.]
> 
> # ip link bond0 down
> 
> I did not see anything "bad" happening. Did your trial produced
> something unpleasant?
> 
The test you've done is irrelevant to the situation that I described
because ndo_uninit() is called when the device is being destroyed. Moreover
the case I told you about would require to have an active aggregator and an
inactive one (i.e. so agg selection logic will get called), here is the result:
[  428.916586] bond1 (unregistering): Removing an active aggregator
[  428.916589] Failed to build slave-array.
[  428.916849] bond1 (unregistering): Releasing active interface eth1
[  428.920342] bond1 (unregistering): Released all slaves
[  428.923043] Failed to update slave array from WT
[  428.924098] Failed to update slave array from WT
[  428.925125] Failed to update slave array from WT
[  428.926120] Failed to update slave array from WT
[  428.927096] Failed to update slave array from WT
[  428.928101] Failed to update slave array from WT
[  428.929120] Failed to update slave array from WT
[  428.930086] BUG: unable to handle kernel NULL pointer dereference at
       (null)
[  428.930644] IP: [<ffffffff810aa37b>] __queue_work+0x7b/0x350
[  428.930946] PGD 0
[  428.931053] Oops: 0000 [#1] SMP
[  428.931053] Modules linked in: sfc ptp pps_core mdio i2c_algo_bit mtd
bonding(O) snd_hda_codec_generic joydev crct10dif_pclmul crc32_pclmul
i2c_piix4 ppdev crc32c_intel ghash_clmulni_intel parport_pc snd_hda_intel
snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer 9pnet_virtio
snd 9pnet pcspkr parport i2ccore serio_raw virtio_console virtio_balloon
pvpanic soundcore virtio_blk virtio_net ata_generic floppy pata_acpi
virtio_pci virtio_ring virtio
[  428.935022] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O
3.17.0-rc4+ #30
[  428.935022] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  428.935022] task: ffffffff81c1b460 ti: ffffffff81c00000 task.ti:
ffffffff81c00000
[  428.935022] RIP: 0010:[<ffffffff810aa37b>]  [<ffffffff810aa37b>]
__queue_work+0x7b/0x350
[  428.935022] RSP: 0018:ffff88005f003e28  EFLAGS: 00010086
[  428.935022] RAX: ffff88005c05c800 RBX: 0000000000000000 RCX:
0000000000000000
[  428.935022] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
ffff88005a4fbd58
[  428.935022] RBP: ffff88005f003e60 R08: 0000000000000046 R09:
ffffffff8225abc2
[  428.935022] R10: 0000000000000004 R11: 0000000000000005 R12:
ffff88005a4fbd58
[  428.935022] R13: 0000000000000008 R14: ffff88004b211800 R15:
00000000000102f0
[  428.935022] FS:  0000000000000000(0000) GS:ffff88005f000000(0000)
knlGS:0000000000000000
[  428.935022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  428.935022] CR2: 0000000000000000 CR3: 000000004abde000 CR4:
00000000000406f0
[  428.935022] Stack:
[  428.935022]  0a19522f72b12222 0000000081c1b460 ffffffff8225abc0
ffff88005a4fbd78
[  428.935022]  0000000000000101 ffffffff810aa650 ffff88005a4fbd58
ffff88005f003e70
[  428.935022]  ffffffff810aa668 ffff88005f003ea8 ffffffff810f3536
ffffffff8225abc0
[  428.935022] Call Trace:
[  428.935022]  <IRQ>
[  428.935022]
[  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
[  428.935022]  [<ffffffff810aa668>] delayed_work_timer_fn+0x18/0x20
[  428.935022]  [<ffffffff810f3536>] call_timer_fn+0x36/0x120
[  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
[  428.935022]  [<ffffffff810f38f5>] run_timer_softirq+0x1a5/0x320
[  428.935022]  [<ffffffff81096dc5>] __do_softirq+0xf5/0x2b0
[  428.935022]  [<ffffffff810971fd>] irq_exit+0xbd/0xd0
[  428.935022]  [<ffffffff8173b715>] smp_apic_timer_interrupt+0x45/0x60
[  428.935022]  [<ffffffff8173981d>] apic_timer_interrupt+0x6d/0x80
[  428.935022]  <EOI>
[  428.935022]
[  428.935022]  [<ffffffff810581c6>] ? native_safe_halt+0x6/0x10
[  428.935022]  [<ffffffff8101f36f>] default_idle+0x1f/0xe0
[  428.935022]  [<ffffffff8101fd8f>] arch_cpu_idle+0xf/0x20
[  428.935022]  [<ffffffff810d25dd>] cpu_startup_entry+0x38d/0x3c0
[  428.935022]  [<ffffffff81722927>] rest_init+0x87/0x90
[  428.935022]  [<ffffffff81d3510e>] start_kernel+0x482/0x4a3
[  428.935022]  [<ffffffff81d34a85>] ? set_init_arg+0x53/0x53
[  428.935022]  [<ffffffff81d34120>] ? early_idt_handlers+0x120/0x120
[  428.935022]  [<ffffffff81d345ee>] x86_64_start_reservations+0x2a/0x2c
[  428.935022]  [<ffffffff81d3473d>] x86_64_start_kernel+0x14d/0x170
[  428.935022] Code: 84 bb 01 00 00 a8 02 0f 85 eb 00 00 00 48 63 45 d4 49
8b 9e 08 01 00 00 48 03 1c c5 60 fa d0 81 4c 89 e7 e8 18 f5 ff ff 48 85 c0
<48> 8b 3b 0f 84 7c 01 00 00 48 39 c7 0f 84 73 01 00 00 48 89 c7
[  428.935022] RIP  [<ffffffff810aa37b>] __queue_work+0x7b/0x350
[  428.935022]  RSP <ffff88005f003e28>
[  428.935022] CR2: 0000000000000000

This is because it keeps trying to re-schedule even though the interface's
memory has been freed.

While testing this I spotted another issue as well - Failed to build
slave_arr message has been printed too many times because you print it in
3ad mode when there's no active aggregator (bond_3ad_get_active_agg_info
check in bond_update_slave_arr) which leads to re-scheduling which also
lead to a deadlock.

>>>> +    arr = rtnl_dereference(bond->slave_arr);
>>>> +    if (arr) {
>>>> +            kfree_rcu(arr, rcu);
>>>> +            RCU_INIT_POINTER(bond->slave_arr, NULL);
>>>> +    }
>>>> +
>>>>      list_del(&bond->bond_list);
>>>>
>>>>      bond_debug_unregister(bond);
>>>> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>>>> index 98dc0d7ad731..4635b175256a 100644
>>>> --- a/drivers/net/bonding/bonding.h
>>>> +++ b/drivers/net/bonding/bonding.h
>>>> @@ -177,6 +177,12 @@ struct slave {
>>>>      struct kobject kobj;
>>>>  };
>>>>
>>>> +struct bond_up_slave {
>>>> +    unsigned int    count;
>>>> +    struct rcu_head rcu;
>>>> +    struct slave    *arr[0];
>>>> +};
>>>> +
>>>>  /*
>>>>   * Link pseudo-state only used internally by monitors
>>>>   */
>>>> @@ -191,6 +197,7 @@ struct bonding {
>>>>      struct   slave __rcu *curr_active_slave;
>>>>      struct   slave __rcu *current_arp_slave;
>>>>      struct   slave __rcu *primary_slave;
>>>> +    struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
>>>>      bool     force_primary;
>>>>      s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
>>>>      int     (*recv_probe)(const struct sk_buff *, struct bonding *,
>>>> @@ -220,6 +227,7 @@ struct bonding {
>>>>      struct   delayed_work alb_work;
>>>>      struct   delayed_work ad_work;
>>>>      struct   delayed_work mcast_work;
>>>> +    struct   delayed_work slave_arr_work;
>>>>  #ifdef CONFIG_DEBUG_FS
>>>>      /* debugging support via debugfs */
>>>>      struct   dentry *debug_dir;
>>>> @@ -531,6 +539,8 @@ const char *bond_slave_link_status(s8 link);
>>>>  struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
>>>>                                            struct net_device *end_dev,
>>>>                                            int level);
>>>> +int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
>>>> +void bond_slave_arr_work_rearm(struct bonding *bond);
>>>>
>>>>  #ifdef CONFIG_PROC_FS
>>>>  void bond_create_proc_entry(struct bonding *bond);
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-20 10:19       ` Nikolay Aleksandrov
@ 2014-09-20 20:04         ` Mahesh Bandewar
  2014-09-21 11:07           ` Nikolay Aleksandrov
  0 siblings, 1 reply; 11+ messages in thread
From: Mahesh Bandewar @ 2014-09-20 20:04 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller,
	netdev, Eric Dumazet, Maciej Zenczykowski

On Sat, Sep 20, 2014 at 3:19 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> On 09/20/2014 02:09 AM, Mahesh Bandewar wrote:
>> On Fri, Sep 19, 2014 at 4:06 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>>>
>>> On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
>>>> On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
>>>>> Earlier change to use usable slave array for TLB mode had an additional
>>>>> performance advantage. So extending the same logic to all other modes
>>>>> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
>>>>> Also consolidating this with the earlier TLB change.
>>>>>
>>>>> The main idea is to build the usable slaves array in the control path
>>>>> and use that array for slave selection during xmit operation.
>>>>>
>>>>> Measured performance in a setup with a bond of 4x1G NICs with 200
>>>>> instances of netperf for the modes involved (3ad, xor, tlb)
>>>>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>>>>
>>>>> Mode        TPS-Before   TPS-After
>>>>>
>>>>> 802.3ad   : 468,694      493,101
>>>>> TLB (lb=0): 392,583      392,965
>>>>> XOR       : 475,696      484,517
>>>>>
>>>>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>>>>> ---
>>>>> v1:
>>>>>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>>>>>       the slave that need to be removed.
>>>>>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>>>>>       transition gracefully.
>>>>>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>>>>>       failure.
>>>>>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>>>>>       will populate the array even if these parameters are not used.
>>>>>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>>>>> v2:
>>>>>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>>>>>   (b) Slave link-events now refresh array for all these modes.
>>>>>   (c) Moved free-array call from bond_close() to bond_uninit().
>>>>> v3:
>>>>>   (a) Fixed null pointer dereference.
>>>>>   (b) Removed bond->lock lockdep dependency.
>>>>> v4:
>>>>>   (a) Made to changes to comply with Nikolay's locking changes
>>>>>   (b) Added a work-queue to refresh slave-array when RTNL is not held
>>>>>   (c) Array refresh happens ONLY with RTNL now.
>>>>>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>>>>>
>>> <<<snip>>>
>>>>> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>>      struct bonding *bond = netdev_priv(bond_dev);
>>>>>      struct list_head *iter;
>>>>>      struct slave *slave;
>>>>> +    struct bond_up_slave *arr;
>>>>>
>>>>>      bond_netpoll_cleanup(bond_dev);
>>>>>
>>>>> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>>              __bond_release_one(bond_dev, slave->dev, true);
>>>>>      netdev_info(bond_dev, "Released all slaves\n");
>>>>>
>>> Sorry but I just spotted a major problem, bond_3ad_unbind_slave() (called
>>> from __bond_release_one) calls ad_agg_selection_logic() which can re-arm
>>> the slave_arr work after it's supposed to be stopped here (i.e. the bond
>>> device has been closed so all works should've been stopped) so we might
>>> leak memory and access freed memory after all since it'll keep
>>> re-scheduling itself until it can acquire rtnl which is after the bond
>>> device has been destroyed.
>>>
>> This should not be a problem. ndo_close (bond_close()) is called
>> before ndo_uninit(bond_uninit()), so the work-queues get cancelled
>> there so if rearm tries to schedule some work after queue gets
>> cancelled, it can't do much and wont harm anything.
>> Hence there wont be any arrays built once it's free-ed completely and
>> therefore no memory leak. I addded some instrumentation and tried
>> following sequence -
>>
>> # modprobe bonding mode=4
>> # ip link set bond0 up
>> # [Add ip]
>> # [Add default route]
>> # ifenslave bond0 eth0 eth1 eth2 eth3
>> ....
>> [Run some backgound traffic. I used netperf.]
>>
>> # ip link bond0 down
>>
>> I did not see anything "bad" happening. Did your trial produced
>> something unpleasant?
>>
> The test you've done is irrelevant to the situation that I described
> because ndo_uninit() is called when the device is being destroyed. Moreover
> the case I told you about would require to have an active aggregator and an
> inactive one (i.e. so agg selection logic will get called), here is the result:
> [  428.916586] bond1 (unregistering): Removing an active aggregator
> [  428.916589] Failed to build slave-array.
> [  428.916849] bond1 (unregistering): Releasing active interface eth1
> [  428.920342] bond1 (unregistering): Released all slaves
> [  428.923043] Failed to update slave array from WT
> [  428.924098] Failed to update slave array from WT
> [  428.925125] Failed to update slave array from WT
> [  428.926120] Failed to update slave array from WT
> [  428.927096] Failed to update slave array from WT
> [  428.928101] Failed to update slave array from WT
> [  428.929120] Failed to update slave array from WT
> [  428.930086] BUG: unable to handle kernel NULL pointer dereference at
>        (null)
> [  428.930644] IP: [<ffffffff810aa37b>] __queue_work+0x7b/0x350
> [  428.930946] PGD 0
> [  428.931053] Oops: 0000 [#1] SMP
> [  428.931053] Modules linked in: sfc ptp pps_core mdio i2c_algo_bit mtd
> bonding(O) snd_hda_codec_generic joydev crct10dif_pclmul crc32_pclmul
> i2c_piix4 ppdev crc32c_intel ghash_clmulni_intel parport_pc snd_hda_intel
> snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer 9pnet_virtio
> snd 9pnet pcspkr parport i2ccore serio_raw virtio_console virtio_balloon
> pvpanic soundcore virtio_blk virtio_net ata_generic floppy pata_acpi
> virtio_pci virtio_ring virtio
> [  428.935022] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O
> 3.17.0-rc4+ #30
> [  428.935022] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  428.935022] task: ffffffff81c1b460 ti: ffffffff81c00000 task.ti:
> ffffffff81c00000
> [  428.935022] RIP: 0010:[<ffffffff810aa37b>]  [<ffffffff810aa37b>]
> __queue_work+0x7b/0x350
> [  428.935022] RSP: 0018:ffff88005f003e28  EFLAGS: 00010086
> [  428.935022] RAX: ffff88005c05c800 RBX: 0000000000000000 RCX:
> 0000000000000000
> [  428.935022] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
> ffff88005a4fbd58
> [  428.935022] RBP: ffff88005f003e60 R08: 0000000000000046 R09:
> ffffffff8225abc2
> [  428.935022] R10: 0000000000000004 R11: 0000000000000005 R12:
> ffff88005a4fbd58
> [  428.935022] R13: 0000000000000008 R14: ffff88004b211800 R15:
> 00000000000102f0
> [  428.935022] FS:  0000000000000000(0000) GS:ffff88005f000000(0000)
> knlGS:0000000000000000
> [  428.935022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  428.935022] CR2: 0000000000000000 CR3: 000000004abde000 CR4:
> 00000000000406f0
> [  428.935022] Stack:
> [  428.935022]  0a19522f72b12222 0000000081c1b460 ffffffff8225abc0
> ffff88005a4fbd78
> [  428.935022]  0000000000000101 ffffffff810aa650 ffff88005a4fbd58
> ffff88005f003e70
> [  428.935022]  ffffffff810aa668 ffff88005f003ea8 ffffffff810f3536
> ffffffff8225abc0
> [  428.935022] Call Trace:
> [  428.935022]  <IRQ>
> [  428.935022]
> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
> [  428.935022]  [<ffffffff810aa668>] delayed_work_timer_fn+0x18/0x20
> [  428.935022]  [<ffffffff810f3536>] call_timer_fn+0x36/0x120
> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
> [  428.935022]  [<ffffffff810f38f5>] run_timer_softirq+0x1a5/0x320
> [  428.935022]  [<ffffffff81096dc5>] __do_softirq+0xf5/0x2b0
> [  428.935022]  [<ffffffff810971fd>] irq_exit+0xbd/0xd0
> [  428.935022]  [<ffffffff8173b715>] smp_apic_timer_interrupt+0x45/0x60
> [  428.935022]  [<ffffffff8173981d>] apic_timer_interrupt+0x6d/0x80
> [  428.935022]  <EOI>
> [  428.935022]
> [  428.935022]  [<ffffffff810581c6>] ? native_safe_halt+0x6/0x10
> [  428.935022]  [<ffffffff8101f36f>] default_idle+0x1f/0xe0
> [  428.935022]  [<ffffffff8101fd8f>] arch_cpu_idle+0xf/0x20
> [  428.935022]  [<ffffffff810d25dd>] cpu_startup_entry+0x38d/0x3c0
> [  428.935022]  [<ffffffff81722927>] rest_init+0x87/0x90
> [  428.935022]  [<ffffffff81d3510e>] start_kernel+0x482/0x4a3
> [  428.935022]  [<ffffffff81d34a85>] ? set_init_arg+0x53/0x53
> [  428.935022]  [<ffffffff81d34120>] ? early_idt_handlers+0x120/0x120
> [  428.935022]  [<ffffffff81d345ee>] x86_64_start_reservations+0x2a/0x2c
> [  428.935022]  [<ffffffff81d3473d>] x86_64_start_kernel+0x14d/0x170
> [  428.935022] Code: 84 bb 01 00 00 a8 02 0f 85 eb 00 00 00 48 63 45 d4 49
> 8b 9e 08 01 00 00 48 03 1c c5 60 fa d0 81 4c 89 e7 e8 18 f5 ff ff 48 85 c0
> <48> 8b 3b 0f 84 7c 01 00 00 48 39 c7 0f 84 73 01 00 00 48 89 c7
> [  428.935022] RIP  [<ffffffff810aa37b>] __queue_work+0x7b/0x350
> [  428.935022]  RSP <ffff88005f003e28>
> [  428.935022] CR2: 0000000000000000
>
> This is because it keeps trying to re-schedule even though the interface's
> memory has been freed.
>
Hmm, how do we handle this?

> While testing this I spotted another issue as well - Failed to build
> slave_arr message has been printed too many times because you print it in
> 3ad mode when there's no active aggregator (bond_3ad_get_active_agg_info
> check in bond_update_slave_arr) which leads to re-scheduling which also
> lead to a deadlock.
>
I think this can be corrected with pr_ratelimited() call.

>>>>> +    arr = rtnl_dereference(bond->slave_arr);
>>>>> +    if (arr) {
>>>>> +            kfree_rcu(arr, rcu);
>>>>> +            RCU_INIT_POINTER(bond->slave_arr, NULL);
>>>>> +    }
>>>>> +
>>>>>      list_del(&bond->bond_list);
>>>>>
>>>>>      bond_debug_unregister(bond);
>>>>> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>>>>> index 98dc0d7ad731..4635b175256a 100644
>>>>> --- a/drivers/net/bonding/bonding.h
>>>>> +++ b/drivers/net/bonding/bonding.h
>>>>> @@ -177,6 +177,12 @@ struct slave {
>>>>>      struct kobject kobj;
>>>>>  };
>>>>>
>>>>> +struct bond_up_slave {
>>>>> +    unsigned int    count;
>>>>> +    struct rcu_head rcu;
>>>>> +    struct slave    *arr[0];
>>>>> +};
>>>>> +
>>>>>  /*
>>>>>   * Link pseudo-state only used internally by monitors
>>>>>   */
>>>>> @@ -191,6 +197,7 @@ struct bonding {
>>>>>      struct   slave __rcu *curr_active_slave;
>>>>>      struct   slave __rcu *current_arp_slave;
>>>>>      struct   slave __rcu *primary_slave;
>>>>> +    struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
>>>>>      bool     force_primary;
>>>>>      s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
>>>>>      int     (*recv_probe)(const struct sk_buff *, struct bonding *,
>>>>> @@ -220,6 +227,7 @@ struct bonding {
>>>>>      struct   delayed_work alb_work;
>>>>>      struct   delayed_work ad_work;
>>>>>      struct   delayed_work mcast_work;
>>>>> +    struct   delayed_work slave_arr_work;
>>>>>  #ifdef CONFIG_DEBUG_FS
>>>>>      /* debugging support via debugfs */
>>>>>      struct   dentry *debug_dir;
>>>>> @@ -531,6 +539,8 @@ const char *bond_slave_link_status(s8 link);
>>>>>  struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
>>>>>                                            struct net_device *end_dev,
>>>>>                                            int level);
>>>>> +int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
>>>>> +void bond_slave_arr_work_rearm(struct bonding *bond);
>>>>>
>>>>>  #ifdef CONFIG_PROC_FS
>>>>>  void bond_create_proc_entry(struct bonding *bond);
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-20 20:04         ` Mahesh Bandewar
@ 2014-09-21 11:07           ` Nikolay Aleksandrov
  2014-09-23  5:13             ` Mahesh Bandewar
  0 siblings, 1 reply; 11+ messages in thread
From: Nikolay Aleksandrov @ 2014-09-21 11:07 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller,
	netdev, Eric Dumazet, Maciej Zenczykowski

On 09/20/2014 10:04 PM, Mahesh Bandewar wrote:
> On Sat, Sep 20, 2014 at 3:19 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>> On 09/20/2014 02:09 AM, Mahesh Bandewar wrote:
>>> On Fri, Sep 19, 2014 at 4:06 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>>>>
>>>> On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
>>>>> On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
>>>>>> Earlier change to use usable slave array for TLB mode had an additional
>>>>>> performance advantage. So extending the same logic to all other modes
>>>>>> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
>>>>>> Also consolidating this with the earlier TLB change.
>>>>>>
>>>>>> The main idea is to build the usable slaves array in the control path
>>>>>> and use that array for slave selection during xmit operation.
>>>>>>
>>>>>> Measured performance in a setup with a bond of 4x1G NICs with 200
>>>>>> instances of netperf for the modes involved (3ad, xor, tlb)
>>>>>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>>>>>
>>>>>> Mode        TPS-Before   TPS-After
>>>>>>
>>>>>> 802.3ad   : 468,694      493,101
>>>>>> TLB (lb=0): 392,583      392,965
>>>>>> XOR       : 475,696      484,517
>>>>>>
>>>>>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>>>>>> ---
>>>>>> v1:
>>>>>>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>>>>>>       the slave that need to be removed.
>>>>>>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>>>>>>       transition gracefully.
>>>>>>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>>>>>>       failure.
>>>>>>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>>>>>>       will populate the array even if these parameters are not used.
>>>>>>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>>>>>> v2:
>>>>>>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>>>>>>   (b) Slave link-events now refresh array for all these modes.
>>>>>>   (c) Moved free-array call from bond_close() to bond_uninit().
>>>>>> v3:
>>>>>>   (a) Fixed null pointer dereference.
>>>>>>   (b) Removed bond->lock lockdep dependency.
>>>>>> v4:
>>>>>>   (a) Made to changes to comply with Nikolay's locking changes
>>>>>>   (b) Added a work-queue to refresh slave-array when RTNL is not held
>>>>>>   (c) Array refresh happens ONLY with RTNL now.
>>>>>>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>>>>>>
>>>> <<<snip>>>
>>>>>> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>>>      struct bonding *bond = netdev_priv(bond_dev);
>>>>>>      struct list_head *iter;
>>>>>>      struct slave *slave;
>>>>>> +    struct bond_up_slave *arr;
>>>>>>
>>>>>>      bond_netpoll_cleanup(bond_dev);
>>>>>>
>>>>>> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>>>              __bond_release_one(bond_dev, slave->dev, true);
>>>>>>      netdev_info(bond_dev, "Released all slaves\n");
>>>>>>
>>>> Sorry but I just spotted a major problem, bond_3ad_unbind_slave() (called
>>>> from __bond_release_one) calls ad_agg_selection_logic() which can re-arm
>>>> the slave_arr work after it's supposed to be stopped here (i.e. the bond
>>>> device has been closed so all works should've been stopped) so we might
>>>> leak memory and access freed memory after all since it'll keep
>>>> re-scheduling itself until it can acquire rtnl which is after the bond
>>>> device has been destroyed.
>>>>
>>> This should not be a problem. ndo_close (bond_close()) is called
>>> before ndo_uninit(bond_uninit()), so the work-queues get cancelled
>>> there so if rearm tries to schedule some work after queue gets
>>> cancelled, it can't do much and wont harm anything.
>>> Hence there wont be any arrays built once it's free-ed completely and
>>> therefore no memory leak. I addded some instrumentation and tried
>>> following sequence -
>>>
>>> # modprobe bonding mode=4
>>> # ip link set bond0 up
>>> # [Add ip]
>>> # [Add default route]
>>> # ifenslave bond0 eth0 eth1 eth2 eth3
>>> ....
>>> [Run some backgound traffic. I used netperf.]
>>>
>>> # ip link bond0 down
>>>
>>> I did not see anything "bad" happening. Did your trial produced
>>> something unpleasant?
>>>
>> The test you've done is irrelevant to the situation that I described
>> because ndo_uninit() is called when the device is being destroyed. Moreover
>> the case I told you about would require to have an active aggregator and an
>> inactive one (i.e. so agg selection logic will get called), here is the result:
>> [  428.916586] bond1 (unregistering): Removing an active aggregator
>> [  428.916589] Failed to build slave-array.
>> [  428.916849] bond1 (unregistering): Releasing active interface eth1
>> [  428.920342] bond1 (unregistering): Released all slaves
>> [  428.923043] Failed to update slave array from WT
>> [  428.924098] Failed to update slave array from WT
>> [  428.925125] Failed to update slave array from WT
>> [  428.926120] Failed to update slave array from WT
>> [  428.927096] Failed to update slave array from WT
>> [  428.928101] Failed to update slave array from WT
>> [  428.929120] Failed to update slave array from WT
>> [  428.930086] BUG: unable to handle kernel NULL pointer dereference at
>>        (null)
>> [  428.930644] IP: [<ffffffff810aa37b>] __queue_work+0x7b/0x350
>> [  428.930946] PGD 0
>> [  428.931053] Oops: 0000 [#1] SMP
>> [  428.931053] Modules linked in: sfc ptp pps_core mdio i2c_algo_bit mtd
>> bonding(O) snd_hda_codec_generic joydev crct10dif_pclmul crc32_pclmul
>> i2c_piix4 ppdev crc32c_intel ghash_clmulni_intel parport_pc snd_hda_intel
>> snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer 9pnet_virtio
>> snd 9pnet pcspkr parport i2ccore serio_raw virtio_console virtio_balloon
>> pvpanic soundcore virtio_blk virtio_net ata_generic floppy pata_acpi
>> virtio_pci virtio_ring virtio
>> [  428.935022] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O
>> 3.17.0-rc4+ #30
>> [  428.935022] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> [  428.935022] task: ffffffff81c1b460 ti: ffffffff81c00000 task.ti:
>> ffffffff81c00000
>> [  428.935022] RIP: 0010:[<ffffffff810aa37b>]  [<ffffffff810aa37b>]
>> __queue_work+0x7b/0x350
>> [  428.935022] RSP: 0018:ffff88005f003e28  EFLAGS: 00010086
>> [  428.935022] RAX: ffff88005c05c800 RBX: 0000000000000000 RCX:
>> 0000000000000000
>> [  428.935022] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
>> ffff88005a4fbd58
>> [  428.935022] RBP: ffff88005f003e60 R08: 0000000000000046 R09:
>> ffffffff8225abc2
>> [  428.935022] R10: 0000000000000004 R11: 0000000000000005 R12:
>> ffff88005a4fbd58
>> [  428.935022] R13: 0000000000000008 R14: ffff88004b211800 R15:
>> 00000000000102f0
>> [  428.935022] FS:  0000000000000000(0000) GS:ffff88005f000000(0000)
>> knlGS:0000000000000000
>> [  428.935022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  428.935022] CR2: 0000000000000000 CR3: 000000004abde000 CR4:
>> 00000000000406f0
>> [  428.935022] Stack:
>> [  428.935022]  0a19522f72b12222 0000000081c1b460 ffffffff8225abc0
>> ffff88005a4fbd78
>> [  428.935022]  0000000000000101 ffffffff810aa650 ffff88005a4fbd58
>> ffff88005f003e70
>> [  428.935022]  ffffffff810aa668 ffff88005f003ea8 ffffffff810f3536
>> ffffffff8225abc0
>> [  428.935022] Call Trace:
>> [  428.935022]  <IRQ>
>> [  428.935022]
>> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
>> [  428.935022]  [<ffffffff810aa668>] delayed_work_timer_fn+0x18/0x20
>> [  428.935022]  [<ffffffff810f3536>] call_timer_fn+0x36/0x120
>> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
>> [  428.935022]  [<ffffffff810f38f5>] run_timer_softirq+0x1a5/0x320
>> [  428.935022]  [<ffffffff81096dc5>] __do_softirq+0xf5/0x2b0
>> [  428.935022]  [<ffffffff810971fd>] irq_exit+0xbd/0xd0
>> [  428.935022]  [<ffffffff8173b715>] smp_apic_timer_interrupt+0x45/0x60
>> [  428.935022]  [<ffffffff8173981d>] apic_timer_interrupt+0x6d/0x80
>> [  428.935022]  <EOI>
>> [  428.935022]
>> [  428.935022]  [<ffffffff810581c6>] ? native_safe_halt+0x6/0x10
>> [  428.935022]  [<ffffffff8101f36f>] default_idle+0x1f/0xe0
>> [  428.935022]  [<ffffffff8101fd8f>] arch_cpu_idle+0xf/0x20
>> [  428.935022]  [<ffffffff810d25dd>] cpu_startup_entry+0x38d/0x3c0
>> [  428.935022]  [<ffffffff81722927>] rest_init+0x87/0x90
>> [  428.935022]  [<ffffffff81d3510e>] start_kernel+0x482/0x4a3
>> [  428.935022]  [<ffffffff81d34a85>] ? set_init_arg+0x53/0x53
>> [  428.935022]  [<ffffffff81d34120>] ? early_idt_handlers+0x120/0x120
>> [  428.935022]  [<ffffffff81d345ee>] x86_64_start_reservations+0x2a/0x2c
>> [  428.935022]  [<ffffffff81d3473d>] x86_64_start_kernel+0x14d/0x170
>> [  428.935022] Code: 84 bb 01 00 00 a8 02 0f 85 eb 00 00 00 48 63 45 d4 49
>> 8b 9e 08 01 00 00 48 03 1c c5 60 fa d0 81 4c 89 e7 e8 18 f5 ff ff 48 85 c0
>> <48> 8b 3b 0f 84 7c 01 00 00 48 39 c7 0f 84 73 01 00 00 48 89 c7
>> [  428.935022] RIP  [<ffffffff810aa37b>] __queue_work+0x7b/0x350
>> [  428.935022]  RSP <ffff88005f003e28>
>> [  428.935022] CR2: 0000000000000000
>>
>> This is because it keeps trying to re-schedule even though the interface's
>> memory has been freed.
>>
> Hmm, how do we handle this?
> 
This is tricky and what concerns me more is that people might make this
mistake again in the future. It's easy to unknowingly make use of a
function that re-schedules this from the wrong place.
What I just noticed is that for all 3ad cases you could pull the scheduling
in the bond_3ad_state_machine_handler() function.
The call sites of ad_agg_selection_logic() are:
- 3ad unbind slave (no need to schedule here as __bond_release_one would
rebuild the array anyhow)
- bond_3ad_state_machine_handler() <- here's where the schedule should
happen as this gets stopped first when the bond is closed and can't get
restarted unless it's opened again.
- ad_port_selection_logic() <- this is called from
bond_3ad_state_machine_handler() only, so this case will be handled as well.

The other 2 functions that you convert - ad_enable/disable_collecting are
used only from ad_mux_machine() which is only called in
bond_3ad_state_machine_handler().

So basically you can pull all rebuild schedules in their common caller -
bond_3ad_state_machine_handler(), just make a flag to note that a rebuild
is needed probably something similar to should_notify_rtnl.
This way you can remove the scheduling from the various 3ad functions that
may get used and will have it only in 1 place which is more easily controlled.

Of course, the alternative would be once again - convert
bond_3ad_state_machine_handler() to RTNL, but that has its own set of problems.

>> While testing this I spotted another issue as well - Failed to build
>> slave_arr message has been printed too many times because you print it in
>> 3ad mode when there's no active aggregator (bond_3ad_get_active_agg_info
>> check in bond_update_slave_arr) which leads to re-scheduling which also
>> lead to a deadlock.
>>
> I think this can be corrected with pr_ratelimited() call.
> 
IMO it shouldn't print anything if it couldn't rebuild the array due to
missing active aggregator as that's not an error condition. It should
though probably clean out the slave array because transmission shouldn't be
possible without an active aggregator in 3ad.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-21 11:07           ` Nikolay Aleksandrov
@ 2014-09-23  5:13             ` Mahesh Bandewar
  2014-09-23  8:29               ` Nikolay Aleksandrov
  0 siblings, 1 reply; 11+ messages in thread
From: Mahesh Bandewar @ 2014-09-23  5:13 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller,
	netdev, Eric Dumazet, Maciej Zenczykowski

On Sun, Sep 21, 2014 at 4:07 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> On 09/20/2014 10:04 PM, Mahesh Bandewar wrote:
>> On Sat, Sep 20, 2014 at 3:19 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>>> On 09/20/2014 02:09 AM, Mahesh Bandewar wrote:
>>>> On Fri, Sep 19, 2014 at 4:06 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>>>>>
>>>>> On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
>>>>>> On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
>>>>>>> Earlier change to use usable slave array for TLB mode had an additional
>>>>>>> performance advantage. So extending the same logic to all other modes
>>>>>>> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
>>>>>>> Also consolidating this with the earlier TLB change.
>>>>>>>
>>>>>>> The main idea is to build the usable slaves array in the control path
>>>>>>> and use that array for slave selection during xmit operation.
>>>>>>>
>>>>>>> Measured performance in a setup with a bond of 4x1G NICs with 200
>>>>>>> instances of netperf for the modes involved (3ad, xor, tlb)
>>>>>>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>>>>>>
>>>>>>> Mode        TPS-Before   TPS-After
>>>>>>>
>>>>>>> 802.3ad   : 468,694      493,101
>>>>>>> TLB (lb=0): 392,583      392,965
>>>>>>> XOR       : 475,696      484,517
>>>>>>>
>>>>>>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>>>>>>> ---
>>>>>>> v1:
>>>>>>>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>>>>>>>       the slave that need to be removed.
>>>>>>>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>>>>>>>       transition gracefully.
>>>>>>>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>>>>>>>       failure.
>>>>>>>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>>>>>>>       will populate the array even if these parameters are not used.
>>>>>>>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>>>>>>> v2:
>>>>>>>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>>>>>>>   (b) Slave link-events now refresh array for all these modes.
>>>>>>>   (c) Moved free-array call from bond_close() to bond_uninit().
>>>>>>> v3:
>>>>>>>   (a) Fixed null pointer dereference.
>>>>>>>   (b) Removed bond->lock lockdep dependency.
>>>>>>> v4:
>>>>>>>   (a) Made to changes to comply with Nikolay's locking changes
>>>>>>>   (b) Added a work-queue to refresh slave-array when RTNL is not held
>>>>>>>   (c) Array refresh happens ONLY with RTNL now.
>>>>>>>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>>>>>>>
>>>>> <<<snip>>>
>>>>>>> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>>>>      struct bonding *bond = netdev_priv(bond_dev);
>>>>>>>      struct list_head *iter;
>>>>>>>      struct slave *slave;
>>>>>>> +    struct bond_up_slave *arr;
>>>>>>>
>>>>>>>      bond_netpoll_cleanup(bond_dev);
>>>>>>>
>>>>>>> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>>>>              __bond_release_one(bond_dev, slave->dev, true);
>>>>>>>      netdev_info(bond_dev, "Released all slaves\n");
>>>>>>>
>>>>> Sorry but I just spotted a major problem, bond_3ad_unbind_slave() (called
>>>>> from __bond_release_one) calls ad_agg_selection_logic() which can re-arm
>>>>> the slave_arr work after it's supposed to be stopped here (i.e. the bond
>>>>> device has been closed so all works should've been stopped) so we might
>>>>> leak memory and access freed memory after all since it'll keep
>>>>> re-scheduling itself until it can acquire rtnl which is after the bond
>>>>> device has been destroyed.
>>>>>
>>>> This should not be a problem. ndo_close (bond_close()) is called
>>>> before ndo_uninit(bond_uninit()), so the work-queues get cancelled
>>>> there so if rearm tries to schedule some work after queue gets
>>>> cancelled, it can't do much and wont harm anything.
>>>> Hence there wont be any arrays built once it's free-ed completely and
>>>> therefore no memory leak. I addded some instrumentation and tried
>>>> following sequence -
>>>>
>>>> # modprobe bonding mode=4
>>>> # ip link set bond0 up
>>>> # [Add ip]
>>>> # [Add default route]
>>>> # ifenslave bond0 eth0 eth1 eth2 eth3
>>>> ....
>>>> [Run some backgound traffic. I used netperf.]
>>>>
>>>> # ip link bond0 down
>>>>
>>>> I did not see anything "bad" happening. Did your trial produced
>>>> something unpleasant?
>>>>
>>> The test you've done is irrelevant to the situation that I described
>>> because ndo_uninit() is called when the device is being destroyed. Moreover
>>> the case I told you about would require to have an active aggregator and an
>>> inactive one (i.e. so agg selection logic will get called), here is the result:
>>> [  428.916586] bond1 (unregistering): Removing an active aggregator
>>> [  428.916589] Failed to build slave-array.
>>> [  428.916849] bond1 (unregistering): Releasing active interface eth1
>>> [  428.920342] bond1 (unregistering): Released all slaves
>>> [  428.923043] Failed to update slave array from WT
>>> [  428.924098] Failed to update slave array from WT
>>> [  428.925125] Failed to update slave array from WT
>>> [  428.926120] Failed to update slave array from WT
>>> [  428.927096] Failed to update slave array from WT
>>> [  428.928101] Failed to update slave array from WT
>>> [  428.929120] Failed to update slave array from WT
>>> [  428.930086] BUG: unable to handle kernel NULL pointer dereference at
>>>        (null)
>>> [  428.930644] IP: [<ffffffff810aa37b>] __queue_work+0x7b/0x350
>>> [  428.930946] PGD 0
>>> [  428.931053] Oops: 0000 [#1] SMP
>>> [  428.931053] Modules linked in: sfc ptp pps_core mdio i2c_algo_bit mtd
>>> bonding(O) snd_hda_codec_generic joydev crct10dif_pclmul crc32_pclmul
>>> i2c_piix4 ppdev crc32c_intel ghash_clmulni_intel parport_pc snd_hda_intel
>>> snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer 9pnet_virtio
>>> snd 9pnet pcspkr parport i2ccore serio_raw virtio_console virtio_balloon
>>> pvpanic soundcore virtio_blk virtio_net ata_generic floppy pata_acpi
>>> virtio_pci virtio_ring virtio
>>> [  428.935022] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O
>>> 3.17.0-rc4+ #30
>>> [  428.935022] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>> [  428.935022] task: ffffffff81c1b460 ti: ffffffff81c00000 task.ti:
>>> ffffffff81c00000
>>> [  428.935022] RIP: 0010:[<ffffffff810aa37b>]  [<ffffffff810aa37b>]
>>> __queue_work+0x7b/0x350
>>> [  428.935022] RSP: 0018:ffff88005f003e28  EFLAGS: 00010086
>>> [  428.935022] RAX: ffff88005c05c800 RBX: 0000000000000000 RCX:
>>> 0000000000000000
>>> [  428.935022] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
>>> ffff88005a4fbd58
>>> [  428.935022] RBP: ffff88005f003e60 R08: 0000000000000046 R09:
>>> ffffffff8225abc2
>>> [  428.935022] R10: 0000000000000004 R11: 0000000000000005 R12:
>>> ffff88005a4fbd58
>>> [  428.935022] R13: 0000000000000008 R14: ffff88004b211800 R15:
>>> 00000000000102f0
>>> [  428.935022] FS:  0000000000000000(0000) GS:ffff88005f000000(0000)
>>> knlGS:0000000000000000
>>> [  428.935022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  428.935022] CR2: 0000000000000000 CR3: 000000004abde000 CR4:
>>> 00000000000406f0
>>> [  428.935022] Stack:
>>> [  428.935022]  0a19522f72b12222 0000000081c1b460 ffffffff8225abc0
>>> ffff88005a4fbd78
>>> [  428.935022]  0000000000000101 ffffffff810aa650 ffff88005a4fbd58
>>> ffff88005f003e70
>>> [  428.935022]  ffffffff810aa668 ffff88005f003ea8 ffffffff810f3536
>>> ffffffff8225abc0
>>> [  428.935022] Call Trace:
>>> [  428.935022]  <IRQ>
>>> [  428.935022]
>>> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
>>> [  428.935022]  [<ffffffff810aa668>] delayed_work_timer_fn+0x18/0x20
>>> [  428.935022]  [<ffffffff810f3536>] call_timer_fn+0x36/0x120
>>> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
>>> [  428.935022]  [<ffffffff810f38f5>] run_timer_softirq+0x1a5/0x320
>>> [  428.935022]  [<ffffffff81096dc5>] __do_softirq+0xf5/0x2b0
>>> [  428.935022]  [<ffffffff810971fd>] irq_exit+0xbd/0xd0
>>> [  428.935022]  [<ffffffff8173b715>] smp_apic_timer_interrupt+0x45/0x60
>>> [  428.935022]  [<ffffffff8173981d>] apic_timer_interrupt+0x6d/0x80
>>> [  428.935022]  <EOI>
>>> [  428.935022]
>>> [  428.935022]  [<ffffffff810581c6>] ? native_safe_halt+0x6/0x10
>>> [  428.935022]  [<ffffffff8101f36f>] default_idle+0x1f/0xe0
>>> [  428.935022]  [<ffffffff8101fd8f>] arch_cpu_idle+0xf/0x20
>>> [  428.935022]  [<ffffffff810d25dd>] cpu_startup_entry+0x38d/0x3c0
>>> [  428.935022]  [<ffffffff81722927>] rest_init+0x87/0x90
>>> [  428.935022]  [<ffffffff81d3510e>] start_kernel+0x482/0x4a3
>>> [  428.935022]  [<ffffffff81d34a85>] ? set_init_arg+0x53/0x53
>>> [  428.935022]  [<ffffffff81d34120>] ? early_idt_handlers+0x120/0x120
>>> [  428.935022]  [<ffffffff81d345ee>] x86_64_start_reservations+0x2a/0x2c
>>> [  428.935022]  [<ffffffff81d3473d>] x86_64_start_kernel+0x14d/0x170
>>> [  428.935022] Code: 84 bb 01 00 00 a8 02 0f 85 eb 00 00 00 48 63 45 d4 49
>>> 8b 9e 08 01 00 00 48 03 1c c5 60 fa d0 81 4c 89 e7 e8 18 f5 ff ff 48 85 c0
>>> <48> 8b 3b 0f 84 7c 01 00 00 48 39 c7 0f 84 73 01 00 00 48 89 c7
>>> [  428.935022] RIP  [<ffffffff810aa37b>] __queue_work+0x7b/0x350
>>> [  428.935022]  RSP <ffff88005f003e28>
>>> [  428.935022] CR2: 0000000000000000
>>>
>>> This is because it keeps trying to re-schedule even though the interface's
>>> memory has been freed.
>>>
>> Hmm, how do we handle this?
>>
> This is tricky and what concerns me more is that people might make this
> mistake again in the future. It's easy to unknowingly make use of a
> function that re-schedules this from the wrong place.
> What I just noticed is that for all 3ad cases you could pull the scheduling
> in the bond_3ad_state_machine_handler() function.
> The call sites of ad_agg_selection_logic() are:
> - 3ad unbind slave (no need to schedule here as __bond_release_one would
> rebuild the array anyhow)
> - bond_3ad_state_machine_handler() <- here's where the schedule should
> happen as this gets stopped first when the bond is closed and can't get
> restarted unless it's opened again.
> - ad_port_selection_logic() <- this is called from
> bond_3ad_state_machine_handler() only, so this case will be handled as well.
>
> The other 2 functions that you convert - ad_enable/disable_collecting are
> used only from ad_mux_machine() which is only called in
> bond_3ad_state_machine_handler().
>
> So basically you can pull all rebuild schedules in their common caller -
> bond_3ad_state_machine_handler(), just make a flag to note that a rebuild
> is needed probably something similar to should_notify_rtnl.
> This way you can remove the scheduling from the various 3ad functions that
> may get used and will have it only in 1 place which is more easily controlled.
>
Well, I was just trying to avoid using flags to pass state from one to
another function so that we can update the array at one place. This
might introduce some bug so I was keeping it simple and build it only
when the condition requires it to build it. However I do not see how
this will fix the issue that you have seen, or would it? If so how?

> Of course, the alternative would be once again - convert
> bond_3ad_state_machine_handler() to RTNL, but that has its own set of problems.
>
It's convoluted, let's keep it simple for now :)

>>> While testing this I spotted another issue as well - Failed to build
>>> slave_arr message has been printed too many times because you print it in
>>> 3ad mode when there's no active aggregator (bond_3ad_get_active_agg_info
>>> check in bond_update_slave_arr) which leads to re-scheduling which also
>>> lead to a deadlock.
>>>
>> I think this can be corrected with pr_ratelimited() call.
>>
> IMO it shouldn't print anything if it couldn't rebuild the array due to
> missing active aggregator as that's not an error condition. It should
> though probably clean out the slave array because transmission shouldn't be
> possible without an active aggregator in 3ad.
>
Sure missing active aggregator is not an error but free-ing the slave
array silently would be bad either. At least we would see something in
the messages about "something" went wrong.
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-23  5:13             ` Mahesh Bandewar
@ 2014-09-23  8:29               ` Nikolay Aleksandrov
  2014-09-24  0:14                 ` Mahesh Bandewar
  0 siblings, 1 reply; 11+ messages in thread
From: Nikolay Aleksandrov @ 2014-09-23  8:29 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller,
	netdev, Eric Dumazet, Maciej Zenczykowski

On 23/09/14 07:13, Mahesh Bandewar wrote:
> On Sun, Sep 21, 2014 at 4:07 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>> On 09/20/2014 10:04 PM, Mahesh Bandewar wrote:
>>> On Sat, Sep 20, 2014 at 3:19 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>>>> On 09/20/2014 02:09 AM, Mahesh Bandewar wrote:
>>>>> On Fri, Sep 19, 2014 at 4:06 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
>>>>>>
>>>>>> On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
>>>>>>> On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
>>>>>>>> Earlier change to use usable slave array for TLB mode had an additional
>>>>>>>> performance advantage. So extending the same logic to all other modes
>>>>>>>> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
>>>>>>>> Also consolidating this with the earlier TLB change.
>>>>>>>>
>>>>>>>> The main idea is to build the usable slaves array in the control path
>>>>>>>> and use that array for slave selection during xmit operation.
>>>>>>>>
>>>>>>>> Measured performance in a setup with a bond of 4x1G NICs with 200
>>>>>>>> instances of netperf for the modes involved (3ad, xor, tlb)
>>>>>>>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>>>>>>>
>>>>>>>> Mode        TPS-Before   TPS-After
>>>>>>>>
>>>>>>>> 802.3ad   : 468,694      493,101
>>>>>>>> TLB (lb=0): 392,583      392,965
>>>>>>>> XOR       : 475,696      484,517
>>>>>>>>
>>>>>>>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>>>>>>>> ---
>>>>>>>> v1:
>>>>>>>>    (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>>>>>>>>        the slave that need to be removed.
>>>>>>>>    (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>>>>>>>>        transition gracefully.
>>>>>>>>    (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>>>>>>>>        failure.
>>>>>>>>    (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>>>>>>>>        will populate the array even if these parameters are not used.
>>>>>>>>    (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>>>>>>>> v2:
>>>>>>>>    (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>>>>>>>>    (b) Slave link-events now refresh array for all these modes.
>>>>>>>>    (c) Moved free-array call from bond_close() to bond_uninit().
>>>>>>>> v3:
>>>>>>>>    (a) Fixed null pointer dereference.
>>>>>>>>    (b) Removed bond->lock lockdep dependency.
>>>>>>>> v4:
>>>>>>>>    (a) Made to changes to comply with Nikolay's locking changes
>>>>>>>>    (b) Added a work-queue to refresh slave-array when RTNL is not held
>>>>>>>>    (c) Array refresh happens ONLY with RTNL now.
>>>>>>>>    (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>>>>>>>>
>>>>>> <<<snip>>>
>>>>>>>> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>>>>>       struct bonding *bond = netdev_priv(bond_dev);
>>>>>>>>       struct list_head *iter;
>>>>>>>>       struct slave *slave;
>>>>>>>> +    struct bond_up_slave *arr;
>>>>>>>>
>>>>>>>>       bond_netpoll_cleanup(bond_dev);
>>>>>>>>
>>>>>>>> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device *bond_dev)
>>>>>>>>               __bond_release_one(bond_dev, slave->dev, true);
>>>>>>>>       netdev_info(bond_dev, "Released all slaves\n");
>>>>>>>>
>>>>>> Sorry but I just spotted a major problem, bond_3ad_unbind_slave() (called
>>>>>> from __bond_release_one) calls ad_agg_selection_logic() which can re-arm
>>>>>> the slave_arr work after it's supposed to be stopped here (i.e. the bond
>>>>>> device has been closed so all works should've been stopped) so we might
>>>>>> leak memory and access freed memory after all since it'll keep
>>>>>> re-scheduling itself until it can acquire rtnl which is after the bond
>>>>>> device has been destroyed.
>>>>>>
>>>>> This should not be a problem. ndo_close (bond_close()) is called
>>>>> before ndo_uninit(bond_uninit()), so the work-queues get cancelled
>>>>> there so if rearm tries to schedule some work after queue gets
>>>>> cancelled, it can't do much and wont harm anything.
>>>>> Hence there wont be any arrays built once it's free-ed completely and
>>>>> therefore no memory leak. I addded some instrumentation and tried
>>>>> following sequence -
>>>>>
>>>>> # modprobe bonding mode=4
>>>>> # ip link set bond0 up
>>>>> # [Add ip]
>>>>> # [Add default route]
>>>>> # ifenslave bond0 eth0 eth1 eth2 eth3
>>>>> ....
>>>>> [Run some backgound traffic. I used netperf.]
>>>>>
>>>>> # ip link bond0 down
>>>>>
>>>>> I did not see anything "bad" happening. Did your trial produced
>>>>> something unpleasant?
>>>>>
>>>> The test you've done is irrelevant to the situation that I described
>>>> because ndo_uninit() is called when the device is being destroyed. Moreover
>>>> the case I told you about would require to have an active aggregator and an
>>>> inactive one (i.e. so agg selection logic will get called), here is the result:
>>>> [  428.916586] bond1 (unregistering): Removing an active aggregator
>>>> [  428.916589] Failed to build slave-array.
>>>> [  428.916849] bond1 (unregistering): Releasing active interface eth1
>>>> [  428.920342] bond1 (unregistering): Released all slaves
>>>> [  428.923043] Failed to update slave array from WT
>>>> [  428.924098] Failed to update slave array from WT
>>>> [  428.925125] Failed to update slave array from WT
>>>> [  428.926120] Failed to update slave array from WT
>>>> [  428.927096] Failed to update slave array from WT
>>>> [  428.928101] Failed to update slave array from WT
>>>> [  428.929120] Failed to update slave array from WT
>>>> [  428.930086] BUG: unable to handle kernel NULL pointer dereference at
>>>>         (null)
>>>> [  428.930644] IP: [<ffffffff810aa37b>] __queue_work+0x7b/0x350
>>>> [  428.930946] PGD 0
>>>> [  428.931053] Oops: 0000 [#1] SMP
>>>> [  428.931053] Modules linked in: sfc ptp pps_core mdio i2c_algo_bit mtd
>>>> bonding(O) snd_hda_codec_generic joydev crct10dif_pclmul crc32_pclmul
>>>> i2c_piix4 ppdev crc32c_intel ghash_clmulni_intel parport_pc snd_hda_intel
>>>> snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer 9pnet_virtio
>>>> snd 9pnet pcspkr parport i2ccore serio_raw virtio_console virtio_balloon
>>>> pvpanic soundcore virtio_blk virtio_net ata_generic floppy pata_acpi
>>>> virtio_pci virtio_ring virtio
>>>> [  428.935022] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O
>>>> 3.17.0-rc4+ #30
>>>> [  428.935022] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>>> [  428.935022] task: ffffffff81c1b460 ti: ffffffff81c00000 task.ti:
>>>> ffffffff81c00000
>>>> [  428.935022] RIP: 0010:[<ffffffff810aa37b>]  [<ffffffff810aa37b>]
>>>> __queue_work+0x7b/0x350
>>>> [  428.935022] RSP: 0018:ffff88005f003e28  EFLAGS: 00010086
>>>> [  428.935022] RAX: ffff88005c05c800 RBX: 0000000000000000 RCX:
>>>> 0000000000000000
>>>> [  428.935022] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
>>>> ffff88005a4fbd58
>>>> [  428.935022] RBP: ffff88005f003e60 R08: 0000000000000046 R09:
>>>> ffffffff8225abc2
>>>> [  428.935022] R10: 0000000000000004 R11: 0000000000000005 R12:
>>>> ffff88005a4fbd58
>>>> [  428.935022] R13: 0000000000000008 R14: ffff88004b211800 R15:
>>>> 00000000000102f0
>>>> [  428.935022] FS:  0000000000000000(0000) GS:ffff88005f000000(0000)
>>>> knlGS:0000000000000000
>>>> [  428.935022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  428.935022] CR2: 0000000000000000 CR3: 000000004abde000 CR4:
>>>> 00000000000406f0
>>>> [  428.935022] Stack:
>>>> [  428.935022]  0a19522f72b12222 0000000081c1b460 ffffffff8225abc0
>>>> ffff88005a4fbd78
>>>> [  428.935022]  0000000000000101 ffffffff810aa650 ffff88005a4fbd58
>>>> ffff88005f003e70
>>>> [  428.935022]  ffffffff810aa668 ffff88005f003ea8 ffffffff810f3536
>>>> ffffffff8225abc0
>>>> [  428.935022] Call Trace:
>>>> [  428.935022]  <IRQ>
>>>> [  428.935022]
>>>> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
>>>> [  428.935022]  [<ffffffff810aa668>] delayed_work_timer_fn+0x18/0x20
>>>> [  428.935022]  [<ffffffff810f3536>] call_timer_fn+0x36/0x120
>>>> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
>>>> [  428.935022]  [<ffffffff810f38f5>] run_timer_softirq+0x1a5/0x320
>>>> [  428.935022]  [<ffffffff81096dc5>] __do_softirq+0xf5/0x2b0
>>>> [  428.935022]  [<ffffffff810971fd>] irq_exit+0xbd/0xd0
>>>> [  428.935022]  [<ffffffff8173b715>] smp_apic_timer_interrupt+0x45/0x60
>>>> [  428.935022]  [<ffffffff8173981d>] apic_timer_interrupt+0x6d/0x80
>>>> [  428.935022]  <EOI>
>>>> [  428.935022]
>>>> [  428.935022]  [<ffffffff810581c6>] ? native_safe_halt+0x6/0x10
>>>> [  428.935022]  [<ffffffff8101f36f>] default_idle+0x1f/0xe0
>>>> [  428.935022]  [<ffffffff8101fd8f>] arch_cpu_idle+0xf/0x20
>>>> [  428.935022]  [<ffffffff810d25dd>] cpu_startup_entry+0x38d/0x3c0
>>>> [  428.935022]  [<ffffffff81722927>] rest_init+0x87/0x90
>>>> [  428.935022]  [<ffffffff81d3510e>] start_kernel+0x482/0x4a3
>>>> [  428.935022]  [<ffffffff81d34a85>] ? set_init_arg+0x53/0x53
>>>> [  428.935022]  [<ffffffff81d34120>] ? early_idt_handlers+0x120/0x120
>>>> [  428.935022]  [<ffffffff81d345ee>] x86_64_start_reservations+0x2a/0x2c
>>>> [  428.935022]  [<ffffffff81d3473d>] x86_64_start_kernel+0x14d/0x170
>>>> [  428.935022] Code: 84 bb 01 00 00 a8 02 0f 85 eb 00 00 00 48 63 45 d4 49
>>>> 8b 9e 08 01 00 00 48 03 1c c5 60 fa d0 81 4c 89 e7 e8 18 f5 ff ff 48 85 c0
>>>> <48> 8b 3b 0f 84 7c 01 00 00 48 39 c7 0f 84 73 01 00 00 48 89 c7
>>>> [  428.935022] RIP  [<ffffffff810aa37b>] __queue_work+0x7b/0x350
>>>> [  428.935022]  RSP <ffff88005f003e28>
>>>> [  428.935022] CR2: 0000000000000000
>>>>
>>>> This is because it keeps trying to re-schedule even though the interface's
>>>> memory has been freed.
>>>>
>>> Hmm, how do we handle this?
>>>
>> This is tricky and what concerns me more is that people might make this
>> mistake again in the future. It's easy to unknowingly make use of a
>> function that re-schedules this from the wrong place.
>> What I just noticed is that for all 3ad cases you could pull the scheduling
>> in the bond_3ad_state_machine_handler() function.
>> The call sites of ad_agg_selection_logic() are:
>> - 3ad unbind slave (no need to schedule here as __bond_release_one would
>> rebuild the array anyhow)
>> - bond_3ad_state_machine_handler() <- here's where the schedule should
>> happen as this gets stopped first when the bond is closed and can't get
>> restarted unless it's opened again.
>> - ad_port_selection_logic() <- this is called from
>> bond_3ad_state_machine_handler() only, so this case will be handled as well.
>>
>> The other 2 functions that you convert - ad_enable/disable_collecting are
>> used only from ad_mux_machine() which is only called in
>> bond_3ad_state_machine_handler().
>>
>> So basically you can pull all rebuild schedules in their common caller -
>> bond_3ad_state_machine_handler(), just make a flag to note that a rebuild
>> is needed probably something similar to should_notify_rtnl.
>> This way you can remove the scheduling from the various 3ad functions that
>> may get used and will have it only in 1 place which is more easily controlled.
>>
> Well, I was just trying to avoid using flags to pass state from one to
> another function so that we can update the array at one place. This
> might introduce some bug so I was keeping it simple and build it only
> when the condition requires it to build it. However I do not see how
> this will fix the issue that you have seen, or would it? If so how?
>
You don't have to pass state between functions, you just have to collect the 
return values from them in the single caller and see if scheduling an update is 
required in the end. Obviously I haven't tested this fix, but the reasoning 
behind it goes like this:
The usual device destruction goes like: 1. ndo_close() 2. ndo_uninit()... So 
when bond_close() is executed the 3ad workqueue will get stopped first, and then 
the slave_update workqueue will get stopped (note - the order is important, 
since the only place where the slave_update workqueue gets scheduled to run is 
from the 3ad workqueue function). So when we reach bond_uninit() there's no way 
for the 3ad workqueue function to run and we're 100% sure that the slave_update 
workqueue has been canceled as well.
The reason for this is because the 3ad workqueue function is started in 
bond_open() which obviously can't run without rtnl held, which is why the other 
workqueue functions also are stopped in the same manner.
So basically, the idea is that you have only 1 place from which you can schedule 
the slave_update array and we can guarantee that it cannot get called once the 
bond device has been closed (bond_close()). You must not do a slave array update 
schedule in bond_3ad_unbind_slave, but that is okay because the slave array will 
get updated by __bond_release_one() anyhow.

>> Of course, the alternative would be once again - convert
>> bond_3ad_state_machine_handler() to RTNL, but that has its own set of problems.
>>
> It's convoluted, let's keep it simple for now :)
>
>>>> While testing this I spotted another issue as well - Failed to build
>>>> slave_arr message has been printed too many times because you print it in
>>>> 3ad mode when there's no active aggregator (bond_3ad_get_active_agg_info
>>>> check in bond_update_slave_arr) which leads to re-scheduling which also
>>>> lead to a deadlock.
>>>>
>>> I think this can be corrected with pr_ratelimited() call.
>>>
>> IMO it shouldn't print anything if it couldn't rebuild the array due to
>> missing active aggregator as that's not an error condition. It should
>> though probably clean out the slave array because transmission shouldn't be
>> possible without an active aggregator in 3ad.
>>
> Sure missing active aggregator is not an error but free-ing the slave
> array silently would be bad either. At least we would see something in
> the messages about "something" went wrong.
Nothing has went wrong, not having an active aggregator is a normal state that 
can happen and in fact could be the state while the bond device is configured. 
It is not advisable to spit out errors in such case as there has been no error 
condition to begin with. Dealing with failed active aggregator and notifying the 
user of it and so on is the job of the 3ad code, not of the slave_update mechanism.
One more thing you really should make sure that we don't xmit when there's no 
active aggregator, it doesn't make sense otherwise and it is actually the 
current behaviour (check bond_3ad_xmit_xor(), first thing it does is try to 
obtain active aggregator and if it fails - it drops the packet, the error 
condition there has been marked as netdev_dbg() so it can be enabled only per 
request of the user and isn't printed normally).
Moreover it's really a bad idea to reschedule the slave array rebuilding if 
there's no active aggregator because it may be the case we don't have it for a 
long time and it will cause constant rtnl acquire/release cycles.

>>
>>
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
  2014-09-23  8:29               ` Nikolay Aleksandrov
@ 2014-09-24  0:14                 ` Mahesh Bandewar
  0 siblings, 0 replies; 11+ messages in thread
From: Mahesh Bandewar @ 2014-09-24  0:14 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller,
	netdev, Eric Dumazet, Maciej Zenczykowski

On Tue, Sep 23, 2014 at 1:29 AM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> On 23/09/14 07:13, Mahesh Bandewar wrote:
>>
>> On Sun, Sep 21, 2014 at 4:07 AM, Nikolay Aleksandrov <nikolay@redhat.com>
>> wrote:
>>>
>>> On 09/20/2014 10:04 PM, Mahesh Bandewar wrote:
>>>>
>>>> On Sat, Sep 20, 2014 at 3:19 AM, Nikolay Aleksandrov
>>>> <nikolay@redhat.com> wrote:
>>>>>
>>>>> On 09/20/2014 02:09 AM, Mahesh Bandewar wrote:
>>>>>>
>>>>>> On Fri, Sep 19, 2014 at 4:06 AM, Nikolay Aleksandrov
>>>>>> <nikolay@redhat.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 09/19/2014 12:00 PM, Nikolay Aleksandrov wrote:
>>>>>>>>
>>>>>>>> On 09/18/2014 11:53 PM, Mahesh Bandewar wrote:
>>>>>>>>>
>>>>>>>>> Earlier change to use usable slave array for TLB mode had an
>>>>>>>>> additional
>>>>>>>>> performance advantage. So extending the same logic to all other
>>>>>>>>> modes
>>>>>>>>> that use xmit-hash for slave selection (viz 802.3AD, and XOR
>>>>>>>>> modes).
>>>>>>>>> Also consolidating this with the earlier TLB change.
>>>>>>>>>
>>>>>>>>> The main idea is to build the usable slaves array in the control
>>>>>>>>> path
>>>>>>>>> and use that array for slave selection during xmit operation.
>>>>>>>>>
>>>>>>>>> Measured performance in a setup with a bond of 4x1G NICs with 200
>>>>>>>>> instances of netperf for the modes involved (3ad, xor, tlb)
>>>>>>>>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>>>>>>>>
>>>>>>>>> Mode        TPS-Before   TPS-After
>>>>>>>>>
>>>>>>>>> 802.3ad   : 468,694      493,101
>>>>>>>>> TLB (lb=0): 392,583      392,965
>>>>>>>>> XOR       : 475,696      484,517
>>>>>>>>>
>>>>>>>>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>>>>>>>>> ---
>>>>>>>>> v1:
>>>>>>>>>    (a) If bond_update_slave_arr() fails to allocate memory, it will
>>>>>>>>> overwrite
>>>>>>>>>        the slave that need to be removed.
>>>>>>>>>    (b) Freeing of array will assign NULL (to handle bond->down to
>>>>>>>>> bond->up
>>>>>>>>>        transition gracefully.
>>>>>>>>>    (c) Change from pr_debug() to pr_err() if
>>>>>>>>> bond_update_slave_arr() returns
>>>>>>>>>        failure.
>>>>>>>>>    (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon
>>>>>>>>> cases and
>>>>>>>>>        will populate the array even if these parameters are not
>>>>>>>>> used.
>>>>>>>>>    (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>>>>>>>>> v2:
>>>>>>>>>    (a) Removed rcu_read_{un}lock() calls from array manipulation
>>>>>>>>> code.
>>>>>>>>>    (b) Slave link-events now refresh array for all these modes.
>>>>>>>>>    (c) Moved free-array call from bond_close() to bond_uninit().
>>>>>>>>> v3:
>>>>>>>>>    (a) Fixed null pointer dereference.
>>>>>>>>>    (b) Removed bond->lock lockdep dependency.
>>>>>>>>> v4:
>>>>>>>>>    (a) Made to changes to comply with Nikolay's locking changes
>>>>>>>>>    (b) Added a work-queue to refresh slave-array when RTNL is not
>>>>>>>>> held
>>>>>>>>>    (c) Array refresh happens ONLY with RTNL now.
>>>>>>>>>    (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>>>>>>>>>
>>>>>>> <<<snip>>>
>>>>>>>>>
>>>>>>>>> @@ -3839,6 +4003,7 @@ static void bond_uninit(struct net_device
>>>>>>>>> *bond_dev)
>>>>>>>>>       struct bonding *bond = netdev_priv(bond_dev);
>>>>>>>>>       struct list_head *iter;
>>>>>>>>>       struct slave *slave;
>>>>>>>>> +    struct bond_up_slave *arr;
>>>>>>>>>
>>>>>>>>>       bond_netpoll_cleanup(bond_dev);
>>>>>>>>>
>>>>>>>>> @@ -3847,6 +4012,12 @@ static void bond_uninit(struct net_device
>>>>>>>>> *bond_dev)
>>>>>>>>>               __bond_release_one(bond_dev, slave->dev, true);
>>>>>>>>>       netdev_info(bond_dev, "Released all slaves\n");
>>>>>>>>>
>>>>>>> Sorry but I just spotted a major problem, bond_3ad_unbind_slave()
>>>>>>> (called
>>>>>>> from __bond_release_one) calls ad_agg_selection_logic() which can
>>>>>>> re-arm
>>>>>>> the slave_arr work after it's supposed to be stopped here (i.e. the
>>>>>>> bond
>>>>>>> device has been closed so all works should've been stopped) so we
>>>>>>> might
>>>>>>> leak memory and access freed memory after all since it'll keep
>>>>>>> re-scheduling itself until it can acquire rtnl which is after the
>>>>>>> bond
>>>>>>> device has been destroyed.
>>>>>>>
>>>>>> This should not be a problem. ndo_close (bond_close()) is called
>>>>>> before ndo_uninit(bond_uninit()), so the work-queues get cancelled
>>>>>> there so if rearm tries to schedule some work after queue gets
>>>>>> cancelled, it can't do much and wont harm anything.
>>>>>> Hence there wont be any arrays built once it's free-ed completely and
>>>>>> therefore no memory leak. I addded some instrumentation and tried
>>>>>> following sequence -
>>>>>>
>>>>>> # modprobe bonding mode=4
>>>>>> # ip link set bond0 up
>>>>>> # [Add ip]
>>>>>> # [Add default route]
>>>>>> # ifenslave bond0 eth0 eth1 eth2 eth3
>>>>>> ....
>>>>>> [Run some backgound traffic. I used netperf.]
>>>>>>
>>>>>> # ip link bond0 down
>>>>>>
>>>>>> I did not see anything "bad" happening. Did your trial produced
>>>>>> something unpleasant?
>>>>>>
>>>>> The test you've done is irrelevant to the situation that I described
>>>>> because ndo_uninit() is called when the device is being destroyed.
>>>>> Moreover
>>>>> the case I told you about would require to have an active aggregator
>>>>> and an
>>>>> inactive one (i.e. so agg selection logic will get called), here is the
>>>>> result:
>>>>> [  428.916586] bond1 (unregistering): Removing an active aggregator
>>>>> [  428.916589] Failed to build slave-array.
>>>>> [  428.916849] bond1 (unregistering): Releasing active interface eth1
>>>>> [  428.920342] bond1 (unregistering): Released all slaves
>>>>> [  428.923043] Failed to update slave array from WT
>>>>> [  428.924098] Failed to update slave array from WT
>>>>> [  428.925125] Failed to update slave array from WT
>>>>> [  428.926120] Failed to update slave array from WT
>>>>> [  428.927096] Failed to update slave array from WT
>>>>> [  428.928101] Failed to update slave array from WT
>>>>> [  428.929120] Failed to update slave array from WT
>>>>> [  428.930086] BUG: unable to handle kernel NULL pointer dereference at
>>>>>         (null)
>>>>> [  428.930644] IP: [<ffffffff810aa37b>] __queue_work+0x7b/0x350
>>>>> [  428.930946] PGD 0
>>>>> [  428.931053] Oops: 0000 [#1] SMP
>>>>> [  428.931053] Modules linked in: sfc ptp pps_core mdio i2c_algo_bit
>>>>> mtd
>>>>> bonding(O) snd_hda_codec_generic joydev crct10dif_pclmul crc32_pclmul
>>>>> i2c_piix4 ppdev crc32c_intel ghash_clmulni_intel parport_pc
>>>>> snd_hda_intel
>>>>> snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer
>>>>> 9pnet_virtio
>>>>> snd 9pnet pcspkr parport i2ccore serio_raw virtio_console
>>>>> virtio_balloon
>>>>> pvpanic soundcore virtio_blk virtio_net ata_generic floppy pata_acpi
>>>>> virtio_pci virtio_ring virtio
>>>>> [  428.935022] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O
>>>>> 3.17.0-rc4+ #30
>>>>> [  428.935022] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>>>> [  428.935022] task: ffffffff81c1b460 ti: ffffffff81c00000 task.ti:
>>>>> ffffffff81c00000
>>>>> [  428.935022] RIP: 0010:[<ffffffff810aa37b>]  [<ffffffff810aa37b>]
>>>>> __queue_work+0x7b/0x350
>>>>> [  428.935022] RSP: 0018:ffff88005f003e28  EFLAGS: 00010086
>>>>> [  428.935022] RAX: ffff88005c05c800 RBX: 0000000000000000 RCX:
>>>>> 0000000000000000
>>>>> [  428.935022] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
>>>>> ffff88005a4fbd58
>>>>> [  428.935022] RBP: ffff88005f003e60 R08: 0000000000000046 R09:
>>>>> ffffffff8225abc2
>>>>> [  428.935022] R10: 0000000000000004 R11: 0000000000000005 R12:
>>>>> ffff88005a4fbd58
>>>>> [  428.935022] R13: 0000000000000008 R14: ffff88004b211800 R15:
>>>>> 00000000000102f0
>>>>> [  428.935022] FS:  0000000000000000(0000) GS:ffff88005f000000(0000)
>>>>> knlGS:0000000000000000
>>>>> [  428.935022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  428.935022] CR2: 0000000000000000 CR3: 000000004abde000 CR4:
>>>>> 00000000000406f0
>>>>> [  428.935022] Stack:
>>>>> [  428.935022]  0a19522f72b12222 0000000081c1b460 ffffffff8225abc0
>>>>> ffff88005a4fbd78
>>>>> [  428.935022]  0000000000000101 ffffffff810aa650 ffff88005a4fbd58
>>>>> ffff88005f003e70
>>>>> [  428.935022]  ffffffff810aa668 ffff88005f003ea8 ffffffff810f3536
>>>>> ffffffff8225abc0
>>>>> [  428.935022] Call Trace:
>>>>> [  428.935022]  <IRQ>
>>>>> [  428.935022]
>>>>> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
>>>>> [  428.935022]  [<ffffffff810aa668>] delayed_work_timer_fn+0x18/0x20
>>>>> [  428.935022]  [<ffffffff810f3536>] call_timer_fn+0x36/0x120
>>>>> [  428.935022]  [<ffffffff810aa650>] ? __queue_work+0x350/0x350
>>>>> [  428.935022]  [<ffffffff810f38f5>] run_timer_softirq+0x1a5/0x320
>>>>> [  428.935022]  [<ffffffff81096dc5>] __do_softirq+0xf5/0x2b0
>>>>> [  428.935022]  [<ffffffff810971fd>] irq_exit+0xbd/0xd0
>>>>> [  428.935022]  [<ffffffff8173b715>] smp_apic_timer_interrupt+0x45/0x60
>>>>> [  428.935022]  [<ffffffff8173981d>] apic_timer_interrupt+0x6d/0x80
>>>>> [  428.935022]  <EOI>
>>>>> [  428.935022]
>>>>> [  428.935022]  [<ffffffff810581c6>] ? native_safe_halt+0x6/0x10
>>>>> [  428.935022]  [<ffffffff8101f36f>] default_idle+0x1f/0xe0
>>>>> [  428.935022]  [<ffffffff8101fd8f>] arch_cpu_idle+0xf/0x20
>>>>> [  428.935022]  [<ffffffff810d25dd>] cpu_startup_entry+0x38d/0x3c0
>>>>> [  428.935022]  [<ffffffff81722927>] rest_init+0x87/0x90
>>>>> [  428.935022]  [<ffffffff81d3510e>] start_kernel+0x482/0x4a3
>>>>> [  428.935022]  [<ffffffff81d34a85>] ? set_init_arg+0x53/0x53
>>>>> [  428.935022]  [<ffffffff81d34120>] ? early_idt_handlers+0x120/0x120
>>>>> [  428.935022]  [<ffffffff81d345ee>]
>>>>> x86_64_start_reservations+0x2a/0x2c
>>>>> [  428.935022]  [<ffffffff81d3473d>] x86_64_start_kernel+0x14d/0x170
>>>>> [  428.935022] Code: 84 bb 01 00 00 a8 02 0f 85 eb 00 00 00 48 63 45 d4
>>>>> 49
>>>>> 8b 9e 08 01 00 00 48 03 1c c5 60 fa d0 81 4c 89 e7 e8 18 f5 ff ff 48 85
>>>>> c0
>>>>> <48> 8b 3b 0f 84 7c 01 00 00 48 39 c7 0f 84 73 01 00 00 48 89 c7
>>>>> [  428.935022] RIP  [<ffffffff810aa37b>] __queue_work+0x7b/0x350
>>>>> [  428.935022]  RSP <ffff88005f003e28>
>>>>> [  428.935022] CR2: 0000000000000000
>>>>>
>>>>> This is because it keeps trying to re-schedule even though the
>>>>> interface's
>>>>> memory has been freed.
>>>>>
>>>> Hmm, how do we handle this?
>>>>
>>> This is tricky and what concerns me more is that people might make this
>>> mistake again in the future. It's easy to unknowingly make use of a
>>> function that re-schedules this from the wrong place.
>>> What I just noticed is that for all 3ad cases you could pull the
>>> scheduling
>>> in the bond_3ad_state_machine_handler() function.
>>> The call sites of ad_agg_selection_logic() are:
>>> - 3ad unbind slave (no need to schedule here as __bond_release_one would
>>> rebuild the array anyhow)
>>> - bond_3ad_state_machine_handler() <- here's where the schedule should
>>> happen as this gets stopped first when the bond is closed and can't get
>>> restarted unless it's opened again.
>>> - ad_port_selection_logic() <- this is called from
>>> bond_3ad_state_machine_handler() only, so this case will be handled as
>>> well.
>>>
>>> The other 2 functions that you convert - ad_enable/disable_collecting are
>>> used only from ad_mux_machine() which is only called in
>>> bond_3ad_state_machine_handler().
>>>
>>> So basically you can pull all rebuild schedules in their common caller -
>>> bond_3ad_state_machine_handler(), just make a flag to note that a rebuild
>>> is needed probably something similar to should_notify_rtnl.
>>> This way you can remove the scheduling from the various 3ad functions
>>> that
>>> may get used and will have it only in 1 place which is more easily
>>> controlled.
>>>
>> Well, I was just trying to avoid using flags to pass state from one to
>> another function so that we can update the array at one place. This
>> might introduce some bug so I was keeping it simple and build it only
>> when the condition requires it to build it. However I do not see how
>> this will fix the issue that you have seen, or would it? If so how?
>>
> You don't have to pass state between functions, you just have to collect the
> return values from them in the single caller and see if scheduling an update
> is required in the end. Obviously I haven't tested this fix, but the
> reasoning behind it goes like this:
> The usual device destruction goes like: 1. ndo_close() 2. ndo_uninit()... So
> when bond_close() is executed the 3ad workqueue will get stopped first, and
> then the slave_update workqueue will get stopped (note - the order is
> important, since the only place where the slave_update workqueue gets
> scheduled to run is from the 3ad workqueue function). So when we reach
> bond_uninit() there's no way for the 3ad workqueue function to run and we're
> 100% sure that the slave_update workqueue has been canceled as well.
> The reason for this is because the 3ad workqueue function is started in
> bond_open() which obviously can't run without rtnl held, which is why the
> other workqueue functions also are stopped in the same manner.
> So basically, the idea is that you have only 1 place from which you can
> schedule the slave_update array and we can guarantee that it cannot get
> called once the bond device has been closed (bond_close()). You must not do
> a slave array update schedule in bond_3ad_unbind_slave, but that is okay
> because the slave array will get updated by __bond_release_one() anyhow.
>
I try doing this in 3ad_state_machine().

>>> Of course, the alternative would be once again - convert
>>> bond_3ad_state_machine_handler() to RTNL, but that has its own set of
>>> problems.
>>>
>> It's convoluted, let's keep it simple for now :)
>>
>>>>> While testing this I spotted another issue as well - Failed to build
>>>>> slave_arr message has been printed too many times because you print it
>>>>> in
>>>>> 3ad mode when there's no active aggregator
>>>>> (bond_3ad_get_active_agg_info
>>>>> check in bond_update_slave_arr) which leads to re-scheduling which also
>>>>> lead to a deadlock.
>>>>>
>>>> I think this can be corrected with pr_ratelimited() call.
>>>>
>>> IMO it shouldn't print anything if it couldn't rebuild the array due to
>>> missing active aggregator as that's not an error condition. It should
>>> though probably clean out the slave array because transmission shouldn't
>>> be
>>> possible without an active aggregator in 3ad.
>>>
>> Sure missing active aggregator is not an error but free-ing the slave
>> array silently would be bad either. At least we would see something in
>> the messages about "something" went wrong.
>
> Nothing has went wrong, not having an active aggregator is a normal state
> that can happen and in fact could be the state while the bond device is
> configured. It is not advisable to spit out errors in such case as there has
> been no error condition to begin with. Dealing with failed active aggregator
> and notifying the user of it and so on is the job of the 3ad code, not of
> the slave_update mechanism.
> One more thing you really should make sure that we don't xmit when there's
> no active aggregator, it doesn't make sense otherwise and it is actually the
> current behaviour (check bond_3ad_xmit_xor(), first thing it does is try to
> obtain active aggregator and if it fails - it drops the packet, the error
> condition there has been marked as netdev_dbg() so it can be enabled only
> per request of the user and isn't printed normally).
> Moreover it's really a bad idea to reschedule the slave array rebuilding if
> there's no active aggregator because it may be the case we don't have it for
> a long time and it will cause constant rtnl acquire/release cycles.
>
alright. I'll update the code to make sure that active agg not being
present does not trigger error. However malloc failure is still an
error.

>>>
>>>
>>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-09-24  0:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-18 21:53 [PATCH net-next v4 2/2] bonding: Simplify the xmit function for modes that use xmit_hash Mahesh Bandewar
2014-09-19 10:00 ` Nikolay Aleksandrov
2014-09-19 10:08   ` Nikolay Aleksandrov
2014-09-19 11:06   ` Nikolay Aleksandrov
2014-09-20  0:09     ` Mahesh Bandewar
2014-09-20 10:19       ` Nikolay Aleksandrov
2014-09-20 20:04         ` Mahesh Bandewar
2014-09-21 11:07           ` Nikolay Aleksandrov
2014-09-23  5:13             ` Mahesh Bandewar
2014-09-23  8:29               ` Nikolay Aleksandrov
2014-09-24  0:14                 ` Mahesh Bandewar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).