netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: moyufeng <moyufeng@huawei.com>
To: Jiri Pirko <jiri@resnulli.us>, Jay Vosburgh <j.vosburgh@gmail.com>
Cc: "lipeng (Y)" <lipeng321@huawei.com>,
	<linux-kernel@vger.kernel.org>,
	Veaceslav Falico <vfalico@gmail.com>,
	Andy Gospodarek <andy@greyhouse.net>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Thomas Davis <tadavis@lbl.gov>,
	<netdev@vger.kernel.org>, <linuxarm@openeuler.org>,
	Salil Mehta <salil.mehta@huawei.com>
Subject: Re: question about bonding mode 4
Date: Sat, 23 Jan 2021 14:10:21 +0800	[thread overview]
Message-ID: <52630cba-cc60-a024-8dd0-8319e5245044@huawei.com> (raw)
In-Reply-To: <ef692814-fdea-ea59-6b52-93630b5b5570@huawei.com>

Ping...
Any comments? Thanks!

On 2021/1/15 10:02, moyufeng wrote:
> Hi Team,
> 
> I have a question about bonding. During testing bonding mode 4
> scenarios, I find that there is a very low probability that
> the pointer is null. The following information is displayed:
> 
> [99359.795934] bond0: (slave eth13.2001): Port 2 did not find a suitable aggregator
> [99359.796960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
> [99359.798127] Mem abort info:
> [99359.798526]   ESR = 0x96000004
> [99359.798938]   EC = 0x25: DABT (current EL), IL = 32 bits
> [99359.799673]   SET = 0, FnV = 0
> [99359.800106]   EA = 0, S1PTW = 0
> [99359.800554] Data abort info:
> [99359.800952]   ISV = 0, ISS = 0x00000004
> [99359.801522]   CM = 0, WnR = 0
> [99359.801970] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000c64e6000
> [99359.802876] [0000000000000020] pgd=0000000000000000
> [99359.803555] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> [99359.804369] Modules linked in: bonding hns3(-) hclgevf hnae3 [last unloaded: bonding]
> [99359.805494] CPU: 1 PID: 951 Comm: kworker/u10:2 Not tainted 5.7.0-rc4+ #1
> [99359.806455] Hardware name: linux,dummy-virt (DT)
> [99359.807107] Workqueue: bond0 bond_3ad_state_machine_handler [bonding]
> [99359.808056] pstate: 60c00005 (nZCv daif +PAN +UAO)
> [99359.808722] pc : bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]
> [99359.809652] lr : bond_3ad_state_machine_handler+0x7f4/0xdb8 [bonding]
> [99359.810535] sp : ffff80001882bd20
> [99359.811012] x29: ffff80001882bd20 x28: ffff000085939a38
> [99359.811791] x27: ffff00008649bb68 x26: 00000000aaaaaaab
> [99359.812871] x25: ffff800009401000 x24: ffff800009408de4
> [99359.814049] x23: ffff80001882bd98 x22: ffff00008649b880
> [99359.815210] x21: 0000000000000000 x20: ffff000085939a00
> [99359.816401] x19: ffff00008649b880 x18: ffff800012572988
> [99359.817637] x17: 0000000000000000 x16: 0000000000000000
> [99359.818870] x15: ffff80009882b987 x14: 726f746167657267
> [99359.820090] x13: 676120656c626174 x12: 697573206120646e
> [99359.821374] x11: 696620746f6e2064 x10: 696420322074726f
> [99359.822659] x9 : 50203a2931303032 x8 : 0000000000081391
> [99359.823891] x7 : ffff8000108e3ad0 x6 : ffff8000128858bb
> [99359.825109] x5 : 0000000000000000 x4 : 0000000000000000
> [99359.826262] x3 : 00000000ffffffff x2 : 906b329bb5362a00
> [99359.827394] x1 : 906b329bb5362a00 x0 : 0000000000000000
> [99359.828540] Call trace:
> [99359.829071]  bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]
> [99359.830367]  process_one_work+0x15c/0x4a0
> [99359.831216]  worker_thread+0x50/0x478
> [99359.832022]  kthread+0x130/0x160
> [99359.832716]  ret_from_fork+0x10/0x18
> [99359.833487] Code: 910c0021 95f704bb f9403f80 b5ffe300 (f9401000)
> [99359.834742] ---[ end trace c7a8e02914afc4e0 ]---
> [99359.835817] Kernel panic - not syncing: Fatal exception in interrupt
> [99359.837334] SMP: stopping secondary CPUs
> [99359.838277] Kernel Offset: disabled
> [99359.839086] CPU features: 0x080002,22208218
> [99359.840053] Memory Limit: none
> [99359.840783] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> 
> The test procedure is as follows:
> 1. Configure bonding and set it to mode 4.
>     echo "4" > /sys/class/net/bond0/bonding/mode
>     ifconfig bond0 up
> 
> 2. Configure two VLANs and add them to the bonding in step 1.
>     vconfig add eth0 2001
>     vconfig add eth1 2001
>     ifenslave bond0 eth0.2001 eth1.2001
> 
> 3. Unload the network device driver and bonding driver.
>     rmmod hns3
>     rmmod hclge
>     rmmod hnae3
>     rmmod bonding.ko
> 
> 4. Repeat the preceding steps for a long time.
> 
> By checking the logic in ad_port_selection_logic(), I find that
> if enter the branch "Port %d did not find a suitable aggregator",
> the value of port->aggregator will be NULL, causing the problem.
> 
> So I'd like to ask what circumstances will be involved in this
> branch, and what should be done in this case?
> 
> 
> The detailed code analysis is as follows:
> 
> static void ad_port_selection_logic(struct port *port, bool *update_slave_arr)
> {
> [...]
> 	/* if the port is connected to other aggregator, detach it */
> 	if (port->aggregator) {
> 		/* detach the port from its former aggregator */
> 		temp_aggregator = port->aggregator;
> 		for (curr_port = temp_aggregator->lag_ports; curr_port;
> 		     last_port = curr_port,
> 		     curr_port = curr_port->next_port_in_aggregator) {
> 			if (curr_port == port) {
> 				temp_aggregator->num_of_ports--;
> 				/* if it is the first port attached to the
> 				 * aggregator
> 				 */
> 				if (!last_port) {
> 					temp_aggregator->lag_ports =
> 						port->next_port_in_aggregator;
> 				} else {
> 					/* not the first port attached to the
> 					 * aggregator
> 					 */
> 					last_port->next_port_in_aggregator =
> 						port->next_port_in_aggregator;
> 				}
> 
> 				/* clear the port's relations to this
> 				 * aggregator
> 				 */
> 				port->aggregator = NULL;
> 
> ----analysis: set port->aggregator NULL at the beginning of ad_port_selection_logic().
> 
> 				port->next_port_in_aggregator = NULL;
> 				port->actor_port_aggregator_identifier = 0;
> 
> 				slave_dbg(bond->dev, port->slave->dev, "Port %d left LAG %d\n",
> 					  port->actor_port_number,
> 					  temp_aggregator->aggregator_identifier);
> 				/* if the aggregator is empty, clear its
> 				 * parameters, and set it ready to be attached
> 				 */
> 				if (!temp_aggregator->lag_ports)
> 					ad_clear_agg(temp_aggregator);
> 				break;
> 			}
> 		}
> 		if (!curr_port) {
> 			/* meaning: the port was related to an aggregator
> 			 * but was not on the aggregator port list
> 			 */
> 			net_warn_ratelimited("%s: (slave %s): Warning: Port %d was related to aggregator %d but was not on its port list\n",
> 					     port->slave->bond->dev->name,
> 					     port->slave->dev->name,
> 					     port->actor_port_number,
> 					     port->aggregator->aggregator_identifier);
> 		}
> 	}
> 	/* search on all aggregators for a suitable aggregator for this port */
> 	bond_for_each_slave(bond, slave, iter) {
> 		aggregator = &(SLAVE_AD_INFO(slave)->aggregator);
> 		/* keep a free aggregator for later use(if needed) */
> 		if (!aggregator->lag_ports) {
> 			if (!free_aggregator)
> 				free_aggregator = aggregator;
> 
> ----analysis: Save free_aggregator if found a free aggregator. But in this case, no free aggregator is available.
> 
> 			continue;
> 		}
> 		/* check if current aggregator suits us */
> 		if (((aggregator->actor_oper_aggregator_key == port->actor_oper_port_key) && /* if all parameters match AND */
> 		     MAC_ADDRESS_EQUAL(&(aggregator->partner_system), &(port->partner_oper.system)) &&
> 		     (aggregator->partner_system_priority == port->partner_oper.system_priority) &&
> 		     (aggregator->partner_oper_aggregator_key == port->partner_oper.key)
> 		    ) &&
> 		    ((!MAC_ADDRESS_EQUAL(&(port->partner_oper.system), &(null_mac_addr)) && /* partner answers */
> 		      !aggregator->is_individual)  /* but is not individual OR */
> 		    )
> 		   ) {
> 			/* attach to the founded aggregator */
> 			port->aggregator = aggregator;
> 
> ----analysis: If a proper aggregator is found, port->aggregator is assigned here.
> 
> 			port->actor_port_aggregator_identifier =
> 				port->aggregator->aggregator_identifier;
> 			port->next_port_in_aggregator = aggregator->lag_ports;
> 			port->aggregator->num_of_ports++;
> 			aggregator->lag_ports = port;
> 			slave_dbg(bond->dev, slave->dev, "Port %d joined LAG %d (existing LAG)\n",
> 				  port->actor_port_number,
> 				  port->aggregator->aggregator_identifier);
> 
> 			/* mark this port as selected */
> 			port->sm_vars |= AD_PORT_SELECTED;
> 			found = 1;
> 
> ----analysis: Set found to 1 if port->aggregator is assigned.
> 
> 			break;
> 		}
> 	}
> 	/* the port couldn't find an aggregator - attach it to a new
> 	 * aggregator
> 	 */
> 	if (!found) {
> 		if (free_aggregator) {
> 			/* assign port a new aggregator */
> 			port->aggregator = free_aggregator;
> 
> ----analysis: No proper free_aggregator is found. Therefore, port->aggregator cannot be assigned here.
> 
> 			port->actor_port_aggregator_identifier =
> 				port->aggregator->aggregator_identifier;
> 
> 			/* update the new aggregator's parameters
> 			 * if port was responsed from the end-user
> 			 */
> 			if (port->actor_oper_port_key & AD_DUPLEX_KEY_MASKS)
> 				/* if port is full duplex */
> 				port->aggregator->is_individual = false;
> 			else
> 				port->aggregator->is_individual = true;
> 
> 			port->aggregator->actor_admin_aggregator_key =
> 				port->actor_admin_port_key;
> 			port->aggregator->actor_oper_aggregator_key =
> 				port->actor_oper_port_key;
> 			port->aggregator->partner_system =
> 				port->partner_oper.system;
> 			port->aggregator->partner_system_priority =
> 				port->partner_oper.system_priority;
> 			port->aggregator->partner_oper_aggregator_key = port->partner_oper.key;
> 			port->aggregator->receive_state = 1;
> 			port->aggregator->transmit_state = 1;
> 			port->aggregator->lag_ports = port;
> 			port->aggregator->num_of_ports++;
> 
> 			/* mark this port as selected */
> 			port->sm_vars |= AD_PORT_SELECTED;
> 
> 			slave_dbg(bond->dev, port->slave->dev, "Port %d joined LAG %d (new LAG)\n",
> 				  port->actor_port_number,
> 				  port->aggregator->aggregator_identifier);
> 		} else {
> 			slave_err(bond->dev, port->slave->dev,
> 				  "Port %d did not find a suitable aggregator\n",
> 				  port->actor_port_number);
> 
> ----analysis: The fault scenario goes to this branch, and port->aggregator remains NULL.
> 
> 		}
> 	}
> 	/* if all aggregator's ports are READY_N == TRUE, set ready=TRUE
> 	 * in all aggregator's ports, else set ready=FALSE in all
> 	 * aggregator's ports
> 	 */
> 	__set_agg_ports_ready(port->aggregator,
> 			      __agg_ports_are_ready(port->aggregator));
> 
> ----analysis: port->aggregator is still NULL, which causes problem.
> 
> 
> 	aggregator = __get_first_agg(port);
> 	ad_agg_selection_logic(aggregator, update_slave_arr);
> 
> 	if (!port->aggregator->is_active)
> 		port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;
> }
> 
> 
> Thanks.
> .
> 

  reply	other threads:[~2021-01-23  6:12 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-18 19:30 [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option Jarod Wilson
2020-12-19  0:18 ` Jay Vosburgh
2021-01-08  0:03   ` Jarod Wilson
2021-01-12 21:12     ` Jarod Wilson
2021-01-12 21:39       ` Jay Vosburgh
2021-01-12 22:32         ` Jarod Wilson
2020-12-28 10:11 ` Jiri Pirko
2021-01-07 23:58   ` Jarod Wilson
2021-01-08 13:12     ` Jiri Pirko
2021-01-08 15:21       ` Jarod Wilson
2021-01-15  2:02       ` question about bonding mode 4 moyufeng
2021-01-23  6:10         ` moyufeng [this message]
2021-01-29 19:11           ` Jay Vosburgh
2021-01-30  9:41             ` moyufeng
2021-01-13 22:35 ` [PATCH net-next v2] bonding: add a vlan+mac tx hashing option Jarod Wilson
2021-01-13 23:41   ` [PATCH iproute2] bond: support xmit_hash_policy=vlan+mac Jarod Wilson
2021-01-15 15:12     ` Jarod Wilson
2021-01-15 19:21     ` [PATCH iproute2 v2] bond: support xmit_hash_policy=vlan+srcmac Jarod Wilson
2021-01-23 18:35       ` David Ahern
2021-01-14  1:58   ` [PATCH net-next v2] bonding: add a vlan+mac tx hashing option Jakub Kicinski
2021-01-14 21:11     ` Jarod Wilson
2021-01-14 21:23       ` Jakub Kicinski
2021-01-14 21:42         ` Jarod Wilson
2021-01-14 21:54       ` Jay Vosburgh
2021-01-15 15:08         ` Jarod Wilson
2021-01-15 19:21   ` [PATCH net-next v3] bonding: add a vlan+srcmac " Jarod Wilson
2021-01-18 23:10     ` David Ahern
2021-01-19  1:04       ` Jarod Wilson
2021-01-19  1:09     ` [PATCH net-next v4] " Jarod Wilson
2021-01-20  6:10       ` patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52630cba-cc60-a024-8dd0-8319e5245044@huawei.com \
    --to=moyufeng@huawei.com \
    --cc=andy@greyhouse.net \
    --cc=davem@davemloft.net \
    --cc=j.vosburgh@gmail.com \
    --cc=jiri@resnulli.us \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=lipeng321@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=salil.mehta@huawei.com \
    --cc=tadavis@lbl.gov \
    --cc=vfalico@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).