netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jay Vosburgh <jay.vosburgh@canonical.com>
To: moyufeng <moyufeng@huawei.com>
Cc: Jiri Pirko <jiri@resnulli.us>,
	"lipeng (Y)" <lipeng321@huawei.com>,
	linux-kernel@vger.kernel.org,
	Veaceslav Falico <vfalico@gmail.com>,
	Andy Gospodarek <andy@greyhouse.net>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Thomas Davis <tadavis@lbl.gov>,
	netdev@vger.kernel.org, linuxarm@openeuler.org,
	Salil Mehta <salil.mehta@huawei.com>
Subject: Re: question about bonding mode 4
Date: Fri, 29 Jan 2021 11:11:13 -0800	[thread overview]
Message-ID: <10374.1611947473@famine> (raw)
In-Reply-To: <52630cba-cc60-a024-8dd0-8319e5245044@huawei.com>

moyufeng <moyufeng@huawei.com> wrote:

>Ping...
>Any comments? Thanks!
>
>On 2021/1/15 10:02, moyufeng wrote:
>> Hi Team,
>> 
>> I have a question about bonding. During testing bonding mode 4
>> scenarios, I find that there is a very low probability that
>> the pointer is null. The following information is displayed:
>> 
>> [99359.795934] bond0: (slave eth13.2001): Port 2 did not find a suitable aggregator
>> [99359.796960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
>> [99359.798127] Mem abort info:
>> [99359.798526]   ESR = 0x96000004
>> [99359.798938]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [99359.799673]   SET = 0, FnV = 0
>> [99359.800106]   EA = 0, S1PTW = 0
>> [99359.800554] Data abort info:
>> [99359.800952]   ISV = 0, ISS = 0x00000004
>> [99359.801522]   CM = 0, WnR = 0
>> [99359.801970] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000c64e6000
>> [99359.802876] [0000000000000020] pgd=0000000000000000
>> [99359.803555] Internal error: Oops: 96000004 [#1] PREEMPT SMP
>> [99359.804369] Modules linked in: bonding hns3(-) hclgevf hnae3 [last unloaded: bonding]
>> [99359.805494] CPU: 1 PID: 951 Comm: kworker/u10:2 Not tainted 5.7.0-rc4+ #1
>> [99359.806455] Hardware name: linux,dummy-virt (DT)
>> [99359.807107] Workqueue: bond0 bond_3ad_state_machine_handler [bonding]
>> [99359.808056] pstate: 60c00005 (nZCv daif +PAN +UAO)
>> [99359.808722] pc : bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]
>> [99359.809652] lr : bond_3ad_state_machine_handler+0x7f4/0xdb8 [bonding]
>> [99359.810535] sp : ffff80001882bd20
>> [99359.811012] x29: ffff80001882bd20 x28: ffff000085939a38
>> [99359.811791] x27: ffff00008649bb68 x26: 00000000aaaaaaab
>> [99359.812871] x25: ffff800009401000 x24: ffff800009408de4
>> [99359.814049] x23: ffff80001882bd98 x22: ffff00008649b880
>> [99359.815210] x21: 0000000000000000 x20: ffff000085939a00
>> [99359.816401] x19: ffff00008649b880 x18: ffff800012572988
>> [99359.817637] x17: 0000000000000000 x16: 0000000000000000
>> [99359.818870] x15: ffff80009882b987 x14: 726f746167657267
>> [99359.820090] x13: 676120656c626174 x12: 697573206120646e
>> [99359.821374] x11: 696620746f6e2064 x10: 696420322074726f
>> [99359.822659] x9 : 50203a2931303032 x8 : 0000000000081391
>> [99359.823891] x7 : ffff8000108e3ad0 x6 : ffff8000128858bb
>> [99359.825109] x5 : 0000000000000000 x4 : 0000000000000000
>> [99359.826262] x3 : 00000000ffffffff x2 : 906b329bb5362a00
>> [99359.827394] x1 : 906b329bb5362a00 x0 : 0000000000000000
>> [99359.828540] Call trace:
>> [99359.829071]  bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]
>> [99359.830367]  process_one_work+0x15c/0x4a0
>> [99359.831216]  worker_thread+0x50/0x478
>> [99359.832022]  kthread+0x130/0x160
>> [99359.832716]  ret_from_fork+0x10/0x18
>> [99359.833487] Code: 910c0021 95f704bb f9403f80 b5ffe300 (f9401000)
>> [99359.834742] ---[ end trace c7a8e02914afc4e0 ]---
>> [99359.835817] Kernel panic - not syncing: Fatal exception in interrupt
>> [99359.837334] SMP: stopping secondary CPUs
>> [99359.838277] Kernel Offset: disabled
>> [99359.839086] CPU features: 0x080002,22208218
>> [99359.840053] Memory Limit: none
>> [99359.840783] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>> 
>> The test procedure is as follows:
>> 1. Configure bonding and set it to mode 4.
>>     echo "4" > /sys/class/net/bond0/bonding/mode
>>     ifconfig bond0 up
>> 
>> 2. Configure two VLANs and add them to the bonding in step 1.
>>     vconfig add eth0 2001
>>     vconfig add eth1 2001
>>     ifenslave bond0 eth0.2001 eth1.2001
>> 
>> 3. Unload the network device driver and bonding driver.
>>     rmmod hns3
>>     rmmod hclge
>>     rmmod hnae3
>>     rmmod bonding.ko

	Are you running the above in a script, and can you share the
entire thing?

	Does the issue occur with the current net-next?

>> 4. Repeat the preceding steps for a long time.

	When you run this test, what are the network interfaces eth0 and
eth1 connected to, and are those ports configured for VLAN 2001 and
LACP?

>> By checking the logic in ad_port_selection_logic(), I find that
>> if enter the branch "Port %d did not find a suitable aggregator",
>> the value of port->aggregator will be NULL, causing the problem.
>> 
>> So I'd like to ask what circumstances will be involved in this
>> branch, and what should be done in this case?

	Well, in principle, this shouldn't ever happen.  Every port
structure contains an aggregator structure, so there should always be
one available somewhere.  I'm going to speculate that there's a race
condition somewhere in the teardown processing vs the LACP state machine
that invalidates this presumption.

>> The detailed code analysis is as follows:

[...]

>> 	/* if all aggregator's ports are READY_N == TRUE, set ready=TRUE
>> 	 * in all aggregator's ports, else set ready=FALSE in all
>> 	 * aggregator's ports
>> 	 */
>> 	__set_agg_ports_ready(port->aggregator,
>> 			      __agg_ports_are_ready(port->aggregator));
>> 
>> ----analysis: port->aggregator is still NULL, which causes problem.
>> 
>> 	aggregator = __get_first_agg(port);
>> 	ad_agg_selection_logic(aggregator, update_slave_arr);
>> 
>> 	if (!port->aggregator->is_active)
>> 		port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;

	Correct, if the "did not find a suitable aggregator" path is
taken, port->aggregator is NULL and bad things happen in the above
block.

	This is something that needs to be fixed, but I'm also concerned
that there are other issues lurking, so I'd like to be able to reproduce
this.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

  reply	other threads:[~2021-01-29 19:12 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-18 19:30 [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option Jarod Wilson
2020-12-19  0:18 ` Jay Vosburgh
2021-01-08  0:03   ` Jarod Wilson
2021-01-12 21:12     ` Jarod Wilson
2021-01-12 21:39       ` Jay Vosburgh
2021-01-12 22:32         ` Jarod Wilson
2020-12-28 10:11 ` Jiri Pirko
2021-01-07 23:58   ` Jarod Wilson
2021-01-08 13:12     ` Jiri Pirko
2021-01-08 15:21       ` Jarod Wilson
2021-01-15  2:02       ` question about bonding mode 4 moyufeng
2021-01-23  6:10         ` moyufeng
2021-01-29 19:11           ` Jay Vosburgh [this message]
2021-01-30  9:41             ` moyufeng
2021-01-13 22:35 ` [PATCH net-next v2] bonding: add a vlan+mac tx hashing option Jarod Wilson
2021-01-13 23:41   ` [PATCH iproute2] bond: support xmit_hash_policy=vlan+mac Jarod Wilson
2021-01-15 15:12     ` Jarod Wilson
2021-01-15 19:21     ` [PATCH iproute2 v2] bond: support xmit_hash_policy=vlan+srcmac Jarod Wilson
2021-01-23 18:35       ` David Ahern
2021-01-14  1:58   ` [PATCH net-next v2] bonding: add a vlan+mac tx hashing option Jakub Kicinski
2021-01-14 21:11     ` Jarod Wilson
2021-01-14 21:23       ` Jakub Kicinski
2021-01-14 21:42         ` Jarod Wilson
2021-01-14 21:54       ` Jay Vosburgh
2021-01-15 15:08         ` Jarod Wilson
2021-01-15 19:21   ` [PATCH net-next v3] bonding: add a vlan+srcmac " Jarod Wilson
2021-01-18 23:10     ` David Ahern
2021-01-19  1:04       ` Jarod Wilson
2021-01-19  1:09     ` [PATCH net-next v4] " Jarod Wilson
2021-01-20  6:10       ` patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10374.1611947473@famine \
    --to=jay.vosburgh@canonical.com \
    --cc=andy@greyhouse.net \
    --cc=davem@davemloft.net \
    --cc=jiri@resnulli.us \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=lipeng321@huawei.com \
    --cc=moyufeng@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=salil.mehta@huawei.com \
    --cc=tadavis@lbl.gov \
    --cc=vfalico@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).