From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Jacob S. Moroni" Subject: Re: DPAA TX Issues Date: Sun, 08 Apr 2018 23:20:55 -0400 Message-ID: <1523244055.3920989.1331005248.4F27E86C@webmail.messagingengine.com> References: <1523231216.3843066.1330879848.1F0E863A@webmail.messagingengine.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org To: madalin.bucur@nxp.com Return-path: Received: from out5-smtp.messagingengine.com ([66.111.4.29]:43535 "EHLO out5-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751874AbeDIDU4 (ORCPT ); Sun, 8 Apr 2018 23:20:56 -0400 In-Reply-To: <1523231216.3843066.1330879848.1F0E863A@webmail.messagingengine.com> Sender: netdev-owner@vger.kernel.org List-ID: On Sun, Apr 8, 2018, at 7:46 PM, Jacob S. Moroni wrote: > Hello Madalin, > > I've been experiencing some issues with the DPAA Ethernet driver, > specifically related to frame transmission. Hopefully you can point > me in the right direction. > > TLDR: Attempting to transmit faster than a few frames per second causes > the TX FQ CGR to enter into the congested state and remain there forever, > even after transmission stops. > > The hardware is a T2080RDB, running from the tip of net-next, using > the standard t2080rdb device tree and corenet64_smp_defconfig kernel > config. No changes were made to any of the files. The issue occurs > with 4.16.1 stable as well. In fact, the only time I've been able > to achieve reliable frame transmission was with the SDK 4.1 kernel. > > For my tests, I'm running iperf3 both with and without the -R > option (send/receive). When using a USB Ethernet adapter, there > are no issues. > > The issue is that it seems like the TX frame queues are getting > "stuck" when attempting to transmit at rates greater than a few frames > per second. Ping works fine, but it seems like anything that could > potentially cause multiple TX frames to be enqueued causes issues. > > If I run iperf3 in reverse mode (with the T2080RDB receiving), then > I can achieve ~940 Mbps, but this is also somewhat unreliable. > > If I run it with the T2080RDB transmitting, the test will never > complete. Sometimes it starts transmitting for a few seconds then stops, > and other times it never even starts. This also seems to force the > interface into a bad state. > > The ethtool stats show that the interface has entered > congestion a few times, and that it's currently congested. The fact > that it's currently congested even after stopping transmission > indicates that the FQ somehow stopped being drained. I've also > noticed that whenever this issue occurs, the TX confirmation > counters are always less than the TX packet counters. > > When it gets into this state, I can see that the memory usage is > climbing, up until about the point of where the CGR threshold > is (about 100 MB). > > Any idea what could prevent the TX FQ from being drained? My first > guess was flow control, but it's completely disabled. > > I tried messing with the egress congestion threshold, workqueue > assignments, etc., but nothing seemed to have any effect. > > If you need any more information or want me to run any tests, > please let me know. > > Thanks, > -- > Jacob S. Moroni > mail@jakemoroni.com It turns out that irqbalance was causing all of the issues. After disabling it and rebooting, the interfaces worked perfectly. Perhaps there's an issue with how the qman/bman portals are defined as per-cpu variables. During the portal's probe, the CPUs are assigned one-by-one and subsequently passed into request_irq as the argument. However, it seems like if the IRQ affinity changes, then the ISR could be passed a reference to a per-cpu variable belonging to another CPU. At least I know where to look now. - Jake