All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-wired-lan] intermittent ixgbe transmit queue timeouts in v5.18 kernels
@ 2022-06-02 21:37 Jeff Layton
  2022-06-07 21:22 ` Switzer, David
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff Layton @ 2022-06-02 21:37 UTC (permalink / raw)
  To: intel-wired-lan, anthony.l.nguyen, jesse.brandeburg
  Cc: Ilya Dryomov, Xiubo Li, Venky Shankar

The Ceph project test lab has a fairly large cluster of machines with
ixgbe adapters:

    03:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

Recently, we've started getting intermittent tx queue timeouts with
these machines. One of them is reported here:

    https://tracker.ceph.com/issues/55823

Usually this happens when we're trying to do a sync, and there is a
flurry of transmission activity. Afterward we see a lot of fallout in
ceph culminating in softlockups.

The kernels we're testing have some patches that are not yet in
mainline, but mostly they are confined to net/ceph and fs/ceph, and
shouldn't really affect hw drivers.

The problem manifested pretty regularly during v5.18 and then I didn't
see it for a while. I had figured it was something that had been fixed,
but I think it was just "luck".

I attempted a bisect a while back, and ruled out recent ceph changes as
the issue. Unfortunately, I wasn't able to get to a conclusive patch
that broke it, but I think it likely crept in during the initial merge
window for v5.18 (pre-rc1).

One other oddity: the test lab often installs bleeding-edge kernels on
old distros (RHEL8 and Ubuntu from similar era). Is it possible that the
firmware that ships with these older distros is not suitable for the
more recent driver in v5.18 ?

Any thoughts or suggestions on things we can do to fix this?

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Intel-wired-lan] intermittent ixgbe transmit queue timeouts in v5.18 kernels
  2022-06-02 21:37 [Intel-wired-lan] intermittent ixgbe transmit queue timeouts in v5.18 kernels Jeff Layton
@ 2022-06-07 21:22 ` Switzer, David
  2022-06-08 12:44   ` Jeff Layton
  0 siblings, 1 reply; 3+ messages in thread
From: Switzer, David @ 2022-06-07 21:22 UTC (permalink / raw)
  To: Jeff Layton, intel-wired-lan, Nguyen, Anthony L, Brandeburg, Jesse
  Cc: Ilya Dryomov, Xiubo Li, Venky Shankar

>-----Original Message-----
>From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of
>Jeff Layton
>Sent: Thursday, June 2, 2022 2:38 PM
>To: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L
><anthony.l.nguyen@intel.com>; Brandeburg, Jesse
><jesse.brandeburg@intel.com>
>Cc: Ilya Dryomov <idryomov@gmail.com>; Xiubo Li <xiubli@redhat.com>;
>Venky Shankar <vshankar@redhat.com>
>Subject: [Intel-wired-lan] intermittent ixgbe transmit queue timeouts in v5.18
>kernels
>
>The Ceph project test lab has a fairly large cluster of machines with ixgbe
>adapters:
>
>    03:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
>Network Connection (rev 01)
>
We are attempting to reproduce your issue, and the output from lspci -s 03:00.0
-vv would help us make sure we're looking at the exact adapter that the issue is
Being seen on.

>Recently, we've started getting intermittent tx queue timeouts with these
>machines. One of them is reported here:
>
>    https://tracker.ceph.com/issues/55823
>
>Usually this happens when we're trying to do a sync, and there is a flurry of
>transmission activity. Afterward we see a lot of fallout in ceph culminating in
>softlockups.
>
>The kernels we're testing have some patches that are not yet in mainline, but
>mostly they are confined to net/ceph and fs/ceph, and shouldn't really affect
>hw drivers.
>
>The problem manifested pretty regularly during v5.18 and then I didn't see it
>for a while. I had figured it was something that had been fixed, but I think it
>was just "luck".
>
>I attempted a bisect a while back, and ruled out recent ceph changes as the
>issue. Unfortunately, I wasn't able to get to a conclusive patch that broke it,
>but I think it likely crept in during the initial merge window for v5.18 (pre-rc1).
>
>One other oddity: the test lab often installs bleeding-edge kernels on old
>distros (RHEL8 and Ubuntu from similar era). Is it possible that the firmware
>that ships with these older distros is not suitable for the more recent driver in
>v5.18 ?
>
Thank you for this information, we'll look into it if we're having trouble
reproducing the issue!


>Any thoughts or suggestions on things we can do to fix this?
>
Nothing yet, but we'll be sure to let you know when we find it.

Have a great day!
Dave Switzer <david.switzer@intel.com>

>Thanks,
>--
>Jeff Layton <jlayton@kernel.org>
>_______________________________________________
>Intel-wired-lan mailing list
>Intel-wired-lan@osuosl.org
>https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Intel-wired-lan] intermittent ixgbe transmit queue timeouts in v5.18 kernels
  2022-06-07 21:22 ` Switzer, David
@ 2022-06-08 12:44   ` Jeff Layton
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff Layton @ 2022-06-08 12:44 UTC (permalink / raw)
  To: Switzer, David, intel-wired-lan, Nguyen, Anthony L, Brandeburg, Jesse
  Cc: Ilya Dryomov, Xiubo Li, Venky Shankar

On Tue, 2022-06-07 at 21:22 +0000, Switzer, David wrote:
> > -----Original Message-----
> > From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> > Of
> > Jeff Layton
> > Sent: Thursday, June 2, 2022 2:38 PM
> > To: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L
> > <anthony.l.nguyen@intel.com>; Brandeburg, Jesse
> > <jesse.brandeburg@intel.com>
> > Cc: Ilya Dryomov <idryomov@gmail.com>; Xiubo Li <xiubli@redhat.com>;
> > Venky Shankar <vshankar@redhat.com>
> > Subject: [Intel-wired-lan] intermittent ixgbe transmit queue
> > timeouts in v5.18
> > kernels
> > 
> > The Ceph project test lab has a fairly large cluster of machines
> > with ixgbe
> > adapters:
> > 
> >    03:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> > SFI/SFP+
> > Network Connection (rev 01)
> > 
> We are attempting to reproduce your issue, and the output from lspci -
> s 03:00.0
> -vv would help us make sure we're looking at the exact adapter that
> the issue is
> Being seen on.
> 
> > Recently, we've started getting intermittent tx queue timeouts with
> > these
> > machines. One of them is reported here:
> > 
> >    https://tracker.ceph.com/issues/55823
> > 
> > Usually this happens when we're trying to do a sync, and there is a
> > flurry of
> > transmission activity. Afterward we see a lot of fallout in ceph
> > culminating in
> > softlockups.
> > 
> > The kernels we're testing have some patches that are not yet in
> > mainline, but
> > mostly they are confined to net/ceph and fs/ceph, and shouldn't
> > really affect
> > hw drivers.
> > 
> > The problem manifested pretty regularly during v5.18 and then I
> > didn't see it
> > for a while. I had figured it was something that had been fixed, but
> > I think it
> > was just "luck".
> > 
> > I attempted a bisect a while back, and ruled out recent ceph changes
> > as the
> > issue. Unfortunately, I wasn't able to get to a conclusive patch
> > that broke it,
> > but I think it likely crept in during the initial merge window for
> > v5.18 (pre-rc1).
> > 
> > One other oddity: the test lab often installs bleeding-edge kernels
> > on old
> > distros (RHEL8 and Ubuntu from similar era). Is it possible that the
> > firmware
> > that ships with these older distros is not suitable for the more
> > recent driver in
> > v5.18 ?
> > 
> Thank you for this information, we'll look into it if we're having
> trouble
> reproducing the issue!
> 
> 
> > Any thoughts or suggestions on things we can do to fix this?
> > 
> Nothing yet, but we'll be sure to let you know when we find it.
> 

Thanks for getting back to us.

Since I emailed you, I've found a bug in ceph that could make the cephfs
client spin in an (essentially) infinite loop if there were delays
getting MDS replies in some situations. We've fixed that and I haven't
seen any tx queue timeouts since, though I've only had the fix in place
for a day or so.

For now, I think we can just consider this to be fallout from the ceph
bug. If the problems return though, I'll let you know!

Thanks again!
-- 
Jeff Layton <jlayton@kernel.org>
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-06-08 16:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-02 21:37 [Intel-wired-lan] intermittent ixgbe transmit queue timeouts in v5.18 kernels Jeff Layton
2022-06-07 21:22 ` Switzer, David
2022-06-08 12:44   ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.