linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Process for severe early stable bugs?
@ 2018-12-08  0:33 Laura Abbott
  2018-12-08  7:32 ` Willy Tarreau
  2018-12-08 11:56 ` Greg KH
  0 siblings, 2 replies; 6+ messages in thread
From: Laura Abbott @ 2018-12-08  0:33 UTC (permalink / raw)
  To: stable, Linux Kernel Mailing List

The latest file system corruption issue (Nominally fixed by
ffe81d45322c ("blk-mq: fix corruption with direct issue") later
fixed by c616cbee97ae ("blk-mq: punt failed direct issue to dispatch
list")) brought a lot of rightfully concerned users asking about
release schedules. 4.18 went EOL on Nov 21 and Fedora rebased to
4.19.3 on Nov 23. When the issue started getting visibility,
users were left with the option of running known EOL 4.18.x
kernels or running a 4.19 series that could corrupt their
data. Admittedly, the risk of running the EOL kernel was pretty
low given how recent it was, but it's still not a great look
to tell people to run something marked EOL.

I'm wondering if there's anything we can do to make things easier
on kernel consumers. Bugs will certainly happen but it really
makes it hard to push the "always run the latest stable" narrative
if there isn't a good fallback when things go seriously wrong. I
don't actually have a great proposal for a solution here other than
retroactively bringing back 4.18 (which I don't think Greg would
like) but I figured I should at least bring it up.

Thanks,
Laura

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Process for severe early stable bugs?
  2018-12-08  0:33 Process for severe early stable bugs? Laura Abbott
@ 2018-12-08  7:32 ` Willy Tarreau
  2018-12-08 11:56 ` Greg KH
  1 sibling, 0 replies; 6+ messages in thread
From: Willy Tarreau @ 2018-12-08  7:32 UTC (permalink / raw)
  To: Laura Abbott; +Cc: stable, Linux Kernel Mailing List

Hi Laura,

On Fri, Dec 07, 2018 at 04:33:10PM -0800, Laura Abbott wrote:
> The latest file system corruption issue (Nominally fixed by
> ffe81d45322c ("blk-mq: fix corruption with direct issue") later
> fixed by c616cbee97ae ("blk-mq: punt failed direct issue to dispatch
> list")) brought a lot of rightfully concerned users asking about
> release schedules. 4.18 went EOL on Nov 21 and Fedora rebased to
> 4.19.3 on Nov 23. When the issue started getting visibility,
> users were left with the option of running known EOL 4.18.x
> kernels or running a 4.19 series that could corrupt their
> data. Admittedly, the risk of running the EOL kernel was pretty
> low given how recent it was, but it's still not a great look
> to tell people to run something marked EOL.
> 
> I'm wondering if there's anything we can do to make things easier
> on kernel consumers. Bugs will certainly happen but it really
> makes it hard to push the "always run the latest stable" narrative
> if there isn't a good fallback when things go seriously wrong. I
> don't actually have a great proposal for a solution here other than
> retroactively bringing back 4.18 (which I don't think Greg would
> like) but I figured I should at least bring it up.

This type of problem may happen once in a while but fortunately is
extremely rare, so I guess it can be addressed with unusual methods.

For my use cases, I always make sure that the last two LTS branches
work fine. Since there's some great maintenance overlap between LTS
branches, I can quickly switch to 4.14.x (or even 4.9.x) if this
happens. In our products we make sure that our toolchain is built
with support for the previous kernel as well "just in case". We've
never switched back and will probably never do, but at least it
serves us a lot to compare strange behaviours between two kernels.

I think that if your distro is functionally and technically compatible
with the previous LTS branch, it could be an acceptable escape for
users who are concerned about their data and their security at the
same time. After all, previous LTS branches are there for those who
can't upgrade. In my opinion this situation perfectly qualifies.

But it requires some preparation like I mentioned. It might be that
some components in the distro rely on features from the very latest
kernels. At the very least it might deserve a bit of inspection to
know if such dependencies exist, and/or what is lost in case of such
a fall back, to warn users.

Just my two cents,
Willy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Process for severe early stable bugs?
  2018-12-08  0:33 Process for severe early stable bugs? Laura Abbott
  2018-12-08  7:32 ` Willy Tarreau
@ 2018-12-08 11:56 ` Greg KH
  2018-12-08 17:18   ` Theodore Y. Ts'o
  1 sibling, 1 reply; 6+ messages in thread
From: Greg KH @ 2018-12-08 11:56 UTC (permalink / raw)
  To: Laura Abbott; +Cc: stable, Linux Kernel Mailing List

On Fri, Dec 07, 2018 at 04:33:10PM -0800, Laura Abbott wrote:
> The latest file system corruption issue (Nominally fixed by
> ffe81d45322c ("blk-mq: fix corruption with direct issue") later
> fixed by c616cbee97ae ("blk-mq: punt failed direct issue to dispatch
> list")) brought a lot of rightfully concerned users asking about
> release schedules. 4.18 went EOL on Nov 21 and Fedora rebased to
> 4.19.3 on Nov 23. When the issue started getting visibility,
> users were left with the option of running known EOL 4.18.x
> kernels or running a 4.19 series that could corrupt their
> data. Admittedly, the risk of running the EOL kernel was pretty
> low given how recent it was, but it's still not a great look
> to tell people to run something marked EOL.
> 
> I'm wondering if there's anything we can do to make things easier
> on kernel consumers. Bugs will certainly happen but it really
> makes it hard to push the "always run the latest stable" narrative
> if there isn't a good fallback when things go seriously wrong. I
> don't actually have a great proposal for a solution here other than
> retroactively bringing back 4.18 (which I don't think Greg would
> like) but I figured I should at least bring it up.

A nice step forward would have been if someone could have at least
_told_ the stable maintainer (i.e. me) that there was such a serious bug
out there.  That didn't happen here and I only found out about it
accidentally by happening to talk to a developer who was on the bugzilla
thread at a totally random meeting last Wednesday.

There was also not an email thread that I could find once I found out
about the issue.  By that time the bug was fixed and all I could do was
wait for it to hit Linus's tree (and even then, I had to wait for the
fix to the fix...)  If I had known about it earlier, I would have
reverted the change that caused this.

I would start by looking at how we at least notify people of major
issues like this.  Yes it was complex and originally blamed on both
btrfs and ext4 changes, and it was dependant on using a brand-new
.config file which no kernel developers use (and it seems no distro uses
either, which protected Fedora and others at the least!)

There will always be bugs and exceptions and personally I think that the
rarity of this one was such that it is a rare event and adding the
requirement that I have to maintain more than one set of stable trees
for longer isn't going to happen (yeah, I know you said you didn't
expect that, but I know others mentioned it to me...)

So I don't know what to say here other than please tell me about major
issues like this and don't rely on me getting lucky and hearing about it
on my own.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Process for severe early stable bugs?
  2018-12-08 11:56 ` Greg KH
@ 2018-12-08 17:18   ` Theodore Y. Ts'o
  2018-12-09 11:30     ` Greg KH
  0 siblings, 1 reply; 6+ messages in thread
From: Theodore Y. Ts'o @ 2018-12-08 17:18 UTC (permalink / raw)
  To: Greg KH; +Cc: Laura Abbott, stable, Linux Kernel Mailing List

On Sat, Dec 08, 2018 at 12:56:29PM +0100, Greg KH wrote:
> A nice step forward would have been if someone could have at least
> _told_ the stable maintainer (i.e. me) that there was such a serious bug
> out there.  That didn't happen here and I only found out about it
> accidentally by happening to talk to a developer who was on the bugzilla
> thread at a totally random meeting last Wednesday.
> 
> There was also not an email thread that I could find once I found out
> about the issue.  By that time the bug was fixed and all I could do was
> wait for it to hit Linus's tree (and even then, I had to wait for the
> fix to the fix...)  If I had known about it earlier, I would have
> reverted the change that caused this.

So to be fair, the window between when we *know* what was the change
that required reverting and the fix actually being available was very
narrow.  For most of the 3-4 weeks when we were trying to track it
down --- and the bug had been present in Linus's tree since
4.19-rc1(!) --- we had no idea exactly how big the problem was.

If you want to know about these sorts of things early --- at the
moment the moment I and others at $WORK have been trying to track down
a problem on a 4.14.x kernel which has symptoms that look ***eerily***
similar to Bugzilla #201685.  There was another bug causing mysterious
file system corruptions that may also be related that was noticed on
an Ubuntu 4.13.x kernel which forced another team to fall back to a
4.4 kernel.  Both of these have caused file system corruptions that
resulted in customer visible disruptions.  Ming Lei has now said that
there is a theoretical bug which he now believes might be present in
blk-mq starting in 4.11.

To make life even more annoying, starting in 4.14.63, disabling blk-mq
is no longer even an *option* for virtio-scsi thanks to commit
b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
vector affinity"), which was backported to 4.14 as of 70b522f163bbb32.
We might try reverting that commit and then disabling blk-mq to see if
it makes the problem go away.  But the problem happens very rarely ---
maybe once a week across a population of 2500 or so VM's, so it would
take a long time before we could be certain that any change would fix
it in absence of a detailed root cause analysis or a clean repro that
can be run in a test environment.

So now you know --- but it's not clear it's going to be helpful.
Commit b5b6e8c8d3b4 was fixing another bug, so reverting it isn't
necessarily the right thing, especially since we can't yet prove it's
the cause of the problem.  It was "interesting" that we forced
virtio-scsi to use blk-mq in the middle of a LTS kernel series,
though.

> I would start by looking at how we at least notify people of major
> issues like this.  Yes it was complex and originally blamed on both
> btrfs and ext4 changes, and it was dependant on using a brand-new
> .config file which no kernel developers use (and it seems no distro uses
> either, which protected Fedora and others at the least!)

Ubuntu's bleeding edge kernel uses the config, so that's where we got
a lot of reports of bug #201685 initially.  At first it wasn't even
obvious whether it was a kernel<->userspace versioning issue (ala the
dm userspace gotcha a month or two ago).  And I never even heard that
btrfs was being blamed.  That was probably on a different thread that
I didn't see?  I wish I had, since at for the first 2-3 weeks all of
the reports I saw were from ext4 users, and because it was so easy to
have false negative and false positives reports, one user bisected it
to a change in the middle of the RCU pull in 4.19-rc1, and another
claimed that after reverting all ext4 changes between 4.18 and 4.19,
the problem went away.  Both conclusions, ultimately, were false of
course.

So before we have root cause, and a clean reproduction that
*developers* could actually use, if you had seen the early reports,
would you have wanted to revert the RCU pull for the 4.19 merge
window?  Or the ext4 pull?  Unfortunately, there are no easy solutions
here.

> There will always be bugs and exceptions and personally I think that the
> rarity of this one was such that it is a rare event and adding the
> requirement that I have to maintain more than one set of stable trees
> for longer isn't going to happen (yeah, I know you said you didn't
> expect that, but I know others mentioned it to me...) 
> 
> So I don't know what to say here other than please tell me about major
> issues like this and don't rely on me getting lucky and hearing about it
> on my own.

Well, now you know about one of the issues that I'm trying to debug.
It's not at all clear how actionable that information happens to be,
though.  I didn't bug you about it for that reason.

						- Ted

P.S.  The fact that Jens is planning on ripping out the legacy block
I/O path in 4.21, and force everyone to use blk-mq, is not filling me
with a lot of joy and gladness.  I understand why he's doing it;
maintaining two code paths is not easy.  But apparently there was
another discard bug recently that would have been found if blktests
were being run more frequently by developers, so I'm not feeling very
trusting of the block layer at the moment, especially invariably
people always blame the file system code first.

P.P.S.  Sorry if it sounds like I'm grumpy; it's probably because I am.

P.P.P.S.  If I were king, I'd be asking for a huge number of kunit
tests for block-mq to be developed, and then running them under a
Thread Sanitizer.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Process for severe early stable bugs?
  2018-12-08 17:18   ` Theodore Y. Ts'o
@ 2018-12-09 11:30     ` Greg KH
       [not found]       ` <20181209164419.GI20708@thunk.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Greg KH @ 2018-12-09 11:30 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Laura Abbott, stable, Linux Kernel Mailing List

On Sat, Dec 08, 2018 at 12:18:53PM -0500, Theodore Y. Ts'o wrote:
> On Sat, Dec 08, 2018 at 12:56:29PM +0100, Greg KH wrote:
> > A nice step forward would have been if someone could have at least
> > _told_ the stable maintainer (i.e. me) that there was such a serious bug
> > out there.  That didn't happen here and I only found out about it
> > accidentally by happening to talk to a developer who was on the bugzilla
> > thread at a totally random meeting last Wednesday.
> > 
> > There was also not an email thread that I could find once I found out
> > about the issue.  By that time the bug was fixed and all I could do was
> > wait for it to hit Linus's tree (and even then, I had to wait for the
> > fix to the fix...)  If I had known about it earlier, I would have
> > reverted the change that caused this.
> 
> So to be fair, the window between when we *know* what was the change
> that required reverting and the fix actually being available was very
> narrow.  For most of the 3-4 weeks when we were trying to track it
> down --- and the bug had been present in Linus's tree since
> 4.19-rc1(!) --- we had no idea exactly how big the problem was.
> 
> If you want to know about these sorts of things early --- at the
> moment the moment I and others at $WORK have been trying to track down
> a problem on a 4.14.x kernel which has symptoms that look ***eerily***
> similar to Bugzilla #201685.  There was another bug causing mysterious
> file system corruptions that may also be related that was noticed on
> an Ubuntu 4.13.x kernel which forced another team to fall back to a
> 4.4 kernel.  Both of these have caused file system corruptions that
> resulted in customer visible disruptions.  Ming Lei has now said that
> there is a theoretical bug which he now believes might be present in
> blk-mq starting in 4.11.
> 
> To make life even more annoying, starting in 4.14.63, disabling blk-mq
> is no longer even an *option* for virtio-scsi thanks to commit
> b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
> vector affinity"), which was backported to 4.14 as of 70b522f163bbb32.
> We might try reverting that commit and then disabling blk-mq to see if
> it makes the problem go away.  But the problem happens very rarely ---
> maybe once a week across a population of 2500 or so VM's, so it would
> take a long time before we could be certain that any change would fix
> it in absence of a detailed root cause analysis or a clean repro that
> can be run in a test environment.
> 
> So now you know --- but it's not clear it's going to be helpful.
> Commit b5b6e8c8d3b4 was fixing another bug, so reverting it isn't
> necessarily the right thing, especially since we can't yet prove it's
> the cause of the problem.  It was "interesting" that we forced
> virtio-scsi to use blk-mq in the middle of a LTS kernel series,
> though.

Yes, this all was very helpful, thank you for the information I
appreciate it.

And I will watch out for these issues now.  It's a bit sad that these
are showing up in 4.14, but it seems that distros are only now starting
to really use that kernel version (or at least are only now starting to
report things from it), as it is a year old.  Oh well, can't do much
about that, I am more worried about the 4.19 issues like Laura was
talking about as that is the "canary" we need to watch out for more.

> P.P.P.S.  If I were king, I'd be asking for a huge number of kunit
> tests for block-mq to be developed, and then running them under a
> Thread Sanitizer.

Isn't that what xfs and fio is?  Aren't we running this all the time and
reporting those issues?  How did this bug not show up on those tests, is
it just because they didn't run long enough?

Because of those test suites, I was thinking that the block and
filesystem paths were one of the more well-tested things we had at the
moment, is this not true?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Process for severe early stable bugs?
       [not found]       ` <20181209164419.GI20708@thunk.org>
@ 2018-12-10  9:51         ` Greg KH
  0 siblings, 0 replies; 6+ messages in thread
From: Greg KH @ 2018-12-10  9:51 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Laura Abbott, stable, Linux Kernel Mailing List

On Sun, Dec 09, 2018 at 11:44:19AM -0500, Theodore Y. Ts'o wrote:
> On Sun, Dec 09, 2018 at 12:30:39PM +0100, Greg KH wrote:
> > > P.P.P.S.  If I were king, I'd be asking for a huge number of kunit
> > > tests for block-mq to be developed, and then running them under a
> > > Thread Sanitizer.
> > 
> > Isn't that what xfs and fio is?  Aren't we running this all the time and
> > reporting those issues?  How did this bug not show up on those tests, is
> > it just because they didn't run long enough?
> > 
> > Because of those test suites, I was thinking that the block and
> > filesystem paths were one of the more well-tested things we had at the
> > moment, is this not true?
> 
> I'm pretty confident about the file system paths, and the "happy
> paths" for the block layer.
> 
> But with Kernel Bugzilla #201685, despite huge amounts both before and
> after 4.19-rc1, nothing picked it up.  It turned out to be very
> configuration specific, *and* only happened when you were under heavy
> memory pressure and/or I/O pressure.
> 
> I'm starting to try to use blktests, but it's not as mature as
> xfstests.  It has portability issues, as it assumes a much newer
> userspace.  So I can't even run it under some environments at all.
> The test coverage just isn't as broad.  Compare:
> 
> ext4/4k: 441 tests, 1 failures, 42 skipped, 4387 seconds
>   Failures: generic/388
> 
> Versus:
> 
> Run: block/001 block/002 block/003 block/004 block/005 block/006
>     block/009 block/010 block/012 block/013 block/014 block/015
>     block/016 block/017 block/018 block/020 block/021 block/023
>     block/024 loop/001 loop/002 loop/003 loop/004 loop/005 loop/006
>     nvme/002 nvme/003 nvme/004 nvme/006 nvme/007 nvme/008 nvme/009
>     nvme/010 nvme/011 nvme/012 nvme/013 nvme/014 nvme/015 nvme/016
>     nvme/017 nvme/019 nvme/020 nvme/021 nvme/022 nvme/023 nvme/024
>     nvme/025 nvme/026 nvme/027 nvme/028 scsi/001 scsi/002 scsi/003
>     scsi/004 scsi/005 scsi/006 srp/001 srp/002 srp/003 srp/004
>     srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013
> Failures: block/017 block/024 nvme/002 nvme/003 nvme/008 nvme/009
>     nvme/010 nvme/011 nvme/012 nvme/013 nvme/014 nvme/015 nvme/016
>     nvme/019 nvme/020 nvme/021 nvme/022 nvme/023 nvme/024 nvme/025
>     nvme/026 nvme/027 nvme/028 scsi/006 srp/001 srp/002 srp/003 srp/004
>     srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013
> Failed 37 of 69 tests
> 
> (Most of the failures are test portability issues that I still need to
> work through, not real failures.  But just look at the number of
> tests....)

So you are saying quantity rules over quantity?  :)

It's really hard to judge this, given that xfstests are testing a whole
range of other things (POSIX compliance and stressing the vfs api),
while blktests are there to stress the block i/o api/interface.

So both would be best to run as we know xfstests also hits the block
layer...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-12-10  9:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-08  0:33 Process for severe early stable bugs? Laura Abbott
2018-12-08  7:32 ` Willy Tarreau
2018-12-08 11:56 ` Greg KH
2018-12-08 17:18   ` Theodore Y. Ts'o
2018-12-09 11:30     ` Greg KH
     [not found]       ` <20181209164419.GI20708@thunk.org>
2018-12-10  9:51         ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).