All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Jens Axboe <axboe@kernel.dk>,
	Mikael Pettersson <mikpelinux@gmail.com>,
	Xuewei Zhang <xueweiz@google.com>
Cc: Linux SPARC Kernel Mailing List <sparclinux@vger.kernel.org>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-scsi <linux-scsi@vger.kernel.org>
Subject: Re: [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500
Date: Tue, 12 Feb 2019 07:24:09 -0800	[thread overview]
Message-ID: <1549985049.3173.3.camel@HansenPartnership.com> (raw)
In-Reply-To: <fc547d40-9a42-a823-b62f-460c3244fe7d@kernel.dk>

On Mon, 2019-02-11 at 19:50 -0700, Jens Axboe wrote:
> On 2/11/19 7:13 PM, James Bottomley wrote:
> > On Mon, 2019-02-11 at 09:31 -0700, Jens Axboe wrote:
> > > On 2/11/19 9:28 AM, James Bottomley wrote:
> > > > On Mon, 2019-02-11 at 08:46 -0700, Jens Axboe wrote:
> > > > > On 2/11/19 8:42 AM, James Bottomley wrote:
> > > > > > On Mon, 2019-02-11 at 08:28 -0700, Jens Axboe wrote:
> > > > > > > On 2/11/19 8:25 AM, James Bottomley wrote:
> > > > > > > > On Sun, 2019-02-10 at 09:35 -0700, Jens Axboe wrote:
> > > > > > > > > On 2/10/19 9:25 AM, James Bottomley wrote:
> > > > 
> > > > [...]
> > > > > > > > > > That check wasn't changed by the code removal.
> > > > > > > > > 
> > > > > > > > > As I said above, for sd. This isn't true for non-
> > > > > > > > > disks.
> > > > > > > > 
> > > > > > > > Yes, but the behaviour above doesn't change across a
> > > > > > > > switch
> > > > > > > > to MQ, so I don't quite understand how it bisects back
> > > > > > > > to
> > > > > > > > that change.  If we're not gathering entropy for the
> > > > > > > > device
> > > > > > > > now, we wouldn't have been before the switch, so the
> > > > > > > > entropy characteristics shouldn't have changed.
> > > > > > > 
> > > > > > > But it does, as I also wrote in that first email. The
> > > > > > > legacy
> > > > > > > queue flags had QUEUE_FLAG_ADD_RANDOM set by default, the
> > > > > > > MQ
> > > > > > > ones do not. Hence any non-sd device would previously
> > > > > > > ALWAYS
> > > > > > > have ADD_RANDOM set, now none of them do. Also see the
> > > > > > > patch
> > > > > > > I sent.
> > > > > > 
> > > > > > So your theory is that the disk in question never gets to
> > > > > > the
> > > > > > rotational check?  because the check will clear the flag if
> > > > > > it's non-rotational and set it if it's not, so the default
> > > > > > state of the flag shouldn't matter.
> > > > > 
> > > > > No, my point is about non-disks, devices that aren't driven
> > > > > by
> > > > > sd. The behavior for sd hasn't changed, as it sets/clears it
> > > > > unconditionally. 
> > > > 
> > > > I agree, but I don't think any of them were significant entropy
> > > > contributors before: things like nvme have always been outside
> > > > of
> > > > this and sr and st don't really contribute much to the seek
> > > > load
> > > > during boot because they're probed but not used by the boot
> > > > sequence, so I can't see how they would cause this
> > > > behaviour.  I
> > > > suppose it could be target probing, but even that seems
> > > > unlikely
> > > > because it should be dwarfed by the number of root disk reads
> > > > during boot.
> > > > 
> > > > For the rng to take an additional 5 minutes to initialize, we
> > > > must
> > > > have lost a significant entropy source somewhere.
> > > 
> > > I agree it's not a significant amount of entropy, but even just
> > > one
> > > bit could mean a long stall if that put us over the edge of just
> > > not
> > > having enough for whatever is blocking on /dev/random. Mikael's
> > > boot
> > > did have a CDROM, it's not impossible that the handful of
> > > commands we
> > > end up doing to that device would have contributed enough entropy
> > > to
> > > get the boot done without stalling for minutes.
> > > 
> > > One way to know for sure, and that's if Mikael tests the patch.
> > 
> > I think I've got the root cause.  I have one system in my test bed
> > exhibiting this behaviour.  It turns out the disk in it has no
> > characteristics VPD page.  The 0xB1 VPD was a SBC-3 addition, so
> > that's
> > not surprising.  However, the characteristics check bails before
> > setting the flags, so it takes the default flag which has flipped.
> > 
> > We can either fix this by setting the QUEUE_FLAG_ADD_RANDOM if
> > there's
> > no 0xB1 page or by setting the default as Jens proposed.
> 
> I'd recommend just doing my patch, since that'll be the same behavior
> that SCSI had before.

I've got the history now, it's this patch

Author: Xuewei Zhang <xueweiz@google.com>
Date:   Thu Sep 6 13:37:19 2018 -0700

    scsi: sd: Contribute to randomness when running rotational device

It added the else branch to the if (rot == 1).  It's the position of
that else branch which is wrong because not all disks have a SBC-3
characteristics VPD page, so they're the ones under MQ which stop
contributing entropy.  Whichever patch we go with will need a fixes:
for this.

James


WARNING: multiple messages have this Message-ID (diff)
From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Jens Axboe <axboe@kernel.dk>,
	Mikael Pettersson <mikpelinux@gmail.com>,
	Xuewei Zhang <xueweiz@google.com>
Cc: Linux SPARC Kernel Mailing List <sparclinux@vger.kernel.org>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-scsi <linux-scsi@vger.kernel.org>
Subject: Re: [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su
Date: Tue, 12 Feb 2019 15:24:09 +0000	[thread overview]
Message-ID: <1549985049.3173.3.camel@HansenPartnership.com> (raw)
In-Reply-To: <fc547d40-9a42-a823-b62f-460c3244fe7d@kernel.dk>

On Mon, 2019-02-11 at 19:50 -0700, Jens Axboe wrote:
> On 2/11/19 7:13 PM, James Bottomley wrote:
> > On Mon, 2019-02-11 at 09:31 -0700, Jens Axboe wrote:
> > > On 2/11/19 9:28 AM, James Bottomley wrote:
> > > > On Mon, 2019-02-11 at 08:46 -0700, Jens Axboe wrote:
> > > > > On 2/11/19 8:42 AM, James Bottomley wrote:
> > > > > > On Mon, 2019-02-11 at 08:28 -0700, Jens Axboe wrote:
> > > > > > > On 2/11/19 8:25 AM, James Bottomley wrote:
> > > > > > > > On Sun, 2019-02-10 at 09:35 -0700, Jens Axboe wrote:
> > > > > > > > > On 2/10/19 9:25 AM, James Bottomley wrote:
> > > > 
> > > > [...]
> > > > > > > > > > That check wasn't changed by the code removal.
> > > > > > > > > 
> > > > > > > > > As I said above, for sd. This isn't true for non-
> > > > > > > > > disks.
> > > > > > > > 
> > > > > > > > Yes, but the behaviour above doesn't change across a
> > > > > > > > switch
> > > > > > > > to MQ, so I don't quite understand how it bisects back
> > > > > > > > to
> > > > > > > > that change.  If we're not gathering entropy for the
> > > > > > > > device
> > > > > > > > now, we wouldn't have been before the switch, so the
> > > > > > > > entropy characteristics shouldn't have changed.
> > > > > > > 
> > > > > > > But it does, as I also wrote in that first email. The
> > > > > > > legacy
> > > > > > > queue flags had QUEUE_FLAG_ADD_RANDOM set by default, the
> > > > > > > MQ
> > > > > > > ones do not. Hence any non-sd device would previously
> > > > > > > ALWAYS
> > > > > > > have ADD_RANDOM set, now none of them do. Also see the
> > > > > > > patch
> > > > > > > I sent.
> > > > > > 
> > > > > > So your theory is that the disk in question never gets to
> > > > > > the
> > > > > > rotational check?  because the check will clear the flag if
> > > > > > it's non-rotational and set it if it's not, so the default
> > > > > > state of the flag shouldn't matter.
> > > > > 
> > > > > No, my point is about non-disks, devices that aren't driven
> > > > > by
> > > > > sd. The behavior for sd hasn't changed, as it sets/clears it
> > > > > unconditionally. 
> > > > 
> > > > I agree, but I don't think any of them were significant entropy
> > > > contributors before: things like nvme have always been outside
> > > > of
> > > > this and sr and st don't really contribute much to the seek
> > > > load
> > > > during boot because they're probed but not used by the boot
> > > > sequence, so I can't see how they would cause this
> > > > behaviour.  I
> > > > suppose it could be target probing, but even that seems
> > > > unlikely
> > > > because it should be dwarfed by the number of root disk reads
> > > > during boot.
> > > > 
> > > > For the rng to take an additional 5 minutes to initialize, we
> > > > must
> > > > have lost a significant entropy source somewhere.
> > > 
> > > I agree it's not a significant amount of entropy, but even just
> > > one
> > > bit could mean a long stall if that put us over the edge of just
> > > not
> > > having enough for whatever is blocking on /dev/random. Mikael's
> > > boot
> > > did have a CDROM, it's not impossible that the handful of
> > > commands we
> > > end up doing to that device would have contributed enough entropy
> > > to
> > > get the boot done without stalling for minutes.
> > > 
> > > One way to know for sure, and that's if Mikael tests the patch.
> > 
> > I think I've got the root cause.  I have one system in my test bed
> > exhibiting this behaviour.  It turns out the disk in it has no
> > characteristics VPD page.  The 0xB1 VPD was a SBC-3 addition, so
> > that's
> > not surprising.  However, the characteristics check bails before
> > setting the flags, so it takes the default flag which has flipped.
> > 
> > We can either fix this by setting the QUEUE_FLAG_ADD_RANDOM if
> > there's
> > no 0xB1 page or by setting the default as Jens proposed.
> 
> I'd recommend just doing my patch, since that'll be the same behavior
> that SCSI had before.

I've got the history now, it's this patch

Author: Xuewei Zhang <xueweiz@google.com>
Date:   Thu Sep 6 13:37:19 2018 -0700

    scsi: sd: Contribute to randomness when running rotational device

It added the else branch to the if (rot = 1).  It's the position of
that else branch which is wrong because not all disks have a SBC-3
characteristics VPD page, so they're the ones under MQ which stop
contributing entropy.  Whichever patch we go with will need a fixes:
for this.

James

  parent reply	other threads:[~2019-02-12 15:24 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-09 17:04 [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Mikael Pettersson
2019-02-09 17:04 ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Bl Mikael Pettersson
2019-02-09 18:19 ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 James Bottomley
2019-02-09 18:19   ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su James Bottomley
2019-02-10  9:17   ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Mikael Pettersson
2019-02-10  9:17     ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Mikael Pettersson
2019-02-10 15:44     ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 James Bottomley
2019-02-10 15:44       ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su James Bottomley
2019-02-10 16:05       ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Jens Axboe
2019-02-10 16:05         ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Jens Axboe
2019-02-10 16:25         ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 James Bottomley
2019-02-10 16:25           ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su James Bottomley
2019-02-10 16:35           ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Jens Axboe
2019-02-10 16:35             ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Jens Axboe
2019-02-11 15:25             ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 James Bottomley
2019-02-11 15:25               ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su James Bottomley
2019-02-11 15:28               ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Jens Axboe
2019-02-11 15:28                 ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Jens Axboe
2019-02-11 15:42                 ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 James Bottomley
2019-02-11 15:42                   ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su James Bottomley
2019-02-11 15:46                   ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Jens Axboe
2019-02-11 15:46                     ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Jens Axboe
2019-02-11 16:28                     ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 James Bottomley
2019-02-11 16:28                       ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su James Bottomley
2019-02-11 16:31                       ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Jens Axboe
2019-02-11 16:31                         ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Jens Axboe
2019-02-12  2:13                         ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 James Bottomley
2019-02-12  2:13                           ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su James Bottomley
2019-02-12  2:50                           ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Jens Axboe
2019-02-12  2:50                             ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Jens Axboe
2019-02-12  3:37                             ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Elliott, Robert (Persistent Memory)
2019-02-12  3:37                               ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Elliott, Robert (Persistent Memory)
2019-02-12  4:15                               ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 James Bottomley
2019-02-12  4:15                                 ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su James Bottomley
2019-02-12 15:24                             ` James Bottomley [this message]
2019-02-12 15:24                               ` James Bottomley
2019-02-12 15:27                               ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Jens Axboe
2019-02-12 15:27                                 ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Jens Axboe
2019-02-14 18:35         ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Sun Blade 2500 Mikael Pettersson
2019-02-14 18:35           ` [5.0-rc5 regression] "scsi: kill off the legacy IO path" causes 5 minute delay during boot on Su Mikael Pettersson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1549985049.3173.3.camel@HansenPartnership.com \
    --to=james.bottomley@hansenpartnership.com \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=mikpelinux@gmail.com \
    --cc=sparclinux@vger.kernel.org \
    --cc=xueweiz@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.