All of lore.kernel.org
 help / color / mirror / Atom feed
* aborted SCSI commands while discarding/unmapping via mkfs.xfs
@ 2012-08-14 20:42 Stefan Priebe
  2012-08-14 21:35 ` Dave Chinner
  2012-08-15  9:13 ` Michael Monnerie
  0 siblings, 2 replies; 8+ messages in thread
From: Stefan Priebe @ 2012-08-14 20:42 UTC (permalink / raw)
  To: xfs; +Cc: Christoph Hellwig, Ronnie Sahlberg, dchinner

Hello list,

i'm testing KVM with qemu, libiscsi, virtio-scsi-pci and scsi-general on 
top of a nexenta storage solution. While doing mkfs.xfs on an already 
used LUN / block device i discovered that the unmapping / discard 
commands mkfs.xfs sends take a long time which results in a lot of 
aborted scsi commands.

Would it make sense to let mkfs.xfs send these unmapping commands in 
small portations (f.e. 100MB) or is there another problem in the patch 
to the block device? Any suggestions or ideas?

Thanks!

Greets,
Stefan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: aborted SCSI commands while discarding/unmapping via mkfs.xfs
  2012-08-14 20:42 aborted SCSI commands while discarding/unmapping via mkfs.xfs Stefan Priebe
@ 2012-08-14 21:35 ` Dave Chinner
  2012-08-14 21:51   ` ronnie sahlberg
  2012-08-15  6:31   ` Stefan Priebe - Profihost AG
  2012-08-15  9:13 ` Michael Monnerie
  1 sibling, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2012-08-14 21:35 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: dchinner, Christoph Hellwig, Ronnie Sahlberg, xfs

On Tue, Aug 14, 2012 at 10:42:21PM +0200, Stefan Priebe wrote:
> Hello list,
> 
> i'm testing KVM with qemu, libiscsi, virtio-scsi-pci and
> scsi-general on top of a nexenta storage solution. While doing
> mkfs.xfs on an already used LUN / block device i discovered that the
> unmapping / discard commands mkfs.xfs sends take a long time which
> results in a lot of aborted scsi commands.

Sounds like a problem with your storage being really slow at
discards.

> Would it make sense to let mkfs.xfs send these unmapping commands in
> small portations (f.e. 100MB)

No, because the underlying implementation (blkdev_issue_discard())
already breaks the discard request up into the granularity that is
supported by the underlying storage.....

> or is there another problem in the
> patch to the block device? Any suggestions or ideas?

.... which, of course, had bugs in it so is a muchmore likely cause
of your problems.

That said,the discard granularity is derived from information the
storage supplies the kernel in it's SCSI mode page, so if the
discard granularity is too large, that's a storage problem, not a
linux problem at all, let alone a mkfs.xfs problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: aborted SCSI commands while discarding/unmapping via mkfs.xfs
  2012-08-14 21:35 ` Dave Chinner
@ 2012-08-14 21:51   ` ronnie sahlberg
  2012-08-15  7:32     ` Dave Chinner
  2012-08-15  6:31   ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 8+ messages in thread
From: ronnie sahlberg @ 2012-08-14 21:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, Christoph Hellwig, dchinner, Stefan Priebe

On Wed, Aug 15, 2012 at 7:35 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Aug 14, 2012 at 10:42:21PM +0200, Stefan Priebe wrote:
>> Hello list,
>>
>> i'm testing KVM with qemu, libiscsi, virtio-scsi-pci and
>> scsi-general on top of a nexenta storage solution. While doing
>> mkfs.xfs on an already used LUN / block device i discovered that the
>> unmapping / discard commands mkfs.xfs sends take a long time which
>> results in a lot of aborted scsi commands.
>
> Sounds like a problem with your storage being really slow at
> discards.
>
>> Would it make sense to let mkfs.xfs send these unmapping commands in
>> small portations (f.e. 100MB)
>
> No, because the underlying implementation (blkdev_issue_discard())
> already breaks the discard request up into the granularity that is
> supported by the underlying storage.....
>
>> or is there another problem in the
>> patch to the block device? Any suggestions or ideas?
>
> .... which, of course, had bugs in it so is a muchmore likely cause
> of your problems.
>
> That said,the discard granularity is derived from information the
> storage supplies the kernel in it's SCSI mode page, so if the
> discard granularity is too large, that's a storage problem, not a
> linux problem at all, let alone a mkfs.xfs problem.

Hi Dave,
That is true.

But this particular issue seen in the network traces show that on this
particular storage array,
when a huge train of discards are sent, to basically discard the entire LUN,
the storage array may take many minutes to perform these discards,
during which time the array is unresponsive to any other I/O, on the
same LUN or on other LUNs.

This is definitely an issue with the i/o scheduler in the storage
array and not strictly in the linux kernel or mkfs.xfs.

And this basically means that for these kind of arrays with this
discard behaviour, running a command that performs
a huge number of discards to discard the entire device will basically
act as a full denial-of-service attack,
since every lun and every host that is attached to the array will
experience a full outage for minutes.


This is definitely an issue with the array, BUT linux kernel and/or
userspace utilities can, and very often are,
implement tweaks to be more firendly towards and avoid triggering
unfortunate hw behaviour.

For example, linux kernel contains a "fix" for the pentium FDIV bug,
eventhoug there was never any issue in linux that needed fixing.


The only other realistic alternative is to provide warnings such as :
"Some storage arrays may have major performance problems if you run
mkfs.xfs that can cause a full outage for every single lun on that
array that lasts for many minutes.  Unless you KNOW that your storage
arrray does not have such issue for a fact, you should never run
mkfs.xfs on a production system outside of a full scheduled outage
window. The full set of storage arrays where this is a potential issue
is not known".


I think it would be more storage array friendly to implement tweaks to
avoid triggering such issue son this, and possibly other arrays.



regards
ronnie sahlberg

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: aborted SCSI commands while discarding/unmapping via mkfs.xfs
  2012-08-14 21:35 ` Dave Chinner
  2012-08-14 21:51   ` ronnie sahlberg
@ 2012-08-15  6:31   ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 8+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-08-15  6:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: dchinner, Christoph Hellwig, Ronnie Sahlberg, xfs

Am 14.08.2012 23:35, schrieb Dave Chinner:
> On Tue, Aug 14, 2012 at 10:42:21PM +0200, Stefan Priebe wrote:
>> Hello list,
>>
>> i'm testing KVM with qemu, libiscsi, virtio-scsi-pci and
>> scsi-general on top of a nexenta storage solution. While doing
>> mkfs.xfs on an already used LUN / block device i discovered that the
>> unmapping / discard commands mkfs.xfs sends take a long time which
>> results in a lot of aborted scsi commands.
>
> Sounds like a problem with your storage being really slow at
> discards.
>
>> Would it make sense to let mkfs.xfs send these unmapping commands in
>> small portations (f.e. 100MB)
>
> No, because the underlying implementation (blkdev_issue_discard())
> already breaks the discard request up into the granularity that is
> supported by the underlying storage.....
>
>> or is there another problem in the
>> patch to the block device? Any suggestions or ideas?
>
> .... which, of course, had bugs in it so is a muchmore likely cause
> of your problems.
>
> That said,the discard granularity is derived from information the
> storage supplies the kernel in it's SCSI mode page, so if the
> discard granularity is too large, that's a storage problem, not a
> linux problem at all, let alone a mkfs.xfs problem.

Thanks for this excelent explanation.

Stefan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: aborted SCSI commands while discarding/unmapping via mkfs.xfs
  2012-08-14 21:51   ` ronnie sahlberg
@ 2012-08-15  7:32     ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2012-08-15  7:32 UTC (permalink / raw)
  To: ronnie sahlberg; +Cc: xfs, Christoph Hellwig, dchinner, Stefan Priebe

On Wed, Aug 15, 2012 at 07:51:26AM +1000, ronnie sahlberg wrote:
> On Wed, Aug 15, 2012 at 7:35 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Aug 14, 2012 at 10:42:21PM +0200, Stefan Priebe wrote:
> >> Hello list,
> >>
> >> i'm testing KVM with qemu, libiscsi, virtio-scsi-pci and
> >> scsi-general on top of a nexenta storage solution. While doing
> >> mkfs.xfs on an already used LUN / block device i discovered that the
> >> unmapping / discard commands mkfs.xfs sends take a long time which
> >> results in a lot of aborted scsi commands.
> >
> > Sounds like a problem with your storage being really slow at
> > discards.
> >
> >> Would it make sense to let mkfs.xfs send these unmapping commands in
> >> small portations (f.e. 100MB)
> >
> > No, because the underlying implementation (blkdev_issue_discard())
> > already breaks the discard request up into the granularity that is
> > supported by the underlying storage.....
> >
> >> or is there another problem in the
> >> patch to the block device? Any suggestions or ideas?
> >
> > .... which, of course, had bugs in it so is a muchmore likely cause
> > of your problems.
> >
> > That said,the discard granularity is derived from information the
> > storage supplies the kernel in it's SCSI mode page, so if the
> > discard granularity is too large, that's a storage problem, not a
> > linux problem at all, let alone a mkfs.xfs problem.
> 
> Hi Dave,
> That is true.
> 
> But this particular issue seen in the network traces show that on this
> particular storage array,
> when a huge train of discards are sent, to basically discard the entire LUN,
> the storage array may take many minutes to perform these discards,
> during which time the array is unresponsive to any other I/O, on the
> same LUN or on other LUNs.

To be blunt, that's not my problem and I don't really care.

> And this basically means that for these kind of arrays with this
> discard behaviour, running a command that performs
> a huge number of discards to discard the entire device will basically
> act as a full denial-of-service attack,
> since every lun and every host that is attached to the array will
> experience a full outage for minutes.

So report the bug to the array vendor as a remote DOS attack. Or,
seeing as Nexenta is OSS, fix it yourself.

> This is definitely an issue with the array, BUT linux kernel and/or
> userspace utilities can, and very often are,
> implement tweaks to be more firendly towards and avoid triggering
> unfortunate hw behaviour.

Read the mkfs.xfs man page - you might find the -K option....

> For example, linux kernel contains a "fix" for the pentium FDIV bug,
> eventhoug there was never any issue in linux that needed fixing.

Apples and oranges.

> The only other realistic alternative is to provide warnings such as :
> "Some storage arrays may have major performance problems if you run
> mkfs.xfs that can cause a full outage for every single lun on that
> array that lasts for many minutes.  Unless you KNOW that your storage
> arrray does not have such issue for a fact, you should never run
> mkfs.xfs on a production system outside of a full scheduled outage
> window. The full set of storage arrays where this is a potential issue
> is not known".

Sorry, this is not a nanny state - a certain level of competency is
expected of storage administrators. I don't care about the children
or whether mkfs.xfs kills Bambi, either....

It's a storage array problem, and if you haven't tested your array
in a test environment before putting it in production, then you have
only yourself to blame because you haven't followed best practices.

In fact, the OP found this in a test environment trying something
shiny, new and still steaming, so these are exactly the sort of
problems we'd expect an early adopter of new technologies to find.
And, following best practices, I'd expect them to be reported,
too...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: aborted SCSI commands while discarding/unmapping via mkfs.xfs
  2012-08-14 20:42 aborted SCSI commands while discarding/unmapping via mkfs.xfs Stefan Priebe
  2012-08-14 21:35 ` Dave Chinner
@ 2012-08-15  9:13 ` Michael Monnerie
  2012-08-15  9:14   ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 8+ messages in thread
From: Michael Monnerie @ 2012-08-15  9:13 UTC (permalink / raw)
  To: xfs; +Cc: Stefan Priebe


[-- Attachment #1.1: Type: text/plain, Size: 341 bytes --]

Am Dienstag, 14. August 2012, 22:42:21 schrieb Stefan Priebe:
> top of a nexenta storage solution

I'd be interested to know which storage exactly that is?

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: aborted SCSI commands while discarding/unmapping via mkfs.xfs
  2012-08-15  9:13 ` Michael Monnerie
@ 2012-08-15  9:14   ` Stefan Priebe - Profihost AG
  2012-08-15  9:22     ` Stefan Ring
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-08-15  9:14 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: xfs

Am 15.08.2012 11:13, schrieb Michael Monnerie:
> Am Dienstag, 14. August 2012, 22:42:21 schrieb Stefan Priebe:
>> top of a nexenta storage solution
>
> I'd be interested to know which storage exactly that is?

right now a self build test system running http://www.nexenta.com/corp/.

Greets
Stefan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: aborted SCSI commands while discarding/unmapping via mkfs.xfs
  2012-08-15  9:14   ` Stefan Priebe - Profihost AG
@ 2012-08-15  9:22     ` Stefan Ring
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Ring @ 2012-08-15  9:22 UTC (permalink / raw)
  To: Linux fs XFS

> right now a self build test system running http://www.nexenta.com/corp/.

With SAS disks or SATA? This seems to make a big difference regarding
how much a long-running operation can lock out concurrent accesses.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-08-15  9:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-14 20:42 aborted SCSI commands while discarding/unmapping via mkfs.xfs Stefan Priebe
2012-08-14 21:35 ` Dave Chinner
2012-08-14 21:51   ` ronnie sahlberg
2012-08-15  7:32     ` Dave Chinner
2012-08-15  6:31   ` Stefan Priebe - Profihost AG
2012-08-15  9:13 ` Michael Monnerie
2012-08-15  9:14   ` Stefan Priebe - Profihost AG
2012-08-15  9:22     ` Stefan Ring

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.