All of lore.kernel.org
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: Avi Kivity <avi@scylladb.com>, linux-raid@vger.kernel.org
Subject: Re: raid0 vs. mkfs
Date: Mon, 28 Nov 2016 19:40:25 +1100	[thread overview]
Message-ID: <87inr880au.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <286a5fc1-eda3-0421-a88e-b03c09403259@scylladb.com>

[-- Attachment #1: Type: text/plain, Size: 3556 bytes --]

On Mon, Nov 28 2016, Avi Kivity wrote:

> On 11/28/2016 07:09 AM, NeilBrown wrote:
>> On Mon, Nov 28 2016, Avi Kivity wrote:
>>
>>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>>> inappropriate) for the entire partition. As far as I can tell, md
>>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>>> requests, one per chunk-size worth of disk, and issues them to the RAID
>>> components individually.
>>>
>>>
>>> It seems to me that md can convert the large TRIM/DISCARD (erase
>>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>>> whichever is inappropriate) per RAID component, converting an O(disk
>>> size / chunk size) operation into an O(number of RAID components)
>>> operation, which is much faster.
>>>
>>>
>>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>>> the operation taking about a quarter of an hour, continuously pushing
>>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>>> to the disk. Linux 4.1.12.
>> Surely it is the task of the underlying driver, or the queuing
>> infrastructure, to merge small requests into large requests.
>
> Here's a blkparse of that run.  As can be seen, there is no concurrency, 
> so nobody down the stack has any chance of merging anything.

That isn't a valid conclusion to draw.
raid0 effectively calls the make_request_fn function that is registered
by the underlying driver.
If that function handles DISCARD synchronously, then you won't see any
concurrency, and that is because the driver chose not to queue but to
handle directly.
I don't know if it actually does this though.  I don't know the insides
of the nmve driver .... there seems to be a lightnvm thing and a scsi
thing and a pci thing and it all confuses me.

>
>
> No merging was happening.  This is an NVMe drive, so running with the 
> noop scheduler (which should still merge).   Does the queuing layer 
> merge trims?

I wish I knew.  I once thought I understood about half of the block
queuing code, but now with multi-queue, I'll need to learn it all
again. :-(

>
> I don't think it's the queuing layer's job, though.  At the I/O 
> scheduler you can merge to clean up sloppy patterns from the upper 
> layer, but each layer should try to generate the best pattern it can.  

Why?  How does it know what is best for the layer below?

> Large merges mean increased latency for the first request in the chain, 
> forcing the I/O scheduler to make a decision which can harm the 
> workload.  By generating merged requests in the first place, the upper 
> layer removes the need to make that tradeoff (splitting the requests 
> removes information: "we are interested only in when all of the range is 
> trimmed, not any particular request").

If it is easy for the upper layer to break a very large request into a
few very large requests, then I wouldn't necessarily object.
But unless it is very hard for the lower layer to merge requests, it
should be doing that too.
When drivers/lightnvm/rrpc.c is providing rrpc_make_rq as the
make_request_fn, it performs REQ_OP_DISCARD synchronously.  I would
suggest that is a very poor design.  I don't know if that is affecting
you (though a printk would find out).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2016-11-28  8:40 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-27 15:24 raid0 vs. mkfs Avi Kivity
2016-11-27 17:09 ` Coly Li
2016-11-27 17:25   ` Avi Kivity
2016-11-27 19:25     ` Doug Dumitru
2016-11-28  4:11 ` Chris Murphy
2016-11-28  7:28   ` Avi Kivity
2016-11-28  7:33     ` Avi Kivity
2016-11-28  5:09 ` NeilBrown
2016-11-28  6:08   ` Shaohua Li
2016-11-28  7:38   ` Avi Kivity
2016-11-28  8:40     ` NeilBrown [this message]
2016-11-28  8:58       ` Avi Kivity
2016-11-28  9:00         ` Christoph Hellwig
2016-11-28  9:11           ` Avi Kivity
2016-11-28  9:15             ` Coly Li
2016-11-28 17:47             ` Shaohua Li
2016-11-29 21:14         ` NeilBrown
2016-11-29 22:45           ` Avi Kivity
2016-12-07  5:08             ` Mike Snitzer
2016-12-07 11:50             ` Coly Li
2016-12-07 12:03               ` Coly Li
2016-12-07 16:59               ` Shaohua Li
2016-12-08 16:44                 ` Coly Li
2016-12-08 19:19                   ` Shaohua Li
2016-12-09  7:34                     ` Coly Li
2016-12-12  3:17                       ` NeilBrown
2017-06-29 15:15                   ` Avi Kivity
2017-06-29 15:31                     ` Coly Li
2017-06-29 15:36                       ` Avi Kivity
2017-01-22 18:01 ` Avi Kivity
2017-01-23 12:26   ` Coly Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87inr880au.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=avi@scylladb.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.