All of lore.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@scylladb.com>
To: Brian Foster <bfoster@redhat.com>
Cc: Glauber Costa <glauber@scylladb.com>, xfs@oss.sgi.com
Subject: Re: sleeps and waits during io_submit
Date: Tue, 1 Dec 2015 18:08:51 +0200	[thread overview]
Message-ID: <565DC613.4090608@scylladb.com> (raw)
In-Reply-To: <20151201160133.GE26129@bfoster.bfoster>



On 12/01/2015 06:01 PM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
>>
>> On 12/01/2015 04:56 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>>>> ...
>>>>>>> The agsize/agcount mkfs-time heuristics change depending on the type of
>>>>>>> storage. A single AG can be up to 1TB and if the fs is not considered
>>>>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
>>>>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is
>>>>>>> adjusted depending on the size of the overall volume (see
>>>>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
>>>>>> We'll experiment with this.  Surely it depends on more than the amount of
>>>>>> storage?  If you have a high op rate you'll be more likely to excite
>>>>>> contention, no?
>>>>>>
>>>>> Sure. The absolute optimal configuration for your workload probably
>>>>> depends on more than storage size, but mkfs doesn't have that
>>>>> information. In general, it tries to use the most reasonable
>>>>> configuration based on the storage and expected workload. If you want to
>>>>> tweak it beyond that, indeed, the best bet is to experiment with what
>>>>> works.
>>>> We will do that.
>>>>
>>>>>>>> Are those locks held around I/O, or just CPU operations, or a mix?
>>>>>>> I believe it's a mix of modifications and I/O, though it looks like some
>>>>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
>>>>>>> pushing case will trylock and defer to the next list iteration if the
>>>>>>> buffer is busy.
>>>>>>>
>>>>>> Ok.  For us sleeping in io_submit() is death because we have no other thread
>>>>>> on that core to take its place.
>>>>>>
>>>>> The above is with regard to metadata I/O, whereas io_submit() is
>>>>> obviously for user I/O.
>>>> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
>>>> async tasks?  I don't mind them blocking each other as long as they let my
>>>> io_submit alone.
>>>>
>>> Yeah, it can trigger metadata reads, force the log (the stale buffer
>>> example) or push the AIL (wait on log space). Metadata changes made
>>> directly via your I/O request are logged/committed via transactions,
>>> which are generally processed asynchronously from that point on.
>>>
>>>>>   io_submit() can probably block in a variety of
>>>>> places afaict... it might have to read in the inode extent map, allocate
>>>>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>>>> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
>>>> if somebody else has to do it.
>>>>
>>> I'm not following... if the fs needs to read in the inode extent map to
>>> prepare for an allocation, what else can the thread do but wait? Are you
>>> suggesting the request kick off whatever the blocking action happens to
>>> be asynchronously and return with an error such that the request can be
>>> retried later?
>> Not quite, it should be invisible to the caller.
>>
>> That is, the code called by io_submit() (file_operations::write_iter, it
>> seems to be called today) can kick off this operation and have it continue
>> from where it left off.
>>
> Isn't that generally what happens today?

You tell me.  According to $subject, apparently not enough.  Maybe we're 
triggering it more often, or we suffer more when it does trigger (the 
latter probably more likely).

>   We submit an I/O which is
> asynchronous in nature and wait on a completion, which causes the cpu to
> schedule and execute another task until the completion is set by I/O
> completion (via an async callback). At that point, the issuing thread
> continues where it left off. I suspect I'm missing something... can you
> elaborate on what you'd do differently here (and how it helps)?

Just apply the same technique everywhere: convert locks to trylock + 
schedule a continuation on failure.

>
>> Seastar (the async user framework which we use to drive xfs) makes writing
>> code like this easy, using continuations; but of course from ordinary
>> threaded code it can be quite hard.
>>
>> btw, there was an attempt to make ext[34] async using this method, but I
>> think it was ripped out.  Yes, the mortal remains can still be seen with
>> 'git grep EIOCBQUEUED'.
>>
>>>>> It sounds to me that first and foremost you want to make sure you don't
>>>>> have however many parallel operations you typically have running
>>>>> contending on the same inodes or AGs. Hint: creating files under
>>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>>> number).
>>>> Unfortunately our directory layout cannot be changed.  And doesn't this
>>>> require having agcount == O(number of active files)?  That is easily in the
>>>> thousands.
>>>>
>>> I think Glauber's O(nr_cpus) comment is probably the more likely
>>> ballpark, but really it's something you'll probably just need to test to
>>> see how far you need to go to avoid AG contention.
>>>
>>> I'm primarily throwing the subdir thing out there for testing purposes.
>>> It's just an easy way to create inodes in a bunch of separate AGs so you
>>> can determine whether/how much it really helps with modified AG counts.
>>> I don't know enough about your application design to really comment on
>>> that...
>> We have O(cpus) shards that operate independently.  Each shard writes 32MB
>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>> without blocking); the files are then flushed and closed, and later removed.
>> In parallel there are sequential writes and reads of large files using 128kB
>> buffers), as well as random reads.  Files are immutable (append-only), and
>> if a file is being written, it is not concurrently read.  In general files
>> are not shared across shards.  All I/O is async and O_DIRECT.  open(),
>> truncate(), fdatasync(), and friends are called from a helper thread.
>>
>> As far as I can tell it should a very friendly load for XFS and SSDs.
>>
>>>>>   Reducing the frequency of block allocation/frees might also be
>>>>> another help (e.g., preallocate and reuse files,
>>>> Isn't that discouraged for SSDs?
>>>>
>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>> and thus never discarded..? Are you running fstrim?
>> mount -o discard.  And yes, overwrites are supposedly more expensive than
>> trim old data + allocate new data, but maybe if you compare it with the work
>> XFS has to do, perhaps the tradeoff is bad.
>>
> Ok, my understanding is that '-o discard' is not recommended in favor of
> periodic fstrim for performance reasons, but that may or may not still
> be the case.

I understand that most SSDs have queued trim these days, but maybe I'm 
optimistic.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2015-12-01 16:08 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-28  2:43 sleeps and waits during io_submit Glauber Costa
2015-11-30 14:10 ` Brian Foster
2015-11-30 14:29   ` Avi Kivity
2015-11-30 16:14     ` Brian Foster
2015-12-01  9:08       ` Avi Kivity
2015-12-01 13:11         ` Brian Foster
2015-12-01 13:58           ` Avi Kivity
2015-12-01 14:01             ` Glauber Costa
2015-12-01 14:37               ` Avi Kivity
2015-12-01 20:45               ` Dave Chinner
2015-12-01 20:56                 ` Avi Kivity
2015-12-01 23:41                   ` Dave Chinner
2015-12-02  8:23                     ` Avi Kivity
2015-12-01 14:56             ` Brian Foster
2015-12-01 15:22               ` Avi Kivity
2015-12-01 16:01                 ` Brian Foster
2015-12-01 16:08                   ` Avi Kivity [this message]
2015-12-01 16:29                     ` Brian Foster
2015-12-01 17:09                       ` Avi Kivity
2015-12-01 18:03                         ` Carlos Maiolino
2015-12-01 19:07                           ` Avi Kivity
2015-12-01 21:19                             ` Dave Chinner
2015-12-01 21:38                               ` Avi Kivity
2015-12-01 23:06                                 ` Dave Chinner
2015-12-02  9:02                                   ` Avi Kivity
2015-12-02 12:57                                     ` Carlos Maiolino
2015-12-02 23:19                                     ` Dave Chinner
2015-12-03 12:52                                       ` Avi Kivity
2015-12-04  3:16                                         ` Dave Chinner
2015-12-08 13:52                                           ` Avi Kivity
2015-12-08 23:13                                             ` Dave Chinner
2015-12-01 18:51                         ` Brian Foster
2015-12-01 19:07                           ` Glauber Costa
2015-12-01 19:35                             ` Brian Foster
2015-12-01 19:45                               ` Avi Kivity
2015-12-01 19:26                           ` Avi Kivity
2015-12-01 19:41                             ` Christoph Hellwig
2015-12-01 19:50                               ` Avi Kivity
2015-12-02  0:13                             ` Brian Foster
2015-12-02  0:57                               ` Dave Chinner
2015-12-02  8:38                                 ` Avi Kivity
2015-12-02  8:34                               ` Avi Kivity
2015-12-08  6:03                                 ` Dave Chinner
2015-12-08 13:56                                   ` Avi Kivity
2015-12-08 23:32                                     ` Dave Chinner
2015-12-09  8:37                                       ` Avi Kivity
2015-12-01 21:04                 ` Dave Chinner
2015-12-01 21:10                   ` Glauber Costa
2015-12-01 21:39                     ` Dave Chinner
2015-12-01 21:24                   ` Avi Kivity
2015-12-01 21:31                     ` Glauber Costa
2015-11-30 15:49   ` Glauber Costa
2015-12-01 13:11     ` Brian Foster
2015-12-01 13:39       ` Glauber Costa
2015-12-01 14:02         ` Brian Foster
2015-11-30 23:10 ` Dave Chinner
2015-11-30 23:51   ` Glauber Costa
2015-12-01 20:30     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=565DC613.4090608@scylladb.com \
    --to=avi@scylladb.com \
    --cc=bfoster@redhat.com \
    --cc=glauber@scylladb.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.