All of lore.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@scylladb.com>
To: Glauber Costa <glauber@scylladb.com>, xfs@oss.sgi.com
Subject: Re: sleeps and waits during io_submit
Date: Tue, 1 Dec 2015 21:07:14 +0200	[thread overview]
Message-ID: <565DEFE2.2000308@scylladb.com> (raw)
In-Reply-To: <20151201180321.GA4762@redhat.com>

On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
> Hi Avi,
>
>>> else is going to execute in our place until this thread can make
>>> progress.
>> For us, nothing else can execute in our place, we usually have exactly one
>> thread per logical core.  So we are heavily dependent on io_submit not
>> sleeping.
>>
>> The case of a contended lock is, to me, less worrying.  It can be reduced by
>> using more allocation groups, which is apparently the shared resource under
>> contention.
>>
> I apologize if I misread your previous comments, but, IIRC you said you can't
> change the directory structure your application is using, and IIRC your
> application does not spread files across several directories.

I miswrote somewhat: the application writes data files and commitlog 
files.  The data file directory structure is fixed due to compatibility 
concerns (it is not a single directory, but some workloads will see most 
access on files in a single directory.  The commitlog directory 
structure is more relaxed, and we can split it to a directory per shard 
(=cpu) or something else.

If worst comes to worst, we'll hack around this and distribute the data 
files into more directories, and provide some hack for compatibility.

> XFS spread files across the allocation groups, based on the directory these
> files are created,

Idea: create the files in some subdirectory, and immediately move them 
to their required location.

>   trying to keep files as close as possible from their
> metadata.

This is pointless for an SSD. Perhaps XFS should randomize the ag on 
nonrotational media instead.


> Directories are spreaded across the AGs in a 'round-robin' way, each
> new directory, will be created in the next allocation group, and, xfs will try
> to allocate the files in the same AG as its parent directory. (Take a look at
> the 'rotorstep' sysctl option for xfs).
>
> So, unless you have the files distributed across enough directories, increasing
> the number of allocation groups may not change the lock contention you're
> facing in this case.
>
> I really don't remember if it has been mentioned already, but if not, it might
> be worth to take this point in consideration.

Thanks.  I think you should really consider randomizing the ag for SSDs, 
and meanwhile, we can just use the creation-directory hack to get the 
same effect, at the cost of an extra system call.  So at least for this 
problem, there is a solution.

> anyway, just my 0.02
>
>> The case of waiting for I/O is much more worrying, because I/O latency are
>> much higher.  But it seems like most of the DIO path does not trigger
>> locking around I/O (and we are careful to avoid the ones that do, like
>> writing beyond eof).
>>
>> (sorry for repeating myself, I have the feeling we are talking past each
>> other and want to be on the same page)
>>
>>>>>   We submit an I/O which is
>>>>> asynchronous in nature and wait on a completion, which causes the cpu to
>>>>> schedule and execute another task until the completion is set by I/O
>>>>> completion (via an async callback). At that point, the issuing thread
>>>>> continues where it left off. I suspect I'm missing something... can you
>>>>> elaborate on what you'd do differently here (and how it helps)?
>>>> Just apply the same technique everywhere: convert locks to trylock +
>>>> schedule a continuation on failure.
>>>>
>>> I'm certainly not an expert on the kernel scheduling, locking and
>>> serialization mechanisms, but my understanding is that most things
>>> outside of spin locks are reschedule points. For example, the
>>> wait_for_completion() calls XFS uses to wait on I/O boil down to
>>> schedule_timeout() calls. Buffer locks are implemented as semaphores and
>>> down() can end up in the same place.
>> But, for the most part, XFS seems to be able to avoid sleeping.  The call to
>> __blockdev_direct_IO only launches the I/O, so any locking is only around
>> cpu operations and, unless there is contention, won't cause us to sleep in
>> io_submit().
>>
>> Trying to follow the code, it looks like xfs_get_blocks_direct (and
>> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
>> we're just lucky to have everything in cache.  If it isn't, we block right
>> there.  I really hope I'm misreading this and some other magic is happening
>> elsewhere instead of this.
>>
>>> Brian
>>>
>>>>>> Seastar (the async user framework which we use to drive xfs) makes writing
>>>>>> code like this easy, using continuations; but of course from ordinary
>>>>>> threaded code it can be quite hard.
>>>>>>
>>>>>> btw, there was an attempt to make ext[34] async using this method, but I
>>>>>> think it was ripped out.  Yes, the mortal remains can still be seen with
>>>>>> 'git grep EIOCBQUEUED'.
>>>>>>
>>>>>>>>> It sounds to me that first and foremost you want to make sure you don't
>>>>>>>>> have however many parallel operations you typically have running
>>>>>>>>> contending on the same inodes or AGs. Hint: creating files under
>>>>>>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>>>>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>>>>>>> number).
>>>>>>>> Unfortunately our directory layout cannot be changed.  And doesn't this
>>>>>>>> require having agcount == O(number of active files)?  That is easily in the
>>>>>>>> thousands.
>>>>>>>>
>>>>>>> I think Glauber's O(nr_cpus) comment is probably the more likely
>>>>>>> ballpark, but really it's something you'll probably just need to test to
>>>>>>> see how far you need to go to avoid AG contention.
>>>>>>>
>>>>>>> I'm primarily throwing the subdir thing out there for testing purposes.
>>>>>>> It's just an easy way to create inodes in a bunch of separate AGs so you
>>>>>>> can determine whether/how much it really helps with modified AG counts.
>>>>>>> I don't know enough about your application design to really comment on
>>>>>>> that...
>>>>>> We have O(cpus) shards that operate independently.  Each shard writes 32MB
>>>>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>>>>>> without blocking); the files are then flushed and closed, and later removed.
>>>>>> In parallel there are sequential writes and reads of large files using 128kB
>>>>>> buffers), as well as random reads.  Files are immutable (append-only), and
>>>>>> if a file is being written, it is not concurrently read.  In general files
>>>>>> are not shared across shards.  All I/O is async and O_DIRECT.  open(),
>>>>>> truncate(), fdatasync(), and friends are called from a helper thread.
>>>>>>
>>>>>> As far as I can tell it should a very friendly load for XFS and SSDs.
>>>>>>
>>>>>>>>>   Reducing the frequency of block allocation/frees might also be
>>>>>>>>> another help (e.g., preallocate and reuse files,
>>>>>>>> Isn't that discouraged for SSDs?
>>>>>>>>
>>>>>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>>>>>> and thus never discarded..? Are you running fstrim?
>>>>>> mount -o discard.  And yes, overwrites are supposedly more expensive than
>>>>>> trim old data + allocate new data, but maybe if you compare it with the work
>>>>>> XFS has to do, perhaps the tradeoff is bad.
>>>>>>
>>>>> Ok, my understanding is that '-o discard' is not recommended in favor of
>>>>> periodic fstrim for performance reasons, but that may or may not still
>>>>> be the case.
>>>> I understand that most SSDs have queued trim these days, but maybe I'm
>>>> optimistic.
>>>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2015-12-01 19:07 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-28  2:43 sleeps and waits during io_submit Glauber Costa
2015-11-30 14:10 ` Brian Foster
2015-11-30 14:29   ` Avi Kivity
2015-11-30 16:14     ` Brian Foster
2015-12-01  9:08       ` Avi Kivity
2015-12-01 13:11         ` Brian Foster
2015-12-01 13:58           ` Avi Kivity
2015-12-01 14:01             ` Glauber Costa
2015-12-01 14:37               ` Avi Kivity
2015-12-01 20:45               ` Dave Chinner
2015-12-01 20:56                 ` Avi Kivity
2015-12-01 23:41                   ` Dave Chinner
2015-12-02  8:23                     ` Avi Kivity
2015-12-01 14:56             ` Brian Foster
2015-12-01 15:22               ` Avi Kivity
2015-12-01 16:01                 ` Brian Foster
2015-12-01 16:08                   ` Avi Kivity
2015-12-01 16:29                     ` Brian Foster
2015-12-01 17:09                       ` Avi Kivity
2015-12-01 18:03                         ` Carlos Maiolino
2015-12-01 19:07                           ` Avi Kivity [this message]
2015-12-01 21:19                             ` Dave Chinner
2015-12-01 21:38                               ` Avi Kivity
2015-12-01 23:06                                 ` Dave Chinner
2015-12-02  9:02                                   ` Avi Kivity
2015-12-02 12:57                                     ` Carlos Maiolino
2015-12-02 23:19                                     ` Dave Chinner
2015-12-03 12:52                                       ` Avi Kivity
2015-12-04  3:16                                         ` Dave Chinner
2015-12-08 13:52                                           ` Avi Kivity
2015-12-08 23:13                                             ` Dave Chinner
2015-12-01 18:51                         ` Brian Foster
2015-12-01 19:07                           ` Glauber Costa
2015-12-01 19:35                             ` Brian Foster
2015-12-01 19:45                               ` Avi Kivity
2015-12-01 19:26                           ` Avi Kivity
2015-12-01 19:41                             ` Christoph Hellwig
2015-12-01 19:50                               ` Avi Kivity
2015-12-02  0:13                             ` Brian Foster
2015-12-02  0:57                               ` Dave Chinner
2015-12-02  8:38                                 ` Avi Kivity
2015-12-02  8:34                               ` Avi Kivity
2015-12-08  6:03                                 ` Dave Chinner
2015-12-08 13:56                                   ` Avi Kivity
2015-12-08 23:32                                     ` Dave Chinner
2015-12-09  8:37                                       ` Avi Kivity
2015-12-01 21:04                 ` Dave Chinner
2015-12-01 21:10                   ` Glauber Costa
2015-12-01 21:39                     ` Dave Chinner
2015-12-01 21:24                   ` Avi Kivity
2015-12-01 21:31                     ` Glauber Costa
2015-11-30 15:49   ` Glauber Costa
2015-12-01 13:11     ` Brian Foster
2015-12-01 13:39       ` Glauber Costa
2015-12-01 14:02         ` Brian Foster
2015-11-30 23:10 ` Dave Chinner
2015-11-30 23:51   ` Glauber Costa
2015-12-01 20:30     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=565DEFE2.2000308@scylladb.com \
    --to=avi@scylladb.com \
    --cc=glauber@scylladb.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.