All of lore.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@scylladb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: xfs_buf_lock vs aio
Date: Wed, 14 Feb 2018 14:07:42 +0200	[thread overview]
Message-ID: <f8b3b7a4-eba4-d5ab-1748-a8382a2a8fd6@scylladb.com> (raw)
In-Reply-To: <20180213051850.GE6778@dastard>

On 02/13/2018 07:18 AM, Dave Chinner wrote:
> On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
>> On 02/10/2018 01:10 AM, Dave Chinner wrote:
>>> On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
>>>> On 02/09/2018 12:11 AM, Dave Chinner wrote:
>>>>> On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
>>>>>> On 02/08/2018 01:33 AM, Dave Chinner wrote:
>>>>>>> On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
>>>>>>>> As usual, I'm having my lovely io_submit()s sleeping. This time some
>>>>>>>> detailed traces. 4.14.15.
>>>>> [....]
>>>>>
>>>>>>>> Forcing the log, so sleeping with ILOCK taken.
>>>>>>> Because it's trying to reallocate an extent that is pinned in the
>>>>>>> log and is marked stale. i.e. we are reallocating a recently freed
>>>>>>> metadata extent that hasn't been committed to disk yet. IOWs, it's
>>>>>>> the metadata form of the "log force to clear a busy extent so we can
>>>>>>> re-use it" condition....
>>>>>>>
>>>>>>> There's nothing you can do to reliably avoid this - it's a sign that
>>>>>>> you're running low on free space in an AG because it's recycling
>>>>>>> recently freed space faster than the CIL is being committed to disk.
>>>>>>>
>>>>>>> You could speed up background journal syncs to try to reduce the
>>>>>>> async checkpoint latency that allows busy extents to build up
>>>>>>> (/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
>>>>>>> journal overhead and IO latency, etc.
>>>>>> Perhaps xfs should auto-tune this variable.
>>>>> That's not a fix. That's a nasty hack that attempts to hide the
>>>>> underlying problem of selecting AGs and/or free space that requires
>>>>> a log force to be used instead of finding other, un-encumbered
>>>>> freespace present in the filesystem.
>>>> Isn't the underlying problem that you have a foreground process
>>>> depending on the progress of a background process?
>>> At a very, very basic level.
>>>
>>>> i.e., no matter
>>>> how AG and free space selection improves, you can always find a
>>>> workload that consumes extents faster than they can be laundered?
>>> Sure, but that doesn't mean we have to fall back to a synchronous
>>> alogrithm to handle collisions. It's that synchronous behaviour that
>>> is the root cause of the long lock stalls you are seeing.
>> Well, having that algorithm be asynchronous will be wonderful. But I
>> imagine it will be a monstrous effort.
> It's not clear yet whether we have to do any of this stuff to solve
> your problem.

I was going by "is the root cause" above. But if we don't have to touch 
it, great.

>
>>>> I'm not saying that free extent selection can't or shouldn't be
>>>> improved, just that it can never completely fix the problem on its
>>>> own.
>>> Righto, if you say so.
>>>
>>> After all, what do I know about the subject at hand? I'm just the
>>> poor dumb guy
>>
>> Just because you're an XFS expert, and even wrote the code at hand,
>> doesn't mean I have nothing to contribute. If I'm wrong, it's enough
>> to tell me that and why.
> It takes time and effort to have to explain why someone's suggestion
> for fixing a bug will not work. It's tiring, unproductive work and I
> get no thanks for it at all.

Isn't the part of being a maintainer? When everything works, the users 
are off the mailing list.

> I'm just seen as the nasty guy who says
> "no" to everything because I eventually run out of patience trying
> to explain everything in simple enough terms for non-XFS people to
> understand that they don't really understand XFS or what I'm talking
> about.
>
> IOWs, sometimes the best way to contribute is to know when you're in
> way over you head and to step back and simply help the master
> crafters get on with weaving their magic.....

Are you suggesting that I should go away? Or something else?

>
>>>   who wrote the current busy extent list handling
>>> algorithm years ago.  Perhaps you'd like to read the commit message
>>> (below), because it explains these sycnhronous slow paths and why
>>> they exist. I'll quote the part relevant to the discussion here,
>>> though:
>>>
>>> 	    Ideally we should not reallocate busy extents. That is a
>>> 	    much more complex fix to the problem as it involves
>>> 	    direct intervention in the allocation btree searches in
>>> 	    many places. This is left to a future set of
>>> 	    modifications.
>> Thanks, that commit was interesting.
>>
>> So, this future set of modifications is to have the extent allocator
>> consult this rbtree and continue searching if locked?
> See, this is exactly what I mean.
>
> You're now trying to guess how we'd solve the busy extent blocking
> problem. i.e. you now appear to be assuming we have a plan to fix
> this problem and are going to do it immediately.  Nothing could be
> further from the truth - I said:
>
>>> this is now important, and so we now need to revisit the issues we
>>> laid out some 8 years ago and work from there.
> That does not mean "we're going to fix this now" - it means we need
> to look at the problem again and determine if it's the best solution
> to the problem being presented to us. There are other avenues we
> still need to explore.
>
> Indeed, does your application and/or users even care about
> [acm]times on your files being absolutely accurate and crash
> resilient? i.e. do you use fsync() or fdatasync() to guarantee the
> data is on stable storage?

We use fdatasync and don't care about mtime much. So lazytime would work 
for us.

>
> [....]
>
>> I still think reducing the amount of outstanding busy extents is
>> important.  Modern disks write multiple GB/s, and big-data
>> applications like to do large sequential writes and deletes,
> Hah! "modern disks"
>
> You need to recalibrate what "big data" and "high performance IO"
> means. This was what we were doing with XFS on linux back in 2006:
>
> https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
>
> i.e. 10 years ago we were already well into the *tens of GB/s* on
> XFS filesystems for big-data applications with large sequential
> reads and writes. These "modern disks" are so slow! :)

Today, that's one or a few disks, not 90, and you can such a setup for a 
few dollars an hour, doing millions of IOPS.

> Cheers,
>
> Dave.



  parent reply	other threads:[~2018-02-14 12:07 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-07 17:20 xfs_buf_lock vs aio Avi Kivity
2018-02-07 23:33 ` Dave Chinner
2018-02-08  8:24   ` Avi Kivity
2018-02-08 22:11     ` Dave Chinner
2018-02-09 12:11       ` Avi Kivity
2018-02-09 23:10         ` Dave Chinner
2018-02-12  9:33           ` Avi Kivity
2018-02-13  5:18             ` Dave Chinner
2018-02-13 23:14               ` Darrick J. Wong
2018-02-14  2:16                 ` Dave Chinner
2018-02-14 12:01                   ` Avi Kivity
2018-02-14 12:07               ` Avi Kivity [this message]
2018-02-14 12:18                 ` Avi Kivity
2018-02-14 23:56                 ` Dave Chinner
2018-02-15  9:36                   ` Avi Kivity
2018-02-15 21:30                     ` Dave Chinner
2018-02-16  8:07                       ` Avi Kivity
2018-02-19  2:40                         ` Dave Chinner
2018-02-19  4:48                           ` Dave Chinner
2018-02-25 17:47                           ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f8b3b7a4-eba4-d5ab-1748-a8382a2a8fd6@scylladb.com \
    --to=avi@scylladb.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.