Re: xfs_buf_lock vs aio

From: Avi Kivity <avi@scylladb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: xfs_buf_lock vs aio
Date: Wed, 14 Feb 2018 14:07:42 +0200	[thread overview]
Message-ID: <f8b3b7a4-eba4-d5ab-1748-a8382a2a8fd6@scylladb.com> (raw)
In-Reply-To: <20180213051850.GE6778@dastard>

On 02/13/2018 07:18 AM, Dave Chinner wrote:
> On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
>> On 02/10/2018 01:10 AM, Dave Chinner wrote:
>>> On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
>>>> On 02/09/2018 12:11 AM, Dave Chinner wrote:
>>>>> On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
>>>>>> On 02/08/2018 01:33 AM, Dave Chinner wrote:
>>>>>>> On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
>>>>>>>> As usual, I'm having my lovely io_submit()s sleeping. This time some
>>>>>>>> detailed traces. 4.14.15.
>>>>> [....]
>>>>>
>>>>>>>> Forcing the log, so sleeping with ILOCK taken.
>>>>>>> Because it's trying to reallocate an extent that is pinned in the
>>>>>>> log and is marked stale. i.e. we are reallocating a recently freed
>>>>>>> metadata extent that hasn't been committed to disk yet. IOWs, it's
>>>>>>> the metadata form of the "log force to clear a busy extent so we can
>>>>>>> re-use it" condition....
>>>>>>>
>>>>>>> There's nothing you can do to reliably avoid this - it's a sign that
>>>>>>> you're running low on free space in an AG because it's recycling
>>>>>>> recently freed space faster than the CIL is being committed to disk.
>>>>>>>
>>>>>>> You could speed up background journal syncs to try to reduce the
>>>>>>> async checkpoint latency that allows busy extents to build up
>>>>>>> (/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
>>>>>>> journal overhead and IO latency, etc.
>>>>>> Perhaps xfs should auto-tune this variable.
>>>>> That's not a fix. That's a nasty hack that attempts to hide the
>>>>> underlying problem of selecting AGs and/or free space that requires
>>>>> a log force to be used instead of finding other, un-encumbered
>>>>> freespace present in the filesystem.
>>>> Isn't the underlying problem that you have a foreground process
>>>> depending on the progress of a background process?
>>> At a very, very basic level.
>>>
>>>> i.e., no matter
>>>> how AG and free space selection improves, you can always find a
>>>> workload that consumes extents faster than they can be laundered?
>>> Sure, but that doesn't mean we have to fall back to a synchronous
>>> alogrithm to handle collisions. It's that synchronous behaviour that
>>> is the root cause of the long lock stalls you are seeing.
>> Well, having that algorithm be asynchronous will be wonderful. But I
>> imagine it will be a monstrous effort.
> It's not clear yet whether we have to do any of this stuff to solve
> your problem.

I was going by "is the root cause" above. But if we don't have to touch 
it, great.

>
>>>> I'm not saying that free extent selection can't or shouldn't be
>>>> improved, just that it can never completely fix the problem on its
>>>> own.
>>> Righto, if you say so.
>>>
>>> After all, what do I know about the subject at hand? I'm just the
>>> poor dumb guy
>>
>> Just because you're an XFS expert, and even wrote the code at hand,
>> doesn't mean I have nothing to contribute. If I'm wrong, it's enough
>> to tell me that and why.
> It takes time and effort to have to explain why someone's suggestion
> for fixing a bug will not work. It's tiring, unproductive work and I
> get no thanks for it at all.

Isn't the part of being a maintainer? When everything works, the users 
are off the mailing list.

> I'm just seen as the nasty guy who says
> "no" to everything because I eventually run out of patience trying
> to explain everything in simple enough terms for non-XFS people to
> understand that they don't really understand XFS or what I'm talking
> about.
>
> IOWs, sometimes the best way to contribute is to know when you're in
> way over you head and to step back and simply help the master
> crafters get on with weaving their magic.....

Are you suggesting that I should go away? Or something else?

>
>>>   who wrote the current busy extent list handling
>>> algorithm years ago.  Perhaps you'd like to read the commit message
>>> (below), because it explains these sycnhronous slow paths and why
>>> they exist. I'll quote the part relevant to the discussion here,
>>> though:
>>>
>>> 	    Ideally we should not reallocate busy extents. That is a
>>> 	    much more complex fix to the problem as it involves
>>> 	    direct intervention in the allocation btree searches in
>>> 	    many places. This is left to a future set of
>>> 	    modifications.
>> Thanks, that commit was interesting.
>>
>> So, this future set of modifications is to have the extent allocator
>> consult this rbtree and continue searching if locked?
> See, this is exactly what I mean.
>
> You're now trying to guess how we'd solve the busy extent blocking
> problem. i.e. you now appear to be assuming we have a plan to fix
> this problem and are going to do it immediately.  Nothing could be
> further from the truth - I said:
>
>>> this is now important, and so we now need to revisit the issues we
>>> laid out some 8 years ago and work from there.
> That does not mean "we're going to fix this now" - it means we need
> to look at the problem again and determine if it's the best solution
> to the problem being presented to us. There are other avenues we
> still need to explore.
>
> Indeed, does your application and/or users even care about
> [acm]times on your files being absolutely accurate and crash
> resilient? i.e. do you use fsync() or fdatasync() to guarantee the
> data is on stable storage?

We use fdatasync and don't care about mtime much. So lazytime would work 
for us.

>
> [....]
>
>> I still think reducing the amount of outstanding busy extents is
>> important.  Modern disks write multiple GB/s, and big-data
>> applications like to do large sequential writes and deletes,
> Hah! "modern disks"
>
> You need to recalibrate what "big data" and "high performance IO"
> means. This was what we were doing with XFS on linux back in 2006:
>
> https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
>
> i.e. 10 years ago we were already well into the *tens of GB/s* on
> XFS filesystems for big-data applications with large sequential
> reads and writes. These "modern disks" are so slow! :)

Today, that's one or a few disks, not 90, and you can such a setup for a 
few dollars an hour, doing millions of IOPS.

> Cheers,
>
> Dave.