All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs_buf_lock vs aio
@ 2018-02-07 17:20 Avi Kivity
  2018-02-07 23:33 ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-02-07 17:20 UTC (permalink / raw)
  To: linux-xfs

As usual, I'm having my lovely io_submit()s sleeping. This time some 
detailed traces. 4.14.15.


kthreadd  8146 [013]  3580.418079:                xfs:xfs_ilock: dev 9:0 
ino 0x50000a9f flags ILOCK_EXCL caller kretprobe_trampoline
             7fffa0858572 xfs_ilock ([kernel.kallsyms])
             7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
             7fffa0849416 xfs_dio_write_end_io ([kernel.kallsyms])
             7fff812bc935 iomap_dio_complete ([kernel.kallsyms])
             7fff812bca17 iomap_dio_complete_work ([kernel.kallsyms])
             7fff8109dd19 process_one_work ([kernel.kallsyms])
             7fff8109e79d worker_thread ([kernel.kallsyms])
             7fff810a43b9 kthread ([kernel.kallsyms])
             7fff81802335 ret_from_fork ([kernel.kallsyms])


Here we have an I/O completion on some inode, taking ILOCK. So far so good.


kthreadd  8146 [013]  3580.418081:            xfs:xfs_ilock_ret: 
(ffffffffa08564e0 <- ffffffffa0853357)
             7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
             7fffa0849416 xfs_dio_write_end_io ([kernel.kallsyms])
             7fff812bc935 iomap_dio_complete ([kernel.kallsyms])
             7fff812bca17 iomap_dio_complete_work ([kernel.kallsyms])
             7fff8109dd19 process_one_work ([kernel.kallsyms])
             7fff8109e79d worker_thread ([kernel.kallsyms])
             7fff810a43b9 kthread ([kernel.kallsyms])
             7fff81802335 ret_from_fork ([kernel.kallsyms])


ILOCK taken without sleeping.


kthreadd  8146 [013]  3580.418103:           sched:sched_switch: 
kworker/13:3:8146 [120] D ==> kworker/13:0:8144 [120]
             7fff817dd0a5 __schedule ([kernel.kallsyms])
             7fff817dd5c6 schedule ([kernel.kallsyms])
             7fff817e1656 schedule_timeout ([kernel.kallsyms])
             7fff817de181 wait_for_completion ([kernel.kallsyms])
             7fff8109d51d flush_work ([kernel.kallsyms])
             7fffa086e29d xlog_cil_force_lsn ([kernel.kallsyms])
             7fffa086bfd6 _xfs_log_force ([kernel.kallsyms])
             7fffa086c1fc xfs_log_force ([kernel.kallsyms])
             7fffa08444d6 xfs_buf_lock ([kernel.kallsyms])
             7fffa084479b _xfs_buf_find ([kernel.kallsyms])
             7fffa0844a1a xfs_buf_get_map ([kernel.kallsyms])
             7fffa087b766 xfs_trans_get_buf_map ([kernel.kallsyms])
             7fffa0811c14 xfs_btree_get_bufl ([kernel.kallsyms])
             7fffa08024d2 xfs_bmap_extents_to_btree ([kernel.kallsyms])
             7fffa0805d68 xfs_bmap_add_extent_unwritten_real 
([kernel.kallsyms])
             7fffa0806a88 xfs_bmapi_convert_unwritten ([kernel.kallsyms])
             7fffa080c3e0 xfs_bmapi_write ([kernel.kallsyms])
             7fffa08553c9 xfs_iomap_write_unwritten ([kernel.kallsyms])
             7fffa0849416 xfs_dio_write_end_io ([kernel.kallsyms])
             7fff812bc935 iomap_dio_complete ([kernel.kallsyms])
             7fff812bca17 iomap_dio_complete_work ([kernel.kallsyms])
             7fff8109dd19 process_one_work ([kernel.kallsyms])
             7fff8109e79d worker_thread ([kernel.kallsyms])
             7fff810a43b9 kthread ([kernel.kallsyms])
             7fff81802335 ret_from_fork ([kernel.kallsyms])



Forcing the log, so sleeping with ILOCK taken.


reactor-29  3336 [029]  3580.420137: xfs:xfs_ilock: dev 9:0 ino 
0x50000a9f flags ILOCK_EXCL caller kretprobe_trampoline
             7fffa0858572 xfs_ilock ([kernel.kallsyms])
             7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
             7fff8127314c file_update_time ([kernel.kallsyms])
             7fffa0849ab9 xfs_file_aio_write_checks ([kernel.kallsyms])
             7fffa0849bce xfs_file_dio_aio_write ([kernel.kallsyms])
             7fffa084a13f xfs_file_write_iter ([kernel.kallsyms])
             7fff812ab24f aio_write ([kernel.kallsyms])
             7fff812ab90e do_io_submit ([kernel.kallsyms])
             7fff812ac4c0 sys_io_submit ([kernel.kallsyms])
             7fff81005917 do_syscall_64 ([kernel.kallsyms])
             7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms])
                      697 io_submit (/usr/lib64/libaio.so.1.0.1)
                       48 [unknown] ([unknown])
                   140e50 reactor_backend_epoll::~reactor_backend_epoll 
(/usr/bin/scylla)
         e90974ffff83087f [unknown] ([unknown])


io_submit(), same inode.


reactor-29  3336 [029]  3580.420141:           sched:sched_switch: 
reactor-29:3336 [120] D ==> swapper/29:0 [120]
             7fff817dd0a5 __schedule ([kernel.kallsyms])
             7fff817dd5c6 schedule ([kernel.kallsyms])
             7fff817e09bd rwsem_down_write_failed ([kernel.kallsyms])
             7fff817d29c7 call_rwsem_down_write_failed ([kernel.kallsyms])
             7fff817e00bd down_write ([kernel.kallsyms])
             7fffa08585b0 xfs_ilock ([kernel.kallsyms])
             7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
             7fff8127314c file_update_time ([kernel.kallsyms])
             7fffa0849ab9 xfs_file_aio_write_checks ([kernel.kallsyms])
             7fffa0849bce xfs_file_dio_aio_write ([kernel.kallsyms])
             7fffa084a13f xfs_file_write_iter ([kernel.kallsyms])
             7fff812ab24f aio_write ([kernel.kallsyms])
             7fff812ab90e do_io_submit ([kernel.kallsyms])
             7fff812ac4c0 sys_io_submit ([kernel.kallsyms])
             7fff81005917 do_syscall_64 ([kernel.kallsyms])
             7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms])
                      697 io_submit (/usr/lib64/libaio.so.1.0.1)
                       48 [unknown] ([unknown])
                   140e50 reactor_backend_epoll::~reactor_backend_epoll 
(/usr/bin/scylla)
         e90974ffff83087f [unknown] ([unknown])


io_submit() stalls. Later, task 8146 relinquishes the lock, which (after 
passing through another worker thread), is finally acquired by 
io_submit() which the continues.


I see to ways two fix this. One is to make sure that we don't need to 
force the log during completion. Is there something I can do to make 
this happen? More log space, less log space, magic mount option?


The other is to check that ILOCK can be taken exclusively in 
xfs_file_dio_aio_write, if IOCB_NOWAIT. If this is the way to go, I 
might venture a patch.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-07 17:20 xfs_buf_lock vs aio Avi Kivity
@ 2018-02-07 23:33 ` Dave Chinner
  2018-02-08  8:24   ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-02-07 23:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
> As usual, I'm having my lovely io_submit()s sleeping. This time some
> detailed traces. 4.14.15.
> 
> kthreadd  8146 [013]  3580.418103:           sched:sched_switch:
> kworker/13:3:8146 [120] D ==> kworker/13:0:8144 [120]
>             7fff817dd0a5 __schedule ([kernel.kallsyms])
>             7fff817dd5c6 schedule ([kernel.kallsyms])
>             7fff817e1656 schedule_timeout ([kernel.kallsyms])
>             7fff817de181 wait_for_completion ([kernel.kallsyms])
>             7fff8109d51d flush_work ([kernel.kallsyms])
>             7fffa086e29d xlog_cil_force_lsn ([kernel.kallsyms])
>             7fffa086bfd6 _xfs_log_force ([kernel.kallsyms])
>             7fffa086c1fc xfs_log_force ([kernel.kallsyms])
>             7fffa08444d6 xfs_buf_lock ([kernel.kallsyms])
>             7fffa084479b _xfs_buf_find ([kernel.kallsyms])
>             7fffa0844a1a xfs_buf_get_map ([kernel.kallsyms])
>             7fffa087b766 xfs_trans_get_buf_map ([kernel.kallsyms])
>             7fffa0811c14 xfs_btree_get_bufl ([kernel.kallsyms])
>             7fffa08024d2 xfs_bmap_extents_to_btree ([kernel.kallsyms])
>             7fffa0805d68 xfs_bmap_add_extent_unwritten_real
> ([kernel.kallsyms])
>             7fffa0806a88 xfs_bmapi_convert_unwritten ([kernel.kallsyms])
>             7fffa080c3e0 xfs_bmapi_write ([kernel.kallsyms])
>             7fffa08553c9 xfs_iomap_write_unwritten ([kernel.kallsyms])
>             7fffa0849416 xfs_dio_write_end_io ([kernel.kallsyms])
>             7fff812bc935 iomap_dio_complete ([kernel.kallsyms])
>             7fff812bca17 iomap_dio_complete_work ([kernel.kallsyms])
>             7fff8109dd19 process_one_work ([kernel.kallsyms])
>             7fff8109e79d worker_thread ([kernel.kallsyms])
>             7fff810a43b9 kthread ([kernel.kallsyms])
>             7fff81802335 ret_from_fork ([kernel.kallsyms])
> 
> 
> 
> Forcing the log, so sleeping with ILOCK taken.

Because it's trying to reallocate an extent that is pinned in the
log and is marked stale. i.e. we are reallocating a recently freed
metadata extent that hasn't been committed to disk yet. IOWs, it's
the metadata form of the "log force to clear a busy extent so we can
re-use it" condition....

There's nothing you can do to reliably avoid this - it's a sign that
you're running low on free space in an AG because it's recycling
recently freed space faster than the CIL is being committed to disk.

You could speed up background journal syncs to try to reduce the
async checkpoint latency that allows busy extents to build up
(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
journal overhead and IO latency, etc.

> reactor-29  3336 [029]  3580.420137: xfs:xfs_ilock: dev 9:0 ino
> 0x50000a9f flags ILOCK_EXCL caller kretprobe_trampoline
>             7fffa0858572 xfs_ilock ([kernel.kallsyms])
>             7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
>             7fff8127314c file_update_time ([kernel.kallsyms])
>             7fffa0849ab9 xfs_file_aio_write_checks ([kernel.kallsyms])
>             7fffa0849bce xfs_file_dio_aio_write ([kernel.kallsyms])
>             7fffa084a13f xfs_file_write_iter ([kernel.kallsyms])
>             7fff812ab24f aio_write ([kernel.kallsyms])
>             7fff812ab90e do_io_submit ([kernel.kallsyms])
>             7fff812ac4c0 sys_io_submit ([kernel.kallsyms])
>             7fff81005917 do_syscall_64 ([kernel.kallsyms])
>             7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms])

> io_submit() stalls. Later, task 8146 relinquishes the lock, which
> (after passing through another worker thread), is finally acquired
> by io_submit() which the continues.

Yup, so it's the timestamp updates at IO submission that are
blocking waiting for the *ILOCK* lock.

[.....]

> The other is to check that ILOCK can be taken exclusively in
> xfs_file_dio_aio_write, if IOCB_NOWAIT. If this is the way to go, I
> might venture a patch.

You're getting your locks mixed up - xfs_file_dio_aio_write()
doesn't ever take the ILOCK directly.  It takes the _IOLOCK_
directly, and that's what IOCB_NOWAIT acts on.

These operations are serialising on metadata updates which require
the _ILOCK_ to be taken exclusively, and there's no way of knowing
if that is going to be needed at the level of
xfs_file_dio_aio_write().

There are hooks in xfs_file_aio_write_checks() to avoid timestamp
updates on write (FMODE_NOCMTIME) but there is no way to set
this from userspace. It's used exclusively by the XFS open-by-handle
code so that utilities like online defrag can write data into files
without changing their user-visible timestamps.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-07 23:33 ` Dave Chinner
@ 2018-02-08  8:24   ` Avi Kivity
  2018-02-08 22:11     ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-02-08  8:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



On 02/08/2018 01:33 AM, Dave Chinner wrote:
> On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
>> As usual, I'm having my lovely io_submit()s sleeping. This time some
>> detailed traces. 4.14.15.
>>
>> kthreadd  8146 [013]  3580.418103:           sched:sched_switch:
>> kworker/13:3:8146 [120] D ==> kworker/13:0:8144 [120]
>>              7fff817dd0a5 __schedule ([kernel.kallsyms])
>>              7fff817dd5c6 schedule ([kernel.kallsyms])
>>              7fff817e1656 schedule_timeout ([kernel.kallsyms])
>>              7fff817de181 wait_for_completion ([kernel.kallsyms])
>>              7fff8109d51d flush_work ([kernel.kallsyms])
>>              7fffa086e29d xlog_cil_force_lsn ([kernel.kallsyms])
>>              7fffa086bfd6 _xfs_log_force ([kernel.kallsyms])
>>              7fffa086c1fc xfs_log_force ([kernel.kallsyms])
>>              7fffa08444d6 xfs_buf_lock ([kernel.kallsyms])
>>              7fffa084479b _xfs_buf_find ([kernel.kallsyms])
>>              7fffa0844a1a xfs_buf_get_map ([kernel.kallsyms])
>>              7fffa087b766 xfs_trans_get_buf_map ([kernel.kallsyms])
>>              7fffa0811c14 xfs_btree_get_bufl ([kernel.kallsyms])
>>              7fffa08024d2 xfs_bmap_extents_to_btree ([kernel.kallsyms])
>>              7fffa0805d68 xfs_bmap_add_extent_unwritten_real
>> ([kernel.kallsyms])
>>              7fffa0806a88 xfs_bmapi_convert_unwritten ([kernel.kallsyms])
>>              7fffa080c3e0 xfs_bmapi_write ([kernel.kallsyms])
>>              7fffa08553c9 xfs_iomap_write_unwritten ([kernel.kallsyms])
>>              7fffa0849416 xfs_dio_write_end_io ([kernel.kallsyms])
>>              7fff812bc935 iomap_dio_complete ([kernel.kallsyms])
>>              7fff812bca17 iomap_dio_complete_work ([kernel.kallsyms])
>>              7fff8109dd19 process_one_work ([kernel.kallsyms])
>>              7fff8109e79d worker_thread ([kernel.kallsyms])
>>              7fff810a43b9 kthread ([kernel.kallsyms])
>>              7fff81802335 ret_from_fork ([kernel.kallsyms])
>>
>>
>>
>> Forcing the log, so sleeping with ILOCK taken.
> Because it's trying to reallocate an extent that is pinned in the
> log and is marked stale. i.e. we are reallocating a recently freed
> metadata extent that hasn't been committed to disk yet. IOWs, it's
> the metadata form of the "log force to clear a busy extent so we can
> re-use it" condition....
>
> There's nothing you can do to reliably avoid this - it's a sign that
> you're running low on free space in an AG because it's recycling
> recently freed space faster than the CIL is being committed to disk.
>
> You could speed up background journal syncs to try to reduce the
> async checkpoint latency that allows busy extents to build up
> (/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
> journal overhead and IO latency, etc.

Perhaps xfs should auto-tune this variable. If it detects that it is 
forcing the log synchronously, it can increase the background flush 
frequency significantly (like by a factor of 2), and if some time passes 
without contention, it can reduce the frequency a little. Over time it 
will find a flush frequency that matches what the workload is doing.

(Our database does this to a great extent - a log structured merge tree 
generates huge amounts of background work, not just for metadata, and we 
make sure that background work is clearing backlog at the same rate it 
is generated to avoid spiky service).

>
>> reactor-29  3336 [029]  3580.420137: xfs:xfs_ilock: dev 9:0 ino
>> 0x50000a9f flags ILOCK_EXCL caller kretprobe_trampoline
>>              7fffa0858572 xfs_ilock ([kernel.kallsyms])
>>              7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
>>              7fff8127314c file_update_time ([kernel.kallsyms])
>>              7fffa0849ab9 xfs_file_aio_write_checks ([kernel.kallsyms])
>>              7fffa0849bce xfs_file_dio_aio_write ([kernel.kallsyms])
>>              7fffa084a13f xfs_file_write_iter ([kernel.kallsyms])
>>              7fff812ab24f aio_write ([kernel.kallsyms])
>>              7fff812ab90e do_io_submit ([kernel.kallsyms])
>>              7fff812ac4c0 sys_io_submit ([kernel.kallsyms])
>>              7fff81005917 do_syscall_64 ([kernel.kallsyms])
>>              7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms])
>> io_submit() stalls. Later, task 8146 relinquishes the lock, which
>> (after passing through another worker thread), is finally acquired
>> by io_submit() which the continues.
> Yup, so it's the timestamp updates at IO submission that are
> blocking waiting for the *ILOCK* lock.
>
> [.....]
>
>> The other is to check that ILOCK can be taken exclusively in
>> xfs_file_dio_aio_write, if IOCB_NOWAIT. If this is the way to go, I
>> might venture a patch.
> You're getting your locks mixed up - xfs_file_dio_aio_write()
> doesn't ever take the ILOCK directly.  It takes the _IOLOCK_
> directly, and that's what IOCB_NOWAIT acts on.

It also takes ILOCK by calling file_update_time(), as seen in the trace 
above. I've seen that it does a _nowait acquisition of the IOLOCK, but 
that's not enough.

If we could pass the iocb to file_update_time() then xfs_vn_update_time, 
then it could perform a _nowait acquisition of ILOCK. Otherwise, we can 
just check if ILOCK is acquirable in xfs_file_aio_write_checks(). That's 
racy, but better than nothing.

>
> These operations are serialising on metadata updates which require
> the _ILOCK_ to be taken exclusively, and there's no way of knowing
> if that is going to be needed at the level of
> xfs_file_dio_aio_write().
>
> There are hooks in xfs_file_aio_write_checks() to avoid timestamp
> updates on write (FMODE_NOCMTIME) but there is no way to set
> this from userspace. It's used exclusively by the XFS open-by-handle
> code so that utilities like online defrag can write data into files
> without changing their user-visible timestamps.
>

xfs_file_aio_write_checks() knows its going to call file_update_time(), 
and it can guess that file_update_time() will call xfs_vn_update_time(). 
So just before that call, check that ILOCK is free and EAGAIN if not. 
Maybe that patch won't win any beauty contests but it will reduce some 
of my pain.

> Cheers,
>
> Dave.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-08  8:24   ` Avi Kivity
@ 2018-02-08 22:11     ` Dave Chinner
  2018-02-09 12:11       ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-02-08 22:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
> On 02/08/2018 01:33 AM, Dave Chinner wrote:
> >On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
> >>As usual, I'm having my lovely io_submit()s sleeping. This time some
> >>detailed traces. 4.14.15.

[....]

> >>Forcing the log, so sleeping with ILOCK taken.
> >Because it's trying to reallocate an extent that is pinned in the
> >log and is marked stale. i.e. we are reallocating a recently freed
> >metadata extent that hasn't been committed to disk yet. IOWs, it's
> >the metadata form of the "log force to clear a busy extent so we can
> >re-use it" condition....
> >
> >There's nothing you can do to reliably avoid this - it's a sign that
> >you're running low on free space in an AG because it's recycling
> >recently freed space faster than the CIL is being committed to disk.
> >
> >You could speed up background journal syncs to try to reduce the
> >async checkpoint latency that allows busy extents to build up
> >(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
> >journal overhead and IO latency, etc.
> 
> Perhaps xfs should auto-tune this variable.

That's not a fix. That's a nasty hack that attempts to hide the
underlying problem of selecting AGs and/or free space that requires
a log force to be used instead of finding other, un-encumbered
freespace present in the filesystem.

i.e. We know what the underlying problem is here, and there isn't a
quick fix for it. You're just going to have to live with that until
we work out a reliable, robust way of avoiding having busy extents
block allocations.

> >>reactor-29  3336 [029]  3580.420137: xfs:xfs_ilock: dev 9:0 ino
> >>0x50000a9f flags ILOCK_EXCL caller kretprobe_trampoline
> >>             7fffa0858572 xfs_ilock ([kernel.kallsyms])
> >>             7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
> >>             7fff8127314c file_update_time ([kernel.kallsyms])
> >>             7fffa0849ab9 xfs_file_aio_write_checks ([kernel.kallsyms])
> >>             7fffa0849bce xfs_file_dio_aio_write ([kernel.kallsyms])
> >>             7fffa084a13f xfs_file_write_iter ([kernel.kallsyms])
> >>             7fff812ab24f aio_write ([kernel.kallsyms])
> >>             7fff812ab90e do_io_submit ([kernel.kallsyms])
> >>             7fff812ac4c0 sys_io_submit ([kernel.kallsyms])
> >>             7fff81005917 do_syscall_64 ([kernel.kallsyms])
> >>             7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms])
> >>io_submit() stalls. Later, task 8146 relinquishes the lock, which
> >>(after passing through another worker thread), is finally acquired
> >>by io_submit() which the continues.
> >Yup, so it's the timestamp updates at IO submission that are
> >blocking waiting for the *ILOCK* lock.
> >
> >[.....]
> >
> >>The other is to check that ILOCK can be taken exclusively in
> >>xfs_file_dio_aio_write, if IOCB_NOWAIT. If this is the way to go, I
> >>might venture a patch.
> >You're getting your locks mixed up - xfs_file_dio_aio_write()
> >doesn't ever take the ILOCK directly.  It takes the _IOLOCK_
> >directly, and that's what IOCB_NOWAIT acts on.
> 
> It also takes ILOCK by calling file_update_time(), as seen in the
> trace above. I've seen that it does a _nowait acquisition of the
> IOLOCK, but that's not enough.
> 
> If we could pass the iocb to file_update_time() then
> xfs_vn_update_time, then it could perform a _nowait acquisition of
> ILOCK. Otherwise, we can just check if ILOCK is acquirable in
> xfs_file_aio_write_checks(). That's racy, but better than nothing.

Sure, but that's not the issue with such suggestions. I'm not about
to propose a patch to hack an iocb through generic infrastructure
that doesn't need an iocb. Doing stuff like that will only get me
shouted at because it's a massive layering violation and I should
(and do!) know better....

> >require the _ILOCK_ to be taken exclusively, and there's no way
> >of knowing if that is going to be needed at the level of
> >xfs_file_dio_aio_write().
> >
> >There are hooks in xfs_file_aio_write_checks() to avoid timestamp
> >updates on write (FMODE_NOCMTIME) but there is no way to set this
> >from userspace. It's used exclusively by the XFS open-by-handle
> >code so that utilities like online defrag can write data into
> >files without changing their user-visible timestamps.
> >
> 
> xfs_file_aio_write_checks() knows its going to call
> file_update_time(), and it can guess that file_update_time() will
> call xfs_vn_update_time(). So just before that call, check that
> ILOCK is free and EAGAIN if not.  Maybe that patch won't win any
> beauty contests but it will reduce some of my pain.

Again, grose, unmaintable layering violations. Expeditious solution
to cater for your immediate need - you're welcome to do this with
your own kernels, but it's not a solution we can really carry
upstream.

And that brings me to the real issue here: If your requirement
really is "never block io_submit(), ever", then the correct change
to make is to the aio code so that it will /always queue IO/ and
never attempt to submit it directly. If you get that done, then we
can stop playing this tiresome whack-a-mole game in XFS.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-08 22:11     ` Dave Chinner
@ 2018-02-09 12:11       ` Avi Kivity
  2018-02-09 23:10         ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-02-09 12:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 02/09/2018 12:11 AM, Dave Chinner wrote:
> On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
>> On 02/08/2018 01:33 AM, Dave Chinner wrote:
>>> On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
>>>> As usual, I'm having my lovely io_submit()s sleeping. This time some
>>>> detailed traces. 4.14.15.
> [....]
>
>>>> Forcing the log, so sleeping with ILOCK taken.
>>> Because it's trying to reallocate an extent that is pinned in the
>>> log and is marked stale. i.e. we are reallocating a recently freed
>>> metadata extent that hasn't been committed to disk yet. IOWs, it's
>>> the metadata form of the "log force to clear a busy extent so we can
>>> re-use it" condition....
>>>
>>> There's nothing you can do to reliably avoid this - it's a sign that
>>> you're running low on free space in an AG because it's recycling
>>> recently freed space faster than the CIL is being committed to disk.
>>>
>>> You could speed up background journal syncs to try to reduce the
>>> async checkpoint latency that allows busy extents to build up
>>> (/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
>>> journal overhead and IO latency, etc.
>> Perhaps xfs should auto-tune this variable.
> That's not a fix. That's a nasty hack that attempts to hide the
> underlying problem of selecting AGs and/or free space that requires
> a log force to be used instead of finding other, un-encumbered
> freespace present in the filesystem.

Isn't the underlying problem that you have a foreground process 
depending on the progress of a background process? i.e., no matter how 
AG and free space selection improves, you can always find a workload 
that consumes extents faster than they can be laundered?

I'm not saying that free extent selection can't or shouldn't be 
improved, just that it can never completely fix the problem on its own.

> i.e. We know what the underlying problem is here, and there isn't a
> quick fix for it. You're just going to have to live with that until
> we work out a reliable, robust way of avoiding having busy extents
> block allocations.
>
>>>> reactor-29  3336 [029]  3580.420137: xfs:xfs_ilock: dev 9:0 ino
>>>> 0x50000a9f flags ILOCK_EXCL caller kretprobe_trampoline
>>>>              7fffa0858572 xfs_ilock ([kernel.kallsyms])
>>>>              7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
>>>>              7fff8127314c file_update_time ([kernel.kallsyms])
>>>>              7fffa0849ab9 xfs_file_aio_write_checks ([kernel.kallsyms])
>>>>              7fffa0849bce xfs_file_dio_aio_write ([kernel.kallsyms])
>>>>              7fffa084a13f xfs_file_write_iter ([kernel.kallsyms])
>>>>              7fff812ab24f aio_write ([kernel.kallsyms])
>>>>              7fff812ab90e do_io_submit ([kernel.kallsyms])
>>>>              7fff812ac4c0 sys_io_submit ([kernel.kallsyms])
>>>>              7fff81005917 do_syscall_64 ([kernel.kallsyms])
>>>>              7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms])
>>>> io_submit() stalls. Later, task 8146 relinquishes the lock, which
>>>> (after passing through another worker thread), is finally acquired
>>>> by io_submit() which the continues.
>>> Yup, so it's the timestamp updates at IO submission that are
>>> blocking waiting for the *ILOCK* lock.
>>>
>>> [.....]
>>>
>>>> The other is to check that ILOCK can be taken exclusively in
>>>> xfs_file_dio_aio_write, if IOCB_NOWAIT. If this is the way to go, I
>>>> might venture a patch.
>>> You're getting your locks mixed up - xfs_file_dio_aio_write()
>>> doesn't ever take the ILOCK directly.  It takes the _IOLOCK_
>>> directly, and that's what IOCB_NOWAIT acts on.
>> It also takes ILOCK by calling file_update_time(), as seen in the
>> trace above. I've seen that it does a _nowait acquisition of the
>> IOLOCK, but that's not enough.
>>
>> If we could pass the iocb to file_update_time() then
>> xfs_vn_update_time, then it could perform a _nowait acquisition of
>> ILOCK. Otherwise, we can just check if ILOCK is acquirable in
>> xfs_file_aio_write_checks(). That's racy, but better than nothing.
> Sure, but that's not the issue with such suggestions. I'm not about
> to propose a patch to hack an iocb through generic infrastructure
> that doesn't need an iocb. Doing stuff like that will only get me
> shouted at because it's a massive layering violation and I should
> (and do!) know better....

If updating the time is part of an asynchronous operation, then it 
should fit with the rest of the async framework, no?

Perhaps a variant file_update_time_async() (which would be called by 
file_update_time).


>
>>> require the _ILOCK_ to be taken exclusively, and there's no way
>>> of knowing if that is going to be needed at the level of
>>> xfs_file_dio_aio_write().
>>>
>>> There are hooks in xfs_file_aio_write_checks() to avoid timestamp
>>> updates on write (FMODE_NOCMTIME) but there is no way to set this
>> >from userspace. It's used exclusively by the XFS open-by-handle
>>> code so that utilities like online defrag can write data into
>>> files without changing their user-visible timestamps.
>>>
>> xfs_file_aio_write_checks() knows its going to call
>> file_update_time(), and it can guess that file_update_time() will
>> call xfs_vn_update_time(). So just before that call, check that
>> ILOCK is free and EAGAIN if not.  Maybe that patch won't win any
>> beauty contests but it will reduce some of my pain.
> Again, grose, unmaintable layering violations. Expeditious solution
> to cater for your immediate need - you're welcome to do this with
> your own kernels, but it's not a solution we can really carry
> upstream.

Well, I don't like it very much either. But I do need some fix.

>
> And that brings me to the real issue here: If your requirement
> really is "never block io_submit(), ever", then the correct change
> to make is to the aio code so that it will /always queue IO/ and
> never attempt to submit it directly. If you get that done, then we
> can stop playing this tiresome whack-a-mole game in XFS.....
>
>

Queuing is easy, but who looks at that queue? If it's another thread, I 
can already do that from userspace from a worker thread (less 
efficiently due to more user/kernel mode changes). But now we have to 
context switch for each I/O. The whole point was to improve efficiency 
and reduce latency by doing everything directly.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-09 12:11       ` Avi Kivity
@ 2018-02-09 23:10         ` Dave Chinner
  2018-02-12  9:33           ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-02-09 23:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
> On 02/09/2018 12:11 AM, Dave Chinner wrote:
> >On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
> >>On 02/08/2018 01:33 AM, Dave Chinner wrote:
> >>>On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
> >>>>As usual, I'm having my lovely io_submit()s sleeping. This time some
> >>>>detailed traces. 4.14.15.
> >[....]
> >
> >>>>Forcing the log, so sleeping with ILOCK taken.
> >>>Because it's trying to reallocate an extent that is pinned in the
> >>>log and is marked stale. i.e. we are reallocating a recently freed
> >>>metadata extent that hasn't been committed to disk yet. IOWs, it's
> >>>the metadata form of the "log force to clear a busy extent so we can
> >>>re-use it" condition....
> >>>
> >>>There's nothing you can do to reliably avoid this - it's a sign that
> >>>you're running low on free space in an AG because it's recycling
> >>>recently freed space faster than the CIL is being committed to disk.
> >>>
> >>>You could speed up background journal syncs to try to reduce the
> >>>async checkpoint latency that allows busy extents to build up
> >>>(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
> >>>journal overhead and IO latency, etc.
> >>Perhaps xfs should auto-tune this variable.
> >That's not a fix. That's a nasty hack that attempts to hide the
> >underlying problem of selecting AGs and/or free space that requires
> >a log force to be used instead of finding other, un-encumbered
> >freespace present in the filesystem.
> 
> Isn't the underlying problem that you have a foreground process
> depending on the progress of a background process?

At a very, very basic level.

> i.e., no matter
> how AG and free space selection improves, you can always find a
> workload that consumes extents faster than they can be laundered?

Sure, but that doesn't mean we have to fall back to a synchronous
alogrithm to handle collisions. It's that synchronous behaviour that
is the root cause of the long lock stalls you are seeing.

> I'm not saying that free extent selection can't or shouldn't be
> improved, just that it can never completely fix the problem on its
> own.

Righto, if you say so.

After all, what do I know about the subject at hand? I'm just the
poor dumb guy who wrote the current busy extent list handling
algorithm years ago.  Perhaps you'd like to read the commit message
(below), because it explains these sycnhronous slow paths and why
they exist. I'll quote the part relevant to the discussion here,
though:

	    Ideally we should not reallocate busy extents. That is a
	    much more complex fix to the problem as it involves
	    direct intervention in the allocation btree searches in
	    many places. This is left to a future set of
	    modifications.

Christoph has already mentioned in these busy extent blocking
threads that we need to do to avoid the synchronous blocking
problems you are seeing. We knew about this blocking problem more
than *15 years ago*, but we didn't need to solve it 8 years ago
because we it was not necessary to solve the "array overrun causes
log forces" problem people were reporting when they had heaps of
free space available in each AG.

Since then we haven't had a user report a workload that needed
anything more than what we currently do. You have a workload where
this is now important, and so we now need to revisit the issues we
laid out some 8 years ago and work from there.

> >>If we could pass the iocb to file_update_time() then
> >>xfs_vn_update_time, then it could perform a _nowait acquisition of
> >>ILOCK. Otherwise, we can just check if ILOCK is acquirable in
> >>xfs_file_aio_write_checks(). That's racy, but better than nothing.
> >Sure, but that's not the issue with such suggestions. I'm not about
> >to propose a patch to hack an iocb through generic infrastructure
> >that doesn't need an iocb. Doing stuff like that will only get me
> >shouted at because it's a massive layering violation and I should
> >(and do!) know better....
> 
> If updating the time is part of an asynchronous operation, then it
> should fit with the rest of the async framework, no?
> 
> Perhaps a variant file_update_time_async() (which would be called by
> file_update_time).

This doesn't fix the problem you've reported.

We've got to run a transaction to update timestamps. That means
memory allocation (which can block on IO) and transaction
reservations need to be made (which can block on IO) *before* we
even attempt to take the ilock.  That means there's a huge TOCTOU
race window in any "optimisitic" check we'd do here.

IOWs,  a non-blocking timestamp update API is still racy. It /might/
make blocking in timestamp updates less common, but it will not
improve the worst case long-tail latencies that are causing you
problems. And that means it doesn't fix your problem.

> Well, I don't like it very much either. But I do need some fix.

We are well aware of that. And we are also well aware of just how
complex and difficult it is to solve properly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

commit ed3b4d6cdc81e8feefdbfa3c584614be301b6d39
Author: Dave Chinner <david@fromorbit.com>
Date:   Fri May 21 12:07:08 2010 +1000

    xfs: Improve scalability of busy extent tracking
    
    When we free a metadata extent, we record it in the per-AG busy
    extent array so that it is not re-used before the freeing
    transaction hits the disk. This array is fixed size, so when it
    overflows we make further allocation transactions synchronous
    because we cannot track more freed extents until those transactions
    hit the disk and are completed. Under heavy mixed allocation and
    freeing workloads with large log buffers, we can overflow this array
    quite easily.
    
    Further, the array is sparsely populated, which means that inserts
    need to search for a free slot, and array searches often have to
    search many more slots that are actually used to check all the
    busy extents. Quite inefficient, really.
    
    To enable this aspect of extent freeing to scale better, we need
    a structure that can grow dynamically. While in other areas of
    XFS we have used radix trees, the extents being freed are at random
    locations on disk so are better suited to being indexed by an rbtree.
    
    So, use a per-AG rbtree indexed by block number to track busy
    extents.  This incures a memory allocation when marking an extent
    busy, but should not occur too often in low memory situations. This
    should scale to an arbitrary number of extents so should not be a
    limitation for features such as in-memory aggregation of
    transactions.
    
    However, there are still situations where we can't avoid allocating
    busy extents (such as allocation from the AGFL). To minimise the
    overhead of such occurences, we need to avoid doing a synchronous
    log force while holding the AGF locked to ensure that the previous
    transactions are safely on disk before we use the extent. We can do
    this by marking the transaction doing the allocation as synchronous
    rather issuing a log force.
    
    Because of the locking involved and the ordering of transactions,
    the synchronous transaction provides the same guarantees as a
    synchronous log force because it ensures that all the prior
    transactions are already on disk when the synchronous transaction
    hits the disk. i.e. it preserves the free->allocate order of the
    extent correctly in recovery.
    
    By doing this, we avoid holding the AGF locked while log writes are
    in progress, hence reducing the length of time the lock is held and
    therefore we increase the rate at which we can allocate and free
    from the allocation group, thereby increasing overall throughput.
    
    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.
    
    Ideally we should not reallocate busy extents. That is a much more
    complex fix to the problem as it involves direct intervention in the
    allocation btree searches in many places. This is left to a future
    set of modifications.
    
    Finally, now that we track busy extents in allocated memory, we
    don't need the descriptors in the transaction structure to point to
    them. We can replace the complex busy chunk infrastructure with a
    simple linked list of busy extents. This allows us to remove a large
    chunk of code, making the overall change a net reduction in code
    size.
    
    Signed-off-by: Dave Chinner <david@fromorbit.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Alex Elder <aelder@sgi.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-09 23:10         ` Dave Chinner
@ 2018-02-12  9:33           ` Avi Kivity
  2018-02-13  5:18             ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-02-12  9:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



On 02/10/2018 01:10 AM, Dave Chinner wrote:
> On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
>> On 02/09/2018 12:11 AM, Dave Chinner wrote:
>>> On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
>>>> On 02/08/2018 01:33 AM, Dave Chinner wrote:
>>>>> On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
>>>>>> As usual, I'm having my lovely io_submit()s sleeping. This time some
>>>>>> detailed traces. 4.14.15.
>>> [....]
>>>
>>>>>> Forcing the log, so sleeping with ILOCK taken.
>>>>> Because it's trying to reallocate an extent that is pinned in the
>>>>> log and is marked stale. i.e. we are reallocating a recently freed
>>>>> metadata extent that hasn't been committed to disk yet. IOWs, it's
>>>>> the metadata form of the "log force to clear a busy extent so we can
>>>>> re-use it" condition....
>>>>>
>>>>> There's nothing you can do to reliably avoid this - it's a sign that
>>>>> you're running low on free space in an AG because it's recycling
>>>>> recently freed space faster than the CIL is being committed to disk.
>>>>>
>>>>> You could speed up background journal syncs to try to reduce the
>>>>> async checkpoint latency that allows busy extents to build up
>>>>> (/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
>>>>> journal overhead and IO latency, etc.
>>>> Perhaps xfs should auto-tune this variable.
>>> That's not a fix. That's a nasty hack that attempts to hide the
>>> underlying problem of selecting AGs and/or free space that requires
>>> a log force to be used instead of finding other, un-encumbered
>>> freespace present in the filesystem.
>> Isn't the underlying problem that you have a foreground process
>> depending on the progress of a background process?
> At a very, very basic level.
>
>> i.e., no matter
>> how AG and free space selection improves, you can always find a
>> workload that consumes extents faster than they can be laundered?
> Sure, but that doesn't mean we have to fall back to a synchronous
> alogrithm to handle collisions. It's that synchronous behaviour that
> is the root cause of the long lock stalls you are seeing.

Well, having that algorithm be asynchronous will be wonderful. But I 
imagine it will be a monstrous effort.

>
>> I'm not saying that free extent selection can't or shouldn't be
>> improved, just that it can never completely fix the problem on its
>> own.
> Righto, if you say so.
>
> After all, what do I know about the subject at hand? I'm just the
> poor dumb guy


Just because you're an XFS expert, and even wrote the code at hand, 
doesn't mean I have nothing to contribute. If I'm wrong, it's enough to 
tell me that and why.



>   who wrote the current busy extent list handling
> algorithm years ago.  Perhaps you'd like to read the commit message
> (below), because it explains these sycnhronous slow paths and why
> they exist. I'll quote the part relevant to the discussion here,
> though:
>
> 	    Ideally we should not reallocate busy extents. That is a
> 	    much more complex fix to the problem as it involves
> 	    direct intervention in the allocation btree searches in
> 	    many places. This is left to a future set of
> 	    modifications.

Thanks, that commit was interesting.

So, this future set of modifications is to have the extent allocator 
consult this rbtree and continue searching if locked?

> Christoph has already mentioned in these busy extent blocking
> threads that we need to do to avoid the synchronous blocking
> problems you are seeing. We knew about this blocking problem more
> than *15 years ago*, but we didn't need to solve it 8 years ago
> because we it was not necessary to solve the "array overrun causes
> log forces" problem people were reporting when they had heaps of
> free space available in each AG.
>
> Since then we haven't had a user report a workload that needed
> anything more than what we currently do. You have a workload where
> this is now important, and so we now need to revisit the issues we
> laid out some 8 years ago and work from there.

I still think reducing the amount of outstanding busy extents is 
important. Modern disks write multiple GB/s, and big-data applications 
like to do large sequential writes and deletes, so high production rate 
of busy extents. Users also like high space utilization of their SSDs. 
It's easy to end up where most of your free extents are busy, leading to 
a hard time for your find-not-busy-free-extent algorithm.


>>>> If we could pass the iocb to file_update_time() then
>>>> xfs_vn_update_time, then it could perform a _nowait acquisition of
>>>> ILOCK. Otherwise, we can just check if ILOCK is acquirable in
>>>> xfs_file_aio_write_checks(). That's racy, but better than nothing.
>>> Sure, but that's not the issue with such suggestions. I'm not about
>>> to propose a patch to hack an iocb through generic infrastructure
>>> that doesn't need an iocb. Doing stuff like that will only get me
>>> shouted at because it's a massive layering violation and I should
>>> (and do!) know better....
>> If updating the time is part of an asynchronous operation, then it
>> should fit with the rest of the async framework, no?
>>
>> Perhaps a variant file_update_time_async() (which would be called by
>> file_update_time).
> This doesn't fix the problem you've reported.
>
> We've got to run a transaction to update timestamps. That means
> memory allocation (which can block on IO) and transaction
> reservations need to be made (which can block on IO) *before* we
> even attempt to take the ilock.  That means there's a huge TOCTOU
> race window in any "optimisitic" check we'd do here.
>
> IOWs,  a non-blocking timestamp update API is still racy. It /might/
> make blocking in timestamp updates less common, but it will not
> improve the worst case long-tail latencies that are causing you
> problems. And that means it doesn't fix your problem.

I'm aware it's racy. For me, even making it less common is an important 
improvement. I know I'll never get fully asynchronous behavior from the 
kernel, so I'm looking for the lowest hanging fruit rather than perfection.

I'm slightly helped by the fact that kernel memory allocations never 
block for me, as the kernel is never under memory pressure (my 
application allocates a fixed amount of anonymous memory and doesn't 
touch the page cache), but that's not the general case.

If we do have an async variant of file_update_time(), then the second 
check (for transaction reservation) can consult the NOWAIT flag and fail 
the write in that case with EAGAIN.

>
>> Well, I don't like it very much either. But I do need some fix.
> We are well aware of that. And we are also well aware of just how
> complex and difficult it is to solve properly.
>

Yeah.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-12  9:33           ` Avi Kivity
@ 2018-02-13  5:18             ` Dave Chinner
  2018-02-13 23:14               ` Darrick J. Wong
  2018-02-14 12:07               ` Avi Kivity
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2018-02-13  5:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
> On 02/10/2018 01:10 AM, Dave Chinner wrote:
> >On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
> >>On 02/09/2018 12:11 AM, Dave Chinner wrote:
> >>>On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
> >>>>On 02/08/2018 01:33 AM, Dave Chinner wrote:
> >>>>>On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
> >>>>>>As usual, I'm having my lovely io_submit()s sleeping. This time some
> >>>>>>detailed traces. 4.14.15.
> >>>[....]
> >>>
> >>>>>>Forcing the log, so sleeping with ILOCK taken.
> >>>>>Because it's trying to reallocate an extent that is pinned in the
> >>>>>log and is marked stale. i.e. we are reallocating a recently freed
> >>>>>metadata extent that hasn't been committed to disk yet. IOWs, it's
> >>>>>the metadata form of the "log force to clear a busy extent so we can
> >>>>>re-use it" condition....
> >>>>>
> >>>>>There's nothing you can do to reliably avoid this - it's a sign that
> >>>>>you're running low on free space in an AG because it's recycling
> >>>>>recently freed space faster than the CIL is being committed to disk.
> >>>>>
> >>>>>You could speed up background journal syncs to try to reduce the
> >>>>>async checkpoint latency that allows busy extents to build up
> >>>>>(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
> >>>>>journal overhead and IO latency, etc.
> >>>>Perhaps xfs should auto-tune this variable.
> >>>That's not a fix. That's a nasty hack that attempts to hide the
> >>>underlying problem of selecting AGs and/or free space that requires
> >>>a log force to be used instead of finding other, un-encumbered
> >>>freespace present in the filesystem.
> >>Isn't the underlying problem that you have a foreground process
> >>depending on the progress of a background process?
> >At a very, very basic level.
> >
> >>i.e., no matter
> >>how AG and free space selection improves, you can always find a
> >>workload that consumes extents faster than they can be laundered?
> >Sure, but that doesn't mean we have to fall back to a synchronous
> >alogrithm to handle collisions. It's that synchronous behaviour that
> >is the root cause of the long lock stalls you are seeing.
> 
> Well, having that algorithm be asynchronous will be wonderful. But I
> imagine it will be a monstrous effort.

It's not clear yet whether we have to do any of this stuff to solve
your problem.

> >>I'm not saying that free extent selection can't or shouldn't be
> >>improved, just that it can never completely fix the problem on its
> >>own.
> >Righto, if you say so.
> >
> >After all, what do I know about the subject at hand? I'm just the
> >poor dumb guy
> 
> 
> Just because you're an XFS expert, and even wrote the code at hand,
> doesn't mean I have nothing to contribute. If I'm wrong, it's enough
> to tell me that and why.

It takes time and effort to have to explain why someone's suggestion
for fixing a bug will not work. It's tiring, unproductive work and I
get no thanks for it at all. I'm just seen as the nasty guy who says
"no" to everything because I eventually run out of patience trying
to explain everything in simple enough terms for non-XFS people to
understand that they don't really understand XFS or what I'm talking
about.

IOWs, sometimes the best way to contribute is to know when you're in
way over you head and to step back and simply help the master
crafters get on with weaving their magic.....

> >  who wrote the current busy extent list handling
> >algorithm years ago.  Perhaps you'd like to read the commit message
> >(below), because it explains these sycnhronous slow paths and why
> >they exist. I'll quote the part relevant to the discussion here,
> >though:
> >
> >	    Ideally we should not reallocate busy extents. That is a
> >	    much more complex fix to the problem as it involves
> >	    direct intervention in the allocation btree searches in
> >	    many places. This is left to a future set of
> >	    modifications.
> 
> Thanks, that commit was interesting.
> 
> So, this future set of modifications is to have the extent allocator
> consult this rbtree and continue searching if locked?

See, this is exactly what I mean.

You're now trying to guess how we'd solve the busy extent blocking
problem. i.e. you now appear to be assuming we have a plan to fix
this problem and are going to do it immediately.  Nothing could be
further from the truth - I said:

> >this is now important, and so we now need to revisit the issues we
> >laid out some 8 years ago and work from there.

That does not mean "we're going to fix this now" - it means we need
to look at the problem again and determine if it's the best solution
to the problem being presented to us. There are other avenues we
still need to explore.

Indeed, does your application and/or users even care about
[acm]times on your files being absolutely accurate and crash
resilient? i.e. do you use fsync() or fdatasync() to guarantee the
data is on stable storage?

[....]

> I still think reducing the amount of outstanding busy extents is
> important.  Modern disks write multiple GB/s, and big-data
> applications like to do large sequential writes and deletes,

Hah! "modern disks"

You need to recalibrate what "big data" and "high performance IO"
means. This was what we were doing with XFS on linux back in 2006:

https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

i.e. 10 years ago we were already well into the *tens of GB/s* on
XFS filesystems for big-data applications with large sequential
reads and writes. These "modern disks" are so slow! :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-13  5:18             ` Dave Chinner
@ 2018-02-13 23:14               ` Darrick J. Wong
  2018-02-14  2:16                 ` Dave Chinner
  2018-02-14 12:07               ` Avi Kivity
  1 sibling, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2018-02-13 23:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Avi Kivity, linux-xfs

On Tue, Feb 13, 2018 at 04:18:51PM +1100, Dave Chinner wrote:
> On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
> > On 02/10/2018 01:10 AM, Dave Chinner wrote:
> > >On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
> > >>On 02/09/2018 12:11 AM, Dave Chinner wrote:
> > >>>On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
> > >>>>On 02/08/2018 01:33 AM, Dave Chinner wrote:
> > >>>>>On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
> > >>>>>>As usual, I'm having my lovely io_submit()s sleeping. This time some
> > >>>>>>detailed traces. 4.14.15.
> > >>>[....]
> > >>>
> > >>>>>>Forcing the log, so sleeping with ILOCK taken.
> > >>>>>Because it's trying to reallocate an extent that is pinned in the
> > >>>>>log and is marked stale. i.e. we are reallocating a recently freed
> > >>>>>metadata extent that hasn't been committed to disk yet. IOWs, it's
> > >>>>>the metadata form of the "log force to clear a busy extent so we can
> > >>>>>re-use it" condition....
> > >>>>>
> > >>>>>There's nothing you can do to reliably avoid this - it's a sign that
> > >>>>>you're running low on free space in an AG because it's recycling
> > >>>>>recently freed space faster than the CIL is being committed to disk.
> > >>>>>
> > >>>>>You could speed up background journal syncs to try to reduce the
> > >>>>>async checkpoint latency that allows busy extents to build up
> > >>>>>(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
> > >>>>>journal overhead and IO latency, etc.
> > >>>>Perhaps xfs should auto-tune this variable.
> > >>>That's not a fix. That's a nasty hack that attempts to hide the
> > >>>underlying problem of selecting AGs and/or free space that requires
> > >>>a log force to be used instead of finding other, un-encumbered
> > >>>freespace present in the filesystem.
> > >>Isn't the underlying problem that you have a foreground process
> > >>depending on the progress of a background process?
> > >At a very, very basic level.
> > >
> > >>i.e., no matter
> > >>how AG and free space selection improves, you can always find a
> > >>workload that consumes extents faster than they can be laundered?
> > >Sure, but that doesn't mean we have to fall back to a synchronous
> > >alogrithm to handle collisions. It's that synchronous behaviour that
> > >is the root cause of the long lock stalls you are seeing.
> > 
> > Well, having that algorithm be asynchronous will be wonderful. But I
> > imagine it will be a monstrous effort.
> 
> It's not clear yet whether we have to do any of this stuff to solve
> your problem.

The maintainer (me) would like to avoid fiddling with trylocks on the
metadata, especially when there's a significant amount of (unlocked)
activity between the trylock and the point where the actual lock is
needed.

If you need the [cma]time to be accurate and newly allocated buffers end
up stuck on busy extent cleanup when there's plenty of free space
available, then I think fixing the collision between the allocator and
the busy extents processing sounds like a reasonable solution.

> > >>I'm not saying that free extent selection can't or shouldn't be
> > >>improved, just that it can never completely fix the problem on its
> > >>own.
> > >Righto, if you say so.
> > >
> > >After all, what do I know about the subject at hand? I'm just the
> > >poor dumb guy
> > 
> > 
> > Just because you're an XFS expert, and even wrote the code at hand,
> > doesn't mean I have nothing to contribute. If I'm wrong, it's enough
> > to tell me that and why.
> 
> It takes time and effort to have to explain why someone's suggestion
> for fixing a bug will not work. It's tiring, unproductive work and I
> get no thanks for it at all. I'm just seen as the nasty guy who says

Thank you for helping me to say no to things. :)

> "no" to everything because I eventually run out of patience trying
> to explain everything in simple enough terms for non-XFS people to
> understand that they don't really understand XFS or what I'm talking
> about.
> 
> IOWs, sometimes the best way to contribute is to know when you're in
> way over you head and to step back and simply help the master
> crafters get on with weaving their magic.....

Not always easy, even if you /have/ been scribbling XFS magic for a while.

Anyway... is it getting a little hot in here?

(Yes, why is it 75F in February?)

> > >  who wrote the current busy extent list handling
> > >algorithm years ago.  Perhaps you'd like to read the commit message
> > >(below), because it explains these sycnhronous slow paths and why
> > >they exist. I'll quote the part relevant to the discussion here,
> > >though:
> > >
> > >	    Ideally we should not reallocate busy extents. That is a
> > >	    much more complex fix to the problem as it involves
> > >	    direct intervention in the allocation btree searches in
> > >	    many places. This is left to a future set of
> > >	    modifications.
> > 
> > Thanks, that commit was interesting.
> > 
> > So, this future set of modifications is to have the extent allocator
> > consult this rbtree and continue searching if locked?
> 
> See, this is exactly what I mean.
> 
> You're now trying to guess how we'd solve the busy extent blocking
> problem. i.e. you now appear to be assuming we have a plan to fix
> this problem and are going to do it immediately.  Nothing could be
> further from the truth - I said:
> 
> > >this is now important, and so we now need to revisit the issues we
> > >laid out some 8 years ago and work from there.
> 
> That does not mean "we're going to fix this now" - it means we need
> to look at the problem again and determine if it's the best solution
> to the problem being presented to us. There are other avenues we
> still need to explore.

Upstream XFS is a general-purpose solution, which means that the code we
add to it must be suitable for everyone -- that means it has to work for
most everyone and be understandable by everyone who reads the code.
I'd like to try to avoid adding weird hacks.

Solving whatever is the problem here requires someone to pin down the
problem, understand the code already built, design some test cases, and
then wire up all the new functionality.  After that, someone /else/ has
to reach the same knowledge levels to review the code.  It's way too
early to be jumping from conjectural solutions to software design.

In other words, more discussion needed:

> Indeed, does your application and/or users even care about
> [acm]times on your files being absolutely accurate and crash
> resilient? i.e. do you use fsync() or fdatasync() to guarantee the
> data is on stable storage?
> 
> [....]
> 
> > I still think reducing the amount of outstanding busy extents is
> > important.  Modern disks write multiple GB/s, and big-data
> > applications like to do large sequential writes and deletes,
> 
> Hah! "modern disks"
> 
> You need to recalibrate what "big data" and "high performance IO"
> means. This was what we were doing with XFS on linux back in 2006:
> 
> https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
> 
> i.e. 10 years ago we were already well into the *tens of GB/s* on
> XFS filesystems for big-data applications with large sequential
> reads and writes. These "modern disks" are so slow! :)

Yeah, and XFS performance is crap on my wall of 3.5" floppy mdraid. :P

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-13 23:14               ` Darrick J. Wong
@ 2018-02-14  2:16                 ` Dave Chinner
  2018-02-14 12:01                   ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-02-14  2:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Avi Kivity, linux-xfs

On Tue, Feb 13, 2018 at 03:14:58PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2018 at 04:18:51PM +1100, Dave Chinner wrote:
> > On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
> > You're now trying to guess how we'd solve the busy extent blocking
> > problem. i.e. you now appear to be assuming we have a plan to fix
> > this problem and are going to do it immediately.  Nothing could be
> > further from the truth - I said:
> > 
> > > >this is now important, and so we now need to revisit the issues we
> > > >laid out some 8 years ago and work from there.
> > 
> > That does not mean "we're going to fix this now" - it means we need
> > to look at the problem again and determine if it's the best solution
> > to the problem being presented to us. There are other avenues we
> > still need to explore.
> 
> Upstream XFS is a general-purpose solution, which means that the code we
> add to it must be suitable for everyone -- that means it has to work for
> most everyone and be understandable by everyone who reads the code.
> I'd like to try to avoid adding weird hacks.

[....]

> In other words, more discussion needed:
> 
> > Indeed, does your application and/or users even care about
> > [acm]times on your files being absolutely accurate and crash
> > resilient? i.e. do you use fsync() or fdatasync() to guarantee the
> > data is on stable storage?

Because what I'm getting at here is this:

$ man 8 mount
.....
      lazytime

	      Only update times (atime, mtime, ctime) on the
	      in-memory version of the file inode.

	      This mount option significantly reduces writes to the
	      inode table for workloads that perform frequent random
	      writes to preallocated files.

              The on-disk timestamps are updated only when:

	      - the inode needs to be updated for some change
		unrelated to file timestamps

	      - the application employs fsync(2), syncfs(2), or
		sync(2)

	      - an undeleted inode is evicted from memory

	      - more than 24 hours have passed since the i-node was
		written to disk.

This has been implemented for ext4 and f2fs, but not for XFS.
Implementing this for XFS and then using it for your application
will basically get rid of the blocking that is occurring in
timestamp updates by removing the transaction/locking we are doing
around timestamp updates.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-14  2:16                 ` Dave Chinner
@ 2018-02-14 12:01                   ` Avi Kivity
  0 siblings, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2018-02-14 12:01 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong; +Cc: linux-xfs

On 02/14/2018 04:16 AM, Dave Chinner wrote:
> On Tue, Feb 13, 2018 at 03:14:58PM -0800, Darrick J. Wong wrote:
>> On Tue, Feb 13, 2018 at 04:18:51PM +1100, Dave Chinner wrote:
>>> On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
>>> You're now trying to guess how we'd solve the busy extent blocking
>>> problem. i.e. you now appear to be assuming we have a plan to fix
>>> this problem and are going to do it immediately.  Nothing could be
>>> further from the truth - I said:
>>>
>>>>> this is now important, and so we now need to revisit the issues we
>>>>> laid out some 8 years ago and work from there.
>>> That does not mean "we're going to fix this now" - it means we need
>>> to look at the problem again and determine if it's the best solution
>>> to the problem being presented to us. There are other avenues we
>>> still need to explore.
>> Upstream XFS is a general-purpose solution, which means that the code we
>> add to it must be suitable for everyone -- that means it has to work for
>> most everyone and be understandable by everyone who reads the code.
>> I'd like to try to avoid adding weird hacks.
> [....]
>
>> In other words, more discussion needed:
>>
>>> Indeed, does your application and/or users even care about
>>> [acm]times on your files being absolutely accurate and crash
>>> resilient? i.e. do you use fsync() or fdatasync() to guarantee the
>>> data is on stable storage?
> Because what I'm getting at here is this:
>
> $ man 8 mount
> .....
>        lazytime
>
> 	      Only update times (atime, mtime, ctime) on the
> 	      in-memory version of the file inode.
>
> 	      This mount option significantly reduces writes to the
> 	      inode table for workloads that perform frequent random
> 	      writes to preallocated files.
>
>                The on-disk timestamps are updated only when:
>
> 	      - the inode needs to be updated for some change
> 		unrelated to file timestamps
>
> 	      - the application employs fsync(2), syncfs(2), or
> 		sync(2)
>
> 	      - an undeleted inode is evicted from memory
>
> 	      - more than 24 hours have passed since the i-node was
> 		written to disk.
>
> This has been implemented for ext4 and f2fs, but not for XFS.
> Implementing this for XFS and then using it for your application
> will basically get rid of the blocking that is occurring in
> timestamp updates by removing the transaction/locking we are doing
> around timestamp updates.
>

Lazytime would work for us. Our backups are implemented by creating 
hardlinks to fully written and fsynced files, so out-of-date mtime 
wouldn't harm us too much.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-13  5:18             ` Dave Chinner
  2018-02-13 23:14               ` Darrick J. Wong
@ 2018-02-14 12:07               ` Avi Kivity
  2018-02-14 12:18                 ` Avi Kivity
  2018-02-14 23:56                 ` Dave Chinner
  1 sibling, 2 replies; 20+ messages in thread
From: Avi Kivity @ 2018-02-14 12:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 02/13/2018 07:18 AM, Dave Chinner wrote:
> On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
>> On 02/10/2018 01:10 AM, Dave Chinner wrote:
>>> On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
>>>> On 02/09/2018 12:11 AM, Dave Chinner wrote:
>>>>> On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote:
>>>>>> On 02/08/2018 01:33 AM, Dave Chinner wrote:
>>>>>>> On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote:
>>>>>>>> As usual, I'm having my lovely io_submit()s sleeping. This time some
>>>>>>>> detailed traces. 4.14.15.
>>>>> [....]
>>>>>
>>>>>>>> Forcing the log, so sleeping with ILOCK taken.
>>>>>>> Because it's trying to reallocate an extent that is pinned in the
>>>>>>> log and is marked stale. i.e. we are reallocating a recently freed
>>>>>>> metadata extent that hasn't been committed to disk yet. IOWs, it's
>>>>>>> the metadata form of the "log force to clear a busy extent so we can
>>>>>>> re-use it" condition....
>>>>>>>
>>>>>>> There's nothing you can do to reliably avoid this - it's a sign that
>>>>>>> you're running low on free space in an AG because it's recycling
>>>>>>> recently freed space faster than the CIL is being committed to disk.
>>>>>>>
>>>>>>> You could speed up background journal syncs to try to reduce the
>>>>>>> async checkpoint latency that allows busy extents to build up
>>>>>>> (/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on
>>>>>>> journal overhead and IO latency, etc.
>>>>>> Perhaps xfs should auto-tune this variable.
>>>>> That's not a fix. That's a nasty hack that attempts to hide the
>>>>> underlying problem of selecting AGs and/or free space that requires
>>>>> a log force to be used instead of finding other, un-encumbered
>>>>> freespace present in the filesystem.
>>>> Isn't the underlying problem that you have a foreground process
>>>> depending on the progress of a background process?
>>> At a very, very basic level.
>>>
>>>> i.e., no matter
>>>> how AG and free space selection improves, you can always find a
>>>> workload that consumes extents faster than they can be laundered?
>>> Sure, but that doesn't mean we have to fall back to a synchronous
>>> alogrithm to handle collisions. It's that synchronous behaviour that
>>> is the root cause of the long lock stalls you are seeing.
>> Well, having that algorithm be asynchronous will be wonderful. But I
>> imagine it will be a monstrous effort.
> It's not clear yet whether we have to do any of this stuff to solve
> your problem.

I was going by "is the root cause" above. But if we don't have to touch 
it, great.

>
>>>> I'm not saying that free extent selection can't or shouldn't be
>>>> improved, just that it can never completely fix the problem on its
>>>> own.
>>> Righto, if you say so.
>>>
>>> After all, what do I know about the subject at hand? I'm just the
>>> poor dumb guy
>>
>> Just because you're an XFS expert, and even wrote the code at hand,
>> doesn't mean I have nothing to contribute. If I'm wrong, it's enough
>> to tell me that and why.
> It takes time and effort to have to explain why someone's suggestion
> for fixing a bug will not work. It's tiring, unproductive work and I
> get no thanks for it at all.

Isn't the part of being a maintainer? When everything works, the users 
are off the mailing list.

> I'm just seen as the nasty guy who says
> "no" to everything because I eventually run out of patience trying
> to explain everything in simple enough terms for non-XFS people to
> understand that they don't really understand XFS or what I'm talking
> about.
>
> IOWs, sometimes the best way to contribute is to know when you're in
> way over you head and to step back and simply help the master
> crafters get on with weaving their magic.....

Are you suggesting that I should go away? Or something else?

>
>>>   who wrote the current busy extent list handling
>>> algorithm years ago.  Perhaps you'd like to read the commit message
>>> (below), because it explains these sycnhronous slow paths and why
>>> they exist. I'll quote the part relevant to the discussion here,
>>> though:
>>>
>>> 	    Ideally we should not reallocate busy extents. That is a
>>> 	    much more complex fix to the problem as it involves
>>> 	    direct intervention in the allocation btree searches in
>>> 	    many places. This is left to a future set of
>>> 	    modifications.
>> Thanks, that commit was interesting.
>>
>> So, this future set of modifications is to have the extent allocator
>> consult this rbtree and continue searching if locked?
> See, this is exactly what I mean.
>
> You're now trying to guess how we'd solve the busy extent blocking
> problem. i.e. you now appear to be assuming we have a plan to fix
> this problem and are going to do it immediately.  Nothing could be
> further from the truth - I said:
>
>>> this is now important, and so we now need to revisit the issues we
>>> laid out some 8 years ago and work from there.
> That does not mean "we're going to fix this now" - it means we need
> to look at the problem again and determine if it's the best solution
> to the problem being presented to us. There are other avenues we
> still need to explore.
>
> Indeed, does your application and/or users even care about
> [acm]times on your files being absolutely accurate and crash
> resilient? i.e. do you use fsync() or fdatasync() to guarantee the
> data is on stable storage?

We use fdatasync and don't care about mtime much. So lazytime would work 
for us.

>
> [....]
>
>> I still think reducing the amount of outstanding busy extents is
>> important.  Modern disks write multiple GB/s, and big-data
>> applications like to do large sequential writes and deletes,
> Hah! "modern disks"
>
> You need to recalibrate what "big data" and "high performance IO"
> means. This was what we were doing with XFS on linux back in 2006:
>
> https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
>
> i.e. 10 years ago we were already well into the *tens of GB/s* on
> XFS filesystems for big-data applications with large sequential
> reads and writes. These "modern disks" are so slow! :)

Today, that's one or a few disks, not 90, and you can such a setup for a 
few dollars an hour, doing millions of IOPS.

> Cheers,
>
> Dave.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-14 12:07               ` Avi Kivity
@ 2018-02-14 12:18                 ` Avi Kivity
  2018-02-14 23:56                 ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2018-02-14 12:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 02/14/2018 02:07 PM, Avi Kivity wrote:
>
>> Indeed, does your application and/or users even care about
>> [acm]times on your files being absolutely accurate and crash
>> resilient? i.e. do you use fsync() or fdatasync() to guarantee the
>> data is on stable storage?
>
> We use fdatasync and don't care about mtime much. So lazytime would 
> work for us. 

We're actually doing something much worse than lazytime now: we truncate 
the file to a much larger size (and keep increasing that size, 
exponentially) to avoid triggering XFS's 
write-will-create-a-hole-so-lets-stall thing. So compared to that, 
lazytime is invisible in terms of data integrity.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-14 12:07               ` Avi Kivity
  2018-02-14 12:18                 ` Avi Kivity
@ 2018-02-14 23:56                 ` Dave Chinner
  2018-02-15  9:36                   ` Avi Kivity
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-02-14 23:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Wed, Feb 14, 2018 at 02:07:42PM +0200, Avi Kivity wrote:
> On 02/13/2018 07:18 AM, Dave Chinner wrote:
> >On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
> >>On 02/10/2018 01:10 AM, Dave Chinner wrote:
> >>>On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
> >>>>i.e., no matter
> >>>>how AG and free space selection improves, you can always find a
> >>>>workload that consumes extents faster than they can be laundered?
> >>>Sure, but that doesn't mean we have to fall back to a synchronous
> >>>alogrithm to handle collisions. It's that synchronous behaviour that
> >>>is the root cause of the long lock stalls you are seeing.
> >>Well, having that algorithm be asynchronous will be wonderful. But I
> >>imagine it will be a monstrous effort.
> >It's not clear yet whether we have to do any of this stuff to solve
> >your problem.
> 
> I was going by "is the root cause" above. But if we don't have to
> touch it, great.

Remember that triage - which is all about finding the root cause of
an issue - is a separate process to finding an appropriate fix for
the issue that has been triaged.

> >>>>I'm not saying that free extent selection can't or shouldn't be
> >>>>improved, just that it can never completely fix the problem on its
> >>>>own.
> >>>Righto, if you say so.
> >>>
> >>>After all, what do I know about the subject at hand? I'm just the
> >>>poor dumb guy
> >>
> >>Just because you're an XFS expert, and even wrote the code at hand,
> >>doesn't mean I have nothing to contribute. If I'm wrong, it's enough
> >>to tell me that and why.
> >It takes time and effort to have to explain why someone's suggestion
> >for fixing a bug will not work. It's tiring, unproductive work and I
> >get no thanks for it at all.
> 
> Isn't the part of being a maintainer?

I'm not the maintainer.  That burnt me out, and this was one of the
aspects of the job that contributes significantly to burn-out.

I don't want the current maintainer to suffer from the same fate.
I can handle some stress, so I'm happy to play the bad guy because
it shares the stress around.

However, I'm not going to make the same mistake I did the first time
around - internalising these issues doesn't make them go away. Hence
I'm going to speak out about it in the hope that users realise that
their demands can have a serious impact on the people that are
supporting them. Sure, I could have put it better, but this is still
an unfamiliar, learning-as-I-go process for me and so next time I
won't make the same mistakes....

> When everything works, the
> users are off the mailing list.

That often makes things worse :/ Users are always asking questions
about configs, optimisations, etc. And then there's all the other
developers who want their projects merged and supported. The need to
say no doesn't go away just because "everything works"....

> >I'm just seen as the nasty guy who says
> >"no" to everything because I eventually run out of patience trying
> >to explain everything in simple enough terms for non-XFS people to
> >understand that they don't really understand XFS or what I'm talking
> >about.
> >
> >IOWs, sometimes the best way to contribute is to know when you're in
> >way over you head and to step back and simply help the master
> >crafters get on with weaving their magic.....
> 
> Are you suggesting that I should go away? Or something else?

Something else.

Avi, your help and insight is most definitely welcome (and needed!)
because we can't find a solution that would suit your needs without
it.  All I'm asking for is a little bit of patience as we go
through the process of gathering all the info we need to determine
the best approach to solving the problem.

Be aware that when you are asked triage questions that seem
illogical or irrelevant, then the best thing to do is to answer the
question as best you can and wait to ask questions later. Those
questions are usually asked to rule out complex, convoluted cases
that take a long, long time to explain and by responding with
questions rather than answers it derails the process of expedient
triage and analysis.

IOWs, lets talk about the merits and mechanisms of solutions when
they are proposed, not while questions are still being asked about
the application, requirements, environment, etc needed to determine
what the best potential solution may be.

> >Indeed, does your application and/or users even care about
> >[acm]times on your files being absolutely accurate and crash
> >resilient? i.e. do you use fsync() or fdatasync() to guarantee the
> >data is on stable storage?
> 
> We use fdatasync and don't care about mtime much. So lazytime would
> work for us.

OK, so let me explore that in a bit more detail and see whether it's
something we can cleanly implement....

> >>I still think reducing the amount of outstanding busy extents is
> >>important.  Modern disks write multiple GB/s, and big-data
> >>applications like to do large sequential writes and deletes,
> >Hah! "modern disks"
> >
> >You need to recalibrate what "big data" and "high performance IO"
> >means. This was what we were doing with XFS on linux back in 2006:
> >
> >https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
> >
> >i.e. 10 years ago we were already well into the *tens of GB/s* on
> >XFS filesystems for big-data applications with large sequential
> >reads and writes. These "modern disks" are so slow! :)
> 
> Today, that's one or a few disks, not 90, and you can such a setup
> for a few dollars an hour, doing millions of IOPS.

Sure, but that's not "big-data" anymore - it's pretty common
nowdays in enterprise server environments. Big data applications
these days are measured in TB/s and hundreds of PBs.... :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-14 23:56                 ` Dave Chinner
@ 2018-02-15  9:36                   ` Avi Kivity
  2018-02-15 21:30                     ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-02-15  9:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 02/15/2018 01:56 AM, Dave Chinner wrote:
> On Wed, Feb 14, 2018 at 02:07:42PM +0200, Avi Kivity wrote:
>> On 02/13/2018 07:18 AM, Dave Chinner wrote:
>>> On Mon, Feb 12, 2018 at 11:33:44AM +0200, Avi Kivity wrote:
>>>> On 02/10/2018 01:10 AM, Dave Chinner wrote:
>>>>> On Fri, Feb 09, 2018 at 02:11:58PM +0200, Avi Kivity wrote:
>>>>>> i.e., no matter
>>>>>> how AG and free space selection improves, you can always find a
>>>>>> workload that consumes extents faster than they can be laundered?
>>>>> Sure, but that doesn't mean we have to fall back to a synchronous
>>>>> alogrithm to handle collisions. It's that synchronous behaviour that
>>>>> is the root cause of the long lock stalls you are seeing.
>>>> Well, having that algorithm be asynchronous will be wonderful. But I
>>>> imagine it will be a monstrous effort.
>>> It's not clear yet whether we have to do any of this stuff to solve
>>> your problem.
>> I was going by "is the root cause" above. But if we don't have to
>> touch it, great.
> Remember that triage - which is all about finding the root cause of
> an issue - is a separate process to finding an appropriate fix for
> the issue that has been triaged.

Sure.

>>>>>> I'm not saying that free extent selection can't or shouldn't be
>>>>>> improved, just that it can never completely fix the problem on its
>>>>>> own.
>>>>> Righto, if you say so.
>>>>>
>>>>> After all, what do I know about the subject at hand? I'm just the
>>>>> poor dumb guy
>>>> Just because you're an XFS expert, and even wrote the code at hand,
>>>> doesn't mean I have nothing to contribute. If I'm wrong, it's enough
>>>> to tell me that and why.
>>> It takes time and effort to have to explain why someone's suggestion
>>> for fixing a bug will not work. It's tiring, unproductive work and I
>>> get no thanks for it at all.
>> Isn't the part of being a maintainer?
> I'm not the maintainer.  That burnt me out, and this was one of the
> aspects of the job that contributes significantly to burn-out.

I'm sorry to hear that. As an ex kernel maintainer (and current 
non-kernel maintainer), I can certainly sympathize, though it was never 
so bad for me.

> I don't want the current maintainer to suffer from the same fate.
> I can handle some stress, so I'm happy to play the bad guy because
> it shares the stress around.
>
> However, I'm not going to make the same mistake I did the first time
> around - internalising these issues doesn't make them go away. Hence
> I'm going to speak out about it in the hope that users realise that
> their demands can have a serious impact on the people that are
> supporting them. Sure, I could have put it better, but this is still
> an unfamiliar, learning-as-I-go process for me and so next time I
> won't make the same mistakes....

Well, I'm happy to adjust in order to work better with you, just tell me 
what will work.

>
>> When everything works, the
>> users are off the mailing list.
> That often makes things worse :/ Users are always asking questions
> about configs, optimisations, etc. And then there's all the other
> developers who want their projects merged and supported. The need to
> say no doesn't go away just because "everything works"....
>
>>> I'm just seen as the nasty guy who says
>>> "no" to everything because I eventually run out of patience trying
>>> to explain everything in simple enough terms for non-XFS people to
>>> understand that they don't really understand XFS or what I'm talking
>>> about.
>>>
>>> IOWs, sometimes the best way to contribute is to know when you're in
>>> way over you head and to step back and simply help the master
>>> crafters get on with weaving their magic.....
>> Are you suggesting that I should go away? Or something else?
> Something else.
>
> Avi, your help and insight is most definitely welcome (and needed!)
> because we can't find a solution that would suit your needs without
> it.  All I'm asking for is a little bit of patience as we go
> through the process of gathering all the info we need to determine
> the best approach to solving the problem.

Thanks. I'm under pressure to find a solution quickly, so maybe I'm 
pushing too hard.

I'm certainly all for the right long-term fix rather than creating 
mountains of workarounds that later create more problems.

>
> Be aware that when you are asked triage questions that seem
> illogical or irrelevant, then the best thing to do is to answer the
> question as best you can and wait to ask questions later. Those
> questions are usually asked to rule out complex, convoluted cases
> that take a long, long time to explain and by responding with
> questions rather than answers it derails the process of expedient
> triage and analysis.
>
> IOWs, lets talk about the merits and mechanisms of solutions when
> they are proposed, not while questions are still being asked about
> the application, requirements, environment, etc needed to determine
> what the best potential solution may be.

Ok. I also ask these questions as a way to increase my understanding of 
the topic, it's not just my hope of getting a quick fix in.

>
>>> Indeed, does your application and/or users even care about
>>> [acm]times on your files being absolutely accurate and crash
>>> resilient? i.e. do you use fsync() or fdatasync() to guarantee the
>>> data is on stable storage?
>> We use fdatasync and don't care about mtime much. So lazytime would
>> work for us.
> OK, so let me explore that in a bit more detail and see whether it's
> something we can cleanly implement....
>
>>>> I still think reducing the amount of outstanding busy extents is
>>>> important.  Modern disks write multiple GB/s, and big-data
>>>> applications like to do large sequential writes and deletes,
>>> Hah! "modern disks"
>>>
>>> You need to recalibrate what "big data" and "high performance IO"
>>> means. This was what we were doing with XFS on linux back in 2006:
>>>
>>> https://web.archive.org/web/20171010112452/http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
>>>
>>> i.e. 10 years ago we were already well into the *tens of GB/s* on
>>> XFS filesystems for big-data applications with large sequential
>>> reads and writes. These "modern disks" are so slow! :)
>> Today, that's one or a few disks, not 90, and you can such a setup
>> for a few dollars an hour, doing millions of IOPS.
> Sure, but that's not "big-data" anymore - it's pretty common
> nowdays in enterprise server environments. Big data applications
> these days are measured in TB/s and hundreds of PBs.... :)

Across a cluster, with each node having tens of cores and tens/hundreds 
of TB, not more. The nodes I described are fairly typical.

Meanwhile, we've tried inode32 on a newly built filesystem (to avoid any 
inherited imbalance). The old filesystem had a large AGF imbalance, the 
new one did not, as expected. However, the stalls remain.


A little bird whispered in my ear to try XFS_IOC_OPEN_BY_HANDLE to avoid 
the the time update lock, so we'll be trying that next, to emulate lazytime.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-15  9:36                   ` Avi Kivity
@ 2018-02-15 21:30                     ` Dave Chinner
  2018-02-16  8:07                       ` Avi Kivity
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-02-15 21:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Thu, Feb 15, 2018 at 11:36:54AM +0200, Avi Kivity wrote:
> On 02/15/2018 01:56 AM, Dave Chinner wrote:
> A little bird whispered in my ear to try XFS_IOC_OPEN_BY_HANDLE to
> avoid the the time update lock, so we'll be trying that next, to
> emulate lazytime.

Biggest problem with that is it requires root permissions. It's not
a solution that can be deployed in practice, so I haven't bothered
suggesting it as something to try.

If you want to try lazytime, an easier test might be to rebuild the
kernel with this change below to support the lazytime mount option
and not log the timestamp updates. This is essentially the mechanism
that I'll use for this, but it will need to grow more stuff to have
the correct lazytime semantics...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


---
 fs/xfs/xfs_iops.c  | 27 +++++++++++++++++++++++----
 fs/xfs/xfs_mount.h |  1 +
 fs/xfs/xfs_super.c | 11 ++++++++++-
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..29082707687b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1044,6 +1044,19 @@ xfs_vn_setattr(
 	return error;
 }
 
+/*
+ * When lazytime is enabled, we never do transactional timestamp updates.
+ * That means it will only get updated on disk when some other change unrelated
+ * to timestamps is made, and so pure timestamp changes can be lost.
+ *
+ * XXX: ideally we want to set an inode state bit here so that we can track
+ * inodes that are purely timestamp dirty. And inode radix tree bit would be
+ * ideal so we can do background sweeps to log the inodes and get them written
+ * to disk at some (long) interval.
+ *
+ * XXX: lazytime currently turns off iversion support, so is incompatible with
+ * applications that require it (such as the NFS server).
+ */
 STATIC int
 xfs_vn_update_time(
 	struct inode		*inode,
@@ -1053,15 +1066,18 @@ xfs_vn_update_time(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
+	bool			lazy = (mp->m_flags & XFS_MOUNT_LAZYTIME);
 	int			error;
 
 	trace_xfs_update_time(ip);
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
-	if (error)
-		return error;
+	if (!lazy) {
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
+		if (error)
+			return error;
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+	}
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (flags & S_CTIME)
 		inode->i_ctime = *now;
 	if (flags & S_MTIME)
@@ -1069,6 +1085,9 @@ xfs_vn_update_time(
 	if (flags & S_ATIME)
 		inode->i_atime = *now;
 
+	if (lazy)
+		return 0;
+
 	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
 	return xfs_trans_commit(tp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 75da058c9456..e085e1e92d18 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -217,6 +217,7 @@ typedef struct xfs_mount {
 						   must be synchronous except
 						   for space allocations */
 #define XFS_MOUNT_UNMOUNTING	(1ULL << 1)	/* filesystem is unmounting */
+#define XFS_MOUNT_LAZYTIME	(1ULL << 2)	/* lazily update [acm]times */
 #define XFS_MOUNT_WAS_CLEAN	(1ULL << 3)
 #define XFS_MOUNT_FS_SHUTDOWN	(1ULL << 4)	/* atomic stop of all filesystem
 						   operations, typically for
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 5a248fd1a96b..ae76d558a934 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -83,7 +83,8 @@ enum {
 	Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota, Opt_prjquota,
 	Opt_uquota, Opt_gquota, Opt_pquota,
 	Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
-	Opt_discard, Opt_nodiscard, Opt_dax, Opt_err,
+	Opt_discard, Opt_nodiscard, Opt_dax,
+	Opt_err,
 };
 
 static const match_table_t tokens = {
@@ -216,6 +217,8 @@ xfs_parseargs(
 		mp->m_flags |= XFS_MOUNT_DIRSYNC;
 	if (sb->s_flags & SB_SYNCHRONOUS)
 		mp->m_flags |= XFS_MOUNT_WSYNC;
+	if (sb->s_flags & SB_LAZYTIME)
+		mp->m_flags |= XFS_MOUNT_LAZYTIME;
 
 	/*
 	 * Set some default flags that could be cleared by the mount option
@@ -492,6 +495,7 @@ xfs_showargs(
 		{ XFS_MOUNT_DISCARD,		",discard" },
 		{ XFS_MOUNT_SMALL_INUMS,	",inode32" },
 		{ XFS_MOUNT_DAX,		",dax" },
+		{ XFS_MOUNT_LAZYTIME,		",lazytime" },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
@@ -1319,6 +1323,11 @@ xfs_fs_remount(
 		}
 	}
 
+	if (*flags & SB_LAZYTIME)
+		mp->m_flags |= XFS_MOUNT_LAZYTIME;
+	else
+		mp->m_flags &= ~XFS_MOUNT_LAZYTIME;
+
 	/* ro -> rw */
 	if ((mp->m_flags & XFS_MOUNT_RDONLY) && !(*flags & SB_RDONLY)) {
 		if (mp->m_flags & XFS_MOUNT_NORECOVERY) {

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-15 21:30                     ` Dave Chinner
@ 2018-02-16  8:07                       ` Avi Kivity
  2018-02-19  2:40                         ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Avi Kivity @ 2018-02-16  8:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 02/15/2018 11:30 PM, Dave Chinner wrote:
> On Thu, Feb 15, 2018 at 11:36:54AM +0200, Avi Kivity wrote:
>> On 02/15/2018 01:56 AM, Dave Chinner wrote:
>> A little bird whispered in my ear to try XFS_IOC_OPEN_BY_HANDLE to
>> avoid the the time update lock, so we'll be trying that next, to
>> emulate lazytime.
> Biggest problem with that is it requires root permissions. It's not
> a solution that can be deployed in practice, so I haven't bothered
> suggesting it as something to try.
>
> If you want to try lazytime, an easier test might be to rebuild the
> kernel with this change below to support the lazytime mount option
> and not log the timestamp updates. This is essentially the mechanism
> that I'll use for this, but it will need to grow more stuff to have
> the correct lazytime semantics...
>

We tried open by handle to see if lazytime would provide relief, but it 
looks like it just pushes the lock acquisition to another place:

reactor-8 16033 [009] 1482121.241125:                xfs:xfs_ilock:
dev 9:2 ino 0xc0004bac flags ILOCK_EXCL caller kretprobe_trampoline
             7fffa01ca572 xfs_ilock ([kernel.kallsyms])
             7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms])
             7fff812bd2ea iomap_apply ([kernel.kallsyms])
             7fff812bdb81 iomap_dio_rw ([kernel.kallsyms])
             7fffa01bbc26 xfs_file_dio_aio_write ([kernel.kallsyms])
             7fffa01bc13f xfs_file_write_iter ([kernel.kallsyms])
             7fff812ab24f aio_write ([kernel.kallsyms])
             7fff812ab90e do_io_submit ([kernel.kallsyms])
             7fff812ac4c0 sys_io_submit ([kernel.kallsyms])
             7fff81005917 do_syscall_64 ([kernel.kallsyms])
             7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms])
                    f27f9 syscall (/usr/lib64/libc-2.17.so)
                       1c [unknown] ([unknown])
                   142e60 reactor_backend_epoll::~reactor_backend_epoll
(/usr/bin/scylla)

Probably, iomap_apply calls xfs_file_iomap_begin, which calls xfs_ilock().

It consults need_excl_ilock() to see whether it wants to lock or not:

static inline bool need_excl_ilock(struct xfs_inode *ip, unsigned flags)
{
         /*
          * COW writes will allocate delalloc space, so we need to make sure
          * to take the lock exclusively here.
          */
         if (xfs_is_reflink_inode(ip) && (flags & (IOMAP_WRITE | 
IOMAP_ZERO)))
                 return true;
         if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE))
                 return true;
         return false;
}

However, that function can EAGAIN (it does for IOLOCK) so maybe we can 
change xfs_ilock to xfs_ilock_nowait and teach it about not waiting for 
ILOCK too.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-16  8:07                       ` Avi Kivity
@ 2018-02-19  2:40                         ` Dave Chinner
  2018-02-19  4:48                           ` Dave Chinner
  2018-02-25 17:47                           ` Avi Kivity
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2018-02-19  2:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Fri, Feb 16, 2018 at 10:07:55AM +0200, Avi Kivity wrote:
> On 02/15/2018 11:30 PM, Dave Chinner wrote:
> >On Thu, Feb 15, 2018 at 11:36:54AM +0200, Avi Kivity wrote:
> >>On 02/15/2018 01:56 AM, Dave Chinner wrote:
> >>A little bird whispered in my ear to try XFS_IOC_OPEN_BY_HANDLE to
> >>avoid the the time update lock, so we'll be trying that next, to
> >>emulate lazytime.
> >Biggest problem with that is it requires root permissions. It's not
> >a solution that can be deployed in practice, so I haven't bothered
> >suggesting it as something to try.
> >
> >If you want to try lazytime, an easier test might be to rebuild the
> >kernel with this change below to support the lazytime mount option
> >and not log the timestamp updates. This is essentially the mechanism
> >that I'll use for this, but it will need to grow more stuff to have
> >the correct lazytime semantics...
> >
> 
> We tried open by handle to see if lazytime would provide relief, but
> it looks like it just pushes the lock acquisition to another place:

Whack-a-mole.

This is the whole problem with driving the "nowait" semantics into
the filesystem implementations - every time we fix one blocking
point, we find a deeper one, and we have to drive the "nowait"
semantics deeper into code that should not have to care about IO
level blocking semantics. And by doing it in a "slap-a-bandaid on
it" process, we end up with spagetti code that is fragile and
unmaintainable...

> However, that function can EAGAIN (it does for IOLOCK) so maybe we
> can change xfs_ilock to xfs_ilock_nowait and teach it about not
> waiting for ILOCK too.

If only it were that simple. Why, exactly, does the direct IO write
code require the ILOCK exclusive? Indeed, if it goes to allocate
blocks, we do this:

                /*
                 * xfs_iomap_write_direct() expects the shared lock. It
                 * is unlocked on return.
                 */
                if (lockmode == XFS_ILOCK_EXCL)
                        xfs_ilock_demote(ip, lockmode);

We demote the lock to shared before we call into the allocation
code. And for pure direct IO writes, all we care about is ensuring
the extent map does not change while we do the lookup and check.
That only requires a shared lock.

So now I've got to go work out why need_excl_ilock() says we need
an exclusive ilock for direct IO writes when it looks pretty clear
to me that we don't. 

But that's only half the problem. The other problem is that even if
we take it shared, we're still going to block on IO completion
taking the ILOCK exclusive to do things like unwritten extent
completion. So we still need to hack about with "trylock" operations
into functions into various functions (xfs_ilock_data_map_shared()
for one).

What a mess....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-19  2:40                         ` Dave Chinner
@ 2018-02-19  4:48                           ` Dave Chinner
  2018-02-25 17:47                           ` Avi Kivity
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2018-02-19  4:48 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs

On Mon, Feb 19, 2018 at 01:40:55PM +1100, Dave Chinner wrote:
> On Fri, Feb 16, 2018 at 10:07:55AM +0200, Avi Kivity wrote:
> > On 02/15/2018 11:30 PM, Dave Chinner wrote:
> > >On Thu, Feb 15, 2018 at 11:36:54AM +0200, Avi Kivity wrote:
> > >>On 02/15/2018 01:56 AM, Dave Chinner wrote:
> > >>A little bird whispered in my ear to try XFS_IOC_OPEN_BY_HANDLE to
> > >>avoid the the time update lock, so we'll be trying that next, to
> > >>emulate lazytime.
> > >Biggest problem with that is it requires root permissions. It's not
> > >a solution that can be deployed in practice, so I haven't bothered
> > >suggesting it as something to try.
> > >
> > >If you want to try lazytime, an easier test might be to rebuild the
> > >kernel with this change below to support the lazytime mount option
> > >and not log the timestamp updates. This is essentially the mechanism
> > >that I'll use for this, but it will need to grow more stuff to have
> > >the correct lazytime semantics...
> > >
> > 
> > We tried open by handle to see if lazytime would provide relief, but
> > it looks like it just pushes the lock acquisition to another place:
> 
> Whack-a-mole.
> 
> This is the whole problem with driving the "nowait" semantics into
> the filesystem implementations - every time we fix one blocking
> point, we find a deeper one, and we have to drive the "nowait"
> semantics deeper into code that should not have to care about IO
> level blocking semantics. And by doing it in a "slap-a-bandaid on
> it" process, we end up with spagetti code that is fragile and
> unmaintainable...
> 
> > However, that function can EAGAIN (it does for IOLOCK) so maybe we
> > can change xfs_ilock to xfs_ilock_nowait and teach it about not
> > waiting for ILOCK too.
> 
> If only it were that simple. Why, exactly, does the direct IO write
> code require the ILOCK exclusive? Indeed, if it goes to allocate
> blocks, we do this:
> 
>                 /*
>                  * xfs_iomap_write_direct() expects the shared lock. It
>                  * is unlocked on return.
>                  */
>                 if (lockmode == XFS_ILOCK_EXCL)
>                         xfs_ilock_demote(ip, lockmode);
> 
> We demote the lock to shared before we call into the allocation
> code. And for pure direct IO writes, all we care about is ensuring
> the extent map does not change while we do the lookup and check.
> That only requires a shared lock.
> 
> So now I've got to go work out why need_excl_ilock() says we need
> an exclusive ilock for direct IO writes when it looks pretty clear
> to me that we don't. 
> 
> But that's only half the problem. The other problem is that even if
> we take it shared, we're still going to block on IO completion
> taking the ILOCK exclusive to do things like unwritten extent
> completion. So we still need to hack about with "trylock" operations
> into functions into various functions (xfs_ilock_data_map_shared()
> for one).
> 
> What a mess....

Try the patch below on top of the lazytime stuff. Let's see where
the next layer of the onion is found.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: clean up xfs_file_iomap_begin, make it non-blocking

From: Dave Chinner <dchinner@redhat.com>

Gah, what a friggin' mess.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c | 166 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 101 insertions(+), 65 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 66e1edbfb2b2..9c7398c8f7e7 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -955,19 +955,52 @@ static inline bool imap_needs_alloc(struct inode *inode,
 		(IS_DAX(inode) && imap->br_state == XFS_EXT_UNWRITTEN);
 }
 
-static inline bool need_excl_ilock(struct xfs_inode *ip, unsigned flags)
+static int
+xfs_ilock_for_iomap(
+	struct xfs_inode	*ip,
+	unsigned		flags,
+	unsigned		*lockmode)
 {
+	unsigned		mode = XFS_ILOCK_SHARED;
+
+	/* Modifications to reflink files require exclusive access */
+	if (xfs_is_reflink_inode(ip) && (flags & (IOMAP_WRITE | IOMAP_ZERO))) {
+		/*
+		 * A reflinked inode will result in CoW alloc.
+		 * FIXME: It could still overwrite on unshared extents
+		 * and not need allocation in the direct IO path.
+		 */
+		if ((flags & IOMAP_NOWAIT) && (flags & IOMAP_DIRECT))
+			return -EAGAIN;
+		mode = XFS_ILOCK_EXCL;
+	}
+
+	/* Non-direct IO modifications require exclusive access */
+	if (!(flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE))
+		mode = XFS_ILOCK_EXCL;
+
 	/*
-	 * COW writes will allocate delalloc space, so we need to make sure
-	 * to take the lock exclusively here.
+	 * Extents not yet cached requires exclusive access, don't block.
+	 * This is an opencoded xfs_ilock_data_map_shared() call but with
+	 * non-blocking behaviour.
 	 */
-	if (xfs_is_reflink_inode(ip) && (flags & (IOMAP_WRITE | IOMAP_ZERO)))
-		return true;
-	if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE))
-		return true;
-	return false;
+	if (!(ip->i_df.if_flags & XFS_IFEXTENTS)) {
+		if (flags & IOMAP_NOWAIT)
+			return -EAGAIN;
+		mode = XFS_ILOCK_EXCL;
+	}
+
+	if (!xfs_ilock_nowait(ip, mode)) {
+		if (flags & IOMAP_NOWAIT)
+			return -EAGAIN;
+		xfs_ilock(ip, mode);
+	}
+
+	*lockmode = mode;
+	return 0;
 }
 
+
 static int
 xfs_file_iomap_begin(
 	struct inode		*inode,
@@ -993,17 +1026,15 @@ xfs_file_iomap_begin(
 		return xfs_file_iomap_begin_delay(inode, offset, length, iomap);
 	}
 
-	if (need_excl_ilock(ip, flags)) {
-		lockmode = XFS_ILOCK_EXCL;
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-	} else {
-		lockmode = xfs_ilock_data_map_shared(ip);
-	}
-
-	if ((flags & IOMAP_NOWAIT) && !(ip->i_df.if_flags & XFS_IFEXTENTS)) {
-		error = -EAGAIN;
-		goto out_unlock;
-	}
+	/*
+	 * Lock the inode in the manner required for the specified operation and
+	 * check for as many conditions that would result in blocking as
+	 * possible. This removes most of the non-blocking checks from the
+	 * mapping code below.
+	 */
+	error = xfs_ilock_for_iomap(ip, flags, &lockmode);
+	if (error)
+		return error;
 
 	ASSERT(offset <= mp->m_super->s_maxbytes);
 	if (offset > mp->m_super->s_maxbytes - length)
@@ -1024,17 +1055,16 @@ xfs_file_iomap_begin(
 			goto out_unlock;
 	}
 
-	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
+	/* Non-modifying mapping requested, so we are done */
+	if (!(flags & (IOMAP_WRITE | IOMAP_ZERO)))
+		goto iomap_found;
+
+	/*
+	 * Break shared extents if necessary. Checks for blocking in the direct
+	 * IO case have been done up front, so we don't need to do them here.
+	 */
+	if (xfs_is_reflink_inode(ip)) {
 		if (flags & IOMAP_DIRECT) {
-			/*
-			 * A reflinked inode will result in CoW alloc.
-			 * FIXME: It could still overwrite on unshared extents
-			 * and not need allocation.
-			 */
-			if (flags & IOMAP_NOWAIT) {
-				error = -EAGAIN;
-				goto out_unlock;
-			}
 			/* may drop and re-acquire the ilock */
 			error = xfs_reflink_allocate_cow(ip, &imap, &shared,
 					&lockmode);
@@ -1050,46 +1080,45 @@ xfs_file_iomap_begin(
 		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
 	}
 
-	if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) {
-		/*
-		 * If nowait is set bail since we are going to make
-		 * allocations.
-		 */
-		if (flags & IOMAP_NOWAIT) {
-			error = -EAGAIN;
-			goto out_unlock;
-		}
-		/*
-		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
-		 * pages to keep the chunks of work done where somewhat symmetric
-		 * with the work writeback does. This is a completely arbitrary
-		 * number pulled out of thin air as a best guess for initial
-		 * testing.
-		 *
-		 * Note that the values needs to be less than 32-bits wide until
-		 * the lower level functions are updated.
-		 */
-		length = min_t(loff_t, length, 1024 * PAGE_SIZE);
-		/*
-		 * xfs_iomap_write_direct() expects the shared lock. It
-		 * is unlocked on return.
-		 */
-		if (lockmode == XFS_ILOCK_EXCL)
-			xfs_ilock_demote(ip, lockmode);
-		error = xfs_iomap_write_direct(ip, offset, length, &imap,
-				nimaps);
-		if (error)
-			return error;
+	/* Don't need to allocate over holes when doing zeroing operations. */
+	if (flags & IOMAP_ZERO)
+		goto iomap_found;
+	ASSERT(flags & IOMAP_WRITE);
 
-		iomap->flags = IOMAP_F_NEW;
-		trace_xfs_iomap_alloc(ip, offset, length, 0, &imap);
-	} else {
-		ASSERT(nimaps);
+	if (!imap_needs_alloc(inode, &imap, nimaps))
+		goto iomap_found;
 
-		xfs_iunlock(ip, lockmode);
-		trace_xfs_iomap_found(ip, offset, length, 0, &imap);
+	/* Don't block on allocation if we are doing non-blocking IO */
+	if (flags & IOMAP_NOWAIT) {
+		error = -EAGAIN;
+		goto out_unlock;
 	}
 
+	/*
+	 * We cap the maximum length we map here to a sane size keep the chunks
+	 * of work done where somewhat symmetric with the work writeback does.
+	 * This is a completely arbitrary number pulled out of thin air as a
+	 * best guess for initial testing.
+	 *
+	 * Note that the values needs to be less than 32-bits wide until the
+	 * lower level functions are updated.
+	 */
+	length = min_t(loff_t, length, 1024 * PAGE_SIZE);
+
+	/*
+	 * xfs_iomap_write_direct() expects the shared lock. It is unlocked on
+	 * return.
+	 */
+	if (lockmode == XFS_ILOCK_EXCL)
+		xfs_ilock_demote(ip, lockmode);
+	error = xfs_iomap_write_direct(ip, offset, length, &imap, nimaps);
+	if (error)
+		return error;
+
+	iomap->flags = IOMAP_F_NEW;
+	trace_xfs_iomap_alloc(ip, offset, length, 0, &imap);
+
+iomap_finish:
 	if (xfs_ipincount(ip) && (ip->i_itemp->ili_fsync_fields
 				& ~XFS_ILOG_TIMESTAMP))
 		iomap->flags |= IOMAP_F_DIRTY;
@@ -1099,6 +1128,13 @@ xfs_file_iomap_begin(
 	if (shared)
 		iomap->flags |= IOMAP_F_SHARED;
 	return 0;
+
+iomap_found:
+	ASSERT(nimaps);
+	xfs_iunlock(ip, lockmode);
+	trace_xfs_iomap_found(ip, offset, length, 0, &imap);
+	goto iomap_finish;
+
 out_unlock:
 	xfs_iunlock(ip, lockmode);
 	return error;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: xfs_buf_lock vs aio
  2018-02-19  2:40                         ` Dave Chinner
  2018-02-19  4:48                           ` Dave Chinner
@ 2018-02-25 17:47                           ` Avi Kivity
  1 sibling, 0 replies; 20+ messages in thread
From: Avi Kivity @ 2018-02-25 17:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Sorry for the late reply, I was on vacation.


On 02/19/2018 04:40 AM, Dave Chinner wrote:
> On Fri, Feb 16, 2018 at 10:07:55AM +0200, Avi Kivity wrote:
>> On 02/15/2018 11:30 PM, Dave Chinner wrote:
>>> On Thu, Feb 15, 2018 at 11:36:54AM +0200, Avi Kivity wrote:
>>>> On 02/15/2018 01:56 AM, Dave Chinner wrote:
>>>> A little bird whispered in my ear to try XFS_IOC_OPEN_BY_HANDLE to
>>>> avoid the the time update lock, so we'll be trying that next, to
>>>> emulate lazytime.
>>> Biggest problem with that is it requires root permissions. It's not
>>> a solution that can be deployed in practice, so I haven't bothered
>>> suggesting it as something to try.
>>>
>>> If you want to try lazytime, an easier test might be to rebuild the
>>> kernel with this change below to support the lazytime mount option
>>> and not log the timestamp updates. This is essentially the mechanism
>>> that I'll use for this, but it will need to grow more stuff to have
>>> the correct lazytime semantics...
>>>
>> We tried open by handle to see if lazytime would provide relief, but
>> it looks like it just pushes the lock acquisition to another place:
> Whack-a-mole.
>
> This is the whole problem with driving the "nowait" semantics into
> the filesystem implementations - every time we fix one blocking
> point, we find a deeper one, and we have to drive the "nowait"
> semantics deeper into code that should not have to care about IO
> level blocking semantics. And by doing it in a "slap-a-bandaid on
> it" process, we end up with spagetti code that is fragile and
> unmaintainable...

I agree it's not pretty. It would be worse if the kernel did not have 
strict error handling hygiene that allows you to add those EAGAIN 
returns without too much fear.

>
>> However, that function can EAGAIN (it does for IOLOCK) so maybe we
>> can change xfs_ilock to xfs_ilock_nowait and teach it about not
>> waiting for ILOCK too.
> If only it were that simple. Why, exactly, does the direct IO write
> code require the ILOCK exclusive? Indeed, if it goes to allocate
> blocks, we do this:
>
>                  /*
>                   * xfs_iomap_write_direct() expects the shared lock. It
>                   * is unlocked on return.
>                   */
>                  if (lockmode == XFS_ILOCK_EXCL)
>                          xfs_ilock_demote(ip, lockmode);
>
> We demote the lock to shared before we call into the allocation
> code. And for pure direct IO writes, all we care about is ensuring
> the extent map does not change while we do the lookup and check.
> That only requires a shared lock.
>
> So now I've got to go work out why need_excl_ilock() says we need
> an exclusive ilock for direct IO writes when it looks pretty clear
> to me that we don't.
>
> But that's only half the problem. The other problem is that even if
> we take it shared, we're still going to block on IO completion
> taking the ILOCK exclusive to do things like unwritten extent
> completion. So we still need to hack about with "trylock" operations
> into functions into various functions (xfs_ilock_data_map_shared()
> for one).

I'll try your patch and report, thanks.

> What a mess....
>
> Cheers,
>
> Dave.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-02-25 17:47 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-07 17:20 xfs_buf_lock vs aio Avi Kivity
2018-02-07 23:33 ` Dave Chinner
2018-02-08  8:24   ` Avi Kivity
2018-02-08 22:11     ` Dave Chinner
2018-02-09 12:11       ` Avi Kivity
2018-02-09 23:10         ` Dave Chinner
2018-02-12  9:33           ` Avi Kivity
2018-02-13  5:18             ` Dave Chinner
2018-02-13 23:14               ` Darrick J. Wong
2018-02-14  2:16                 ` Dave Chinner
2018-02-14 12:01                   ` Avi Kivity
2018-02-14 12:07               ` Avi Kivity
2018-02-14 12:18                 ` Avi Kivity
2018-02-14 23:56                 ` Dave Chinner
2018-02-15  9:36                   ` Avi Kivity
2018-02-15 21:30                     ` Dave Chinner
2018-02-16  8:07                       ` Avi Kivity
2018-02-19  2:40                         ` Dave Chinner
2018-02-19  4:48                           ` Dave Chinner
2018-02-25 17:47                           ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.