All of lore.kernel.org
 help / color / mirror / Atom feed
* io_submit() blocks for writes for substantial amount of time
@ 2017-09-19  8:50 Tomasz Grabiec
  2017-09-19 12:27 ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Tomasz Grabiec @ 2017-09-19  8:50 UTC (permalink / raw)
  To: linux-xfs

Hi,

On some systems we are seeing one of our tests to trigger io_submit()
calls to block when submitting writes for an order of 100ms [1]. This
is problematic, because we heavily rely on io_submit() being async.

Workload: open, (ftruncate, append*)*, close.

Kernel version: 4.12.9-300.fc26.x86_64
mount: /dev/nvme0n1p3 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

The blocking happens in the following places:

(1)

            7fff9287472f __schedule ([kernel.kallsyms])
            7fff92874d16 schedule ([kernel.kallsyms])
            7fff92878d42 schedule_timeout ([kernel.kallsyms])
            7fff92876478 wait_for_completion ([kernel.kallsyms])
            7fffc05bf231 xfs_buf_submit_wait ([kernel.kallsyms])
            7fffc05bf3d3 _xfs_buf_read ([kernel.kallsyms])
            7fffc05bf4e4 xfs_buf_read_map ([kernel.kallsyms])
            7fffc05f53ca xfs_trans_read_buf_map ([kernel.kallsyms])
            7fffc058b432 xfs_btree_read_buf_block.constprop.34
([kernel.kallsyms])
            7fffc058b504 xfs_btree_lookup_get_block ([kernel.kallsyms])
            7fffc058f6ad xfs_btree_lookup ([kernel.kallsyms])
            7fffc0570919 xfs_alloc_lookup_eq ([kernel.kallsyms])
            7fffc0570c59 xfs_alloc_fixup_trees ([kernel.kallsyms])
            7fffc0573a2d xfs_alloc_ag_vextent_near ([kernel.kallsyms])
            7fffc0573db1 xfs_alloc_ag_vextent ([kernel.kallsyms])
            7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
            7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
            7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
            7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
            7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
            7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
            7fff922d46ca iomap_apply ([kernel.kallsyms])
            7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
            7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
            7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
            7fff922bc5d3 aio_write ([kernel.kallsyms])
            7fff922bcec1 do_io_submit ([kernel.kallsyms])
            7fff922bdd40 sys_io_submit ([kernel.kallsyms])
            7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
                     687 io_submit (/usr/lib64/libaio.so.1.0.1)
                  112373 seastar::reactor::flush_pending_aio
(/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)

(2)

  7fff9287472f __schedule ([kernel.kallsyms])
            7fff92874d16 schedule ([kernel.kallsyms])
            7fffc05e6265 _xfs_log_force ([kernel.kallsyms])
            7fffc05c2518 xfs_extent_busy_flush ([kernel.kallsyms])
            7fffc0572ccd xfs_alloc_ag_vextent_size ([kernel.kallsyms])
            7fffc0573d91 xfs_alloc_ag_vextent ([kernel.kallsyms])
            7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
            7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
            7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
            7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
            7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
            7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
            7fff922d46ca iomap_apply ([kernel.kallsyms])
            7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
            7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
            7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
            7fff922bc5d3 aio_write ([kernel.kallsyms])
            7fff922bcec1 do_io_submit ([kernel.kallsyms])
            7fff922bdd40 sys_io_submit ([kernel.kallsyms])
            7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
                     687 io_submit (/usr/lib64/libaio.so.1.0.1)
                  112373 seastar::reactor::flush_pending_aio
(/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)

Please advise, is this a known bug? When can it happen? Is there a way
to work it around to avoid blocking?

[1] https://github.com/scylladb/seastar/issues/340


Regards,
Tomasz Grabiec

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19  8:50 io_submit() blocks for writes for substantial amount of time Tomasz Grabiec
@ 2017-09-19 12:27 ` Brian Foster
  2017-09-19 14:58   ` Christoph Hellwig
  2017-09-19 16:29   ` Avi Kivity
  0 siblings, 2 replies; 17+ messages in thread
From: Brian Foster @ 2017-09-19 12:27 UTC (permalink / raw)
  To: Tomasz Grabiec; +Cc: linux-xfs

On Tue, Sep 19, 2017 at 10:50:51AM +0200, Tomasz Grabiec wrote:
> Hi,
> 
> On some systems we are seeing one of our tests to trigger io_submit()
> calls to block when submitting writes for an order of 100ms [1]. This
> is problematic, because we heavily rely on io_submit() being async.
> 
> Workload: open, (ftruncate, append*)*, close.
> 
> Kernel version: 4.12.9-300.fc26.x86_64
> mount: /dev/nvme0n1p3 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
> 
> The blocking happens in the following places:
> 
> (1)
> 
>             7fff9287472f __schedule ([kernel.kallsyms])
>             7fff92874d16 schedule ([kernel.kallsyms])
>             7fff92878d42 schedule_timeout ([kernel.kallsyms])
>             7fff92876478 wait_for_completion ([kernel.kallsyms])
>             7fffc05bf231 xfs_buf_submit_wait ([kernel.kallsyms])
>             7fffc05bf3d3 _xfs_buf_read ([kernel.kallsyms])
>             7fffc05bf4e4 xfs_buf_read_map ([kernel.kallsyms])
>             7fffc05f53ca xfs_trans_read_buf_map ([kernel.kallsyms])
>             7fffc058b432 xfs_btree_read_buf_block.constprop.34
> ([kernel.kallsyms])
>             7fffc058b504 xfs_btree_lookup_get_block ([kernel.kallsyms])
>             7fffc058f6ad xfs_btree_lookup ([kernel.kallsyms])
>             7fffc0570919 xfs_alloc_lookup_eq ([kernel.kallsyms])
>             7fffc0570c59 xfs_alloc_fixup_trees ([kernel.kallsyms])
>             7fffc0573a2d xfs_alloc_ag_vextent_near ([kernel.kallsyms])
>             7fffc0573db1 xfs_alloc_ag_vextent ([kernel.kallsyms])
>             7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
>             7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
>             7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
>             7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
>             7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
>             7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
>             7fff922d46ca iomap_apply ([kernel.kallsyms])
>             7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
>             7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
>             7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
>             7fff922bc5d3 aio_write ([kernel.kallsyms])
>             7fff922bcec1 do_io_submit ([kernel.kallsyms])
>             7fff922bdd40 sys_io_submit ([kernel.kallsyms])
>             7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
>                      687 io_submit (/usr/lib64/libaio.so.1.0.1)
>                   112373 seastar::reactor::flush_pending_aio
> (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)

So you have a direct I/O write that requires block allocation. Block
allocation requires reading free space btree blocks to identify and fix
up remaining free extent records based on the allocation.

> 
> (2)
> 
>   7fff9287472f __schedule ([kernel.kallsyms])
>             7fff92874d16 schedule ([kernel.kallsyms])
>             7fffc05e6265 _xfs_log_force ([kernel.kallsyms])
>             7fffc05c2518 xfs_extent_busy_flush ([kernel.kallsyms])
>             7fffc0572ccd xfs_alloc_ag_vextent_size ([kernel.kallsyms])
>             7fffc0573d91 xfs_alloc_ag_vextent ([kernel.kallsyms])
>             7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
>             7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
>             7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
>             7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
>             7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
>             7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
>             7fff922d46ca iomap_apply ([kernel.kallsyms])
>             7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
>             7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
>             7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
>             7fff922bc5d3 aio_write ([kernel.kallsyms])
>             7fff922bcec1 do_io_submit ([kernel.kallsyms])
>             7fff922bdd40 sys_io_submit ([kernel.kallsyms])
>             7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
>                      687 io_submit (/usr/lib64/libaio.so.1.0.1)
>                   112373 seastar::reactor::flush_pending_aio
> (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)
> 

Another dio write that requires allocation. The allocation finds a busy
extent, which means the extent was recently freed but the associated
freeing transaction has not yet made it to the on-disk log. As such it
cannot be safely reused, so the allocator flushes the log and retries to
try and clear the busy state and find an extent.

> Please advise, is this a known bug? When can it happen? Is there a way
> to work it around to avoid blocking?
> 

I'm not sure how either could be considered a bug based on the stack
trace information alone. Allocations may require reading metadata and
reads are synchronous. This all seems like pretty basic filesystem
behavior.

I suppose performance may be a separate question. For the latter issue,
I'd be curious whether leaving more free space available in the
filesystem would help avoid running into busy extents. Perhaps having
more memory and thus a larger buffer cache for btree blocks could help
mitigate the former issue..? The deterministic workaround for both is to
preallocate the associated file. If the file would be too large, another
option may be to set an extent size hint to allocate the file in larger
chunks and amortize the cost of the allocations over multiple writes.

Brian

> [1] https://github.com/scylladb/seastar/issues/340
> 
> 
> Regards,
> Tomasz Grabiec
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 12:27 ` Brian Foster
@ 2017-09-19 14:58   ` Christoph Hellwig
  2017-09-19 16:31     ` Avi Kivity
  2017-09-19 16:29   ` Avi Kivity
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2017-09-19 14:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: Tomasz Grabiec, linux-xfs

On Tue, Sep 19, 2017 at 08:27:05AM -0400, Brian Foster wrote:
> > Please advise, is this a known bug? When can it happen? Is there a way
> > to work it around to avoid blocking?
> > 
> 
> I'm not sure how either could be considered a bug based on the stack
> trace information alone. Allocations may require reading metadata and
> reads are synchronous. This all seems like pretty basic filesystem
> behavior.
> 
> I suppose performance may be a separate question. For the latter issue,
> I'd be curious whether leaving more free space available in the
> filesystem would help avoid running into busy extents. Perhaps having
> more memory and thus a larger buffer cache for btree blocks could help
> mitigate the former issue..? The deterministic workaround for both is to
> preallocate the associated file. If the file would be too large, another
> option may be to set an extent size hint to allocate the file in larger
> chunks and amortize the cost of the allocations over multiple writes.

Note that Linux 4.13 and later support a RWF_NOWAIT flag, that will
return -EAGAIN from io_submit for these conditions so they can be
handled by a thread pool.

Note that until a few years ago we performed all allocations from
a workqueue, this was changed by:

commit cf11da9c5d374962913ca5ba0ce0886b58286224
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jul 15 07:08:24 2014 +1000

    xfs: refine the allocation stack switch

to only defer btree splits to a workqueue.  With that previous scheme
there might have been an option to defer AIO allocations to a workqueue,
but the main issue with that is that the worker thread which is then
going to do the actual data transfer would have to "borrow" the
mm_struct from the submitter.  That's the primary reason why something
like that was never implemented in mainline Linux.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 12:27 ` Brian Foster
  2017-09-19 14:58   ` Christoph Hellwig
@ 2017-09-19 16:29   ` Avi Kivity
  2017-09-19 17:38     ` Brian Foster
  1 sibling, 1 reply; 17+ messages in thread
From: Avi Kivity @ 2017-09-19 16:29 UTC (permalink / raw)
  To: Brian Foster, Tomasz Grabiec; +Cc: linux-xfs



On 09/19/2017 03:27 PM, Brian Foster wrote:
> On Tue, Sep 19, 2017 at 10:50:51AM +0200, Tomasz Grabiec wrote:
>> Hi,
>>
>> On some systems we are seeing one of our tests to trigger io_submit()
>> calls to block when submitting writes for an order of 100ms [1]. This
>> is problematic, because we heavily rely on io_submit() being async.
>>
>> Workload: open, (ftruncate, append*)*, close.
>>
>> Kernel version: 4.12.9-300.fc26.x86_64
>> mount: /dev/nvme0n1p3 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
>>
>> The blocking happens in the following places:
>>
>> (1)
>>
>>              7fff9287472f __schedule ([kernel.kallsyms])
>>              7fff92874d16 schedule ([kernel.kallsyms])
>>              7fff92878d42 schedule_timeout ([kernel.kallsyms])
>>              7fff92876478 wait_for_completion ([kernel.kallsyms])
>>              7fffc05bf231 xfs_buf_submit_wait ([kernel.kallsyms])
>>              7fffc05bf3d3 _xfs_buf_read ([kernel.kallsyms])
>>              7fffc05bf4e4 xfs_buf_read_map ([kernel.kallsyms])
>>              7fffc05f53ca xfs_trans_read_buf_map ([kernel.kallsyms])
>>              7fffc058b432 xfs_btree_read_buf_block.constprop.34
>> ([kernel.kallsyms])
>>              7fffc058b504 xfs_btree_lookup_get_block ([kernel.kallsyms])
>>              7fffc058f6ad xfs_btree_lookup ([kernel.kallsyms])
>>              7fffc0570919 xfs_alloc_lookup_eq ([kernel.kallsyms])
>>              7fffc0570c59 xfs_alloc_fixup_trees ([kernel.kallsyms])
>>              7fffc0573a2d xfs_alloc_ag_vextent_near ([kernel.kallsyms])
>>              7fffc0573db1 xfs_alloc_ag_vextent ([kernel.kallsyms])
>>              7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
>>              7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
>>              7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
>>              7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
>>              7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
>>              7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
>>              7fff922d46ca iomap_apply ([kernel.kallsyms])
>>              7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
>>              7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
>>              7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
>>              7fff922bc5d3 aio_write ([kernel.kallsyms])
>>              7fff922bcec1 do_io_submit ([kernel.kallsyms])
>>              7fff922bdd40 sys_io_submit ([kernel.kallsyms])
>>              7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
>>                       687 io_submit (/usr/lib64/libaio.so.1.0.1)
>>                    112373 seastar::reactor::flush_pending_aio
>> (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)
> So you have a direct I/O write that requires block allocation. Block
> allocation requires reading free space btree blocks to identify and fix
> up remaining free extent records based on the allocation.

Will an fallocate() call before the write in another thread help?

Will a write to a previously fallocate()d extent get blocked while 
fallocate()ing a new extent?

>
>> (2)
>>
>>    7fff9287472f __schedule ([kernel.kallsyms])
>>              7fff92874d16 schedule ([kernel.kallsyms])
>>              7fffc05e6265 _xfs_log_force ([kernel.kallsyms])
>>              7fffc05c2518 xfs_extent_busy_flush ([kernel.kallsyms])
>>              7fffc0572ccd xfs_alloc_ag_vextent_size ([kernel.kallsyms])
>>              7fffc0573d91 xfs_alloc_ag_vextent ([kernel.kallsyms])
>>              7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
>>              7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
>>              7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
>>              7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
>>              7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
>>              7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
>>              7fff922d46ca iomap_apply ([kernel.kallsyms])
>>              7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
>>              7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
>>              7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
>>              7fff922bc5d3 aio_write ([kernel.kallsyms])
>>              7fff922bcec1 do_io_submit ([kernel.kallsyms])
>>              7fff922bdd40 sys_io_submit ([kernel.kallsyms])
>>              7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
>>                       687 io_submit (/usr/lib64/libaio.so.1.0.1)
>>                    112373 seastar::reactor::flush_pending_aio
>> (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)
>>
> Another dio write that requires allocation. The allocation finds a busy
> extent, which means the extent was recently freed but the associated
> freeing transaction has not yet made it to the on-disk log. As such it
> cannot be safely reused, so the allocator flushes the log and retries to
> try and clear the busy state and find an extent.

Is that because the disk is nearly full and there are no known flushed 
extents, or because the allocator doesn't prioritize known-flushed 
extents? From your comments below I gather you may not know for sure.

>
>> Please advise, is this a known bug? When can it happen? Is there a way
>> to work it around to avoid blocking?
>>
> I'm not sure how either could be considered a bug based on the stack
> trace information alone. Allocations may require reading metadata and
> reads are synchronous. This all seems like pretty basic filesystem
> behavior.

Synchronous behavior in an asynchronous system call can be considered a 
bug, although of course this has been the case in Linux since forever. 
If there are ways we can get the filesystem to behave more 
asynchronously (like nowait aio) we'll use them.

>
> I suppose performance may be a separate question. For the latter issue,
> I'd be curious whether leaving more free space available in the
> filesystem would help avoid running into busy extents. Perhaps having
> more memory and thus a larger buffer cache for btree blocks could help
> mitigate the former issue..? The deterministic workaround for both is to
> preallocate the associated file. If the file would be too large, another
> option may be to set an extent size hint to allocate the file in larger
> chunks and amortize the cost of the allocations over multiple writes.

We do set the allocation size hint. We don't really know the file size 
in advance though. If fallocate() and io_submit() can run in parallel 
without fallocate() blocking io_submit(), we can have another thread run 
ahead of the writer and issue fallocate()s. I guess we can double the 
fallocate() size each time to amortize the effort.

Is ftruncate() sufficient to release extents past-the-end, or do we need 
an extra FALLOC_FL_PUNCH_HOLE?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 14:58   ` Christoph Hellwig
@ 2017-09-19 16:31     ` Avi Kivity
  2017-09-19 17:39       ` Brian Foster
  2017-09-19 20:34       ` Christoph Hellwig
  0 siblings, 2 replies; 17+ messages in thread
From: Avi Kivity @ 2017-09-19 16:31 UTC (permalink / raw)
  To: Christoph Hellwig, Brian Foster; +Cc: Tomasz Grabiec, linux-xfs



On 09/19/2017 05:58 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 08:27:05AM -0400, Brian Foster wrote:
>>> Please advise, is this a known bug? When can it happen? Is there a way
>>> to work it around to avoid blocking?
>>>
>> I'm not sure how either could be considered a bug based on the stack
>> trace information alone. Allocations may require reading metadata and
>> reads are synchronous. This all seems like pretty basic filesystem
>> behavior.
>>
>> I suppose performance may be a separate question. For the latter issue,
>> I'd be curious whether leaving more free space available in the
>> filesystem would help avoid running into busy extents. Perhaps having
>> more memory and thus a larger buffer cache for btree blocks could help
>> mitigate the former issue..? The deterministic workaround for both is to
>> preallocate the associated file. If the file would be too large, another
>> option may be to set an extent size hint to allocate the file in larger
>> chunks and amortize the cost of the allocations over multiple writes.
> Note that Linux 4.13 and later support a RWF_NOWAIT flag, that will
> return -EAGAIN from io_submit for these conditions so they can be
> handled by a thread pool.
>
> Note that until a few years ago we performed all allocations from
> a workqueue, this was changed by:
>
> commit cf11da9c5d374962913ca5ba0ce0886b58286224
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Tue Jul 15 07:08:24 2014 +1000
>
>      xfs: refine the allocation stack switch
>
> to only defer btree splits to a workqueue.  With that previous scheme
> there might have been an option to defer AIO allocations to a workqueue,
> but the main issue with that is that the worker thread which is then
> going to do the actual data transfer would have to "borrow" the
> mm_struct from the submitter.  That's the primary reason why something
> like that was never implemented in mainline Linux.

For DIO, does it really need the mm_struct? It can just pin the pages 
and pass them to the workqueue function.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 16:29   ` Avi Kivity
@ 2017-09-19 17:38     ` Brian Foster
  2017-09-19 17:53       ` Tomasz Grabiec
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2017-09-19 17:38 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Tomasz Grabiec, linux-xfs

On Tue, Sep 19, 2017 at 07:29:18PM +0300, Avi Kivity wrote:
> 
> 
> On 09/19/2017 03:27 PM, Brian Foster wrote:
> > On Tue, Sep 19, 2017 at 10:50:51AM +0200, Tomasz Grabiec wrote:
> > > Hi,
> > > 
> > > On some systems we are seeing one of our tests to trigger io_submit()
> > > calls to block when submitting writes for an order of 100ms [1]. This
> > > is problematic, because we heavily rely on io_submit() being async.
> > > 
> > > Workload: open, (ftruncate, append*)*, close.
> > > 
> > > Kernel version: 4.12.9-300.fc26.x86_64
> > > mount: /dev/nvme0n1p3 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
> > > 
> > > The blocking happens in the following places:
> > > 
> > > (1)
> > > 
> > >              7fff9287472f __schedule ([kernel.kallsyms])
> > >              7fff92874d16 schedule ([kernel.kallsyms])
> > >              7fff92878d42 schedule_timeout ([kernel.kallsyms])
> > >              7fff92876478 wait_for_completion ([kernel.kallsyms])
> > >              7fffc05bf231 xfs_buf_submit_wait ([kernel.kallsyms])
> > >              7fffc05bf3d3 _xfs_buf_read ([kernel.kallsyms])
> > >              7fffc05bf4e4 xfs_buf_read_map ([kernel.kallsyms])
> > >              7fffc05f53ca xfs_trans_read_buf_map ([kernel.kallsyms])
> > >              7fffc058b432 xfs_btree_read_buf_block.constprop.34
> > > ([kernel.kallsyms])
> > >              7fffc058b504 xfs_btree_lookup_get_block ([kernel.kallsyms])
> > >              7fffc058f6ad xfs_btree_lookup ([kernel.kallsyms])
> > >              7fffc0570919 xfs_alloc_lookup_eq ([kernel.kallsyms])
> > >              7fffc0570c59 xfs_alloc_fixup_trees ([kernel.kallsyms])
> > >              7fffc0573a2d xfs_alloc_ag_vextent_near ([kernel.kallsyms])
> > >              7fffc0573db1 xfs_alloc_ag_vextent ([kernel.kallsyms])
> > >              7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
> > >              7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
> > >              7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
> > >              7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
> > >              7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
> > >              7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
> > >              7fff922d46ca iomap_apply ([kernel.kallsyms])
> > >              7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
> > >              7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
> > >              7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
> > >              7fff922bc5d3 aio_write ([kernel.kallsyms])
> > >              7fff922bcec1 do_io_submit ([kernel.kallsyms])
> > >              7fff922bdd40 sys_io_submit ([kernel.kallsyms])
> > >              7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
> > >                       687 io_submit (/usr/lib64/libaio.so.1.0.1)
> > >                    112373 seastar::reactor::flush_pending_aio
> > > (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)
> > So you have a direct I/O write that requires block allocation. Block
> > allocation requires reading free space btree blocks to identify and fix
> > up remaining free extent records based on the allocation.
> 
> Will an fallocate() call before the write in another thread help?
> 

Preallocating the file (or largish ranges) should help. I'm not sure
preallocating the range of each and every write will have the behavior
you want.

> Will a write to a previously fallocate()d extent get blocked while
> fallocate()ing a new extent?
> 

Any dio can most likely block behind an fallocate call due to locking
(just like any write that requires allocation can block behind another
such write).

> > 
> > > (2)
> > > 
> > >    7fff9287472f __schedule ([kernel.kallsyms])
> > >              7fff92874d16 schedule ([kernel.kallsyms])
> > >              7fffc05e6265 _xfs_log_force ([kernel.kallsyms])
> > >              7fffc05c2518 xfs_extent_busy_flush ([kernel.kallsyms])
> > >              7fffc0572ccd xfs_alloc_ag_vextent_size ([kernel.kallsyms])
> > >              7fffc0573d91 xfs_alloc_ag_vextent ([kernel.kallsyms])
> > >              7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
> > >              7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
> > >              7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
> > >              7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
> > >              7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
> > >              7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
> > >              7fff922d46ca iomap_apply ([kernel.kallsyms])
> > >              7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
> > >              7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
> > >              7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
> > >              7fff922bc5d3 aio_write ([kernel.kallsyms])
> > >              7fff922bcec1 do_io_submit ([kernel.kallsyms])
> > >              7fff922bdd40 sys_io_submit ([kernel.kallsyms])
> > >              7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
> > >                       687 io_submit (/usr/lib64/libaio.so.1.0.1)
> > >                    112373 seastar::reactor::flush_pending_aio
> > > (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)
> > > 
> > Another dio write that requires allocation. The allocation finds a busy
> > extent, which means the extent was recently freed but the associated
> > freeing transaction has not yet made it to the on-disk log. As such it
> > cannot be safely reused, so the allocator flushes the log and retries to
> > try and clear the busy state and find an extent.
> 
> Is that because the disk is nearly full and there are no known flushed
> extents, or because the allocator doesn't prioritize known-flushed extents?
> From your comments below I gather you may not know for sure.
> 

I'm not sure without digging further into it. Hence the question around
free space availability.

> > 
> > > Please advise, is this a known bug? When can it happen? Is there a way
> > > to work it around to avoid blocking?
> > > 
> > I'm not sure how either could be considered a bug based on the stack
> > trace information alone. Allocations may require reading metadata and
> > reads are synchronous. This all seems like pretty basic filesystem
> > behavior.
> 
> Synchronous behavior in an asynchronous system call can be considered a bug,
> although of course this has been the case in Linux since forever. If there
> are ways we can get the filesystem to behave more asynchronously (like
> nowait aio) we'll use them.
> 

I think the RWF_NOWAIT thing that hch pointed out is intended to cover
this (i.e., if you must be absolutely sure that nothing will block the
current thread). It looks like it will skip calls that require
allocations, fail to acquire locks, etc. so they can be deferred.

> > 
> > I suppose performance may be a separate question. For the latter issue,
> > I'd be curious whether leaving more free space available in the
> > filesystem would help avoid running into busy extents. Perhaps having
> > more memory and thus a larger buffer cache for btree blocks could help
> > mitigate the former issue..? The deterministic workaround for both is to
> > preallocate the associated file. If the file would be too large, another
> > option may be to set an extent size hint to allocate the file in larger
> > chunks and amortize the cost of the allocations over multiple writes.
> 
> We do set the allocation size hint. We don't really know the file size in
> advance though. If fallocate() and io_submit() can run in parallel without
> fallocate() blocking io_submit(), we can have another thread run ahead of
> the writer and issue fallocate()s. I guess we can double the fallocate()
> size each time to amortize the effort.
> 

What size hint? I'm not familiar with your workload/requirements, but it
sounds like RWF_NOWAIT might be what you want. You can defer any write
that requires allocation outright and any subsequent write that ends up
blocked due to locking would also return -EAGAIN.

If you did end up deferring those calls in favor of an fallocate, you
could certainly amortize the cost by doing aggressive post-eof
allocations. XFS does something similar internally to preserve
contiguity of delayed allocations. 

> Is ftruncate() sufficient to release extents past-the-end, or do we need an
> extra FALLOC_FL_PUNCH_HOLE?

Yes, a truncate trims post-eof blocks.

Brian

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 16:31     ` Avi Kivity
@ 2017-09-19 17:39       ` Brian Foster
  2017-09-19 20:34         ` Christoph Hellwig
  2017-09-20  6:17         ` Avi Kivity
  2017-09-19 20:34       ` Christoph Hellwig
  1 sibling, 2 replies; 17+ messages in thread
From: Brian Foster @ 2017-09-19 17:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Christoph Hellwig, Tomasz Grabiec, linux-xfs

On Tue, Sep 19, 2017 at 07:31:04PM +0300, Avi Kivity wrote:
> 
> 
> On 09/19/2017 05:58 PM, Christoph Hellwig wrote:
> > On Tue, Sep 19, 2017 at 08:27:05AM -0400, Brian Foster wrote:
> > > > Please advise, is this a known bug? When can it happen? Is there a way
> > > > to work it around to avoid blocking?
> > > > 
> > > I'm not sure how either could be considered a bug based on the stack
> > > trace information alone. Allocations may require reading metadata and
> > > reads are synchronous. This all seems like pretty basic filesystem
> > > behavior.
> > > 
> > > I suppose performance may be a separate question. For the latter issue,
> > > I'd be curious whether leaving more free space available in the
> > > filesystem would help avoid running into busy extents. Perhaps having
> > > more memory and thus a larger buffer cache for btree blocks could help
> > > mitigate the former issue..? The deterministic workaround for both is to
> > > preallocate the associated file. If the file would be too large, another
> > > option may be to set an extent size hint to allocate the file in larger
> > > chunks and amortize the cost of the allocations over multiple writes.
> > Note that Linux 4.13 and later support a RWF_NOWAIT flag, that will
> > return -EAGAIN from io_submit for these conditions so they can be
> > handled by a thread pool.
> > 
> > Note that until a few years ago we performed all allocations from
> > a workqueue, this was changed by:
> > 
> > commit cf11da9c5d374962913ca5ba0ce0886b58286224
> > Author: Dave Chinner <dchinner@redhat.com>
> > Date:   Tue Jul 15 07:08:24 2014 +1000
> > 
> >      xfs: refine the allocation stack switch
> > 
> > to only defer btree splits to a workqueue.  With that previous scheme
> > there might have been an option to defer AIO allocations to a workqueue,
> > but the main issue with that is that the worker thread which is then
> > going to do the actual data transfer would have to "borrow" the
> > mm_struct from the submitter.  That's the primary reason why something
> > like that was never implemented in mainline Linux.
> 
> For DIO, does it really need the mm_struct? It can just pin the pages and
> pass them to the workqueue function.
> 

I'm not sure what difference it makes regardless. We still have to wait
for an allocation to complete before we can issue an I/O. IIRC, the old
defer allocs to a wq thing was more about saving stack space than
providing async behavior.

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 17:38     ` Brian Foster
@ 2017-09-19 17:53       ` Tomasz Grabiec
  2017-09-19 23:38         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Tomasz Grabiec @ 2017-09-19 17:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: Avi Kivity, linux-xfs

On Tue, Sep 19, 2017 at 7:38 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Tue, Sep 19, 2017 at 07:29:18PM +0300, Avi Kivity wrote:
>>
>>
>> On 09/19/2017 03:27 PM, Brian Foster wrote:
>> > On Tue, Sep 19, 2017 at 10:50:51AM +0200, Tomasz Grabiec wrote:
>> > > Hi,
>> > >
>> > > On some systems we are seeing one of our tests to trigger io_submit()
>> > > calls to block when submitting writes for an order of 100ms [1]. This
>> > > is problematic, because we heavily rely on io_submit() being async.
>> > >
>> > > Workload: open, (ftruncate, append*)*, close.
>> > >
>> > > Kernel version: 4.12.9-300.fc26.x86_64
>> > > mount: /dev/nvme0n1p3 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
>> > >
>> > > The blocking happens in the following places:
>> > >
>> > > (1)
>> > >
>> > >              7fff9287472f __schedule ([kernel.kallsyms])
>> > >              7fff92874d16 schedule ([kernel.kallsyms])
>> > >              7fff92878d42 schedule_timeout ([kernel.kallsyms])
>> > >              7fff92876478 wait_for_completion ([kernel.kallsyms])
>> > >              7fffc05bf231 xfs_buf_submit_wait ([kernel.kallsyms])
>> > >              7fffc05bf3d3 _xfs_buf_read ([kernel.kallsyms])
>> > >              7fffc05bf4e4 xfs_buf_read_map ([kernel.kallsyms])
>> > >              7fffc05f53ca xfs_trans_read_buf_map ([kernel.kallsyms])
>> > >              7fffc058b432 xfs_btree_read_buf_block.constprop.34
>> > > ([kernel.kallsyms])
>> > >              7fffc058b504 xfs_btree_lookup_get_block ([kernel.kallsyms])
>> > >              7fffc058f6ad xfs_btree_lookup ([kernel.kallsyms])
>> > >              7fffc0570919 xfs_alloc_lookup_eq ([kernel.kallsyms])
>> > >              7fffc0570c59 xfs_alloc_fixup_trees ([kernel.kallsyms])
>> > >              7fffc0573a2d xfs_alloc_ag_vextent_near ([kernel.kallsyms])
>> > >              7fffc0573db1 xfs_alloc_ag_vextent ([kernel.kallsyms])
>> > >              7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
>> > >              7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
>> > >              7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
>> > >              7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
>> > >              7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
>> > >              7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
>> > >              7fff922d46ca iomap_apply ([kernel.kallsyms])
>> > >              7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
>> > >              7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
>> > >              7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
>> > >              7fff922bc5d3 aio_write ([kernel.kallsyms])
>> > >              7fff922bcec1 do_io_submit ([kernel.kallsyms])
>> > >              7fff922bdd40 sys_io_submit ([kernel.kallsyms])
>> > >              7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
>> > >                       687 io_submit (/usr/lib64/libaio.so.1.0.1)
>> > >                    112373 seastar::reactor::flush_pending_aio
>> > > (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)
>> > So you have a direct I/O write that requires block allocation. Block
>> > allocation requires reading free space btree blocks to identify and fix
>> > up remaining free extent records based on the allocation.
>>
>> Will an fallocate() call before the write in another thread help?
>>
>
> Preallocating the file (or largish ranges) should help. I'm not sure
> preallocating the range of each and every write will have the behavior
> you want.
>
>> Will a write to a previously fallocate()d extent get blocked while
>> fallocate()ing a new extent?
>>
>
> Any dio can most likely block behind an fallocate call due to locking
> (just like any write that requires allocation can block behind another
> such write).
>
>> >
>> > > (2)
>> > >
>> > >    7fff9287472f __schedule ([kernel.kallsyms])
>> > >              7fff92874d16 schedule ([kernel.kallsyms])
>> > >              7fffc05e6265 _xfs_log_force ([kernel.kallsyms])
>> > >              7fffc05c2518 xfs_extent_busy_flush ([kernel.kallsyms])
>> > >              7fffc0572ccd xfs_alloc_ag_vextent_size ([kernel.kallsyms])
>> > >              7fffc0573d91 xfs_alloc_ag_vextent ([kernel.kallsyms])
>> > >              7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
>> > >              7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
>> > >              7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
>> > >              7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
>> > >              7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
>> > >              7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
>> > >              7fff922d46ca iomap_apply ([kernel.kallsyms])
>> > >              7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
>> > >              7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
>> > >              7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
>> > >              7fff922bc5d3 aio_write ([kernel.kallsyms])
>> > >              7fff922bcec1 do_io_submit ([kernel.kallsyms])
>> > >              7fff922bdd40 sys_io_submit ([kernel.kallsyms])
>> > >              7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
>> > >                       687 io_submit (/usr/lib64/libaio.so.1.0.1)
>> > >                    112373 seastar::reactor::flush_pending_aio
>> > > (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)
>> > >
>> > Another dio write that requires allocation. The allocation finds a busy
>> > extent, which means the extent was recently freed but the associated
>> > freeing transaction has not yet made it to the on-disk log. As such it
>> > cannot be safely reused, so the allocator flushes the log and retries to
>> > try and clear the busy state and find an extent.
>>
>> Is that because the disk is nearly full and there are no known flushed
>> extents, or because the allocator doesn't prioritize known-flushed extents?
>> From your comments below I gather you may not know for sure.
>>
>
> I'm not sure without digging further into it. Hence the question around
> free space availability.

The file system was utilized between 90% and 95% out of 165GB during the test.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 16:31     ` Avi Kivity
  2017-09-19 17:39       ` Brian Foster
@ 2017-09-19 20:34       ` Christoph Hellwig
  2017-09-20  6:14         ` Avi Kivity
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2017-09-19 20:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Christoph Hellwig, Brian Foster, Tomasz Grabiec, linux-xfs

On Tue, Sep 19, 2017 at 07:31:04PM +0300, Avi Kivity wrote:
> For DIO, does it really need the mm_struct? It can just pin the pages and
> pass them to the workqueue function.

We can't pin all the pages for a huge I/O at the same time.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 17:39       ` Brian Foster
@ 2017-09-19 20:34         ` Christoph Hellwig
  2017-09-20  6:17         ` Avi Kivity
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2017-09-19 20:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: Avi Kivity, Christoph Hellwig, Tomasz Grabiec, linux-xfs

On Tue, Sep 19, 2017 at 01:39:55PM -0400, Brian Foster wrote:
> I'm not sure what difference it makes regardless. We still have to wait
> for an allocation to complete before we can issue an I/O. IIRC, the old
> defer allocs to a wq thing was more about saving stack space than
> providing async behavior.

At least in theory we could do the allocation from one workqueue
and submit the I/O from the next one.  Except for the lack of modern
workqueues that is what the historic AIO code in RHEL2.1 (and maybe 3.0,
but I'm not sure did). 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 17:53       ` Tomasz Grabiec
@ 2017-09-19 23:38         ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2017-09-19 23:38 UTC (permalink / raw)
  To: Tomasz Grabiec; +Cc: Brian Foster, Avi Kivity, linux-xfs

On Tue, Sep 19, 2017 at 07:53:52PM +0200, Tomasz Grabiec wrote:
> On Tue, Sep 19, 2017 at 7:38 PM, Brian Foster <bfoster@redhat.com> wrote:
> >> Is that because the disk is nearly full and there are no known flushed
> >> extents, or because the allocator doesn't prioritize known-flushed extents?
> >> From your comments below I gather you may not know for sure.
> >>
> >
> > I'm not sure without digging further into it. Hence the question around
> > free space availability.
> 
> The file system was utilized between 90% and 95% out of 165GB during the test.

Then I'm surprised that you only hit these relatively minor delays.
Once you get beyond 85-90% full the filesystem is typically not
running through allocation fast paths as large contiguous free
spaces are getting to be non-existant (especially for such small
filesystems like this).

Continued operation at >85-90% full will run you into premature
aging situations like free space fragmentation, and then all your
allocations and then data IO patterns will begin to suffer.  Once
you get above 95% full, various algorithms will even stop attempting
optimal allocations and instead start optimising for minimum size at
the expense of increased file fragmentation.

IOWs, expect unpredictable delays in the filesystem once you start
approaching ENOSPC, and the closer to ENOSPC you get the more
unpredictable the filesystem behaviour will get. And if you spend
long periods of time operating near ENOSPC, performance and
behaviour may not improve unless you free a large amount of space
(e.g. 50% of filesystem space) so that large contiguous free space
regions can reform....

So, yes, you can operate at near ENOSPC conditions. However, it's not
advisable if you require deterministic/predictable behaviour and/or
have long term filesystem performance requirements...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 20:34       ` Christoph Hellwig
@ 2017-09-20  6:14         ` Avi Kivity
  0 siblings, 0 replies; 17+ messages in thread
From: Avi Kivity @ 2017-09-20  6:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, Tomasz Grabiec, linux-xfs

On 09/19/2017 11:34 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 07:31:04PM +0300, Avi Kivity wrote:
>> For DIO, does it really need the mm_struct? It can just pin the pages and
>> pass them to the workqueue function.
> We can't pin all the pages for a huge I/O at the same time.

Is it not legal to perform a short write?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-19 17:39       ` Brian Foster
  2017-09-19 20:34         ` Christoph Hellwig
@ 2017-09-20  6:17         ` Avi Kivity
  2017-09-20 10:50           ` Brian Foster
  1 sibling, 1 reply; 17+ messages in thread
From: Avi Kivity @ 2017-09-20  6:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Tomasz Grabiec, linux-xfs

On 09/19/2017 08:39 PM, Brian Foster wrote:
> On Tue, Sep 19, 2017 at 07:31:04PM +0300, Avi Kivity wrote:
>>
>> On 09/19/2017 05:58 PM, Christoph Hellwig wrote:
>>> On Tue, Sep 19, 2017 at 08:27:05AM -0400, Brian Foster wrote:
>>>>> Please advise, is this a known bug? When can it happen? Is there a way
>>>>> to work it around to avoid blocking?
>>>>>
>>>> I'm not sure how either could be considered a bug based on the stack
>>>> trace information alone. Allocations may require reading metadata and
>>>> reads are synchronous. This all seems like pretty basic filesystem
>>>> behavior.
>>>>
>>>> I suppose performance may be a separate question. For the latter issue,
>>>> I'd be curious whether leaving more free space available in the
>>>> filesystem would help avoid running into busy extents. Perhaps having
>>>> more memory and thus a larger buffer cache for btree blocks could help
>>>> mitigate the former issue..? The deterministic workaround for both is to
>>>> preallocate the associated file. If the file would be too large, another
>>>> option may be to set an extent size hint to allocate the file in larger
>>>> chunks and amortize the cost of the allocations over multiple writes.
>>> Note that Linux 4.13 and later support a RWF_NOWAIT flag, that will
>>> return -EAGAIN from io_submit for these conditions so they can be
>>> handled by a thread pool.
>>>
>>> Note that until a few years ago we performed all allocations from
>>> a workqueue, this was changed by:
>>>
>>> commit cf11da9c5d374962913ca5ba0ce0886b58286224
>>> Author: Dave Chinner <dchinner@redhat.com>
>>> Date:   Tue Jul 15 07:08:24 2014 +1000
>>>
>>>       xfs: refine the allocation stack switch
>>>
>>> to only defer btree splits to a workqueue.  With that previous scheme
>>> there might have been an option to defer AIO allocations to a workqueue,
>>> but the main issue with that is that the worker thread which is then
>>> going to do the actual data transfer would have to "borrow" the
>>> mm_struct from the submitter.  That's the primary reason why something
>>> like that was never implemented in mainline Linux.
>> For DIO, does it really need the mm_struct? It can just pin the pages and
>> pass them to the workqueue function.
>>
> I'm not sure what difference it makes regardless. We still have to wait
> for an allocation to complete before we can issue an I/O.

If io_submit() returns immediately rather than blocking, it makes a huge 
difference. Waiting in the workqueue can be done in parallel to other 
I/O and in parallel to cpu work in the caller thread. Blocking means no 
further I/O is issued and no cpu work is done.

>   IIRC, the old
> defer allocs to a wq thing was more about saving stack space than
> providing async behavior.

Perhaps, but IMO the async behavior is a major feature of the aio system 
calls. It is very hard to use them if they block.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-20  6:17         ` Avi Kivity
@ 2017-09-20 10:50           ` Brian Foster
  2017-09-20 11:11             ` Avi Kivity
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2017-09-20 10:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Christoph Hellwig, Tomasz Grabiec, linux-xfs

On Wed, Sep 20, 2017 at 09:17:25AM +0300, Avi Kivity wrote:
> On 09/19/2017 08:39 PM, Brian Foster wrote:
> > On Tue, Sep 19, 2017 at 07:31:04PM +0300, Avi Kivity wrote:
> > > 
> > > On 09/19/2017 05:58 PM, Christoph Hellwig wrote:
> > > > On Tue, Sep 19, 2017 at 08:27:05AM -0400, Brian Foster wrote:
> > > > > > Please advise, is this a known bug? When can it happen? Is there a way
> > > > > > to work it around to avoid blocking?
> > > > > > 
> > > > > I'm not sure how either could be considered a bug based on the stack
> > > > > trace information alone. Allocations may require reading metadata and
> > > > > reads are synchronous. This all seems like pretty basic filesystem
> > > > > behavior.
> > > > > 
> > > > > I suppose performance may be a separate question. For the latter issue,
> > > > > I'd be curious whether leaving more free space available in the
> > > > > filesystem would help avoid running into busy extents. Perhaps having
> > > > > more memory and thus a larger buffer cache for btree blocks could help
> > > > > mitigate the former issue..? The deterministic workaround for both is to
> > > > > preallocate the associated file. If the file would be too large, another
> > > > > option may be to set an extent size hint to allocate the file in larger
> > > > > chunks and amortize the cost of the allocations over multiple writes.
> > > > Note that Linux 4.13 and later support a RWF_NOWAIT flag, that will
> > > > return -EAGAIN from io_submit for these conditions so they can be
> > > > handled by a thread pool.
> > > > 
> > > > Note that until a few years ago we performed all allocations from
> > > > a workqueue, this was changed by:
> > > > 
> > > > commit cf11da9c5d374962913ca5ba0ce0886b58286224
> > > > Author: Dave Chinner <dchinner@redhat.com>
> > > > Date:   Tue Jul 15 07:08:24 2014 +1000
> > > > 
> > > >       xfs: refine the allocation stack switch
> > > > 
> > > > to only defer btree splits to a workqueue.  With that previous scheme
> > > > there might have been an option to defer AIO allocations to a workqueue,
> > > > but the main issue with that is that the worker thread which is then
> > > > going to do the actual data transfer would have to "borrow" the
> > > > mm_struct from the submitter.  That's the primary reason why something
> > > > like that was never implemented in mainline Linux.
> > > For DIO, does it really need the mm_struct? It can just pin the pages and
> > > pass them to the workqueue function.
> > > 
> > I'm not sure what difference it makes regardless. We still have to wait
> > for an allocation to complete before we can issue an I/O.
> 
> If io_submit() returns immediately rather than blocking, it makes a huge
> difference. Waiting in the workqueue can be done in parallel to other I/O
> and in parallel to cpu work in the caller thread. Blocking means no further
> I/O is issued and no cpu work is done.
> 

Sure. I'm just saying that seems orthogonal to how/why we deferred block
allocations to a wq. Even if we went back to that behavior, io_submit()
will still potentially block as it does today. It sounds like what you
want is something higher level that defers the entire aio submission to
a wq (which still may have to use another wq for btree splits, for
different reasons). Apparently we had something like that in the past as
Christoph referred to in his last mail, but I'm not really familiar with
that.

FWIW, this is not exactly the same, but I think Dave prototyped
something in the past to wire up aio_fsync() to a basic wq
implementation and managed to show really good scalability improvements.
Given that, I suppose it wouldn't be that surprising to get similar
results for I/O submission if there is some way around the page issue.

Brian

> >   IIRC, the old
> > defer allocs to a wq thing was more about saving stack space than
> > providing async behavior.
> 
> Perhaps, but IMO the async behavior is a major feature of the aio system
> calls. It is very hard to use them if they block.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-20 10:50           ` Brian Foster
@ 2017-09-20 11:11             ` Avi Kivity
  2017-09-20 14:49               ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Avi Kivity @ 2017-09-20 11:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Tomasz Grabiec, linux-xfs



On 09/20/2017 01:50 PM, Brian Foster wrote:
> On Wed, Sep 20, 2017 at 09:17:25AM +0300, Avi Kivity wrote:
>> On 09/19/2017 08:39 PM, Brian Foster wrote:
>>> On Tue, Sep 19, 2017 at 07:31:04PM +0300, Avi Kivity wrote:
>>>> On 09/19/2017 05:58 PM, Christoph Hellwig wrote:
>>>>> On Tue, Sep 19, 2017 at 08:27:05AM -0400, Brian Foster wrote:
>>>>>>> Please advise, is this a known bug? When can it happen? Is there a way
>>>>>>> to work it around to avoid blocking?
>>>>>>>
>>>>>> I'm not sure how either could be considered a bug based on the stack
>>>>>> trace information alone. Allocations may require reading metadata and
>>>>>> reads are synchronous. This all seems like pretty basic filesystem
>>>>>> behavior.
>>>>>>
>>>>>> I suppose performance may be a separate question. For the latter issue,
>>>>>> I'd be curious whether leaving more free space available in the
>>>>>> filesystem would help avoid running into busy extents. Perhaps having
>>>>>> more memory and thus a larger buffer cache for btree blocks could help
>>>>>> mitigate the former issue..? The deterministic workaround for both is to
>>>>>> preallocate the associated file. If the file would be too large, another
>>>>>> option may be to set an extent size hint to allocate the file in larger
>>>>>> chunks and amortize the cost of the allocations over multiple writes.
>>>>> Note that Linux 4.13 and later support a RWF_NOWAIT flag, that will
>>>>> return -EAGAIN from io_submit for these conditions so they can be
>>>>> handled by a thread pool.
>>>>>
>>>>> Note that until a few years ago we performed all allocations from
>>>>> a workqueue, this was changed by:
>>>>>
>>>>> commit cf11da9c5d374962913ca5ba0ce0886b58286224
>>>>> Author: Dave Chinner <dchinner@redhat.com>
>>>>> Date:   Tue Jul 15 07:08:24 2014 +1000
>>>>>
>>>>>        xfs: refine the allocation stack switch
>>>>>
>>>>> to only defer btree splits to a workqueue.  With that previous scheme
>>>>> there might have been an option to defer AIO allocations to a workqueue,
>>>>> but the main issue with that is that the worker thread which is then
>>>>> going to do the actual data transfer would have to "borrow" the
>>>>> mm_struct from the submitter.  That's the primary reason why something
>>>>> like that was never implemented in mainline Linux.
>>>> For DIO, does it really need the mm_struct? It can just pin the pages and
>>>> pass them to the workqueue function.
>>>>
>>> I'm not sure what difference it makes regardless. We still have to wait
>>> for an allocation to complete before we can issue an I/O.
>> If io_submit() returns immediately rather than blocking, it makes a huge
>> difference. Waiting in the workqueue can be done in parallel to other I/O
>> and in parallel to cpu work in the caller thread. Blocking means no further
>> I/O is issued and no cpu work is done.
>>
> Sure. I'm just saying that seems orthogonal to how/why we deferred block
> allocations to a wq.

Oh, sorry for misunderstanding. TBH this is beyond my (very weak) 
understanding of the low-level implementation.

>   Even if we went back to that behavior, io_submit()
> will still potentially block as it does today. It sounds like what you
> want is something higher level that defers the entire aio submission to
> a wq (which still may have to use another wq for btree splits, for
> different reasons).

I think it's still preferable to avoid a workqueue and its 
non-deterministic latencies and context switches if we can prove that a 
particular iocb will not require a synchronous operation. If that can be 
done then 4.13 nowait aio also works - the user provides the workqueue 
equivalent. The only problem is if we can't prove in advance that an 
iocb will require blocking.

>   Apparently we had something like that in the past as
> Christoph referred to in his last mail, but I'm not really familiar with
> that.
>
> FWIW, this is not exactly the same, but I think Dave prototyped
> something in the past to wire up aio_fsync() to a basic wq
> implementation and managed to show really good scalability improvements.
> Given that, I suppose it wouldn't be that surprising to get similar
> results for I/O submission if there is some way around the page issue.

I can think of a couple of options:

  1. Short writes - just ignore the tail of a too-large iovec. May cause 
buggy applications to fail, so probably not a good idea.
  2. Global limit - if the number of pinned pages in all currently 
running iocbs is below some limit, allow it, otherwise fail a nowait aio 
(and synchronously execute a non-nowait aio). Few applications will 
overflow the global limit if it is generous enough, since very large 
I/Os induce bad latency and don't gain you much in throughput.
  3. Borrow the mm, and pin from the wq - I gather it was considered and 
rejected, but maybe it can be reconsidered.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-20 11:11             ` Avi Kivity
@ 2017-09-20 14:49               ` Christoph Hellwig
  2017-09-23 18:23                 ` Avi Kivity
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2017-09-20 14:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Brian Foster, Christoph Hellwig, Tomasz Grabiec, linux-xfs,
	Goldwyn Rodrigues, linux-aio

On Wed, Sep 20, 2017 at 02:11:49PM +0300, Avi Kivity wrote:
> I think it's still preferable to avoid a workqueue and its non-deterministic
> latencies and context switches if we can prove that a particular iocb will
> not require a synchronous operation. If that can be done then 4.13 nowait
> aio also works - the user provides the workqueue equivalent. The only
> problem is if we can't prove in advance that an iocb will require blocking.

The code is generally pessimistic and bails out rather too often.
The only issue not solved is memory allocation, at the moment we could
still block on them so this will need some more work.  For XFS direct
I/O the only memory allocations in that path should be the bios.

>  1. Short writes - just ignore the tail of a too-large iovec. May cause
> buggy applications to fail, so probably not a good idea.

We could still do it the same way we did RWF_NOWAIT - require an
explicit opt-in for what should be the defalt behavior because we
change the historic behavior.

>  3. Borrow the mm, and pin from the wq - I gather it was considered and
> rejected, but maybe it can be reconsidered.

It was done before in vendor kernels, and I think we also had code
for it in a driver implementing aio.  I'd need to look up the whole
history as I don't remember it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: io_submit() blocks for writes for substantial amount of time
  2017-09-20 14:49               ` Christoph Hellwig
@ 2017-09-23 18:23                 ` Avi Kivity
  0 siblings, 0 replies; 17+ messages in thread
From: Avi Kivity @ 2017-09-23 18:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Brian Foster, Tomasz Grabiec, linux-xfs, Goldwyn Rodrigues, linux-aio



On 09/20/2017 05:49 PM, Christoph Hellwig wrote:
> On Wed, Sep 20, 2017 at 02:11:49PM +0300, Avi Kivity wrote:
>> I think it's still preferable to avoid a workqueue and its non-deterministic
>> latencies and context switches if we can prove that a particular iocb will
>> not require a synchronous operation. If that can be done then 4.13 nowait
>> aio also works - the user provides the workqueue equivalent. The only
>> problem is if we can't prove in advance that an iocb will require blocking.
> The code is generally pessimistic and bails out rather too often.
> The only issue not solved is memory allocation, at the moment we could
> still block on them so this will need some more work.  For XFS direct
> I/O the only memory allocations in that path should be the bios.

I think we can ignore blocking on memory allocation. It affects all 
system calls and even just regular user memory access - if you're 
starved for memory you're likely to have text and data pages swapped 
out. An application that wants aio should be prepared to avoid starving 
the kernel for memory.

>
>>   1. Short writes - just ignore the tail of a too-large iovec. May cause
>> buggy applications to fail, so probably not a good idea.
> We could still do it the same way we did RWF_NOWAIT - require an
> explicit opt-in for what should be the defalt behavior because we
> change the historic behavior.

Yes.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-09-23 18:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-19  8:50 io_submit() blocks for writes for substantial amount of time Tomasz Grabiec
2017-09-19 12:27 ` Brian Foster
2017-09-19 14:58   ` Christoph Hellwig
2017-09-19 16:31     ` Avi Kivity
2017-09-19 17:39       ` Brian Foster
2017-09-19 20:34         ` Christoph Hellwig
2017-09-20  6:17         ` Avi Kivity
2017-09-20 10:50           ` Brian Foster
2017-09-20 11:11             ` Avi Kivity
2017-09-20 14:49               ` Christoph Hellwig
2017-09-23 18:23                 ` Avi Kivity
2017-09-19 20:34       ` Christoph Hellwig
2017-09-20  6:14         ` Avi Kivity
2017-09-19 16:29   ` Avi Kivity
2017-09-19 17:38     ` Brian Foster
2017-09-19 17:53       ` Tomasz Grabiec
2017-09-19 23:38         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.