All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about non asynchronous aio calls.
@ 2015-10-07 14:18 Gleb Natapov
  2015-10-07 14:24 ` Eric Sandeen
  0 siblings, 1 reply; 13+ messages in thread
From: Gleb Natapov @ 2015-10-07 14:18 UTC (permalink / raw)
  To: xfs; +Cc: avi, glauber

Hello XFS developers,

We are working on scylladb[1] database which is written using seastar[2]
- highly asynchronous C++ framework. The code uses aio heavily: no
synchronous operation is allowed at all by the framework otherwise
performance drops drastically. We noticed that the only mainstream FS
in Linux that takes aio seriously is XFS. So let me start by thanking
you guys for the great work! But unfortunately we also noticed that
sometimes io_submit() is executed synchronously even on XFS.

Looking at the code I see two cases when this is happening: unaligned
IO and write past EOF. It looks like we hit both. For the first one we
make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
to figure out what alignment should be, but it does not help. Looking at the
code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
filesystem block size not values that DIOINFO returns. Is it intentional? How
should our code know what it should align buffers to?

Second one is harder. We do need to write past the end of a file, actually
most of our writes are like that, so it would have been great for XFS to
handle this case asynchronously. Currently we are working to work around
this by issuing truncate() (or fallocate()) on another thread and doing
aio on a main thread only after truncate() is complete. It seams to be
working, but is it guarantied that a thread issuing aio will never sleep
in this case (may be new file size value needs to hit the disk and it is
not guarantied that it will happen after truncate() returns, but before
aio call)?

[2] http://www.scylladb.com/
[1] http://www.seastar-project.org/

Thanks,

--
			Gleb.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-07 14:18 Question about non asynchronous aio calls Gleb Natapov
@ 2015-10-07 14:24 ` Eric Sandeen
  2015-10-07 15:08   ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2015-10-07 14:24 UTC (permalink / raw)
  To: xfs



On 10/7/15 9:18 AM, Gleb Natapov wrote:
> Hello XFS developers,
> 
> We are working on scylladb[1] database which is written using seastar[2]
> - highly asynchronous C++ framework. The code uses aio heavily: no
> synchronous operation is allowed at all by the framework otherwise
> performance drops drastically. We noticed that the only mainstream FS
> in Linux that takes aio seriously is XFS. So let me start by thanking
> you guys for the great work! But unfortunately we also noticed that
> sometimes io_submit() is executed synchronously even on XFS.
> 
> Looking at the code I see two cases when this is happening: unaligned
> IO and write past EOF. It looks like we hit both. For the first one we
> make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
> to figure out what alignment should be, but it does not help. Looking at the
> code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
> is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
> filesystem block size not values that DIOINFO returns. Is it intentional? How
> should our code know what it should align buffers to?

        /* "unaligned" here means not aligned to a filesystem block */
        if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
                unaligned_io = 1;

It should be aligned to the filesystem block size.

> Second one is harder. We do need to write past the end of a file, actually
> most of our writes are like that, so it would have been great for XFS to
> handle this case asynchronously.

You didn't say what kernel you're on, but these:

9862f62 xfs: allow appending aio writes
7b7a866 direct-io: Implement generic deferred AIO completions

hit kernel v3.15.

However, we had a bug report about this, and Brian has sent a fix
which has not yet been merged, see:

[PATCH 1/2] xfs: always drain dio before extending aio write submission

on this list last week.

With those 3 patches, things should just work for you I think.

-Eric

> Currently we are working to work around
> this by issuing truncate() (or fallocate()) on another thread and doing
> aio on a main thread only after truncate() is complete. It seams to be
> working, but is it guarantied that a thread issuing aio will never sleep
> in this case (may be new file size value needs to hit the disk and it is
> not guarantied that it will happen after truncate() returns, but before
> aio call)?
> 
> [2] http://www.scylladb.com/
> [1] http://www.seastar-project.org/
> 
> Thanks,
> 
> --
> 			Gleb.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-07 14:24 ` Eric Sandeen
@ 2015-10-07 15:08   ` Brian Foster
  2015-10-07 15:13     ` Eric Sandeen
  2015-10-08  8:34     ` Gleb Natapov
  0 siblings, 2 replies; 13+ messages in thread
From: Brian Foster @ 2015-10-07 15:08 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Wed, Oct 07, 2015 at 09:24:15AM -0500, Eric Sandeen wrote:
> 
> 
> On 10/7/15 9:18 AM, Gleb Natapov wrote:
> > Hello XFS developers,
> > 
> > We are working on scylladb[1] database which is written using seastar[2]
> > - highly asynchronous C++ framework. The code uses aio heavily: no
> > synchronous operation is allowed at all by the framework otherwise
> > performance drops drastically. We noticed that the only mainstream FS
> > in Linux that takes aio seriously is XFS. So let me start by thanking
> > you guys for the great work! But unfortunately we also noticed that
> > sometimes io_submit() is executed synchronously even on XFS.
> > 
> > Looking at the code I see two cases when this is happening: unaligned
> > IO and write past EOF. It looks like we hit both. For the first one we
> > make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
> > to figure out what alignment should be, but it does not help. Looking at the
> > code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
> > is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
> > filesystem block size not values that DIOINFO returns. Is it intentional? How
> > should our code know what it should align buffers to?
> 
>         /* "unaligned" here means not aligned to a filesystem block */
>         if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
>                 unaligned_io = 1;
> 
> It should be aligned to the filesystem block size.
> 

I'm not sure exactly what kinds of races are opened if the above locking
were absent, but I'd guess it's related to the buffer/block state
management, block zeroing and whatnot that is buried in the depths of
the generic dio code.

I suspect the dioinfo information describes the capabilities of the
filesystem (e.g., what kinds of DIO are allowable) as opposed to any
kind of optimal I/O related values. Something like statfs() can be used
to determine the filesystem block size. I suppose you could also
intentionally format the filesystem with a smaller block size if
concurrent, smaller dio's are a requirement.

> > Second one is harder. We do need to write past the end of a file, actually
> > most of our writes are like that, so it would have been great for XFS to
> > handle this case asynchronously.
> 
> You didn't say what kernel you're on, but these:
> 
> 9862f62 xfs: allow appending aio writes
> 7b7a866 direct-io: Implement generic deferred AIO completions
> 
> hit kernel v3.15.
> 
> However, we had a bug report about this, and Brian has sent a fix
> which has not yet been merged, see:
> 
> [PATCH 1/2] xfs: always drain dio before extending aio write submission
> 
> on this list last week.
> 
> With those 3 patches, things should just work for you I think.
> 

These fix some problems in that code, but the "beyond EOF" submission is
still synchronous in nature by virtue of cycling the IOLOCK and draining
pending dio. This is required to check for EOF zeroing, and we can't do
that safely without a stable i_size.

Note that according to the commit Eric referenced above, ordering your
I/O to always append (rather than start at some point beyond the current
EOF) might be another option to avoid the synchronization here. Whether
that is an option is specific to your application, of course.

> -Eric
> 
> > Currently we are working to work around
> > this by issuing truncate() (or fallocate()) on another thread and doing
> > aio on a main thread only after truncate() is complete. It seams to be
> > working, but is it guarantied that a thread issuing aio will never sleep
> > in this case (may be new file size value needs to hit the disk and it is
> > not guarantied that it will happen after truncate() returns, but before
> > aio call)?
> > 

There are no such pitfalls as far as I'm aware. The entire AIO
submission synchronization sequence triggers off an in-memory i_size
check in xfs_file_aio_write_checks(). The in-memory i_size is updated in
the truncate path (xfs_setattr_size()) via truncate_setsize(), so at
that point the new size should be visible to subsequent AIO writers.

Note that the truncate itself does appear to wait here on pending DIO.
Also note that the existence of pagecache pages is another avenue to
synchronous DIO submission due to the need to possibly flush and
invalidate the cache, so you probably want to avoid any kind of mixed
buffered/direct I/O to a single file as well.

Brian

> > [2] http://www.scylladb.com/
> > [1] http://www.seastar-project.org/
> > 
> > Thanks,
> > 
> > --
> > 			Gleb.
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-07 15:08   ` Brian Foster
@ 2015-10-07 15:13     ` Eric Sandeen
  2015-10-07 18:13       ` Avi Kivity
  2015-10-08  8:34     ` Gleb Natapov
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2015-10-07 15:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs



On 10/7/15 10:08 AM, Brian Foster wrote:
> On Wed, Oct 07, 2015 at 09:24:15AM -0500, Eric Sandeen wrote:
>>
>>
>> On 10/7/15 9:18 AM, Gleb Natapov wrote:
>>> Hello XFS developers,
>>>
>>> We are working on scylladb[1] database which is written using seastar[2]
>>> - highly asynchronous C++ framework. The code uses aio heavily: no
>>> synchronous operation is allowed at all by the framework otherwise
>>> performance drops drastically. We noticed that the only mainstream FS
>>> in Linux that takes aio seriously is XFS. So let me start by thanking
>>> you guys for the great work! But unfortunately we also noticed that
>>> sometimes io_submit() is executed synchronously even on XFS.
>>>
>>> Looking at the code I see two cases when this is happening: unaligned
>>> IO and write past EOF. It looks like we hit both. For the first one we
>>> make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
>>> to figure out what alignment should be, but it does not help. Looking at the
>>> code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
>>> is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
>>> filesystem block size not values that DIOINFO returns. Is it intentional? How
>>> should our code know what it should align buffers to?
>>
>>         /* "unaligned" here means not aligned to a filesystem block */
>>         if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
>>                 unaligned_io = 1;
>>
>> It should be aligned to the filesystem block size.
>>
> 
> I'm not sure exactly what kinds of races are opened if the above locking
> were absent, but I'd guess it's related to the buffer/block state
> management, block zeroing and whatnot that is buried in the depths of
> the generic dio code.

Yep:

commit eda77982729b7170bdc9e8855f0682edf322d277
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jan 11 10:22:40 2011 +1100

    xfs: serialise unaligned direct IOs
    
    When two concurrent unaligned, non-overlapping direct IOs are issued
    to the same block, the direct Io layer will race to zero the block.
    The result is that one of the concurrent IOs will overwrite data
    written by the other IO with zeros. This is demonstrated by the
    xfsqa test 240.
    
    To avoid this problem, serialise all unaligned direct IOs to an
    inode with a big hammer. We need a big hammer approach as we need to
    serialise AIO as well, so we can't just block writes on locks.
    Hence, the big hammer is calling xfs_ioend_wait() while holding out
    other unaligned direct IOs from starting.
    
    We don't bother trying to serialised aligned vs unaligned IOs as
    they are overlapping IO and the result of concurrent overlapping IOs
    is undefined - the result of either IO is a valid result so we let
    them race. Hence we only penalise unaligned IO, which already has a
    major overhead compared to aligned IO so this isn't a major problem.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Alex Elder <aelder@sgi.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

I fixed something similar in ext4 at the time, FWIW.

> I suspect the dioinfo information describes the capabilities of the
> filesystem (e.g., what kinds of DIO are allowable) as opposed to any
> kind of optimal I/O related values. Something like statfs() can be used
> to determine the filesystem block size. I suppose you could also
> intentionally format the filesystem with a smaller block size if
> concurrent, smaller dio's are a requirement.
> 
>>> Second one is harder. We do need to write past the end of a file, actually
>>> most of our writes are like that, so it would have been great for XFS to
>>> handle this case asynchronously.
>>
>> You didn't say what kernel you're on, but these:
>>
>> 9862f62 xfs: allow appending aio writes
>> 7b7a866 direct-io: Implement generic deferred AIO completions
>>
>> hit kernel v3.15.
>>
>> However, we had a bug report about this, and Brian has sent a fix
>> which has not yet been merged, see:
>>
>> [PATCH 1/2] xfs: always drain dio before extending aio write submission
>>
>> on this list last week.
>>
>> With those 3 patches, things should just work for you I think.
>>
> 
> These fix some problems in that code, but the "beyond EOF" submission is
> still synchronous in nature by virtue of cycling the IOLOCK and draining
> pending dio. This is required to check for EOF zeroing, and we can't do
> that safely without a stable i_size.
> 
> Note that according to the commit Eric referenced above, ordering your
> I/O to always append (rather than start at some point beyond the current
> EOF) might be another option to avoid the synchronization here. Whether
> that is an option is specific to your application, of course.

Thanks for keeping me honest Brian.  :)

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-07 15:13     ` Eric Sandeen
@ 2015-10-07 18:13       ` Avi Kivity
  2015-10-08  4:28         ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Avi Kivity @ 2015-10-07 18:13 UTC (permalink / raw)
  To: Eric Sandeen, Brian Foster; +Cc: xfs

On 07/10/15 18:13, Eric Sandeen wrote:
>
> On 10/7/15 10:08 AM, Brian Foster wrote:
>> On Wed, Oct 07, 2015 at 09:24:15AM -0500, Eric Sandeen wrote:
>>>
>>> On 10/7/15 9:18 AM, Gleb Natapov wrote:
>>>> Hello XFS developers,
>>>>
>>>> We are working on scylladb[1] database which is written using seastar[2]
>>>> - highly asynchronous C++ framework. The code uses aio heavily: no
>>>> synchronous operation is allowed at all by the framework otherwise
>>>> performance drops drastically. We noticed that the only mainstream FS
>>>> in Linux that takes aio seriously is XFS. So let me start by thanking
>>>> you guys for the great work! But unfortunately we also noticed that
>>>> sometimes io_submit() is executed synchronously even on XFS.
>>>>
>>>> Looking at the code I see two cases when this is happening: unaligned
>>>> IO and write past EOF. It looks like we hit both. For the first one we
>>>> make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
>>>> to figure out what alignment should be, but it does not help. Looking at the
>>>> code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
>>>> is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
>>>> filesystem block size not values that DIOINFO returns. Is it intentional? How
>>>> should our code know what it should align buffers to?
>>>          /* "unaligned" here means not aligned to a filesystem block */
>>>          if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
>>>                  unaligned_io = 1;
>>>
>>> It should be aligned to the filesystem block size.
>>>
>> I'm not sure exactly what kinds of races are opened if the above locking
>> were absent, but I'd guess it's related to the buffer/block state
>> management, block zeroing and whatnot that is buried in the depths of
>> the generic dio code.
> Yep:
>
> commit eda77982729b7170bdc9e8855f0682edf322d277
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Tue Jan 11 10:22:40 2011 +1100
>
>      xfs: serialise unaligned direct IOs
>      
>      When two concurrent unaligned, non-overlapping direct IOs are issued
>      to the same block, the direct Io layer will race to zero the block.
>      The result is that one of the concurrent IOs will overwrite data
>      written by the other IO with zeros. This is demonstrated by the
>      xfsqa test 240.
>      
>      To avoid this problem, serialise all unaligned direct IOs to an
>      inode with a big hammer. We need a big hammer approach as we need to
>      serialise AIO as well, so we can't just block writes on locks.
>      Hence, the big hammer is calling xfs_ioend_wait() while holding out
>      other unaligned direct IOs from starting.
>      
>      We don't bother trying to serialised aligned vs unaligned IOs as
>      they are overlapping IO and the result of concurrent overlapping IOs
>      is undefined - the result of either IO is a valid result so we let
>      them race. Hence we only penalise unaligned IO, which already has a
>      major overhead compared to aligned IO so this isn't a major problem.
>      
>      Signed-off-by: Dave Chinner <dchinner@redhat.com>
>      Reviewed-by: Alex Elder <aelder@sgi.com>
>      Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> I fixed something similar in ext4 at the time, FWIW.

Makes sense.

Is there a way to relax this for reads?  It's pretty easy to saturate 
the disk read bandwidth with 4K reads, and there shouldn't be a race 
there, at least for reads targeting already-written blocks.  For us at 
least small reads would be sufficient.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-07 18:13       ` Avi Kivity
@ 2015-10-08  4:28         ` Dave Chinner
  2015-10-08  5:21           ` Avi Kivity
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2015-10-08  4:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, Eric Sandeen, xfs

On Wed, Oct 07, 2015 at 09:13:06PM +0300, Avi Kivity wrote:
> On 07/10/15 18:13, Eric Sandeen wrote:
> >
> >On 10/7/15 10:08 AM, Brian Foster wrote:
> >>On Wed, Oct 07, 2015 at 09:24:15AM -0500, Eric Sandeen wrote:
> >>>
> >>>On 10/7/15 9:18 AM, Gleb Natapov wrote:
> >>>>Hello XFS developers,
> >>>>
> >>>>We are working on scylladb[1] database which is written using seastar[2]
> >>>>- highly asynchronous C++ framework. The code uses aio heavily: no
> >>>>synchronous operation is allowed at all by the framework otherwise
> >>>>performance drops drastically. We noticed that the only mainstream FS
> >>>>in Linux that takes aio seriously is XFS. So let me start by thanking
> >>>>you guys for the great work! But unfortunately we also noticed that
> >>>>sometimes io_submit() is executed synchronously even on XFS.
> >>>>
> >>>>Looking at the code I see two cases when this is happening: unaligned
> >>>>IO and write past EOF. It looks like we hit both. For the first one we
> >>>>make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
> >>>>to figure out what alignment should be, but it does not help. Looking at the
> >>>>code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
> >>>>is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
> >>>>filesystem block size not values that DIOINFO returns. Is it intentional? How
> >>>>should our code know what it should align buffers to?
> >>>         /* "unaligned" here means not aligned to a filesystem block */
> >>>         if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
> >>>                 unaligned_io = 1;
> >>>
> >>>It should be aligned to the filesystem block size.
> >>>
> >>I'm not sure exactly what kinds of races are opened if the above locking
> >>were absent, but I'd guess it's related to the buffer/block state
> >>management, block zeroing and whatnot that is buried in the depths of
> >>the generic dio code.
> >Yep:
> >
> >commit eda77982729b7170bdc9e8855f0682edf322d277
> >Author: Dave Chinner <dchinner@redhat.com>
> >Date:   Tue Jan 11 10:22:40 2011 +1100
> >
> >     xfs: serialise unaligned direct IOs

[...]

> >I fixed something similar in ext4 at the time, FWIW.
> 
> Makes sense.
> 
> Is there a way to relax this for reads?

The above mostly only applies to writes. Reads don't modify data so
racing unaligned reads against other reads won't given unexpected
results and so aren't serialised.

i.e. serialisation will only occur when:
	- unaligned write IO will serialise until sub-block zeroing
	  is complete.
	- write IO extending EOF will serialis until post-EOF
	  zeroing is complete
	- cached pages are found on the inode (i.e. mixing
	  buffered/mmap access with direct IO).
	- truncate/extent manipulation syscall is run

All other DIO will be issued and run concurrently, reads and writes.

Realistically, if you are care about performance (which obviously
you are) then you do not do unaligned IO, and you try hard to
minimise operations that extend the file...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-08  4:28         ` Dave Chinner
@ 2015-10-08  5:21           ` Avi Kivity
  2015-10-08  8:23             ` Gleb Natapov
  0 siblings, 1 reply; 13+ messages in thread
From: Avi Kivity @ 2015-10-08  5:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Eric Sandeen, xfs

On 08/10/15 07:28, Dave Chinner wrote:
> On Wed, Oct 07, 2015 at 09:13:06PM +0300, Avi Kivity wrote:
>> On 07/10/15 18:13, Eric Sandeen wrote:
>>> On 10/7/15 10:08 AM, Brian Foster wrote:
>>>> On Wed, Oct 07, 2015 at 09:24:15AM -0500, Eric Sandeen wrote:
>>>>> On 10/7/15 9:18 AM, Gleb Natapov wrote:
>>>>>> Hello XFS developers,
>>>>>>
>>>>>> We are working on scylladb[1] database which is written using seastar[2]
>>>>>> - highly asynchronous C++ framework. The code uses aio heavily: no
>>>>>> synchronous operation is allowed at all by the framework otherwise
>>>>>> performance drops drastically. We noticed that the only mainstream FS
>>>>>> in Linux that takes aio seriously is XFS. So let me start by thanking
>>>>>> you guys for the great work! But unfortunately we also noticed that
>>>>>> sometimes io_submit() is executed synchronously even on XFS.
>>>>>>
>>>>>> Looking at the code I see two cases when this is happening: unaligned
>>>>>> IO and write past EOF. It looks like we hit both. For the first one we
>>>>>> make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO
>>>>>> to figure out what alignment should be, but it does not help. Looking at the
>>>>>> code though xfs_file_dio_aio_write() checks alignment against m_blockmask which
>>>>>> is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to
>>>>>> filesystem block size not values that DIOINFO returns. Is it intentional? How
>>>>>> should our code know what it should align buffers to?
>>>>>          /* "unaligned" here means not aligned to a filesystem block */
>>>>>          if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
>>>>>                  unaligned_io = 1;
>>>>>
>>>>> It should be aligned to the filesystem block size.
>>>>>
>>>> I'm not sure exactly what kinds of races are opened if the above locking
>>>> were absent, but I'd guess it's related to the buffer/block state
>>>> management, block zeroing and whatnot that is buried in the depths of
>>>> the generic dio code.
>>> Yep:
>>>
>>> commit eda77982729b7170bdc9e8855f0682edf322d277
>>> Author: Dave Chinner <dchinner@redhat.com>
>>> Date:   Tue Jan 11 10:22:40 2011 +1100
>>>
>>>      xfs: serialise unaligned direct IOs
> [...]
>
>>> I fixed something similar in ext4 at the time, FWIW.
>> Makes sense.
>>
>> Is there a way to relax this for reads?
> The above mostly only applies to writes. Reads don't modify data so
> racing unaligned reads against other reads won't given unexpected
> results and so aren't serialised.
>
> i.e. serialisation will only occur when:
> 	- unaligned write IO will serialise until sub-block zeroing
> 	  is complete.
> 	- write IO extending EOF will serialis until post-EOF
> 	  zeroing is complete


By "complete" here, do you mean that a call to truncate() returned, or 
that its results reached the disk an unknown time later?

i could, immediately after truncating the file, extend it to a very 
large size, and truncate it back just before the final fsync/close 
sequence.  This has downsides from the viewpoint of user support (why is 
the file so large after a crash, what happens with backups) but is 
better than nothing.

> 	- cached pages are found on the inode (i.e. mixing
> 	  buffered/mmap access with direct IO).

We don't do that.

> 	- truncate/extent manipulation syscall is run

Actually, we do call fallocate() ahead of io_submit() (in a worker 
thread, in non-overlapping ranges) to optimize file layout and also in 
the belief that it would reduce the amount of blocking io_submit() does.

Should we serialize the fallocate() calls vs. io_submit() (on the same 
file)?  Were those fallocates a good idea in the first place?

> All other DIO will be issued and run concurrently, reads and writes.
>
> Realistically, if you are care about performance (which obviously
> you are) then you do not do unaligned IO, and you try hard to
> minimise operations that extend the file...

On SSDs, if you care about performance you avoid random writes, which 
cause write amplification.  So you do have to extend the file, unless 
you know its size in advance, which we don't.

Also, does "extend the file" here mean just the size, or extent 
allocation as well?

A final point is discoverability.  There is no way to discover safe 
alignment for reads and writes, and which operations block io_submit(), 
except by asking here, which cannot be done at runtime.  Interfaces that 
provide a way to query these attributes are very important to us.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-08  5:21           ` Avi Kivity
@ 2015-10-08  8:23             ` Gleb Natapov
  2015-10-08 11:46               ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Gleb Natapov @ 2015-10-08  8:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Eric Sandeen, Brian Foster, xfs

On Thu, Oct 08, 2015 at 08:21:58AM +0300, Avi Kivity wrote:
> >>>I fixed something similar in ext4 at the time, FWIW.
> >>Makes sense.
> >>
> >>Is there a way to relax this for reads?
> >The above mostly only applies to writes. Reads don't modify data so
> >racing unaligned reads against other reads won't given unexpected
> >results and so aren't serialised.
> >
> >i.e. serialisation will only occur when:
> >	- unaligned write IO will serialise until sub-block zeroing
> >	  is complete.
> >	- write IO extending EOF will serialis until post-EOF
> >	  zeroing is complete
> 
> 
> By "complete" here, do you mean that a call to truncate() returned, or that
> its results reached the disk an unknown time later?
> 
I think Brian already answered that one with:

  There are no such pitfalls as far as I'm aware. The entire AIO
  submission synchronization sequence triggers off an in-memory i_size
  check in xfs_file_aio_write_checks(). The in-memory i_size is updated in
  the truncate path (xfs_setattr_size()) via truncate_setsize(), so at
  that point the new size should be visible to subsequent AIO writers.

> i could, immediately after truncating the file, extend it to a very large
> size, and truncate it back just before the final fsync/close sequence.  This
> has downsides from the viewpoint of user support (why is the file so large
> after a crash, what happens with backups) but is better than nothing.
> 
> >	- cached pages are found on the inode (i.e. mixing
> >	  buffered/mmap access with direct IO).
> 
> We don't do that.
> 
> >	- truncate/extent manipulation syscall is run
> 
> Actually, we do call fallocate() ahead of io_submit() (in a worker thread,
> in non-overlapping ranges) to optimize file layout and also in the belief
> that it would reduce the amount of blocking io_submit() does.
> 
> Should we serialize the fallocate() calls vs. io_submit() (on the same
> file)?  Were those fallocates a good idea in the first place?
> 
> >All other DIO will be issued and run concurrently, reads and writes.
> >
> >Realistically, if you are care about performance (which obviously
> >you are) then you do not do unaligned IO, and you try hard to
> >minimise operations that extend the file...
> 
> On SSDs, if you care about performance you avoid random writes, which cause
> write amplification.  So you do have to extend the file, unless you know its
> size in advance, which we don't.
> 
> Also, does "extend the file" here mean just the size, or extent allocation
> as well?
> 
> A final point is discoverability.  There is no way to discover safe
> alignment for reads and writes, and which operations block io_submit(),
> except by asking here, which cannot be done at runtime.  Interfaces that
> provide a way to query these attributes are very important to us.
As Brian pointed statfs() can be use to get f_bsize which is defined as
"optimal transfer block size".

--
			Gleb.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-07 15:08   ` Brian Foster
  2015-10-07 15:13     ` Eric Sandeen
@ 2015-10-08  8:34     ` Gleb Natapov
  1 sibling, 0 replies; 13+ messages in thread
From: Gleb Natapov @ 2015-10-08  8:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: Eric Sandeen, xfs

On Wed, Oct 07, 2015 at 11:08:34AM -0400, Brian Foster wrote:
> > > Second one is harder. We do need to write past the end of a file, actually
> > > most of our writes are like that, so it would have been great for XFS to
> > > handle this case asynchronously.
> > 
> > You didn't say what kernel you're on, but these:
> > 
> > 9862f62 xfs: allow appending aio writes
> > 7b7a866 direct-io: Implement generic deferred AIO completions
> > 
> > hit kernel v3.15.
> > 
> > However, we had a bug report about this, and Brian has sent a fix
> > which has not yet been merged, see:
> > 
> > [PATCH 1/2] xfs: always drain dio before extending aio write submission
> > 
> > on this list last week.
> > 
> > With those 3 patches, things should just work for you I think.
> > 
> 
> These fix some problems in that code, but the "beyond EOF" submission is
> still synchronous in nature by virtue of cycling the IOLOCK and draining
> pending dio. This is required to check for EOF zeroing, and we can't do
> that safely without a stable i_size.
> 
> Note that according to the commit Eric referenced above, ordering your
> I/O to always append (rather than start at some point beyond the current
> EOF) might be another option to avoid the synchronization here. Whether
> that is an option is specific to your application, of course.
> 
Our IO should be always append IIRC, the above explains why most aio we
do is truly async, but may be somewhere there is a reordering and then
we see synchronous behaviour. Will have to check it.

--
			Gleb.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-08  8:23             ` Gleb Natapov
@ 2015-10-08 11:46               ` Dave Chinner
  2015-10-12 12:37                 ` Avi Kivity
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2015-10-08 11:46 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, Brian Foster, Eric Sandeen, xfs

On Thu, Oct 08, 2015 at 11:23:07AM +0300, Gleb Natapov wrote:
> On Thu, Oct 08, 2015 at 08:21:58AM +0300, Avi Kivity wrote:
> > >>>I fixed something similar in ext4 at the time, FWIW.
> > >>Makes sense.
> > >>
> > >>Is there a way to relax this for reads?
> > >The above mostly only applies to writes. Reads don't modify data so
> > >racing unaligned reads against other reads won't given unexpected
> > >results and so aren't serialised.
> > >
> > >i.e. serialisation will only occur when:
> > >	- unaligned write IO will serialise until sub-block zeroing
> > >	  is complete.
> > >	- write IO extending EOF will serialis until post-EOF
> > >	  zeroing is complete
> > 
> > 
> > By "complete" here, do you mean that a call to truncate() returned, or that
> > its results reached the disk an unknown time later?
> > 

No, I'm talking purely about DIO here. If you do write that
starts beyond the existing EOF, there is a region between the
current EOF and the offset the write starts at. i.e.

   0             EOF            offset     new EOF
   +dddddddddddddd+..............+nnnnnnnnnnn+

It is the region between EOF and offset that we must ensure is made
up of either holes, unwritten extents or fully zeroed blocks before
allowing the write to proceed. If we have to zero allocated blocks,
then we have to ensure that completes before the write can start.
This means that when we update the EOF on completion of the write,
we don't expose stale data in blocks that were between EOF and
offset...

> I think Brian already answered that one with:
> 
>   There are no such pitfalls as far as I'm aware. The entire AIO
>   submission synchronization sequence triggers off an in-memory i_size
>   check in xfs_file_aio_write_checks(). The in-memory i_size is updated in
>   the truncate path (xfs_setattr_size()) via truncate_setsize(), so at
>   that point the new size should be visible to subsequent AIO writers.

Different situation as truncate serialises all IO. Extending the file
via truncate also runs the same "EOF zeroing" that the DIO code runs
above, for the same reasons.

> 
> > >	- truncate/extent manipulation syscall is run
> > 
> > Actually, we do call fallocate() ahead of io_submit() (in a worker thread,
> > in non-overlapping ranges) to optimize file layout and also in the belief
> > that it would reduce the amount of blocking io_submit() does.

fallocate serialises all IO submission - including reads. Unlike
truncate, however, it doesn't drain the queue of IO for
preallocation so the impact on AIO is somewhat limited.

Ideally you want to limit fallocate calls to large chunks at a time.
If you have a 1:1 mapping of fallocate calls to write calls, then
you're likely making things worse for the AIO submission path
because you'll block reads as well as writes. Doing the allocation
in the write submission path will not block reads, and only writes
that are attempting to do concurrent allocations to the same file
will serialise...

If you want to limit fragmentation without adding and overhead on
XFS for non-sparse files (which it sounds like your case), then the
best thing to use in XFS is the per-inode extent size hints. You set
it on the file when first creating it (or the parent directory so
all children inherit it at create), and then the allocator will
round out allocations to the size hint alignment and size, including
beyond EOF so appending writes can take advantage of it....

> > A final point is discoverability.  There is no way to discover safe
> > alignment for reads and writes, and which operations block io_submit(),
> > except by asking here, which cannot be done at runtime.  Interfaces that
> > provide a way to query these attributes are very important to us.
> As Brian pointed statfs() can be use to get f_bsize which is defined as
> "optimal transfer block size".

Well, that's what posix calls it. It's not really the optimal IO
size, though, it's just the IO size that avoids page cache RMW
cycles. For direct IO, larger tends to be better, and IO aligned to
the underlying geometry of the storage is even better. See, for
example, the "largeio" mount option, which will make XFS report the
stripe width in f_bsize rather than the PAGE_SIZE of the machine....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-08 11:46               ` Dave Chinner
@ 2015-10-12 12:37                 ` Avi Kivity
  2015-10-12 22:23                   ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Avi Kivity @ 2015-10-12 12:37 UTC (permalink / raw)
  To: Dave Chinner, Gleb Natapov; +Cc: Brian Foster, Eric Sandeen, xfs

On 10/08/2015 02:46 PM, Dave Chinner wrote:
> On Thu, Oct 08, 2015 at 11:23:07AM +0300, Gleb Natapov wrote:
>> On Thu, Oct 08, 2015 at 08:21:58AM +0300, Avi Kivity wrote:
>>>>>> I fixed something similar in ext4 at the time, FWIW.
>>>>> Makes sense.
>>>>>
>>>>> Is there a way to relax this for reads?
>>>> The above mostly only applies to writes. Reads don't modify data so
>>>> racing unaligned reads against other reads won't given unexpected
>>>> results and so aren't serialised.
>>>>
>>>> i.e. serialisation will only occur when:
>>>> 	- unaligned write IO will serialise until sub-block zeroing
>>>> 	  is complete.
>>>> 	- write IO extending EOF will serialis until post-EOF
>>>> 	  zeroing is complete
>>>
>>> By "complete" here, do you mean that a call to truncate() returned, or that
>>> its results reached the disk an unknown time later?
>>>
> No, I'm talking purely about DIO here. If you do write that
> starts beyond the existing EOF, there is a region between the
> current EOF and the offset the write starts at. i.e.
>
>     0             EOF            offset     new EOF
>     +dddddddddddddd+..............+nnnnnnnnnnn+
>
> It is the region between EOF and offset that we must ensure is made
> up of either holes, unwritten extents or fully zeroed blocks before
> allowing the write to proceed. If we have to zero allocated blocks,
> then we have to ensure that completes before the write can start.
> This means that when we update the EOF on completion of the write,
> we don't expose stale data in blocks that were between EOF and
> offset...

Thanks.  We found, experimentally, that io_submit(write_at_eof) followed 
by (without waiting) io_submit(write_at_what_would_be_the_new_eof) 
occasionally blocks.

So I guess we have to employ a train algorithm here and keep at most one 
aio in flight for append loads (which are very common for us).

>
>> I think Brian already answered that one with:
>>
>>    There are no such pitfalls as far as I'm aware. The entire AIO
>>    submission synchronization sequence triggers off an in-memory i_size
>>    check in xfs_file_aio_write_checks(). The in-memory i_size is updated in
>>    the truncate path (xfs_setattr_size()) via truncate_setsize(), so at
>>    that point the new size should be visible to subsequent AIO writers.
> Different situation as truncate serialises all IO. Extending the file
> via truncate also runs the same "EOF zeroing" that the DIO code runs
> above, for the same reasons.

Does that mean that truncate() will wait for inflight aios, or that new 
aios will wait for the truncate() to complete, or both?

>
>>>> 	- truncate/extent manipulation syscall is run
>>> Actually, we do call fallocate() ahead of io_submit() (in a worker thread,
>>> in non-overlapping ranges) to optimize file layout and also in the belief
>>> that it would reduce the amount of blocking io_submit() does.
> fallocate serialises all IO submission - including reads. Unlike
> truncate, however, it doesn't drain the queue of IO for
> preallocation so the impact on AIO is somewhat limited.
>
> Ideally you want to limit fallocate calls to large chunks at a time.
> If you have a 1:1 mapping of fallocate calls to write calls, then
> you're likely making things worse for the AIO submission path
> because you'll block reads as well as writes. Doing the allocation
> in the write submission path will not block reads, and only writes
> that are attempting to do concurrent allocations to the same file
> will serialise...

We have a 1:8 ratio (128K:1M), but that's just random numbers we guessed.

Again, not only for reduced xfs metadata, but also to reduce the amount 
of write amplification done by the FTL. We have a concurrent append 
workload on many files, and files are reclaimed out of order, so larger 
extends means less fragmentation for the FTL later on.

>
> If you want to limit fragmentation without adding and overhead on
> XFS for non-sparse files (which it sounds like your case), then the
> best thing to use in XFS is the per-inode extent size hints. You set
> it on the file when first creating it (or the parent directory so
> all children inherit it at create), and then the allocator will
> round out allocations to the size hint alignment and size, including
> beyond EOF so appending writes can take advantage of it....

We'll try that out.  That's fsxattr::fsx_extsize?

What about small files that are eventually closed, do I need to do 
anything to reclaim the preallocated space?

>
>>> A final point is discoverability.  There is no way to discover safe
>>> alignment for reads and writes, and which operations block io_submit(),
>>> except by asking here, which cannot be done at runtime.  Interfaces that
>>> provide a way to query these attributes are very important to us.
>> As Brian pointed statfs() can be use to get f_bsize which is defined as
>> "optimal transfer block size".
> Well, that's what posix calls it. It's not really the optimal IO
> size, though, it's just the IO size that avoids page cache RMW
> cycles. For direct IO, larger tends to be better, and IO aligned to
> the underlying geometry of the storage is even better. See, for
> example, the "largeio" mount option, which will make XFS report the
> stripe width in f_bsize rather than the PAGE_SIZE of the machine....
>

Well, random reads will still be faster with 512 byte alignment, yes? 
and for random writes, you can't just make those I/Os larger, you'll 
overwrite something.

So I read "optimal" here to mean "smallest I/O size that doesn't incur a 
penalty; but if you really need more data, making it larger will help".

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-12 12:37                 ` Avi Kivity
@ 2015-10-12 22:23                   ` Dave Chinner
  2015-10-13  9:11                     ` Avi Kivity
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2015-10-12 22:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, Eric Sandeen, Gleb Natapov, xfs

On Mon, Oct 12, 2015 at 03:37:04PM +0300, Avi Kivity wrote:
> On 10/08/2015 02:46 PM, Dave Chinner wrote:
> >On Thu, Oct 08, 2015 at 11:23:07AM +0300, Gleb Natapov wrote:
> >>On Thu, Oct 08, 2015 at 08:21:58AM +0300, Avi Kivity wrote:
> >>>>>>I fixed something similar in ext4 at the time, FWIW.
> >>>>>Makes sense.
> >>>>>
> >>>>>Is there a way to relax this for reads?
> >>>>The above mostly only applies to writes. Reads don't modify data so
> >>>>racing unaligned reads against other reads won't given unexpected
> >>>>results and so aren't serialised.
> >>>>
> >>>>i.e. serialisation will only occur when:
> >>>>	- unaligned write IO will serialise until sub-block zeroing
> >>>>	  is complete.
> >>>>	- write IO extending EOF will serialis until post-EOF
> >>>>	  zeroing is complete
> >>>
> >>>By "complete" here, do you mean that a call to truncate() returned, or that
> >>>its results reached the disk an unknown time later?
> >>>
> >No, I'm talking purely about DIO here. If you do write that
> >starts beyond the existing EOF, there is a region between the
> >current EOF and the offset the write starts at. i.e.
> >
> >    0             EOF            offset     new EOF
> >    +dddddddddddddd+..............+nnnnnnnnnnn+
> >
> >It is the region between EOF and offset that we must ensure is made
> >up of either holes, unwritten extents or fully zeroed blocks before
> >allowing the write to proceed. If we have to zero allocated blocks,
> >then we have to ensure that completes before the write can start.
> >This means that when we update the EOF on completion of the write,
> >we don't expose stale data in blocks that were between EOF and
> >offset...
> 
> Thanks.  We found, experimentally, that io_submit(write_at_eof)
> followed by (without waiting)
> io_submit(write_at_what_would_be_the_new_eof) occasionally blocks.

Yes, that matches up with needing to wait for IO completion to
update the inode size before submitting the next IO.

> So I guess we have to employ a train algorithm here and keep at most
> one aio in flight for append loads (which are very common for us).

Or use prealloc that extends the file and on staartup use and
algorithm that detects the end of data by looking for zeroed area
that hasn't been written.  SEEK_DATA/SEEK_HOLE can be used to do
this efficiently...

> >>I think Brian already answered that one with:
> >>
> >>   There are no such pitfalls as far as I'm aware. The entire AIO
> >>   submission synchronization sequence triggers off an in-memory i_size
> >>   check in xfs_file_aio_write_checks(). The in-memory i_size is updated in
> >>   the truncate path (xfs_setattr_size()) via truncate_setsize(), so at
> >>   that point the new size should be visible to subsequent AIO writers.
> >Different situation as truncate serialises all IO. Extending the file
> >via truncate also runs the same "EOF zeroing" that the DIO code runs
> >above, for the same reasons.
> 
> Does that mean that truncate() will wait for inflight aios, or that
> new aios will wait for the truncate() to complete, or both?

Both.

> >If you want to limit fragmentation without adding and overhead on
> >XFS for non-sparse files (which it sounds like your case), then the
> >best thing to use in XFS is the per-inode extent size hints. You set
> >it on the file when first creating it (or the parent directory so
> >all children inherit it at create), and then the allocator will
> >round out allocations to the size hint alignment and size, including
> >beyond EOF so appending writes can take advantage of it....
> 
> We'll try that out.  That's fsxattr::fsx_extsize?

*nod*

> What about small files that are eventually closed, do I need to do
> anything to reclaim the preallocated space?

Truncate to the current size (i.e. new size = old size) will remove
the extents beyond EOF, so will punching a hole from EOF for a
distance larger than the extent size hint.

> >>>A final point is discoverability.  There is no way to discover safe
> >>>alignment for reads and writes, and which operations block io_submit(),
> >>>except by asking here, which cannot be done at runtime.  Interfaces that
> >>>provide a way to query these attributes are very important to us.
> >>As Brian pointed statfs() can be use to get f_bsize which is defined as
> >>"optimal transfer block size".
> >Well, that's what posix calls it. It's not really the optimal IO
> >size, though, it's just the IO size that avoids page cache RMW
> >cycles. For direct IO, larger tends to be better, and IO aligned to
> >the underlying geometry of the storage is even better. See, for
> >example, the "largeio" mount option, which will make XFS report the
> >stripe width in f_bsize rather than the PAGE_SIZE of the machine....
> >
> 
> Well, random reads will still be faster with 512 byte alignment,
> yes?

Define "faster". :)

If you are talking about minimal latency, then an
individual IO will be marginally faster. If you are worried about
bulk throughput, then you storage will be IOPS bound (hence
destroying latency determinism) and it won't be faster by any metric
you care to measure because you'll end up with blocking in the
request queues during submission...

> and for random writes, you can't just make those I/Os larger,
> you'll overwrite something.
> 
> So I read "optimal" here to mean "smallest I/O size that doesn't
> incur a penalty; but if you really need more data, making it larger
> will help".

You hit the nail on the head. For an asynchornous IO engine like
you seem to be building, I'd be aiming for an IO size that maximises
the bulk throughput to/from the storage devices, rather than one
that aims for minimum latency on any one individiual IO. i.e. aim
for the minimum IO size that acheives >80% of the usable bandwidth
the storage device has...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about non asynchronous aio calls.
  2015-10-12 22:23                   ` Dave Chinner
@ 2015-10-13  9:11                     ` Avi Kivity
  0 siblings, 0 replies; 13+ messages in thread
From: Avi Kivity @ 2015-10-13  9:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Eric Sandeen, Gleb Natapov, xfs

On 10/13/2015 01:23 AM, Dave Chinner wrote:
> On Mon, Oct 12, 2015 at 03:37:04PM +0300, Avi Kivity wrote:
>> On 10/08/2015 02:46 PM, Dave Chinner wrote:
>>> On Thu, Oct 08, 2015 at 11:23:07AM +0300, Gleb Natapov wrote:
>>>> On Thu, Oct 08, 2015 at 08:21:58AM +0300, Avi Kivity wrote:
>>>>>>>> I fixed something similar in ext4 at the time, FWIW.
>>>>>>> Makes sense.
>>>>>>>
>>>>>>> Is there a way to relax this for reads?
>>>>>> The above mostly only applies to writes. Reads don't modify data so
>>>>>> racing unaligned reads against other reads won't given unexpected
>>>>>> results and so aren't serialised.
>>>>>>
>>>>>> i.e. serialisation will only occur when:
>>>>>> 	- unaligned write IO will serialise until sub-block zeroing
>>>>>> 	  is complete.
>>>>>> 	- write IO extending EOF will serialis until post-EOF
>>>>>> 	  zeroing is complete
>>>>> By "complete" here, do you mean that a call to truncate() returned, or that
>>>>> its results reached the disk an unknown time later?
>>>>>
>>> No, I'm talking purely about DIO here. If you do write that
>>> starts beyond the existing EOF, there is a region between the
>>> current EOF and the offset the write starts at. i.e.
>>>
>>>     0             EOF            offset     new EOF
>>>     +dddddddddddddd+..............+nnnnnnnnnnn+
>>>
>>> It is the region between EOF and offset that we must ensure is made
>>> up of either holes, unwritten extents or fully zeroed blocks before
>>> allowing the write to proceed. If we have to zero allocated blocks,
>>> then we have to ensure that completes before the write can start.
>>> This means that when we update the EOF on completion of the write,
>>> we don't expose stale data in blocks that were between EOF and
>>> offset...
>> Thanks.  We found, experimentally, that io_submit(write_at_eof)
>> followed by (without waiting)
>> io_submit(write_at_what_would_be_the_new_eof) occasionally blocks.
> Yes, that matches up with needing to wait for IO completion to
> update the inode size before submitting the next IO.
>
>> So I guess we have to employ a train algorithm here and keep at most
>> one aio in flight for append loads (which are very common for us).
> Or use prealloc that extends the file and on staartup use and
> algorithm that detects the end of data by looking for zeroed area
> that hasn't been written.  SEEK_DATA/SEEK_HOLE can be used to do
> this efficiently...

Given that prealloc interferes with aio, we'll just give up the extra 
concurrency here.

>
>>>> I think Brian already answered that one with:
>>>>
>>>>    There are no such pitfalls as far as I'm aware. The entire AIO
>>>>    submission synchronization sequence triggers off an in-memory i_size
>>>>    check in xfs_file_aio_write_checks(). The in-memory i_size is updated in
>>>>    the truncate path (xfs_setattr_size()) via truncate_setsize(), so at
>>>>    that point the new size should be visible to subsequent AIO writers.
>>> Different situation as truncate serialises all IO. Extending the file
>>> via truncate also runs the same "EOF zeroing" that the DIO code runs
>>> above, for the same reasons.
>> Does that mean that truncate() will wait for inflight aios, or that
>> new aios will wait for the truncate() to complete, or both?
> Both.
>
>>> If you want to limit fragmentation without adding and overhead on
>>> XFS for non-sparse files (which it sounds like your case), then the
>>> best thing to use in XFS is the per-inode extent size hints. You set
>>> it on the file when first creating it (or the parent directory so
>>> all children inherit it at create), and then the allocator will
>>> round out allocations to the size hint alignment and size, including
>>> beyond EOF so appending writes can take advantage of it....
>> We'll try that out.  That's fsxattr::fsx_extsize?
> *nod*
>
>> What about small files that are eventually closed, do I need to do
>> anything to reclaim the preallocated space?
> Truncate to the current size (i.e. new size = old size) will remove
> the extents beyond EOF, so will punching a hole from EOF for a
> distance larger than the extent size hint.

Ok.  We already have to truncate if the file size turns out to be not 
aligned on a block boundary, so we can just make it unconditional.

>
>>>>> A final point is discoverability.  There is no way to discover safe
>>>>> alignment for reads and writes, and which operations block io_submit(),
>>>>> except by asking here, which cannot be done at runtime.  Interfaces that
>>>>> provide a way to query these attributes are very important to us.
>>>> As Brian pointed statfs() can be use to get f_bsize which is defined as
>>>> "optimal transfer block size".
>>> Well, that's what posix calls it. It's not really the optimal IO
>>> size, though, it's just the IO size that avoids page cache RMW
>>> cycles. For direct IO, larger tends to be better, and IO aligned to
>>> the underlying geometry of the storage is even better. See, for
>>> example, the "largeio" mount option, which will make XFS report the
>>> stripe width in f_bsize rather than the PAGE_SIZE of the machine....
>>>
>> Well, random reads will still be faster with 512 byte alignment,
>> yes?
> Define "faster". :)
>
> If you are talking about minimal latency, then an
> individual IO will be marginally faster. If you are worried about
> bulk throughput, then you storage will be IOPS bound (hence
> destroying latency determinism) and it won't be faster by any metric
> you care to measure because you'll end up with blocking in the
> request queues during submission...

There is also pcie link saturation.  Smaller I/Os means we'll reach 
saturation later, and so the device can push more data.

Our workload reads variable-sized pieces of data in random locations on 
the disk.  Increasing the alignment will increase bandwidth, yes, but it 
won't increase the bandwidth of useful data.

>
>> and for random writes, you can't just make those I/Os larger,
>> you'll overwrite something.
>>
>> So I read "optimal" here to mean "smallest I/O size that doesn't
>> incur a penalty; but if you really need more data, making it larger
>> will help".
> You hit the nail on the head. For an asynchornous IO engine like
> you seem to be building,

[http://seastar-project.org.  Everything is async, we push 
open/truncate/fsync to a worker thread, but otherwise everything is one 
thread per core, and the only syscalls are io_submit and io_getevents.

btw, something that may help (I did not measure it) is aio fsync. I've 
read the thread about it, and using a workqueue in the kernel rather 
than a worker thread in userspace probably won't give an advantage, but 
for the special case of aio+dio, do you need the workqueue?  It may be 
possible to special case it, and then you can coalesce several aio 
fsyncs into a single device flush].

>   I'd be aiming for an IO size that maximises
> the bulk throughput to/from the storage devices, rather than one
> that aims for minimum latency on any one individiual IO.

That I/O size is infinite, the larger your I/Os the better your 
efficiency.  But from the application point of view, you aren't 
increasing the amount of useful data.  The application (for random 
workloads) wants to transfer the minimum amount of data possible, as 
long as it doesn't cause the kernel or device to drop into a slow path.  
So far that magic value seems to be the device block size.

>   i.e. aim
> for the minimum IO size that acheives >80% of the usable bandwidth
> the storage device has...


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-10-13  9:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-07 14:18 Question about non asynchronous aio calls Gleb Natapov
2015-10-07 14:24 ` Eric Sandeen
2015-10-07 15:08   ` Brian Foster
2015-10-07 15:13     ` Eric Sandeen
2015-10-07 18:13       ` Avi Kivity
2015-10-08  4:28         ` Dave Chinner
2015-10-08  5:21           ` Avi Kivity
2015-10-08  8:23             ` Gleb Natapov
2015-10-08 11:46               ` Dave Chinner
2015-10-12 12:37                 ` Avi Kivity
2015-10-12 22:23                   ` Dave Chinner
2015-10-13  9:11                     ` Avi Kivity
2015-10-08  8:34     ` Gleb Natapov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.