All of lore.kernel.org
 help / color / mirror / Atom feed
* Async direct IO write vs buffered read race
@ 2017-06-22 15:57 Lukas Czerner
  2017-06-22 16:55 ` Jeff Moyer
  0 siblings, 1 reply; 8+ messages in thread
From: Lukas Czerner @ 2017-06-22 15:57 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, esandeen

Hello,

I am dealing with a problem where in case that buffered read happens to
land between direct IO submission and completion page cache will contain
the stale data, while the new data will be on disk.

We are trying to avoid such problems by calling
invalidate_inode_pages2_range() before and after direct_IO() in
generic_file_direct_write() however that does not seem to be enough,
because nothing prevents buffered reads to come in afterwards populating
page cache.

Aside from the fact that mixing direct and buffered IO is not such a
good idea, we end up with page cache showing different content than
what's on disk even after aio dio completes which seems very strange
to me.

I can reproduce this on ext4 as well as xfs and kernel version going
back at least to v3.10 which leads me to believe that this might
actually be known behaviour ?

I was trying to avoid that by moving invalidate_inode_pages2_range() to
after the aio dio completion into dio_complete (or file system ->end_io
callback) but it has it's own problems - sometimes this appears to be
called from atomic context and I do not really see why...

Do you have any comments on this ? Is it actually expected behaviour ?

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Async direct IO write vs buffered read race
  2017-06-22 15:57 Async direct IO write vs buffered read race Lukas Czerner
@ 2017-06-22 16:55 ` Jeff Moyer
  2017-06-23  7:59   ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Moyer @ 2017-06-22 16:55 UTC (permalink / raw)
  To: Lukas Czerner, linux-fsdevel; +Cc: viro, esandeen, Christoph Hellwig, Jan Kara

Lukas Czerner <lczerner@redhat.com> writes:

> Hello,
>
> I am dealing with a problem where in case that buffered read happens to
> land between direct IO submission and completion page cache will contain
> the stale data, while the new data will be on disk.
>
> We are trying to avoid such problems by calling
> invalidate_inode_pages2_range() before and after direct_IO() in
> generic_file_direct_write() however that does not seem to be enough,
> because nothing prevents buffered reads to come in afterwards populating
> page cache.

Ugh, right.  With aio, we're doing the invalidate after the submission,
not the completion.

> Aside from the fact that mixing direct and buffered IO is not such a
> good idea, we end up with page cache showing different content than
> what's on disk even after aio dio completes which seems very strange
> to me.
>
> I can reproduce this on ext4 as well as xfs and kernel version going
> back at least to v3.10 which leads me to believe that this might
> actually be known behaviour ?

At least I didn't know about it.  ;-)

> I was trying to avoid that by moving invalidate_inode_pages2_range() to
> after the aio dio completion into dio_complete (or file system ->end_io
> callback) but it has it's own problems - sometimes this appears to be
> called from atomic context and I do not really see why...

Well, I/O completion processing of course happens in atomic context.  We
do defer some things (like O_(D)SYNC processing) to process context.  I
guess we could add another qualifier inside of dio_bio_end_aio:

	bool defer_completion = false;
	if (dio->result)
		defer_completion = dio->defer_completion || 
			(dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages);

	if (remaining == 0) {
		if (defer_completion) {
			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
			queue_work(dio->inode->i_sb->s_dio_done_wq,
				   &dio->complete_work);
...

(I'm not sure whether we also have to take into account exceptional entries.)

And then call invalidate_inode_pages2_range from dio_aio_complete_work.
That at least wouldn't defer /all/ completion processing to a workqueue.
However, it will slow things down when there is mixed buffered and
direct I/O.

Christoph or Jan, any thoughts on this?

> Do you have any comments on this ? Is it actually expected behaviour ?

It's not expected behaviour.

-Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Async direct IO write vs buffered read race
  2017-06-22 16:55 ` Jeff Moyer
@ 2017-06-23  7:59   ` Jan Kara
  2017-06-23 10:16     ` Lukas Czerner
  2017-06-23 18:04     ` Eric Sandeen
  0 siblings, 2 replies; 8+ messages in thread
From: Jan Kara @ 2017-06-23  7:59 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Lukas Czerner, linux-fsdevel, viro, esandeen, Christoph Hellwig,
	Jan Kara

On Thu 22-06-17 12:55:50, Jeff Moyer wrote:
> Lukas Czerner <lczerner@redhat.com> writes:
> 
> > Hello,
> >
> > I am dealing with a problem where in case that buffered read happens to
> > land between direct IO submission and completion page cache will contain
> > the stale data, while the new data will be on disk.
> >
> > We are trying to avoid such problems by calling
> > invalidate_inode_pages2_range() before and after direct_IO() in
> > generic_file_direct_write() however that does not seem to be enough,
> > because nothing prevents buffered reads to come in afterwards populating
> > page cache.
> 
> Ugh, right.  With aio, we're doing the invalidate after the submission,
> not the completion.
> 
> > Aside from the fact that mixing direct and buffered IO is not such a
> > good idea, we end up with page cache showing different content than
> > what's on disk even after aio dio completes which seems very strange
> > to me.
> >
> > I can reproduce this on ext4 as well as xfs and kernel version going
> > back at least to v3.10 which leads me to believe that this might
> > actually be known behaviour ?
> 
> At least I didn't know about it.  ;-)

I'm actually aware of it :)

> > I was trying to avoid that by moving invalidate_inode_pages2_range() to
> > after the aio dio completion into dio_complete (or file system ->end_io
> > callback) but it has it's own problems - sometimes this appears to be
> > called from atomic context and I do not really see why...
> 
> Well, I/O completion processing of course happens in atomic context.  We
> do defer some things (like O_(D)SYNC processing) to process context.  I
> guess we could add another qualifier inside of dio_bio_end_aio:
> 
> 	bool defer_completion = false;
> 	if (dio->result)
> 		defer_completion = dio->defer_completion || 
> 			(dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages);
> 
> 	if (remaining == 0) {
> 		if (defer_completion) {
> 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
> 			queue_work(dio->inode->i_sb->s_dio_done_wq,
> 				   &dio->complete_work);
> ...
> 
> (I'm not sure whether we also have to take into account exceptional entries.)
> 
> And then call invalidate_inode_pages2_range from dio_aio_complete_work.
> That at least wouldn't defer /all/ completion processing to a workqueue.
> However, it will slow things down when there is mixed buffered and
> direct I/O.
> 
> Christoph or Jan, any thoughts on this?

So our stance has been: Do not ever mix buffered and direct IO! Definitely
not on the same file range, most definitely not at the same time.

The thing we do is a best effort thing that more or less guarantees that if
you do say buffered IO and direct IO after that, it will work reasonably.
However if direct and buffered IO can race, bad luck for your data. I don't
think we want to sacrifice any performance of AIO DIO (and offloading of
direct IO completion to a workqueue so that we can do invalidation costs
noticeable mount of performance) for supporting such usecase.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Async direct IO write vs buffered read race
  2017-06-23  7:59   ` Jan Kara
@ 2017-06-23 10:16     ` Lukas Czerner
  2017-06-26 15:11       ` Jeff Moyer
  2017-06-23 18:04     ` Eric Sandeen
  1 sibling, 1 reply; 8+ messages in thread
From: Lukas Czerner @ 2017-06-23 10:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jeff Moyer, linux-fsdevel, viro, esandeen, Christoph Hellwig

On Fri, Jun 23, 2017 at 09:59:42AM +0200, Jan Kara wrote:
> On Thu 22-06-17 12:55:50, Jeff Moyer wrote:
> > Lukas Czerner <lczerner@redhat.com> writes:
> > 
> > > Hello,
> > >
> > > I am dealing with a problem where in case that buffered read happens to
> > > land between direct IO submission and completion page cache will contain
> > > the stale data, while the new data will be on disk.
> > >
> > > We are trying to avoid such problems by calling
> > > invalidate_inode_pages2_range() before and after direct_IO() in
> > > generic_file_direct_write() however that does not seem to be enough,
> > > because nothing prevents buffered reads to come in afterwards populating
> > > page cache.
> > 
> > Ugh, right.  With aio, we're doing the invalidate after the submission,
> > not the completion.
> > 
> > > Aside from the fact that mixing direct and buffered IO is not such a
> > > good idea, we end up with page cache showing different content than
> > > what's on disk even after aio dio completes which seems very strange
> > > to me.
> > >
> > > I can reproduce this on ext4 as well as xfs and kernel version going
> > > back at least to v3.10 which leads me to believe that this might
> > > actually be known behaviour ?
> > 
> > At least I didn't know about it.  ;-)
> 
> I'm actually aware of it :)
> 
> > > I was trying to avoid that by moving invalidate_inode_pages2_range() to
> > > after the aio dio completion into dio_complete (or file system ->end_io
> > > callback) but it has it's own problems - sometimes this appears to be
> > > called from atomic context and I do not really see why...
> > 
> > Well, I/O completion processing of course happens in atomic context.  We
> > do defer some things (like O_(D)SYNC processing) to process context.  I
> > guess we could add another qualifier inside of dio_bio_end_aio:
> > 
> > 	bool defer_completion = false;
> > 	if (dio->result)
> > 		defer_completion = dio->defer_completion || 
> > 			(dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages);
> > 
> > 	if (remaining == 0) {
> > 		if (defer_completion) {
> > 			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
> > 			queue_work(dio->inode->i_sb->s_dio_done_wq,
> > 				   &dio->complete_work);
> > ...
> > 
> > (I'm not sure whether we also have to take into account exceptional entries.)
> > 
> > And then call invalidate_inode_pages2_range from dio_aio_complete_work.
> > That at least wouldn't defer /all/ completion processing to a workqueue.
> > However, it will slow things down when there is mixed buffered and
> > direct I/O.
> > 
> > Christoph or Jan, any thoughts on this?
> 
> So our stance has been: Do not ever mix buffered and direct IO! Definitely
> not on the same file range, most definitely not at the same time.

True, that's a good recommendation.

> 
> The thing we do is a best effort thing that more or less guarantees that if
> you do say buffered IO and direct IO after that, it will work reasonably.
> However if direct and buffered IO can race, bad luck for your data. I don't
> think we want to sacrifice any performance of AIO DIO (and offloading of
> direct IO completion to a workqueue so that we can do invalidation costs
> noticeable mount of performance) for supporting such usecase.

What Jeff proposed would sacrifice performance for the case where AIO
DIO write does race with buffered IO - the situation we agree is not ideal
and should be avoided anyway. For the rest of AIO DIO this should have no
effect right ? If true, I'd say this is a good effort to make sure we do
not have disparity between page cache and disk.

-Lukas

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Async direct IO write vs buffered read race
  2017-06-23  7:59   ` Jan Kara
  2017-06-23 10:16     ` Lukas Czerner
@ 2017-06-23 18:04     ` Eric Sandeen
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Sandeen @ 2017-06-23 18:04 UTC (permalink / raw)
  To: Jan Kara, Jeff Moyer
  Cc: Lukas Czerner, linux-fsdevel, viro, Christoph Hellwig

On 6/23/17 2:59 AM, Jan Kara wrote:
> On Thu 22-06-17 12:55:50, Jeff Moyer wrote:


>> Christoph or Jan, any thoughts on this?
> 
> So our stance has been: Do not ever mix buffered and direct IO! Definitely
> not on the same file range, most definitely not at the same time.

FWIW, I'd always known that concurrent buffered & direct wasn't
particularly deterministic, i.e. the racing buffered read may get old or
new data at the time.

I was surprised to learn that the stale file data would linger
indefinitely in the page cache though.  Maybe I was just blissfully
unaware.  :)

-Eric

> The thing we do is a best effort thing that more or less guarantees that if
> you do say buffered IO and direct IO after that, it will work reasonably.
> However if direct and buffered IO can race, bad luck for your data. I don't
> think we want to sacrifice any performance of AIO DIO (and offloading of
> direct IO completion to a workqueue so that we can do invalidation costs
> noticeable mount of performance) for supporting such usecase.
> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Async direct IO write vs buffered read race
  2017-06-23 10:16     ` Lukas Czerner
@ 2017-06-26 15:11       ` Jeff Moyer
  2017-06-28 16:57         ` Rik van Riel
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Moyer @ 2017-06-26 15:11 UTC (permalink / raw)
  To: Jan Kara, Lukas Czerner; +Cc: linux-fsdevel, viro, esandeen, Christoph Hellwig

Lukas Czerner <lczerner@redhat.com> writes:

>> The thing we do is a best effort thing that more or less guarantees that if
>> you do say buffered IO and direct IO after that, it will work reasonably.
>> However if direct and buffered IO can race, bad luck for your data. I don't
>> think we want to sacrifice any performance of AIO DIO (and offloading of
>> direct IO completion to a workqueue so that we can do invalidation costs
>> noticeable mount of performance) for supporting such usecase.
>
> What Jeff proposed would sacrifice performance for the case where AIO
> DIO write does race with buffered IO - the situation we agree is not ideal
> and should be avoided anyway. For the rest of AIO DIO this should have no
> effect right ? If true, I'd say this is a good effort to make sure we do
> not have disparity between page cache and disk.

Exactly.  Jan, are you concerned about impacting performance for mixed
buffered I/O and direct writes?  If so, we could look into restricting
the process context switch further, to just overlapping buffered and
direct I/O (assuming there are no locking issues).

Alternatively, since we already know this is racy, we don't actually
have to defer I/O completion to process context.  We could just complete
the I/O as we normally would, but also queue up an
invalidate_inode_pages2_range work item.  It will be asynchronous, but
this is best effort, anyway.

As Eric mentioned, the thing that bothers me is that we have invalid
data lingering in the page cache indefinitely.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Async direct IO write vs buffered read race
  2017-06-26 15:11       ` Jeff Moyer
@ 2017-06-28 16:57         ` Rik van Riel
  2017-06-30 11:16           ` Lukas Czerner
  0 siblings, 1 reply; 8+ messages in thread
From: Rik van Riel @ 2017-06-28 16:57 UTC (permalink / raw)
  To: Jeff Moyer, Jan Kara, Lukas Czerner
  Cc: linux-fsdevel, viro, esandeen, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 2007 bytes --]

On Mon, 2017-06-26 at 11:11 -0400, Jeff Moyer wrote:
> Lukas Czerner <lczerner@redhat.com> writes:
> 
> > > The thing we do is a best effort thing that more or less
> > > guarantees that if
> > > you do say buffered IO and direct IO after that, it will work
> > > reasonably.
> > > However if direct and buffered IO can race, bad luck for your
> > > data. I don't
> > > think we want to sacrifice any performance of AIO DIO (and
> > > offloading of
> > > direct IO completion to a workqueue so that we can do
> > > invalidation costs
> > > noticeable mount of performance) for supporting such usecase.
> > 
> > What Jeff proposed would sacrifice performance for the case where
> > AIO
> > DIO write does race with buffered IO - the situation we agree is
> > not ideal
> > and should be avoided anyway. For the rest of AIO DIO this should
> > have no
> > effect right ? If true, I'd say this is a good effort to make sure
> > we do
> > not have disparity between page cache and disk.
> 
> Exactly.  Jan, are you concerned about impacting performance for
> mixed
> buffered I/O and direct writes?  If so, we could look into
> restricting
> the process context switch further, to just overlapping buffered and
> direct I/O (assuming there are no locking issues).
> 
> Alternatively, since we already know this is racy, we don't actually
> have to defer I/O completion to process context.  We could just
> complete
> the I/O as we normally would, but also queue up an
> invalidate_inode_pages2_range work item.  It will be asynchronous,
> but
> this is best effort, anyway.
> 
> As Eric mentioned, the thing that bothers me is that we have invalid
> data lingering in the page cache indefinitely.

Given that the requirement is that the page cache
gets invalidated after IO completion, would it be
possible to defer only the page cache invalidation
to task context, and handle the rest of the IO
completion in interrupt context?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Async direct IO write vs buffered read race
  2017-06-28 16:57         ` Rik van Riel
@ 2017-06-30 11:16           ` Lukas Czerner
  0 siblings, 0 replies; 8+ messages in thread
From: Lukas Czerner @ 2017-06-30 11:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jeff Moyer, Jan Kara, linux-fsdevel, viro, esandeen, Christoph Hellwig

On Wed, Jun 28, 2017 at 12:57:59PM -0400, Rik van Riel wrote:
> On Mon, 2017-06-26 at 11:11 -0400, Jeff Moyer wrote:
> > Lukas Czerner <lczerner@redhat.com> writes:
> > 
> > > > The thing we do is a best effort thing that more or less
> > > > guarantees that if
> > > > you do say buffered IO and direct IO after that, it will work
> > > > reasonably.
> > > > However if direct and buffered IO can race, bad luck for your
> > > > data. I don't
> > > > think we want to sacrifice any performance of AIO DIO (and
> > > > offloading of
> > > > direct IO completion to a workqueue so that we can do
> > > > invalidation costs
> > > > noticeable mount of performance) for supporting such usecase.
> > > 
> > > What Jeff proposed would sacrifice performance for the case where
> > > AIO
> > > DIO write does race with buffered IO - the situation we agree is
> > > not ideal
> > > and should be avoided anyway. For the rest of AIO DIO this should
> > > have no
> > > effect right ? If true, I'd say this is a good effort to make sure
> > > we do
> > > not have disparity between page cache and disk.
> > 
> > Exactly.��Jan, are you concerned about impacting performance for
> > mixed
> > buffered I/O and direct writes?��If so, we could look into
> > restricting
> > the process context switch further, to just overlapping buffered and
> > direct I/O (assuming there are no locking issues).
> > 
> > Alternatively, since we already know this is racy, we don't actually
> > have to defer I/O completion to process context.��We could just
> > complete
> > the I/O as we normally would, but also queue up an
> > invalidate_inode_pages2_range work item.��It will be asynchronous,
> > but
> > this is best effort, anyway.
> > 
> > As Eric mentioned, the thing that bothers me is that we have invalid
> > data lingering in the page cache indefinitely.
> 
> Given that the requirement is that the page cache
> gets invalidated after IO completion, would it be
> possible to defer only the page cache invalidation
> to task context, and handle the rest of the IO
> completion in interrupt context?

Hi,

if I am reading it correctly that's basically how it works now for the
IO that has defer_completion set (filesystems set this to do extent
conversion at the completion). We'd use the same path here for
the invalidation.

-Lukas

> 
> -- 
> All rights reversed

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-30 11:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-22 15:57 Async direct IO write vs buffered read race Lukas Czerner
2017-06-22 16:55 ` Jeff Moyer
2017-06-23  7:59   ` Jan Kara
2017-06-23 10:16     ` Lukas Czerner
2017-06-26 15:11       ` Jeff Moyer
2017-06-28 16:57         ` Rik van Riel
2017-06-30 11:16           ` Lukas Czerner
2017-06-23 18:04     ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.