[PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes
@ 2016-08-31  2:56 NeilBrown
  2016-08-31  2:58 ` [PATCH 1/2] cephfs: ignore error from invalidate_inode_pages2_range() in direct write NeilBrown
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: NeilBrown @ 2016-08-31  2:56 UTC (permalink / raw)
  To: Yan, Zheng, Sage Weil, Ilya Dryomov; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 559 bytes --]

Hi,
 it is quite possible for O_DIRECT and buffered writes to a file to
 race.
 The xfstests test suite has a test - generic/036 - which tests this
 case.

 Unlike most filesystems, cephfs does not hold inode_lock() across
 direct writes.  This means that buffer pages can become dirty while
 direct writes are happening.  This confused ceph a little.

 The following two patches allow ceph to handle this possibility a
 little more cleanly.  The more important patch removes a WARN_ON() for
 a circumstance which can easily be triggered.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] cephfs: ignore error from invalidate_inode_pages2_range() in direct write.
  2016-08-31  2:56 [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes NeilBrown
@ 2016-08-31  2:58 ` NeilBrown
  2016-08-31 13:47   ` Jeff Layton
  2016-08-31  2:59 ` [PATCH 2/2] cephfs: remove warning when ceph_releasepage() is called on dirty page NeilBrown
  2016-08-31  9:19 ` [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes Yan, Zheng
  2 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2016-08-31  2:58 UTC (permalink / raw)
  To: Yan, Zheng, Sage Weil, Ilya Dryomov; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1571 bytes --]

This call can fail if there are dirty pages.  The preceding call to
filemap_write_and_wait_range() will normally remove dirty pages, but
as inode_lock() is not held over calls to ceph_direct_read_write(), it
could race with non-direct writes and pages could be dirtied
immediately after filemap_write_and_wait_range() returns

If there are dirty pages, they will be removed by the subsequent call
to truncate_inode_pages_range(), so having them here is not a problem.

If the 'ret' value is left holding an error, then in the async IO case
(aio_req is not NULL) the loop that would normally call
ceph_osdc_start_request() will see the error in 'ret' and abort all
requests.  This doesn't seem like correct behaviour.

So clear 'ret' and ignore the error (other than the dout() message).

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/ceph/file.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 0f5375d8e030..1ca6e29edcc9 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -905,8 +905,14 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 		ret = invalidate_inode_pages2_range(inode->i_mapping,
 					pos >> PAGE_SHIFT,
 					(pos + count) >> PAGE_SHIFT);
-		if (ret < 0)
+		if (ret < 0) {
 			dout("invalidate_inode_pages2_range returned %d\n", ret);
+			/*
+			 * Error is not fatal as we truncate_inode_pages_range()
+			 * below.
+			 */
+			ret = 0;
+		}

 		flags = CEPH_OSD_FLAG_ORDERSNAP |
 			CEPH_OSD_FLAG_ONDISK |
-- 
2.9.3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] cephfs: remove warning when ceph_releasepage() is called on dirty page.
  2016-08-31  2:56 [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes NeilBrown
  2016-08-31  2:58 ` [PATCH 1/2] cephfs: ignore error from invalidate_inode_pages2_range() in direct write NeilBrown
@ 2016-08-31  2:59 ` NeilBrown
  2016-08-31 13:48   ` Jeff Layton
  2016-08-31  9:19 ` [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes Yan, Zheng
  2 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2016-08-31  2:59 UTC (permalink / raw)
  To: Yan, Zheng, Sage Weil, Ilya Dryomov; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1357 bytes --]


If O_DIRECT writes are racing with buffered writes, then
the call to invalidate_inode_pages2_range() can call ceph_releasepage()
on dirty pages.

Most filesystems hold inode_lock() across O_DIRECT writes so they do not
suffer this race, but cephfs deliberately drops the lock, and opens a window
for the race.

This race can be triggered with the generic/036 test from the xfstests
test suite.  It doesn't happen every time, but it does happen often.

As the possibilty is expected, remove the warning, and instead include
the PageDirty() status in the debug message.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/ceph/addr.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index d5b6f959a3c3..f657da7d12ff 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -175,9 +175,8 @@ static void ceph_invalidatepage(struct page *page, unsigned int offset,
 
 static int ceph_releasepage(struct page *page, gfp_t g)
 {
-	dout("%p releasepage %p idx %lu\n", page->mapping->host,
-	     page, page->index);
-	WARN_ON(PageDirty(page));
+	dout("%p releasepage %p idx %lu (%sdirty)\n", page->mapping->host,
+	     page, page->index, PageDirty(page) ? "" : "not ");
 
 	/* Can we release the page from the cache? */
 	if (!ceph_release_fscache_page(page, g))
-- 
2.9.3


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes
  2016-08-31  2:56 [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes NeilBrown
  2016-08-31  2:58 ` [PATCH 1/2] cephfs: ignore error from invalidate_inode_pages2_range() in direct write NeilBrown
  2016-08-31  2:59 ` [PATCH 2/2] cephfs: remove warning when ceph_releasepage() is called on dirty page NeilBrown
@ 2016-08-31  9:19 ` Yan, Zheng
  2016-09-01  1:13   ` NeilBrown
  2 siblings, 1 reply; 9+ messages in thread
From: Yan, Zheng @ 2016-08-31  9:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: Yan, Zheng, Sage Weil, Ilya Dryomov, ceph-devel

On Wed, Aug 31, 2016 at 10:56 AM, NeilBrown <neilb@suse.com> wrote:
>
> Hi,
>  it is quite possible for O_DIRECT and buffered writes to a file to
>  race.
>  The xfstests test suite has a test - generic/036 - which tests this
>  case.
>
>  Unlike most filesystems, cephfs does not hold inode_lock() across
>  direct writes.  This means that buffer pages can become dirty while
>  direct writes are happening.  This confused ceph a little.

how about make ceph_write_iter() hold inode_lock() for direct/sync write?

Regards
Yan, Zheng
>
>  The following two patches allow ceph to handle this possibility a
>  little more cleanly.  The more important patch removes a WARN_ON() for
>  a circumstance which can easily be triggered.
>
> Thanks,
> NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] cephfs: ignore error from invalidate_inode_pages2_range() in direct write.
  2016-08-31  2:58 ` [PATCH 1/2] cephfs: ignore error from invalidate_inode_pages2_range() in direct write NeilBrown
@ 2016-08-31 13:47   ` Jeff Layton
  2016-09-01  1:01     ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Layton @ 2016-08-31 13:47 UTC (permalink / raw)
  To: NeilBrown, Yan, Zheng, Sage Weil, Ilya Dryomov; +Cc: ceph-devel

On Wed, 2016-08-31 at 12:58 +1000, NeilBrown wrote:
> This call can fail if there are dirty pages.  The preceding call to
> filemap_write_and_wait_range() will normally remove dirty pages, but
> as inode_lock() is not held over calls to ceph_direct_read_write(), it
> could race with non-direct writes and pages could be dirtied
> immediately after filemap_write_and_wait_range() returns
> 
> If there are dirty pages, they will be removed by the subsequent call
> to truncate_inode_pages_range(), so having them here is not a problem.
> 
> If the 'ret' value is left holding an error, then in the async IO case
> (aio_req is not NULL) the loop that would normally call
> ceph_osdc_start_request() will see the error in 'ret' and abort all
> requests.  This doesn't seem like correct behaviour.
> 
> So clear 'ret' and ignore the error (other than the dout() message).
> 
> > Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/ceph/file.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 0f5375d8e030..1ca6e29edcc9 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -905,8 +905,14 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
> >  		ret = invalidate_inode_pages2_range(inode->i_mapping,
> >  					pos >> PAGE_SHIFT,
> >  					(pos + count) >> PAGE_SHIFT);
> > -		if (ret < 0)
> > +		if (ret < 0) {
> >  			dout("invalidate_inode_pages2_range returned %d\n", ret);
> > +			/*
> > +			 * Error is not fatal as we truncate_inode_pages_range()
> > +			 * below.
> > +			 */
> > +			ret = 0;
> > +		}
>  
> >  		flags = CEPH_OSD_FLAG_ORDERSNAP |
> >  			CEPH_OSD_FLAG_ONDISK |


Good catch. Even better might be to just declare a int ret2 and not
clobber "ret" at all.

Clearly, mixing buffered and direct I/O is gross, but I suppose you
could hit the occasional problem here with a real workload
occasionally.

Should this go to stable? The patch seems safe enough.

Reviewed-by: Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] cephfs: remove warning when ceph_releasepage() is called on dirty page.
  2016-08-31  2:59 ` [PATCH 2/2] cephfs: remove warning when ceph_releasepage() is called on dirty page NeilBrown
@ 2016-08-31 13:48   ` Jeff Layton
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff Layton @ 2016-08-31 13:48 UTC (permalink / raw)
  To: NeilBrown, Yan, Zheng, Sage Weil, Ilya Dryomov; +Cc: ceph-devel

On Wed, 2016-08-31 at 12:59 +1000, NeilBrown wrote:
> If O_DIRECT writes are racing with buffered writes, then
> the call to invalidate_inode_pages2_range() can call ceph_releasepage()
> on dirty pages.
> 
> Most filesystems hold inode_lock() across O_DIRECT writes so they do not
> suffer this race, but cephfs deliberately drops the lock, and opens a window
> for the race.
> 
> This race can be triggered with the generic/036 test from the xfstests
> test suite.  It doesn't happen every time, but it does happen often.
> 
> As the possibilty is expected, remove the warning, and instead include
> the PageDirty() status in the debug message.
> 
> > Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/ceph/addr.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index d5b6f959a3c3..f657da7d12ff 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -175,9 +175,8 @@ static void ceph_invalidatepage(struct page *page, unsigned int offset,
>  
>  static int ceph_releasepage(struct page *page, gfp_t g)
>  {
> > -	dout("%p releasepage %p idx %lu\n", page->mapping->host,
> > -	     page, page->index);
> > -	WARN_ON(PageDirty(page));
> > +	dout("%p releasepage %p idx %lu (%sdirty)\n", page->mapping->host,
> > +	     page, page->index, PageDirty(page) ? "" : "not ");
>  
> >  	/* Can we release the page from the cache? */
> >  	if (!ceph_release_fscache_page(page, g))


Reviewed-by: Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] cephfs: ignore error from invalidate_inode_pages2_range() in direct write.
  2016-08-31 13:47   ` Jeff Layton
@ 2016-09-01  1:01     ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2016-09-01  1:01 UTC (permalink / raw)
  To: Jeff Layton, Yan, Zheng, Sage Weil, Ilya Dryomov; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2212 bytes --]

On Wed, Aug 31 2016, Jeff Layton wrote:
>
> Good catch. Even better might be to just declare a int ret2 and not
> clobber "ret" at all.

Like the following?  Must better, yes.

>
> Clearly, mixing buffered and direct I/O is gross, but I suppose you
> could hit the occasional problem here with a real workload
> occasionally.
>
> Should this go to stable? The patch seems safe enough.

Hardly seems worth it, but certainly safe enough.

>
> Reviewed-by: Jeff Layton <jlayton@redhat.com>

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Subject: [PATCH] cephfs: ignore error from invalidate_inode_pages2_range() in
 direct write.

This call can fail if there are dirty pages.  The preceding call to
filemap_write_and_wait_range() will normally remove dirty pages, but
as inode_lock() is not held over calls to ceph_direct_read_write(), it
could race with non-direct writes and pages could be dirtied
immediately after filemap_write_and_wait_range() returns

If there are dirty pages, they will be removed by the subsequent call
to truncate_inode_pages_range(), so having them here is not a problem.

If the 'ret' value is left holding an error, then in the async IO case
(aio_req is not NULL) the loop that would normally call
ceph_osdc_start_request() will see the error in 'ret' and abort all
requests.  This doesn't seem like correct behaviour.

So use separate 'ret2' instead of overloading 'ret'

Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 0f5375d8e030..395c7fcb1cea 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -902,10 +902,10 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 		return ret;

 	if (write) {
-		ret = invalidate_inode_pages2_range(inode->i_mapping,
+		int ret2 = invalidate_inode_pages2_range(inode->i_mapping,
 					pos >> PAGE_SHIFT,
 					(pos + count) >> PAGE_SHIFT);
-		if (ret < 0)
+		if (ret2 < 0)
 			dout("invalidate_inode_pages2_range returned %d\n", ret);

 		flags = CEPH_OSD_FLAG_ORDERSNAP |
-- 
2.9.3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes
  2016-08-31  9:19 ` [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes Yan, Zheng
@ 2016-09-01  1:13   ` NeilBrown
  2016-09-01 14:30     ` Yan, Zheng
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2016-09-01  1:13 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: Yan, Zheng, Sage Weil, Ilya Dryomov, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

On Wed, Aug 31 2016, Yan, Zheng wrote:

> On Wed, Aug 31, 2016 at 10:56 AM, NeilBrown <neilb@suse.com> wrote:
>>
>> Hi,
>>  it is quite possible for O_DIRECT and buffered writes to a file to
>>  race.
>>  The xfstests test suite has a test - generic/036 - which tests this
>>  case.
>>
>>  Unlike most filesystems, cephfs does not hold inode_lock() across
>>  direct writes.  This means that buffer pages can become dirty while
>>  direct writes are happening.  This confused ceph a little.
>
> how about make ceph_write_iter() hold inode_lock() for direct/sync write?

You could probably do that.  You could even use inode_lock_shared() to
allow multiple writers to perform O_DIRECT writes to the same file in
parallel.

But I don't know why ceph did this differently from every other
filesystem, and the git commit message doesn't shine any light on that
question.  I didn't want to propose a change that I didn't understand
the consequences of.

If you did extend the locking over ceph_direct_read_write(), it would
probably make sense to remove the truncate_inode_pages_range() call as
it should be redundant.

Thanks,
NeilBrown


>
> Regards
> Yan, Zheng
>>
>>  The following two patches allow ceph to handle this possibility a
>>  little more cleanly.  The more important patch removes a WARN_ON() for
>>  a circumstance which can easily be triggered.
>>
>> Thanks,
>> NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes
  2016-09-01  1:13   ` NeilBrown
@ 2016-09-01 14:30     ` Yan, Zheng
  0 siblings, 0 replies; 9+ messages in thread
From: Yan, Zheng @ 2016-09-01 14:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Zheng Yan, Sage Weil, Ilya Dryomov, ceph-devel


> On Sep 1, 2016, at 09:13, NeilBrown <neilb@suse.com> wrote:
> 
> On Wed, Aug 31 2016, Yan, Zheng wrote:
> 
>> On Wed, Aug 31, 2016 at 10:56 AM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>> Hi,
>>> it is quite possible for O_DIRECT and buffered writes to a file to
>>> race.
>>> The xfstests test suite has a test - generic/036 - which tests this
>>> case.
>>> 
>>> Unlike most filesystems, cephfs does not hold inode_lock() across
>>> direct writes.  This means that buffer pages can become dirty while
>>> direct writes are happening.  This confused ceph a little.
>> 
>> how about make ceph_write_iter() hold inode_lock() for direct/sync write?
> 
> You could probably do that.  You could even use inode_lock_shared() to
> allow multiple writers to perform O_DIRECT writes to the same file in
> parallel.
> 
> But I don't know why ceph did this differently from every other
> filesystem, and the git commit message doesn't shine any light on that
> question.  I didn't want to propose a change that I didn't understand
> the consequences of.
> 
> If you did extend the locking over ceph_direct_read_write(), it would
> probably make sense to remove the truncate_inode_pages_range() call as
> it should be redundant.

both patches applied

Regards
Yan, Zheng


> 
> Thanks,
> NeilBrown
> 
> 
>> 
>> Regards
>> Yan, Zheng
>>> 
>>> The following two patches allow ceph to handle this possibility a
>>> little more cleanly.  The more important patch removes a WARN_ON() for
>>> a circumstance which can easily be triggered.
>>> 
>>> Thanks,
>>> NeilBrown


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-09-01 14:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-31  2:56 [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes NeilBrown
2016-08-31  2:58 ` [PATCH 1/2] cephfs: ignore error from invalidate_inode_pages2_range() in direct write NeilBrown
2016-08-31 13:47   ` Jeff Layton
2016-09-01  1:01     ` NeilBrown
2016-08-31  2:59 ` [PATCH 2/2] cephfs: remove warning when ceph_releasepage() is called on dirty page NeilBrown
2016-08-31 13:48   ` Jeff Layton
2016-08-31  9:19 ` [PATCH 0/2]: CEPHFS: allow for races between direct and buffered writes Yan, Zheng
2016-09-01  1:13   ` NeilBrown
2016-09-01 14:30     ` Yan, Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.