All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] xfs: fix COW writeback race
@ 2017-01-17  7:48 Christoph Hellwig
  2017-01-17 13:44 ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2017-01-17  7:48 UTC (permalink / raw)
  To: linux-xfs; +Cc: darrick.wong

Due to the way how xfs_iomap_write_allocate tries to convert the whole
found extents from delalloc to real space we can run into a race
condition with multiple threads doing writes to this same extent.
For the non-COW case that is harmless as the only thing that can happen
is that we call xfs_bmapi_write on an extent that has already been
converted to a real allocation.  For COW writes where we move the extent
from the COW to the data fork after I/O completion the race is, however,
not quite as harmless.  In the worst case we are now calling
xfs_bmapi_write on a region that contains hole in the COW work, which
will trip up an assert in debug builds or lead to file system corruption
in non-debug builds.  This seems to be reproducible with workloads of
small O_DSYNC write, although so far I've not managed to come up with
a with an isolated reproducer.

The fix for the issue is relatively simple:  tell xfs_bmapi_write
that we are only asked to convert delayed allocations and skip holes
in that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_bmap.c | 41 +++++++++++++++++++++++++++++------------
 fs/xfs/libxfs/xfs_bmap.h |  6 +++++-
 fs/xfs/xfs_iomap.c       |  2 +-
 3 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 44773c9..ceae779 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4514,8 +4514,6 @@ xfs_bmapi_write(
 	int			n;		/* current extent index */
 	xfs_fileoff_t		obno;		/* old block number (offset) */
 	int			whichfork;	/* data or attr fork */
-	char			inhole;		/* current location is hole in file */
-	char			wasdelay;	/* old extent was delayed */
 
 #ifdef DEBUG
 	xfs_fileoff_t		orig_bno;	/* original block number value */
@@ -4603,22 +4601,41 @@ xfs_bmapi_write(
 	bma.firstblock = firstblock;
 
 	while (bno < end && n < *nmap) {
-		inhole = eof || bma.got.br_startoff > bno;
-		wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
+		bool			need_alloc = false, wasdelay = false;
 
-		/*
-		 * Make sure we only reflink into a hole.
-		 */
-		if (flags & XFS_BMAPI_REMAP)
-			ASSERT(inhole);
-		if (flags & XFS_BMAPI_COWFORK)
-			ASSERT(!inhole);
+		/* in hole or beyoned EOF? */
+		if (eof || bma.got.br_startoff > bno) {
+			if (flags & XFS_BMAPI_DELALLOC) {
+				/*
+				 * For the COW fork we can reasonably get a
+				 * request for converting an extent that races
+				 * with other threads already having converted
+				 * part of it, as there converting COW to
+				 * regular blocks is not protected using the
+				 * IOLOCK.
+				 */
+				ASSERT(flags & XFS_BMAPI_COWFORK);
+				if (!(flags & XFS_BMAPI_COWFORK)) {
+					error = -EIO;
+					goto error0;
+				}
+			} else {
+				need_alloc = true;
+			}
+		} else {
+			/*
+			 * Make sure we only reflink into a hole.
+			 */
+			ASSERT(!(flags & XFS_BMAPI_REMAP));
+			if (isnullstartblock(bma.got.br_startblock))
+				wasdelay = true;
+		}
 
 		/*
 		 * First, deal with the hole before the allocated space
 		 * that we found, if any.
 		 */
-		if (inhole || wasdelay) {
+		if (need_alloc || wasdelay) {
 			bma.eof = eof;
 			bma.conv = !!(flags & XFS_BMAPI_CONVERT);
 			bma.wasdel = wasdelay;
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index cecd094..cdef87d 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -110,6 +110,9 @@ struct xfs_extent_free_item
 /* Map something in the CoW fork. */
 #define XFS_BMAPI_COWFORK	0x200
 
+/* Only convert delalloc space, don't allocate entirely new extents */
+#define XFS_BMAPI_DELALLOC	0x400
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -120,7 +123,8 @@ struct xfs_extent_free_item
 	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
 	{ XFS_BMAPI_ZERO,	"ZERO" }, \
 	{ XFS_BMAPI_REMAP,	"REMAP" }, \
-	{ XFS_BMAPI_COWFORK,	"COWFORK" }
+	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
+	{ XFS_BMAPI_DELALLOC,	"DELALLOC" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ca137b7..7ee8629 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -795,7 +795,7 @@ xfs_iomap_write_allocate(
 	xfs_trans_t	*tp;
 	int		nimaps;
 	int		error = 0;
-	int		flags = 0;
+	int		flags = XFS_BMAPI_DELALLOC;
 	int		nres;
 
 	if (whichfork == XFS_COW_FORK)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: fix COW writeback race
  2017-01-17  7:48 [PATCH v2] xfs: fix COW writeback race Christoph Hellwig
@ 2017-01-17 13:44 ` Brian Foster
  2017-01-17 14:37   ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2017-01-17 13:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, darrick.wong

On Tue, Jan 17, 2017 at 08:48:51AM +0100, Christoph Hellwig wrote:
> Due to the way how xfs_iomap_write_allocate tries to convert the whole
> found extents from delalloc to real space we can run into a race
> condition with multiple threads doing writes to this same extent.
> For the non-COW case that is harmless as the only thing that can happen
> is that we call xfs_bmapi_write on an extent that has already been
> converted to a real allocation.  For COW writes where we move the extent
> from the COW to the data fork after I/O completion the race is, however,
> not quite as harmless.  In the worst case we are now calling
> xfs_bmapi_write on a region that contains hole in the COW work, which
> will trip up an assert in debug builds or lead to file system corruption
> in non-debug builds.  This seems to be reproducible with workloads of
> small O_DSYNC write, although so far I've not managed to come up with
> a with an isolated reproducer.
> 

Any reason we don't try to address the core race rather than shake up
the affected code to accommodate it? I ask for a couple reasons: 1.) I'm
not quite following the specific race from the description and 2.) I
considered doing the exact same thing at first for the eofblocks i_size
issue, but more digging rooted out the problem in the eofblocks code.
This one may not be as straightforward a fix, of course... (but if not,
the commit log should probably explain why).

> The fix for the issue is relatively simple:  tell xfs_bmapi_write
> that we are only asked to convert delayed allocations and skip holes
> in that case.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 41 +++++++++++++++++++++++++++++------------
>  fs/xfs/libxfs/xfs_bmap.h |  6 +++++-
>  fs/xfs/xfs_iomap.c       |  2 +-
>  3 files changed, 35 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 44773c9..ceae779 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4514,8 +4514,6 @@ xfs_bmapi_write(
>  	int			n;		/* current extent index */
>  	xfs_fileoff_t		obno;		/* old block number (offset) */
>  	int			whichfork;	/* data or attr fork */
> -	char			inhole;		/* current location is hole in file */
> -	char			wasdelay;	/* old extent was delayed */
>  
>  #ifdef DEBUG
>  	xfs_fileoff_t		orig_bno;	/* original block number value */
> @@ -4603,22 +4601,41 @@ xfs_bmapi_write(
>  	bma.firstblock = firstblock;
>  
>  	while (bno < end && n < *nmap) {
> -		inhole = eof || bma.got.br_startoff > bno;
> -		wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
> +		bool			need_alloc = false, wasdelay = false;
>  
> -		/*
> -		 * Make sure we only reflink into a hole.
> -		 */
> -		if (flags & XFS_BMAPI_REMAP)
> -			ASSERT(inhole);
> -		if (flags & XFS_BMAPI_COWFORK)
> -			ASSERT(!inhole);
> +		/* in hole or beyoned EOF? */
> +		if (eof || bma.got.br_startoff > bno) {
> +			if (flags & XFS_BMAPI_DELALLOC) {
> +				/*
> +				 * For the COW fork we can reasonably get a
> +				 * request for converting an extent that races
> +				 * with other threads already having converted
> +				 * part of it, as there converting COW to
> +				 * regular blocks is not protected using the
> +				 * IOLOCK.
> +				 */
> +				ASSERT(flags & XFS_BMAPI_COWFORK);
> +				if (!(flags & XFS_BMAPI_COWFORK)) {
> +					error = -EIO;
> +					goto error0;
> +				}

What happens in this case if eof is true? It looks like got could be
bogus, yet we still carry on using it in the post-allocation part of the
loop. The fact that the allocation code breaks out of the loop if
allocation doesn't occur is a bit of a red flag that the post-allocation
code may very well expect to always have an allocated mapping.

That aside... if we do want to do something like this, I wonder whether
it's more cleanly handled by the caller.

Brian

> +			} else {
> +				need_alloc = true;
> +			}
> +		} else {
> +			/*
> +			 * Make sure we only reflink into a hole.
> +			 */
> +			ASSERT(!(flags & XFS_BMAPI_REMAP));
> +			if (isnullstartblock(bma.got.br_startblock))
> +				wasdelay = true;
> +		}
>  
>  		/*
>  		 * First, deal with the hole before the allocated space
>  		 * that we found, if any.
>  		 */
> -		if (inhole || wasdelay) {
> +		if (need_alloc || wasdelay) {
>  			bma.eof = eof;
>  			bma.conv = !!(flags & XFS_BMAPI_CONVERT);
>  			bma.wasdel = wasdelay;
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index cecd094..cdef87d 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -110,6 +110,9 @@ struct xfs_extent_free_item
>  /* Map something in the CoW fork. */
>  #define XFS_BMAPI_COWFORK	0x200
>  
> +/* Only convert delalloc space, don't allocate entirely new extents */
> +#define XFS_BMAPI_DELALLOC	0x400
> +
>  #define XFS_BMAPI_FLAGS \
>  	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
>  	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> @@ -120,7 +123,8 @@ struct xfs_extent_free_item
>  	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
>  	{ XFS_BMAPI_ZERO,	"ZERO" }, \
>  	{ XFS_BMAPI_REMAP,	"REMAP" }, \
> -	{ XFS_BMAPI_COWFORK,	"COWFORK" }
> +	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
> +	{ XFS_BMAPI_DELALLOC,	"DELALLOC" }
>  
>  
>  static inline int xfs_bmapi_aflag(int w)
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ca137b7..7ee8629 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -795,7 +795,7 @@ xfs_iomap_write_allocate(
>  	xfs_trans_t	*tp;
>  	int		nimaps;
>  	int		error = 0;
> -	int		flags = 0;
> +	int		flags = XFS_BMAPI_DELALLOC;
>  	int		nres;
>  
>  	if (whichfork == XFS_COW_FORK)
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: fix COW writeback race
  2017-01-17 13:44 ` Brian Foster
@ 2017-01-17 14:37   ` Christoph Hellwig
  2017-01-17 17:14     ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2017-01-17 14:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Tue, Jan 17, 2017 at 08:44:21AM -0500, Brian Foster wrote:
> Any reason we don't try to address the core race rather than shake up
> the affected code to accommodate it?

I think there are two aspects to the whole thing.  One is the way
xfs_bmapi_write currently works is fundamentally wrong - if the
caller only needs a conversion from delalloc to real space trying
to allocate space is always wrong and we should catch it early.
The second is if we should do the eager conversion of the whole
found extent for either the data and/or the COW fork.

> I ask for a couple reasons: 1.) I'm
> not quite following the specific race from the description and 2.) I
> considered doing the exact same thing at first for the eofblocks i_size
> issue, but more digging rooted out the problem in the eofblocks code.
> This one may not be as straightforward a fix, of course... (but if not,
> the commit log should probably explain why).

My hope was that the long commit message explained the issue, but
I guess I need to go into even more details.

The writeback code (both COW and real) works like this

take ilock
read in an extent at offset O
drop ilock

if extent is delalloc:
  while !done with the whole extent
    take ilock
    convert partial extent to a real allocation
    drop ilock

But if multiple threads are doing writeback on pages next to
each other another thread might have already converted parts
of all of the extent found in the beginning to a real allocation.
That on it's own is not a problem because xfs_bmapi_write
handles a call to allocate an already real allocation as a no-op.
But with the COW code moving real extents from the COW fork to
the data fork after I/O completion this can become a problem,
because we now might not only have delalloc or real extents in
the area covered by extent returned from the inital xfs_bmapi_read
call, but also a hole in the COW work.  At which point it blows up.

As for why we're doing the eager conversion:  at least for the data
fork this was initentional to get better allocation patterns, I
remember a discussion with Dave on that a long time ago.  Maybe we
shouldn't do it for the COW for to avoid these sorts of races, but
then again without xfs_bmapi_write being stupid the race is harmless.

> What happens in this case if eof is true? It looks like got could be
> bogus, yet we still carry on using it in the post-allocation part of the
> loop.

For an initial EOF lookup it could indeed be bogus.  To properly
work it would need something like the trick xfs_bmapi_read uses
for this case.

> The fact that the allocation code breaks out of the loop if
> allocation doesn't occur is a bit of a red flag that the post-allocation
> code may very well expect to always have an allocated mapping.

The post-allocation cleanup code bust handle xfs_bmapi_allocate
returning an error before doing anything, and because of that it's
full of conditionals for everything that could or could not have
happened.

> That aside... if we do want to do something like this, I wonder whether
> it's more cleanly handled by the caller.

I don't see how it could be done in the caller - the caller wants
the bmap code to convert a delayed allocation and not allocate
entirely new blocks.  The best way to do so is to tell the bmapi
code not do allocate new blocks.  Now if you mean splitting up
xfs_bmapi_write into different functions for allocating real blocks,
converting delalloc blocks or just flipping the unwritten bit: that's
something I'd like to look into - the current interface is just
too confusing.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: fix COW writeback race
  2017-01-17 14:37   ` Christoph Hellwig
@ 2017-01-17 17:14     ` Brian Foster
  2017-01-17 18:39       ` Darrick J. Wong
                         ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Brian Foster @ 2017-01-17 17:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, darrick.wong

On Tue, Jan 17, 2017 at 03:37:21PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 17, 2017 at 08:44:21AM -0500, Brian Foster wrote:
> > Any reason we don't try to address the core race rather than shake up
> > the affected code to accommodate it?
> 
> I think there are two aspects to the whole thing.  One is the way
> xfs_bmapi_write currently works is fundamentally wrong - if the
> caller only needs a conversion from delalloc to real space trying
> to allocate space is always wrong and we should catch it early.
> The second is if we should do the eager conversion of the whole
> found extent for either the data and/or the COW fork.
> 

Makes sense. I agree that the former is probably the right thing to do,
it just seems more like an error check than a solution for a race. The
second is probably a bigger question, as I assume we do that to request
as large allocations as possible.

> > I ask for a couple reasons: 1.) I'm
> > not quite following the specific race from the description and 2.) I
> > considered doing the exact same thing at first for the eofblocks i_size
> > issue, but more digging rooted out the problem in the eofblocks code.
> > This one may not be as straightforward a fix, of course... (but if not,
> > the commit log should probably explain why).
> 
> My hope was that the long commit message explained the issue, but
> I guess I need to go into even more details.
> 
> The writeback code (both COW and real) works like this
> 
> take ilock
> read in an extent at offset O
> drop ilock
> 
> if extent is delalloc:
>   while !done with the whole extent
>     take ilock
>     convert partial extent to a real allocation
>     drop ilock
> 
> But if multiple threads are doing writeback on pages next to
> each other another thread might have already converted parts
> of all of the extent found in the beginning to a real allocation.
> That on it's own is not a problem because xfs_bmapi_write
> handles a call to allocate an already real allocation as a no-op.
> But with the COW code moving real extents from the COW fork to
> the data fork after I/O completion this can become a problem,
> because we now might not only have delalloc or real extents in
> the area covered by extent returned from the inital xfs_bmapi_read
> call, but also a hole in the COW work.  At which point it blows up.
> 

Got it, thanks. So all of the writeback stuff is protected via
page/buffer locks, and even if we still had those locks, it doesn't
matter because the same extent is obviously covered by many page/buffer
objects.

> As for why we're doing the eager conversion:  at least for the data
> fork this was initentional to get better allocation patterns, I
> remember a discussion with Dave on that a long time ago.  Maybe we
> shouldn't do it for the COW for to avoid these sorts of races, but
> then again without xfs_bmapi_write being stupid the race is harmless.
> 

Yeah, and doing otherwise may break the assumption that larger delallocs
produce larger physical allocs (re: cowextsz hint and potentially
preallocation).

> > What happens in this case if eof is true? It looks like got could be
> > bogus, yet we still carry on using it in the post-allocation part of the
> > loop.
> 
> For an initial EOF lookup it could indeed be bogus.  To properly
> work it would need something like the trick xfs_bmapi_read uses
> for this case.
> 

That seems reasonable so long as we skip the parts of the loop that are
expecting a real (non-hole) startblock.

> > The fact that the allocation code breaks out of the loop if
> > allocation doesn't occur is a bit of a red flag that the post-allocation
> > code may very well expect to always have an allocated mapping.
> 
> The post-allocation cleanup code bust handle xfs_bmapi_allocate
> returning an error before doing anything, and because of that it's
> full of conditionals for everything that could or could not have
> happened.
> 

We have the following in the need_alloc block:

                        error = xfs_bmapi_allocate(&bma);
                        if (error)
                                goto error0;
                        if (bma.blkno == NULLFSBLOCK)
                                break;

... which breaks out of the loop on error or allocation failure. The
first call after that block is xfs_bmapi_trim_map(), which uses got
without any consideration for holes that I can see.

> > That aside... if we do want to do something like this, I wonder whether
> > it's more cleanly handled by the caller.
> 
> I don't see how it could be done in the caller - the caller wants
> the bmap code to convert a delayed allocation and not allocate
> entirely new blocks.  The best way to do so is to tell the bmapi
> code not do allocate new blocks.  Now if you mean splitting up
> xfs_bmapi_write into different functions for allocating real blocks,
> converting delalloc blocks or just flipping the unwritten bit: that's
> something I'd like to look into - the current interface is just
> too confusing.

Things like the above had me thinking it might be more clear to
explicitly read the extent and check for delalloc in the caller while
under the appropriate lock (and if XFS_COW_FORK). That's kind of what I
was alluding to above wrt to closing the race. That's just an idea,
however, and doesn't necessarily improve the error handling in the way
that this patch does (to avoid the transaction overrun). Given that, I'm
not against what this patch is currently doing so long as we fix up the
rest of the loop. Your idea of xfs_bmapi_convert() or some such sounds
like a nice potential cleanup at some point too.

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: fix COW writeback race
  2017-01-17 17:14     ` Brian Foster
@ 2017-01-17 18:39       ` Darrick J. Wong
  2017-01-17 18:58         ` Brian Foster
  2017-01-18  8:45       ` Christoph Hellwig
  2017-01-18  8:49       ` Christoph Hellwig
  2 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2017-01-17 18:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

On Tue, Jan 17, 2017 at 12:14:51PM -0500, Brian Foster wrote:
> On Tue, Jan 17, 2017 at 03:37:21PM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 17, 2017 at 08:44:21AM -0500, Brian Foster wrote:
> > > Any reason we don't try to address the core race rather than shake up
> > > the affected code to accommodate it?
> > 
> > I think there are two aspects to the whole thing.  One is the way
> > xfs_bmapi_write currently works is fundamentally wrong - if the
> > caller only needs a conversion from delalloc to real space trying
> > to allocate space is always wrong and we should catch it early.
> > The second is if we should do the eager conversion of the whole
> > found extent for either the data and/or the COW fork.
> > 
> 
> Makes sense. I agree that the former is probably the right thing to do,
> it just seems more like an error check than a solution for a race. The
> second is probably a bigger question, as I assume we do that to request
> as large allocations as possible.

I'm under the impression that yes, we do #2 to maximize request sizes.
AFAICT this patch preserves that behavior and gets rid of the behavior
where cow delalloc conversion creates real extents where there
previously were holes.

> > > I ask for a couple reasons: 1.) I'm
> > > not quite following the specific race from the description and 2.) I
> > > considered doing the exact same thing at first for the eofblocks i_size
> > > issue, but more digging rooted out the problem in the eofblocks code.
> > > This one may not be as straightforward a fix, of course... (but if not,
> > > the commit log should probably explain why).
> > 
> > My hope was that the long commit message explained the issue, but
> > I guess I need to go into even more details.
> > 
> > The writeback code (both COW and real) works like this
> > 
> > take ilock
> > read in an extent at offset O
> > drop ilock
> > 
> > if extent is delalloc:
> >   while !done with the whole extent
> >     take ilock
> >     convert partial extent to a real allocation
> >     drop ilock
> > 
> > But if multiple threads are doing writeback on pages next to
> > each other another thread might have already converted parts
> > of all of the extent found in the beginning to a real allocation.
> > That on it's own is not a problem because xfs_bmapi_write
> > handles a call to allocate an already real allocation as a no-op.
> > But with the COW code moving real extents from the COW fork to
> > the data fork after I/O completion this can become a problem,
> > because we now might not only have delalloc or real extents in
> > the area covered by extent returned from the inital xfs_bmapi_read
> > call, but also a hole in the COW work.  At which point it blows up.
> > 
> 
> Got it, thanks. So all of the writeback stuff is protected via
> page/buffer locks, and even if we still had those locks, it doesn't
> matter because the same extent is obviously covered by many page/buffer
> objects.
> 
> > As for why we're doing the eager conversion:  at least for the data
> > fork this was initentional to get better allocation patterns, I
> > remember a discussion with Dave on that a long time ago.  Maybe we
> > shouldn't do it for the COW for to avoid these sorts of races, but
> > then again without xfs_bmapi_write being stupid the race is harmless.
> > 
> 
> Yeah, and doing otherwise may break the assumption that larger delallocs
> produce larger physical allocs (re: cowextsz hint and potentially
> preallocation).

Yep.

> > > What happens in this case if eof is true? It looks like got could be
> > > bogus, yet we still carry on using it in the post-allocation part of the
> > > loop.
> > 
> > For an initial EOF lookup it could indeed be bogus.  To properly
> > work it would need something like the trick xfs_bmapi_read uses
> > for this case.
> > 
> 
> That seems reasonable so long as we skip the parts of the loop that are
> expecting a real (non-hole) startblock.
> 
> > > The fact that the allocation code breaks out of the loop if
> > > allocation doesn't occur is a bit of a red flag that the post-allocation
> > > code may very well expect to always have an allocated mapping.
> > 
> > The post-allocation cleanup code bust handle xfs_bmapi_allocate
> > returning an error before doing anything, and because of that it's
> > full of conditionals for everything that could or could not have
> > happened.
> > 
> 
> We have the following in the need_alloc block:
> 
>                         error = xfs_bmapi_allocate(&bma);
>                         if (error)
>                                 goto error0;
>                         if (bma.blkno == NULLFSBLOCK)
>                                 break;
> 
> ... which breaks out of the loop on error or allocation failure. The
> first call after that block is xfs_bmapi_trim_map(), which uses got
> without any consideration for holes that I can see.

Wait, what?  The break gets us out of the while loop, not the
"if (inhole || wasdelay)" clause that precedes the _trim_map.

The while loop ends at "*nmap = n", correct?  So the NULLFSBLOCK case
shouldn't be calling _trim_map with uninitialized got.

> > > That aside... if we do want to do something like this, I wonder whether
> > > it's more cleanly handled by the caller.
> > 
> > I don't see how it could be done in the caller - the caller wants
> > the bmap code to convert a delayed allocation and not allocate
> > entirely new blocks.  The best way to do so is to tell the bmapi
> > code not do allocate new blocks.  Now if you mean splitting up
> > xfs_bmapi_write into different functions for allocating real blocks,
> > converting delalloc blocks or just flipping the unwritten bit: that's
> > something I'd like to look into - the current interface is just
> > too confusing.
> 
> Things like the above had me thinking it might be more clear to
> explicitly read the extent and check for delalloc in the caller while
> under the appropriate lock (and if XFS_COW_FORK). That's kind of what I
> was alluding to above wrt to closing the race. That's just an idea,
> however, and doesn't necessarily improve the error handling in the way
> that this patch does (to avoid the transaction overrun). Given that, I'm
> not against what this patch is currently doing so long as we fix up the
> rest of the loop. Your idea of xfs_bmapi_convert() or some such sounds
> like a nice potential cleanup at some point too.

I'd wondered when I was writing all the cow code if it would make the
code easier to understand if the _bmapi_write was split into different
frontend wrappers of the underlying implementation.  It's a little weird
that for a remapping you have to stuff the new physical block in a
_fsblock_t and pass that in as a pointer argument.

--D

> 
> Brian
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: fix COW writeback race
  2017-01-17 18:39       ` Darrick J. Wong
@ 2017-01-17 18:58         ` Brian Foster
  2017-01-17 20:02           ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2017-01-17 18:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Tue, Jan 17, 2017 at 10:39:00AM -0800, Darrick J. Wong wrote:
> On Tue, Jan 17, 2017 at 12:14:51PM -0500, Brian Foster wrote:
> > On Tue, Jan 17, 2017 at 03:37:21PM +0100, Christoph Hellwig wrote:
> > > On Tue, Jan 17, 2017 at 08:44:21AM -0500, Brian Foster wrote:
> > > > Any reason we don't try to address the core race rather than shake up
> > > > the affected code to accommodate it?
> > > 
> > > I think there are two aspects to the whole thing.  One is the way
> > > xfs_bmapi_write currently works is fundamentally wrong - if the
> > > caller only needs a conversion from delalloc to real space trying
> > > to allocate space is always wrong and we should catch it early.
> > > The second is if we should do the eager conversion of the whole
> > > found extent for either the data and/or the COW fork.
> > > 
> > 
> > Makes sense. I agree that the former is probably the right thing to do,
> > it just seems more like an error check than a solution for a race. The
> > second is probably a bigger question, as I assume we do that to request
> > as large allocations as possible.
> 
> I'm under the impression that yes, we do #2 to maximize request sizes.
> AFAICT this patch preserves that behavior and gets rid of the behavior
> where cow delalloc conversion creates real extents where there
> previously were holes.
> 

Yeah, Christoph is just pointing out how that behavior contributes to
the race. I'm not suggesting that this patch changes that. Rather, I'm
agreeing that we probably don't want to go the route of changing that to
address this issue.

> > > > I ask for a couple reasons: 1.) I'm
> > > > not quite following the specific race from the description and 2.) I
> > > > considered doing the exact same thing at first for the eofblocks i_size
> > > > issue, but more digging rooted out the problem in the eofblocks code.
> > > > This one may not be as straightforward a fix, of course... (but if not,
> > > > the commit log should probably explain why).
> > > 
> > > My hope was that the long commit message explained the issue, but
> > > I guess I need to go into even more details.
> > > 
> > > The writeback code (both COW and real) works like this
> > > 
> > > take ilock
> > > read in an extent at offset O
> > > drop ilock
> > > 
> > > if extent is delalloc:
> > >   while !done with the whole extent
> > >     take ilock
> > >     convert partial extent to a real allocation
> > >     drop ilock
> > > 
> > > But if multiple threads are doing writeback on pages next to
> > > each other another thread might have already converted parts
> > > of all of the extent found in the beginning to a real allocation.
> > > That on it's own is not a problem because xfs_bmapi_write
> > > handles a call to allocate an already real allocation as a no-op.
> > > But with the COW code moving real extents from the COW fork to
> > > the data fork after I/O completion this can become a problem,
> > > because we now might not only have delalloc or real extents in
> > > the area covered by extent returned from the inital xfs_bmapi_read
> > > call, but also a hole in the COW work.  At which point it blows up.
> > > 
> > 
> > Got it, thanks. So all of the writeback stuff is protected via
> > page/buffer locks, and even if we still had those locks, it doesn't
> > matter because the same extent is obviously covered by many page/buffer
> > objects.
> > 
> > > As for why we're doing the eager conversion:  at least for the data
> > > fork this was initentional to get better allocation patterns, I
> > > remember a discussion with Dave on that a long time ago.  Maybe we
> > > shouldn't do it for the COW for to avoid these sorts of races, but
> > > then again without xfs_bmapi_write being stupid the race is harmless.
> > > 
> > 
> > Yeah, and doing otherwise may break the assumption that larger delallocs
> > produce larger physical allocs (re: cowextsz hint and potentially
> > preallocation).
> 
> Yep.
> 
> > > > What happens in this case if eof is true? It looks like got could be
> > > > bogus, yet we still carry on using it in the post-allocation part of the
> > > > loop.
> > > 
> > > For an initial EOF lookup it could indeed be bogus.  To properly
> > > work it would need something like the trick xfs_bmapi_read uses
> > > for this case.
> > > 
> > 
> > That seems reasonable so long as we skip the parts of the loop that are
> > expecting a real (non-hole) startblock.
> > 
> > > > The fact that the allocation code breaks out of the loop if
> > > > allocation doesn't occur is a bit of a red flag that the post-allocation
> > > > code may very well expect to always have an allocated mapping.
> > > 
> > > The post-allocation cleanup code bust handle xfs_bmapi_allocate
> > > returning an error before doing anything, and because of that it's
> > > full of conditionals for everything that could or could not have
> > > happened.
> > > 
> > 
> > We have the following in the need_alloc block:
> > 
> >                         error = xfs_bmapi_allocate(&bma);
> >                         if (error)
> >                                 goto error0;
> >                         if (bma.blkno == NULLFSBLOCK)
> >                                 break;
> > 
> > ... which breaks out of the loop on error or allocation failure. The
> > first call after that block is xfs_bmapi_trim_map(), which uses got
> > without any consideration for holes that I can see.
> 
> Wait, what?  The break gets us out of the while loop, not the
> "if (inhole || wasdelay)" clause that precedes the _trim_map.
> 
> The while loop ends at "*nmap = n", correct?  So the NULLFSBLOCK case
> shouldn't be calling _trim_map with uninitialized got.
> 

Precisely. As it stands, this shouldn't happen. With this patch,
however, it is now possible to get down to _trim_map() with a completely
uninitialized got. E.g., consider the first iteration of the loop if eof
is true (which means got is invalid) and XFS_BMAPI_DELALLOC is set.

(I suspect the whole 'if (xyz) { do_error_checks() }' pattern somewhat
obfuscates this..)

Brian

> > > > That aside... if we do want to do something like this, I wonder whether
> > > > it's more cleanly handled by the caller.
> > > 
> > > I don't see how it could be done in the caller - the caller wants
> > > the bmap code to convert a delayed allocation and not allocate
> > > entirely new blocks.  The best way to do so is to tell the bmapi
> > > code not do allocate new blocks.  Now if you mean splitting up
> > > xfs_bmapi_write into different functions for allocating real blocks,
> > > converting delalloc blocks or just flipping the unwritten bit: that's
> > > something I'd like to look into - the current interface is just
> > > too confusing.
> > 
> > Things like the above had me thinking it might be more clear to
> > explicitly read the extent and check for delalloc in the caller while
> > under the appropriate lock (and if XFS_COW_FORK). That's kind of what I
> > was alluding to above wrt to closing the race. That's just an idea,
> > however, and doesn't necessarily improve the error handling in the way
> > that this patch does (to avoid the transaction overrun). Given that, I'm
> > not against what this patch is currently doing so long as we fix up the
> > rest of the loop. Your idea of xfs_bmapi_convert() or some such sounds
> > like a nice potential cleanup at some point too.
> 
> I'd wondered when I was writing all the cow code if it would make the
> code easier to understand if the _bmapi_write was split into different
> frontend wrappers of the underlying implementation.  It's a little weird
> that for a remapping you have to stuff the new physical block in a
> _fsblock_t and pass that in as a pointer argument.
> 
> --D
> 
> > 
> > Brian
> > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: fix COW writeback race
  2017-01-17 18:58         ` Brian Foster
@ 2017-01-17 20:02           ` Darrick J. Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2017-01-17 20:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

On Tue, Jan 17, 2017 at 01:58:34PM -0500, Brian Foster wrote:
> On Tue, Jan 17, 2017 at 10:39:00AM -0800, Darrick J. Wong wrote:
> > On Tue, Jan 17, 2017 at 12:14:51PM -0500, Brian Foster wrote:
> > > On Tue, Jan 17, 2017 at 03:37:21PM +0100, Christoph Hellwig wrote:
> > > > On Tue, Jan 17, 2017 at 08:44:21AM -0500, Brian Foster wrote:
> > > > > Any reason we don't try to address the core race rather than shake up
> > > > > the affected code to accommodate it?
> > > > 
> > > > I think there are two aspects to the whole thing.  One is the way
> > > > xfs_bmapi_write currently works is fundamentally wrong - if the
> > > > caller only needs a conversion from delalloc to real space trying
> > > > to allocate space is always wrong and we should catch it early.
> > > > The second is if we should do the eager conversion of the whole
> > > > found extent for either the data and/or the COW fork.
> > > > 
> > > 
> > > Makes sense. I agree that the former is probably the right thing to do,
> > > it just seems more like an error check than a solution for a race. The
> > > second is probably a bigger question, as I assume we do that to request
> > > as large allocations as possible.
> > 
> > I'm under the impression that yes, we do #2 to maximize request sizes.
> > AFAICT this patch preserves that behavior and gets rid of the behavior
> > where cow delalloc conversion creates real extents where there
> > previously were holes.
> > 
> 
> Yeah, Christoph is just pointing out how that behavior contributes to
> the race. I'm not suggesting that this patch changes that. Rather, I'm
> agreeing that we probably don't want to go the route of changing that to
> address this issue.

Oh.  So am I.  I think we're /all/ in agreement then. :)

> > > > > I ask for a couple reasons: 1.) I'm
> > > > > not quite following the specific race from the description and 2.) I
> > > > > considered doing the exact same thing at first for the eofblocks i_size
> > > > > issue, but more digging rooted out the problem in the eofblocks code.
> > > > > This one may not be as straightforward a fix, of course... (but if not,
> > > > > the commit log should probably explain why).
> > > > 
> > > > My hope was that the long commit message explained the issue, but
> > > > I guess I need to go into even more details.
> > > > 
> > > > The writeback code (both COW and real) works like this
> > > > 
> > > > take ilock
> > > > read in an extent at offset O
> > > > drop ilock
> > > > 
> > > > if extent is delalloc:
> > > >   while !done with the whole extent
> > > >     take ilock
> > > >     convert partial extent to a real allocation
> > > >     drop ilock
> > > > 
> > > > But if multiple threads are doing writeback on pages next to
> > > > each other another thread might have already converted parts
> > > > of all of the extent found in the beginning to a real allocation.
> > > > That on it's own is not a problem because xfs_bmapi_write
> > > > handles a call to allocate an already real allocation as a no-op.
> > > > But with the COW code moving real extents from the COW fork to
> > > > the data fork after I/O completion this can become a problem,
> > > > because we now might not only have delalloc or real extents in
> > > > the area covered by extent returned from the inital xfs_bmapi_read
> > > > call, but also a hole in the COW work.  At which point it blows up.
> > > > 
> > > 
> > > Got it, thanks. So all of the writeback stuff is protected via
> > > page/buffer locks, and even if we still had those locks, it doesn't
> > > matter because the same extent is obviously covered by many page/buffer
> > > objects.
> > > 
> > > > As for why we're doing the eager conversion:  at least for the data
> > > > fork this was initentional to get better allocation patterns, I
> > > > remember a discussion with Dave on that a long time ago.  Maybe we
> > > > shouldn't do it for the COW for to avoid these sorts of races, but
> > > > then again without xfs_bmapi_write being stupid the race is harmless.
> > > > 
> > > 
> > > Yeah, and doing otherwise may break the assumption that larger delallocs
> > > produce larger physical allocs (re: cowextsz hint and potentially
> > > preallocation).
> > 
> > Yep.
> > 
> > > > > What happens in this case if eof is true? It looks like got could be
> > > > > bogus, yet we still carry on using it in the post-allocation part of the
> > > > > loop.
> > > > 
> > > > For an initial EOF lookup it could indeed be bogus.  To properly
> > > > work it would need something like the trick xfs_bmapi_read uses
> > > > for this case.
> > > > 
> > > 
> > > That seems reasonable so long as we skip the parts of the loop that are
> > > expecting a real (non-hole) startblock.
> > > 
> > > > > The fact that the allocation code breaks out of the loop if
> > > > > allocation doesn't occur is a bit of a red flag that the post-allocation
> > > > > code may very well expect to always have an allocated mapping.
> > > > 
> > > > The post-allocation cleanup code bust handle xfs_bmapi_allocate
> > > > returning an error before doing anything, and because of that it's
> > > > full of conditionals for everything that could or could not have
> > > > happened.
> > > > 
> > > 
> > > We have the following in the need_alloc block:
> > > 
> > >                         error = xfs_bmapi_allocate(&bma);
> > >                         if (error)
> > >                                 goto error0;
> > >                         if (bma.blkno == NULLFSBLOCK)
> > >                                 break;
> > > 
> > > ... which breaks out of the loop on error or allocation failure. The
> > > first call after that block is xfs_bmapi_trim_map(), which uses got
> > > without any consideration for holes that I can see.
> > 
> > Wait, what?  The break gets us out of the while loop, not the
> > "if (inhole || wasdelay)" clause that precedes the _trim_map.
> > 
> > The while loop ends at "*nmap = n", correct?  So the NULLFSBLOCK case
> > shouldn't be calling _trim_map with uninitialized got.
> > 
> 
> Precisely. As it stands, this shouldn't happen. With this patch,
> however, it is now possible to get down to _trim_map() with a completely
> uninitialized got. E.g., consider the first iteration of the loop if eof
> is true (which means got is invalid) and XFS_BMAPI_DELALLOC is set.
> 
> (I suspect the whole 'if (xyz) { do_error_checks() }' pattern somewhat
> obfuscates this..)

Aha, yes, I see it now.  Thanks for having a sharper pair of eyes! :)

--D

> 
> Brian
> 
> > > > > That aside... if we do want to do something like this, I wonder whether
> > > > > it's more cleanly handled by the caller.
> > > > 
> > > > I don't see how it could be done in the caller - the caller wants
> > > > the bmap code to convert a delayed allocation and not allocate
> > > > entirely new blocks.  The best way to do so is to tell the bmapi
> > > > code not do allocate new blocks.  Now if you mean splitting up
> > > > xfs_bmapi_write into different functions for allocating real blocks,
> > > > converting delalloc blocks or just flipping the unwritten bit: that's
> > > > something I'd like to look into - the current interface is just
> > > > too confusing.
> > > 
> > > Things like the above had me thinking it might be more clear to
> > > explicitly read the extent and check for delalloc in the caller while
> > > under the appropriate lock (and if XFS_COW_FORK). That's kind of what I
> > > was alluding to above wrt to closing the race. That's just an idea,
> > > however, and doesn't necessarily improve the error handling in the way
> > > that this patch does (to avoid the transaction overrun). Given that, I'm
> > > not against what this patch is currently doing so long as we fix up the
> > > rest of the loop. Your idea of xfs_bmapi_convert() or some such sounds
> > > like a nice potential cleanup at some point too.
> > 
> > I'd wondered when I was writing all the cow code if it would make the
> > code easier to understand if the _bmapi_write was split into different
> > frontend wrappers of the underlying implementation.  It's a little weird
> > that for a remapping you have to stuff the new physical block in a
> > _fsblock_t and pass that in as a pointer argument.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: fix COW writeback race
  2017-01-17 17:14     ` Brian Foster
  2017-01-17 18:39       ` Darrick J. Wong
@ 2017-01-18  8:45       ` Christoph Hellwig
  2017-01-18  8:49       ` Christoph Hellwig
  2 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2017-01-18  8:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Tue, Jan 17, 2017 at 12:14:51PM -0500, Brian Foster wrote:
> Got it, thanks. So all of the writeback stuff is protected via
> page/buffer locks, and even if we still had those locks, it doesn't
> matter because the same extent is obviously covered by many page/buffer
> objects.

Yes.

> Yeah, and doing otherwise may break the assumption that larger delallocs
> produce larger physical allocs (re: cowextsz hint and potentially
> preallocation).

Yes - especially for the COW case this might be very important.

> That seems reasonable so long as we skip the parts of the loop that are
> expecting a real (non-hole) startblock.

Agreed.

> Things like the above had me thinking it might be more clear to
> explicitly read the extent and check for delalloc in the caller while
> under the appropriate lock (and if XFS_COW_FORK). That's kind of what I
> was alluding to above wrt to closing the race. That's just an idea,
> however, and doesn't necessarily improve the error handling in the way
> that this patch does (to avoid the transaction overrun). Given that, I'm
> not against what this patch is currently doing so long as we fix up the
> rest of the loop. Your idea of xfs_bmapi_convert() or some such sounds
> like a nice potential cleanup at some point too.

I don't like that idea because it just means even more extent lookups.
xfs_bmapi_write has to read in the extents anyway, so instead of doing
another read under the same lock we'd better reuse this one.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: fix COW writeback race
  2017-01-17 17:14     ` Brian Foster
  2017-01-17 18:39       ` Darrick J. Wong
  2017-01-18  8:45       ` Christoph Hellwig
@ 2017-01-18  8:49       ` Christoph Hellwig
  2 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2017-01-18  8:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

It seems like this relative diff is all that's needed.  I'll run it
through the reproducer and will resend later.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-01-18  8:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-17  7:48 [PATCH v2] xfs: fix COW writeback race Christoph Hellwig
2017-01-17 13:44 ` Brian Foster
2017-01-17 14:37   ` Christoph Hellwig
2017-01-17 17:14     ` Brian Foster
2017-01-17 18:39       ` Darrick J. Wong
2017-01-17 18:58         ` Brian Foster
2017-01-17 20:02           ` Darrick J. Wong
2017-01-18  8:45       ` Christoph Hellwig
2017-01-18  8:49       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.