linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* spurious -ENOSPC on XFS
@ 2009-01-12 11:14 Mikulas Patocka
  2009-01-12 15:11 ` Christoph Hellwig
  2009-01-13 21:49 ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Mikulas Patocka @ 2009-01-12 11:14 UTC (permalink / raw)
  To: xfs; +Cc: linux-kernel

Hi

I discovered a bug in XFS in delayed allocation.

When you take a small partition (52MB in my case) and copy many small 
files on it (source code) that barely fits there, you get -ENOSPC. Then 
sync the partition, some free space pops up, click "retry" in MC an the 
copy continues. They you get again -ENOSPC, you must sync, click "retry" 
and go on. And so on few times until the source code finally fits on the 
XFS partition.

This misbehavior is apparently caused by delayed allocation, delayed 
allocation does not exactly know how much space will be occupied by data, 
so it makes some upper bound guess. Because free space count is only a 
guess, not the actual data being consumed, XFS should not return -ENOSPC 
on behalf of it. When the free space overflows, XFS should sync itself, 
retry allocation and only return -ENOSPC if it fails the second time, 
after the sync.

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-12 11:14 spurious -ENOSPC on XFS Mikulas Patocka
@ 2009-01-12 15:11 ` Christoph Hellwig
  2009-01-13  5:58   ` Lachlan McIlroy
  2009-01-13 21:49 ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2009-01-12 15:11 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: xfs, linux-kernel

On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
> Hi
> 
> I discovered a bug in XFS in delayed allocation.
> 
> When you take a small partition (52MB in my case) and copy many small 
> files on it (source code) that barely fits there, you get -ENOSPC. Then 
> sync the partition, some free space pops up, click "retry" in MC an the 
> copy continues. They you get again -ENOSPC, you must sync, click "retry" 
> and go on. And so on few times until the source code finally fits on the 
> XFS partition.
> 
> This misbehavior is apparently caused by delayed allocation, delayed 
> allocation does not exactly know how much space will be occupied by data, 
> so it makes some upper bound guess. Because free space count is only a 
> guess, not the actual data being consumed, XFS should not return -ENOSPC 
> on behalf of it. When the free space overflows, XFS should sync itself, 
> retry allocation and only return -ENOSPC if it fails the second time, 
> after the sync.

This looks a lot like: http://oss.sgi.com/bugzilla/show_bug.cgi?id=724
It's on my short-term todo list to turn the testcase in that entry
into a proper xfsqa testcase and followup on the investigation by
Dave and Eric.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-12 15:11 ` Christoph Hellwig
@ 2009-01-13  5:58   ` Lachlan McIlroy
  2009-01-14 22:16     ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Lachlan McIlroy @ 2009-01-13  5:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mikulas Patocka, linux-kernel, xfs

Christoph Hellwig wrote:
> On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
>> Hi
>>
>> I discovered a bug in XFS in delayed allocation.
>>
>> When you take a small partition (52MB in my case) and copy many small 
>> files on it (source code) that barely fits there, you get -ENOSPC. Then 
>> sync the partition, some free space pops up, click "retry" in MC an the 
>> copy continues. They you get again -ENOSPC, you must sync, click "retry" 
>> and go on. And so on few times until the source code finally fits on the 
>> XFS partition.
>>
>> This misbehavior is apparently caused by delayed allocation, delayed 
>> allocation does not exactly know how much space will be occupied by data, 
>> so it makes some upper bound guess. Because free space count is only a 
>> guess, not the actual data being consumed, XFS should not return -ENOSPC 
>> on behalf of it. When the free space overflows, XFS should sync itself, 
>> retry allocation and only return -ENOSPC if it fails the second time, 
>> after the sync.
This sounds like a problem with speculative allocation - delayed allocations
beyond eof.  Even if we write a small file, say 4k, a 64k chunk of delayed
allocation will be credited to the file.  If we extend the file we use more
of that chunk instead of allocating more.  When the file is closed and the
last inode reference is dropped the excess delayed allocations are released
back to freespace - this must be happening during the sync.

> 
> This looks a lot like: http://oss.sgi.com/bugzilla/show_bug.cgi?id=724
> It's on my short-term todo list to turn the testcase in that entry
> into a proper xfsqa testcase and followup on the investigation by
> Dave and Eric.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-12 11:14 spurious -ENOSPC on XFS Mikulas Patocka
  2009-01-12 15:11 ` Christoph Hellwig
@ 2009-01-13 21:49 ` Dave Chinner
  2009-01-14  4:28   ` Mikulas Patocka
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2009-01-13 21:49 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: xfs, linux-kernel

On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
> Hi
> 
> I discovered a bug in XFS in delayed allocation.
> 
> When you take a small partition (52MB in my case) and copy many small 
> files on it (source code) that barely fits there, you get -ENOSPC. Then 
> sync the partition, some free space pops up, click "retry" in MC an the 
> copy continues. They you get again -ENOSPC, you must sync, click "retry" 
> and go on. And so on few times until the source code finally fits on the 
> XFS partition.

Not a Bug. This is by design.

> This misbehavior is apparently caused by delayed allocation, delayed 
> allocation does not exactly know how much space will be occupied by data, 
> so it makes some upper bound guess.

No, we know *exactly* how much space is consumed by the data. What
we don't know is how much space will be required for additional
*metadata* to do the allocation so we reserve the worst case need so
hat we should never get an ENOSPC during async writeback when we
can't report the error to anyone.  Worst case is 4 metadata blocks
per allocation (delalloc extent, really).

If we ENOSPC in the delalloc path, we have two choices:

	1. potentially lock the system up due to OOM and being
	   unable to flush pages
	2. throw away user data without being able to report an
	   error to the application that wrote it originally.

Personally, I don't like either option, so premature ENOSPC at
write() time is fine by me....

> Because free space count is only a 
> guess, not the actual data being consumed, XFS should not return -ENOSPC 
> on behalf of it. When the free space overflows, XFS should sync itself, 
> retry allocation and only return -ENOSPC if it fails the second time, 
> after the sync.

It does, by graduated response (see xfs_iomap_write_delay() and
xfs_flush_space()):

	1. trigger async flush of the inode and retry
	2. retry again
	3. start a filesystem wide flush, wait 500ms and try again
	4. really ENOSPC now.

It could probably be improved but, quite frankly, XFS wasn't designed
for small filesystems so I don't think this is worth investing any
major effort in changing/fixing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-13 21:49 ` Dave Chinner
@ 2009-01-14  4:28   ` Mikulas Patocka
  2009-01-18 17:31     ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2009-01-14  4:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-kernel

> > This misbehavior is apparently caused by delayed allocation, delayed 
> > allocation does not exactly know how much space will be occupied by data, 
> > so it makes some upper bound guess.
> 
> No, we know *exactly* how much space is consumed by the data. What
> we don't know is how much space will be required for additional
> *metadata* to do the allocation so we reserve the worst case need so
> hat we should never get an ENOSPC during async writeback when we
> can't report the error to anyone.  Worst case is 4 metadata blocks
> per allocation (delalloc extent, really).
> 
> If we ENOSPC in the delalloc path, we have two choices:
> 
> 	1. potentially lock the system up due to OOM and being
> 	   unable to flush pages
> 	2. throw away user data without being able to report an
> 	   error to the application that wrote it originally.
> 
> Personally, I don't like either option, so premature ENOSPC at
> write() time is fine by me....
> 
> > Because free space count is only a 
> > guess, not the actual data being consumed, XFS should not return -ENOSPC 
> > on behalf of it. When the free space overflows, XFS should sync itself, 
> > retry allocation and only return -ENOSPC if it fails the second time, 
> > after the sync.
> 
> It does, by graduated response (see xfs_iomap_write_delay() and
> xfs_flush_space()):
> 
> 	1. trigger async flush of the inode and retry
> 	2. retry again
> 	3. start a filesystem wide flush, wait 500ms and try again

The result must not depend on magic timer values. If it does, you end up 
with undebbugable nondeterministic failures.

Why don't you change that 500ms wait to "wait until the flush finishes"? 
That would be correct.

> 	4. really ENOSPC now.
> 
> It could probably be improved but, quite frankly, XFS wasn't designed
> for small filesystems so I don't think this is worth investing any
> major effort in changing/fixing.
> 
> Cheers,
> 
> Dave.

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-13  5:58   ` Lachlan McIlroy
@ 2009-01-14 22:16     ` Dave Chinner
  2009-01-15  0:57       ` Lachlan McIlroy
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2009-01-14 22:16 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Christoph Hellwig, Mikulas Patocka, linux-kernel, xfs

On Tue, Jan 13, 2009 at 04:58:01PM +1100, Lachlan McIlroy wrote:
> Christoph Hellwig wrote:
>> On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
>>> Hi
>>>
>>> I discovered a bug in XFS in delayed allocation.
>>>
>>> When you take a small partition (52MB in my case) and copy many small 
>>> files on it (source code) that barely fits there, you get -ENOSPC. 
>>> Then sync the partition, some free space pops up, click "retry" in MC 
>>> an the copy continues. They you get again -ENOSPC, you must sync, 
>>> click "retry" and go on. And so on few times until the source code 
>>> finally fits on the XFS partition.
>>>
>>> This misbehavior is apparently caused by delayed allocation, delayed  
>>> allocation does not exactly know how much space will be occupied by 
>>> data, so it makes some upper bound guess. Because free space count is 
>>> only a guess, not the actual data being consumed, XFS should not 
>>> return -ENOSPC on behalf of it. When the free space overflows, XFS 
>>> should sync itself, retry allocation and only return -ENOSPC if it 
>>> fails the second time, after the sync.
> This sounds like a problem with speculative allocation - delayed allocations
> beyond eof.  Even if we write a small file, say 4k, a 64k chunk of delayed
> allocation will be credited to the file. 

The second retry occurs without speculative EOF allocation. That's
what the BMAPI_SYNC flag does....

That being said, it can't truncate away pre-existing speculative
allocations on other files, which is why there is a global flush
and wait before the third retry.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-14 22:16     ` Dave Chinner
@ 2009-01-15  0:57       ` Lachlan McIlroy
  2009-01-15  8:47         ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Lachlan McIlroy @ 2009-01-15  0:57 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, Mikulas Patocka, linux-kernel, xfs

Dave Chinner wrote:
> On Tue, Jan 13, 2009 at 04:58:01PM +1100, Lachlan McIlroy wrote:
>> Christoph Hellwig wrote:
>>> On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
>>>> Hi
>>>>
>>>> I discovered a bug in XFS in delayed allocation.
>>>>
>>>> When you take a small partition (52MB in my case) and copy many small 
>>>> files on it (source code) that barely fits there, you get -ENOSPC. 
>>>> Then sync the partition, some free space pops up, click "retry" in MC 
>>>> an the copy continues. They you get again -ENOSPC, you must sync, 
>>>> click "retry" and go on. And so on few times until the source code 
>>>> finally fits on the XFS partition.
>>>>
>>>> This misbehavior is apparently caused by delayed allocation, delayed  
>>>> allocation does not exactly know how much space will be occupied by 
>>>> data, so it makes some upper bound guess. Because free space count is 
>>>> only a guess, not the actual data being consumed, XFS should not 
>>>> return -ENOSPC on behalf of it. When the free space overflows, XFS 
>>>> should sync itself, retry allocation and only return -ENOSPC if it 
>>>> fails the second time, after the sync.
>> This sounds like a problem with speculative allocation - delayed allocations
>> beyond eof.  Even if we write a small file, say 4k, a 64k chunk of delayed
>> allocation will be credited to the file. 
> 
> The second retry occurs without speculative EOF allocation. That's
> what the BMAPI_SYNC flag does....
By then it's too late.  There could already be many files with delayed
allocations beyond eof that are unneccesarily consuming space.  I suspect
that when those files are flushed some are not able to convert the full
delayed allocation in one extent and only convert what is needed to write
out data.  The remaining unused delayed allocation is released and that's
why the freespace is going up and down.

> 
> That being said, it can't truncate away pre-existing speculative
> allocations on other files, which is why there is a global flush
> and wait before the third retry.....
> 
> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-15  0:57       ` Lachlan McIlroy
@ 2009-01-15  8:47         ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2009-01-15  8:47 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Christoph Hellwig, Mikulas Patocka, linux-kernel, xfs

On Thu, Jan 15, 2009 at 11:57:08AM +1100, Lachlan McIlroy wrote:
> Dave Chinner wrote:
> > On Tue, Jan 13, 2009 at 04:58:01PM +1100, Lachlan McIlroy wrote:
> >> Christoph Hellwig wrote:
> >>> On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
> >>>> Hi
> >>>>
> >>>> I discovered a bug in XFS in delayed allocation.
> >>>>
> >>>> When you take a small partition (52MB in my case) and copy many small 
> >>>> files on it (source code) that barely fits there, you get -ENOSPC. 
> >>>> Then sync the partition, some free space pops up, click "retry" in MC 
> >>>> an the copy continues. They you get again -ENOSPC, you must sync, 
> >>>> click "retry" and go on. And so on few times until the source code 
> >>>> finally fits on the XFS partition.
> >>>>
> >>>> This misbehavior is apparently caused by delayed allocation, delayed  
> >>>> allocation does not exactly know how much space will be occupied by 
> >>>> data, so it makes some upper bound guess. Because free space count is 
> >>>> only a guess, not the actual data being consumed, XFS should not 
> >>>> return -ENOSPC on behalf of it. When the free space overflows, XFS 
> >>>> should sync itself, retry allocation and only return -ENOSPC if it 
> >>>> fails the second time, after the sync.
> >> This sounds like a problem with speculative allocation - delayed allocations
> >> beyond eof.  Even if we write a small file, say 4k, a 64k chunk of delayed
> >> allocation will be credited to the file. 
> > 
> > The second retry occurs without speculative EOF allocation. That's
> > what the BMAPI_SYNC flag does....
> By then it's too late.  There could already be many files with delayed
> allocations beyond eof that are unneccesarily consuming space.

Yes:

> > That being said, it can't truncate away pre-existing speculative
> > allocations on other files, which is why there is a global flush
> > and wait before the third retry.....

> I suspect
> that when those files are flushed some are not able to convert the full
> delayed allocation in one extent and only convert what is needed to write
> out data.  The remaining unused delayed allocation is released and that's
> why the freespace is going up and down.

It takes time to flush several thousand files. The 500ms
sleep hack is probably not a long enough delay when there are lots
of small files to flush that haven't had their speculative
allcoation truncations removed.

I suspect that xfs_release() might not be truncating the speculative
EOF allocation on file close when the inode has no extents yet:

1190         if (ip->i_d.di_nlink != 0) {
1191                 if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
1192                      ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
1193                        ip->i_delayed_blks > 0)) &&
1194   >>>>>>>>>          (ip->i_df.if_flags & XFS_IFEXTENTS))  &&
1195                     (!(ip->i_d.di_flags &
1196                                 (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
1197                         error = xfs_free_eofblocks(mp, ip, XFS_FREE_EOF_LOCK);
1198                         if (error)
1199                                 return error;
1200                 }
1201         }

I think that check will prevent truncation of the speculative
allocation on new files that haven't been flushed or had allocation
done on them yet. That'll be the cause of the buildup, I think...

It's interesting that the only consideration for EOF truncation on
file close is when there extents in the inode itself, not when in
btree format....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-14  4:28   ` Mikulas Patocka
@ 2009-01-18 17:31     ` Christoph Hellwig
  2009-01-20 19:38       ` Mikulas Patocka
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2009-01-18 17:31 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Dave Chinner, xfs, linux-kernel

On Tue, Jan 13, 2009 at 11:28:58PM -0500, Mikulas Patocka wrote:
> The result must not depend on magic timer values. If it does, you end up 
> with undebbugable nondeterministic failures.
> 
> Why don't you change that 500ms wait to "wait until the flush finishes"? 
> That would be correct.

Yes, this probably would better.  Could I motivate you to come up with
a patch for that?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-18 17:31     ` Christoph Hellwig
@ 2009-01-20 19:38       ` Mikulas Patocka
  2009-01-20 23:24         ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2009-01-20 19:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, xfs, linux-kernel



On Sun, 18 Jan 2009, Christoph Hellwig wrote:

> On Tue, Jan 13, 2009 at 11:28:58PM -0500, Mikulas Patocka wrote:
> > The result must not depend on magic timer values. If it does, you end up 
> > with undebbugable nondeterministic failures.
> > 
> > Why don't you change that 500ms wait to "wait until the flush finishes"? 
> > That would be correct.
> 
> Yes, this probably would better.  Could I motivate you to come up with
> a patch for that?
> 

Hi

I looked at the source and found out that it uses sync_blockdev for 
syncing --- but sync_blockdev writes only metadata buffers, it doesn't 
touch inodes and pages and doesn't resolve delayed allocations. So it 
really doesn't sync anything.

I replaced it with correct syncing of all inodes. With this patch it 
passes my testcase (no more spurious -ENOSPCs), but it still isn't 
correct, there is that 500ms delay --- if the machine was so overloaded 
that it couldn't sync withing 500ms, you still get spurious -ENOSPC.

There are notions about possible deadlocks (the syncer may lock against 
the process that is waiting for the sync to finish), that's why removing 
that 500ms delay isn't that easy as it seems. I don't have XFS knowledge 
to check for the deadlocks, it should be done by XFS developers. Also, 
when you resolve the deadlocks and drop the timeout, replace WB_SYNC_NONE 
with WB_SYNC_ALL in this patch.

Mikulas

---

Properly sync when the filesystem runs out of space because of delayed
allocation overaccounting.

Note that sync_blockdev writes just dirty metadata buffers, it doesn't touch
inodes or data buffers --- so it had no effect.

WB_SYNC_ALL should be used instead of WB_SYNC_NONE, but WB_SYNC_ALL gets worse
results --- it is likely because WB_SYNC_ALL sync locks against the process
doing the write, stalling the whole sync. This needs to be further investigated
by XFS developers. Also, further investigation is needed to remove that
delay(msecs_to_jiffies(500)).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 fs/xfs/linux-2.6/xfs_sync.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6.29-rc1-devel/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- linux-2.6.29-rc1-devel.orig/fs/xfs/linux-2.6/xfs_sync.c	2009-01-20 20:13:32.000000000 +0100
+++ linux-2.6.29-rc1-devel/fs/xfs/linux-2.6/xfs_sync.c	2009-01-20 20:24:31.000000000 +0100
@@ -446,7 +446,13 @@ xfs_flush_device_work(
 	void		*arg)
 {
 	struct inode	*inode = arg;
-	sync_blockdev(mp->m_super->s_bdev);
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nr_to_write = LONG_MAX,
+	};
+	generic_sync_sb_inodes(mp->m_super, &wbc);
 	iput(inode);
 }
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-20 19:38       ` Mikulas Patocka
@ 2009-01-20 23:24         ` Dave Chinner
  2009-01-22 20:59           ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2009-01-20 23:24 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Christoph Hellwig, xfs, linux-kernel

On Tue, Jan 20, 2009 at 02:38:27PM -0500, Mikulas Patocka wrote:
> 
> 
> On Sun, 18 Jan 2009, Christoph Hellwig wrote:
> 
> > On Tue, Jan 13, 2009 at 11:28:58PM -0500, Mikulas Patocka wrote:
> > > The result must not depend on magic timer values. If it does, you end up 
> > > with undebbugable nondeterministic failures.
> > > 
> > > Why don't you change that 500ms wait to "wait until the flush finishes"? 
> > > That would be correct.
> > 
> > Yes, this probably would better.  Could I motivate you to come up with
> > a patch for that?
> > 
> 
> Hi
> 
> I looked at the source and found out that it uses sync_blockdev for 
> syncing --- but sync_blockdev writes only metadata buffers, it doesn't 
> touch inodes and pages and doesn't resolve delayed allocations. So it 
> really doesn't sync anything.

Ah, bugger. Thanks for finding this.

> I replaced it with correct syncing of all inodes. With this patch it 
> passes my testcase (no more spurious -ENOSPCs), but it still isn't 
> correct, there is that 500ms delay --- if the machine was so overloaded 
> that it couldn't sync withing 500ms, you still get spurious -ENOSPC.

That's VFS level data syncing - there may be other XFS level stuff
that can be dones as well (e.g. cleanup/truncate of unlinked inodes)
that will release space.

> There are notions about possible deadlocks (the syncer may lock against 
> the process that is waiting for the sync to finish), that's why removing 
> that 500ms delay isn't that easy as it seems. I don't have XFS knowledge 
> to check for the deadlocks, it should be done by XFS developers. Also, 
> when you resolve the deadlocks and drop the timeout, replace WB_SYNC_NONE 
> with WB_SYNC_ALL in this patch.

Right, so you need to use internal xfs sync functions that don't
have these problems. That is:

	error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);

will do a blocking flush of all the inodes without deadlocks occurring.
Then you can remove the 500ms wait.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-20 23:24         ` Dave Chinner
@ 2009-01-22 20:59           ` Christoph Hellwig
  2009-01-22 22:43             ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2009-01-22 20:59 UTC (permalink / raw)
  To: Mikulas Patocka, Christoph Hellwig, xfs, linux-kernel

On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> Right, so you need to use internal xfs sync functions that don't
> have these problems. That is:
> 
> 	error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
> 
> will do a blocking flush of all the inodes without deadlocks occurring.
> Then you can remove the 500ms wait.

I've given this a try with Eric's testcase from #724 in the oss bugzilla,
but it's not enough yet.  I thinks that's because SYNC_WAIT is rather
meaningless for data writeout, and we need SYNC_IOWAIT instead.  The
patch below gets the testcase working for me:


Index: linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_sync.c	2009-01-22 06:29:30.646141628 +0100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_sync.c	2009-01-22 21:57:53.073864570 +0100
@@ -446,7 +446,9 @@ xfs_flush_device_work(
 	void		*arg)
 {
 	struct inode	*inode = arg;
-	sync_blockdev(mp->m_super->s_bdev);
+
+	xfs_sync_inodes(mp, SYNC_DELWRI);
+	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_IOWAIT);
 	iput(inode);
 }
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-22 20:59           ` Christoph Hellwig
@ 2009-01-22 22:43             ` Christoph Hellwig
  2009-01-23 20:14               ` Mikulas Patocka
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2009-01-22 22:43 UTC (permalink / raw)
  To: Mikulas Patocka, Christoph Hellwig, xfs, linux-kernel

On Thu, Jan 22, 2009 at 03:59:13PM -0500, Christoph Hellwig wrote:
> On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> > Right, so you need to use internal xfs sync functions that don't
> > have these problems. That is:
> > 
> > 	error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
> > 
> > will do a blocking flush of all the inodes without deadlocks occurring.
> > Then you can remove the 500ms wait.
> 
> I've given this a try with Eric's testcase from #724 in the oss bugzilla,
> but it's not enough yet.  I thinks that's because SYNC_WAIT is rather
> meaningless for data writeout, and we need SYNC_IOWAIT instead.  The
> patch below gets the testcase working for me:

Actually I still see failures happing sometimes.  I guess tha'ts because
our flush is still asynchronous due to the schedule_work..

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-22 22:43             ` Christoph Hellwig
@ 2009-01-23 20:14               ` Mikulas Patocka
  2009-01-24  7:12                 ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2009-01-23 20:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, linux-kernel



On Thu, 22 Jan 2009, Christoph Hellwig wrote:

> On Thu, Jan 22, 2009 at 03:59:13PM -0500, Christoph Hellwig wrote:
> > On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> > > Right, so you need to use internal xfs sync functions that don't
> > > have these problems. That is:
> > > 
> > > 	error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
> > > 
> > > will do a blocking flush of all the inodes without deadlocks occurring.
> > > Then you can remove the 500ms wait.
> > 
> > I've given this a try with Eric's testcase from #724 in the oss bugzilla,
> > but it's not enough yet.  I thinks that's because SYNC_WAIT is rather
> > meaningless for data writeout, and we need SYNC_IOWAIT instead.  The
> > patch below gets the testcase working for me:
> 
> Actually I still see failures happing sometimes.  I guess tha'ts because
> our flush is still asynchronous due to the schedule_work..

If I placed
xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
directly to xfs_flush_device, I got lock dependency warning (though not a 
real deadlock).

So synchronous flushing definitely needs some thinking and lock 
rearchitecting. If you sync it asynchronously, a deadlock possibility will 
actually turn into spurious -ENOSPC (because the flushing thread will wait 
untill the main thread sleeps for 500ms, drops the lock and returns 
ENOSPC).

Mikulas

=============================================
[ INFO: possible recursive locking detected ]
2.6.29-rc1-devel #2
---------------------------------------------
mc/336 is trying to acquire lock:
 (&(&ip->i_iolock)->mr_lock){----}, at: [<0000000010186b88>] 
xfs_ilock+0x68/0xa0 [xfs]

but task is already holding lock:
 (&(&ip->i_iolock)->mr_lock){----}, at: [<0000000010186bb0>] 
xfs_ilock+0x90/0xa0 [xfs]

other info that might help us debug this:
2 locks held by mc/336:
 #0:  (&sb->s_type->i_mutex_key#8){--..}, at: [<00000000101b1560>] 
xfs_write+0x340/0x6a0 [xfs]
 #1:  (&(&ip->i_iolock)->mr_lock){----}, at: [<0000000010186bb0>] 
xfs_ilock+0x90/0xa0 [xfs]

stack backtrace:
Call Trace:
 [000000000047e924] validate_chain+0x1184/0x1460
 [000000000047ee88] __lock_acquire+0x288/0xae0
 [000000000047f734] lock_acquire+0x54/0x80
 [0000000000470b20] down_read_nested+0x40/0x60
 [0000000010186b88] xfs_ilock+0x68/0xa0 [xfs]
 [00000000101b4f78] xfs_sync_inodes_ag+0x218/0x320 [xfs]
 [00000000101b5100] xfs_sync_inodes+0x80/0x100 [xfs]
 [00000000101b51a0] xfs_flush_device+0x20/0x60 [xfs]
 [000000001018e654] xfs_flush_space+0x94/0x100 [xfs]
 [000000001018e95c] xfs_iomap_write_delay+0x13c/0x240 [xfs]
 [000000001018f100] xfs_iomap+0x280/0x300 [xfs]
 [00000000101a84d4] __xfs_get_blocks+0x74/0x260 [xfs]
 [00000000101a8724] xfs_get_blocks+0x24/0x40 [xfs]
 [00000000004e6800] __block_prepare_write+0x220/0x4e0
 [00000000004e6b68] block_write_begin+0x48/0x100
 [00000000101a785c] xfs_vm_write_begin+0x3c/0x60 [xfs]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-23 20:14               ` Mikulas Patocka
@ 2009-01-24  7:12                 ` Dave Chinner
  2009-01-29 16:39                   ` Mikulas Patocka
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2009-01-24  7:12 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Christoph Hellwig, xfs, linux-kernel

On Fri, Jan 23, 2009 at 03:14:30PM -0500, Mikulas Patocka wrote:
> 
> 
> On Thu, 22 Jan 2009, Christoph Hellwig wrote:
> 
> > On Thu, Jan 22, 2009 at 03:59:13PM -0500, Christoph Hellwig wrote:
> > > On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> > > > Right, so you need to use internal xfs sync functions that don't
> > > > have these problems. That is:
> > > > 
> > > > 	error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
> > > > 
> > > > will do a blocking flush of all the inodes without deadlocks occurring.
> > > > Then you can remove the 500ms wait.
> > > 
> > > I've given this a try with Eric's testcase from #724 in the oss bugzilla,
> > > but it's not enough yet.  I thinks that's because SYNC_WAIT is rather
> > > meaningless for data writeout, and we need SYNC_IOWAIT instead.  The
> > > patch below gets the testcase working for me:
> > 
> > Actually I still see failures happing sometimes.  I guess tha'ts because
> > our flush is still asynchronous due to the schedule_work..
> 
> If I placed
> xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
> xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
> directly to xfs_flush_device, I got lock dependency warning (though not a 
> real deadlock).

Same reason memory reclaim gives lockdep warnings on XFS - we're
recursing into operations that take inode locks while we currently
hold an inode lock.  However, it shouldn't deadlock because
we should ever try to take the iolock on the inode that we current
hold it on.

> So synchronous flushing definitely needs some thinking and lock 
> rearchitecting.

No, not at all. At most the grabbing of the iolock in
xfs_sync_inodes_ag() needs to become a trylock....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-24  7:12                 ` Dave Chinner
@ 2009-01-29 16:39                   ` Mikulas Patocka
  2009-01-29 16:45                     ` Mikulas Patocka
  2009-01-31 23:57                     ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Mikulas Patocka @ 2009-01-29 16:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs, linux-kernel



On Sat, 24 Jan 2009, Dave Chinner wrote:

> On Fri, Jan 23, 2009 at 03:14:30PM -0500, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 22 Jan 2009, Christoph Hellwig wrote:
> > 
> > > On Thu, Jan 22, 2009 at 03:59:13PM -0500, Christoph Hellwig wrote:
> > > > On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> > > > > Right, so you need to use internal xfs sync functions that don't
> > > > > have these problems. That is:
> > > > > 
> > > > > 	error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
> > > > > 
> > > > > will do a blocking flush of all the inodes without deadlocks occurring.
> > > > > Then you can remove the 500ms wait.
> > > > 
> > > > I've given this a try with Eric's testcase from #724 in the oss bugzilla,
> > > > but it's not enough yet.  I thinks that's because SYNC_WAIT is rather
> > > > meaningless for data writeout, and we need SYNC_IOWAIT instead.  The
> > > > patch below gets the testcase working for me:
> > > 
> > > Actually I still see failures happing sometimes.  I guess tha'ts because
> > > our flush is still asynchronous due to the schedule_work..
> > 
> > If I placed
> > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
> > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
> > directly to xfs_flush_device, I got lock dependency warning (though not a 
> > real deadlock).
> 
> Same reason memory reclaim gives lockdep warnings on XFS - we're
> recursing into operations that take inode locks while we currently
> hold an inode lock.  However, it shouldn't deadlock because
> we should ever try to take the iolock on the inode that we current
> hold it on.
> 
> > So synchronous flushing definitely needs some thinking and lock 
> > rearchitecting.
> 
> No, not at all. At most the grabbing of the iolock in
> xfs_sync_inodes_ag() needs to become a trylock....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

You are wrong, the comments in the code are right. It really deadlocks if 
it is modified to use synchronous wait for the end of the flush and if 
the flush is done with xfs_sync_inodes:

xfssyncd      D 0000000000606808     0  4819      2
Call Trace:
 [0000000000606788] rwsem_down_failed_common+0x1ac/0x1d8
 [0000000000606808] rwsem_down_read_failed+0x24/0x34
 [0000000000606848] __down_read+0x30/0x40
 [0000000000468a64] down_read_nested+0x48/0x58
 [00000000100e6cc8] xfs_ilock+0x48/0x8c [xfs]
 [000000001011018c] xfs_sync_inodes_ag+0x17c/0x2dc [xfs]
 [000000001011034c] xfs_sync_inodes+0x60/0xc4 [xfs]
 [00000000101103c4] xfs_flush_device_work+0x14/0x2c [xfs]
 [000000001010ff1c] xfssyncd+0x110/0x174 [xfs]
 [000000000046556c] kthread+0x54/0x88
 [000000000042b2a0] kernel_thread+0x3c/0x54
 [00000000004653b8] kthreadd+0xac/0x160
dd            D 00000000006045c4     0  4829   2500
Call Trace:
 [00000000006047d8] schedule_timeout+0x24/0xb8
 [00000000006045c4] wait_for_common+0xe4/0x14c
 [0000000000604700] wait_for_completion+0x1c/0x2c
 [00000000101106b0] xfs_flush_device+0x68/0x88 [xfs]
 [00000000100edba4] xfs_flush_space+0xa8/0xd0 [xfs]
 [00000000100edec0] xfs_iomap_write_delay+0x1bc/0x228 [xfs]
 [00000000100ee4b8] xfs_iomap+0x1c4/0x2e0 [xfs]
 [0000000010104f04] __xfs_get_blocks+0x74/0x240 [xfs]
 [000000001010512c] xfs_get_blocks+0x24/0x38 [xfs]
 [00000000004d05f0] __block_prepare_write+0x184/0x41c
 [00000000004d095c] block_write_begin+0x84/0xe8
 [000000001010447c] xfs_vm_write_begin+0x3c/0x50 [xfs]
 [0000000000485258] generic_file_buffered_write+0xe8/0x28c
 [000000001010cec4] xfs_write+0x40c/0x604 [xfs]
 [0000000010108d3c] xfs_file_aio_write+0x74/0x84 [xfs]
 [00000000004ae70c] do_sync_write+0x8c/0xdc

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-29 16:39                   ` Mikulas Patocka
@ 2009-01-29 16:45                     ` Mikulas Patocka
  2009-01-31 23:57                     ` Dave Chinner
  1 sibling, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2009-01-29 16:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs, linux-kernel

> > > So synchronous flushing definitely needs some thinking and lock 
> > > rearchitecting.
> > 
> > No, not at all. At most the grabbing of the iolock in
> > xfs_sync_inodes_ag() needs to become a trylock....
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 
> You are wrong, the comments in the code are right. It really deadlocks if 
> it is modified to use synchronous wait for the end of the flush and if 
> the flush is done with xfs_sync_inodes:
> 
> xfssyncd      D 0000000000606808     0  4819      2
> Call Trace:
>  [0000000000606788] rwsem_down_failed_common+0x1ac/0x1d8
>  [0000000000606808] rwsem_down_read_failed+0x24/0x34
>  [0000000000606848] __down_read+0x30/0x40
>  [0000000000468a64] down_read_nested+0x48/0x58
>  [00000000100e6cc8] xfs_ilock+0x48/0x8c [xfs]
>  [000000001011018c] xfs_sync_inodes_ag+0x17c/0x2dc [xfs]
>  [000000001011034c] xfs_sync_inodes+0x60/0xc4 [xfs]
>  [00000000101103c4] xfs_flush_device_work+0x14/0x2c [xfs]
>  [000000001010ff1c] xfssyncd+0x110/0x174 [xfs]
>  [000000000046556c] kthread+0x54/0x88
>  [000000000042b2a0] kernel_thread+0x3c/0x54
>  [00000000004653b8] kthreadd+0xac/0x160
> dd            D 00000000006045c4     0  4829   2500
> Call Trace:
>  [00000000006047d8] schedule_timeout+0x24/0xb8
>  [00000000006045c4] wait_for_common+0xe4/0x14c
>  [0000000000604700] wait_for_completion+0x1c/0x2c
>  [00000000101106b0] xfs_flush_device+0x68/0x88 [xfs]
>  [00000000100edba4] xfs_flush_space+0xa8/0xd0 [xfs]
>  [00000000100edec0] xfs_iomap_write_delay+0x1bc/0x228 [xfs]
>  [00000000100ee4b8] xfs_iomap+0x1c4/0x2e0 [xfs]
>  [0000000010104f04] __xfs_get_blocks+0x74/0x240 [xfs]
>  [000000001010512c] xfs_get_blocks+0x24/0x38 [xfs]
>  [00000000004d05f0] __block_prepare_write+0x184/0x41c
>  [00000000004d095c] block_write_begin+0x84/0xe8
>  [000000001010447c] xfs_vm_write_begin+0x3c/0x50 [xfs]
>  [0000000000485258] generic_file_buffered_write+0xe8/0x28c
>  [000000001010cec4] xfs_write+0x40c/0x604 [xfs]
>  [0000000010108d3c] xfs_file_aio_write+0x74/0x84 [xfs]
>  [00000000004ae70c] do_sync_write+0x8c/0xdc
> 
> Mikulas

BTW. I modified it to use completion to wait for the end of the sync work 
(instead of that fixed 500ms wait) and got this deadlock quite quickly.

filemap_flush used to deadlock, I disabled it and used only 
xfs_sync_inodes, but xfs_sync_inodes also deadlocks.

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-29 16:39                   ` Mikulas Patocka
  2009-01-29 16:45                     ` Mikulas Patocka
@ 2009-01-31 23:57                     ` Dave Chinner
  2009-02-02 17:36                       ` Mikulas Patocka
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2009-01-31 23:57 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Christoph Hellwig, xfs, linux-kernel

On Thu, Jan 29, 2009 at 11:39:00AM -0500, Mikulas Patocka wrote:
> On Sat, 24 Jan 2009, Dave Chinner wrote:
> > On Fri, Jan 23, 2009 at 03:14:30PM -0500, Mikulas Patocka wrote:
> > > If I placed
> > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
> > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
> > > directly to xfs_flush_device, I got lock dependency warning (though not a 
> > > real deadlock).
> > 
> > Same reason memory reclaim gives lockdep warnings on XFS - we're
> > recursing into operations that take inode locks while we currently
> > hold an inode lock.  However, it shouldn't deadlock because
> > we should ever try to take the iolock on the inode that we current
> > hold it on.
> > 
> > > So synchronous flushing definitely needs some thinking and lock 
> > > rearchitecting.
> > 
> > No, not at all. At most the grabbing of the iolock in
> > xfs_sync_inodes_ag() needs to become a trylock....
> 
> You are wrong, the comments in the code are right. It really
> deadlocks if it is modified to use synchronous wait for the end of
> the flush and if the flush is done with xfs_sync_inodes:
> 
> xfssyncd      D 0000000000606808     0  4819      2
> Call Trace:
>  [0000000000606788] rwsem_down_failed_common+0x1ac/0x1d8
>  [0000000000606808] rwsem_down_read_failed+0x24/0x34
>  [0000000000606848] __down_read+0x30/0x40
>  [0000000000468a64] down_read_nested+0x48/0x58
>  [00000000100e6cc8] xfs_ilock+0x48/0x8c [xfs]
>  [000000001011018c] xfs_sync_inodes_ag+0x17c/0x2dc [xfs]
>  [000000001011034c] xfs_sync_inodes+0x60/0xc4 [xfs]
>  [00000000101103c4] xfs_flush_device_work+0x14/0x2c [xfs]
>  [000000001010ff1c] xfssyncd+0x110/0x174 [xfs]
>  [000000000046556c] kthread+0x54/0x88
>  [000000000042b2a0] kernel_thread+0x3c/0x54
>  [00000000004653b8] kthreadd+0xac/0x160

So it is stuck:

127                 /*
128                  * If we have to flush data or wait for I/O completion
129                  * we need to hold the iolock.
130                  */
131                 if ((flags & SYNC_DELWRI) && VN_DIRTY(inode)) {
132     >>>>>>>>            xfs_ilock(ip, XFS_IOLOCK_SHARED);
133                         lock_flags |= XFS_IOLOCK_SHARED;
134                         error = xfs_flush_pages(ip, 0, -1, fflag, FI_NONE);
135                         if (flags & SYNC_IOWAIT)
136                                 xfs_ioend_wait(ip);
137                 }

Given that we are stuck on the iolock in xfs_sync_inodes_ag(), I
suspect you should re-read my comments above about "lock
re-architecting" ;).

If you make the xfs_ilock() there xfs_ilock_nowait() and avoid data
writeback if we don't get the lock the deadlock goes away, right?

BTW, can you post the patch you are working on?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-01-31 23:57                     ` Dave Chinner
@ 2009-02-02 17:36                       ` Mikulas Patocka
  2009-02-03  3:27                         ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2009-02-02 17:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs, linux-kernel



On Sun, 1 Feb 2009, Dave Chinner wrote:

> On Thu, Jan 29, 2009 at 11:39:00AM -0500, Mikulas Patocka wrote:
> > On Sat, 24 Jan 2009, Dave Chinner wrote:
> > > On Fri, Jan 23, 2009 at 03:14:30PM -0500, Mikulas Patocka wrote:
> > > > If I placed
> > > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
> > > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
> > > > directly to xfs_flush_device, I got lock dependency warning (though not a 
> > > > real deadlock).
> > > 
> > > Same reason memory reclaim gives lockdep warnings on XFS - we're
> > > recursing into operations that take inode locks while we currently
> > > hold an inode lock.  However, it shouldn't deadlock because
> > > we should ever try to take the iolock on the inode that we current
> > > hold it on.
> > > 
> > > > So synchronous flushing definitely needs some thinking and lock 
> > > > rearchitecting.
> > > 
> > > No, not at all. At most the grabbing of the iolock in
> > > xfs_sync_inodes_ag() needs to become a trylock....
> > 
> > You are wrong, the comments in the code are right. It really
> > deadlocks if it is modified to use synchronous wait for the end of
> > the flush and if the flush is done with xfs_sync_inodes:
> > 
> > xfssyncd      D 0000000000606808     0  4819      2
> > Call Trace:
> >  [0000000000606788] rwsem_down_failed_common+0x1ac/0x1d8
> >  [0000000000606808] rwsem_down_read_failed+0x24/0x34
> >  [0000000000606848] __down_read+0x30/0x40
> >  [0000000000468a64] down_read_nested+0x48/0x58
> >  [00000000100e6cc8] xfs_ilock+0x48/0x8c [xfs]
> >  [000000001011018c] xfs_sync_inodes_ag+0x17c/0x2dc [xfs]
> >  [000000001011034c] xfs_sync_inodes+0x60/0xc4 [xfs]
> >  [00000000101103c4] xfs_flush_device_work+0x14/0x2c [xfs]
> >  [000000001010ff1c] xfssyncd+0x110/0x174 [xfs]
> >  [000000000046556c] kthread+0x54/0x88
> >  [000000000042b2a0] kernel_thread+0x3c/0x54
> >  [00000000004653b8] kthreadd+0xac/0x160
> 
> So it is stuck:
> 
> 127                 /*
> 128                  * If we have to flush data or wait for I/O completion
> 129                  * we need to hold the iolock.
> 130                  */
> 131                 if ((flags & SYNC_DELWRI) && VN_DIRTY(inode)) {
> 132     >>>>>>>>            xfs_ilock(ip, XFS_IOLOCK_SHARED);
> 133                         lock_flags |= XFS_IOLOCK_SHARED;
> 134                         error = xfs_flush_pages(ip, 0, -1, fflag, FI_NONE);
> 135                         if (flags & SYNC_IOWAIT)
> 136                                 xfs_ioend_wait(ip);
> 137                 }
> 
> Given that we are stuck on the iolock in xfs_sync_inodes_ag(), I
> suspect you should re-read my comments above about "lock
> re-architecting" ;).

No, if you turn it into a trylock, it will randomly fail if any other 
process is holding the lock for whatever reason. So you will still get 
spurious -ENOSPCs, although with lower probability.

> If you make the xfs_ilock() there xfs_ilock_nowait() and avoid data
> writeback if we don't get the lock the deadlock goes away, right?

But that premature ENOSPC possibility will still exist.

The only right solution that I see is to drop this flushing code at all 
(because it's unfixable), catch -ENOSPCs exactly before you are about to 
return them to Linux VFS, flush at this point (you are not holding any 
locks here) and retry.

There seems to be more bugs with forgetting to flush delayed allocation 
--- you should flush delayed allocation after *every* failed allocate 
attempt, but the code flushes it only after failed delayed allocate 
attempt --- i.e. nondelayed allocators, such as xfs_iomap_write_direct 
(and maybe other places) will still return -ENOSPC without attempting to 
flush other inodes with delayed allocation.

Syncing should be really moved at the topmost place.

Mikulas

> BTW, can you post the patch you are working on?
> 
> Cheers,
> 
> Dave.

This is what I tried: I turned that 500ms wait into a completion:


Use xfs_sync_inodes and not sync_blockdev. sync_blockdev writes dirty 
metadata buffers, it doesn't touch inodes and pages at all.

Also, change that 500ms delay to a wait for completion, so that the
behavior is not dependent on magic timeout values. XFS developers insist
that it can't deadlock (I hope they're right).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 fs/xfs/linux-2.6/xfs_sync.c |   25 +++++++++++++++++++------
 fs/xfs/linux-2.6/xfs_sync.h |    1 +
 2 files changed, 20 insertions(+), 6 deletions(-)

Index: linux-2.6.29-rc3-devel/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- linux-2.6.29-rc3-devel.orig/fs/xfs/linux-2.6/xfs_sync.c	2009-01-29 15:49:25.000000000 +0100
+++ linux-2.6.29-rc3-devel/fs/xfs/linux-2.6/xfs_sync.c	2009-01-29 16:57:53.000000000 +0100
@@ -389,12 +389,15 @@ xfs_quiesce_attr(
  * - It saves on stack space, which is tight in certain situations
  * - It can be used (with care) as a mechanism to avoid deadlocks.
  * Flushing while allocating in a full filesystem requires both.
+ *
+ * Dave Chinner claims that the deadlock can't happen. -- mikulas
  */
 STATIC void
 xfs_syncd_queue_work(
 	struct xfs_mount *mp,
 	void		*data,
-	void		(*syncer)(struct xfs_mount *, void *))
+	void		(*syncer)(struct xfs_mount *, void *),
+	struct completion *completion)
 {
 	struct bhv_vfs_sync_work *work;
 
@@ -403,6 +406,7 @@ xfs_syncd_queue_work(
 	work->w_syncer = syncer;
 	work->w_data = data;
 	work->w_mount = mp;
+	work->w_completion = completion;
 	spin_lock(&mp->m_sync_lock);
 	list_add_tail(&work->w_list, &mp->m_sync_list);
 	spin_unlock(&mp->m_sync_lock);
@@ -415,6 +419,7 @@ xfs_syncd_queue_work(
  * has failed with ENOSPC and we are in the process of scratching our
  * heads, looking about for more room...
  */
+#if 0
 STATIC void
 xfs_flush_inode_work(
 	struct xfs_mount *mp,
@@ -424,16 +429,20 @@ xfs_flush_inode_work(
 	filemap_flush(inode->i_mapping);
 	iput(inode);
 }
+#endif
 
 void
 xfs_flush_inode(
 	xfs_inode_t	*ip)
 {
+#if 0
 	struct inode	*inode = VFS_I(ip);
+	DECLARE_COMPLETION_ONSTACK(completion);
 
 	igrab(inode);
-	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work);
-	delay(msecs_to_jiffies(500));
+	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work, &completion);
+	wait_for_completion(&completion);
+#endif
 }
 
 /*
@@ -446,7 +455,7 @@ xfs_flush_device_work(
 	void		*arg)
 {
 	struct inode	*inode = arg;
-	sync_blockdev(mp->m_super->s_bdev);
+	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_IOWAIT);
 	iput(inode);
 }
 
@@ -455,10 +464,11 @@ xfs_flush_device(
 	xfs_inode_t	*ip)
 {
 	struct inode	*inode = VFS_I(ip);
+	DECLARE_COMPLETION_ONSTACK(completion);
 
 	igrab(inode);
-	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_device_work);
-	delay(msecs_to_jiffies(500));
+	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_device_work, &completion);
+	wait_for_completion(&completion);
 	xfs_log_force(ip->i_mount, (xfs_lsn_t)0, XFS_LOG_FORCE|XFS_LOG_SYNC);
 }
 
@@ -526,6 +536,8 @@ xfssyncd(
 		list_for_each_entry_safe(work, n, &tmp, w_list) {
 			(*work->w_syncer)(mp, work->w_data);
 			list_del(&work->w_list);
+			if (work->w_completion)
+				complete(work->w_completion);
 			if (work == &mp->m_sync_work)
 				continue;
 			kmem_free(work);
@@ -541,6 +553,7 @@ xfs_syncd_init(
 {
 	mp->m_sync_work.w_syncer = xfs_sync_worker;
 	mp->m_sync_work.w_mount = mp;
+	mp->m_sync_work.w_completion = NULL;
 	mp->m_sync_task = kthread_run(xfssyncd, mp, "xfssyncd");
 	if (IS_ERR(mp->m_sync_task))
 		return -PTR_ERR(mp->m_sync_task);
Index: linux-2.6.29-rc3-devel/fs/xfs/linux-2.6/xfs_sync.h
===================================================================
--- linux-2.6.29-rc3-devel.orig/fs/xfs/linux-2.6/xfs_sync.h	2009-01-29 15:52:28.000000000 +0100
+++ linux-2.6.29-rc3-devel/fs/xfs/linux-2.6/xfs_sync.h	2009-01-29 15:52:56.000000000 +0100
@@ -25,6 +25,7 @@ typedef struct bhv_vfs_sync_work {
 	struct xfs_mount	*w_mount;
 	void			*w_data;	/* syncer routine argument */
 	void			(*w_syncer)(struct xfs_mount *, void *);
+	struct completion	*w_completion;
 } bhv_vfs_sync_work_t;
 
 #define SYNC_ATTR		0x0001	/* sync attributes */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-02-02 17:36                       ` Mikulas Patocka
@ 2009-02-03  3:27                         ` Dave Chinner
  2009-02-03 20:05                           ` Mikulas Patocka
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2009-02-03  3:27 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Christoph Hellwig, xfs, linux-kernel

On Mon, Feb 02, 2009 at 12:36:40PM -0500, Mikulas Patocka wrote:
> On Sun, 1 Feb 2009, Dave Chinner wrote:
> > On Thu, Jan 29, 2009 at 11:39:00AM -0500, Mikulas Patocka wrote:
> > > On Sat, 24 Jan 2009, Dave Chinner wrote:
> > > > On Fri, Jan 23, 2009 at 03:14:30PM -0500, Mikulas Patocka wrote:
> > > > > If I placed
> > > > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
> > > > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
> > > > > directly to xfs_flush_device, I got lock dependency warning (though not a 
> > > > > real deadlock).
> > > > 
> > > > Same reason memory reclaim gives lockdep warnings on XFS - we're
> > > > recursing into operations that take inode locks while we currently
> > > > hold an inode lock.  However, it shouldn't deadlock because
> > > > we should ever try to take the iolock on the inode that we current
> > > > hold it on.
> > > > 
> > > > > So synchronous flushing definitely needs some thinking and lock 
> > > > > rearchitecting.
> > > > 
> > > > No, not at all. At most the grabbing of the iolock in
> > > > xfs_sync_inodes_ag() needs to become a trylock....
> > > 
> > > You are wrong, the comments in the code are right. It really
> > > deadlocks if it is modified to use synchronous wait for the end of
> > > the flush and if the flush is done with xfs_sync_inodes:
....
> > So it is stuck:
> > 
> > 127                 /*
> > 128                  * If we have to flush data or wait for I/O completion
> > 129                  * we need to hold the iolock.
> > 130                  */
> > 131                 if ((flags & SYNC_DELWRI) && VN_DIRTY(inode)) {
> > 132     >>>>>>>>            xfs_ilock(ip, XFS_IOLOCK_SHARED);
> > 133                         lock_flags |= XFS_IOLOCK_SHARED;
> > 134                         error = xfs_flush_pages(ip, 0, -1, fflag, FI_NONE);
> > 135                         if (flags & SYNC_IOWAIT)
> > 136                                 xfs_ioend_wait(ip);
> > 137                 }
> > 
> > Given that we are stuck on the iolock in xfs_sync_inodes_ag(), I
> > suspect you should re-read my comments above about "lock
> > re-architecting" ;).
> 
> No, if you turn it into a trylock, it will randomly fail if any other 
> process is holding the lock for whatever reason. So you will still get 
> spurious -ENOSPCs, although with lower probability.

The flush is never, ever going to be perfect. While we are blocking
waiting for the flush, other concurrent writes could be doing more
delayed allocation. The flush won't catch these, so we can still get
spurious ENOSPCs even with a *perfect* flush.

Hence there is no point in trying to make the code perfect - we
can look at more complex solutions if and only if the simple
solution is not sufficient.

> > If you make the xfs_ilock() there xfs_ilock_nowait() and avoid data
> > writeback if we don't get the lock the deadlock goes away, right?
> 
> But that premature ENOSPC possibility will still exist.
> The only right solution that I see is to drop this flushing code at all 
> (because it's unfixable), catch -ENOSPCs exactly before you are about to 
> return them to Linux VFS, flush at this point (you are not holding any 
> locks here) and retry.

Still not perfect as soon as you consider concurrency (as per above).

> There seems to be more bugs with forgetting to flush delayed allocation 
> --- you should flush delayed allocation after *every* failed allocate 
> attempt, but the code flushes it only after failed delayed allocate 
> attempt --- i.e. nondelayed allocators, such as xfs_iomap_write_direct 
> (and maybe other places) will still return -ENOSPC without attempting to 
> flush other inodes with delayed allocation.

xfs_iomap_write_direct() is for direct IO, and if that fails due to
ENOSPC, we're going to return the error rather than try to flush
delayed allocation blocks. Users of direct IO care more about
deterministic response than trying to use every last byte of the
filesystem. Such users also don't tend to mix buffered writes
(hence delayed allocation) and direct IO on the same filesystem
precisely because of the non-deterministic nature of buffered IO.
Hence we don't flush here.

xfs_iomap_write_allocate() does the allocation of delayed extents,
which we've already guaranteed that there is space for due to the
flushing in xfs_iomap_write_delay(). Hence we don't flush here,
either.

For metadata allocations (all over the place), we take a reservation
first and so we could trigger a flush in certain circumstances if
the reservation fails (to avoid recursion and transaction subsystem
deadlocks). However, if you are not getting spurious enospc on
creating inodes (as opposed to writing to them) then I see no
immediate need for flushes there, either....

> Syncing should be really moved at the topmost place.

This can only be considered a best effort flush, not a sync.

> This is what I tried: I turned that 500ms wait into a completion:
> 
> 
> Use xfs_sync_inodes and not sync_blockdev. sync_blockdev writes dirty 
> metadata buffers, it doesn't touch inodes and pages at all.
> 
> Also, change that 500ms delay to a wait for completion, so that the
> behavior is not dependent on magic timeout values. XFS developers insist
> that it can't deadlock (I hope they're right).
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
.....
> @@ -415,6 +419,7 @@ xfs_syncd_queue_work(
>   * has failed with ENOSPC and we are in the process of scratching our
>   * heads, looking about for more room...
>   */
> +#if 0
>  STATIC void
>  xfs_flush_inode_work(
>  	struct xfs_mount *mp,
> @@ -424,16 +429,20 @@ xfs_flush_inode_work(
>  	filemap_flush(inode->i_mapping);
>  	iput(inode);
>  }
> +#endif
>  
>  void
>  xfs_flush_inode(
>  	xfs_inode_t	*ip)
>  {
> +#if 0
>  	struct inode	*inode = VFS_I(ip);
> +	DECLARE_COMPLETION_ONSTACK(completion);
>  
>  	igrab(inode);
> -	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work);
> -	delay(msecs_to_jiffies(500));
> +	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work, &completion);
> +	wait_for_completion(&completion);
> +#endif

Why did you turn off the initial inode flush?

>  }
>  
>  /*
> @@ -446,7 +455,7 @@ xfs_flush_device_work(
>  	void		*arg)
>  {
>  	struct inode	*inode = arg;
> -	sync_blockdev(mp->m_super->s_bdev);
> +	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_IOWAIT);
>  	iput(inode);

This should be:

	xfs_sync_inodes(mp, SYNC_DELWRI);
	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_IOWAIT);

As it will flush the inodes asynchronously and then the second
pass will wait for all the IO to complete. That will be much
more efficient than synchronous flushing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-02-03  3:27                         ` Dave Chinner
@ 2009-02-03 20:05                           ` Mikulas Patocka
  2009-02-04 12:08                             ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2009-02-03 20:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs, linux-kernel

On Tue, 3 Feb 2009, Dave Chinner wrote:

> > No, if you turn it into a trylock, it will randomly fail if any other 
> > process is holding the lock for whatever reason. So you will still get 
> > spurious -ENOSPCs, although with lower probability.
> 
> The flush is never, ever going to be perfect. While we are blocking
> waiting for the flush, other concurrent writes could be doing more
> delayed allocation. The flush won't catch these, so we can still get
> spurious ENOSPCs even with a *perfect* flush.

So you can turn it into a loop --- while there are delayed allocations, 
flush and retry.

... and if you turn it into trylock, what are you going to do with the 
inode that is just being written to? You should definitely flush it, but 
trylock will skip it because it's already locked.

> Hence there is no point in trying to make the code perfect - we
> can look at more complex solutions if and only if the simple
> solution is not sufficient.
> 
> > > If you make the xfs_ilock() there xfs_ilock_nowait() and avoid data
> > > writeback if we don't get the lock the deadlock goes away, right?
> > 
> > But that premature ENOSPC possibility will still exist.
> > The only right solution that I see is to drop this flushing code at all 
> > (because it's unfixable), catch -ENOSPCs exactly before you are about to 
> > return them to Linux VFS, flush at this point (you are not holding any 
> > locks here) and retry.
> 
> Still not perfect as soon as you consider concurrency (as per above).
> 
> > There seems to be more bugs with forgetting to flush delayed allocation 
> > --- you should flush delayed allocation after *every* failed allocate 
> > attempt, but the code flushes it only after failed delayed allocate 
> > attempt --- i.e. nondelayed allocators, such as xfs_iomap_write_direct 
> > (and maybe other places) will still return -ENOSPC without attempting to 
> > flush other inodes with delayed allocation.
> 
> xfs_iomap_write_direct() is for direct IO, and if that fails due to
> ENOSPC, we're going to return the error rather than try to flush
> delayed allocation blocks. Users of direct IO care more about
> deterministic response than trying to use every last byte of the
> filesystem. Such users also don't tend to mix buffered writes
> (hence delayed allocation) and direct IO on the same filesystem
> precisely because of the non-deterministic nature of buffered IO.
> Hence we don't flush here.
> 
> xfs_iomap_write_allocate() does the allocation of delayed extents,
> which we've already guaranteed that there is space for due to the
> flushing in xfs_iomap_write_delay(). Hence we don't flush here,
> either.
> 
> For metadata allocations (all over the place), we take a reservation
> first

And what if the reservation fails? Do you flush in this case?

> and so we could trigger a flush in certain circumstances if
> the reservation fails (to avoid recursion and transaction subsystem
> deadlocks). However, if you are not getting spurious enospc on
> creating inodes (as opposed to writing to them) then I see no
> immediate need for flushes there, either....
> 
> > Syncing should be really moved at the topmost place.
> 
> This can only be considered a best effort flush, not a sync.
> 
> > This is what I tried: I turned that 500ms wait into a completion:
> > 
> > 
> > Use xfs_sync_inodes and not sync_blockdev. sync_blockdev writes dirty 
> > metadata buffers, it doesn't touch inodes and pages at all.
> > 
> > Also, change that 500ms delay to a wait for completion, so that the
> > behavior is not dependent on magic timeout values. XFS developers insist
> > that it can't deadlock (I hope they're right).
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> .....
> > @@ -415,6 +419,7 @@ xfs_syncd_queue_work(
> >   * has failed with ENOSPC and we are in the process of scratching our
> >   * heads, looking about for more room...
> >   */
> > +#if 0
> >  STATIC void
> >  xfs_flush_inode_work(
> >  	struct xfs_mount *mp,
> > @@ -424,16 +429,20 @@ xfs_flush_inode_work(
> >  	filemap_flush(inode->i_mapping);
> >  	iput(inode);
> >  }
> > +#endif
> >  
> >  void
> >  xfs_flush_inode(
> >  	xfs_inode_t	*ip)
> >  {
> > +#if 0
> >  	struct inode	*inode = VFS_I(ip);
> > +	DECLARE_COMPLETION_ONSTACK(completion);
> >  
> >  	igrab(inode);
> > -	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work);
> > -	delay(msecs_to_jiffies(500));
> > +	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work, &completion);
> > +	wait_for_completion(&completion);
> > +#endif
> 
> Why did you turn off the initial inode flush?

Because it recurses into Linux VFS layer and deadlocks.

You can avoid XFS deadlock with trylock, but you can't avoid Linux VFS 
deadlocks --- you just must not call it.

> >  }
> >  
> >  /*
> > @@ -446,7 +455,7 @@ xfs_flush_device_work(
> >  	void		*arg)
> >  {
> >  	struct inode	*inode = arg;
> > -	sync_blockdev(mp->m_super->s_bdev);
> > +	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_IOWAIT);
> >  	iput(inode);
> 
> This should be:
> 
> 	xfs_sync_inodes(mp, SYNC_DELWRI);
> 	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_IOWAIT);
> 
> As it will flush the inodes asynchronously and then the second
> pass will wait for all the IO to complete. That will be much
> more efficient than synchronous flushing.

I see, I didn't know about this trick.

Mikulas

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-02-03 20:05                           ` Mikulas Patocka
@ 2009-02-04 12:08                             ` Dave Chinner
  2009-02-05  4:31                               ` Mikulas Patocka
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2009-02-04 12:08 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Christoph Hellwig, xfs, linux-kernel

On Tue, Feb 03, 2009 at 03:05:07PM -0500, Mikulas Patocka wrote:
> On Tue, 3 Feb 2009, Dave Chinner wrote:
> 
> > > No, if you turn it into a trylock, it will randomly fail if any other 
> > > process is holding the lock for whatever reason. So you will still get 
> > > spurious -ENOSPCs, although with lower probability.
> > 
> > The flush is never, ever going to be perfect. While we are blocking
> > waiting for the flush, other concurrent writes could be doing more
> > delayed allocation. The flush won't catch these, so we can still get
> > spurious ENOSPCs even with a *perfect* flush.
> 
> So you can turn it into a loop --- while there are delayed allocations, 
> flush and retry.

I don't think that is necessary having just instrumented the code:

[42949497.350000] Ai 7040, tl 7037, f 7037, w 0
[42949497.510000] i 6528, tl 6528, f 6528, w 0
[42949497.510000] i 3648, tl 3648, f 3648, w 0
[42949497.530000] i 1977, tl 1977, f 1977, w 0
[42949497.530000] Bi 7040, tl 7037, f 7037, w 7037
[42949497.550000] i 6528, tl 6528, f 6528, w 6528
[42949497.550000] i 3648, tl 3648, f 3648, w 3648
[42949497.550000] i 1977, tl 1977, f 1977, w 1977
[42949497.550000] CDd

A = start of xfs_flush_device
i = inodes in AG scanned for writeback
tl = trylocks that succeeded
f = number flushed
w = number waited for.
(there is 4 of these groups due to 4 AGs in the fs)
B = startof blocking flush
C = completion signaled
D = completion complete
d = xfs_flush_device done.

As you can see, there are only 3 inodes (of ~19,200) that weren't
successfully trylocked in the flush. I haven't seen any more than
that, either.

> ... and if you turn it into trylock, what are you going to do with the 
> inode that is just being written to? You should definitely flush it, but 
> trylock will skip it because it's already locked.

We've already flushed it directly. You disabled that code fearing
deadlocks. I've made it synchronous (i.e. not handed off to
xfssyncd) because the flush path requires us to hold the lock we are
already holding....

> > > There seems to be more bugs with forgetting to flush delayed allocation 
> > > --- you should flush delayed allocation after *every* failed allocate 
> > > attempt, but the code flushes it only after failed delayed allocate 
> > > attempt --- i.e. nondelayed allocators, such as xfs_iomap_write_direct 
> > > (and maybe other places) will still return -ENOSPC without attempting to 
> > > flush other inodes with delayed allocation.
.....
> > For metadata allocations (all over the place), we take a reservation
> > first
> 
> And what if the reservation fails? Do you flush in this case?

Well, if you read the rest of the paragraph you'd have the answer
to that question. I'll leave it quoted because it is relevant:

> > and so we could trigger a flush in certain circumstances if
> > the reservation fails (to avoid recursion and transaction subsystem
> > deadlocks). However, if you are not getting spurious enospc on
> > creating inodes (as opposed to writing to them) then I see no
> > immediate need for flushes there, either....

It is relevant because the test case Eric provided fails repeatedly
in xfs_create() on a reservation rather than on a data write, and
without a flush there no further data writes will occur to flush out
the blocks.

So a flush there is needed, too.

> > > @@ -415,6 +419,7 @@ xfs_syncd_queue_work(
> > >   * has failed with ENOSPC and we are in the process of scratching our
> > >   * heads, looking about for more room...
> > >   */
> > > +#if 0
> > >  STATIC void
> > >  xfs_flush_inode_work(
> > >  	struct xfs_mount *mp,
> > > @@ -424,16 +429,20 @@ xfs_flush_inode_work(
> > >  	filemap_flush(inode->i_mapping);
> > >  	iput(inode);
> > >  }
> > > +#endif
> > >  
> > >  void
> > >  xfs_flush_inode(
> > >  	xfs_inode_t	*ip)
> > >  {
> > > +#if 0
> > >  	struct inode	*inode = VFS_I(ip);
> > > +	DECLARE_COMPLETION_ONSTACK(completion);
> > >  
> > >  	igrab(inode);
> > > -	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work);
> > > -	delay(msecs_to_jiffies(500));
> > > +	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work, &completion);
> > > +	wait_for_completion(&completion);
> > > +#endif
> > 
> > Why did you turn off the initial inode flush?
> 
> Because it recurses into Linux VFS layer and deadlocks.

Not at all. It calls into the writeback code (not the VFS)
which does not take any locks except page locks before it
calls back into ->writepages....

I've attached the WIP patch I have right now (based on yours) that
allows the test case to pass. It will need a bit of cleanup before
it is ready for prime-time, but is passes xfsqa here including the
ENOSPC stress tests.  Can you see if it fixes your test case and
whether it deadlocks?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


Use xfs_sync_inodes and not sync_blockdev. XFS keeps it's dirty
metadata in different address spaces to the block device, so
sync_blockdev does nothing.

Change that 500ms delay to a wait for completion, so that the
behavior is not dependent on magic timeout values.

Add a trylock option to xfs_sync_inodes() so that it won't
deadlock if we are already holding an inode lock.

Add xfs_flush_device() to inode allocation failure in the create
path as that is the other place that typically falls over due to
outstanding delayed allocation extents.

Based on work from Mikulas Patocka <mpatocka@redhat.com>.

---
 fs/xfs/linux-2.6/xfs_fs_subr.c |   14 ++++----
 fs/xfs/linux-2.6/xfs_sync.c    |   66 +++++++++++++++++++--------------------
 fs/xfs/linux-2.6/xfs_sync.h    |    6 ++-
 fs/xfs/xfs_mount.h             |    2 +-
 fs/xfs/xfs_vnodeops.c          |    6 ++++
 5 files changed, 50 insertions(+), 44 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_fs_subr.c b/fs/xfs/linux-2.6/xfs_fs_subr.c
index 5aeb777..08be36d 100644
--- a/fs/xfs/linux-2.6/xfs_fs_subr.c
+++ b/fs/xfs/linux-2.6/xfs_fs_subr.c
@@ -74,14 +74,14 @@ xfs_flush_pages(
 
 	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 		xfs_iflags_clear(ip, XFS_ITRUNCATED);
-		ret = filemap_fdatawrite(mapping);
-		if (flags & XFS_B_ASYNC)
-			return -ret;
-		ret2 = filemap_fdatawait(mapping);
-		if (!ret)
-			ret = ret2;
+		ret = -filemap_fdatawrite(mapping);
 	}
-	return -ret;
+	if (flags & XFS_B_ASYNC)
+		return ret;
+	ret2 = xfs_wait_on_pages(ip, first, last);
+	if (!ret)
+		ret = ret2;
+	return ret;
 }
 
 int
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index a608e72..fddb9de 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -62,12 +62,6 @@ xfs_sync_inodes_ag(
 	uint32_t	first_index = 0;
 	int		error = 0;
 	int		last_error = 0;
-	int		fflag = XFS_B_ASYNC;
-
-	if (flags & SYNC_DELWRI)
-		fflag = XFS_B_DELWRI;
-	if (flags & SYNC_WAIT)
-		fflag = 0;		/* synchronous overrides all */
 
 	do {
 		struct inode	*inode;
@@ -128,12 +122,23 @@ xfs_sync_inodes_ag(
 		 * If we have to flush data or wait for I/O completion
 		 * we need to hold the iolock.
 		 */
-		if ((flags & SYNC_DELWRI) && VN_DIRTY(inode)) {
-			xfs_ilock(ip, XFS_IOLOCK_SHARED);
-			lock_flags |= XFS_IOLOCK_SHARED;
-			error = xfs_flush_pages(ip, 0, -1, fflag, FI_NONE);
-			if (flags & SYNC_IOWAIT)
+		if (flags & SYNC_DELWRI) {
+			if (VN_DIRTY(inode)) {
+				if (flags & SYNC_TRYLOCK) {
+					if (xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
+						lock_flags |= XFS_IOLOCK_SHARED;
+				} else {
+					xfs_ilock(ip, XFS_IOLOCK_SHARED);
+					lock_flags |= XFS_IOLOCK_SHARED;
+				}
+				if (lock_flags & XFS_IOLOCK_SHARED) {
+					error = xfs_flush_pages(ip, 0, -1,
+							XFS_B_ASYNC, FI_NONE);
+				}
+			}
+			if (VN_CACHED(inode) && (flags & SYNC_IOWAIT)) {
 				xfs_ioend_wait(ip);
+			}
 		}
 		xfs_ilock(ip, XFS_ILOCK_SHARED);
 
@@ -388,7 +393,7 @@ xfs_quiesce_attr(
 }
 
 /*
- * Enqueue a work item to be picked up by the vfs xfssyncd thread.
+ * Enqueue a work item to be picked up by the xfssyncd thread.
  * Doing this has two advantages:
  * - It saves on stack space, which is tight in certain situations
  * - It can be used (with care) as a mechanism to avoid deadlocks.
@@ -398,15 +403,17 @@ STATIC void
 xfs_syncd_queue_work(
 	struct xfs_mount *mp,
 	void		*data,
-	void		(*syncer)(struct xfs_mount *, void *))
+	void		(*syncer)(struct xfs_mount *, void *),
+	struct completion *completion)
 {
-	struct bhv_vfs_sync_work *work;
+	struct xfs_sync_work *work;
 
-	work = kmem_alloc(sizeof(struct bhv_vfs_sync_work), KM_SLEEP);
+	work = kmem_zalloc(sizeof(struct xfs_sync_work), KM_SLEEP);
 	INIT_LIST_HEAD(&work->w_list);
 	work->w_syncer = syncer;
 	work->w_data = data;
 	work->w_mount = mp;
+	work->w_completion = completion;
 	spin_lock(&mp->m_sync_lock);
 	list_add_tail(&work->w_list, &mp->m_sync_list);
 	spin_unlock(&mp->m_sync_lock);
@@ -419,25 +426,11 @@ xfs_syncd_queue_work(
  * has failed with ENOSPC and we are in the process of scratching our
  * heads, looking about for more room...
  */
-STATIC void
-xfs_flush_inode_work(
-	struct xfs_mount *mp,
-	void		*arg)
-{
-	struct inode	*inode = arg;
-	filemap_flush(inode->i_mapping);
-	iput(inode);
-}
-
 void
 xfs_flush_inode(
 	xfs_inode_t	*ip)
 {
-	struct inode	*inode = VFS_I(ip);
-
-	igrab(inode);
-	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work);
-	delay(msecs_to_jiffies(500));
+	xfs_flush_pages(ip, 0, -1, 0, FI_NONE);
 }
 
 /*
@@ -450,7 +443,8 @@ xfs_flush_device_work(
 	void		*arg)
 {
 	struct inode	*inode = arg;
-	sync_blockdev(mp->m_super->s_bdev);
+	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_TRYLOCK);
+	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_TRYLOCK | SYNC_IOWAIT);
 	iput(inode);
 }
 
@@ -459,10 +453,11 @@ xfs_flush_device(
 	xfs_inode_t	*ip)
 {
 	struct inode	*inode = VFS_I(ip);
+	DECLARE_COMPLETION_ONSTACK(completion);
 
 	igrab(inode);
-	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_device_work);
-	delay(msecs_to_jiffies(500));
+	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_device_work, &completion);
+	wait_for_completion(&completion);
 	xfs_log_force(ip->i_mount, (xfs_lsn_t)0, XFS_LOG_FORCE|XFS_LOG_SYNC);
 }
 
@@ -497,7 +492,7 @@ xfssyncd(
 {
 	struct xfs_mount	*mp = arg;
 	long			timeleft;
-	bhv_vfs_sync_work_t	*work, *n;
+	xfs_sync_work_t		*work, *n;
 	LIST_HEAD		(tmp);
 
 	set_freezable();
@@ -530,6 +525,8 @@ xfssyncd(
 		list_for_each_entry_safe(work, n, &tmp, w_list) {
 			(*work->w_syncer)(mp, work->w_data);
 			list_del(&work->w_list);
+			if (work->w_completion)
+				complete(work->w_completion);
 			if (work == &mp->m_sync_work)
 				continue;
 			kmem_free(work);
@@ -545,6 +542,7 @@ xfs_syncd_init(
 {
 	mp->m_sync_work.w_syncer = xfs_sync_worker;
 	mp->m_sync_work.w_mount = mp;
+	mp->m_sync_work.w_completion = NULL;
 	mp->m_sync_task = kthread_run(xfssyncd, mp, "xfssyncd");
 	if (IS_ERR(mp->m_sync_task))
 		return -PTR_ERR(mp->m_sync_task);
diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
index 5f6de1e..de87a7f 100644
--- a/fs/xfs/linux-2.6/xfs_sync.h
+++ b/fs/xfs/linux-2.6/xfs_sync.h
@@ -20,18 +20,20 @@
 
 struct xfs_mount;
 
-typedef struct bhv_vfs_sync_work {
+typedef struct xfs_sync_work {
 	struct list_head	w_list;
 	struct xfs_mount	*w_mount;
 	void			*w_data;	/* syncer routine argument */
 	void			(*w_syncer)(struct xfs_mount *, void *);
-} bhv_vfs_sync_work_t;
+	struct completion	*w_completion;
+} xfs_sync_work_t;
 
 #define SYNC_ATTR		0x0001	/* sync attributes */
 #define SYNC_DELWRI		0x0002	/* look at delayed writes */
 #define SYNC_WAIT		0x0004	/* wait for i/o to complete */
 #define SYNC_BDFLUSH		0x0008	/* BDFLUSH is calling -- don't block */
 #define SYNC_IOWAIT		0x0010  /* wait for all I/O to complete */
+#define SYNC_TRYLOCK		0x0020  /* only try to lock inodes */
 
 int xfs_syncd_init(struct xfs_mount *mp);
 void xfs_syncd_stop(struct xfs_mount *mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index f5e9937..775a2ac 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -322,7 +322,7 @@ typedef struct xfs_mount {
 #endif
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct task_struct	*m_sync_task;	/* generalised sync thread */
-	bhv_vfs_sync_work_t	m_sync_work;	/* work item for VFS_SYNC */
+	xfs_sync_work_t		m_sync_work;	/* work item for VFS_SYNC */
 	struct list_head	m_sync_list;	/* sync thread work item list */
 	spinlock_t		m_sync_lock;	/* work item list lock */
 	int			m_sync_seq;	/* sync thread generation no. */
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 0e55c5d..e1099e7 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -1449,6 +1449,12 @@ xfs_create(
 	error = xfs_trans_reserve(tp, resblks, XFS_CREATE_LOG_RES(mp), 0,
 			XFS_TRANS_PERM_LOG_RES, XFS_CREATE_LOG_COUNT);
 	if (error == ENOSPC) {
+		/* flush delalloc blocks and retry */
+		xfs_flush_device(dp);
+		error = xfs_trans_reserve(tp, resblks, XFS_CREATE_LOG_RES(mp), 0,
+			XFS_TRANS_PERM_LOG_RES, XFS_CREATE_LOG_COUNT);
+	}
+	if (error == ENOSPC) {
 		resblks = 0;
 		error = xfs_trans_reserve(tp, 0, XFS_CREATE_LOG_RES(mp), 0,
 				XFS_TRANS_PERM_LOG_RES, XFS_CREATE_LOG_COUNT);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-02-04 12:08                             ` Dave Chinner
@ 2009-02-05  4:31                               ` Mikulas Patocka
  2009-02-05  7:43                                 ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2009-02-05  4:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs, linux-kernel

> > ... and if you turn it into trylock, what are you going to do with the 
> > inode that is just being written to? You should definitely flush it, but 
> > trylock will skip it because it's already locked.
> 
> We've already flushed it directly. You disabled that code fearing
> deadlocks. I've made it synchronous (i.e. not handed off to
> xfssyncd) because the flush path requires us to hold the lock we are
> already holding....

This is not "fearing deadlocks". This was getting a real deadlock:

Jan 29 16:51:55 slunicko kernel: SysRq : Show Blocked State
Jan 29 16:51:55 slunicko kernel:   task                        PC stack   
pid father
Jan 29 16:51:55 slunicko kernel: xfssyncd      D 00000000004848e0     0 
13321
   2
Jan 29 16:51:55 slunicko kernel: Call Trace:
Jan 29 16:51:55 slunicko kernel:  [000000000060475c] io_schedule+0x20/0x44
Jan 29 16:51:55 slunicko kernel:  [00000000004848e0] sync_page+0x64/0x74
Jan 29 16:51:55 slunicko kernel:  [000000000060494c] 
__wait_on_bit_lock+0x5c/0x9c
Jan 29 16:51:55 slunicko kernel:  [0000000000484848] __lock_page+0x58/0x68
Jan 29 16:51:55 slunicko kernel:  [000000000048b3fc] 
write_cache_pages+0x170/0x38c
Jan 29 16:51:55 slunicko kernel:  [000000000048b64c] 
generic_writepages+0x34/0x48
Jan 29 16:51:55 slunicko kernel:  [00000000101d0d48] 
xfs_vm_writepages+0x3c/0x50 [xfs]
Jan 29 16:51:55 slunicko kernel:  [000000000048b6a4] 
do_writepages+0x44/0x7c
Jan 29 16:51:55 slunicko kernel:  [00000000004850f4] 
__filemap_fdatawrite_range+0x58/0x6c
Jan 29 16:51:55 slunicko kernel:  [0000000000485cc8] 
filemap_flush+0x20/0x34
Jan 29 16:51:55 slunicko kernel:  [00000000101dbc64] 
xfs_flush_inode_work+0x10/0x28 [xfs]
Jan 29 16:51:55 slunicko kernel:  [00000000101dbf2c] xfssyncd+0x110/0x174 
[xfs]
Jan 29 16:51:55 slunicko kernel:  [000000000046556c] kthread+0x54/0x88
Jan 29 16:51:55 slunicko kernel:  [000000000042b2a0] 
kernel_thread+0x3c/0x54
Jan 29 16:51:55 slunicko kernel:  [00000000004653b8] kthreadd+0xac/0x160
Jan 29 16:51:55 slunicko kernel: pdflush       D 00000000004848e0     0 
15226
   2
Jan 29 16:51:55 slunicko kernel: Call Trace:
Jan 29 16:51:55 slunicko kernel:  [000000000060475c] io_schedule+0x20/0x44
Jan 29 16:51:55 slunicko kernel:  [00000000004848e0] sync_page+0x64/0x74
Jan 29 16:51:55 slunicko kernel:  [000000000060494c] 
__wait_on_bit_lock+0x5c/0x9c
Jan 29 16:51:55 slunicko kernel:  [0000000000484848] __lock_page+0x58/0x68
Jan 29 16:51:55 slunicko kernel:  [000000000048b3fc] 
write_cache_pages+0x170/0x38c
Jan 29 16:51:55 slunicko kernel:  [000000000048b64c] 
generic_writepages+0x34/0x48
Jan 29 16:51:55 slunicko kernel:  [00000000101d0d48] 
xfs_vm_writepages+0x3c/0x50 [xfs]
Jan 29 16:51:55 slunicko kernel:  [000000000048b6a4] 
do_writepages+0x44/0x7c
Jan 29 16:51:55 slunicko kernel:  [00000000004ca350] 
__writeback_single_inode+0x168/0x2e4
Jan 29 16:51:55 slunicko kernel:  [00000000004ca90c] 
generic_sync_sb_inodes+0x204/0x3a0
Jan 29 16:51:55 slunicko kernel:  [00000000004caabc] 
sync_sb_inodes+0x14/0x24
Jan 29 16:51:55 slunicko kernel:  [00000000004cacd8] 
writeback_inodes+0x8c/0x100
Jan 29 16:51:55 slunicko kernel:  [000000000048c274] wb_kupdate+0xc4/0x15c
Jan 29 16:51:55 slunicko kernel:  [000000000048c8dc] pdflush+0xfc/0x1cc
Jan 29 16:51:55 slunicko kernel:  [000000000046556c] kthread+0x54/0x88
Jan 29 16:51:55 slunicko kernel:  [000000000042b2a0] 
kernel_thread+0x3c/0x54
Jan 29 16:51:55 slunicko kernel: dd            D 00000000006045c4     0 
16610  16597
Jan 29 16:51:55 slunicko kernel: Call Trace:
Jan 29 16:51:55 slunicko kernel:  [00000000006047d8] 
schedule_timeout+0x24/0xb8
Jan 29 16:51:55 slunicko kernel:  [00000000006045c4] 
wait_for_common+0xe4/0x14c
Jan 29 16:51:55 slunicko kernel:  [0000000000604700] 
wait_for_completion+0x1c/0x2c
Jan 29 16:51:55 slunicko kernel:  [00000000101dc7d8] 
xfs_flush_inode+0xb0/0xc0 [xfs]
Jan 29 16:51:55 slunicko kernel:  [00000000101b9b50] 
xfs_flush_space+0x54/0xd0 [xfs]
Jan 29 16:51:55 slunicko kernel:  [00000000101b9ec0] 
xfs_iomap_write_delay+0x1bc/0x228 [xfs]
Jan 29 16:51:55 slunicko kernel:  [00000000101ba4b8] xfs_iomap+0x1c4/0x2e0 
[xfs]Jan 29 16:51:55 slunicko kernel:  [00000000101d0f04] 
__xfs_get_blocks+0x74/0x240 [xfs]
Jan 29 16:51:55 slunicko kernel:  [00000000101d112c] 
xfs_get_blocks+0x24/0x38 [xfs]
Jan 29 16:51:55 slunicko kernel:  [00000000004d05f0] 
__block_prepare_write+0x184/0x41c
Jan 29 16:51:55 slunicko kernel:  [00000000004d095c] 
block_write_begin+0x84/0xe8
Jan 29 16:51:55 slunicko kernel:  [00000000101d047c] 
xfs_vm_write_begin+0x3c/0x50 [xfs]
Jan 29 16:51:55 slunicko kernel:  [0000000000485258] 
generic_file_buffered_write+0xe8/0x28c
Jan 29 16:51:55 slunicko kernel:  [00000000101d8ec4] xfs_write+0x40c/0x604 
[xfs]Jan 29 16:51:55 slunicko kernel:  [00000000101d4d3c] 
xfs_file_aio_write+0x74/0x84 [xfs]
Jan 29 16:51:55 slunicko kernel:  [00000000004ae70c] 
do_sync_write+0x8c/0xdc

This one was obtained on a machine with 4k filesystem blocks, 8k pages and 
dd bs=1 on a nearly full filesystem. Apparently it attempts to lock the 
page that is already locked when writing to it.

The deadlock happened when I modified it to use completions (instead of 
500ms wait) with the patch I already posted.

> > > > There seems to be more bugs with forgetting to flush delayed allocation 
> > > > --- you should flush delayed allocation after *every* failed allocate 
> > > > attempt, but the code flushes it only after failed delayed allocate 
> > > > attempt --- i.e. nondelayed allocators, such as xfs_iomap_write_direct 
> > > > (and maybe other places) will still return -ENOSPC without attempting to 
> > > > flush other inodes with delayed allocation.
> .....
> > > For metadata allocations (all over the place), we take a reservation
> > > first
> > 
> > And what if the reservation fails? Do you flush in this case?
> 
> Well, if you read the rest of the paragraph you'd have the answer
> to that question. I'll leave it quoted because it is relevant:
> 
> > > and so we could trigger a flush in certain circumstances if
> > > the reservation fails (to avoid recursion and transaction subsystem
> > > deadlocks). However, if you are not getting spurious enospc on
> > > creating inodes (as opposed to writing to them) then I see no
> > > immediate need for flushes there, either....
> 
> It is relevant because the test case Eric provided fails repeatedly
> in xfs_create() on a reservation rather than on a data write, and
> without a flush there no further data writes will occur to flush out
> the blocks.
> 
> So a flush there is needed, too.
> 
> > > > @@ -415,6 +419,7 @@ xfs_syncd_queue_work(
> > > >   * has failed with ENOSPC and we are in the process of scratching our
> > > >   * heads, looking about for more room...
> > > >   */
> > > > +#if 0
> > > >  STATIC void
> > > >  xfs_flush_inode_work(
> > > >  	struct xfs_mount *mp,
> > > > @@ -424,16 +429,20 @@ xfs_flush_inode_work(
> > > >  	filemap_flush(inode->i_mapping);
> > > >  	iput(inode);
> > > >  }
> > > > +#endif
> > > >  
> > > >  void
> > > >  xfs_flush_inode(
> > > >  	xfs_inode_t	*ip)
> > > >  {
> > > > +#if 0
> > > >  	struct inode	*inode = VFS_I(ip);
> > > > +	DECLARE_COMPLETION_ONSTACK(completion);
> > > >  
> > > >  	igrab(inode);
> > > > -	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work);
> > > > -	delay(msecs_to_jiffies(500));
> > > > +	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work, &completion);
> > > > +	wait_for_completion(&completion);
> > > > +#endif
> > > 
> > > Why did you turn off the initial inode flush?
> > 
> > Because it recurses into Linux VFS layer and deadlocks.
> 
> Not at all. It calls into the writeback code (not the VFS)
> which does not take any locks except page locks before it
> calls back into ->writepages....

See the backtrace above.

> I've attached the WIP patch I have right now (based on yours) that
> allows the test case to pass. It will need a bit of cleanup before
> it is ready for prime-time, but is passes xfsqa here including the
> ENOSPC stress tests.  Can you see if it fixes your test case and
> whether it deadlocks?

I can try it, but that filemap_fdatawrite is still wrong. I tried to call 
filemap_fdatawrite myself with my patch and got the deadlock.

You just can't do it that you hold a page lock and call an generic kernel 
function that takes other page locks. Even if you hacked it somehow to be 
working now, over the long term it is unmaintainable.

Mikulas

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> Use xfs_sync_inodes and not sync_blockdev. XFS keeps it's dirty
> metadata in different address spaces to the block device, so
> sync_blockdev does nothing.
> 
> Change that 500ms delay to a wait for completion, so that the
> behavior is not dependent on magic timeout values.
> 
> Add a trylock option to xfs_sync_inodes() so that it won't
> deadlock if we are already holding an inode lock.
> 
> Add xfs_flush_device() to inode allocation failure in the create
> path as that is the other place that typically falls over due to
> outstanding delayed allocation extents.
> 
> Based on work from Mikulas Patocka <mpatocka@redhat.com>.
> 
> ---
>  fs/xfs/linux-2.6/xfs_fs_subr.c |   14 ++++----
>  fs/xfs/linux-2.6/xfs_sync.c    |   66 +++++++++++++++++++--------------------
>  fs/xfs/linux-2.6/xfs_sync.h    |    6 ++-
>  fs/xfs/xfs_mount.h             |    2 +-
>  fs/xfs/xfs_vnodeops.c          |    6 ++++
>  5 files changed, 50 insertions(+), 44 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_fs_subr.c b/fs/xfs/linux-2.6/xfs_fs_subr.c
> index 5aeb777..08be36d 100644
> --- a/fs/xfs/linux-2.6/xfs_fs_subr.c
> +++ b/fs/xfs/linux-2.6/xfs_fs_subr.c
> @@ -74,14 +74,14 @@ xfs_flush_pages(
>  
>  	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  		xfs_iflags_clear(ip, XFS_ITRUNCATED);
> -		ret = filemap_fdatawrite(mapping);
> -		if (flags & XFS_B_ASYNC)
> -			return -ret;
> -		ret2 = filemap_fdatawait(mapping);
> -		if (!ret)
> -			ret = ret2;
> +		ret = -filemap_fdatawrite(mapping);
>  	}
> -	return -ret;
> +	if (flags & XFS_B_ASYNC)
> +		return ret;
> +	ret2 = xfs_wait_on_pages(ip, first, last);
> +	if (!ret)
> +		ret = ret2;
> +	return ret;
>  }
>  
>  int
> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index a608e72..fddb9de 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.c
> +++ b/fs/xfs/linux-2.6/xfs_sync.c
> @@ -62,12 +62,6 @@ xfs_sync_inodes_ag(
>  	uint32_t	first_index = 0;
>  	int		error = 0;
>  	int		last_error = 0;
> -	int		fflag = XFS_B_ASYNC;
> -
> -	if (flags & SYNC_DELWRI)
> -		fflag = XFS_B_DELWRI;
> -	if (flags & SYNC_WAIT)
> -		fflag = 0;		/* synchronous overrides all */
>  
>  	do {
>  		struct inode	*inode;
> @@ -128,12 +122,23 @@ xfs_sync_inodes_ag(
>  		 * If we have to flush data or wait for I/O completion
>  		 * we need to hold the iolock.
>  		 */
> -		if ((flags & SYNC_DELWRI) && VN_DIRTY(inode)) {
> -			xfs_ilock(ip, XFS_IOLOCK_SHARED);
> -			lock_flags |= XFS_IOLOCK_SHARED;
> -			error = xfs_flush_pages(ip, 0, -1, fflag, FI_NONE);
> -			if (flags & SYNC_IOWAIT)
> +		if (flags & SYNC_DELWRI) {
> +			if (VN_DIRTY(inode)) {
> +				if (flags & SYNC_TRYLOCK) {
> +					if (xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
> +						lock_flags |= XFS_IOLOCK_SHARED;
> +				} else {
> +					xfs_ilock(ip, XFS_IOLOCK_SHARED);
> +					lock_flags |= XFS_IOLOCK_SHARED;
> +				}
> +				if (lock_flags & XFS_IOLOCK_SHARED) {
> +					error = xfs_flush_pages(ip, 0, -1,
> +							XFS_B_ASYNC, FI_NONE);
> +				}
> +			}
> +			if (VN_CACHED(inode) && (flags & SYNC_IOWAIT)) {
>  				xfs_ioend_wait(ip);
> +			}
>  		}
>  		xfs_ilock(ip, XFS_ILOCK_SHARED);
>  
> @@ -388,7 +393,7 @@ xfs_quiesce_attr(
>  }
>  
>  /*
> - * Enqueue a work item to be picked up by the vfs xfssyncd thread.
> + * Enqueue a work item to be picked up by the xfssyncd thread.
>   * Doing this has two advantages:
>   * - It saves on stack space, which is tight in certain situations
>   * - It can be used (with care) as a mechanism to avoid deadlocks.
> @@ -398,15 +403,17 @@ STATIC void
>  xfs_syncd_queue_work(
>  	struct xfs_mount *mp,
>  	void		*data,
> -	void		(*syncer)(struct xfs_mount *, void *))
> +	void		(*syncer)(struct xfs_mount *, void *),
> +	struct completion *completion)
>  {
> -	struct bhv_vfs_sync_work *work;
> +	struct xfs_sync_work *work;
>  
> -	work = kmem_alloc(sizeof(struct bhv_vfs_sync_work), KM_SLEEP);
> +	work = kmem_zalloc(sizeof(struct xfs_sync_work), KM_SLEEP);
>  	INIT_LIST_HEAD(&work->w_list);
>  	work->w_syncer = syncer;
>  	work->w_data = data;
>  	work->w_mount = mp;
> +	work->w_completion = completion;
>  	spin_lock(&mp->m_sync_lock);
>  	list_add_tail(&work->w_list, &mp->m_sync_list);
>  	spin_unlock(&mp->m_sync_lock);
> @@ -419,25 +426,11 @@ xfs_syncd_queue_work(
>   * has failed with ENOSPC and we are in the process of scratching our
>   * heads, looking about for more room...
>   */
> -STATIC void
> -xfs_flush_inode_work(
> -	struct xfs_mount *mp,
> -	void		*arg)
> -{
> -	struct inode	*inode = arg;
> -	filemap_flush(inode->i_mapping);
> -	iput(inode);
> -}
> -
>  void
>  xfs_flush_inode(
>  	xfs_inode_t	*ip)
>  {
> -	struct inode	*inode = VFS_I(ip);
> -
> -	igrab(inode);
> -	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inode_work);
> -	delay(msecs_to_jiffies(500));
> +	xfs_flush_pages(ip, 0, -1, 0, FI_NONE);
>  }
>  
>  /*
> @@ -450,7 +443,8 @@ xfs_flush_device_work(
>  	void		*arg)
>  {
>  	struct inode	*inode = arg;
> -	sync_blockdev(mp->m_super->s_bdev);
> +	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_TRYLOCK);
> +	xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_TRYLOCK | SYNC_IOWAIT);
>  	iput(inode);
>  }
>  
> @@ -459,10 +453,11 @@ xfs_flush_device(
>  	xfs_inode_t	*ip)
>  {
>  	struct inode	*inode = VFS_I(ip);
> +	DECLARE_COMPLETION_ONSTACK(completion);
>  
>  	igrab(inode);
> -	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_device_work);
> -	delay(msecs_to_jiffies(500));
> +	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_device_work, &completion);
> +	wait_for_completion(&completion);
>  	xfs_log_force(ip->i_mount, (xfs_lsn_t)0, XFS_LOG_FORCE|XFS_LOG_SYNC);
>  }
>  
> @@ -497,7 +492,7 @@ xfssyncd(
>  {
>  	struct xfs_mount	*mp = arg;
>  	long			timeleft;
> -	bhv_vfs_sync_work_t	*work, *n;
> +	xfs_sync_work_t		*work, *n;
>  	LIST_HEAD		(tmp);
>  
>  	set_freezable();
> @@ -530,6 +525,8 @@ xfssyncd(
>  		list_for_each_entry_safe(work, n, &tmp, w_list) {
>  			(*work->w_syncer)(mp, work->w_data);
>  			list_del(&work->w_list);
> +			if (work->w_completion)
> +				complete(work->w_completion);
>  			if (work == &mp->m_sync_work)
>  				continue;
>  			kmem_free(work);
> @@ -545,6 +542,7 @@ xfs_syncd_init(
>  {
>  	mp->m_sync_work.w_syncer = xfs_sync_worker;
>  	mp->m_sync_work.w_mount = mp;
> +	mp->m_sync_work.w_completion = NULL;
>  	mp->m_sync_task = kthread_run(xfssyncd, mp, "xfssyncd");
>  	if (IS_ERR(mp->m_sync_task))
>  		return -PTR_ERR(mp->m_sync_task);
> diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
> index 5f6de1e..de87a7f 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.h
> +++ b/fs/xfs/linux-2.6/xfs_sync.h
> @@ -20,18 +20,20 @@
>  
>  struct xfs_mount;
>  
> -typedef struct bhv_vfs_sync_work {
> +typedef struct xfs_sync_work {
>  	struct list_head	w_list;
>  	struct xfs_mount	*w_mount;
>  	void			*w_data;	/* syncer routine argument */
>  	void			(*w_syncer)(struct xfs_mount *, void *);
> -} bhv_vfs_sync_work_t;
> +	struct completion	*w_completion;
> +} xfs_sync_work_t;
>  
>  #define SYNC_ATTR		0x0001	/* sync attributes */
>  #define SYNC_DELWRI		0x0002	/* look at delayed writes */
>  #define SYNC_WAIT		0x0004	/* wait for i/o to complete */
>  #define SYNC_BDFLUSH		0x0008	/* BDFLUSH is calling -- don't block */
>  #define SYNC_IOWAIT		0x0010  /* wait for all I/O to complete */
> +#define SYNC_TRYLOCK		0x0020  /* only try to lock inodes */
>  
>  int xfs_syncd_init(struct xfs_mount *mp);
>  void xfs_syncd_stop(struct xfs_mount *mp);
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index f5e9937..775a2ac 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -322,7 +322,7 @@ typedef struct xfs_mount {
>  #endif
>  	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
>  	struct task_struct	*m_sync_task;	/* generalised sync thread */
> -	bhv_vfs_sync_work_t	m_sync_work;	/* work item for VFS_SYNC */
> +	xfs_sync_work_t		m_sync_work;	/* work item for VFS_SYNC */
>  	struct list_head	m_sync_list;	/* sync thread work item list */
>  	spinlock_t		m_sync_lock;	/* work item list lock */
>  	int			m_sync_seq;	/* sync thread generation no. */
> diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
> index 0e55c5d..e1099e7 100644
> --- a/fs/xfs/xfs_vnodeops.c
> +++ b/fs/xfs/xfs_vnodeops.c
> @@ -1449,6 +1449,12 @@ xfs_create(
>  	error = xfs_trans_reserve(tp, resblks, XFS_CREATE_LOG_RES(mp), 0,
>  			XFS_TRANS_PERM_LOG_RES, XFS_CREATE_LOG_COUNT);
>  	if (error == ENOSPC) {
> +		/* flush delalloc blocks and retry */
> +		xfs_flush_device(dp);
> +		error = xfs_trans_reserve(tp, resblks, XFS_CREATE_LOG_RES(mp), 0,
> +			XFS_TRANS_PERM_LOG_RES, XFS_CREATE_LOG_COUNT);
> +	}
> +	if (error == ENOSPC) {
>  		resblks = 0;
>  		error = xfs_trans_reserve(tp, 0, XFS_CREATE_LOG_RES(mp), 0,
>  				XFS_TRANS_PERM_LOG_RES, XFS_CREATE_LOG_COUNT);
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: spurious -ENOSPC on XFS
  2009-02-05  4:31                               ` Mikulas Patocka
@ 2009-02-05  7:43                                 ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2009-02-05  7:43 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Christoph Hellwig, xfs, linux-kernel

On Wed, Feb 04, 2009 at 11:31:25PM -0500, Mikulas Patocka wrote:
> > > ... and if you turn it into trylock, what are you going to do with the 
> > > inode that is just being written to? You should definitely flush it, but 
> > > trylock will skip it because it's already locked.
> > 
> > We've already flushed it directly. You disabled that code fearing
> > deadlocks. I've made it synchronous (i.e. not handed off to
> > xfssyncd) because the flush path requires us to hold the lock we are
> > already holding....
> 
> This is not "fearing deadlocks". This was getting a real deadlock:

<sigh>

Thank you for *finally* telling me exactly what the deadlock is that
you've been handwaving about for the last week. It's not a VFS
deadlock, nor is it an inode lock deadlock - its a page lock deadlock.

Perhaps next time you will post the stack trace instead of vaguely
describing a deadlock so you don't waste several hours of another
developer's time looking for deadlocks in all the wrong places?

> This one was obtained on a machine with 4k filesystem blocks, 8k pages and 
> dd bs=1 on a nearly full filesystem.

That's helpful, too. I can write a test case to exercise that.

So, now I understand why you were suggesting going all the way back up
to the top of the IO path and flushing from there - so we don't hold
a page lock.

Perhaps we should just cull the direct inode flush completely.
If that inode has any significant delayed allocation space on it,
then the only reason it gets to an ENOSPC is that is has converted
all the speculative preallocation that it already has reserved
and is trying to allocate new space. Hence flushing it will not
return any extra space.

Hmmmmm - given that we hold the iolock exclusively, the trylock I
added into xfs_sync_inodes_ag() will fail on the inode we currently
hold page locks on (tries to get iolock shared) so that should avoid
deadlock on the page we currently hold locked.  Can you remove the
direct inode flush and just run with the modified device flush to see
if that triggers the deadlock you've been seeing?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-02-05  7:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-12 11:14 spurious -ENOSPC on XFS Mikulas Patocka
2009-01-12 15:11 ` Christoph Hellwig
2009-01-13  5:58   ` Lachlan McIlroy
2009-01-14 22:16     ` Dave Chinner
2009-01-15  0:57       ` Lachlan McIlroy
2009-01-15  8:47         ` Dave Chinner
2009-01-13 21:49 ` Dave Chinner
2009-01-14  4:28   ` Mikulas Patocka
2009-01-18 17:31     ` Christoph Hellwig
2009-01-20 19:38       ` Mikulas Patocka
2009-01-20 23:24         ` Dave Chinner
2009-01-22 20:59           ` Christoph Hellwig
2009-01-22 22:43             ` Christoph Hellwig
2009-01-23 20:14               ` Mikulas Patocka
2009-01-24  7:12                 ` Dave Chinner
2009-01-29 16:39                   ` Mikulas Patocka
2009-01-29 16:45                     ` Mikulas Patocka
2009-01-31 23:57                     ` Dave Chinner
2009-02-02 17:36                       ` Mikulas Patocka
2009-02-03  3:27                         ` Dave Chinner
2009-02-03 20:05                           ` Mikulas Patocka
2009-02-04 12:08                             ` Dave Chinner
2009-02-05  4:31                               ` Mikulas Patocka
2009-02-05  7:43                                 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).