All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH, RFC] ext4: flush delalloc blocks when space is low
@ 2009-10-20 20:59 Eric Sandeen
  2009-10-21  2:37 ` Eric Sandeen
  2009-10-21 19:51 ` [PATCH, RFC V2] " Eric Sandeen
  0 siblings, 2 replies; 6+ messages in thread
From: Eric Sandeen @ 2009-10-20 20:59 UTC (permalink / raw)
  To: ext4 development

Creating many small files in rapid succession on a small
filesystem can lead to spurious ENOSPC; on a 104MB filesystem:

for i in `seq 1 22500`; do
    echo -n > $SCRATCH_MNT/$i
    echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
done

leads to ENOSPC even though after a sync, 40% of the fs is free
again.

This is because we reserve worst-case metadata for delalloc writes,
and when data is allocated that worst-case reservation was not
needed.

I've added 2 flushers here:

 * when free space is low compared to dirty blocks, do an async flush
 * when we get a hard ENOSPC, do a sync flush before retry

This resolves the testcase for me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1d04189..63519fc 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -605,6 +605,17 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
  */
 int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 {
+	/* try a sync to flush delalloc space & free resvd metadata */
+	if (test_opt(sb, DELALLOC) &&
+	    *retries == 0 &&
+	    !ext4_has_free_blocks(EXT4_SB(sb), 1)) {
+		down_read(&sb->s_umount);
+		sync_inodes_sb(sb);
+		up_read(&sb->s_umount);
+		(*retries)++;
+		return 1;
+	}
+
 	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
 	    (*retries)++ > 3 ||
 	    !EXT4_SB(sb)->s_journal)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c5bc5d..27c8b9b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3024,11 +3024,18 @@ static int ext4_nonda_switch(struct super_block *sb)
 	if (2 * free_blocks < 3 * dirty_blocks ||
 		free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
 		/*
-		 * free block count is less that 150% of dirty blocks
-		 * or free blocks is less that watermark
+		 * free block count is less than 150% of dirty blocks
+		 * or free blocks is less than watermark
 		 */
 		return 1;
 	}
+	/*
+	 * Even if we don't switch but are nearing capacity,
+	 * start pushing delalloc when 1/2 of free blocks are dirty.
+	 */
+	if (free_blocks < 2 * dirty_blocks)
+		writeback_inodes_sb(sb);
+
 	return 0;
 }
 



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH, RFC] ext4: flush delalloc blocks when space is low
  2009-10-20 20:59 [PATCH, RFC] ext4: flush delalloc blocks when space is low Eric Sandeen
@ 2009-10-21  2:37 ` Eric Sandeen
  2009-10-21 19:51 ` [PATCH, RFC V2] " Eric Sandeen
  1 sibling, 0 replies; 6+ messages in thread
From: Eric Sandeen @ 2009-10-21  2:37 UTC (permalink / raw)
  To: ext4 development

Eric Sandeen wrote:
> Creating many small files in rapid succession on a small
> filesystem can lead to spurious ENOSPC; on a 104MB filesystem:
> 
> for i in `seq 1 22500`; do
>    echo -n > $SCRATCH_MNT/$i
>    echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
> done
> 
> leads to ENOSPC even though after a sync, 40% of the fs is free
> again.
> 
> This is because we reserve worst-case metadata for delalloc writes,
> and when data is allocated that worst-case reservation was not
> needed.
> 
> I've added 2 flushers here:
> 
> * when free space is low compared to dirty blocks, do an async flush
> * when we get a hard ENOSPC, do a sync flush before retry
> 
> This resolves the testcase for me.

argh, but fails xfstests 083 at least:

# Exercise filesystem full behaviour - run numerous fsstress
# processes in write mode on a small filesystem.  NB: delayed
# allocate flushing is quite deadlock prone at the filesystem
# full boundary due to the fact that we will retry allocation
# several times after flushing, before giving back ENOSPC.

and indeed I deadlock ;)  so don't merge this yet!

-Eric

> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
> 
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 1d04189..63519fc 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -605,6 +605,17 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
>  */
> int ext4_should_retry_alloc(struct super_block *sb, int *retries)
> {
> +    /* try a sync to flush delalloc space & free resvd metadata */
> +    if (test_opt(sb, DELALLOC) &&
> +        *retries == 0 &&
> +        !ext4_has_free_blocks(EXT4_SB(sb), 1)) {
> +        down_read(&sb->s_umount);
> +        sync_inodes_sb(sb);
> +        up_read(&sb->s_umount);
> +        (*retries)++;
> +        return 1;
> +    }
> +
>     if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
>         (*retries)++ > 3 ||
>         !EXT4_SB(sb)->s_journal)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5c5bc5d..27c8b9b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3024,11 +3024,18 @@ static int ext4_nonda_switch(struct super_block 
> *sb)
>     if (2 * free_blocks < 3 * dirty_blocks ||
>         free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
>         /*
> -         * free block count is less that 150% of dirty blocks
> -         * or free blocks is less that watermark
> +         * free block count is less than 150% of dirty blocks
> +         * or free blocks is less than watermark
>          */
>         return 1;
>     }
> +    /*
> +     * Even if we don't switch but are nearing capacity,
> +     * start pushing delalloc when 1/2 of free blocks are dirty.
> +     */
> +    if (free_blocks < 2 * dirty_blocks)
> +        writeback_inodes_sb(sb);
> +
>     return 0;
> }
> 
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH, RFC V2] ext4: flush delalloc blocks when space is low
  2009-10-20 20:59 [PATCH, RFC] ext4: flush delalloc blocks when space is low Eric Sandeen
  2009-10-21  2:37 ` Eric Sandeen
@ 2009-10-21 19:51 ` Eric Sandeen
  2009-11-05 14:09   ` Jan Kara
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Sandeen @ 2009-10-21 19:51 UTC (permalink / raw)
  To: ext4 development

Creating many small files in rapid succession on a small
filesystem can lead to spurious ENOSPC; on a 104MB filesystem:

for i in `seq 1 22500`; do
    echo -n > $SCRATCH_MNT/$i
    echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
done

leads to ENOSPC even though after a sync, 40% of the fs is free
again.

This is because we reserve worst-case metadata for delalloc writes,
and when data is allocated that worst-case reservation was not
needed.

I've added 2 flushers here:

 * when free space is low compared to dirty blocks, do an async flush
 * when we get a hard ENOSPC, do a sync flush before retry

This resolves the testcase for me, and survives all 4 generic
ENOSPC tests in xfstests.

V2: don't try to sync if we're still in a (probably nested) transaction.

Thanks to Josef for pointing out that possibility.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1d04189..28bde58 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -605,11 +605,27 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
  */
 int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 {
-	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
+	s64 dirtyblocks = 0;
+	struct percpu_counter *dbc = &EXT4_SB(sb)->s_dirtyblocks_counter;
+
+	if (test_opt(sb, DELALLOC))
+		dirtyblocks = percpu_counter_read_positive(dbc);
+
+	if ((!ext4_has_free_blocks(EXT4_SB(sb), 1) && !dirtyblocks) ||
 	    (*retries)++ > 3 ||
 	    !EXT4_SB(sb)->s_journal)
 		return 0;
 
+	/* try a sync to flush delalloc space & free resvd metadata */
+	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
+		if (!ext4_journal_current_handle()) {
+			down_read(&sb->s_umount);
+			sync_inodes_sb(sb);
+			up_read(&sb->s_umount);
+			return 1;
+		}
+	}
+
 	jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
 
 	return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c5bc5d..27c8b9b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3024,11 +3024,18 @@ static int ext4_nonda_switch(struct super_block *sb)
 	if (2 * free_blocks < 3 * dirty_blocks ||
 		free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
 		/*
-		 * free block count is less that 150% of dirty blocks
-		 * or free blocks is less that watermark
+		 * free block count is less than 150% of dirty blocks
+		 * or free blocks is less than watermark
 		 */
 		return 1;
 	}
+	/*
+	 * Even if we don't switch but are nearing capacity,
+	 * start pushing delalloc when 1/2 of free blocks are dirty.
+	 */
+	if (free_blocks < 2 * dirty_blocks)
+		writeback_inodes_sb(sb);
+
 	return 0;
 }
 



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH, RFC V2] ext4: flush delalloc blocks when space is low
  2009-10-21 19:51 ` [PATCH, RFC V2] " Eric Sandeen
@ 2009-11-05 14:09   ` Jan Kara
  2009-11-05 15:45     ` Eric Sandeen
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2009-11-05 14:09 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: ext4 development

> Creating many small files in rapid succession on a small
> filesystem can lead to spurious ENOSPC; on a 104MB filesystem:
> 
> for i in `seq 1 22500`; do
>     echo -n > $SCRATCH_MNT/$i
>     echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
> done
> 
> leads to ENOSPC even though after a sync, 40% of the fs is free
> again.
> 
> This is because we reserve worst-case metadata for delalloc writes,
> and when data is allocated that worst-case reservation was not
> needed.
> 
> I've added 2 flushers here:
> 
>  * when free space is low compared to dirty blocks, do an async flush
>  * when we get a hard ENOSPC, do a sync flush before retry
> 
> This resolves the testcase for me, and survives all 4 generic
> ENOSPC tests in xfstests.
> 
> V2: don't try to sync if we're still in a (probably nested) transaction.
> 
> Thanks to Josef for pointing out that possibility.
  I still think it's deadlockable... See below.

> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 1d04189..28bde58 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -605,11 +605,27 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
>   */
>  int ext4_should_retry_alloc(struct super_block *sb, int *retries)
>  {
> -	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
> +	s64 dirtyblocks = 0;
> +	struct percpu_counter *dbc = &EXT4_SB(sb)->s_dirtyblocks_counter;
> +
> +	if (test_opt(sb, DELALLOC))
> +		dirtyblocks = percpu_counter_read_positive(dbc);
> +
> +	if ((!ext4_has_free_blocks(EXT4_SB(sb), 1) && !dirtyblocks) ||
>  	    (*retries)++ > 3 ||
>  	    !EXT4_SB(sb)->s_journal)
>  		return 0;
>  
> +	/* try a sync to flush delalloc space & free resvd metadata */
> +	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
> +		if (!ext4_journal_current_handle()) {
> +			down_read(&sb->s_umount);
> +			sync_inodes_sb(sb);
> +			up_read(&sb->s_umount);
  ext4_should_retry_alloc() is called quite deep from the filesystem. In
particular we can hold i_mutex of some inodes etc. So I'd almost bet
that taking s_umount sem here violates lock ranking in some code paths
(an easy check would be to enable lockdep and stress the filesystem a
bit).
  Also calling sync_inodes_sb() with i_mutex held just seems as a bad
thing to do although I don't see where it could deadlock and so it's
probably just a matter of taste...
  If we start writeback from ext4_nonda_switch as you do below, I think
that we should get decent results even without synchronous writeback in
the allocation path (maybe we'd need to tweak a bit the logic in
ext4_nonda_switch to provide more time for writeback thread to catchup).

								Honza

> +			return 1;
> +		}
> +	}
> +
>  	jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
>  
>  	return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5c5bc5d..27c8b9b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3024,11 +3024,18 @@ static int ext4_nonda_switch(struct super_block *sb)
>  	if (2 * free_blocks < 3 * dirty_blocks ||
>  		free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
>  		/*
> -		 * free block count is less that 150% of dirty blocks
> -		 * or free blocks is less that watermark
> +		 * free block count is less than 150% of dirty blocks
> +		 * or free blocks is less than watermark
>  		 */
>  		return 1;
>  	}
> +	/*
> +	 * Even if we don't switch but are nearing capacity,
> +	 * start pushing delalloc when 1/2 of free blocks are dirty.
> +	 */
> +	if (free_blocks < 2 * dirty_blocks)
> +		writeback_inodes_sb(sb);
> +
>  	return 0;
>  }
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH, RFC V2] ext4: flush delalloc blocks when space is low
  2009-11-05 14:09   ` Jan Kara
@ 2009-11-05 15:45     ` Eric Sandeen
  2009-11-05 16:05       ` Jan Kara
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Sandeen @ 2009-11-05 15:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: ext4 development

Jan Kara wrote:
...

>> +	/* try a sync to flush delalloc space & free resvd metadata */
>> +	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
>> +		if (!ext4_journal_current_handle()) {
>> +			down_read(&sb->s_umount);
>> +			sync_inodes_sb(sb);
>> +			up_read(&sb->s_umount);
>   ext4_should_retry_alloc() is called quite deep from the filesystem. In
> particular we can hold i_mutex of some inodes etc. So I'd almost bet
> that taking s_umount sem here violates lock ranking in some code paths
> (an easy check would be to enable lockdep and stress the filesystem a
> bit).
>   Also calling sync_inodes_sb() with i_mutex held just seems as a bad
> thing to do although I don't see where it could deadlock and so it's
> probably just a matter of taste...

Well, to be honest I agree with you ;)  It does still feel like a hack.

>   If we start writeback from ext4_nonda_switch as you do below, I think
> that we should get decent results even without synchronous writeback in
> the allocation path (maybe we'd need to tweak a bit the logic in
> ext4_nonda_switch to provide more time for writeback thread to catchup).

I think starting writeback helps a lot, but it seems that in the end we
still need a synchronous attempt when we hit a real enocpc... after I
finish dealing with this corruption thing I'll come back and look at this.

Maybe we should put the writeback in for now, and worry about the
synchronous sync-up later?

Thanks for the review,

-Eric

> 								Honza

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH, RFC V2] ext4: flush delalloc blocks when space is low
  2009-11-05 15:45     ` Eric Sandeen
@ 2009-11-05 16:05       ` Jan Kara
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2009-11-05 16:05 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Jan Kara, ext4 development

> Jan Kara wrote:
> ...
> 
> >> +	/* try a sync to flush delalloc space & free resvd metadata */
> >> +	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
> >> +		if (!ext4_journal_current_handle()) {
> >> +			down_read(&sb->s_umount);
> >> +			sync_inodes_sb(sb);
> >> +			up_read(&sb->s_umount);
> >   ext4_should_retry_alloc() is called quite deep from the filesystem. In
> > particular we can hold i_mutex of some inodes etc. So I'd almost bet
> > that taking s_umount sem here violates lock ranking in some code paths
> > (an easy check would be to enable lockdep and stress the filesystem a
> > bit).
> >   Also calling sync_inodes_sb() with i_mutex held just seems as a bad
> > thing to do although I don't see where it could deadlock and so it's
> > probably just a matter of taste...
> 
> Well, to be honest I agree with you ;)  It does still feel like a hack.
> 
> >   If we start writeback from ext4_nonda_switch as you do below, I think
> > that we should get decent results even without synchronous writeback in
> > the allocation path (maybe we'd need to tweak a bit the logic in
> > ext4_nonda_switch to provide more time for writeback thread to catchup).
> 
> I think starting writeback helps a lot, but it seems that in the end we
> still need a synchronous attempt when we hit a real enocpc... after I
> finish dealing with this corruption thing I'll come back and look at this.
  Without the synchronous attempt, it will never be perfect, that is
correct. But it could be quite close to perfect...

> Maybe we should put the writeback in for now, and worry about the
> synchronous sync-up later?
  Yes, I'd do that for now.

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-11-05 16:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-20 20:59 [PATCH, RFC] ext4: flush delalloc blocks when space is low Eric Sandeen
2009-10-21  2:37 ` Eric Sandeen
2009-10-21 19:51 ` [PATCH, RFC V2] " Eric Sandeen
2009-11-05 14:09   ` Jan Kara
2009-11-05 15:45     ` Eric Sandeen
2009-11-05 16:05       ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.