All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] avoid scanning bitmaps for group preallocation
@ 2010-03-22 22:03 Andreas Dilger
  2010-03-26 10:28 ` Aneesh Kumar K. V
  0 siblings, 1 reply; 3+ messages in thread
From: Andreas Dilger @ 2010-03-22 22:03 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

[-- Attachment #1: Type: text/plain, Size: 1215 bytes --]

Here is the patch I mentioned today on the call.  It avoids (or at  
least reduces) serious latency (10 minutes or more) on a large  
filesystem (8TB+) on the first write, if the filesystem is nearly  
full.  The latency is entirely due to seeking to read the block  
bitmaps, so is considerably less serious on flex_bg formatted  
filesystems.

A better long-term approach would be to store in the superblock the  
last group that had space to allocate a stripe-sized chunk and/or flag  
in the group descriptor if there is not a large amount of contiguous  
free space therein (cleared on freeing blocks in the group).

Having the mount-time buddy-bitmap (and checksum verifying) scanning  
thread start at mount would only help if the first write to the  
filesystem is not immediately after mount (which it is in Lustre at  
least).  Having a filesystem-wide (r)btree for the freespace (ala XFS)  
would also only help if the btree could be (at least partially) built  
from bitmaps before the first write, unless we cache the bitmap on  
disk, which caused Lustre plenty in the past and I'm leery to do it.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


[-- Attachment #2: ext4-mballoc-skip.diff --]
[-- Type: application/octet-stream, Size: 2517 bytes --]

Reduce the size of group preallocation chunks to a single RAID
stripe, or 1MB if that is not specified.  Since it is likely the
small files will be read back in an unrelated order anyway, the
main benefit of aggregation is to avoid read-modify-write, which
is still satisfied by the smaller default size.

Also skip reading of block bitmaps if there is absolutely no chance
of finding a better extent in that group, because the number of
free blocks is less than the number of blocks in the best extent
found so far.  A better decision can be made after the bitmap is
loaded, but in a large filesystem there can be tens of thousands
of groups, and reading them all in can take minutes if the filesystem
is nearly full.

Signed-off-by: Andreas Dilger <adilger@sun.com>

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 54df209..6038fad 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -125,8 +125,7 @@
  * list. In case of inode preallocation we follow a list of heuristics
  * based on file size. This can be found in ext4_mb_normalize_request. If
  * we are doing a group prealloc we try to normalize the request to
- * sbi->s_mb_group_prealloc. Default value of s_mb_group_prealloc is
- * 512 blocks. This can be tuned via
+ * sbi->s_mb_group_prealloc.  This can be tuned via
  * /sys/fs/ext4/<partition/mb_group_prealloc. The value is represented in
  * terms of number of blocks. If we have mounted the file system with -O
  * stripe=<value> option the group prealloc request is normalized to the
@@ -2029,9 +2028,12 @@ repeat:
 			if (group == ngroups)
 				group = 0;
 
-			/* quick check to skip empty groups */
+			/* If there's no chance that this group has a better
+			 * extent, just skip it instead of seeking to read
+			 * block bitmap from disk. Initially ac_b_ex.fe_len = 0,
+			 * so this always skips groups with no free space. */
 			grp = ext4_get_group_info(sb, group);
-			if (grp->bb_free == 0)
+			if (grp->bb_free <= ac->ac_b_ex.fe_len)
 				continue;
 
 			err = ext4_mb_load_buddy(sb, group, &e4b);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index b619322..df516c8 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -90,9 +90,9 @@ extern u8 mb_enable_debug;
 #define MB_DEFAULT_ORDER2_REQS		2
 
 /*
- * default group prealloc size 512 blocks
+ * default group prealloc size in blocks
  */
-#define MB_DEFAULT_GROUP_PREALLOC	512
+#define MB_DEFAULT_GROUP_PREALLOC	256
 
 
 struct ext4_free_data {

[-- Attachment #3: Type: text/plain, Size: 3 bytes --]





^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] avoid scanning bitmaps for group preallocation
  2010-03-22 22:03 [PATCH] avoid scanning bitmaps for group preallocation Andreas Dilger
@ 2010-03-26 10:28 ` Aneesh Kumar K. V
  2010-03-26 17:58   ` Andreas Dilger
  0 siblings, 1 reply; 3+ messages in thread
From: Aneesh Kumar K. V @ 2010-03-26 10:28 UTC (permalink / raw)
  To: Andreas Dilger, Ext4 Developers List; +Cc: Theodore Ts'o

On Mon, 22 Mar 2010 16:03:10 -0600, Andreas Dilger <adilger@sun.com> wrote:
> Here is the patch I mentioned today on the call.  It avoids (or at  
> least reduces) serious latency (10 minutes or more) on a large  
> filesystem (8TB+) on the first write, if the filesystem is nearly  
> full.  The latency is entirely due to seeking to read the block  
> bitmaps, so is considerably less serious on flex_bg formatted  
> filesystems.
> 
> A better long-term approach would be to store in the superblock the  
> last group that had space to allocate a stripe-sized chunk and/or flag  
> in the group descriptor if there is not a large amount of contiguous  
> free space therein (cleared on freeing blocks in the group).
> 
> Having the mount-time buddy-bitmap (and checksum verifying) scanning  
> thread start at mount would only help if the first write to the  
> filesystem is not immediately after mount (which it is in Lustre at  
> least).  Having a filesystem-wide (r)btree for the freespace (ala XFS)  
> would also only help if the btree could be (at least partially) built  
> from bitmaps before the first write, unless we cache the bitmap on  
> disk, which caused Lustre plenty in the past and I'm leery to do it.
> 
> 

@@ -125,8 +125,7 @@
  * list. In case of inode preallocation we follow a list of heuristics
  * based on file size. This can be found in ext4_mb_normalize_request. If
  * we are doing a group prealloc we try to normalize the request to
- * sbi->s_mb_group_prealloc. Default value of s_mb_group_prealloc is
- * 512 blocks. This can be tuned via
+ * sbi->s_mb_group_prealloc.  This can be tuned via
  * /sys/fs/ext4/<partition/mb_group_prealloc. The value is represented in
  * terms of number of blocks. If we have mounted the file system with -O
  * stripe=<value> option the group prealloc request is normalized to the
@@ -2029,9 +2028,12 @@ repeat:
			if (group == ngroups)
				group = 0;
 
-			/* quick check to skip empty groups */
+			/* If there's no chance that this group has a better
+			 * extent, just skip it instead of seeking to read
+			 * block bitmap from disk. Initially ac_b_ex.fe_len = 0,
+			 * so this always skips groups with no free space. */
			grp = ext4_get_group_info(sb, group);
-			if (grp->bb_free == 0)
+			if (grp->bb_free <= ac->ac_b_ex.fe_len)
				continue;
 
			err = ext4_mb_load_buddy(sb, group, &e4b);

I was wondering whether we need to make sure we also use criteria value
when checking for bb_free. If we are really low on space we may want to
return what is left right ?. Or does ac_b_ex take care of that ?

-aneesh

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] avoid scanning bitmaps for group preallocation
  2010-03-26 10:28 ` Aneesh Kumar K. V
@ 2010-03-26 17:58   ` Andreas Dilger
  0 siblings, 0 replies; 3+ messages in thread
From: Andreas Dilger @ 2010-03-26 17:58 UTC (permalink / raw)
  To: Aneesh Kumar K. V; +Cc: Ext4 Developers List, Theodore Ts'o

On 2010-03-26, at 04:28, Aneesh Kumar K. V wrote:
> On Mon, 22 Mar 2010 16:03:10 -0600, Andreas Dilger <adilger@sun.com>  
> wrote:
>> Here is the patch I mentioned today on the call.  It avoids (or at
>> least reduces) serious latency (10 minutes or more) on a large
>> filesystem (8TB+) on the first write, if the filesystem is nearly
>> full.  The latency is entirely due to seeking to read the block
>> bitmaps, so is considerably less serious on flex_bg formatted
>> filesystems.
>
> @@ -2029,9 +2028,12 @@ repeat:
> +		/* If there's no chance that this group has a better
> +		 * extent, just skip it instead of seeking to read
> +		 * block bitmap from disk. Initially ac_b_ex.fe_len = 0,
> +		 * so this always skips groups with no free space. */
>  		grp = ext4_get_group_info(sb, group);
> -		if (grp->bb_free == 0)
> +		if (grp->bb_free <= ac->ac_b_ex.fe_len)
>  			continue;
>
> I was wondering whether we need to make sure we also use criteria  
> value
> when checking for bb_free. If we are really low on space we may want  
> to
> return what is left right ?. Or does ac_b_ex take care of that ?


ac_b_ex is the best currently ALLOCATED extent, so mballoc wouldn't  
ever select an extent that is smaller than ac_b_ex.fe_len.  That means  
it is pointless to even look at a group which has fewer free blocks  
than ac_b_ex.fe_len.

Later, after the group information is loaded, ldiskfs_mb_good_group()  
will skip the group if the average fragment size is smaller than the  
GOAL extent, but only for certain criterion levels.  At the highest  
criterion, any group with free blocks will be scanned.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-03-26 17:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-22 22:03 [PATCH] avoid scanning bitmaps for group preallocation Andreas Dilger
2010-03-26 10:28 ` Aneesh Kumar K. V
2010-03-26 17:58   ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.