[PATCH] avoid scanning bitmaps for group preallocation

* [PATCH] avoid scanning bitmaps for group preallocation
@ 2010-03-22 22:03 Andreas Dilger
  2010-03-26 10:28 ` Aneesh Kumar K. V
  0 siblings, 1 reply; 3+ messages in thread
From: Andreas Dilger @ 2010-03-22 22:03 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

[-- Attachment #1: Type: text/plain, Size: 1215 bytes --]

Here is the patch I mentioned today on the call.  It avoids (or at  
least reduces) serious latency (10 minutes or more) on a large  
filesystem (8TB+) on the first write, if the filesystem is nearly  
full.  The latency is entirely due to seeking to read the block  
bitmaps, so is considerably less serious on flex_bg formatted  
filesystems.

A better long-term approach would be to store in the superblock the  
last group that had space to allocate a stripe-sized chunk and/or flag  
in the group descriptor if there is not a large amount of contiguous  
free space therein (cleared on freeing blocks in the group).

Having the mount-time buddy-bitmap (and checksum verifying) scanning  
thread start at mount would only help if the first write to the  
filesystem is not immediately after mount (which it is in Lustre at  
least).  Having a filesystem-wide (r)btree for the freespace (ala XFS)  
would also only help if the btree could be (at least partially) built  
from bitmaps before the first write, unless we cache the bitmap on  
disk, which caused Lustre plenty in the past and I'm leery to do it.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


[-- Attachment #2: ext4-mballoc-skip.diff --]
[-- Type: application/octet-stream, Size: 2517 bytes --]

Reduce the size of group preallocation chunks to a single RAID
stripe, or 1MB if that is not specified.  Since it is likely the
small files will be read back in an unrelated order anyway, the
main benefit of aggregation is to avoid read-modify-write, which
is still satisfied by the smaller default size.

Also skip reading of block bitmaps if there is absolutely no chance
of finding a better extent in that group, because the number of
free blocks is less than the number of blocks in the best extent
found so far.  A better decision can be made after the bitmap is
loaded, but in a large filesystem there can be tens of thousands
of groups, and reading them all in can take minutes if the filesystem
is nearly full.

Signed-off-by: Andreas Dilger <adilger@sun.com>

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 54df209..6038fad 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -125,8 +125,7 @@
  * list. In case of inode preallocation we follow a list of heuristics
  * based on file size. This can be found in ext4_mb_normalize_request. If
  * we are doing a group prealloc we try to normalize the request to
- * sbi->s_mb_group_prealloc. Default value of s_mb_group_prealloc is
- * 512 blocks. This can be tuned via
+ * sbi->s_mb_group_prealloc.  This can be tuned via
  * /sys/fs/ext4/<partition/mb_group_prealloc. The value is represented in
  * terms of number of blocks. If we have mounted the file system with -O
  * stripe=<value> option the group prealloc request is normalized to the
@@ -2029,9 +2028,12 @@ repeat:
 			if (group == ngroups)
 				group = 0;
 
-			/* quick check to skip empty groups */
+			/* If there's no chance that this group has a better
+			 * extent, just skip it instead of seeking to read
+			 * block bitmap from disk. Initially ac_b_ex.fe_len = 0,
+			 * so this always skips groups with no free space. */
 			grp = ext4_get_group_info(sb, group);
-			if (grp->bb_free == 0)
+			if (grp->bb_free <= ac->ac_b_ex.fe_len)
 				continue;
 
 			err = ext4_mb_load_buddy(sb, group, &e4b);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index b619322..df516c8 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -90,9 +90,9 @@ extern u8 mb_enable_debug;
 #define MB_DEFAULT_ORDER2_REQS		2
 
 /*
- * default group prealloc size 512 blocks
+ * default group prealloc size in blocks
  */
-#define MB_DEFAULT_GROUP_PREALLOC	512
+#define MB_DEFAULT_GROUP_PREALLOC	256
 
 
 struct ext4_free_data {

[-- Attachment #3: Type: text/plain, Size: 3 bytes --]





^ permalink raw reply related	[flat|nested] 3+ messages in thread