linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Clustering indirect blocks in Ext3
@ 2007-11-16  5:02 Abhishek Rai
  2007-11-16  7:02 ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Abhishek Rai @ 2007-11-16  5:02 UTC (permalink / raw)
  To: akpm; +Cc: Andreas Dilger, linux-kernel, Ken Chen, Mike Waychison

(This patch was previously posted on linux-ext4 where Andreas Dilger
offered some valuable comments on it).

This patch modifies the block allocation strategy in ext3 in order to
improve fsck performance. This was initially sent out as a patch for
ext2, but given the lack of ongoing development on ext2, I have
crossported it to ext3 instead. Slow fsck is not a serious problem on
ext3 due to journaling, but once in a while users do need to run full
fsck on their ext3 file systems. This can be due to several reasons:
(1) bad disk, bad crash, etc, (2) bug in jbd/ext3, and (3) every few
reboots, it's good to run fsck anyway. This patch will help reduce
full fsck time for ext3. I've seen 50-65% reduction in fsck time when
using this patch on a near-full file system. With some fsck
optimizations, this figure becomes 80%.

Most of Ext3 metadata is clustered on disk. For example, Ext3
partitions the block space into block groups and stores the metadata
for each block group (inode table, block bitmap, inode bitmap) at the
beginning of the block group. Clustering related metadata together not
only helps ext3 I/O performance by keeping data and related metadata
close together, but also helps fsck since it is able to find all the
metadata in one place. However, indirect blocks are an exception.
Indirect blocks are allocated on-demand and are spread out along with
the data. This layout enables good I/O performance due to the close
proximity between an indirect block and its data blocks but it makes
things difficult for fsck which must now rotate almost the entire disk
in order to read all indirect blocks. In fact, our measurements have
indicated that for most ext3 disks on which fsck takes a long time,
>80% of the time is spent reading indirect blocks. So speeding up
indirect block read accesses in fsck can significantly improve fsck
times.

One solution to this problem implemented in this patch is to cluster
indirect blocks together on a per group basis, similar to how inodes
and bitmaps are clustered. Indirect block clusters (metaclusters) help
fsck performance by enabling fsck to fetch all indirect blocks by
reading from a few locations on the disk instead of rotating through
the entire disk. Unfortunately, a naive clustering scheme for indirect
blocks can hurt I/O performance, as it separates out indirect blocks
and corresponding direct blocks on the disk. So an I/O to a direct
block whose indirect block is not in the page cache now needs to incur
a longer seek+rotational delay in moving the disk head from the
indirect block to the direct block.

So our goal then is to implement metaclustering without having any
impact (<0.1%) on I/O performance. Fortunately, current ext3 I/O
algorithm is not the most efficient, improving it can camouflage the
performance hit we suffer due to metaclustering. In fact,
metaclustering automatically enables one such optimization. When doing
sequential read from a file and reading an indirect block for it, we
readahead several indirect blocks of the file from the same
metacluster. Moreover, when possible we do this asynchronously. This
reduces the seek+rotational latency associated with seeking between
data and indirect blocks during a (long) sequential read.

There is one more design choice that affect the performance of this
patch: location and number of metaclusters per block group. Currently
we have one metacluster per block group and it is located at the
center of the block group. We adopted this scheme after evaluating
three possible locations of metaclusters: beginning, middle, and end
of block group. We did not evaluate configurations with >1 metacluster
per block group. In our experiments, the middle configuration did not
cause any performance degradation for sequential and random reads.
Whereas putting the metacluster at the beginning of the block group
yields best performance for sequential reads (write performance is
unaffected by this change), putting it in the middle helps random
reads. Since the "middle path" maintains status quo, we adopted that
in our change.

Performance evaluation results:
Setup:
RAM: 8GB
Disk: 400GB disk.
CPU: Dual core hyperthreaded

All measurements were taken 10 times or more until standard deviation
was <2%. Machine was rebooted between runs and file system freshly
formatted, also we made sure that there was nothing running on the
machine at the time of the test.

Notation:
- 'vanilla': regular ext3 without any changes
- 'mc': metaclustering ext3

Benchmark 1: Sequential write to a 10GB file followed by 'sync'
1. vanilla:
  Total: 3m9.0s
  User: 0.08
  System: 23s-48s (very high variance)
2. mc:
  Total: 3m6.1s
  User: 0.08s
  System: 48.1s

Benchmark 2: Sequential read from a 10GB file.
Description: the file is created using same type of ext2 (mc or vanilla)
1. vanilla:
  Total: 3m14.5s
  User: 0.04s
  System: 13.4s
2. mc:
  Total: 3m14.5s
  User: 0.04s
  System: 13.3s

Benchmark 3: Random read from a 300GB file
Description: read using 512 byte chunk total 5MB
1. vanilla:
  Total: 3m56.4s
  User: ~0
  System: 0.6s
2. mc:
  Total: 3m51.4s
  User: ~0
  System: 0.8s

Benchmark 4: Random read from a 300GB file
Description: read using 512KB chunk total 1% size of the file
1. vanilla:
  Total: 4m46.3s
  User: ~0
  System: 3.9s
2. mc:
  Total: 4m44.4s
  User: ~0
  System: 3.9s

Benchmark 5: fsck
Description: Prepare a newly formated 400GB disk as follows: create
200 files of 0.5GB each, 100 files of 1GB each, 40 files of 2.5GB ech,
and 10 files of 10GB each. fsck command line: fsck -f -n
1. vanilla:
  Total: 12m18.1s
  User: 15.9s
  System: 18.3s
2. mc:
  Total: 4m47.0s
  User: 16.0s
  System: 17.1s


Benchmark 6: kernbench (this was done on an 8cpu machine with 32GB RAM)
1. vanilla:
  Elapsed: 35.60
  User: 228.79
  System: 21.10
2. mc:
  Elapsed: 35.12
  User: 228.47
  System: 21.08

Note:
1. This patch does not affect ext3 on-disk layout compatibility in any
way. Existing disks continue to work with new code, and disks modified
by new code continue to work with existing machines. In contrast, the
extents patch will also probably solve this problem but it breaks on-disk
compatibility.
2. Metaclustering is a mount time option (-o metacluster). This option
only affects the write path, when this option is specified indirect
blocks are allocated in clusters, when it is not specified they are
allocated alongside data blocks. The read path is unaffected by the
option, read behavior depends on the data layout on disk - if read
discovers metaclusters on disk it will do prefetching otherwise it
will not.
3. e2fsck speedup with metaclustering varies from disk
to disk with most benefit coming from disks which have a large number
of indirect blocks. For disks which have few indirect blocks, fsck
usually doesn't take too long anyway and hence it's OK not to deliver
a huge speedup there. But in all cases, metaclustering doesn't cause
any degradation in IO performance as seen in the benchmarks above.

Thanks,
Abhishek

Signed-off-by: Abhishek Rai <abhishekrai@google.com>

diff -uprdN linux-2.6.23mm1-clean/fs/ext3/balloc.c
linux-2.6.23mm1-ext3mc/fs/ext3/balloc.c
--- linux-2.6.23mm1-clean/fs/ext3/balloc.c	2007-10-17 18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/balloc.c	2007-11-15 11:23:51.000000000 -0800
@@ -711,6 +711,7 @@ bitmap_search_next_usable_block(ext3_grp
 	ext3_grpblk_t next;
 	struct journal_head *jh = bh2jh(bh);

+	BUG_ON(start > maxblocks);
 	while (start < maxblocks) {
 		next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
 		if (next >= maxblocks)
@@ -841,10 +842,12 @@ claim_block(spinlock_t *lock, ext3_grpbl
 static ext3_grpblk_t
 ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
 			struct buffer_head *bitmap_bh, ext3_grpblk_t grp_goal,
-			unsigned long *count, struct ext3_reserve_window *my_rsv)
+			int use_metacluster, unsigned long *count,
+			struct ext3_reserve_window *my_rsv)
 {
 	ext3_fsblk_t group_first_block;
 	ext3_grpblk_t start, end;
+	ext3_grpblk_t mc_start, mc_end, start2 = -1, end2 = -1;
 	unsigned long num = 0;

 	/* we do allocation within the reservation window if we have a window */
@@ -872,12 +875,48 @@ ext3_try_to_allocate(struct super_block
 	}

 	BUG_ON(start > EXT3_BLOCKS_PER_GROUP(sb));
+	/* start must have been set to grp_goal if one still exists. */
+	BUG_ON(grp_goal >= 0 && start != grp_goal);
+
+	if (test_opt(sb, METACLUSTER) && !use_metacluster) {
+		ext3_get_grp_metacluster(sb, &mc_start, &mc_end);
+
+		/*
+	 	 * If there is an overlap with metacluster range, adjust our
+		 * range to remove overlap, splitting our range into two if
+		 * needed.
+	 	 */
+		if (mc_end > mc_start) {
+			if (mc_start <= start)
+				start = max_t(ext3_grpblk_t, start, mc_end);
+			else if (mc_end >= end)
+				end = min_t(ext3_grpblk_t, end, mc_start);
+			else {
+				start2 = mc_end;
+				end2 = end;
+				end = mc_start;
+			}
+		}
+	}
+
+	if (start >= end)
+		goto fail_access;
+
+	if (grp_goal > 0)
+		grp_goal = start;

 repeat:
 	if (grp_goal < 0 || !ext3_test_allocatable(grp_goal, bitmap_bh)) {
 		grp_goal = find_next_usable_block(start, bitmap_bh, end);
-		if (grp_goal < 0)
+		if (grp_goal < 0) {
+			if (start2 >= 0) {
+				start = start2;
+				end = end2;
+				start2 = -1;
+				goto repeat;
+			}
 			goto fail_access;
+		}
 		if (!my_rsv) {
 			int i;

@@ -898,8 +937,15 @@ repeat:
 		 */
 		start++;
 		grp_goal++;
-		if (start >= end)
-			goto fail_access;
+		if (start >= end) {
+			if (start2 < 0)
+				goto fail_access;
+
+			start = start2;
+			end = end2;
+			start2 = -1;
+			grp_goal = -1;
+		}
 		goto repeat;
 	}
 	num++;
@@ -1084,6 +1130,7 @@ static int alloc_new_reservation(struct
 	unsigned long size;
 	int ret;
 	spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
+	ext3_grpblk_t mc_start, mc_end;

 	group_first_block = ext3_group_first_block_no(sb, group);
 	group_end_block = group_first_block + (EXT3_BLOCKS_PER_GROUP(sb) - 1);
@@ -1143,6 +1190,7 @@ static int alloc_new_reservation(struct
 	 * To make sure the reservation window has a free bit inside it, we
 	 * need to check the bitmap after we found a reservable window.
 	 */
+	ext3_get_grp_metacluster(sb, &mc_start, &mc_end);
 retry:
 	ret = find_next_reservable_window(search_head, my_rsv, sb,
 						start_block, group_end_block);
@@ -1170,6 +1218,11 @@ retry:
 			my_rsv->rsv_start - group_first_block,
 			bitmap_bh, group_end_block - group_first_block + 1);

+	if (first_free_block >= mc_start && first_free_block < mc_end) {
+		start_block = mc_end;
+		goto next;
+	}
+
 	if (first_free_block < 0) {
 		/*
 		 * no free block left on the bitmap, no point
@@ -1195,6 +1248,7 @@ retry:
 	 * start from where the free block is,
 	 * we also shift the list head to where we stopped last time
 	 */
+next:
 	search_head = my_rsv;
 	spin_lock(rsv_lock);
 	goto retry;
@@ -1223,12 +1277,18 @@ static void try_to_extend_reservation(st
 	struct ext3_reserve_window_node *next_rsv;
 	struct rb_node *next;
 	spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
+	ext3_grpblk_t mc_start, mc_end;

 	if (!spin_trylock(rsv_lock))
 		return;

 	next = rb_next(&my_rsv->rsv_node);

+	ext3_get_grp_metacluster(sb, &mc_start, &mc_end);
+
+	if (my_rsv->rsv_end >= mc_start && my_rsv->rsv_end < mc_end)
+		size += mc_end - 1 - my_rsv->rsv_end;
+
 	if (!next)
 		my_rsv->rsv_end += size;
 	else {
@@ -1274,7 +1334,7 @@ static void try_to_extend_reservation(st
 static ext3_grpblk_t
 ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
 			unsigned int group, struct buffer_head *bitmap_bh,
-			ext3_grpblk_t grp_goal,
+			ext3_grpblk_t grp_goal, int use_metacluster,
 			struct ext3_reserve_window_node * my_rsv,
 			unsigned long *count, int *errp)
 {
@@ -1305,7 +1365,8 @@ ext3_try_to_allocate_with_rsv(struct sup
 	 */
 	if (my_rsv == NULL ) {
 		ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh,
-						grp_goal, count, NULL);
+						grp_goal, use_metacluster,
+						count, NULL);
 		goto out;
 	}
 	/*
@@ -1361,7 +1422,8 @@ ext3_try_to_allocate_with_rsv(struct sup
 			BUG();
 		}
 		ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh,
-					   grp_goal, &num, &my_rsv->rsv_window);
+						grp_goal, use_metacluster,
+						&num, &my_rsv->rsv_window);
 		if (ret >= 0) {
 			my_rsv->rsv_alloc_hit += num;
 			*count = num;
@@ -1455,6 +1517,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 	int bgi;			/* blockgroup iteration index */
 	int fatal = 0, err;
 	int performed_allocation = 0;
+	int use_metacluster = 0;
 	ext3_grpblk_t free_blocks;	/* number of free blocks in a group */
 	struct super_block *sb;
 	struct ext3_group_desc *gdp;
@@ -1473,6 +1536,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 	sb = inode->i_sb;
 	if (!sb) {
 		printk("ext3_new_block: nonexistent device");
+		*errp = -ENODEV;
 		return 0;
 	}

@@ -1487,6 +1551,11 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 	sbi = EXT3_SB(sb);
 	es = EXT3_SB(sb)->s_es;
 	ext3_debug("goal=%lu.\n", goal);
+
+	/* Caller should ensure this. */
+	BUG_ON(goal < le32_to_cpu(es->s_first_data_block) ||
+	       goal >= le32_to_cpu(es->s_blocks_count));
+
 	/*
 	 * Allocate a block from reservation only when
 	 * filesystem is mounted with reservation(default,-o reservation), and
@@ -1507,9 +1576,6 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 	/*
 	 * First, test whether the goal block is free.
 	 */
-	if (goal < le32_to_cpu(es->s_first_data_block) ||
-	    goal >= le32_to_cpu(es->s_blocks_count))
-		goal = le32_to_cpu(es->s_first_data_block);
 	group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
 			EXT3_BLOCKS_PER_GROUP(sb);
 	goal_group = group_no;
@@ -1535,7 +1601,7 @@ retry_alloc:
 			goto io_error;
 		grp_alloc_blk = ext3_try_to_allocate_with_rsv(sb, handle,
 					group_no, bitmap_bh, grp_target_blk,
-					my_rsv,	&num, &fatal);
+					use_metacluster, my_rsv, &num, &fatal);
 		if (fatal)
 			goto out;
 		if (grp_alloc_blk >= 0)
@@ -1573,8 +1639,8 @@ retry_alloc:
 		 * try to allocate block(s) from this group, without a goal(-1).
 		 */
 		grp_alloc_blk = ext3_try_to_allocate_with_rsv(sb, handle,
-					group_no, bitmap_bh, -1, my_rsv,
-					&num, &fatal);
+					group_no, bitmap_bh, -1,
+					use_metacluster, my_rsv, &num, &fatal);
 		if (fatal)
 			goto out;
 		if (grp_alloc_blk >= 0)
@@ -1593,6 +1659,10 @@ retry_alloc:
 		group_no = goal_group;
 		goto retry_alloc;
 	}
+	if (test_opt(sb, METACLUSTER) && use_metacluster == 0) {
+		use_metacluster = 1;
+		goto retry_alloc;
+	}
 	/* No space left on the device */
 	*errp = -ENOSPC;
 	goto out;
@@ -1713,6 +1783,161 @@ ext3_fsblk_t ext3_new_block(handle_t *ha
 	return ext3_new_blocks(handle, inode, goal, &count, errp);
 }

+/*
+ * ext3_new_indirect_blocks() -- allocate indirect blocks for inode.
+ * @inode:		file inode
+ * @count:		target number of indirect blocks to allocate
+ * @new_blocks[]:       used for returning block numbers allocated
+ *
+ * return: 0 on success, appropriate error code otherwise. Upon return, *count
+ * contains the number of blocks successfully allocated which is non-zero only
+ * in the success case.
+ *
+ * Allocate maximum of *count indirect blocks from the indirect block metadata
+ * area of inode's group and store the block numbers in new_blocksp[]. Since
+ * the allocation is in a predetermined region of the block group, caller just
+ * needs to pass a group number here which is where the goal and/or the
+ * reservation window may fall.
+ */
+int ext3_new_indirect_blocks(handle_t *handle, struct inode *inode,
+			unsigned long group_no, unsigned long *count,
+			ext3_fsblk_t new_blocks[])
+{
+	struct super_block *sb;
+	struct ext3_sb_info *sbi;
+	struct buffer_head *bitmap_bh = NULL;
+	struct buffer_head *gdp_bh;
+	struct ext3_group_desc *gdp;
+	ext3_grpblk_t group_first_block;      /* first block in the group */
+	ext3_grpblk_t free_blocks;	/* number of free blocks in the group */
+	ext3_grpblk_t mc_start, mc_end;
+	int blk, done = 0;
+	int err = 0;
+
+	BUG_ON(*count > 3);
+
+	sb = inode->i_sb;
+	if (!sb) {
+		printk(KERN_INFO "ext3_new_indirect_blocks: "
+			"nonexistent device");
+		return -ENODEV;
+	}
+	BUG_ON(!test_opt(sb, METACLUSTER));
+	sbi = EXT3_SB(sb);
+
+	if (DQUOT_ALLOC_BLOCK(inode, *count))
+		return -EDQUOT;
+
+	if (!ext3_has_free_blocks(sbi)) {
+		err = -ENOSPC;
+		goto out;
+	}
+
+	gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
+	if (!gdp) {
+		err = -EIO;
+		goto out;
+	}
+
+	free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
+	if (free_blocks == 0) {
+		err = -ENOSPC;
+		goto out;
+	}
+
+	bitmap_bh = read_block_bitmap(sb, group_no);
+	if (!bitmap_bh) {
+		err = -EIO;
+		goto out;
+	}
+
+	/*
+	 * Make sure we use undo access for the bitmap, because it is critical
+	 * that we do the frozen_data COW on bitmap buffers in all cases even
+	 * if the buffer is in BJ_Forget state in the committing transaction.
+	 */
+	BUFFER_TRACE(bitmap_bh, "get undo access for new indirect block");
+	err = ext3_journal_get_undo_access(handle, bitmap_bh);
+	if (err)
+		goto out;
+
+	err = -ENOSPC;
+	group_first_block = ext3_group_first_block_no(sb, group_no);
+	ext3_get_grp_metacluster(sb, &mc_start, &mc_end);
+	blk = mc_start;
+
+	while (done < *count && blk < mc_end) {
+		if (!ext3_test_allocatable(blk, bitmap_bh)) {
+			/*
+			 * Don't use find_next_usable_block() here as it may
+			 * skip free blocks that are not close to the goal.
+			 * Since our goal is always fixed (mc_start), we may
+			 * be trying to allocate slightly far from it and that
+			 * will be a problem.
+			 */
+			blk = bitmap_search_next_usable_block(blk, bitmap_bh,
+								mc_end);
+			continue;
+		}
+		if (claim_block(sb_bgl_lock(sbi, group_no), blk,
+				bitmap_bh)) {
+			new_blocks[done++] = group_first_block + blk;
+		} else {
+			/*
+		 	 * The block was allocated by another thread, or it
+			 * was allocated and then freed by another thread
+		 	 */
+			cpu_relax();
+		}
+		blk++;
+	}
+
+	if (!done) {
+		BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
+		ext3_journal_release_buffer(handle, bitmap_bh);
+		goto out;
+	}
+
+	BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for bitmap block");
+	err = ext3_journal_dirty_metadata(handle, bitmap_bh);
+	if (err)
+		goto out;
+
+	BUFFER_TRACE(gdp_bh, "get_write_access");
+	err = ext3_journal_get_write_access(handle, gdp_bh);
+	if (err)
+		goto out;
+
+	/*
+	 * Caller is responsible for adding the new indirect block buffers
+	 * to the journal list.
+	 */
+
+	spin_lock(sb_bgl_lock(sbi, group_no));
+	gdp->bg_free_blocks_count =
+		cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) - done);
+	spin_unlock(sb_bgl_lock(sbi, group_no));
+	percpu_counter_sub(&sbi->s_freeblocks_counter, done);
+
+	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
+	err = ext3_journal_dirty_metadata(handle, gdp_bh);
+	sb->s_dirt = 1;
+	if (err)
+		goto out;
+
+out:
+	if (bitmap_bh)
+		brelse(bitmap_bh);
+
+	DQUOT_FREE_BLOCK(inode, *count - done);
+	*count = done;
+
+	if (err && err != -ENOSPC)
+		ext3_error(sb, "ext3_new_indirect_blocks", "error %d", err);
+
+	return err;
+}
+
 /**
  * ext3_count_free_blocks() -- count filesystem free blocks
  * @sb:		superblock
diff -uprdN linux-2.6.23mm1-clean/fs/ext3/inode.c
linux-2.6.23mm1-ext3mc/fs/ext3/inode.c
--- linux-2.6.23mm1-clean/fs/ext3/inode.c	2007-10-17 18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/inode.c	2007-11-15 11:21:07.000000000 -0800
@@ -39,7 +39,29 @@
 #include "xattr.h"
 #include "acl.h"

+typedef struct {
+	__le32	*p;
+	__le32	key;
+	struct buffer_head *bh;
+} Indirect;
+
+struct ext3_ind_read_info {
+	int                     count;
+	int                     seq_prefetch;
+	long                    size;
+	struct buffer_head      *bh[0];
+};
+
+# define EXT3_IND_READ_INFO_SIZE(_c)        \
+	(sizeof(struct ext3_ind_read_info) + \
+	 sizeof(struct buffer_head *) * (_c))
+
+# define EXT3_IND_READ_MAX     	(32)
+
 static int ext3_writepage_trans_blocks(struct inode *inode);
+static Indirect *ext3_read_indblocks(struct inode *inode, int iblock,
+					int depth, int offsets[4],
+					Indirect chain[4], int *err);

 /*
  * Test whether an inode is a fast symlink.
@@ -233,12 +255,6 @@ no_delete:
 	clear_inode(inode);	/* We must guarantee clearing of inode... */
 }

-typedef struct {
-	__le32	*p;
-	__le32	key;
-	struct buffer_head *bh;
-} Indirect;
-
 static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
 {
 	p->key = *(p->p = v);
@@ -352,18 +368,21 @@ static int ext3_block_to_path(struct ino
  *	the whole chain, all way to the data (returns %NULL, *err == 0).
  */
 static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
-				 Indirect chain[4], int *err)
+				 Indirect chain[4], int ind_readahead, int *err)
 {
 	struct super_block *sb = inode->i_sb;
 	Indirect *p = chain;
 	struct buffer_head *bh;
+	int index;

 	*err = 0;
 	/* i_data is not going away, no lock needed */
 	add_chain (chain, NULL, EXT3_I(inode)->i_data + *offsets);
 	if (!p->key)
 		goto no_block;
-	while (--depth) {
+	for (index = 0; index < depth - 1; index++) {
+		if (ind_readahead && depth > 2 && index == depth - 2)
+			break;
 		bh = sb_bread(sb, le32_to_cpu(p->key));
 		if (!bh)
 			goto failure;
@@ -396,7 +415,15 @@ no_block:
  *	It is used when heuristic for sequential allocation fails.
  *	Rules are:
  *	  + if there is a block to the left of our position - allocate near it.
- *	  + if pointer will live in indirect block - allocate near that block.
+ *	  + If METACLUSTER options is not specified, allocate the data
+ *	  block close to the metadata block. Otherwise, if pointer will live in
+ *	  indirect block, we cannot allocate near the indirect block since
+ *	  indirect blocks are allocated in a reserved area. Even if we allocate
+ *	  this block right after the preceding logical file block, we'll still
+ *	  have to incur extra seek due to the indirect block (unless we
+ *	  prefetch the indirect block separately). So for now (until
+ *	  prefetching is turned on), it's OK not to return a sequential goal -
+ *	  just put in the same cylinder group as the inode.
  *	  + if pointer will live in inode - allocate in the same
  *	    cylinder group.
  *
@@ -421,9 +448,11 @@ static ext3_fsblk_t ext3_find_near(struc
 			return le32_to_cpu(*p);
 	}

-	/* No such thing, so let's try location of indirect block */
-	if (ind->bh)
-		return ind->bh->b_blocknr;
+	if (!test_opt(inode->i_sb, METACLUSTER)) {
+		/* No such thing, so let's try location of indirect block */
+		if (ind->bh)
+			return ind->bh->b_blocknr;
+	}

 	/*
 	 * It is going to be referred to from the inode itself? OK, just put it
@@ -475,8 +504,7 @@ static ext3_fsblk_t ext3_find_goal(struc
  *	@blks: number of data blocks to be mapped.
  *	@blocks_to_boundary:  the offset in the indirect block
  *
- *	return the total number of blocks to be allocate, including the
- *	direct and indirect blocks.
+ *	return the total number of direct blocks to be allocated.
  */
 static int ext3_blks_to_allocate(Indirect *branch, int k, unsigned long blks,
 		int blocks_to_boundary)
@@ -508,22 +536,39 @@ static int ext3_blks_to_allocate(Indirec
  *	ext3_alloc_blocks: multiple allocate blocks needed for a branch
  *	@indirect_blks: the number of blocks need to allocate for indirect
  *			blocks
- *
+ *	@blks: the number of direct blocks to be allocated
  *	@new_blocks: on return it will store the new block numbers for
  *	the indirect blocks(if needed) and the first direct block,
- *	@blks:	on return it will store the total number of allocated
- *		direct blocks
+ *
+ *	returns the number of direct blocks allocated, error via *err, and
+ *	new block numbers via new_blocks[]
  */
 static int ext3_alloc_blocks(handle_t *handle, struct inode *inode,
 			ext3_fsblk_t goal, int indirect_blks, int blks,
 			ext3_fsblk_t new_blocks[4], int *err)
 {
+	struct super_block *sb;
+	struct ext3_super_block *es;
 	int target, i;
-	unsigned long count = 0;
+	unsigned long count = 0, goal_group;
 	int index = 0;
 	ext3_fsblk_t current_block = 0;
 	int ret = 0;

+	BUG_ON(blks <= 0);
+
+	sb = inode->i_sb;
+	if (!sb) {
+		printk(KERN_INFO "ext3_alloc_blocks: nonexistent device");
+		*err = -ENODEV;
+		return 0;
+	}
+	es = EXT3_SB(sb)->s_es;
+
+	if (goal < le32_to_cpu(es->s_first_data_block) ||
+	    goal >= le32_to_cpu(es->s_blocks_count))
+		goal = le32_to_cpu(es->s_first_data_block);
+
 	/*
 	 * Here we try to allocate the requested multiple blocks at once,
 	 * on a best-effort basis.
@@ -534,6 +579,41 @@ static int ext3_alloc_blocks(handle_t *h
 	 */
 	target = blks + indirect_blks;

+	/*
+	 * Try to allocate indirect blocks in the metacluster region of block
+	 * group in which goal falls. This should not only give us clustered
+	 * metablock allocation, but also allocate new metablocks close to the
+	 * corresponding data blocks (by putting them in the same block group).
+	 * Note that allocation of indirect blocks is only guided by goal and
+	 * not by reservation window since the goal mostly falls within the
+	 * reservation window for sequential allocation.
+	 * If the indirect blocks could not be allocated in this block group,
+	 * we fall back to sequential allocation of indirect block alongside
+	 * the data block instead of trying other block groups as that can
+	 * separate indirect and data blocks too far out.
+	 */
+	if (test_opt(sb, METACLUSTER) && indirect_blks) {
+		count = indirect_blks;
+		goal_group = (goal - le32_to_cpu(es->s_first_data_block)) /
+				EXT3_BLOCKS_PER_GROUP(sb);
+		*err = ext3_new_indirect_blocks(handle, inode, goal_group,
+						&count, new_blocks + index);
+		if (*err && *err != -ENOSPC) {
+			printk(KERN_ERR "ext3_alloc_blocks failed to allocate "
+				"indirect blocks: %d", *err);
+			goto failed_out;
+		} else if (*err == 0) {
+			BUG_ON(count == 0);
+		}
+		*err = 0;
+
+		if (count > 0) {
+			index += count;
+			target -= count;
+			BUG_ON(index > indirect_blks);
+		}
+	}
+
 	while (1) {
 		count = target;
 		/* allocating blocks for indirect blocks and direct blocks */
@@ -542,7 +622,7 @@ static int ext3_alloc_blocks(handle_t *h
 			goto failed_out;

 		target -= count;
-		/* allocate blocks for indirect blocks */
+		/* store indirect block numbers we just allocated */
 		while (index < indirect_blks && count) {
 			new_blocks[index++] = current_block++;
 			count--;
@@ -570,10 +650,14 @@ failed_out:
  *	@inode: owner
  *	@indirect_blks: number of allocated indirect blocks
  *	@blks: number of allocated direct blocks
+ *	@goal: goal for allocation
  *	@offsets: offsets (in the blocks) to store the pointers to next.
  *	@branch: place to store the chain in.
  *
- *	This function allocates blocks, zeroes out all but the last one,
+ *	returns error and number of direct blocks allocated via *blks
+ *
+ *	This function allocates indirect_blks + *blks, zeroes out all
+ *	indirect blocks,
  *	links them into chain and (if we are synchronous) writes them to disk.
  *	In other words, it prepares a branch that can be spliced onto the
  *	inode. It stores the information about that chain in the branch[], in
@@ -799,17 +883,24 @@ int ext3_get_blocks_handle(handle_t *han
 	int blocks_to_boundary = 0;
 	int depth;
 	struct ext3_inode_info *ei = EXT3_I(inode);
-	int count = 0;
+	int count = 0, ind_readahead;
 	ext3_fsblk_t first_block = 0;

-
+	BUG_ON(!create &&
+		iblock >= (inode->i_size + inode->i_sb->s_blocksize - 1) >>
+					inode->i_sb->s_blocksize_bits);
 	J_ASSERT(handle != NULL || create == 0);
 	depth = ext3_block_to_path(inode,iblock,offsets,&blocks_to_boundary);

 	if (depth == 0)
 		goto out;

-	partial = ext3_get_branch(inode, depth, offsets, chain, &err);
+	ind_readahead = !create && depth > 2;
+	partial = ext3_get_branch(inode, depth, offsets, chain,
+				  ind_readahead, &err);
+	if (!partial && ind_readahead)
+		partial = ext3_read_indblocks(inode, iblock, depth,
+					      offsets, chain, &err);

 	/* Simplest case - block found, no allocation needed */
 	if (!partial) {
@@ -844,7 +935,7 @@ int ext3_get_blocks_handle(handle_t *han
 	}

 	/* Next simple case - plain lookup or failed read of indirect block */
-	if (!create || err == -EIO)
+	if (!create || (err && err != -EAGAIN))
 		goto cleanup;

 	mutex_lock(&ei->truncate_mutex);
@@ -866,7 +957,8 @@ int ext3_get_blocks_handle(handle_t *han
 			brelse(partial->bh);
 			partial--;
 		}
-		partial = ext3_get_branch(inode, depth, offsets, chain, &err);
+		partial = ext3_get_branch(inode, depth, offsets, chain, 0,
+					&err);
 		if (!partial) {
 			count++;
 			mutex_unlock(&ei->truncate_mutex);
@@ -1974,7 +2066,7 @@ static Indirect *ext3_find_shared(struct
 	/* Make k index the deepest non-null offest + 1 */
 	for (k = depth; k > 1 && !offsets[k-1]; k--)
 		;
-	partial = ext3_get_branch(inode, k, offsets, chain, &err);
+	partial = ext3_get_branch(inode, k, offsets, chain, 0, &err);
 	/* Writer: pointers */
 	if (!partial)
 		partial = chain + k-1;
@@ -3297,3 +3389,508 @@ int ext3_change_inode_journal_flag(struc

 	return err;
 }
+
+/*
+ * ext3_ind_read_end_bio --
+ *
+ * 	bio callback for read IO issued from ext3_read_indblocks.
+ * 	Will be called only once, when all I/O has completed.
+ * 	Frees read_info and bio.
+ */
+static void ext3_ind_read_end_bio(struct bio *bio, int err)
+{
+	struct ext3_ind_read_info *read_info = bio->bi_private;
+	struct buffer_head *bh;
+	int uptodate = !err && test_bit(BIO_UPTODATE, &bio->bi_flags);
+	int i;
+
+	BUG_ON(read_info->count <= 0);
+
+	if (err == -EOPNOTSUPP)
+		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+
+	for (i = 0; i < read_info->count; i++) {
+		bh = read_info->bh[i];
+		BUG_ON(bh == NULL);
+
+		if (err == -EOPNOTSUPP)
+			set_bit(BH_Eopnotsupp, &bh->b_state);
+
+		if (uptodate) {
+			BUG_ON(buffer_uptodate(bh));
+			BUG_ON(ext3_buffer_prefetch(bh));
+			set_buffer_uptodate(bh);
+			if (read_info->seq_prefetch)
+				ext3_set_buffer_prefetch(bh);
+		}
+
+		unlock_buffer(bh);
+		brelse(bh);
+	}
+
+	kfree(read_info);
+	bio_put(bio);
+}
+
+/*
+ * ext3_get_max_read --
+ * 	@inode: inode of file.
+ * 	@block: block number in file (starting from zero).
+ * 	@offset_in_dind_block: offset of the indirect block inside it's
+ * 	parent doubly-indirect block.
+ *
+ *      Compute the maximum no. of indirect blocks that can be read
+ *      satisfying following constraints:
+ *              - Don't read indirect blocks beyond the end of current
+ *              doubly-indirect block.
+ *              - Don't read beyond eof.
+ */
+static inline unsigned long ext3_get_max_read(const struct inode *inode,
+						  int block,
+						  int offset_in_dind_block)
+{
+	const struct super_block *sb = inode->i_sb;
+	unsigned long max_read;
+	unsigned long ptrs = EXT3_ADDR_PER_BLOCK(inode->i_sb);
+	unsigned long ptrs_bits = EXT3_ADDR_PER_BLOCK_BITS(inode->i_sb);
+	unsigned long blocks_in_file =
+		(inode->i_size + sb->s_blocksize - 1) >> sb->s_blocksize_bits;
+	unsigned long remaining_ind_blks_in_dind =
+		(ptrs >= offset_in_dind_block) ? (ptrs - offset_in_dind_block)
+					       : 0;
+	unsigned long remaining_ind_blks_before_eof =
+		((blocks_in_file - EXT3_NDIR_BLOCKS + ptrs - 1) >> ptrs_bits) -
+		((block - EXT3_NDIR_BLOCKS) >> ptrs_bits);
+
+	BUG_ON(block >= blocks_in_file);
+
+	max_read = min_t(unsigned long, remaining_ind_blks_in_dind,
+			 remaining_ind_blks_before_eof);
+
+	BUG_ON(max_read < 1);
+
+	return max_read;
+}
+
+static void ext3_read_indblocks_submit(struct bio **pbio,
+					struct ext3_ind_read_info **pread_info,
+					int *read_cnt, int seq_prefetch)
+{
+	struct bio *bio = *pbio;
+	struct ext3_ind_read_info *read_info = *pread_info;
+
+	BUG_ON(*read_cnt < 1);
+
+	read_info->seq_prefetch = seq_prefetch;
+	read_info->count = *read_cnt;
+	read_info->size = bio->bi_size;
+	bio->bi_private = read_info;
+	bio->bi_end_io = ext3_ind_read_end_bio;
+	submit_bio(READ, bio);
+
+	*pbio = NULL;
+	*pread_info = NULL;
+	*read_cnt = 0;
+}
+
+/*
+ * ext3_read_indblocks_async --
+ *      @sb:            super block
+ *      @ind_blocks[]:  array of indirect block numbers on disk
+ *      @count:         maximum number of indirect blocks to read
+ *      @first_bh:      buffer_head for indirect block ind_blocks[0], may be
+ *                      NULL
+ *      @seq_prefetch:  if this is part of a sequential prefetch and buffers'
+ *                      prefetch bit must be set.
+ *      @blocks_done:   number of blocks considered for prefetching.
+ *
+ *      Issue a single bio request to read upto count buffers identified in
+ *      ind_blocks[]. Fewer than count buffers may be read in some cases:
+ *      - If a buffer is found to be uptodate and it's prefetch bit is set, we
+ *      don't look at any more buffers as they will most likely be in
the cache.
+ *      - We skip buffers we cannot lock without blocking (except for first_bh
+ *			read_info->seq_prefetch = seq_prefetch;
+			read_info->count = read_cnt;
+			read_info->size = bio->bi_size;
+			bio->bi_private = read_info;
+			bio->bi_end_io = ext3_ind_read_end_bio;
+			submit_bio(READ, bio);
+      if specified).
+ *      - We skip buffers beyond a certain range on disk.
+ *
+ *      This function must issue read on first_bh if specified unless of course
+ *      it's already uptodate.
+ */
+static int ext3_read_indblocks_async(struct super_block *sb,
+				     __le32 ind_blocks[], int count,
+				     struct buffer_head *first_bh,
+				     int seq_prefetch,
+				     unsigned long *blocks_done)
+{
+	struct buffer_head *bh;
+	struct bio *bio = NULL;
+	struct ext3_ind_read_info *read_info = NULL;
+	int read_cnt = 0, blk;
+	ext3_fsblk_t prev_blk = 0, io_start_blk = 0, curr;
+	int err = 0;
+
+	BUG_ON(count < 1);
+	/* Don't move this to ext3_get_max_read() since callers often need to
+	 * trim the count returned by that function. So this bound must only
+	 * be imposed at the last moment. */
+	count = min_t(unsigned long, count, EXT3_IND_READ_MAX);
+	*blocks_done = 0UL;
+
+	if (count == 1 && first_bh) {
+		lock_buffer(first_bh);
+		get_bh(first_bh);
+		first_bh->b_end_io = end_buffer_read_sync;
+		submit_bh(READ, first_bh);
+		*blocks_done = 1UL;
+		return 0;
+	}
+
+	for (blk = 0; blk < count; blk++) {
+		curr = le32_to_cpu(ind_blocks[blk]);
+
+		if (!curr)
+			continue;
+
+		if (io_start_blk > 0) {
+			if (max(io_start_blk, curr) - min(io_start_blk, curr) >=
+					EXT3_IND_READ_MAX)
+				continue;
+		}
+
+		if (prev_blk > 0 && curr != prev_blk + 1) {
+			ext3_read_indblocks_submit(&bio, &read_info,
+						&read_cnt, seq_prefetch);
+			prev_blk = 0;
+			break;
+		}
+
+		if (blk == 0 && first_bh) {
+			bh = first_bh;
+			get_bh(first_bh);
+		} else {
+			bh = sb_getblk(sb, curr);
+			if (unlikely(!bh)) {
+				err = -ENOMEM;
+				goto failure;
+			}
+		}
+
+		if (buffer_uptodate(bh)) {
+			if (ext3_buffer_prefetch(bh)) {
+				brelse(bh);
+				break;
+			}
+			brelse(bh);
+			continue;
+		}
+
+		/* Lock the buffer without blocking, skipping any buffers
+		 * which would require us to block. first_bh when specified is
+		 * an exception as caller typically wants it to be read for
+		 * sure (e.g., ext3_read_indblocks_sync).
+		 */
+		if (bh == first_bh) {
+			lock_buffer(bh);
+		} else if (test_set_buffer_locked(bh)) {
+			brelse(bh);
+			continue;
+		}
+
+		/* Check again with the buffer locked. */
+		if (buffer_uptodate(bh)) {
+			if (ext3_buffer_prefetch(bh)) {
+				unlock_buffer(bh);
+				brelse(bh);
+				break;
+			}
+			unlock_buffer(bh);
+			brelse(bh);
+			continue;
+		}
+
+		if (read_cnt == 0) {
+			/* read_info freed in ext3_ind_read_end_bio(). */
+			read_info = kmalloc(EXT3_IND_READ_INFO_SIZE(count),
+					    GFP_KERNEL);
+			if (unlikely(!read_info)) {
+				err = -ENOMEM;
+				goto failure;
+			}
+
+			bio = bio_alloc(GFP_KERNEL, count);
+			if (unlikely(!bio)) {
+				err = -ENOMEM;
+				goto failure;
+			}
+			bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
+			bio->bi_bdev = bh->b_bdev;
+		}
+
+		if (bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh))
+				< bh->b_size) {
+			brelse(bh);
+			if (read_cnt == 0)
+				goto failure;
+
+			break;
+		}
+
+		read_info->bh[read_cnt++] = bh;
+
+		prev_blk = curr;
+		if (io_start_blk == 0)
+			io_start_blk = curr;
+	}
+
+	if (read_cnt == 0)
+		goto done;
+
+	ext3_read_indblocks_submit(&bio, &read_info, &read_cnt, seq_prefetch);
+
+	*blocks_done = blk;
+	return 0;
+
+failure:
+	while (--read_cnt >= 0) {
+		unlock_buffer(read_info->bh[read_cnt]);
+		brelse(read_info->bh[read_cnt]);
+	}
+
+done:
+	if (read_info)
+		kfree(read_info);
+
+	if (bio)
+		bio_put(bio);
+
+	return err;
+}
+
+/*
+ * ext3_read_indblocks_sync --
+ *      @sb:            super block
+ *      @ind_blocks[]:  array of indirect block numbers on disk
+ *      @count:         maximum number of indirect blocks to read
+ *      @first_bh:      buffer_head for indirect block ind_blocks[0], must be
+ *                      non-NULL.
+ *      @seq_prefetch:  set prefetch bit of buffers, used when this is part of
+ *                      a sequential prefetch.
+ *      @blocks_done:   number of blocks considered for prefetching.
+ *
+ *      Synchronously read at most count indirect blocks listed in
+ *      ind_blocks[]. This function calls ext3_read_indblocks_async() to do all
+ *      the hard work. It waits for read to complete on first_bh before
+ *      returning.
+ */
+
+static int ext3_read_indblocks_sync(struct super_block *sb,
+				    __le32 ind_blocks[], int count,
+				    struct buffer_head *first_bh,
+				    int seq_prefetch,
+				    unsigned long *blocks_done)
+{
+	int err;
+
+	BUG_ON(count < 1);
+	BUG_ON(!first_bh);
+
+	err = ext3_read_indblocks_async(sb, ind_blocks, count, first_bh,
+					seq_prefetch, blocks_done);
+	if (err)
+		return err;
+
+	wait_on_buffer(first_bh);
+	if (!buffer_uptodate(first_bh))
+		err = -EIO;
+
+	/* if seq_prefetch != 0, ext3_read_indblocks_async() sets prefetch bit
+	 * for all buffers, but the first buffer for sync IO is never a prefetch
+	 * buffer since it's needed presently so mark it so.
+	 */
+	if (seq_prefetch)
+		ext3_clear_buffer_prefetch(first_bh);
+
+	BUG_ON(ext3_buffer_prefetch(first_bh));
+
+	return err;
+}
+
+/*
+ * ext3_read_indblocks --
+ *
+ * 	@inode: inode of file
+ * 	@iblock: block number inside file (starting from 0).
+ * 	@depth: depth of path from inode to data block.
+ * 	@offsets: array of offsets within blocks identified in 'chain'.
+ * 	@chain: array of Indirect with info about all levels of blocks until
+ * 	the data block.
+ * 	@err: error pointer.
+ *
+ * 	This function is called after reading all metablocks leading to 'iblock'
+ * 	except the (singly) indirect block. It reads the indirect block if not
+ * 	already in the cache and may also prefetch next few indirect blocks.
+ * 	It uses a combination of synchronous and asynchronous requests to
+ * 	accomplish this. We do prefetching even for random reads by reading
+ * 	ahead one indirect block since reads of size >=512KB have at least 12%
+ * 	chance of spanning two indirect blocks.
+ */
+
+static Indirect *ext3_read_indblocks(struct inode *inode, int iblock,
+				     int depth, int offsets[4],
+				     Indirect chain[4], int *err)
+{
+	struct super_block *sb = inode->i_sb;
+	struct buffer_head *first_bh, *prev_bh;
+	unsigned long max_read, blocks_done = 0;
+	__le32 *ind_blocks;
+
+	/* Must have doubly indirect block for prefetching indirect blocks. */
+	BUG_ON(depth <= 2);
+	BUG_ON(!chain[depth-2].key);
+
+	*err = 0;
+
+	/* Handle first block */
+	ind_blocks = chain[depth-2].p;
+	first_bh = sb_getblk(sb, le32_to_cpu(ind_blocks[0]));
+	if (unlikely(!first_bh)) {
+		printk(KERN_ERR "Failed to get block %u for sb %p\n",
+		       le32_to_cpu(ind_blocks[0]), sb);
+		goto failure;
+	}
+
+	BUG_ON(first_bh->b_size != sb->s_blocksize);
+
+	if (buffer_uptodate(first_bh)) {
+		/* Found the buffer in cache, either it was accessed recently or
+		 * it was prefetched while reading previous indirect block(s).
+		 * We need to figure out if we need to prefetch the following
+		 * indirect blocks.
+		 */
+		if (!ext3_buffer_prefetch(first_bh)) {
+			/* Either we've seen this indirect block before while
+			 * accessing another data block, or this is a random
+			 * read. In the former case, we must have done the
+			 * needful the first time we had a cache hit on this
+			 * indirect block, in the latter case we obviously
+			 * don't need to do any prefetching.
+			 */
+			goto done;
+		}
+
+		max_read = ext3_get_max_read(inode, iblock,
+					     offsets[depth-2]);
+
+		/* This indirect block is in the cache due to prefetching and
+		 * this is its first cache hit, clear the prefetch bit and
+		 * make sure the following blocks are also prefetched.
+		 */
+		ext3_clear_buffer_prefetch(first_bh);
+
+		if (max_read >= 2) {
+			/* ext3_read_indblocks_async() stops at the first
+			 * indirect block which has the prefetch bit set which
+			 * will most likely be the very next indirect block.
+			 */
+			ext3_read_indblocks_async(sb, &ind_blocks[1],
+						  max_read - 1,
+						  NULL, 1, &blocks_done);
+		}
+
+	} else {
+		/* Buffer is not in memory, we need to read it. If we are
+		 * reading sequentially from the previous indirect block, we
+		 * have just detected a sequential read and we must prefetch
+		 * some indirect blocks for future.
+		 */
+
+		max_read = ext3_get_max_read(inode, iblock,
+					     offsets[depth-2]);
+
+		if ((ind_blocks - (__le32 *)chain[depth-2].bh->b_data) >= 1) {
+			prev_bh = sb_getblk(sb, le32_to_cpu(ind_blocks[-1]));
+			if (buffer_uptodate(prev_bh) &&
+			    !ext3_buffer_prefetch(prev_bh)) {
+				/* Detected sequential read. */
+				brelse(prev_bh);
+
+				/* Sync read indirect block, also read the next
+				 * few indirect blocks.
+				 */
+				*err = ext3_read_indblocks_sync(sb, ind_blocks,
+							 max_read, first_bh, 1,
+							 &blocks_done);
+
+				if (*err)
+					goto out;
+
+				/* In case the very next indirect block is
+				 * discontiguous by a non-trivial amount,
+				 * ext3_read_indblocks_sync() above won't
+				 * prefetch it (indicated by blocks_done < 2).
+				 * So to help sequential read, schedule an
+				 * async request for reading the next
+				 * contiguous indirect block range (which
+				 * in metaclustering case would be the next
+				 * metacluster, without metaclustering it
+				 * would be the next indirect block). This is
+				 * expected to benefit the non-metaclustering
+				 * case.
+				 */
+				if (max_read >= 2 && blocks_done < 2)
+					ext3_read_indblocks_async(sb,
+							&ind_blocks[1],
+							max_read - 1,
+							NULL, 1, &blocks_done);
+
+				goto done;
+			}
+			brelse(prev_bh);
+		}
+
+		/* Either random read, or sequential detection failed above.
+		 * We always prefetch the next indirect block in this case
+		 * whenever possible.
+		 * This is because for random reads of size ~512KB, there is
+		 * >12% chance that a read will span two indirect blocks.
+		 */
+		*err = ext3_read_indblocks_sync(sb, ind_blocks,
+						(max_read >= 2) ? 2 : 1,
+						first_bh, 0, &blocks_done);
+		if (*err)
+			goto out;
+	}
+
+done:
+	/* Reader: pointers */
+	if (!verify_chain(chain, &chain[depth - 2])) {
+		brelse(first_bh);
+		goto changed;
+	}
+	add_chain(&chain[depth - 1], first_bh,
+		  (__le32*)first_bh->b_data + offsets[depth - 1]);
+	/* Reader: end */
+	if (!chain[depth - 1].key)
+		goto out;
+
+	BUG_ON(!buffer_uptodate(first_bh));
+	return NULL;
+
+changed:
+	*err = -EAGAIN;
+	goto out;
+failure:
+	*err = -EIO;
+out:
+	if (*err) {
+		ext3_debug("Error %d reading indirect blocks\n", *err);
+		return &chain[depth - 2];
+	} else
+		return &chain[depth - 1];
+}
+
diff -uprdN linux-2.6.23mm1-clean/fs/ext3/super.c
linux-2.6.23mm1-ext3mc/fs/ext3/super.c
--- linux-2.6.23mm1-clean/fs/ext3/super.c	2007-10-17 18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/super.c	2007-11-09 16:46:29.000000000 -0800
@@ -625,6 +625,9 @@ static int ext3_show_options(struct seq_
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
 		seq_puts(seq, ",data=writeback");

+	if (test_opt(sb, METACLUSTER))
+		seq_puts(seq, ",metacluster");
+
 	ext3_show_quota_options(seq, sb);

 	return 0;
@@ -758,7 +761,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
-	Opt_grpquota
+	Opt_grpquota, Opt_metacluster
 };

 static match_table_t tokens = {
@@ -808,6 +811,7 @@ static match_table_t tokens = {
 	{Opt_quota, "quota"},
 	{Opt_usrquota, "usrquota"},
 	{Opt_barrier, "barrier=%u"},
+	{Opt_metacluster, "metacluster"},
 	{Opt_err, NULL},
 	{Opt_resize, "resize"},
 };
@@ -1140,6 +1144,9 @@ clear_qf_name:
 		case Opt_bh:
 			clear_opt(sbi->s_mount_opt, NOBH);
 			break;
+		case Opt_metacluster:
+			set_opt(sbi->s_mount_opt, METACLUSTER);
+			break;
 		default:
 			printk (KERN_ERR
 				"EXT3-fs: Unrecognized mount option \"%s\" "
diff -uprdN linux-2.6.23mm1-clean/include/linux/ext3_fs.h
linux-2.6.23mm1-ext3mc/include/linux/ext3_fs.h
--- linux-2.6.23mm1-clean/include/linux/ext3_fs.h	2007-10-17
18:31:43.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/include/linux/ext3_fs.h	2007-11-15
12:03:48.000000000 -0800
@@ -380,6 +380,7 @@ struct ext3_inode {
 #define EXT3_MOUNT_QUOTA		0x80000 /* Some quota option set */
 #define EXT3_MOUNT_USRQUOTA		0x100000 /* "old" user quota */
 #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
+#define EXT3_MOUNT_METACLUSTER		0x400000 /* Indirect block clustering */

 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -493,6 +494,7 @@ struct ext3_super_block {
 #ifdef __KERNEL__
 #include <linux/ext3_fs_i.h>
 #include <linux/ext3_fs_sb.h>
+#include <linux/buffer_head.h>
 static inline struct ext3_sb_info * EXT3_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -722,6 +724,11 @@ struct dir_private_info {
 	__u32		next_hash;
 };

+/* Special bh flag used by the metacluster readahead logic. */
+enum ext3_bh_state_bits {
+	EXT3_BH_PREFETCH = BH_JBD_Sentinel,
+};
+
 /* calculate the first block number of the group */
 static inline ext3_fsblk_t
 ext3_group_first_block_no(struct super_block *sb, unsigned long group_no)
@@ -730,6 +737,24 @@ ext3_group_first_block_no(struct super_b
 		le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block);
 }

+static inline void
+ext3_set_buffer_prefetch(struct buffer_head *bh)
+{
+	set_bit(EXT3_BH_PREFETCH, &bh->b_state);
+}
+
+static inline void
+ext3_clear_buffer_prefetch(struct buffer_head *bh)
+{
+	clear_bit(EXT3_BH_PREFETCH, &bh->b_state);
+}
+
+static inline int
+ext3_buffer_prefetch(struct buffer_head *bh)
+{
+	return test_bit(EXT3_BH_PREFETCH, &bh->b_state);
+}
+
 /*
  * Special error return code only used by dx_probe() and its callers.
  */
@@ -752,6 +777,9 @@ extern int ext3_bg_has_super(struct supe
 extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
 extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode,
 			ext3_fsblk_t goal, int *errp);
+extern int ext3_new_indirect_blocks(handle_t *handle, struct inode *,
+				unsigned long group_no, unsigned long *,
+				ext3_fsblk_t new_blocks[]);
 extern ext3_fsblk_t ext3_new_blocks (handle_t *handle, struct inode *inode,
 			ext3_fsblk_t goal, unsigned long *count, int *errp);
 extern void ext3_free_blocks (handle_t *handle, struct inode *inode,
@@ -870,6 +898,31 @@ extern const struct inode_operations ext
 extern const struct inode_operations ext3_symlink_inode_operations;
 extern const struct inode_operations ext3_fast_symlink_inode_operations;

+/*
+ * ext3_get_grp_metacluster:
+ *
+ * 	Determines metacluster block range for all block groups of the file
+ * 	system.
+ *
+ * 	Number of metacluster blocks = blocks_per_group/128. This allows us
+ * 	to fit all indirect blocks in a block group with average file size of
+ * 	256KB into the group's metacluster. We want to avoid having large
+ * 	metaclusters because then we'll run of data blocks sooner and when
+ * 	out of data blocks metaclustering goes for a toss.
+ * 	
+ */
+static inline void
+ext3_get_grp_metacluster(struct super_block *sb,
+				ext3_grpblk_t *mc_start,
+				ext3_grpblk_t *mc_end)	/* exclusive */
+{
+	*mc_start = EXT3_BLOCKS_PER_GROUP(sb) / 2;
+	if (test_opt(sb, METACLUSTER)) {
+		*mc_end = *mc_start + (EXT3_BLOCKS_PER_GROUP(sb) >> 7);
+	} else {
+		*mc_end = *mc_start;
+	}
+}

 #endif	/* __KERNEL__ */

diff -uprdN linux-2.6.23mm1-clean/include/linux/jbd.h
linux-2.6.23mm1-ext3mc/include/linux/jbd.h
--- linux-2.6.23mm1-clean/include/linux/jbd.h	2007-10-17
18:31:43.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/include/linux/jbd.h	2007-11-09
16:46:29.000000000 -0800
@@ -294,6 +294,7 @@ enum jbd_state_bits {
 	BH_State,		/* Pins most journal_head state */
 	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
 	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
+	BH_JBD_Sentinel,	/* Start bit for clients of jbd */
 };

 BUFFER_FNS(JBD, jbd)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-16  5:02 [PATCH] Clustering indirect blocks in Ext3 Abhishek Rai
@ 2007-11-16  7:02 ` Andrew Morton
  2007-11-16  7:37   ` Matt Mackall
                     ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Andrew Morton @ 2007-11-16  7:02 UTC (permalink / raw)
  To: Abhishek Rai; +Cc: Andreas Dilger, linux-kernel, Ken Chen, Mike Waychison

On Thu, 15 Nov 2007 21:02:46 -0800 "Abhishek Rai" <abhishekrai@google.com> wrote:

> (This patch was previously posted on linux-ext4 where Andreas Dilger
> offered some valuable comments on it).
> 
> This patch modifies the block allocation strategy in ext3 in order to
> improve fsck performance. This was initially sent out as a patch for
> ext2, but given the lack of ongoing development on ext2, I have
> crossported it to ext3 instead. Slow fsck is not a serious problem on
> ext3 due to journaling, but once in a while users do need to run full
> fsck on their ext3 file systems. This can be due to several reasons:
> (1) bad disk, bad crash, etc, (2) bug in jbd/ext3, and (3) every few
> reboots, it's good to run fsck anyway. This patch will help reduce
> full fsck time for ext3. I've seen 50-65% reduction in fsck time when
> using this patch on a near-full file system. With some fsck
> optimizations, this figure becomes 80%.
> 
> Most of Ext3 metadata is clustered on disk. For example, Ext3
> partitions the block space into block groups and stores the metadata
> for each block group (inode table, block bitmap, inode bitmap) at the
> beginning of the block group. Clustering related metadata together not
> only helps ext3 I/O performance by keeping data and related metadata
> close together, but also helps fsck since it is able to find all the
> metadata in one place. However, indirect blocks are an exception.
> Indirect blocks are allocated on-demand and are spread out along with
> the data. This layout enables good I/O performance due to the close
> proximity between an indirect block and its data blocks but it makes
> things difficult for fsck which must now rotate almost the entire disk
> in order to read all indirect blocks. In fact, our measurements have
> indicated that for most ext3 disks on which fsck takes a long time,
> >80% of the time is spent reading indirect blocks. So speeding up
> indirect block read accesses in fsck can significantly improve fsck
> times.
> 
> One solution to this problem implemented in this patch is to cluster
> indirect blocks together on a per group basis, similar to how inodes
> and bitmaps are clustered.

So we have a section of blocks around the middle of the blockgroup which
are used for indirect blocks.

Presmably it starts around 50% of the way into the blockgroup?

How do you decide its size?

What happens when it fills up but we still have room for more data blocks
in that blockgroup?

Can this reserved area cause disk space wastage (all data blocks used,
metacluster area not yet full).

The file data block allocator now needs to avoid allocating blocks from
inside this reserved area.  How is this implemented?  It is awfully similar
to the existing reservations code - does it utilise that code?

> Indirect block clusters (metaclusters) help
> fsck performance by enabling fsck to fetch all indirect blocks by
> reading from a few locations on the disk instead of rotating through
> the entire disk. Unfortunately, a naive clustering scheme for indirect
> blocks can hurt I/O performance, as it separates out indirect blocks
> and corresponding direct blocks on the disk. So an I/O to a direct
> block whose indirect block is not in the page cache now needs to incur
> a longer seek+rotational delay in moving the disk head from the
> indirect block to the direct block.
> 
> So our goal then is to implement metaclustering without having any
> impact (<0.1%) on I/O performance. Fortunately, current ext3 I/O
> algorithm is not the most efficient, improving it can camouflage the
> performance hit we suffer due to metaclustering. In fact,
> metaclustering automatically enables one such optimization. When doing
> sequential read from a file and reading an indirect block for it, we
> readahead several indirect blocks of the file from the same
> metacluster. Moreover, when possible we do this asynchronously. This
> reduces the seek+rotational latency associated with seeking between
> data and indirect blocks during a (long) sequential read.
> 
> There is one more design choice that affect the performance of this
> patch: location and number of metaclusters per block group. Currently
> we have one metacluster per block group and it is located at the
> center of the block group. We adopted this scheme after evaluating
> three possible locations of metaclusters: beginning, middle, and end
> of block group. We did not evaluate configurations with >1 metacluster
> per block group. In our experiments, the middle configuration did not
> cause any performance degradation for sequential and random reads.
> Whereas putting the metacluster at the beginning of the block group
> yields best performance for sequential reads (write performance is
> unaffected by this change), putting it in the middle helps random
> reads. Since the "middle path" maintains status quo, we adopted that
> in our change.
> 
> Performance evaluation results:
> Setup:
> RAM: 8GB
> Disk: 400GB disk.
> CPU: Dual core hyperthreaded
> 
> All measurements were taken 10 times or more until standard deviation
> was <2%. Machine was rebooted between runs and file system freshly
> formatted, also we made sure that there was nothing running on the
> machine at the time of the test.
> 
> Notation:
> - 'vanilla': regular ext3 without any changes
> - 'mc': metaclustering ext3
> 
> Benchmark 1: Sequential write to a 10GB file followed by 'sync'
> 1. vanilla:
>   Total: 3m9.0s
>   User: 0.08
>   System: 23s-48s (very high variance)

hm, system time variance is weird.  You might have found an ext3 bug (or a
cpu time accounting bug).

Excecution profiling would tell, I guess.

> 2. mc:
>   Total: 3m6.1s
>   User: 0.08s
>   System: 48.1s
>
> Benchmark 2: Sequential read from a 10GB file.
> Description: the file is created using same type of ext2 (mc or vanilla)
> 1. vanilla:
>   Total: 3m14.5s
>   User: 0.04s
>   System: 13.4s
> 2. mc:
>   Total: 3m14.5s
>   User: 0.04s
>   System: 13.3s
> 
> Benchmark 3: Random read from a 300GB file
> Description: read using 512 byte chunk total 5MB
> 1. vanilla:
>   Total: 3m56.4s
>   User: ~0
>   System: 0.6s
> 2. mc:
>   Total: 3m51.4s
>   User: ~0
>   System: 0.8s
> 
> Benchmark 4: Random read from a 300GB file
> Description: read using 512KB chunk total 1% size of the file
> 1. vanilla:
>   Total: 4m46.3s
>   User: ~0
>   System: 3.9s
> 2. mc:
>   Total: 4m44.4s
>   User: ~0
>   System: 3.9s
> 
> Benchmark 5: fsck
> Description: Prepare a newly formated 400GB disk as follows: create
> 200 files of 0.5GB each, 100 files of 1GB each, 40 files of 2.5GB ech,
> and 10 files of 10GB each. fsck command line: fsck -f -n
> 1. vanilla:
>   Total: 12m18.1s
>   User: 15.9s
>   System: 18.3s
> 2. mc:
>   Total: 4m47.0s
>   User: 16.0s
>   System: 17.1s
> 

They're large files.  It would be interesting to see what the numbers are
for more and smaller files.

> 
> Benchmark 6: kernbench (this was done on an 8cpu machine with 32GB RAM)
> 1. vanilla:
>   Elapsed: 35.60
>   User: 228.79
>   System: 21.10
> 2. mc:
>   Elapsed: 35.12
>   User: 228.47
>   System: 21.08
> 
> Note:
> 1. This patch does not affect ext3 on-disk layout compatibility in any
> way. Existing disks continue to work with new code, and disks modified
> by new code continue to work with existing machines. In contrast, the
> extents patch will also probably solve this problem but it breaks on-disk
> compatibility.
> 2. Metaclustering is a mount time option (-o metacluster). This option
> only affects the write path, when this option is specified indirect
> blocks are allocated in clusters, when it is not specified they are
> allocated alongside data blocks. The read path is unaffected by the
> option, read behavior depends on the data layout on disk - if read
> discovers metaclusters on disk it will do prefetching otherwise it
> will not.
> 3. e2fsck speedup with metaclustering varies from disk
> to disk with most benefit coming from disks which have a large number
> of indirect blocks. For disks which have few indirect blocks, fsck
> usually doesn't take too long anyway and hence it's OK not to deliver
> a huge speedup there. But in all cases, metaclustering doesn't cause
> any degradation in IO performance as seen in the benchmarks above.

Less speedup, for more-and-smaller files, it appears.

An important question is: how does it stand up over time?  Simply laying
files out a single time on a fresh fs is the easy case.  But what happens
if that disk has been in continuous create/delete/truncate/append usage for
six months?

> 
> [implementation]
>

We can get onto that later ;)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-16  7:02 ` Andrew Morton
@ 2007-11-16  7:37   ` Matt Mackall
  2007-11-18 15:52     ` Abhishek Rai
  2007-11-16 11:28   ` Andreas Dilger
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Matt Mackall @ 2007-11-16  7:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Abhishek Rai, Andreas Dilger, linux-kernel, Ken Chen, Mike Waychison

On Thu, Nov 15, 2007 at 11:02:19PM -0800, Andrew Morton wrote:
> On Thu, 15 Nov 2007 21:02:46 -0800 "Abhishek Rai" <abhishekrai@google.com> wrote:
...
> > 3. e2fsck speedup with metaclustering varies from disk
> > to disk with most benefit coming from disks which have a large number
> > of indirect blocks. For disks which have few indirect blocks, fsck
> > usually doesn't take too long anyway and hence it's OK not to deliver
> > a huge speedup there. But in all cases, metaclustering doesn't cause
> > any degradation in IO performance as seen in the benchmarks above.
> 
> Less speedup, for more-and-smaller files, it appears.
> 
> An important question is: how does it stand up over time?  Simply laying
> files out a single time on a fresh fs is the easy case.  But what happens
> if that disk has been in continuous create/delete/truncate/append usage for
> six months?

Try Chris Mason's compilebench, which is a decent aging simulation.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-16  7:02 ` Andrew Morton
  2007-11-16  7:37   ` Matt Mackall
@ 2007-11-16 11:28   ` Andreas Dilger
  2007-11-16 21:11   ` Theodore Tso
  2007-11-16 22:27   ` Abhishek Rai
  3 siblings, 0 replies; 21+ messages in thread
From: Andreas Dilger @ 2007-11-16 11:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Abhishek Rai, linux-kernel, Ken Chen, Mike Waychison

On Nov 15, 2007  23:02 -0800, Andrew Morton wrote:
> So we have a section of blocks around the middle of the blockgroup which
> are used for indirect blocks.
> 
> Presmably it starts around 50% of the way into the blockgroup?
> 
> An important question is: how does it stand up over time?  Simply laying
> files out a single time on a fresh fs is the easy case.  But what happens
> if that disk has been in continuous create/delete/truncate/append usage for
> six months?

In the ext4-devel discussion, I asked about placement of the reserved
blocks.  Placement at the beginning of the group showed at worst
marginally less performance and in some cases better performance.
I suspect putting the reserved blocks at the beginning of the group
would have a better long-term effect on performance because they are
not in the middle of large contiguous allocations in the middle of
the group.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-16  7:02 ` Andrew Morton
  2007-11-16  7:37   ` Matt Mackall
  2007-11-16 11:28   ` Andreas Dilger
@ 2007-11-16 21:11   ` Theodore Tso
  2007-11-17  0:25     ` Abhishek Rai
  2007-11-16 22:27   ` Abhishek Rai
  3 siblings, 1 reply; 21+ messages in thread
From: Theodore Tso @ 2007-11-16 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Abhishek Rai, Andreas Dilger, linux-kernel, Ken Chen, Mike Waychison

On Thu, Nov 15, 2007 at 11:02:19PM -0800, Andrew Morton wrote:
> 
> Presmably it starts around 50% of the way into the blockgroup?

Yes.

> How do you decide its size?

It's fixed at 1/128th (0.78%) of the blockgroup.

> What happens when it fills up but we still have room for more data blocks
> in that blockgroup?

It does fall back, but it does so starting from the beginning of the
block group by using the old-style allocation routines if it can't
find any space in the metacluster region.  What I'd suggest that it do
instead is to start searching from the end of metacluster region, and
then wrap around to the beginning of the block group, and then if it
can't find any blocks when it reaches the beginning of the metacluster
region, then go to the next block group that would be used by
ext3_new_blocks(), and start searching in the metacluster region ---
that way a smart e2fsck that is doing clustering could just arrange to
pre-read the metacluster region for each block group, and if it finds
an indirect block that is another block group's metacluster region, it
could try reading in those blocks too.

In order to do this, I'd suggest considering to fold ext3_new_blocks
and ext3_new_indirect_blocks() into the same function, with just a
passed-in flag to indicate whether for each block group the
metacluster region or the non-metacluster region should be searched
first.  This would also make elimiate some duplicated code.

> Can this reserved area cause disk space wastage (all data blocks used,
> metacluster area not yet full).

No, not as far as I can see.

> Less speedup, for more-and-smaller files, it appears.
> 
> An important question is: how does it stand up over time?  Simply laying
> files out a single time on a fresh fs is the easy case.  But what happens
> if that disk has been in continuous create/delete/truncate/append usage for
> six months?

Another question is how does it stand up if the average size of files
is different from what you anticipate?  If the files are bigger than
you expect, or smaller than you expect, then the ratio of indirect
blocks to data blocks will be different, at which point allocations
won't be perfectly split up between metacluster region.

For this reason, the exact size of the metacluster region should
probably be a superblock tunable --- and once we have the superblock
tunable, I'd use the non-zero metacluster size to determine whether or
not to enable this feature, and not to use a mount option.  Mount
options really should be avoided whenever possible, in favor of
settings in the superblock.

						- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-16  7:02 ` Andrew Morton
                     ` (2 preceding siblings ...)
  2007-11-16 21:11   ` Theodore Tso
@ 2007-11-16 22:27   ` Abhishek Rai
  3 siblings, 0 replies; 21+ messages in thread
From: Abhishek Rai @ 2007-11-16 22:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andreas Dilger, linux-kernel, Ken Chen, Mike Waychison

On Nov 15, 2007 11:02 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 15 Nov 2007 21:02:46 -0800 "Abhishek Rai" <abhishekrai@google.com> wrote:
> > One solution to this problem implemented in this patch is to cluster
> > indirect blocks together on a per group basis, similar to how inodes
> > and bitmaps are clustered.
>
> So we have a section of blocks around the middle of the blockgroup which
> are used for indirect blocks.
>
> Presmably it starts around 50% of the way into the blockgroup?
>
> How do you decide its size?

There are couple of factors to consider when choosing a size:
1. The size cannot be too small, or the metacluster will fill up too
quickly and then we'll have to fall back to regular indirect block
allocation. E.g., if average file size of files in a block group is
512KB, a default block group having 32K blocks of size 4KB will need
~256 indirect blocks, one for each file.
2. If number of indirect blocks is too high, there will be less space
for data block allocation and so it'll make it more likely that we run
out of data blocks and start using blocks from the metacluster which
makes metaclustering useless.

Considering these factors, I think we should have <1% of blocks
reserved for the metacluster. The current patch uses (blocks_per_group
/ 128).

> What happens when it fills up but we still have room for more data blocks
> in that blockgroup?

Metaclustering is honored only as long as we have free data blocks and
free metacluster blocks. If we run out of either, we start using the
other. Of course, once that happens indirect blocks may not be
clustered anymore.

> Can this reserved area cause disk space wastage (all data blocks used,
> metacluster area not yet full).

No because of above reason.

> The file data block allocator now needs to avoid allocating blocks from
> inside this reserved area.  How is this implemented?  It is awfully similar
> to the existing reservations code - does it utilise that code?

It is actually much simpler than the reservation code, so I haven't
used it. The logic is implemented in <20 lines in
ext3_try_to_allocate().

>
> > Notation:
> > - 'vanilla': regular ext3 without any changes
> > - 'mc': metaclustering ext3
> >
> > Benchmark 1: Sequential write to a 10GB file followed by 'sync'
> > 1. vanilla:
> >   Total: 3m9.0s
> >   User: 0.08
> >   System: 23s-48s (very high variance)
>
> hm, system time variance is weird.  You might have found an ext3 bug (or a
> cpu time accounting bug).
>
> Excecution profiling would tell, I guess.

OK, I'll investigate this further.

> > Benchmark 5: fsck
> > Description: Prepare a newly formated 400GB disk as follows: create
> > 200 files of 0.5GB each, 100 files of 1GB each, 40 files of 2.5GB ech,
> > and 10 files of 10GB each. fsck command line: fsck -f -n
> > 1. vanilla:
> >   Total: 12m18.1s
> >   User: 15.9s
> >   System: 18.3s
> > 2. mc:
> >   Total: 4m47.0s
> >   User: 16.0s
> >   System: 17.1s
> >
>
> They're large files.  It would be interesting to see what the numbers are
> for more and smaller files.
>

kernbench below shows the behavior with small files. I'll also post
results from running
compilebench.

> >
> > Benchmark 6: kernbench (this was done on an 8cpu machine with 32GB RAM)
> > 1. vanilla:
> >   Elapsed: 35.60
> >   User: 228.79
> >   System: 21.10
> > 2. mc:
> >   Elapsed: 35.12
> >   User: 228.47
> >   System: 21.08
> >
> > Note:
> > 1. This patch does not affect ext3 on-disk layout compatibility in any
> > way. Existing disks continue to work with new code, and disks modified
> > by new code continue to work with existing machines. In contrast, the
> > extents patch will also probably solve this problem but it breaks on-disk
> > compatibility.
> > 2. Metaclustering is a mount time option (-o metacluster). This option
> > only affects the write path, when this option is specified indirect
> > blocks are allocated in clusters, when it is not specified they are
> > allocated alongside data blocks. The read path is unaffected by the
> > option, read behavior depends on the data layout on disk - if read
> > discovers metaclusters on disk it will do prefetching otherwise it
> > will not.
> > 3. e2fsck speedup with metaclustering varies from disk
> > to disk with most benefit coming from disks which have a large number
> > of indirect blocks. For disks which have few indirect blocks, fsck
> > usually doesn't take too long anyway and hence it's OK not to deliver
> > a huge speedup there. But in all cases, metaclustering doesn't cause
> > any degradation in IO performance as seen in the benchmarks above.
>
> Less speedup, for more-and-smaller files, it appears.

Not necessarily. If a lot of files use indirect blocks which happens when file
length >48KB on a 4KB blocksize file system, then we have a lot of
indirect blocks to read during fsck and hence this patch will be useful. But
if most files are <= 48KB, then the speedup is less/none of course.

>
> An important question is: how does it stand up over time?  Simply laying
> files out a single time on a fresh fs is the easy case.  But what happens
> if that disk has been in continuous create/delete/truncate/append usage for
> six months?

I'll post results of running compilebench shortly.

> >
> > [implementation]
> >
>
> We can get onto that later ;)
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-16 21:11   ` Theodore Tso
@ 2007-11-17  0:25     ` Abhishek Rai
  2007-11-17  2:58       ` Theodore Tso
  0 siblings, 1 reply; 21+ messages in thread
From: Abhishek Rai @ 2007-11-17  0:25 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Abhishek Rai, Andreas Dilger,
	linux-kernel, Ken Chen, Mike Waychison

Thanks for the great feedback.

On Nov 16, 2007 1:11 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Thu, Nov 15, 2007 at 11:02:19PM -0800, Andrew Morton wrote:
> > What happens when it fills up but we still have room for more data blocks
> > in that blockgroup?
>
> It does fall back, but it does so starting from the beginning of the
> block group by using the old-style allocation routines if it can't
> find any space in the metacluster region.  What I'd suggest that it do
> instead is to start searching from the end of metacluster region, and
> then wrap around to the beginning of the block group,

When we fallback to old-style allocation, new blocks get allocated next
to the goal and are thus co-located with the corresponding data blocks.

> and then if it
> can't find any blocks when it reaches the beginning of the metacluster
> region, then go to the next block group that would be used by
> ext3_new_blocks(), and start searching in the metacluster region ---
> that way a smart e2fsck that is doing clustering could just arrange to
> pre-read the metacluster region for each block group, and if it finds
> an indirect block that is another block group's metacluster region, it
> could try reading in those blocks too.

Ideally, this is how things should be done, but I feel in practice, it
will make little difference. To summarize, the difference between
my approach and above approach is that when out of free blocks in a
block group while allocating indirect block, the above approach repeats
the same allocation algorithm in the next block group, while I fully
fall back to old-style allocation meaning the indirect block gets
co-located with the data block in the next block group with a free
block. In practice, this will make a difference only for one indirect
block as from next request onwards the goal will be updated to the new
group making the behavior like what you propose. Still, I think your
suggestion is cleaner and I'll change to that.

>
> In order to do this, I'd suggest considering to fold ext3_new_blocks
> and ext3_new_indirect_blocks() into the same function, with just a
> passed-in flag to indicate whether for each block group the
> metacluster region or the non-metacluster region should be searched
> first.  This would also make elimiate some duplicated code.

Makes sense, will do.

>
> > Can this reserved area cause disk space wastage (all data blocks used,
> > metacluster area not yet full).
>
> No, not as far as I can see.
>
> > Less speedup, for more-and-smaller files, it appears.
> >
> > An important question is: how does it stand up over time?  Simply laying
> > files out a single time on a fresh fs is the easy case.  But what happens
> > if that disk has been in continuous create/delete/truncate/append usage for
> > six months?
>
> Another question is how does it stand up if the average size of files
> is different from what you anticipate?  If the files are bigger than
> you expect, or smaller than you expect, then the ratio of indirect
> blocks to data blocks will be different, at which point allocations
> won't be perfectly split up between metacluster region.
>
> For this reason, the exact size of the metacluster region should
> probably be a superblock tunable --- and once we have the superblock
> tunable, I'd use the non-zero metacluster size to determine whether or
> not to enable this feature, and not to use a mount option.  Mount
> options really should be avoided whenever possible, in favor of
> settings in the superblock.
>
>                                                 - Ted
>

We initially avoided making metaclustering a superblock tunable as we
didn't want to make any changes to the on-disk format as then ext4
extents are also a good option. If metaclustering gains acceptance
it might make sense to make it a superblock tunable. However, I would
avoid putting metacluster size into the superblock for the following
reason. Ideally, we should not have to bother about finding the sweet
spot of metacluster size as
(1) a given file system can be used for storing different kinds
of files at different times and it would be a pain to tune it every now
and then, and
(2) it opens the possibility of doubting metcluster size for unrelated
ext3/fsck performance anomalies.
The user should be able to just enable metaclustering and ext3 should
take care of the rest as best as it can. That said, some type of coarse
metaclustering advice can definitely be stored in the superblock.

Allow me to propose a solution that will most likely address the above
issue and please ignore its complexity for a moment. Instead of a two
level partitioning in the block space between data blocks and
metacluster blocks, have a 3 or 4 level partitioning. E.g., a block
group with 'd' blocks can have d/32 blocks in metacluster level 1,
d/64 blocks in metacluster level 2, and d/128 blocks in metacluster
level 3 (define level 0 has having the remaining blocks = d - d/32 -
d/64 - d/128). Data block allocation starts looking for a free block
starting from the lowest possible level. If it is unable to find any
free blocks at that level in all block groups, it moves up a level and
so on. Indirect block allocation proceeds in the opposite direction
starting from higher levels. This approach has several benefits:

In traditional metaclustering, once we run out of metacluster blocks
or data blocks, all bets are off. This forces us to keep small
metaclusters in order to avoid this situation altogether. But with small
metaclusters, we cannot optimize indirect block allocation on file
systems with many small files (>48KB).There is only one glitch in
implementing this. If a block group doesn't have any free blocks at a
given level, we should be able to find that out quickly instead of
having to scan its entire bitmap. gdp->bg_free_blocks_count is not good
enough for this.

Thanks,
Abhishek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-17  0:25     ` Abhishek Rai
@ 2007-11-17  2:58       ` Theodore Tso
  2007-11-17  8:58         ` Abhishek Rai
  2007-12-21 14:15         ` Abhishek Rai
  0 siblings, 2 replies; 21+ messages in thread
From: Theodore Tso @ 2007-11-17  2:58 UTC (permalink / raw)
  To: Abhishek Rai
  Cc: Andrew Morton, Andreas Dilger, linux-kernel, Ken Chen, Mike Waychison

On Fri, Nov 16, 2007 at 04:25:38PM -0800, Abhishek Rai wrote:
> Ideally, this is how things should be done, but I feel in practice, it
> will make little difference. To summarize, the difference between
> my approach and above approach is that when out of free blocks in a
> block group while allocating indirect block, the above approach repeats
> the same allocation algorithm in the next block group, while I fully
> fall back to old-style allocation meaning the indirect block gets
> co-located with the data block in the next block group with a free
> block.

Well, also I suggested that if the metacluster region is full, that it
attempt to find a block starting at end of the metacluster region and
then wrap around, instead of starting at the beginning of the block
group.  That way it's more likely that subsequent metadata block is
nearer to the previous metadata blocks.

> In practice, this will make a difference only for one indirect
> block as from next request onwards the goal will be updated to the new
> group making the behavior like what you propose. Still, I think your
> suggestion is cleaner and I'll change to that.

The practice of starting search in the next block block in the
metadata area only makes a difference for one indirect block, yes, but
it's the right thing to do.  And if you fold the ext3_new_blocks and
ext3_new_indirect_blocks(), it's really not that hard.  You can
basically do something like this:

	if (alloc_for_metadata)
		strategy = 0x132;
	else
		strategy = 0x231;
	for (; strategy; strategy = strategy >> 8) {
		switch (strategy & 0xF) {
		case 1:
		     start = block_group_start;
		     end = mc_start - 1;
		     break;
		case 2:
		     start = mc_start;
		     end = mc_end;
		     break;
		case 3:
		     start = mc_end + 1;
		     end = block_group_end;
		     break;
		}
		<search region between start.. end>
	}

> We initially avoided making metaclustering a superblock tunable as we
> didn't want to make any changes to the on-disk format as then ext4
> extents are also a good option.

Allocating a superblock field is no big deal.  I'll note further that
metaclustering is not necessarily mutually exclusive with ext4
extents.  Allocating the extent tree data blocks out of the
metacluster blocks can be a good idea, depending on the average size
of the blocks and how fragmented the filesystem gets (and hence how
many contiguous extents can be expected).  If the filesystem is
storing lots of really big files where being contiguous across
multiple blockgroups are productive, then the metacluster area would
actually be counterproductive.  And if files are all small so the
extents fit the inode, the metadata cluster area wouldn't be necessary
at all.  But if there are multiple external extent blocks in a block
group, it would be useful for them to be allocated together.  

> If metaclustering gains acceptance
> it might make sense to make it a superblock tunable. However, I would
> avoid putting metacluster size into the superblock for the following
> reason. Ideally, we should not have to bother about finding the sweet
> spot of metacluster size as
> (1) a given file system can be used for storing different kinds
> of files at different times and it would be a pain to tune it every now
> and then, and

Yes, it doesn't make sense to retune the filesystem.  I was assuming
that this would only be done at mke2fs time.

> (2) it opens the possibility of doubting metcluster size for unrelated
> ext3/fsck performance anomalies.

I'm not sure I understand your concern.  The reality is that 99% of
the time users will never change it from the defaults, but making it
tunable makes it much, much easier for us to try various experiments
to determine what is the best initial value for different workloads.
What might get used for a Usenet news spool or a Squid cache might be
quite different from series of DVD image files.

> Allow me to propose a solution that will most likely address the above
> issue and please ignore its complexity for a moment. Instead of a two
> level partitioning in the block space between data blocks and
> metacluster blocks, have a 3 or 4 level partitioning. E.g., a block
> group with 'd' blocks can have d/32 blocks in metacluster level 1,
> d/64 blocks in metacluster level 2, and d/128 blocks in metacluster
> level 3 (define level 0 has having the remaining blocks = d - d/32 -
> d/64 - d/128). Data block allocation starts looking for a free block
> starting from the lowest possible level. If it is unable to find any
> free blocks at that level in all block groups, it moves up a level and
> so on. Indirect block allocation proceeds in the opposite direction
> starting from higher levels. This approach has several benefits:

That is clever.  Oh, one other thing.  You didn't mention what
happened when the metacluster field was placed at the end of the block
group.  I assume you tried that in your experiments; what were the
results?  The obvious thing to do to avoid further fragmentation of
the block group would be to put level 1 at the end of the block group,
level 2 just before it, and level 3 before that, and then allocate the
data blocks starting at the beginning of the block group, i.e:

+----------------------------------+---------------+---------+-------+
|     data                         | level 3       | level 2 | lvl 1 |
+----------------------------------+---------------+---------+-------+


> In traditional metaclustering, once we run out of metacluster blocks
> or data blocks, all bets are off. This forces us to keep small
> metaclusters in order to avoid this situation altogether. But with small
> metaclusters, we cannot optimize indirect block allocation on file
> systems with many small files (>48KB).There is only one glitch in
> implementing this. If a block group doesn't have any free blocks at a
> given level, we should be able to find that out quickly instead of
> having to scan its entire bitmap. gdp->bg_free_blocks_count is not good
> enough for this.

Ideally, true, but this was a defect with the original metacluster
scheme as well.  We could steal some bits in the block_group
descriptor structure to indicate whether a particular level is full,
though.  This would be another data format change that would require
e2fsprogs support, though.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-17  2:58       ` Theodore Tso
@ 2007-11-17  8:58         ` Abhishek Rai
  2007-12-21 14:15         ` Abhishek Rai
  1 sibling, 0 replies; 21+ messages in thread
From: Abhishek Rai @ 2007-11-17  8:58 UTC (permalink / raw)
  To: Theodore Tso, Abhishek Rai, Andrew Morton, Andreas Dilger,
	linux-kernel, Ken Chen, Mike Waychison

Thanks for the comments.

On Nov 16, 2007 6:58 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Fri, Nov 16, 2007 at 04:25:38PM -0800, Abhishek Rai wrote:
> > Ideally, this is how things should be done, but I feel in practice, it
> > will make little difference. To summarize, the difference between
> > my approach and above approach is that when out of free blocks in a
> > block group while allocating indirect block, the above approach repeats
> > the same allocation algorithm in the next block group, while I fully
> > fall back to old-style allocation meaning the indirect block gets
> > co-located with the data block in the next block group with a free
> > block.
>
> Well, also I suggested that if the metacluster region is full, that it
> attempt to find a block starting at end of the metacluster region and
> then wrap around, instead of starting at the beginning of the block
> group.  That way it's more likely that subsequent metadata block is
> nearer to the previous metadata blocks.

Ah ok. I think a generalization of this idea is that when we must mix
indirect blocks and data blocks, at least start the search for a free
block for them from different parts of the block group. Of course, an
added benefit of your suggestion is that the non-metaclustered
indirect blocks will likely be close to the metacluster.

I think this approach will help fsck but from a design point of view it
may not be good for IO performance. E.g., the reason sequential
read performance with metaclustering is same as regular is
because we prefetch co-located indirect blocks (I have verified this by
turning off prefetching). When allocating indirect
blocks outside the metacluster using the above scheme, co-location of
indirect blocks is still likely but less so. OTOH, by falling back to the
old-style allocation routines, we at least make sure that the indirect
block is close to its data block which is good for performance. I think
both approaches are pretty close. One thumb rule we've followed during
metaclustering design is: "when in doubt, favor IO performance over
fsck performance" so I tend to lean towards the latter approach.

> [More comments which I will incorporate in the code]
>
> > Allow me to propose a solution that will most likely address the above
> > issue and please ignore its complexity for a moment. Instead of a two
> > level partitioning in the block space between data blocks and
> > metacluster blocks, have a 3 or 4 level partitioning. E.g., a block
> > group with 'd' blocks can have d/32 blocks in metacluster level 1,
> > d/64 blocks in metacluster level 2, and d/128 blocks in metacluster
> > level 3 (define level 0 has having the remaining blocks = d - d/32 -
> > d/64 - d/128). Data block allocation starts looking for a free block
> > starting from the lowest possible level. If it is unable to find any
> > free blocks at that level in all block groups, it moves up a level and
> > so on. Indirect block allocation proceeds in the opposite direction
> > starting from higher levels. This approach has several benefits:
>
> That is clever.  Oh, one other thing.  You didn't mention what
> happened when the metacluster field was placed at the end of the block
> group.  I assume you tried that in your experiments; what were the
> results?  The obvious thing to do to avoid further fragmentation of
> the block group would be to put level 1 at the end of the block group,
> level 2 just before it, and level 3 before that, and then allocate the
> data blocks starting at the beginning of the block group, i.e:
>
> +----------------------------------+---------------+---------+-------+
> |     data                         | level 3       | level 2 | lvl 1 |
> +----------------------------------+---------------+---------+-------+
>

Thanks for this nice visualization :-)

I agree with your and Andreas' concern about fragmentation due to the
current scheme of putting metacluster in the middle of the block group.
Here are some stats concerning different metacluster locations:
- Placing metacluster at the end of the block group results in 2%
degradation in sequential reads from large files. Putting it at the
beginning improves sequential read performance by 0.5%.
- For random reads the beginning and ending configurations have
idential performance which is almost the same as regular ext3 performance
but 1% worse than the middle configuration.
- I haven't compared the different metacluster locations for sequential
reads from small files, but in general I've found the behavior to be
very similar to random reads from a large file.

So I think putting metacluster levels at the beginning of the block group
is an obvious choice.

>
> > In traditional metaclustering, once we run out of metacluster blocks
> > or data blocks, all bets are off. This forces us to keep small
> > metaclusters in order to avoid this situation altogether. But with small
> > metaclusters, we cannot optimize indirect block allocation on file
> > systems with many small files (>48KB).There is only one glitch in
> > implementing this. If a block group doesn't have any free blocks at a
> > given level, we should be able to find that out quickly instead of
> > having to scan its entire bitmap. gdp->bg_free_blocks_count is not good
> > enough for this.
>
> Ideally, true, but this was a defect with the original metacluster
> scheme as well.  We could steal some bits in the block_group
> descriptor structure to indicate whether a particular level is full,
> though.  This would be another data format change that would require
> e2fsprogs support, though.
>
> Regards,
>
>                                                 - Ted
>

Thanks,
Abhishekj

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-16  7:37   ` Matt Mackall
@ 2007-11-18 15:52     ` Abhishek Rai
  2007-11-18 20:47       ` Matt Mackall
  2007-11-20 20:25       ` John Stoffel
  0 siblings, 2 replies; 21+ messages in thread
From: Abhishek Rai @ 2007-11-18 15:52 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, Andreas Dilger, linux-kernel, Ken Chen, Mike Waychison

Thanks for the suggestion Matt.

It took me some time to get compilebench working due to the known
issue with drop_caches due to circular lock dependency between
j_list_lock and inode_lock (compilebench triggers drop_caches quite
frequently). Here are the results for compilebench run with options
"-i 30 -r 30". I repeated the test 5 times on each of vanilla and mc
configurations.

Setup: 4 cpu, 8GB RAM, 400GB disk.

Average vanilla results
==========================================================================
intial create total runs 30 avg 46.49 MB/s (user 1.12s sys 2.25s)
create total runs 5 avg 12.90 MB/s (user 1.08s sys 1.97s)
patch total runs 4 avg 8.70 MB/s (user 0.60s sys 2.31s)
compile total runs 7 avg 21.44 MB/s (user 0.32s sys 2.95s)
clean total runs 4 avg 59.91 MB/s (user 0.05s sys 0.26s)
read tree total runs 2 avg 21.85 MB/s (user 1.12s sys 2.89s)
read compiled tree total runs 1 avg 23.47 MB/s (user 1.45s sys 4.89s)
delete tree total runs 2 avg 13.18 seconds (user 0.64s sys 1.02s)
no runs for delete compiled tree
stat tree total runs 4 avg 4.76 seconds (user 0.70s sys 0.50s)
stat compiled tree total runs 1 avg 7.84 seconds (user 0.74s sys 0.54s)

Average metaclustering results
==========================================================================
intial create total runs 30 avg 45.04 MB/s (user 1.13s sys 2.42s)
create total runs 5 avg 15.64 MB/s (user 1.08s sys 1.98s)
patch total runs 4 avg 10.50 MB/s (user 0.61s sys 3.11s)
compile total runs 7 avg 28.07 MB/s (user 0.33s sys 4.06s)
clean total runs 4 avg 83.27 MB/s (user 0.04s sys 0.27s)
read tree total runs 2 avg 21.17 MB/s (user 1.15s sys 2.91s)
read compiled tree total runs 1 avg 22.79 MB/s (user 1.38s sys 4.89s)
delete tree total runs 2 avg 9.23 seconds (user 0.62s sys 1.01s)
no runs for delete compiled tree
stat tree total runs 4 avg 4.72 seconds (user 0.71s sys 0.50s)
stat compiled tree total runs 1 avg 6.50 seconds (user 0.79s sys 0.53s)

Overall, metaclustering does better than vanilla except in a few cases.

Thanks,
Abhishek

On Nov 15, 2007 11:37 PM, Matt Mackall <mpm@selenic.com> wrote:
> On Thu, Nov 15, 2007 at 11:02:19PM -0800, Andrew Morton wrote:
> > On Thu, 15 Nov 2007 21:02:46 -0800 "Abhishek Rai" <abhishekrai@google.com> wrote:
> ...
> > > 3. e2fsck speedup with metaclustering varies from disk
> > > to disk with most benefit coming from disks which have a large number
> > > of indirect blocks. For disks which have few indirect blocks, fsck
> > > usually doesn't take too long anyway and hence it's OK not to deliver
> > > a huge speedup there. But in all cases, metaclustering doesn't cause
> > > any degradation in IO performance as seen in the benchmarks above.
> >
> > Less speedup, for more-and-smaller files, it appears.
> >
> > An important question is: how does it stand up over time?  Simply laying
> > files out a single time on a fresh fs is the easy case.  But what happens
> > if that disk has been in continuous create/delete/truncate/append usage for
> > six months?
>
> Try Chris Mason's compilebench, which is a decent aging simulation.
>
> --
> Mathematics is the supreme nostalgia of our time.
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-18 15:52     ` Abhishek Rai
@ 2007-11-18 20:47       ` Matt Mackall
  2007-11-19 10:34         ` Kyungmin Park
  2007-11-20 20:25       ` John Stoffel
  1 sibling, 1 reply; 21+ messages in thread
From: Matt Mackall @ 2007-11-18 20:47 UTC (permalink / raw)
  To: Abhishek Rai
  Cc: Andrew Morton, Andreas Dilger, linux-kernel, Ken Chen, Mike Waychison

On Sun, Nov 18, 2007 at 07:52:36AM -0800, Abhishek Rai wrote:
> Thanks for the suggestion Matt.
> 
> It took me some time to get compilebench working due to the known
> issue with drop_caches due to circular lock dependency between
> j_list_lock and inode_lock (compilebench triggers drop_caches quite
> frequently). Here are the results for compilebench run with options
> "-i 30 -r 30". I repeated the test 5 times on each of vanilla and mc
> configurations.
> 
> Setup: 4 cpu, 8GB RAM, 400GB disk.
> 
> Average vanilla results
> ==========================================================================
> intial create total runs 30 avg 46.49 MB/s (user 1.12s sys 2.25s)
> create total runs 5 avg 12.90 MB/s (user 1.08s sys 1.97s)
> patch total runs 4 avg 8.70 MB/s (user 0.60s sys 2.31s)
> compile total runs 7 avg 21.44 MB/s (user 0.32s sys 2.95s)
> clean total runs 4 avg 59.91 MB/s (user 0.05s sys 0.26s)
> read tree total runs 2 avg 21.85 MB/s (user 1.12s sys 2.89s)
> read compiled tree total runs 1 avg 23.47 MB/s (user 1.45s sys 4.89s)
> delete tree total runs 2 avg 13.18 seconds (user 0.64s sys 1.02s)
> no runs for delete compiled tree
> stat tree total runs 4 avg 4.76 seconds (user 0.70s sys 0.50s)
> stat compiled tree total runs 1 avg 7.84 seconds (user 0.74s sys 0.54s)
> 
> Average metaclustering results
> ==========================================================================
> intial create total runs 30 avg 45.04 MB/s (user 1.13s sys 2.42s)
> create total runs 5 avg 15.64 MB/s (user 1.08s sys 1.98s)
> patch total runs 4 avg 10.50 MB/s (user 0.61s sys 3.11s)
> compile total runs 7 avg 28.07 MB/s (user 0.33s sys 4.06s)
> clean total runs 4 avg 83.27 MB/s (user 0.04s sys 0.27s)
> read tree total runs 2 avg 21.17 MB/s (user 1.15s sys 2.91s)
> read compiled tree total runs 1 avg 22.79 MB/s (user 1.38s sys 4.89s)
> delete tree total runs 2 avg 9.23 seconds (user 0.62s sys 1.01s)
> no runs for delete compiled tree
> stat tree total runs 4 avg 4.72 seconds (user 0.71s sys 0.50s)
> stat compiled tree total runs 1 avg 6.50 seconds (user 0.79s sys 0.53s)
> 
> Overall, metaclustering does better than vanilla except in a few cases.

Well it strikes me as about half up and half down, but the ups are
indeed much more substantial. Looks quite promising.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-18 20:47       ` Matt Mackall
@ 2007-11-19 10:34         ` Kyungmin Park
  0 siblings, 0 replies; 21+ messages in thread
From: Kyungmin Park @ 2007-11-19 10:34 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Abhishek Rai, Andrew Morton, Andreas Dilger, linux-kernel,
	Ken Chen, Mike Waychison

Hi,

> >
> > Setup: 4 cpu, 8GB RAM, 400GB disk.
> >
> > Average vanilla results
> > ==========================================================================
> > intial create total runs 30 avg 46.49 MB/s (user 1.12s sys 2.25s)
> > create total runs 5 avg 12.90 MB/s (user 1.08s sys 1.97s)
> > patch total runs 4 avg 8.70 MB/s (user 0.60s sys 2.31s)
> > compile total runs 7 avg 21.44 MB/s (user 0.32s sys 2.95s)
> > clean total runs 4 avg 59.91 MB/s (user 0.05s sys 0.26s)
> > read tree total runs 2 avg 21.85 MB/s (user 1.12s sys 2.89s)
> > read compiled tree total runs 1 avg 23.47 MB/s (user 1.45s sys 4.89s)
> > delete tree total runs 2 avg 13.18 seconds (user 0.64s sys 1.02s)
> > no runs for delete compiled tree
> > stat tree total runs 4 avg 4.76 seconds (user 0.70s sys 0.50s)
> > stat compiled tree total runs 1 avg 7.84 seconds (user 0.74s sys 0.54s)
> >
> > Average metaclustering results
> > ==========================================================================
> > intial create total runs 30 avg 45.04 MB/s (user 1.13s sys 2.42s)
> > create total runs 5 avg 15.64 MB/s (user 1.08s sys 1.98s)
> > patch total runs 4 avg 10.50 MB/s (user 0.61s sys 3.11s)
> > compile total runs 7 avg 28.07 MB/s (user 0.33s sys 4.06s)
> > clean total runs 4 avg 83.27 MB/s (user 0.04s sys 0.27s)
> > read tree total runs 2 avg 21.17 MB/s (user 1.15s sys 2.91s)
> > read compiled tree total runs 1 avg 22.79 MB/s (user 1.38s sys 4.89s)
> > delete tree total runs 2 avg 9.23 seconds (user 0.62s sys 1.01s)
> > no runs for delete compiled tree
> > stat tree total runs 4 avg 4.72 seconds (user 0.71s sys 0.50s)
> > stat compiled tree total runs 1 avg 6.50 seconds (user 0.79s sys 0.53s)
> >
> > Overall, metaclustering does better than vanilla except in a few cases.
>

I think above testcases are just run with normal I/O case.
Did you test the direct I/O test with this patch?
With metaclustering patch, I can't run 'fsstress' since it oops when direct I/O.
or is it dependent on other ext3 patches?
Since I tested it with latest kernel 2.6.24-rc3 instead of mm kernel.

Thank you,
Kyungmin Park

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-18 15:52     ` Abhishek Rai
  2007-11-18 20:47       ` Matt Mackall
@ 2007-11-20 20:25       ` John Stoffel
  1 sibling, 0 replies; 21+ messages in thread
From: John Stoffel @ 2007-11-20 20:25 UTC (permalink / raw)
  To: Abhishek Rai
  Cc: Matt Mackall, Andrew Morton, Andreas Dilger, linux-kernel,
	Ken Chen, Mike Waychison


Abhishek> It took me some time to get compilebench working due to the
Abhishek> known issue with drop_caches due to circular lock dependency
Abhishek> between j_list_lock and inode_lock (compilebench triggers
Abhishek> drop_caches quite frequently). Here are the results for
Abhishek> compilebench run with options "-i 30 -r 30". I repeated the
Abhishek> test 5 times on each of vanilla and mc configurations.

Abhishek> Setup: 4 cpu, 8GB RAM, 400GB disk.

How about running these tests on a more pedestrian system which people
will actually have?  Like 1gb, 1cpu and 400gb of a single disk?  


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-11-17  2:58       ` Theodore Tso
  2007-11-17  8:58         ` Abhishek Rai
@ 2007-12-21 14:15         ` Abhishek Rai
  2008-01-10 21:17           ` Abhishek Rai
  1 sibling, 1 reply; 21+ messages in thread
From: Abhishek Rai @ 2007-12-21 14:15 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Andreas Dilger, linux-kernel,
	Ken Chen, Mike Waychison

I have implemented a revised patch that addresses the concerns raised
with the previous patch. To summarize, here were the three main
concerns:

1. Metacluster size is sensitive to average file size in the
block-group/file-system so how do we find a good metacluster size ?
The last discussion we had on this topic was to have multiple
metaclusters of different sizes per group and have direct blocks
overflow towards smaller metacluster sizes and indirect blocks
overflow towards bigger metaclusters. The overflowing from one
metacluster level to the next would happen only when none of the block
groups have any free blocks left at that metacluster level, IOW
overflowing was a file system wide event.

2. When indirect block allocation in metacluster fails, don't fully
fall back to old-style allocation scheme, instead fall back to
old-style scheme only for that block group and repeat this for each
block group: Now all this happens in ext3_new_blocks(). To control the
size of that function, created a few helper functions.

3. Don't have separate functions for allocating indirect and direct
blocks as there is considerable overlap especially in the journaling
code: Rolled the two functions into one ext3_new_blocks() function
which is directly called from ext3_alloc_branch() instead of via
ext3_alloc_blocks() (which is nuked now).

The current approach I've implemented is similar in principle but it
fixes a problem with the above scheme. The above scheme of overflowing
into the next "metacluster level" upon exhaustion of current
metacluster level across all block groups results in increased
fragmentation. E.g., say a block group BG1 ran out of blocks at
metacluster (mc) level X and now wants to use the next level. It
checks and find that a different block group BG2 still has free blocks
at mc level X so it starts using level X in BG2 which results in the
file getting fragmented. In the new patch, we'd continue using level
X+1 in BG1 to reduce overall fragmentation and so the new patch
results in overflow only within the same block group.

Also, I've chosen a simpler implementation for this multi-level
metaclustering scheme by not having real metacluster levels but by
having direct blocks and indirect blocks grow towards each other from
opposite ends of the block group. Both are conceptually the same. Now
the overflow condition is that direct block allocation cannot spill
into indirect block region (the metacluster) and vice versa unless it
has run out of free blocks in its own region. This information is now
available through a new memory-only per-block counter that keeps a
count of the number of free blocks in the non-metacluster region. This
also addresses Andreas Dilger's concern with the previous
implementation regarding metaclusters increasing fragmentation by
splitting the block group into two halves.

Putting metacluster at the end of the block group gives slightly
inferior sequential read throughput compared to putting it in the
beginning or the middle, but the difference is very tiny and exists
only for large files that span multiple block groups.

Since there are couple of ways in which this change differs from
before, I repeated the testing and performance evaluation. The new
change passed
fsx, fsstress, and bonnie - both with and without metaclustering.
Also, I checked that the block layout on disk is coming out to be what
one would expect from the code.

Here are the performance numbers. The setup was somewhat different
from the previous setup so I've gotten fresh numbers for the vanilla
case as well.

Setup:
RAM: 8GB
Disk: 400GB disk.
CPU: Dual core hyperthreaded

All measurements were taken 10 times or more until standard deviation
was <2%. Machine was rebooted between runs and file system freshly
formatted, also we made sure that there was nothing running on the
machine at the time of the test.

Notation:
- 'vanilla': regular ext3 without any changes
- 'mc': metaclustering ext3 (new)

Benchmark 1: Sequential write to a 10GB file followed by 'sync'
1. vanilla:
 Total: 3m39.1s
 User: 0.08
 System: 51.9s
2. mc:
 Total: 3m11.5s
 User: 0.06s
 System: 53.6s

Benchmark 2: Sequential read from a 10GB file.
Description: the file is created using same type of ext2 (mc or vanilla)
1. vanilla:
 Total: 3m6.5s
 User: 0.04s
 System: 13.4s
2. mc:
 Total: 3m7.0s
 User: 0.05s
 System: 13.1s

Benchmark 3: Random read from a 300GB file
Description: read using 512 byte chunk total 5MB
1. vanilla:
 Total: 3m57.0s
 User: ~0
 System: 0.8s
2. mc:
 Total: 3m56.4s
 User: ~0
 System: 0.9s

Benchmark 4: Random read from a 300GB file
Description: read using 512KB chunk total 1% size of the file
1. vanilla:
 Total: 4m50.3s
 User: ~0
 System: 3.9s
2. mc:
 Total: 4m56.9s
 User: ~0
 System: 3.9s

Benchmark 5: fsck
Description: Prepare a newly formated 400GB disk as follows: create
200 files of 0.5GB each, 100 files of 1GB each, 40 files of 2.5GB ech,
and 10 files of 10GB each. fsck command line: fsck -f -n
1. vanilla:
 Total: 11m25.3s
 User: 13.4s
 System: 13.2s
2. mc:
 Total: 3m11.0s
 User: 13.1s
 System: 12.9s


Note: I'll report results from kernbench and compilebench shortly.

Observations:
Sequential write performance is much better with metaclustering than
with vanilla. To better understand it, I ran the same benchmark with the
new code but with the metaclustering option turned off and I got the
same performance as vanilla which makes me believe that there is
something about metaclustering that helps write performance though I
don't have a very good handle of what that thing might be.

Thanks,
Abhishek

Signed-off-by: Abhishek Rai <abhishekrai@google.com>

diff -uprdN linux-2.6.23mm1-clean/fs/ext3/balloc.c
linux-2.6.23mm1-ext3mc/fs/ext3/balloc.c
--- linux-2.6.23mm1-clean/fs/ext3/balloc.c	2007-10-17 18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/balloc.c	2007-12-21 05:34:35.000000000 -0800
@@ -33,6 +33,29 @@
  * super block.  Each descriptor contains the number of the bitmap block and
  * the free blocks count in the block.  The descriptors are loaded in memory
  * when a file system is mounted (see ext3_fill_super).
+ *
+ * A note on ext3 metaclustering:
+ *
+ * 	Start of						End of
+ * 	block group						block group
+ * 	 ________________________________________________________________
+ * 	|	NON-MC REGION			|	MC REGION	 |
+ * 	|					|Overflow		 |
+ * 	|Data blocks and			|data		Indirect |
+ * 	|overflow indirect blocks		|blocks		blocks	 |
+ * 	|----------> 				|------->	<--------|
+ * 	|________________________________________________________________|
+ *
+ * 	Every block group has at its end a semi-reserved region called the
+ * 	metacluster mostly used for allocating indirect blocks. Under normal
+ * 	circumstances, the metacluster is used only for allocating indirect
+ * 	blocks which are allocated in decreasing order of block numbers.
+ * 	The non-Metacluster region is used for data block allocation which are
+ * 	allocated in increasing order of block numbers. However, when the MC
+ * 	runs out of space, indirect blocks can be allocated in the non-MC
+ * 	region along with the data blocks in the forward direction. Similarly,
+ * 	when non-MC runs out of space, new data blocks are allocated in MC but
+ * 	in the forward direction.
  */


@@ -147,6 +170,88 @@ error_out:
 			block_group, bitmap_blk);
 	return NULL;
 }
+
+
+/*
+ * Count number of free blocks in a block group that don't lie in the
+ * metacluster region of the block group.
+ */
+static void
+ext3_init_grp_free_nonmc_blocks(struct super_block *sb,
+				struct buffer_head *bitmap_bh,
+				unsigned long block_group)
+{
+	struct ext3_sb_info *sbi = EXT3_SB(sb);
+	struct ext3_bg_info *bgi = &sbi->s_bginfo[block_group];
+
+	BUG_ON(!test_opt(sb, METACLUSTER));
+
+	spin_lock(sb_bgl_lock(sbi, block_group));
+	if (bgi->bgi_free_nonmc_blocks_count >= 0)
+		goto out;
+
+	bgi->bgi_free_nonmc_blocks_count =
+		ext3_count_free(bitmap_bh, sbi->s_nonmc_blocks_per_group/8);
+
+out:
+	spin_unlock(sb_bgl_lock(sbi, block_group));
+	BUG_ON(bgi->bgi_free_nonmc_blocks_count >
+		sbi->s_nonmc_blocks_per_group);
+}
+
+/*
+ * ext3_update_nonmc_block_count:
+ *	Update bgi_free_nonmc_blocks_count for block group 'group_no' following
+ *	an allocation or deallocation.
+ *
+ *	@group_no:	affected block group
+ *	@start:		start of the [de]allocated range
+ *	@count:		number of blocks [de]allocated
+ *	@allocation:	1 if blocks were allocated, 0 otherwise.
+ */
+static inline void
+ext3_update_nonmc_block_count(struct ext3_sb_info *sbi, unsigned long group_no,
+				ext3_grpblk_t start, unsigned long count,
+				int allocation)
+{
+	struct ext3_bg_info *bginfo = &sbi->s_bginfo[group_no];
+	ext3_grpblk_t change;
+
+	BUG_ON(bginfo->bgi_free_nonmc_blocks_count < 0);
+	BUG_ON(start >= sbi->s_nonmc_blocks_per_group);
+
+	change = min_t(ext3_grpblk_t, start + count,
+			sbi->s_nonmc_blocks_per_group) - start;
+
+	spin_lock(sb_bgl_lock(sbi, group_no));
+	BUG_ON(bginfo->bgi_free_nonmc_blocks_count >
+		sbi->s_nonmc_blocks_per_group);
+	BUG_ON(allocation && bginfo->bgi_free_nonmc_blocks_count < change);
+
+	bginfo->bgi_free_nonmc_blocks_count += (allocation ? -change : change);
+
+	BUG_ON(bginfo->bgi_free_nonmc_blocks_count >
+		sbi->s_nonmc_blocks_per_group);
+	spin_unlock(sb_bgl_lock(sbi, group_no));
+}
+
+/*
+ * allow_mc_alloc:
+ * 	Check if we can use metacluster region of a block group for general
+ * 	allocation if needed. Ideally, we should allow this only if
+ * 	bgi_free_nonmc_blocks_count == 0, but sometimes there is a small number
+ * 	of blocks which don't get allocated in the first pass, no point
+ * 	breaking our file at the metacluster boundary because of that, so we
+ * 	relax the limit to 8.
+ */
+static inline int allow_mc_alloc(struct ext3_sb_info *sbi,
+					struct ext3_bg_info *bgi,
+					ext3_grpblk_t blk)
+{
+	return !(blk >= 0 && blk >= sbi->s_nonmc_blocks_per_group &&
+		bgi->bgi_free_nonmc_blocks_count >= 8);
+}
+
 /*
  * The reservation window structure operations
  * --------------------------------------------
@@ -463,6 +568,7 @@ void ext3_free_blocks_sb(handle_t *handl
 	struct ext3_group_desc * desc;
 	struct ext3_super_block * es;
 	struct ext3_sb_info *sbi;
+	struct ext3_bg_info *bgi;
 	int err = 0, ret;
 	ext3_grpblk_t group_freed;

@@ -502,6 +608,13 @@ do_more:
 	if (!desc)
 		goto error_return;

+	if (test_opt(sb, METACLUSTER)) {
+		bgi = &sbi->s_bginfo[block_group];
+		if (bgi->bgi_free_nonmc_blocks_count < 0)
+			ext3_init_grp_free_nonmc_blocks(sb, bitmap_bh,
+							block_group);
+	}
+
 	if (in_range (le32_to_cpu(desc->bg_block_bitmap), block, count) ||
 	    in_range (le32_to_cpu(desc->bg_inode_bitmap), block, count) ||
 	    in_range (block, le32_to_cpu(desc->bg_inode_table),
@@ -621,6 +734,9 @@ do_more:
 	if (!err) err = ret;
 	*pdquot_freed_blocks += group_freed;

+	if (test_opt(sb, METACLUSTER) && bit < sbi->s_nonmc_blocks_per_group)
+		ext3_update_nonmc_block_count(sbi, block_group, bit, count, 0);
+
 	if (overflow && !err) {
 		block += count;
 		count = overflow;
@@ -726,6 +842,50 @@ bitmap_search_next_usable_block(ext3_grp
 	return -1;
 }

+static ext3_grpblk_t
+bitmap_find_prev_zero_bit(char *map, ext3_grpblk_t start, ext3_grpblk_t lowest)
+{
+	ext3_grpblk_t k, blk;
+
+	k = start & ~7;
+	while (lowest <= k) {
+		if (map[k/8] != '\255' &&
+			(blk = ext3_find_next_zero_bit(map, k + 8, k))
+			 < (k + 8))
+				return blk;
+
+		k -= 8;
+	}
+	return -1;
+}
+
+static ext3_grpblk_t
+bitmap_search_prev_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
+					ext3_grpblk_t lowest)
+{
+	ext3_grpblk_t next;
+	struct journal_head *jh = bh2jh(bh);
+
+	/*
+	 * The bitmap search --- search backward alternately through the actual
+	 * bitmap and the last-committed copy until we find a bit free in
+	 * both
+	 */
+	while (start >= lowest) {
+		next = bitmap_find_prev_zero_bit(bh->b_data, start, lowest);
+		if (next < lowest)
+			return -1;
+		if (ext3_test_allocatable(next, bh))
+			return next;
+		jbd_lock_bh_state(bh);
+		if (jh->b_committed_data)
+			start = bitmap_find_prev_zero_bit(jh->b_committed_data,
+								next, lowest);
+		jbd_unlock_bh_state(bh);
+	}
+	return -1;
+}
+
 /**
  * find_next_usable_block()
  * @start:		the starting block (group relative) to find next
@@ -833,19 +993,27 @@ claim_block(spinlock_t *lock, ext3_grpbl
  *	file's own reservation window;
  *	Otherwise, the allocation range starts from the give goal block, ends at
  *	the block group's last block.
- *
- * If we failed to allocate the desired block then we may end up crossing to a
- * new bitmap.  In that case we must release write access to the old one via
- * ext3_journal_release_buffer(), else we'll run out of credits.
  */
 static ext3_grpblk_t
 ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
 			struct buffer_head *bitmap_bh, ext3_grpblk_t grp_goal,
 			unsigned long *count, struct ext3_reserve_window *my_rsv)
 {
+	struct ext3_sb_info *sbi = EXT3_SB(sb);
+	struct ext3_group_desc *gdp;
+	struct ext3_bg_info *bgi = NULL;
+	struct buffer_head *gdp_bh;
 	ext3_fsblk_t group_first_block;
 	ext3_grpblk_t start, end;
 	unsigned long num = 0;
+	const int metaclustering = test_opt(sb, METACLUSTER);
+
+	if (metaclustering)
+		bgi = &sbi->s_bginfo[group];
+
+	gdp = ext3_get_group_desc(sb, group, &gdp_bh);
+	if (!gdp)
+		goto fail_access;

 	/* we do allocation within the reservation window if we have a window */
 	if (my_rsv) {
@@ -890,8 +1058,10 @@ repeat:
 	}
 	start = grp_goal;

-	if (!claim_block(sb_bgl_lock(EXT3_SB(sb), group),
-		grp_goal, bitmap_bh)) {
+	if (metaclustering && !allow_mc_alloc(sbi, bgi, grp_goal))
+		goto fail_access;
+
+	if (!claim_block(sb_bgl_lock(sbi, group), grp_goal, bitmap_bh)) {
 		/*
 		 * The block was allocated by another thread, or it was
 		 * allocated and then freed by another thread
@@ -906,8 +1076,8 @@ repeat:
 	grp_goal++;
 	while (num < *count && grp_goal < end
 		&& ext3_test_allocatable(grp_goal, bitmap_bh)
-		&& claim_block(sb_bgl_lock(EXT3_SB(sb), group),
-				grp_goal, bitmap_bh)) {
+		&& (!metaclustering || allow_mc_alloc(sbi, bgi, grp_goal))
+		&& claim_block(sb_bgl_lock(sbi, group), grp_goal, bitmap_bh)) {
 		num++;
 		grp_goal++;
 	}
@@ -1138,7 +1308,9 @@ static int alloc_new_reservation(struct

 	/*
 	 * find_next_reservable_window() simply finds a reservable window
-	 * inside the given range(start_block, group_end_block).
+	 * inside the given range(start_block, group_end_block). The
+	 * reservation window must have a reservable free bit inside it for our
+	 * callers to work correctly.
 	 *
 	 * To make sure the reservation window has a free bit inside it, we
 	 * need to check the bitmap after we found a reservable window.
@@ -1170,10 +1342,17 @@ retry:
 			my_rsv->rsv_start - group_first_block,
 			bitmap_bh, group_end_block - group_first_block + 1);

-	if (first_free_block < 0) {
+	if (first_free_block < 0 ||
+		(test_opt(sb, METACLUSTER)
+		 && !allow_mc_alloc(EXT3_SB(sb), &EXT3_SB(sb)->s_bginfo[group],
+			 		first_free_block))) {
 		/*
-		 * no free block left on the bitmap, no point
-		 * to reserve the space. return failed.
+		 * No free block left on the bitmap, no point to reserve space,
+		 * return failed. We also fail here if metaclustering is enabled
+		 * and the first free block in the window lies in the
+		 * metacluster while there are free non-mc blocks in the block
+		 * group, such a window or any window following it is not useful
+		 * to us.
 		 */
 		spin_lock(rsv_lock);
 		if (!rsv_is_empty(&my_rsv->rsv_window))
@@ -1276,25 +1455,17 @@ ext3_try_to_allocate_with_rsv(struct sup
 			unsigned int group, struct buffer_head *bitmap_bh,
 			ext3_grpblk_t grp_goal,
 			struct ext3_reserve_window_node * my_rsv,
-			unsigned long *count, int *errp)
+			unsigned long *count)
 {
+	struct ext3_bg_info *bgi;
 	ext3_fsblk_t group_first_block, group_last_block;
 	ext3_grpblk_t ret = 0;
-	int fatal;
 	unsigned long num = *count;

-	*errp = 0;
-
-	/*
-	 * Make sure we use undo access for the bitmap, because it is critical
-	 * that we do the frozen_data COW on bitmap buffers in all cases even
-	 * if the buffer is in BJ_Forget state in the committing transaction.
-	 */
-	BUFFER_TRACE(bitmap_bh, "get undo access for new block");
-	fatal = ext3_journal_get_undo_access(handle, bitmap_bh);
-	if (fatal) {
-		*errp = fatal;
-		return -1;
+	if (test_opt(sb, METACLUSTER)) {
+		bgi = &EXT3_SB(sb)->s_bginfo[group];
+		if (bgi->bgi_free_nonmc_blocks_count < 0)
+			ext3_init_grp_free_nonmc_blocks(sb, bitmap_bh, group);
 	}

 	/*
@@ -1370,19 +1541,6 @@ ext3_try_to_allocate_with_rsv(struct sup
 		num = *count;
 	}
 out:
-	if (ret >= 0) {
-		BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for "
-					"bitmap block");
-		fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
-		if (fatal) {
-			*errp = fatal;
-			return -1;
-		}
-		return ret;
-	}
-
-	BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
-	ext3_journal_release_buffer(handle, bitmap_bh);
 	return ret;
 }

@@ -1428,22 +1586,149 @@ int ext3_should_retry_alloc(struct super
 	return journal_force_commit_nested(EXT3_SB(sb)->s_journal);
 }

+/*
+ * ext3_alloc_indirect_blocks:
+ * 	Helper function for ext3_new_blocks. Allocates indirect blocks from the
+ * 	metacluster region only and stores their numbers in new_blocks[].
+ */
+int ext3_alloc_indirect_blocks(struct super_block *sb,
+			struct buffer_head *bitmap_bh,
+			struct ext3_group_desc *gdp,
+			int group_no, unsigned long indirect_blks,
+			ext3_fsblk_t new_blocks[])
+{
+	struct ext3_bg_info *bgi = &EXT3_SB(sb)->s_bginfo[group_no];
+	ext3_grpblk_t blk = EXT3_BLOCKS_PER_GROUP(sb) - 1;
+	ext3_grpblk_t mc_start = EXT3_SB(sb)->s_nonmc_blocks_per_group;
+	ext3_fsblk_t group_first_block;
+	int allocated = 0;
+
+	BUG_ON(!test_opt(sb, METACLUSTER));
+
+	/* This check is racy but that wouldn't harm us. */
+	if (bgi->bgi_free_nonmc_blocks_count >=
+		le16_to_cpu(gdp->bg_free_blocks_count))
+		return 0;
+
+	group_first_block = ext3_group_first_block_no(sb, group_no);
+	while (allocated < indirect_blks && blk >= mc_start) {
+		if (!ext3_test_allocatable(blk, bitmap_bh)) {
+			blk = bitmap_search_prev_usable_block(blk, bitmap_bh,
+								mc_start);
+			continue;
+		}
+		if (claim_block(sb_bgl_lock(EXT3_SB(sb), group_no), blk,
+				bitmap_bh)) {
+			new_blocks[allocated++] = group_first_block + blk;
+		} else {
+			/*
+			 * The block was allocated by another thread, or it
+			 * was allocated and then freed by another thread
+			 */
+			cpu_relax();
+		}
+		if (allocated < indirect_blks)
+			blk = bitmap_search_prev_usable_block(blk, bitmap_bh,
+								mc_start);
+	}
+	return allocated;
+}
+
+/*
+ * check_allocated_blocks:
+ * 	Helper function for ext3_new_blocks. Checks newly allocated block
+ * 	numbers.
+ */
+int check_allocated_blocks(ext3_fsblk_t blk, unsigned long num,
+				struct super_block *sb, int group_no,
+				struct ext3_group_desc *gdp,
+				struct buffer_head *bitmap_bh)
+{
+	struct ext3_super_block *es = EXT3_SB(sb)->s_es;
+	struct ext3_sb_info *sbi = EXT3_SB(sb);
+	ext3_fsblk_t grp_blk = blk - ext3_group_first_block_no(sb, group_no);
+
+	if (in_range(le32_to_cpu(gdp->bg_block_bitmap), blk, num) ||
+		in_range(le32_to_cpu(gdp->bg_inode_bitmap), blk, num) ||
+		in_range(blk, le32_to_cpu(gdp->bg_inode_table),
+				EXT3_SB(sb)->s_itb_per_group) ||
+		in_range(blk + num - 1, le32_to_cpu(gdp->bg_inode_table),
+				EXT3_SB(sb)->s_itb_per_group))
+		ext3_error(sb, "ext3_new_blocks",
+				"Allocating block in system zone - "
+				"blocks from "E3FSBLK", length %lu",
+				blk, num);
+
+#ifdef CONFIG_JBD_DEBUG
+	{
+		struct buffer_head *debug_bh;
+
+		/* Record bitmap buffer state in the newly allocated block */
+		debug_bh = sb_find_get_block(sb, blk);
+		if (debug_bh) {
+			BUFFER_TRACE(debug_bh, "state when allocated");
+			BUFFER_TRACE2(debug_bh, bitmap_bh, "bitmap state");
+			brelse(debug_bh);
+		}
+	}
+	jbd_lock_bh_state(bitmap_bh);
+	spin_lock(sb_bgl_lock(sbi, group_no));
+	if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) {
+		int i;
+
+		for (i = 0; i < num; i++) {
+			if (ext3_test_bit(grp_blk+i,
+					bh2jh(bitmap_bh)->b_committed_data))
+				printk(KERN_ERR "%s: block was unexpectedly set"
+					" in b_committed_data\n", __FUNCTION__);
+		}
+	}
+	ext3_debug("found bit %d\n", grp_blk);
+	spin_unlock(sb_bgl_lock(sbi, group_no));
+	jbd_unlock_bh_state(bitmap_bh);
+#endif
+
+	if (blk + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
+		ext3_error(sb, "ext3_new_blocks",
+				"block("E3FSBLK") >= blocks count(%d) - "
+				"block_group = %d, es == %p ", blk,
+				le32_to_cpu(es->s_blocks_count), group_no, es);
+		return 1;
+	}
+
+	return 0;
+}
+
 /**
- * ext3_new_blocks() -- core block(s) allocation function
- * @handle:		handle to this transaction
- * @inode:		file inode
- * @goal:		given target block(filesystem wide)
- * @count:		target number of blocks to allocate
- * @errp:		error code
+ * ext3_new_blocks - allocate indirect blocks and direct blocks.
+ *	@handle:	handle to this transaction
+ *	@inode:		file inode
+ *	@goal:		given target block(filesystem wide)
+ * 	@indirect_blks	number of indirect blocks to allocate
+ * 	@blks		number of direct blocks to allocate
+ * 	@new_blocks	this will store the block numbers of indirect blocks
+ * 			and direct blocks upon return.
  *
- * ext3_new_blocks uses a goal block to assist allocation.  It tries to
- * allocate block(s) from the block group contains the goal block
first. If that
- * fails, it will try to allocate block(s) from other block groups without
- * any specific goal block.
+ * 	returns the number of direct blocks allocated. Fewer than requested
+ * 	number of direct blocks may be allocated but all requested indirect
+ * 	blocks must be allocated in order to return success.
  *
+ *	Without metaclustering, ext3_new_block allocates all blocks using a
+ *	goal block to assist allocation.  It tries to allocate block(s) from
+ *	the block group contains the goal block first. If that fails, it will
+ *	try to allocate block(s) from other block groups without any specific
+ *	goal block.
+ *
+ *	With metaclustering, the only difference is that indirect block
+ *	allocation is first attempted in the metacluster region of the same
+ *	block group failing which they are allocated along with direct blocks.
+ *
+ *	This function also updates quota and i_blocks field.
  */
-ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
-			ext3_fsblk_t goal, unsigned long *count, int *errp)
+int ext3_new_blocks(handle_t *handle, struct inode *inode,
+			ext3_fsblk_t goal, int indirect_blks, int blks,
+			ext3_fsblk_t new_blocks[4], int *errp)
+
 {
 	struct buffer_head *bitmap_bh = NULL;
 	struct buffer_head *gdp_bh;
@@ -1452,10 +1737,16 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 	ext3_grpblk_t grp_target_blk;	/* blockgroup relative goal block */
 	ext3_grpblk_t grp_alloc_blk;	/* blockgroup-relative allocated block*/
 	ext3_fsblk_t ret_block;		/* filesyetem-wide allocated block */
+	ext3_fsblk_t group_first_block; /* first block in the group */
 	int bgi;			/* blockgroup iteration index */
 	int fatal = 0, err;
 	int performed_allocation = 0;
 	ext3_grpblk_t free_blocks;	/* number of free blocks in a group */
+	unsigned long ngroups;
+	unsigned long grp_mc_alloc;/* blocks allocated from mc in a group */
+	unsigned long grp_alloc;   /* blocks allocated outside mc in a group */
+	int indirect_blks_done = 0;/* total ind blocks allocated so far */
+	int blks_done = 0;	   /* total direct blocks allocated */
 	struct super_block *sb;
 	struct ext3_group_desc *gdp;
 	struct ext3_super_block *es;
@@ -1463,23 +1754,23 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 	struct ext3_reserve_window_node *my_rsv = NULL;
 	struct ext3_block_alloc_info *block_i;
 	unsigned short windowsz = 0;
+	int i;
 #ifdef EXT3FS_DEBUG
 	static int goal_hits, goal_attempts;
 #endif
-	unsigned long ngroups;
-	unsigned long num = *count;

 	*errp = -ENOSPC;
 	sb = inode->i_sb;
 	if (!sb) {
-		printk("ext3_new_block: nonexistent device");
+		printk(KERN_INFO "ext3_new_blocks: nonexistent device");
+		*errp = -ENODEV;
 		return 0;
 	}

 	/*
 	 * Check quota for allocation of this block.
 	 */
-	if (DQUOT_ALLOC_BLOCK(inode, num)) {
+	if (DQUOT_ALLOC_BLOCK(inode, indirect_blks + blks)) {
 		*errp = -EDQUOT;
 		return 0;
 	}
@@ -1513,73 +1804,194 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 	group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
 			EXT3_BLOCKS_PER_GROUP(sb);
 	goal_group = group_no;
-retry_alloc:
-	gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
-	if (!gdp)
-		goto io_error;
-
-	free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
-	/*
-	 * if there is not enough free blocks to make a new resevation
-	 * turn off reservation for this allocation
-	 */
-	if (my_rsv && (free_blocks < windowsz)
-		&& (rsv_is_empty(&my_rsv->rsv_window)))
-		my_rsv = NULL;
-
-	if (free_blocks > 0) {
-		grp_target_blk = ((goal - le32_to_cpu(es->s_first_data_block)) %
-				EXT3_BLOCKS_PER_GROUP(sb));
-		bitmap_bh = read_block_bitmap(sb, group_no);
-		if (!bitmap_bh)
-			goto io_error;
-		grp_alloc_blk = ext3_try_to_allocate_with_rsv(sb, handle,
-					group_no, bitmap_bh, grp_target_blk,
-					my_rsv,	&num, &fatal);
-		if (fatal)
-			goto out;
-		if (grp_alloc_blk >= 0)
-			goto allocated;
-	}

+retry_alloc:
+	grp_target_blk = ((goal - le32_to_cpu(es->s_first_data_block)) %
+			EXT3_BLOCKS_PER_GROUP(sb));
 	ngroups = EXT3_SB(sb)->s_groups_count;
 	smp_rmb();

 	/*
-	 * Now search the rest of the groups.  We assume that
-	 * i and gdp correctly point to the last group visited.
+	 * Iterate over successive block groups for allocating (any) indirect
+	 * blocks and direct blocks until at least one direct block has been
+	 * allocated. If metaclustering is enabled, we try allocating indirect
+	 * blocks first in the metacluster region and then in the general
+	 * region and if that fails too, we repeat the same algorithm in the
+	 * next block group and so on. This not only keeps the indirect blocks
+	 * together in the metacluster, but also keeps them in close proximity
+	 * to their corresponding direct blocks.
+	 *
+	 * The search begins and ends at the goal group, though the second time
+	 * we are at the goal group we try allocating without a goal.
 	 */
-	for (bgi = 0; bgi < ngroups; bgi++) {
-		group_no++;
+	bgi = 0;
+	while (bgi < ngroups + 1) {
+		grp_mc_alloc = 0;
+
 		if (group_no >= ngroups)
 			group_no = 0;
+
 		gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
 		if (!gdp)
 			goto io_error;
+
 		free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
-		/*
-		 * skip this group if the number of
-		 * free blocks is less than half of the reservation
-		 * window size.
-		 */
-		if (free_blocks <= (windowsz/2))
-			continue;
+		if (group_no == goal_group) {
+			if (my_rsv && (free_blocks < windowsz)
+				&& (rsv_is_empty(&my_rsv->rsv_window)))
+				my_rsv = NULL;
+			if (free_blocks <= 0)
+				goto next;
+		} else if (free_blocks <= windowsz/2)
+			goto next;

-		brelse(bitmap_bh);
 		bitmap_bh = read_block_bitmap(sb, group_no);
 		if (!bitmap_bh)
 			goto io_error;
+
 		/*
-		 * try to allocate block(s) from this group, without a goal(-1).
+		 * Make sure we use undo access for the bitmap, because it is
+		 * critical that we do the frozen_data COW on bitmap buffers in
+		 * all cases even if the buffer is in BJ_Forget state in the
+		 * committing transaction.
+		 */
+		BUFFER_TRACE(bitmap_bh, "get undo access for new block");
+		fatal = ext3_journal_get_undo_access(handle, bitmap_bh);
+		if (fatal)
+			goto out;
+
+		/*
+		 * If metaclustering is enabled, first try to allocate indirect
+		 * blocks in the metacluster.
 		 */
+		if (test_opt(sb, METACLUSTER) &&
+			indirect_blks_done < indirect_blks)
+			grp_mc_alloc = ext3_alloc_indirect_blocks(sb,
+					bitmap_bh, gdp, group_no,
+					indirect_blks - indirect_blks_done,
+					new_blocks + indirect_blks_done);
+
+		/* Allocate data blocks and any leftover indirect blocks. */
+		grp_alloc = indirect_blks + blks
+				- (indirect_blks_done + grp_mc_alloc);
 		grp_alloc_blk = ext3_try_to_allocate_with_rsv(sb, handle,
-					group_no, bitmap_bh, -1, my_rsv,
-					&num, &fatal);
+					group_no, bitmap_bh, grp_target_blk,
+					my_rsv, &grp_alloc);
+		if (grp_alloc_blk < 0)
+			grp_alloc = 0;
+
+		/*
+		 * If we couldn't allocate anything, there is nothing more to
+		 * do with this block group, so move over to the next. But
+		 * before that We must release write access to the old one via
+		 * ext3_journal_release_buffer(), else we'll run out of credits.
+		 */
+		if (grp_mc_alloc == 0 && grp_alloc == 0) {
+			BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
+			ext3_journal_release_buffer(handle, bitmap_bh);
+			goto next;
+		}
+
+		BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for "
+					"bitmap block");
+		fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
 		if (fatal)
 			goto out;
-		if (grp_alloc_blk >= 0)
+
+		ext3_debug("using block group %d(%d)\n",
+				group_no, gdp->bg_free_blocks_count);
+
+		BUFFER_TRACE(gdp_bh, "get_write_access");
+		fatal = ext3_journal_get_write_access(handle, gdp_bh);
+		if (fatal)
+			goto out;
+
+		/* Should this be called before ext3_journal_dirty_metadata? */
+		for (i = 0; i < grp_mc_alloc; i++) {
+			if (check_allocated_blocks(
+				new_blocks[indirect_blks_done + i], 1, sb,
+				group_no, gdp, bitmap_bh))
+				goto out;
+		}
+		if (grp_alloc > 0) {
+			ret_block = ext3_group_first_block_no(sb, group_no) +
+				grp_alloc_blk;
+			if (check_allocated_blocks(ret_block, grp_alloc, sb,
+						group_no, gdp, bitmap_bh))
+				goto out;
+		}
+
+		indirect_blks_done += grp_mc_alloc;
+		performed_allocation = 1;
+
+		/* The caller will add the new buffer to the journal. */
+		if (grp_alloc > 0)
+			ext3_debug("allocating block %lu. "
+					"Goal hits %d of %d.\n",
+					ret_block, goal_hits, goal_attempts);
+
+		spin_lock(sb_bgl_lock(sbi, group_no));
+		gdp->bg_free_blocks_count =
+			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) -
+					(grp_mc_alloc + grp_alloc));
+		spin_unlock(sb_bgl_lock(sbi, group_no));
+		percpu_counter_sub(&sbi->s_freeblocks_counter,
+				(grp_mc_alloc + grp_alloc));
+
+		BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for "
+				"group descriptor");
+		err = ext3_journal_dirty_metadata(handle, gdp_bh);
+		if (!fatal)
+			fatal = err;
+
+		sb->s_dirt = 1;
+		if (fatal)
+			goto out;
+
+		brelse(bitmap_bh);
+		bitmap_bh = NULL;
+
+		if (grp_alloc == 0)
+			goto next;
+
+		/* Update block group non-mc block count since we used some. */
+		if (test_opt(sb, METACLUSTER) &&
+			grp_alloc_blk < sbi->s_nonmc_blocks_per_group)
+			ext3_update_nonmc_block_count(sbi, group_no,
+				grp_alloc_blk, grp_alloc, 1);
+
+		/*
+		 * Assign all the non-mc blocks that we allocated from this
+		 * block group.
+		 */
+		group_first_block = ext3_group_first_block_no(sb, group_no);
+		while (grp_alloc > 0 && indirect_blks_done < indirect_blks) {
+			new_blocks[indirect_blks_done++] =
+				group_first_block + grp_alloc_blk;
+			grp_alloc_blk++;
+			grp_alloc--;
+		}
+
+		if (grp_alloc > 0) {
+			blks_done = grp_alloc;
+			new_blocks[indirect_blks_done] =
+				group_first_block + grp_alloc_blk;
 			goto allocated;
+		}
+
+		/*
+		 * If we allocated something but not the minimum required,
+		 * it's OK to retry in this group as it might have more free
+		 * blocks.
+		 */
+		continue;
+
+next:
+		bgi++;
+		group_no++;
+		grp_target_blk = -1;
 	}
+
 	/*
 	 * We may end up a bogus ealier ENOSPC error due to
 	 * filesystem is "full" of reservations, but
@@ -1598,96 +2010,11 @@ retry_alloc:
 	goto out;

 allocated:
-
-	ext3_debug("using block group %d(%d)\n",
-			group_no, gdp->bg_free_blocks_count);
-
-	BUFFER_TRACE(gdp_bh, "get_write_access");
-	fatal = ext3_journal_get_write_access(handle, gdp_bh);
-	if (fatal)
-		goto out;
-
-	ret_block = grp_alloc_blk + ext3_group_first_block_no(sb, group_no);
-
-	if (in_range(le32_to_cpu(gdp->bg_block_bitmap), ret_block, num) ||
-	    in_range(le32_to_cpu(gdp->bg_inode_bitmap), ret_block, num) ||
-	    in_range(ret_block, le32_to_cpu(gdp->bg_inode_table),
-		      EXT3_SB(sb)->s_itb_per_group) ||
-	    in_range(ret_block + num - 1, le32_to_cpu(gdp->bg_inode_table),
-		      EXT3_SB(sb)->s_itb_per_group))
-		ext3_error(sb, "ext3_new_block",
-			    "Allocating block in system zone - "
-			    "blocks from "E3FSBLK", length %lu",
-			     ret_block, num);
-
-	performed_allocation = 1;
-
-#ifdef CONFIG_JBD_DEBUG
-	{
-		struct buffer_head *debug_bh;
-
-		/* Record bitmap buffer state in the newly allocated block */
-		debug_bh = sb_find_get_block(sb, ret_block);
-		if (debug_bh) {
-			BUFFER_TRACE(debug_bh, "state when allocated");
-			BUFFER_TRACE2(debug_bh, bitmap_bh, "bitmap state");
-			brelse(debug_bh);
-		}
-	}
-	jbd_lock_bh_state(bitmap_bh);
-	spin_lock(sb_bgl_lock(sbi, group_no));
-	if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) {
-		int i;
-
-		for (i = 0; i < num; i++) {
-			if (ext3_test_bit(grp_alloc_blk+i,
-					bh2jh(bitmap_bh)->b_committed_data)) {
-				printk("%s: block was unexpectedly set in "
-					"b_committed_data\n", __FUNCTION__);
-			}
-		}
-	}
-	ext3_debug("found bit %d\n", grp_alloc_blk);
-	spin_unlock(sb_bgl_lock(sbi, group_no));
-	jbd_unlock_bh_state(bitmap_bh);
-#endif
-
-	if (ret_block + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
-		ext3_error(sb, "ext3_new_block",
-			    "block("E3FSBLK") >= blocks count(%d) - "
-			    "block_group = %d, es == %p ", ret_block,
-			le32_to_cpu(es->s_blocks_count), group_no, es);
-		goto out;
-	}
-
-	/*
-	 * It is up to the caller to add the new buffer to a journal
-	 * list of some description.  We don't know in advance whether
-	 * the caller wants to use it as metadata or data.
-	 */
-	ext3_debug("allocating block %lu. Goal hits %d of %d.\n",
-			ret_block, goal_hits, goal_attempts);
-
-	spin_lock(sb_bgl_lock(sbi, group_no));
-	gdp->bg_free_blocks_count =
-			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
-	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_sub(&sbi->s_freeblocks_counter, num);
-
-	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
-	err = ext3_journal_dirty_metadata(handle, gdp_bh);
-	if (!fatal)
-		fatal = err;
-
-	sb->s_dirt = 1;
-	if (fatal)
-		goto out;
-
 	*errp = 0;
-	brelse(bitmap_bh);
-	DQUOT_FREE_BLOCK(inode, *count-num);
-	*count = num;
-	return ret_block;
+	DQUOT_FREE_BLOCK(inode,
+			indirect_blks + blks - indirect_blks_done - blks_done);
+
+	return blks_done;

 io_error:
 	*errp = -EIO;
@@ -1700,7 +2027,13 @@ out:
 	 * Undo the block allocation
 	 */
 	if (!performed_allocation)
-		DQUOT_FREE_BLOCK(inode, *count);
+		DQUOT_FREE_BLOCK(inode, indirect_blks + blks);
+	/*
+	 * Free any indirect blocks we allocated already. If the transaction
+	 * has been aborted this is essentially a no-op.
+	 */
+	for (i = 0; i < indirect_blks_done; i++)
+		ext3_free_blocks(handle, inode, new_blocks[i], 1);
 	brelse(bitmap_bh);
 	return 0;
 }
@@ -1708,9 +2041,13 @@ out:
 ext3_fsblk_t ext3_new_block(handle_t *handle, struct inode *inode,
 			ext3_fsblk_t goal, int *errp)
 {
-	unsigned long count = 1;
+	ext3_fsblk_t new_blocks[4];

-	return ext3_new_blocks(handle, inode, goal, &count, errp);
+	ext3_new_blocks(handle, inode, goal, 0, 1, new_blocks, errp);
+	if (*errp)
+		return 0;
+
+	return new_blocks[0];
 }

 /**
diff -uprdN linux-2.6.23mm1-clean/fs/ext3/bitmap.c
linux-2.6.23mm1-ext3mc/fs/ext3/bitmap.c
--- linux-2.6.23mm1-clean/fs/ext3/bitmap.c	2007-10-17 18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/bitmap.c	2007-12-20 18:12:17.000000000 -0800
@@ -11,8 +11,6 @@
 #include <linux/jbd.h>
 #include <linux/ext3_fs.h>

-#ifdef EXT3FS_DEBUG
-
 static const int nibblemap[] = {4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1,
2, 1, 1, 0};

 unsigned long ext3_count_free (struct buffer_head * map, unsigned int numchars)
@@ -27,6 +25,3 @@ unsigned long ext3_count_free (struct bu
 			nibblemap[(map->b_data[i] >> 4) & 0xf];
 	return (sum);
 }
-
-#endif  /*  EXT3FS_DEBUG  */
-
diff -uprdN linux-2.6.23mm1-clean/fs/ext3/inode.c
linux-2.6.23mm1-ext3mc/fs/ext3/inode.c
--- linux-2.6.23mm1-clean/fs/ext3/inode.c	2007-10-17 18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/inode.c	2007-12-21 05:38:41.000000000 -0800
@@ -36,10 +36,33 @@
 #include <linux/mpage.h>
 #include <linux/uio.h>
 #include <linux/bio.h>
+#include <linux/sort.h>
 #include "xattr.h"
 #include "acl.h"

+typedef struct {
+	__le32	*p;
+	__le32	key;
+	struct buffer_head *bh;
+} Indirect;
+
+struct ext3_ind_read_info {
+	int                     count;
+	int                     seq_prefetch;
+	long                    size;
+	struct buffer_head      *bh[0];
+};
+
+# define EXT3_IND_READ_INFO_SIZE(_c)        \
+	(sizeof(struct ext3_ind_read_info) + \
+	 sizeof(struct buffer_head *) * (_c))
+
+# define EXT3_IND_READ_MAX     	(32)
+
 static int ext3_writepage_trans_blocks(struct inode *inode);
+static Indirect *ext3_read_indblocks(struct inode *inode, int iblock,
+				     int depth, int offsets[4],
+				     Indirect chain[4], int *err);

 /*
  * Test whether an inode is a fast symlink.
@@ -233,12 +256,6 @@ no_delete:
 	clear_inode(inode);	/* We must guarantee clearing of inode... */
 }

-typedef struct {
-	__le32	*p;
-	__le32	key;
-	struct buffer_head *bh;
-} Indirect;
-
 static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
 {
 	p->key = *(p->p = v);
@@ -352,18 +369,21 @@ static int ext3_block_to_path(struct ino
  *	the whole chain, all way to the data (returns %NULL, *err == 0).
  */
 static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
-				 Indirect chain[4], int *err)
+				 Indirect chain[4], int ind_readahead, int *err)
 {
 	struct super_block *sb = inode->i_sb;
 	Indirect *p = chain;
 	struct buffer_head *bh;
+	int index;

 	*err = 0;
 	/* i_data is not going away, no lock needed */
 	add_chain (chain, NULL, EXT3_I(inode)->i_data + *offsets);
 	if (!p->key)
 		goto no_block;
-	while (--depth) {
+	for (index = 0; index < depth - 1; index++) {
+		if (ind_readahead && depth > 2 && index == depth - 2)
+			break;
 		bh = sb_bread(sb, le32_to_cpu(p->key));
 		if (!bh)
 			goto failure;
@@ -396,7 +416,11 @@ no_block:
  *	It is used when heuristic for sequential allocation fails.
  *	Rules are:
  *	  + if there is a block to the left of our position - allocate near it.
- *	  + if pointer will live in indirect block - allocate near that block.
+ *	  + If METACLUSTER options is not specified, allocate the data
+ *	  block close to the metadata block. Otherwise, if pointer will live in
+ *	  indirect block, we cannot allocate near the indirect block since
+ *	  indirect blocks are allocated in the metacluster, just put in the same
+ *	  cylinder group as the inode.
  *	  + if pointer will live in inode - allocate in the same
  *	    cylinder group.
  *
@@ -421,9 +445,11 @@ static ext3_fsblk_t ext3_find_near(struc
 			return le32_to_cpu(*p);
 	}

-	/* No such thing, so let's try location of indirect block */
-	if (ind->bh)
-		return ind->bh->b_blocknr;
+	if (!test_opt(inode->i_sb, METACLUSTER)) {
+		/* No such thing, so let's try location of indirect block */
+		if (ind->bh)
+			return ind->bh->b_blocknr;
+	}

 	/*
 	 * It is going to be referred to from the inode itself? OK, just put it
@@ -475,8 +501,7 @@ static ext3_fsblk_t ext3_find_goal(struc
  *	@blks: number of data blocks to be mapped.
  *	@blocks_to_boundary:  the offset in the indirect block
  *
- *	return the total number of blocks to be allocate, including the
- *	direct and indirect blocks.
+ *	return the total number of direct blocks to be allocated.
  */
 static int ext3_blks_to_allocate(Indirect *branch, int k, unsigned long blks,
 		int blocks_to_boundary)
@@ -505,75 +530,18 @@ static int ext3_blks_to_allocate(Indirec
 }

 /**
- *	ext3_alloc_blocks: multiple allocate blocks needed for a branch
- *	@indirect_blks: the number of blocks need to allocate for indirect
- *			blocks
- *
- *	@new_blocks: on return it will store the new block numbers for
- *	the indirect blocks(if needed) and the first direct block,
- *	@blks:	on return it will store the total number of allocated
- *		direct blocks
- */
-static int ext3_alloc_blocks(handle_t *handle, struct inode *inode,
-			ext3_fsblk_t goal, int indirect_blks, int blks,
-			ext3_fsblk_t new_blocks[4], int *err)
-{
-	int target, i;
-	unsigned long count = 0;
-	int index = 0;
-	ext3_fsblk_t current_block = 0;
-	int ret = 0;
-
-	/*
-	 * Here we try to allocate the requested multiple blocks at once,
-	 * on a best-effort basis.
-	 * To build a branch, we should allocate blocks for
-	 * the indirect blocks(if not allocated yet), and at least
-	 * the first direct block of this branch.  That's the
-	 * minimum number of blocks need to allocate(required)
-	 */
-	target = blks + indirect_blks;
-
-	while (1) {
-		count = target;
-		/* allocating blocks for indirect blocks and direct blocks */
-		current_block = ext3_new_blocks(handle,inode,goal,&count,err);
-		if (*err)
-			goto failed_out;
-
-		target -= count;
-		/* allocate blocks for indirect blocks */
-		while (index < indirect_blks && count) {
-			new_blocks[index++] = current_block++;
-			count--;
-		}
-
-		if (count > 0)
-			break;
-	}
-
-	/* save the new block number for the first direct block */
-	new_blocks[index] = current_block;
-
-	/* total number of blocks allocated for direct blocks */
-	ret = count;
-	*err = 0;
-	return ret;
-failed_out:
-	for (i = 0; i <index; i++)
-		ext3_free_blocks(handle, inode, new_blocks[i], 1);
-	return ret;
-}
-
-/**
  *	ext3_alloc_branch - allocate and set up a chain of blocks.
  *	@inode: owner
  *	@indirect_blks: number of allocated indirect blocks
  *	@blks: number of allocated direct blocks
+ *	@goal: goal for allocation
  *	@offsets: offsets (in the blocks) to store the pointers to next.
  *	@branch: place to store the chain in.
  *
- *	This function allocates blocks, zeroes out all but the last one,
+ *	returns error and number of direct blocks allocated via *blks
+ *
+ *	This function allocates indirect_blks + *blks, zeroes out all
+ *	indirect blocks,
  *	links them into chain and (if we are synchronous) writes them to disk.
  *	In other words, it prepares a branch that can be spliced onto the
  *	inode. It stores the information about that chain in the branch[], in
@@ -602,7 +570,7 @@ static int ext3_alloc_branch(handle_t *h
 	ext3_fsblk_t new_blocks[4];
 	ext3_fsblk_t current_block;

-	num = ext3_alloc_blocks(handle, inode, goal, indirect_blks,
+	num = ext3_new_blocks(handle, inode, goal, indirect_blks,
 				*blks, new_blocks, &err);
 	if (err)
 		return err;
@@ -799,17 +767,21 @@ int ext3_get_blocks_handle(handle_t *han
 	int blocks_to_boundary = 0;
 	int depth;
 	struct ext3_inode_info *ei = EXT3_I(inode);
-	int count = 0;
+	int count = 0, ind_readahead;
 	ext3_fsblk_t first_block = 0;

-
 	J_ASSERT(handle != NULL || create == 0);
 	depth = ext3_block_to_path(inode,iblock,offsets,&blocks_to_boundary);

 	if (depth == 0)
 		goto out;

-	partial = ext3_get_branch(inode, depth, offsets, chain, &err);
+	ind_readahead = !create && depth > 2;
+	partial = ext3_get_branch(inode, depth, offsets, chain,
+				  ind_readahead, &err);
+	if (!partial && ind_readahead)
+		partial = ext3_read_indblocks(inode, iblock, depth,
+					      offsets, chain, &err);

 	/* Simplest case - block found, no allocation needed */
 	if (!partial) {
@@ -844,7 +816,7 @@ int ext3_get_blocks_handle(handle_t *han
 	}

 	/* Next simple case - plain lookup or failed read of indirect block */
-	if (!create || err == -EIO)
+	if (!create || (err && err != -EAGAIN))
 		goto cleanup;

 	mutex_lock(&ei->truncate_mutex);
@@ -866,7 +838,8 @@ int ext3_get_blocks_handle(handle_t *han
 			brelse(partial->bh);
 			partial--;
 		}
-		partial = ext3_get_branch(inode, depth, offsets, chain, &err);
+		partial = ext3_get_branch(inode, depth, offsets, chain, 0,
+					&err);
 		if (!partial) {
 			count++;
 			mutex_unlock(&ei->truncate_mutex);
@@ -1974,7 +1947,7 @@ static Indirect *ext3_find_shared(struct
 	/* Make k index the deepest non-null offest + 1 */
 	for (k = depth; k > 1 && !offsets[k-1]; k--)
 		;
-	partial = ext3_get_branch(inode, k, offsets, chain, &err);
+	partial = ext3_get_branch(inode, k, offsets, chain, 0, &err);
 	/* Writer: pointers */
 	if (!partial)
 		partial = chain + k-1;
@@ -3297,3 +3270,560 @@ int ext3_change_inode_journal_flag(struc

 	return err;
 }
+
+/*
+ * ext3_ind_read_end_bio --
+ *
+ * 	bio callback for read IO issued from ext3_read_indblocks.
+ * 	May be called multiple times until the whole I/O completes at
+ * 	which point bio->bi_size = 0 and it frees read_info and bio.
+ * 	The first time it is called, first_bh is unlocked so that any sync
+ * 	waier can unblock.
+ */
+static void ext3_ind_read_end_bio(struct bio *bio, int err)
+{
+	struct ext3_ind_read_info *read_info = bio->bi_private;
+	struct buffer_head *bh;
+	int uptodate = !err && test_bit(BIO_UPTODATE, &bio->bi_flags);
+	int i;
+
+	if (err == -EOPNOTSUPP)
+		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+
+	/* Wait for all buffers to finish - is this needed? */
+	if (bio->bi_size)
+		return;
+
+	for (i = 0; i < read_info->count; i++) {
+		bh = read_info->bh[i];
+		if (err == -EOPNOTSUPP)
+			set_bit(BH_Eopnotsupp, &bh->b_state);
+
+		if (uptodate) {
+			BUG_ON(buffer_uptodate(bh));
+			BUG_ON(ext3_buffer_prefetch(bh));
+			set_buffer_uptodate(bh);
+			if (read_info->seq_prefetch)
+				ext3_set_buffer_prefetch(bh);
+		}
+
+		unlock_buffer(bh);
+		brelse(bh);
+	}
+
+	kfree(read_info);
+	bio_put(bio);
+}
+
+/*
+ * ext3_get_max_read --
+ * 	@inode: inode of file.
+ * 	@block: block number in file (starting from zero).
+ * 	@offset_in_dind_block: offset of the indirect block inside it's
+ * 	parent doubly-indirect block.
+ *
+ *      Compute the maximum no. of indirect blocks that can be read
+ *      satisfying following constraints:
+ *              - Don't read indirect blocks beyond the end of current
+ *              doubly-indirect block.
+ *              - Don't read beyond eof.
+ */
+static inline unsigned long ext3_get_max_read(const struct inode *inode,
+						  int block,
+						  int offset_in_dind_block)
+{
+	const struct super_block *sb = inode->i_sb;
+	unsigned long max_read;
+	unsigned long ptrs = EXT3_ADDR_PER_BLOCK(inode->i_sb);
+	unsigned long ptrs_bits = EXT3_ADDR_PER_BLOCK_BITS(inode->i_sb);
+	unsigned long blocks_in_file =
+		(inode->i_size + sb->s_blocksize - 1) >> sb->s_blocksize_bits;
+	unsigned long remaining_ind_blks_in_dind =
+		(ptrs >= offset_in_dind_block) ? (ptrs - offset_in_dind_block)
+					       : 0;
+	unsigned long remaining_ind_blks_before_eof =
+		((blocks_in_file - EXT3_NDIR_BLOCKS + ptrs - 1) >> ptrs_bits) -
+		((block - EXT3_NDIR_BLOCKS) >> ptrs_bits);
+
+	BUG_ON(block >= blocks_in_file);
+
+	max_read = min_t(unsigned long, remaining_ind_blks_in_dind,
+			 remaining_ind_blks_before_eof);
+
+	BUG_ON(max_read < 1);
+
+	return max_read;
+}
+
+static void ext3_read_indblocks_submit(struct bio **pbio,
+					struct ext3_ind_read_info **pread_info,
+					int *read_cnt, int seq_prefetch)
+{
+	struct bio *bio = *pbio;
+	struct ext3_ind_read_info *read_info = *pread_info;
+
+	BUG_ON(*read_cnt < 1);
+
+	read_info->seq_prefetch = seq_prefetch;
+	read_info->count = *read_cnt;
+	read_info->size = bio->bi_size;
+	bio->bi_private = read_info;
+	bio->bi_end_io = ext3_ind_read_end_bio;
+	submit_bio(READ, bio);
+
+	*pbio = NULL;
+	*pread_info = NULL;
+	*read_cnt = 0;
+}
+
+struct ind_block_info {
+	ext3_fsblk_t		blockno;
+	struct buffer_head	*bh;
+};
+
+static int ind_info_cmp(const void *a, const void *b)
+{
+	struct ind_block_info *info_a = (struct ind_block_info *)a;
+	struct ind_block_info *info_b = (struct ind_block_info *)b;
+
+	return info_a->blockno - info_b->blockno;
+}
+
+static void ind_info_swap(void *a, void *b, int size)
+{
+	struct ind_block_info *info_a = (struct ind_block_info *)a;
+	struct ind_block_info *info_b = (struct ind_block_info *)b;
+	struct ind_block_info tmp;
+
+	tmp = *info_a;
+	*info_a = *info_b;
+	*info_b = tmp;
+}
+
+/*
+ * ext3_read_indblocks_async --
+ *      @sb:            super block
+ *      @ind_blocks[]:  array of indirect block numbers on disk
+ *      @count:         maximum number of indirect blocks to read
+ *      @first_bh:      buffer_head for indirect block ind_blocks[0], may be
+ *                      NULL
+ *      @seq_prefetch:  if this is part of a sequential prefetch and buffers'
+ *                      prefetch bit must be set.
+ *      @blocks_done:   number of blocks considered for prefetching.
+ *
+ *      Issue a single bio request to read upto count buffers identified in
+ *      ind_blocks[]. Fewer than count buffers may be read in some cases:
+ *      - If a buffer is found to be uptodate and it's prefetch bit is set, we
+ *      don't look at any more buffers as they will most likely be in
the cache.
+ *      - We skip buffers we cannot lock without blocking (except for first_bh
+ *      if specified).
+ *      - We skip buffers beyond a certain range on disk.
+ *
+ *      This function must issue read on first_bh if specified unless of course
+ *      it's already uptodate.
+ */
+static int ext3_read_indblocks_async(struct super_block *sb,
+				     const __le32 ind_blocks[], int count,
+				     struct buffer_head *first_bh,
+				     int seq_prefetch,
+				     unsigned long *blocks_done)
+{
+	struct buffer_head *bh;
+	struct bio *bio = NULL;
+	struct ext3_ind_read_info *read_info = NULL;
+	int read_cnt = 0, blk;
+	ext3_fsblk_t prev_blk = 0, io_start_blk = 0, curr;
+	struct ind_block_info *ind_info = NULL;
+	int err = 0, ind_info_count = 0;
+
+	BUG_ON(count < 1);
+	/* Don't move this to ext3_get_max_read() since callers often need to
+	 * trim the count returned by that function. So this bound must only
+	 * be imposed at the last moment. */
+	count = min_t(unsigned long, count, EXT3_IND_READ_MAX);
+	*blocks_done = 0UL;
+
+	if (count == 1 && first_bh) {
+		lock_buffer(first_bh);
+		get_bh(first_bh);
+		first_bh->b_end_io = end_buffer_read_sync;
+		submit_bh(READ, first_bh);
+		*blocks_done = 1UL;
+		return 0;
+	}
+
+	ind_info = kmalloc(count * sizeof(*ind_info), GFP_KERNEL);
+	if (unlikely(!ind_info))
+		return -ENOMEM;
+
+	/*
+	 * First pass: sort block numbers for all indirect blocks that we'll
+	 * read. This allows us to scan blocks in sequenial order during the
+	 * second pass which helps coalasce requests to contiguous blocks.
+	 * Since we sort block numbers here instead of assuming any specific
+	 * layout on the disk, we have some protection against different
+	 * indirect block layout strategies as long as they keep all indirect
+	 * blocks close by.
+	 */
+	for (blk = 0; blk < count; blk++) {
+		curr = le32_to_cpu(ind_blocks[blk]);
+		if (!curr)
+			continue;
+
+		/*
+		 * Skip this block if it lies too far from blocks we have
+		 * already decided to read. "Too far" should typically indicate
+		 * lying on a different track on the disk. EXT3_IND_READ_MAX
+		 * seems reasonable for most disks.
+		 */
+		if (io_start_blk > 0 &&
+			(max(io_start_blk, curr) - min(io_start_blk, curr) >=
+				EXT3_IND_READ_MAX))
+			continue;
+
+		if (blk == 0 && first_bh) {
+			bh = first_bh;
+			get_bh(first_bh);
+		} else {
+			bh = sb_getblk(sb, curr);
+			if (unlikely(!bh)) {
+				err = -ENOMEM;
+				goto failure;
+			}
+		}
+
+		if (buffer_uptodate(bh)) {
+			if (ext3_buffer_prefetch(bh)) {
+				brelse(bh);
+				break;
+			}
+			brelse(bh);
+			continue;
+		}
+
+		if (io_start_blk == 0)
+			io_start_blk = curr;
+
+		ind_info[ind_info_count].blockno = curr;
+		ind_info[ind_info_count].bh = bh;
+		ind_info_count++;
+	}
+	*blocks_done = blk;
+
+	sort(ind_info, ind_info_count, sizeof(*ind_info),
+		ind_info_cmp, ind_info_swap);
+
+	/* Second pass: compose bio requests and issue them. */
+	for (blk = 0; blk < ind_info_count; blk++) {
+		bh = ind_info[blk].bh;
+		curr = ind_info[blk].blockno;
+
+		if (prev_blk > 0 && curr != prev_blk + 1) {
+			ext3_read_indblocks_submit(&bio, &read_info,
+						&read_cnt, seq_prefetch);
+			prev_blk = 0;
+		}
+
+		/* Lock the buffer without blocking, skipping any buffers
+		 * which would require us to block. first_bh when specified is
+		 * an exception as caller typically wants it to be read for
+		 * sure (e.g., ext3_read_indblocks_sync).
+		 */
+		if (bh == first_bh) {
+			lock_buffer(bh);
+		} else if (test_set_buffer_locked(bh)) {
+			brelse(bh);
+			continue;
+		}
+
+		/* Check again with the buffer locked. */
+		if (buffer_uptodate(bh)) {
+			if (ext3_buffer_prefetch(bh)) {
+				unlock_buffer(bh);
+				brelse(bh);
+				break;
+			}
+			unlock_buffer(bh);
+			brelse(bh);
+			continue;
+		}
+
+		if (read_cnt == 0) {
+			/* read_info freed in ext3_ind_read_end_bio(). */
+			read_info = kmalloc(EXT3_IND_READ_INFO_SIZE(count),
+					    GFP_KERNEL);
+			if (unlikely(!read_info)) {
+				err = -ENOMEM;
+				goto failure;
+			}
+
+			bio = bio_alloc(GFP_KERNEL, count);
+			if (unlikely(!bio)) {
+				err = -ENOMEM;
+				goto failure;
+			}
+			bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
+			bio->bi_bdev = bh->b_bdev;
+		}
+
+		if (bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh))
+				< bh->b_size) {
+			brelse(bh);
+			if (read_cnt == 0)
+				goto failure;
+
+			break;
+		}
+
+		read_info->bh[read_cnt++] = bh;
+		prev_blk = curr;
+	}
+
+	if (read_cnt == 0)
+		goto done;
+
+	ext3_read_indblocks_submit(&bio, &read_info, &read_cnt, seq_prefetch);
+
+	kfree(ind_info);
+	return 0;
+
+failure:
+	while (--read_cnt >= 0) {
+		unlock_buffer(read_info->bh[read_cnt]);
+		brelse(read_info->bh[read_cnt]);
+	}
+	*blocks_done = 0UL;
+
+done:
+	kfree(read_info);
+
+	if (bio)
+		bio_put(bio);
+
+	kfree(ind_info);
+	return err;
+}
+
+/*
+ * ext3_read_indblocks_sync --
+ *      @sb:            super block
+ *      @ind_blocks[]:  array of indirect block numbers on disk
+ *      @count:         maximum number of indirect blocks to read
+ *      @first_bh:      buffer_head for indirect block ind_blocks[0], must be
+ *                      non-NULL.
+ *      @seq_prefetch:  set prefetch bit of buffers, used when this is part of
+ *                      a sequential prefetch.
+ *      @blocks_done:   number of blocks considered for prefetching.
+ *
+ *      Synchronously read at most count indirect blocks listed in
+ *      ind_blocks[]. This function calls ext3_read_indblocks_async() to do all
+ *      the hard work. It waits for read to complete on first_bh before
+ *      returning.
+ */
+
+static int ext3_read_indblocks_sync(struct super_block *sb,
+				    const __le32 ind_blocks[], int count,
+				    struct buffer_head *first_bh,
+				    int seq_prefetch,
+				    unsigned long *blocks_done)
+{
+	int err;
+
+	BUG_ON(count < 1);
+	BUG_ON(!first_bh);
+
+	err = ext3_read_indblocks_async(sb, ind_blocks, count, first_bh,
+					seq_prefetch, blocks_done);
+	if (err)
+		return err;
+
+	wait_on_buffer(first_bh);
+	if (!buffer_uptodate(first_bh))
+		err = -EIO;
+
+	/* if seq_prefetch != 0, ext3_read_indblocks_async() sets prefetch bit
+	 * for all buffers, but the first buffer for sync IO is never a prefetch
+	 * buffer since it's needed presently so mark it so.
+	 */
+	if (seq_prefetch)
+		ext3_clear_buffer_prefetch(first_bh);
+
+	BUG_ON(ext3_buffer_prefetch(first_bh));
+
+	return err;
+}
+
+/*
+ * ext3_read_indblocks --
+ *
+ * 	@inode: inode of file
+ * 	@iblock: block number inside file (starting from 0).
+ * 	@depth: depth of path from inode to data block.
+ * 	@offsets: array of offsets within blocks identified in 'chain'.
+ * 	@chain: array of Indirect with info about all levels of blocks until
+ * 	the data block.
+ * 	@err: error pointer.
+ *
+ * 	This function is called after reading all metablocks leading to 'iblock'
+ * 	except the (singly) indirect block. It reads the indirect block if not
+ * 	already in the cache and may also prefetch next few indirect blocks.
+ * 	It uses a combination of synchronous and asynchronous requests to
+ * 	accomplish this. We do prefetching even for random reads by reading
+ * 	ahead one indirect block since reads of size >=512KB have at least 12%
+ * 	chance of spanning two indirect blocks.
+ */
+
+static Indirect *ext3_read_indblocks(struct inode *inode, int iblock,
+				     int depth, int offsets[4],
+				     Indirect chain[4], int *err)
+{
+	struct super_block *sb = inode->i_sb;
+	struct buffer_head *first_bh, *prev_bh;
+	unsigned long max_read, blocks_done = 0;
+	__le32 *ind_blocks;
+
+	/* Must have doubly indirect block for prefetching indirect blocks. */
+	BUG_ON(depth <= 2);
+	BUG_ON(!chain[depth-2].key);
+
+	*err = 0;
+
+	/* Handle first block */
+	ind_blocks = chain[depth-2].p;
+	first_bh = sb_getblk(sb, le32_to_cpu(ind_blocks[0]));
+	if (unlikely(!first_bh)) {
+		printk(KERN_ERR "Failed to get block %u for sb %p\n",
+		       le32_to_cpu(ind_blocks[0]), sb);
+		goto failure;
+	}
+
+	BUG_ON(first_bh->b_size != sb->s_blocksize);
+
+	if (buffer_uptodate(first_bh)) {
+		/* Found the buffer in cache, either it was accessed recently or
+		 * it was prefetched while reading previous indirect block(s).
+		 * We need to figure out if we need to prefetch the following
+		 * indirect blocks.
+		 */
+		if (!ext3_buffer_prefetch(first_bh)) {
+			/* Either we've seen this indirect block before while
+			 * accessing another data block, or this is a random
+			 * read. In the former case, we must have done the
+			 * needful the first time we had a cache hit on this
+			 * indirect block, in the latter case we obviously
+			 * don't need to do any prefetching.
+			 */
+			goto done;
+		}
+
+		max_read = ext3_get_max_read(inode, iblock,
+					     offsets[depth-2]);
+
+		/* This indirect block is in the cache due to prefetching and
+		 * this is its first cache hit, clear the prefetch bit and
+		 * make sure the following blocks are also prefetched.
+		 */
+		ext3_clear_buffer_prefetch(first_bh);
+
+		if (max_read >= 2) {
+			/* ext3_read_indblocks_async() stops at the first
+			 * indirect block which has the prefetch bit set which
+			 * will most likely be the very next indirect block.
+			 */
+			ext3_read_indblocks_async(sb, &ind_blocks[1],
+						  max_read - 1,
+						  NULL, 1, &blocks_done);
+		}
+
+	} else {
+		/* Buffer is not in memory, we need to read it. If we are
+		 * reading sequentially from the previous indirect block, we
+		 * have just detected a sequential read and we must prefetch
+		 * some indirect blocks for future.
+		 */
+
+		max_read = ext3_get_max_read(inode, iblock,
+					     offsets[depth-2]);
+
+		if ((ind_blocks - (__le32 *)chain[depth-2].bh->b_data) >= 1) {
+			prev_bh = sb_getblk(sb, le32_to_cpu(ind_blocks[-1]));
+			if (buffer_uptodate(prev_bh) &&
+			    !ext3_buffer_prefetch(prev_bh)) {
+				/* Detected sequential read. */
+				brelse(prev_bh);
+
+				/* Sync read indirect block, also read the next
+				 * few indirect blocks.
+				 */
+				*err = ext3_read_indblocks_sync(sb, ind_blocks,
+							 max_read, first_bh, 1,
+							 &blocks_done);
+
+				if (*err)
+					goto out;
+
+				/* In case the very next indirect block is
+				 * discontiguous by a non-trivial amount,
+				 * ext3_read_indblocks_sync() above won't
+				 * prefetch it (indicated by blocks_done < 2).
+				 * So to help sequential read, schedule an
+				 * async request for reading the next
+				 * contiguous indirect block range (which
+				 * in metaclustering case would be the next
+				 * metacluster, without metaclustering it
+				 * would be the next indirect block). This is
+				 * expected to benefit the non-metaclustering
+				 * case.
+				 */
+				if (max_read >= 2 && blocks_done < 2)
+					ext3_read_indblocks_async(sb,
+							&ind_blocks[1],
+							max_read - 1,
+							NULL, 1, &blocks_done);
+
+				goto done;
+			}
+			brelse(prev_bh);
+		}
+
+		/* Either random read, or sequential detection failed above.
+		 * We always prefetch the next indirect block in this case
+		 * whenever possible.
+		 * This is because for random reads of size ~512KB, there is
+		 * >12% chance that a read will span two indirect blocks.
+		 */
+		*err = ext3_read_indblocks_sync(sb, ind_blocks,
+						(max_read >= 2) ? 2 : 1,
+						first_bh, 0, &blocks_done);
+		if (*err)
+			goto out;
+	}
+
+done:
+	/* Reader: pointers */
+	if (!verify_chain(chain, &chain[depth - 2])) {
+		brelse(first_bh);
+		goto changed;
+	}
+	add_chain(&chain[depth - 1], first_bh,
+		  (__le32*)first_bh->b_data + offsets[depth - 1]);
+	/* Reader: end */
+	if (!chain[depth - 1].key)
+		goto out;
+
+	BUG_ON(!buffer_uptodate(first_bh));
+	return NULL;
+
+changed:
+	*err = -EAGAIN;
+	goto out;
+failure:
+	*err = -EIO;
+out:
+	if (*err) {
+		ext3_debug("Error %d reading indirect blocks\n", *err);
+		return &chain[depth - 2];
+	} else
+		return &chain[depth - 1];
+}
+
diff -uprdN linux-2.6.23mm1-clean/fs/ext3/super.c
linux-2.6.23mm1-ext3mc/fs/ext3/super.c
--- linux-2.6.23mm1-clean/fs/ext3/super.c	2007-10-17 18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/super.c	2007-12-20 18:11:14.000000000 -0800
@@ -625,6 +625,9 @@ static int ext3_show_options(struct seq_
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
 		seq_puts(seq, ",data=writeback");

+	if (test_opt(sb, METACLUSTER))
+		seq_puts(seq, ",metacluster");
+
 	ext3_show_quota_options(seq, sb);

 	return 0;
@@ -758,7 +761,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
-	Opt_grpquota
+	Opt_grpquota, Opt_metacluster
 };

 static match_table_t tokens = {
@@ -808,6 +811,7 @@ static match_table_t tokens = {
 	{Opt_quota, "quota"},
 	{Opt_usrquota, "usrquota"},
 	{Opt_barrier, "barrier=%u"},
+	{Opt_metacluster, "metacluster"},
 	{Opt_err, NULL},
 	{Opt_resize, "resize"},
 };
@@ -1140,6 +1144,9 @@ clear_qf_name:
 		case Opt_bh:
 			clear_opt(sbi->s_mount_opt, NOBH);
 			break;
+		case Opt_metacluster:
+			set_opt(sbi->s_mount_opt, METACLUSTER);
+			break;
 		default:
 			printk (KERN_ERR
 				"EXT3-fs: Unrecognized mount option \"%s\" "
@@ -1674,6 +1681,13 @@ static int ext3_fill_super (struct super
 	}
 	sbi->s_frags_per_block = 1;
 	sbi->s_blocks_per_group = le32_to_cpu(es->s_blocks_per_group);
+	if (test_opt(sb, METACLUSTER)) {
+		sbi->s_nonmc_blocks_per_group = sbi->s_blocks_per_group -
+			sbi->s_blocks_per_group / 12;
+		sbi->s_nonmc_blocks_per_group &= ~7;
+	} else
+		sbi->s_nonmc_blocks_per_group = sbi->s_blocks_per_group;
+
 	sbi->s_frags_per_group = le32_to_cpu(es->s_frags_per_group);
 	sbi->s_inodes_per_group = le32_to_cpu(es->s_inodes_per_group);
 	if (EXT3_INODE_SIZE(sb) == 0)
@@ -1783,6 +1797,18 @@ static int ext3_fill_super (struct super
 	sbi->s_rsv_window_head.rsv_goal_size = 0;
 	ext3_rsv_window_add(sb, &sbi->s_rsv_window_head);

+	if (test_opt(sb, METACLUSTER)) {
+		sbi->s_bginfo = kmalloc(sbi->s_groups_count *
+					sizeof(*sbi->s_bginfo), GFP_KERNEL);
+		if (!sbi->s_bginfo) {
+			printk(KERN_ERR "EXT3-fs: not enough memory\n");
+			goto failed_mount3;
+		}
+		for (i = 0; i < sbi->s_groups_count; i++)
+			sbi->s_bginfo[i].bgi_free_nonmc_blocks_count = -1;
+	} else
+		sbi->s_bginfo = NULL;
+
 	/*
 	 * set up enough so that it can read an inode
 	 */
@@ -1808,16 +1834,16 @@ static int ext3_fill_super (struct super
 	if (!test_opt(sb, NOLOAD) &&
 	    EXT3_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL)) {
 		if (ext3_load_journal(sb, es, journal_devnum))
-			goto failed_mount3;
+			goto failed_mount4;
 	} else if (journal_inum) {
 		if (ext3_create_journal(sb, es, journal_inum))
-			goto failed_mount3;
+			goto failed_mount4;
 	} else {
 		if (!silent)
 			printk (KERN_ERR
 				"ext3: No journal on filesystem on %s\n",
 				sb->s_id);
-		goto failed_mount3;
+		goto failed_mount4;
 	}

 	/* We have now updated the journal if required, so we can
@@ -1840,7 +1866,7 @@ static int ext3_fill_super (struct super
 		    (sbi->s_journal, 0, 0, JFS_FEATURE_INCOMPAT_REVOKE)) {
 			printk(KERN_ERR "EXT3-fs: Journal does not support "
 			       "requested data journaling mode\n");
-			goto failed_mount4;
+			goto failed_mount5;
 		}
 	default:
 		break;
@@ -1863,13 +1889,13 @@ static int ext3_fill_super (struct super
 	if (!sb->s_root) {
 		printk(KERN_ERR "EXT3-fs: get root inode failed\n");
 		iput(root);
-		goto failed_mount4;
+		goto failed_mount5;
 	}
 	if (!S_ISDIR(root->i_mode) || !root->i_blocks || !root->i_size) {
 		dput(sb->s_root);
 		sb->s_root = NULL;
 		printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
-		goto failed_mount4;
+		goto failed_mount5;
 	}

 	ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
@@ -1901,8 +1927,10 @@ cantfind_ext3:
 		       sb->s_id);
 	goto failed_mount;

-failed_mount4:
+failed_mount5:
 	journal_destroy(sbi->s_journal);
+failed_mount4:
+	kfree(sbi->s_bginfo);
 failed_mount3:
 	percpu_counter_destroy(&sbi->s_freeblocks_counter);
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
diff -uprdN linux-2.6.23mm1-clean/include/linux/ext3_fs.h
linux-2.6.23mm1-ext3mc/include/linux/ext3_fs.h
--- linux-2.6.23mm1-clean/include/linux/ext3_fs.h	2007-10-17
18:31:43.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/include/linux/ext3_fs.h	2007-12-21
05:40:05.000000000 -0800
@@ -380,6 +380,7 @@ struct ext3_inode {
 #define EXT3_MOUNT_QUOTA		0x80000 /* Some quota option set */
 #define EXT3_MOUNT_USRQUOTA		0x100000 /* "old" user quota */
 #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
+#define EXT3_MOUNT_METACLUSTER		0x400000 /* Indirect block clustering */

 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -493,6 +494,7 @@ struct ext3_super_block {
 #ifdef __KERNEL__
 #include <linux/ext3_fs_i.h>
 #include <linux/ext3_fs_sb.h>
+#include <linux/buffer_head.h>
 static inline struct ext3_sb_info * EXT3_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -722,6 +724,11 @@ struct dir_private_info {
 	__u32		next_hash;
 };

+/* Special bh flag used by the metacluster readahead logic. */
+enum ext3_bh_state_bits {
+	EXT3_BH_PREFETCH = BH_JBD_Sentinel,
+};
+
 /* calculate the first block number of the group */
 static inline ext3_fsblk_t
 ext3_group_first_block_no(struct super_block *sb, unsigned long group_no)
@@ -730,6 +737,24 @@ ext3_group_first_block_no(struct super_b
 		le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block);
 }

+static inline void
+ext3_set_buffer_prefetch(struct buffer_head *bh)
+{
+	set_bit(EXT3_BH_PREFETCH, &bh->b_state);
+}
+
+static inline void
+ext3_clear_buffer_prefetch(struct buffer_head *bh)
+{
+	clear_bit(EXT3_BH_PREFETCH, &bh->b_state);
+}
+
+static inline int
+ext3_buffer_prefetch(struct buffer_head *bh)
+{
+	return test_bit(EXT3_BH_PREFETCH, &bh->b_state);
+}
+
 /*
  * Special error return code only used by dx_probe() and its callers.
  */
@@ -752,8 +777,9 @@ extern int ext3_bg_has_super(struct supe
 extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
 extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode,
 			ext3_fsblk_t goal, int *errp);
-extern ext3_fsblk_t ext3_new_blocks (handle_t *handle, struct inode *inode,
-			ext3_fsblk_t goal, unsigned long *count, int *errp);
+extern int ext3_new_blocks(handle_t *handle, struct inode *inode,
+			ext3_fsblk_t goal, int indirect_blks, int blks,
+			ext3_fsblk_t new_blocks[], int *errp);
 extern void ext3_free_blocks (handle_t *handle, struct inode *inode,
 			ext3_fsblk_t block, unsigned long count);
 extern void ext3_free_blocks_sb (handle_t *handle, struct super_block *sb,
diff -uprdN linux-2.6.23mm1-clean/include/linux/ext3_fs_sb.h
linux-2.6.23mm1-ext3mc/include/linux/ext3_fs_sb.h
--- linux-2.6.23mm1-clean/include/linux/ext3_fs_sb.h	2007-10-17
18:31:43.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/include/linux/ext3_fs_sb.h	2007-12-20
18:11:14.000000000 -0800
@@ -24,6 +24,8 @@
 #endif
 #include <linux/rbtree.h>

+struct ext3_bg_info;
+
 /*
  * third extended-fs super-block data in memory
  */
@@ -33,6 +35,7 @@ struct ext3_sb_info {
 	unsigned long s_inodes_per_block;/* Number of inodes per block */
 	unsigned long s_frags_per_group;/* Number of fragments in a group */
 	unsigned long s_blocks_per_group;/* Number of blocks in a group */
+	unsigned long s_nonmc_blocks_per_group;/* Number of non-metacluster
blocks in a group */
 	unsigned long s_inodes_per_group;/* Number of inodes in a group */
 	unsigned long s_itb_per_group;	/* Number of inode table blocks per group */
 	unsigned long s_gdb_count;	/* Number of group descriptor blocks */
@@ -67,6 +70,9 @@ struct ext3_sb_info {
 	struct rb_root s_rsv_window_root;
 	struct ext3_reserve_window_node s_rsv_window_head;

+	/* array of per-bg in-memory info */
+	struct ext3_bg_info *s_bginfo;
+
 	/* Journaling */
 	struct inode * s_journal_inode;
 	struct journal_s * s_journal;
@@ -83,4 +89,11 @@ struct ext3_sb_info {
 #endif
 };

+/*
+ * in-memory data associated with each block group.
+ */
+struct ext3_bg_info {
+	int bgi_free_nonmc_blocks_count;/* Number of free non-metacluster
blocks in group */
+};
+
 #endif	/* _LINUX_EXT3_FS_SB */
diff -uprdN linux-2.6.23mm1-clean/include/linux/jbd.h
linux-2.6.23mm1-ext3mc/include/linux/jbd.h
--- linux-2.6.23mm1-clean/include/linux/jbd.h	2007-10-17
18:31:43.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/include/linux/jbd.h	2007-12-20
18:11:14.000000000 -0800
@@ -294,6 +294,7 @@ enum jbd_state_bits {
 	BH_State,		/* Pins most journal_head state */
 	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
 	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
+	BH_JBD_Sentinel,	/* Start bit for clients of jbd */
 };

 BUFFER_FNS(JBD, jbd)








On Nov 16, 2007 6:58 PM, Theodore Tso <tytso@mit.edu> wrote:
> The practice of starting search in the next block block in the
> metadata area only makes a difference for one indirect block, yes, but
> it's the right thing to do.  And if you fold the ext3_new_blocks and
> ext3_new_indirect_blocks(), it's really not that hard.  You can
> basically do something like this:
>
>         if (alloc_for_metadata)
>                 strategy = 0x132;
>         else
>                 strategy = 0x231;
>         for (; strategy; strategy = strategy >> 8) {
>                 switch (strategy & 0xF) {
>                 case 1:
>                      start = block_group_start;
>                      end = mc_start - 1;
>                      break;
>                 case 2:
>                      start = mc_start;
>                      end = mc_end;
>                      break;
>                 case 3:
>                      start = mc_end + 1;
>                      end = block_group_end;
>                      break;
>                 }
>                 <search region between start.. end>
>         }
>
> > We initially avoided making metaclustering a superblock tunable as we
> > didn't want to make any changes to the on-disk format as then ext4
> > extents are also a good option.
>
> Allocating a superblock field is no big deal.  I'll note further that
> metaclustering is not necessarily mutually exclusive with ext4
> extents.  Allocating the extent tree data blocks out of the
> metacluster blocks can be a good idea, depending on the average size
> of the blocks and how fragmented the filesystem gets (and hence how
> many contiguous extents can be expected).  If the filesystem is
> storing lots of really big files where being contiguous across
> multiple blockgroups are productive, then the metacluster area would
> actually be counterproductive.  And if files are all small so the
> extents fit the inode, the metadata cluster area wouldn't be necessary
> at all.  But if there are multiple external extent blocks in a block
> group, it would be useful for them to be allocated together.
>
> > If metaclustering gains acceptance
> > it might make sense to make it a superblock tunable. However, I would
> > avoid putting metacluster size into the superblock for the following
> > reason. Ideally, we should not have to bother about finding the sweet
> > spot of metacluster size as
> > (1) a given file system can be used for storing different kinds
> > of files at different times and it would be a pain to tune it every now
> > and then, and
>
> Yes, it doesn't make sense to retune the filesystem.  I was assuming
> that this would only be done at mke2fs time.
>
> > (2) it opens the possibility of doubting metcluster size for unrelated
> > ext3/fsck performance anomalies.
>
> I'm not sure I understand your concern.  The reality is that 99% of
> the time users will never change it from the defaults, but making it
> tunable makes it much, much easier for us to try various experiments
> to determine what is the best initial value for different workloads.
> What might get used for a Usenet news spool or a Squid cache might be
> quite different from series of DVD image files.
>
> > Allow me to propose a solution that will most likely address the above
> > issue and please ignore its complexity for a moment. Instead of a two
> > level partitioning in the block space between data blocks and
> > metacluster blocks, have a 3 or 4 level partitioning. E.g., a block
> > group with 'd' blocks can have d/32 blocks in metacluster level 1,
> > d/64 blocks in metacluster level 2, and d/128 blocks in metacluster
> > level 3 (define level 0 has having the remaining blocks = d - d/32 -
> > d/64 - d/128). Data block allocation starts looking for a free block
> > starting from the lowest possible level. If it is unable to find any
> > free blocks at that level in all block groups, it moves up a level and
> > so on. Indirect block allocation proceeds in the opposite direction
> > starting from higher levels. This approach has several benefits:
>
> That is clever.  Oh, one other thing.  You didn't mention what
> happened when the metacluster field was placed at the end of the block
> group.  I assume you tried that in your experiments; what were the
> results?  The obvious thing to do to avoid further fragmentation of
> the block group would be to put level 1 at the end of the block group,
> level 2 just before it, and level 3 before that, and then allocate the
> data blocks starting at the beginning of the block group, i.e:
>
> +----------------------------------+---------------+---------+-------+
> |     data                         | level 3       | level 2 | lvl 1 |
> +----------------------------------+---------------+---------+-------+
>
>
> > In traditional metaclustering, once we run out of metacluster blocks
> > or data blocks, all bets are off. This forces us to keep small
> > metaclusters in order to avoid this situation altogether. But with small
> > metaclusters, we cannot optimize indirect block allocation on file
> > systems with many small files (>48KB).There is only one glitch in
> > implementing this. If a block group doesn't have any free blocks at a
> > given level, we should be able to find that out quickly instead of
> > having to scan its entire bitmap. gdp->bg_free_blocks_count is not good
> > enough for this.
>
> Ideally, true, but this was a defect with the original metacluster
> scheme as well.  We could steal some bits in the block_group
> descriptor structure to indicate whether a particular level is full,
> though.  This would be another data format change that would require
> e2fsprogs support, though.
>
> Regards,
>
>                                                 - Ted
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2007-12-21 14:15         ` Abhishek Rai
@ 2008-01-10 21:17           ` Abhishek Rai
  2008-01-11 17:05             ` Daniel Phillips
  0 siblings, 1 reply; 21+ messages in thread
From: Abhishek Rai @ 2008-01-10 21:17 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Andreas Dilger, linux-kernel,
	Ken Chen, Mike Waychison, rohitseth

Hi,

Resending this email as I haven't received any feedback on this patch
probably due to the holiday season. Any feedback will be greatly
appreciated!

Thanks


I have implemented a revised patch that addresses the concerns raised
with the previous patch. To summarize, here were the three main
concerns:

1. Metacluster size is sensitive to average file size in the
block-group/file-system so how do we find a good metacluster size ?
The last discussion we had on this topic was to have multiple
metaclusters of different sizes per group and have direct blocks
overflow towards smaller metacluster sizes and indirect blocks
overflow towards bigger metaclusters. The overflowing from one
metacluster level to the next would happen only when none of the block
groups have any free blocks left at that metacluster level, IOW
overflowing was a file system wide event.

2. When indirect block allocation in metacluster fails, don't fully
fall back to old-style allocation scheme, instead fall back to
old-style scheme only for that block group and repeat this for each
block group: Now all this happens in ext3_new_blocks(). To control the
size of that function, created a few helper functions.

3. Don't have separate functions for allocating indirect and direct
blocks as there is considerable overlap especially in the journaling
code: Rolled the two functions into one ext3_new_blocks() function
which is directly called from ext3_alloc_branch() instead of via
ext3_alloc_blocks() (which is nuked now).

The current approach I've implemented is similar in principle but it
fixes a problem with the above scheme. The above scheme of overflowing
into the next "metacluster level" upon exhaustion of current
metacluster level across all block groups results in increased
fragmentation. E.g., say a block group BG1 ran out of blocks at
metacluster (mc) level X and now wants to use the next level. It
checks and find that a different block group BG2 still has free blocks
at mc level X so it starts using level X in BG2 which results in the
file getting fragmented. In the new patch, we'd continue using level
X+1 in BG1 to reduce overall fragmentation and so the new patch
results in overflow only within the same block group.

Also, I've chosen a simpler implementation for this multi-level
metaclustering scheme by not having real metacluster levels but by
having direct blocks and indirect blocks grow towards each other from
opposite ends of the block group. Both are conceptually the same. Now
the overflow condition is that direct block allocation cannot spill
into indirect block region (the metacluster) and vice versa unless it
has run out of free blocks in its own region. This information is now
available through a new memory-only per-block counter that keeps a
count of the number of free blocks in the non-metacluster region. This
also addresses Andreas Dilger's concern with the previous
implementation regarding metaclusters increasing fragmentation by
splitting the block group into two halves.

Putting metacluster at the end of the block group gives slightly
inferior sequential read throughput compared to putting it in the
beginning or the middle, but the difference is very tiny and exists
only for large files that span multiple block groups.

Since there are couple of ways in which this change differs from
before, I repeated the testing and performance evaluation. The new
change passed
fsx, fsstress, and bonnie - both with and without metaclustering.
Also, I checked that the block layout on disk is coming out to be what
one would expect from the code.

Here are the performance numbers. The setup was somewhat different
from the previous setup so I've gotten fresh numbers for the vanilla
case as well.

Setup:
RAM: 8GB
Disk: 400GB disk.
CPU: Dual core hyperthreaded

All measurements were taken 10 times or more until standard deviation
was <2%. Machine was rebooted between runs and file system freshly
formatted, also we made sure that there was nothing running on the
machine at the time of the test.

Notation:
- 'vanilla': regular ext3 without any changes
- 'mc': metaclustering ext3 (new)

Benchmark 1: Sequential write to a 10GB file followed by 'sync'
1. vanilla:
 Total: 3m39.1s
 User: 0.08
 System: 51.9s
2. mc:
 Total: 3m11.5s
 User: 0.06s
 System: 53.6s

Benchmark 2: Sequential read from a 10GB file.
Description: the file is created using same type of ext2 (mc or vanilla)
1. vanilla:
 Total: 3m6.5s
 User: 0.04s
 System: 13.4s
2. mc:
 Total: 3m7.0s
 User: 0.05s
 System: 13.1s

Benchmark 3: Random read from a 300GB file
Description: read using 512 byte chunk total 5MB
1. vanilla:
 Total: 3m57.0s
 User: ~0
 System: 0.8s
2. mc:
 Total: 3m56.4s
 User: ~0
 System: 0.9s

Benchmark 4: Random read from a 300GB file
Description: read using 512KB chunk total 1% size of the file
1. vanilla:
 Total: 4m50.3s
 User: ~0
 System: 3.9s
2. mc:
 Total: 4m56.9s
 User: ~0
 System: 3.9s

Benchmark 5: fsck
Description: Prepare a newly formated 400GB disk as follows: create
200 files of 0.5GB each, 100 files of 1GB each, 40 files of 2.5GB ech,
and 10 files of 10GB each. fsck command line: fsck -f -n
1. vanilla:
 Total: 11m25.3s
 User: 13.4s
 System: 13.2s
2. mc:
 Total: 3m11.0s
 User: 13.1s
 System: 12.9s


Note: I'll report results from kernbench and compilebench shortly.

Observations:
Sequential write performance is much better with metaclustering than
with vanilla. To better understand it, I ran the same benchmark with the
new code but with the metaclustering option turned off and I got the
same performance as vanilla which makes me believe that there is
something about metaclustering that helps write performance though I
don't have a very good handle of what that thing might be.

Thanks,
Abhishek

Signed-off-by: Abhishek Rai <abhishekrai@google.com>

diff -uprdN linux-2.6.23mm1-clean/fs/ext3/balloc.c
linux-2.6.23mm1-ext3mc/fs/ext3/balloc.c
--- linux-2.6.23mm1-clean/fs/ext3/balloc.c      2007-10-17
18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/balloc.c     2007-12-21
05:34:35.000000000 -0800
@@ -33,6 +33,29 @@
  * super block.  Each descriptor contains the number of the bitmap block and
  * the free blocks count in the block.  The descriptors are loaded in memory
  * when a file system is mounted (see ext3_fill_super).
+ *
+ * A note on ext3 metaclustering:
+ *
+ *     Start of                                                End of
+ *     block group                                             block group
+ *      ________________________________________________________________
+ *     |       NON-MC REGION                   |       MC REGION        |
+ *     |                                       |Overflow                |
+ *     |Data blocks and                        |data           Indirect |
+ *     |overflow indirect blocks               |blocks         blocks   |
+ *     |---------->                            |------->       <--------|
+ *     |________________________________________________________________|
+ *
+ *     Every block group has at its end a semi-reserved region called the
+ *     metacluster mostly used for allocating indirect blocks. Under normal
+ *     circumstances, the metacluster is used only for allocating indirect
+ *     blocks which are allocated in decreasing order of block numbers.
+ *     The non-Metacluster region is used for data block allocation which are
+ *     allocated in increasing order of block numbers. However, when the MC
+ *     runs out of space, indirect blocks can be allocated in the non-MC
+ *     region along with the data blocks in the forward direction. Similarly,
+ *     when non-MC runs out of space, new data blocks are allocated in MC but
+ *     in the forward direction.
  */


@@ -147,6 +170,88 @@ error_out:
                        block_group, bitmap_blk);
        return NULL;
 }
+
+
+/*
+ * Count number of free blocks in a block group that don't lie in the
+ * metacluster region of the block group.
+ */
+static void
+ext3_init_grp_free_nonmc_blocks(struct super_block *sb,
+                               struct buffer_head *bitmap_bh,
+                               unsigned long block_group)
+{
+       struct ext3_sb_info *sbi = EXT3_SB(sb);
+       struct ext3_bg_info *bgi = &sbi->s_bginfo[block_group];
+
+       BUG_ON(!test_opt(sb, METACLUSTER));
+
+       spin_lock(sb_bgl_lock(sbi, block_group));
+       if (bgi->bgi_free_nonmc_blocks_count >= 0)
+               goto out;
+
+       bgi->bgi_free_nonmc_blocks_count =
+               ext3_count_free(bitmap_bh, sbi->s_nonmc_blocks_per_group/8);
+
+out:
+       spin_unlock(sb_bgl_lock(sbi, block_group));
+       BUG_ON(bgi->bgi_free_nonmc_blocks_count >
+               sbi->s_nonmc_blocks_per_group);
+}
+
+/*
+ * ext3_update_nonmc_block_count:
+ *     Update bgi_free_nonmc_blocks_count for block group 'group_no' following
+ *     an allocation or deallocation.
+ *
+ *     @group_no:      affected block group
+ *     @start:         start of the [de]allocated range
+ *     @count:         number of blocks [de]allocated
+ *     @allocation:    1 if blocks were allocated, 0 otherwise.
+ */
+static inline void
+ext3_update_nonmc_block_count(struct ext3_sb_info *sbi, unsigned long group_no,
+                               ext3_grpblk_t start, unsigned long count,
+                               int allocation)
+{
+       struct ext3_bg_info *bginfo = &sbi->s_bginfo[group_no];
+       ext3_grpblk_t change;
+
+       BUG_ON(bginfo->bgi_free_nonmc_blocks_count < 0);
+       BUG_ON(start >= sbi->s_nonmc_blocks_per_group);
+
+       change = min_t(ext3_grpblk_t, start + count,
+                       sbi->s_nonmc_blocks_per_group) - start;
+
+       spin_lock(sb_bgl_lock(sbi, group_no));
+       BUG_ON(bginfo->bgi_free_nonmc_blocks_count >
+               sbi->s_nonmc_blocks_per_group);
+       BUG_ON(allocation && bginfo->bgi_free_nonmc_blocks_count < change);
+
+       bginfo->bgi_free_nonmc_blocks_count += (allocation ? -change : change);
+
+       BUG_ON(bginfo->bgi_free_nonmc_blocks_count >
+               sbi->s_nonmc_blocks_per_group);
+       spin_unlock(sb_bgl_lock(sbi, group_no));
+}
+
+/*
+ * allow_mc_alloc:
+ *     Check if we can use metacluster region of a block group for general
+ *     allocation if needed. Ideally, we should allow this only if
+ *     bgi_free_nonmc_blocks_count == 0, but sometimes there is a small number
+ *     of blocks which don't get allocated in the first pass, no point
+ *     breaking our file at the metacluster boundary because of that, so we
+ *     relax the limit to 8.
+ */
+static inline int allow_mc_alloc(struct ext3_sb_info *sbi,
+                                       struct ext3_bg_info *bgi,
+                                       ext3_grpblk_t blk)
+{
+       return !(blk >= 0 && blk >= sbi->s_nonmc_blocks_per_group &&
+               bgi->bgi_free_nonmc_blocks_count >= 8);
+}
+
 /*
  * The reservation window structure operations
  * --------------------------------------------
@@ -463,6 +568,7 @@ void ext3_free_blocks_sb(handle_t *handl
        struct ext3_group_desc * desc;
        struct ext3_super_block * es;
        struct ext3_sb_info *sbi;
+       struct ext3_bg_info *bgi;
        int err = 0, ret;
        ext3_grpblk_t group_freed;

@@ -502,6 +608,13 @@ do_more:
        if (!desc)
                goto error_return;

+       if (test_opt(sb, METACLUSTER)) {
+               bgi = &sbi->s_bginfo[block_group];
+               if (bgi->bgi_free_nonmc_blocks_count < 0)
+                       ext3_init_grp_free_nonmc_blocks(sb, bitmap_bh,
+                                                       block_group);
+       }
+
        if (in_range (le32_to_cpu(desc->bg_block_bitmap), block, count) ||
            in_range (le32_to_cpu(desc->bg_inode_bitmap), block, count) ||
            in_range (block, le32_to_cpu(desc->bg_inode_table),
@@ -621,6 +734,9 @@ do_more:
        if (!err) err = ret;
        *pdquot_freed_blocks += group_freed;

+       if (test_opt(sb, METACLUSTER) && bit < sbi->s_nonmc_blocks_per_group)
+               ext3_update_nonmc_block_count(sbi, block_group, bit, count, 0);
+
        if (overflow && !err) {
                block += count;
                count = overflow;
@@ -726,6 +842,50 @@ bitmap_search_next_usable_block(ext3_grp
        return -1;
 }

+static ext3_grpblk_t
+bitmap_find_prev_zero_bit(char *map, ext3_grpblk_t start, ext3_grpblk_t lowest)
+{
+       ext3_grpblk_t k, blk;
+
+       k = start & ~7;
+       while (lowest <= k) {
+               if (map[k/8] != '\255' &&
+                       (blk = ext3_find_next_zero_bit(map, k + 8, k))
+                        < (k + 8))
+                               return blk;
+
+               k -= 8;
+       }
+       return -1;
+}
+
+static ext3_grpblk_t
+bitmap_search_prev_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
+                                       ext3_grpblk_t lowest)
+{
+       ext3_grpblk_t next;
+       struct journal_head *jh = bh2jh(bh);
+
+       /*
+        * The bitmap search --- search backward alternately through the actual
+        * bitmap and the last-committed copy until we find a bit free in
+        * both
+        */
+       while (start >= lowest) {
+               next = bitmap_find_prev_zero_bit(bh->b_data, start, lowest);
+               if (next < lowest)
+                       return -1;
+               if (ext3_test_allocatable(next, bh))
+                       return next;
+               jbd_lock_bh_state(bh);
+               if (jh->b_committed_data)
+                       start = bitmap_find_prev_zero_bit(jh->b_committed_data,
+                                                               next, lowest);
+               jbd_unlock_bh_state(bh);
+       }
+       return -1;
+}
+
 /**
  * find_next_usable_block()
  * @start:             the starting block (group relative) to find next
@@ -833,19 +993,27 @@ claim_block(spinlock_t *lock, ext3_grpbl
  *     file's own reservation window;
  *     Otherwise, the allocation range starts from the give goal block, ends at
  *     the block group's last block.
- *
- * If we failed to allocate the desired block then we may end up crossing to a
- * new bitmap.  In that case we must release write access to the old one via
- * ext3_journal_release_buffer(), else we'll run out of credits.
  */
 static ext3_grpblk_t
 ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
                        struct buffer_head *bitmap_bh, ext3_grpblk_t grp_goal,
                        unsigned long *count, struct
ext3_reserve_window *my_rsv)
 {
+       struct ext3_sb_info *sbi = EXT3_SB(sb);
+       struct ext3_group_desc *gdp;
+       struct ext3_bg_info *bgi = NULL;
+       struct buffer_head *gdp_bh;
        ext3_fsblk_t group_first_block;
        ext3_grpblk_t start, end;
        unsigned long num = 0;
+       const int metaclustering = test_opt(sb, METACLUSTER);
+
+       if (metaclustering)
+               bgi = &sbi->s_bginfo[group];
+
+       gdp = ext3_get_group_desc(sb, group, &gdp_bh);
+       if (!gdp)
+               goto fail_access;

        /* we do allocation within the reservation window if we have a window */
        if (my_rsv) {
@@ -890,8 +1058,10 @@ repeat:
        }
        start = grp_goal;

-       if (!claim_block(sb_bgl_lock(EXT3_SB(sb), group),
-               grp_goal, bitmap_bh)) {
+       if (metaclustering && !allow_mc_alloc(sbi, bgi, grp_goal))
+               goto fail_access;
+
+       if (!claim_block(sb_bgl_lock(sbi, group), grp_goal, bitmap_bh)) {
                /*
                 * The block was allocated by another thread, or it was
                 * allocated and then freed by another thread
@@ -906,8 +1076,8 @@ repeat:
        grp_goal++;
        while (num < *count && grp_goal < end
                && ext3_test_allocatable(grp_goal, bitmap_bh)
-               && claim_block(sb_bgl_lock(EXT3_SB(sb), group),
-                               grp_goal, bitmap_bh)) {
+               && (!metaclustering || allow_mc_alloc(sbi, bgi, grp_goal))
+               && claim_block(sb_bgl_lock(sbi, group), grp_goal, bitmap_bh)) {
                num++;
                grp_goal++;
        }
@@ -1138,7 +1308,9 @@ static int alloc_new_reservation(struct

        /*
         * find_next_reservable_window() simply finds a reservable window
-        * inside the given range(start_block, group_end_block).
+        * inside the given range(start_block, group_end_block). The
+        * reservation window must have a reservable free bit inside it for our
+        * callers to work correctly.
         *
         * To make sure the reservation window has a free bit inside it, we
         * need to check the bitmap after we found a reservable window.
@@ -1170,10 +1342,17 @@ retry:
                        my_rsv->rsv_start - group_first_block,
                        bitmap_bh, group_end_block - group_first_block + 1);

-       if (first_free_block < 0) {
+       if (first_free_block < 0 ||
+               (test_opt(sb, METACLUSTER)
+                && !allow_mc_alloc(EXT3_SB(sb), &EXT3_SB(sb)->s_bginfo[group],
+                                       first_free_block))) {
                /*
-                * no free block left on the bitmap, no point
-                * to reserve the space. return failed.
+                * No free block left on the bitmap, no point to reserve space,
+                * return failed. We also fail here if metaclustering is enabled
+                * and the first free block in the window lies in the
+                * metacluster while there are free non-mc blocks in the block
+                * group, such a window or any window following it is not useful
+                * to us.
                 */
                spin_lock(rsv_lock);
                if (!rsv_is_empty(&my_rsv->rsv_window))
@@ -1276,25 +1455,17 @@ ext3_try_to_allocate_with_rsv(struct sup
                        unsigned int group, struct buffer_head *bitmap_bh,
                        ext3_grpblk_t grp_goal,
                        struct ext3_reserve_window_node * my_rsv,
-                       unsigned long *count, int *errp)
+                       unsigned long *count)
 {
+       struct ext3_bg_info *bgi;
        ext3_fsblk_t group_first_block, group_last_block;
        ext3_grpblk_t ret = 0;
-       int fatal;
        unsigned long num = *count;

-       *errp = 0;
-
-       /*
-        * Make sure we use undo access for the bitmap, because it is critical
-        * that we do the frozen_data COW on bitmap buffers in all cases even
-        * if the buffer is in BJ_Forget state in the committing transaction.
-        */
-       BUFFER_TRACE(bitmap_bh, "get undo access for new block");
-       fatal = ext3_journal_get_undo_access(handle, bitmap_bh);
-       if (fatal) {
-               *errp = fatal;
-               return -1;
+       if (test_opt(sb, METACLUSTER)) {
+               bgi = &EXT3_SB(sb)->s_bginfo[group];
+               if (bgi->bgi_free_nonmc_blocks_count < 0)
+                       ext3_init_grp_free_nonmc_blocks(sb, bitmap_bh, group);
        }

        /*
@@ -1370,19 +1541,6 @@ ext3_try_to_allocate_with_rsv(struct sup
                num = *count;
        }
 out:
-       if (ret >= 0) {
-               BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for "
-                                       "bitmap block");
-               fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
-               if (fatal) {
-                       *errp = fatal;
-                       return -1;
-               }
-               return ret;
-       }
-
-       BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
-       ext3_journal_release_buffer(handle, bitmap_bh);
        return ret;
 }

@@ -1428,22 +1586,149 @@ int ext3_should_retry_alloc(struct super
        return journal_force_commit_nested(EXT3_SB(sb)->s_journal);
 }

+/*
+ * ext3_alloc_indirect_blocks:
+ *     Helper function for ext3_new_blocks. Allocates indirect blocks from the
+ *     metacluster region only and stores their numbers in new_blocks[].
+ */
+int ext3_alloc_indirect_blocks(struct super_block *sb,
+                       struct buffer_head *bitmap_bh,
+                       struct ext3_group_desc *gdp,
+                       int group_no, unsigned long indirect_blks,
+                       ext3_fsblk_t new_blocks[])
+{
+       struct ext3_bg_info *bgi = &EXT3_SB(sb)->s_bginfo[group_no];
+       ext3_grpblk_t blk = EXT3_BLOCKS_PER_GROUP(sb) - 1;
+       ext3_grpblk_t mc_start = EXT3_SB(sb)->s_nonmc_blocks_per_group;
+       ext3_fsblk_t group_first_block;
+       int allocated = 0;
+
+       BUG_ON(!test_opt(sb, METACLUSTER));
+
+       /* This check is racy but that wouldn't harm us. */
+       if (bgi->bgi_free_nonmc_blocks_count >=
+               le16_to_cpu(gdp->bg_free_blocks_count))
+               return 0;
+
+       group_first_block = ext3_group_first_block_no(sb, group_no);
+       while (allocated < indirect_blks && blk >= mc_start) {
+               if (!ext3_test_allocatable(blk, bitmap_bh)) {
+                       blk = bitmap_search_prev_usable_block(blk, bitmap_bh,
+                                                               mc_start);
+                       continue;
+               }
+               if (claim_block(sb_bgl_lock(EXT3_SB(sb), group_no), blk,
+                               bitmap_bh)) {
+                       new_blocks[allocated++] = group_first_block + blk;
+               } else {
+                       /*
+                        * The block was allocated by another thread, or it
+                        * was allocated and then freed by another thread
+                        */
+                       cpu_relax();
+               }
+               if (allocated < indirect_blks)
+                       blk = bitmap_search_prev_usable_block(blk, bitmap_bh,
+                                                               mc_start);
+       }
+       return allocated;
+}
+
+/*
+ * check_allocated_blocks:
+ *     Helper function for ext3_new_blocks. Checks newly allocated block
+ *     numbers.
+ */
+int check_allocated_blocks(ext3_fsblk_t blk, unsigned long num,
+                               struct super_block *sb, int group_no,
+                               struct ext3_group_desc *gdp,
+                               struct buffer_head *bitmap_bh)
+{
+       struct ext3_super_block *es = EXT3_SB(sb)->s_es;
+       struct ext3_sb_info *sbi = EXT3_SB(sb);
+       ext3_fsblk_t grp_blk = blk - ext3_group_first_block_no(sb, group_no);
+
+       if (in_range(le32_to_cpu(gdp->bg_block_bitmap), blk, num) ||
+               in_range(le32_to_cpu(gdp->bg_inode_bitmap), blk, num) ||
+               in_range(blk, le32_to_cpu(gdp->bg_inode_table),
+                               EXT3_SB(sb)->s_itb_per_group) ||
+               in_range(blk + num - 1, le32_to_cpu(gdp->bg_inode_table),
+                               EXT3_SB(sb)->s_itb_per_group))
+               ext3_error(sb, "ext3_new_blocks",
+                               "Allocating block in system zone - "
+                               "blocks from "E3FSBLK", length %lu",
+                               blk, num);
+
+#ifdef CONFIG_JBD_DEBUG
+       {
+               struct buffer_head *debug_bh;
+
+               /* Record bitmap buffer state in the newly allocated block */
+               debug_bh = sb_find_get_block(sb, blk);
+               if (debug_bh) {
+                       BUFFER_TRACE(debug_bh, "state when allocated");
+                       BUFFER_TRACE2(debug_bh, bitmap_bh, "bitmap state");
+                       brelse(debug_bh);
+               }
+       }
+       jbd_lock_bh_state(bitmap_bh);
+       spin_lock(sb_bgl_lock(sbi, group_no));
+       if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) {
+               int i;
+
+               for (i = 0; i < num; i++) {
+                       if (ext3_test_bit(grp_blk+i,
+                                       bh2jh(bitmap_bh)->b_committed_data))
+                               printk(KERN_ERR "%s: block was unexpectedly set"
+                                       " in b_committed_data\n", __FUNCTION__);
+               }
+       }
+       ext3_debug("found bit %d\n", grp_blk);
+       spin_unlock(sb_bgl_lock(sbi, group_no));
+       jbd_unlock_bh_state(bitmap_bh);
+#endif
+
+       if (blk + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
+               ext3_error(sb, "ext3_new_blocks",
+                               "block("E3FSBLK") >= blocks count(%d) - "
+                               "block_group = %d, es == %p ", blk,
+                               le32_to_cpu(es->s_blocks_count), group_no, es);
+               return 1;
+       }
+
+       return 0;
+}
+
 /**
- * ext3_new_blocks() -- core block(s) allocation function
- * @handle:            handle to this transaction
- * @inode:             file inode
- * @goal:              given target block(filesystem wide)
- * @count:             target number of blocks to allocate
- * @errp:              error code
+ * ext3_new_blocks - allocate indirect blocks and direct blocks.
+ *     @handle:        handle to this transaction
+ *     @inode:         file inode
+ *     @goal:          given target block(filesystem wide)
+ *     @indirect_blks  number of indirect blocks to allocate
+ *     @blks           number of direct blocks to allocate
+ *     @new_blocks     this will store the block numbers of indirect blocks
+ *                     and direct blocks upon return.
  *
- * ext3_new_blocks uses a goal block to assist allocation.  It tries to
- * allocate block(s) from the block group contains the goal block
first. If that
- * fails, it will try to allocate block(s) from other block groups without
- * any specific goal block.
+ *     returns the number of direct blocks allocated. Fewer than requested
+ *     number of direct blocks may be allocated but all requested indirect
+ *     blocks must be allocated in order to return success.
  *
+ *     Without metaclustering, ext3_new_block allocates all blocks using a
+ *     goal block to assist allocation.  It tries to allocate block(s) from
+ *     the block group contains the goal block first. If that fails, it will
+ *     try to allocate block(s) from other block groups without any specific
+ *     goal block.
+ *
+ *     With metaclustering, the only difference is that indirect block
+ *     allocation is first attempted in the metacluster region of the same
+ *     block group failing which they are allocated along with direct blocks.
+ *
+ *     This function also updates quota and i_blocks field.
  */
-ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
-                       ext3_fsblk_t goal, unsigned long *count, int *errp)
+int ext3_new_blocks(handle_t *handle, struct inode *inode,
+                       ext3_fsblk_t goal, int indirect_blks, int blks,
+                       ext3_fsblk_t new_blocks[4], int *errp)
+
 {
        struct buffer_head *bitmap_bh = NULL;
        struct buffer_head *gdp_bh;
@@ -1452,10 +1737,16 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
        ext3_grpblk_t grp_target_blk;   /* blockgroup relative goal block */
        ext3_grpblk_t grp_alloc_blk;    /* blockgroup-relative allocated block*/
        ext3_fsblk_t ret_block;         /* filesyetem-wide allocated block */
+       ext3_fsblk_t group_first_block; /* first block in the group */
        int bgi;                        /* blockgroup iteration index */
        int fatal = 0, err;
        int performed_allocation = 0;
        ext3_grpblk_t free_blocks;      /* number of free blocks in a group */
+       unsigned long ngroups;
+       unsigned long grp_mc_alloc;/* blocks allocated from mc in a group */
+       unsigned long grp_alloc;   /* blocks allocated outside mc in a group */
+       int indirect_blks_done = 0;/* total ind blocks allocated so far */
+       int blks_done = 0;         /* total direct blocks allocated */
        struct super_block *sb;
        struct ext3_group_desc *gdp;
        struct ext3_super_block *es;
@@ -1463,23 +1754,23 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
        struct ext3_reserve_window_node *my_rsv = NULL;
        struct ext3_block_alloc_info *block_i;
        unsigned short windowsz = 0;
+       int i;
 #ifdef EXT3FS_DEBUG
        static int goal_hits, goal_attempts;
 #endif
-       unsigned long ngroups;
-       unsigned long num = *count;

        *errp = -ENOSPC;
        sb = inode->i_sb;
        if (!sb) {
-               printk("ext3_new_block: nonexistent device");
+               printk(KERN_INFO "ext3_new_blocks: nonexistent device");
+               *errp = -ENODEV;
                return 0;
        }

        /*
         * Check quota for allocation of this block.
         */
-       if (DQUOT_ALLOC_BLOCK(inode, num)) {
+       if (DQUOT_ALLOC_BLOCK(inode, indirect_blks + blks)) {
                *errp = -EDQUOT;
                return 0;
        }
@@ -1513,73 +1804,194 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
        group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
                        EXT3_BLOCKS_PER_GROUP(sb);
        goal_group = group_no;
-retry_alloc:
-       gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
-       if (!gdp)
-               goto io_error;
-
-       free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
-       /*
-        * if there is not enough free blocks to make a new resevation
-        * turn off reservation for this allocation
-        */
-       if (my_rsv && (free_blocks < windowsz)
-               && (rsv_is_empty(&my_rsv->rsv_window)))
-               my_rsv = NULL;
-
-       if (free_blocks > 0) {
-               grp_target_blk = ((goal - le32_to_cpu(es->s_first_data_block)) %
-                               EXT3_BLOCKS_PER_GROUP(sb));
-               bitmap_bh = read_block_bitmap(sb, group_no);
-               if (!bitmap_bh)
-                       goto io_error;
-               grp_alloc_blk = ext3_try_to_allocate_with_rsv(sb, handle,
-                                       group_no, bitmap_bh, grp_target_blk,
-                                       my_rsv, &num, &fatal);
-               if (fatal)
-                       goto out;
-               if (grp_alloc_blk >= 0)
-                       goto allocated;
-       }

+retry_alloc:
+       grp_target_blk = ((goal - le32_to_cpu(es->s_first_data_block)) %
+                       EXT3_BLOCKS_PER_GROUP(sb));
        ngroups = EXT3_SB(sb)->s_groups_count;
        smp_rmb();

        /*
-        * Now search the rest of the groups.  We assume that
-        * i and gdp correctly point to the last group visited.
+        * Iterate over successive block groups for allocating (any) indirect
+        * blocks and direct blocks until at least one direct block has been
+        * allocated. If metaclustering is enabled, we try allocating indirect
+        * blocks first in the metacluster region and then in the general
+        * region and if that fails too, we repeat the same algorithm in the
+        * next block group and so on. This not only keeps the indirect blocks
+        * together in the metacluster, but also keeps them in close proximity
+        * to their corresponding direct blocks.
+        *
+        * The search begins and ends at the goal group, though the second time
+        * we are at the goal group we try allocating without a goal.
         */
-       for (bgi = 0; bgi < ngroups; bgi++) {
-               group_no++;
+       bgi = 0;
+       while (bgi < ngroups + 1) {
+               grp_mc_alloc = 0;
+
                if (group_no >= ngroups)
                        group_no = 0;
+
                gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
                if (!gdp)
                        goto io_error;
+
                free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
-               /*
-                * skip this group if the number of
-                * free blocks is less than half of the reservation
-                * window size.
-                */
-               if (free_blocks <= (windowsz/2))
-                       continue;
+               if (group_no == goal_group) {
+                       if (my_rsv && (free_blocks < windowsz)
+                               && (rsv_is_empty(&my_rsv->rsv_window)))
+                               my_rsv = NULL;
+                       if (free_blocks <= 0)
+                               goto next;
+               } else if (free_blocks <= windowsz/2)
+                       goto next;

-               brelse(bitmap_bh);
                bitmap_bh = read_block_bitmap(sb, group_no);
                if (!bitmap_bh)
                        goto io_error;
+
                /*
-                * try to allocate block(s) from this group, without a goal(-1).
+                * Make sure we use undo access for the bitmap, because it is
+                * critical that we do the frozen_data COW on bitmap buffers in
+                * all cases even if the buffer is in BJ_Forget state in the
+                * committing transaction.
+                */
+               BUFFER_TRACE(bitmap_bh, "get undo access for new block");
+               fatal = ext3_journal_get_undo_access(handle, bitmap_bh);
+               if (fatal)
+                       goto out;
+
+               /*
+                * If metaclustering is enabled, first try to allocate indirect
+                * blocks in the metacluster.
                 */
+               if (test_opt(sb, METACLUSTER) &&
+                       indirect_blks_done < indirect_blks)
+                       grp_mc_alloc = ext3_alloc_indirect_blocks(sb,
+                                       bitmap_bh, gdp, group_no,
+                                       indirect_blks - indirect_blks_done,
+                                       new_blocks + indirect_blks_done);
+
+               /* Allocate data blocks and any leftover indirect blocks. */
+               grp_alloc = indirect_blks + blks
+                               - (indirect_blks_done + grp_mc_alloc);
                grp_alloc_blk = ext3_try_to_allocate_with_rsv(sb, handle,
-                                       group_no, bitmap_bh, -1, my_rsv,
-                                       &num, &fatal);
+                                       group_no, bitmap_bh, grp_target_blk,
+                                       my_rsv, &grp_alloc);
+               if (grp_alloc_blk < 0)
+                       grp_alloc = 0;
+
+               /*
+                * If we couldn't allocate anything, there is nothing more to
+                * do with this block group, so move over to the next. But
+                * before that We must release write access to the old one via
+                * ext3_journal_release_buffer(), else we'll run out of credits.
+                */
+               if (grp_mc_alloc == 0 && grp_alloc == 0) {
+                       BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
+                       ext3_journal_release_buffer(handle, bitmap_bh);
+                       goto next;
+               }
+
+               BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for "
+                                       "bitmap block");
+               fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
                if (fatal)
                        goto out;
-               if (grp_alloc_blk >= 0)
+
+               ext3_debug("using block group %d(%d)\n",
+                               group_no, gdp->bg_free_blocks_count);
+
+               BUFFER_TRACE(gdp_bh, "get_write_access");
+               fatal = ext3_journal_get_write_access(handle, gdp_bh);
+               if (fatal)
+                       goto out;
+
+               /* Should this be called before ext3_journal_dirty_metadata? */
+               for (i = 0; i < grp_mc_alloc; i++) {
+                       if (check_allocated_blocks(
+                               new_blocks[indirect_blks_done + i], 1, sb,
+                               group_no, gdp, bitmap_bh))
+                               goto out;
+               }
+               if (grp_alloc > 0) {
+                       ret_block = ext3_group_first_block_no(sb, group_no) +
+                               grp_alloc_blk;
+                       if (check_allocated_blocks(ret_block, grp_alloc, sb,
+                                               group_no, gdp, bitmap_bh))
+                               goto out;
+               }
+
+               indirect_blks_done += grp_mc_alloc;
+               performed_allocation = 1;
+
+               /* The caller will add the new buffer to the journal. */
+               if (grp_alloc > 0)
+                       ext3_debug("allocating block %lu. "
+                                       "Goal hits %d of %d.\n",
+                                       ret_block, goal_hits, goal_attempts);
+
+               spin_lock(sb_bgl_lock(sbi, group_no));
+               gdp->bg_free_blocks_count =
+                       cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) -
+                                       (grp_mc_alloc + grp_alloc));
+               spin_unlock(sb_bgl_lock(sbi, group_no));
+               percpu_counter_sub(&sbi->s_freeblocks_counter,
+                               (grp_mc_alloc + grp_alloc));
+
+               BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for "
+                               "group descriptor");
+               err = ext3_journal_dirty_metadata(handle, gdp_bh);
+               if (!fatal)
+                       fatal = err;
+
+               sb->s_dirt = 1;
+               if (fatal)
+                       goto out;
+
+               brelse(bitmap_bh);
+               bitmap_bh = NULL;
+
+               if (grp_alloc == 0)
+                       goto next;
+
+               /* Update block group non-mc block count since we used some. */
+               if (test_opt(sb, METACLUSTER) &&
+                       grp_alloc_blk < sbi->s_nonmc_blocks_per_group)
+                       ext3_update_nonmc_block_count(sbi, group_no,
+                               grp_alloc_blk, grp_alloc, 1);
+
+               /*
+                * Assign all the non-mc blocks that we allocated from this
+                * block group.
+                */
+               group_first_block = ext3_group_first_block_no(sb, group_no);
+               while (grp_alloc > 0 && indirect_blks_done < indirect_blks) {
+                       new_blocks[indirect_blks_done++] =
+                               group_first_block + grp_alloc_blk;
+                       grp_alloc_blk++;
+                       grp_alloc--;
+               }
+
+               if (grp_alloc > 0) {
+                       blks_done = grp_alloc;
+                       new_blocks[indirect_blks_done] =
+                               group_first_block + grp_alloc_blk;
                        goto allocated;
+               }
+
+               /*
+                * If we allocated something but not the minimum required,
+                * it's OK to retry in this group as it might have more free
+                * blocks.
+                */
+               continue;
+
+next:
+               bgi++;
+               group_no++;
+               grp_target_blk = -1;
        }
+
        /*
         * We may end up a bogus ealier ENOSPC error due to
         * filesystem is "full" of reservations, but
@@ -1598,96 +2010,11 @@ retry_alloc:
        goto out;

 allocated:
-
-       ext3_debug("using block group %d(%d)\n",
-                       group_no, gdp->bg_free_blocks_count);
-
-       BUFFER_TRACE(gdp_bh, "get_write_access");
-       fatal = ext3_journal_get_write_access(handle, gdp_bh);
-       if (fatal)
-               goto out;
-
-       ret_block = grp_alloc_blk + ext3_group_first_block_no(sb, group_no);
-
-       if (in_range(le32_to_cpu(gdp->bg_block_bitmap), ret_block, num) ||
-           in_range(le32_to_cpu(gdp->bg_inode_bitmap), ret_block, num) ||
-           in_range(ret_block, le32_to_cpu(gdp->bg_inode_table),
-                     EXT3_SB(sb)->s_itb_per_group) ||
-           in_range(ret_block + num - 1, le32_to_cpu(gdp->bg_inode_table),
-                     EXT3_SB(sb)->s_itb_per_group))
-               ext3_error(sb, "ext3_new_block",
-                           "Allocating block in system zone - "
-                           "blocks from "E3FSBLK", length %lu",
-                            ret_block, num);
-
-       performed_allocation = 1;
-
-#ifdef CONFIG_JBD_DEBUG
-       {
-               struct buffer_head *debug_bh;
-
-               /* Record bitmap buffer state in the newly allocated block */
-               debug_bh = sb_find_get_block(sb, ret_block);
-               if (debug_bh) {
-                       BUFFER_TRACE(debug_bh, "state when allocated");
-                       BUFFER_TRACE2(debug_bh, bitmap_bh, "bitmap state");
-                       brelse(debug_bh);
-               }
-       }
-       jbd_lock_bh_state(bitmap_bh);
-       spin_lock(sb_bgl_lock(sbi, group_no));
-       if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) {
-               int i;
-
-               for (i = 0; i < num; i++) {
-                       if (ext3_test_bit(grp_alloc_blk+i,
-                                       bh2jh(bitmap_bh)->b_committed_data)) {
-                               printk("%s: block was unexpectedly set in "
-                                       "b_committed_data\n", __FUNCTION__);
-                       }
-               }
-       }
-       ext3_debug("found bit %d\n", grp_alloc_blk);
-       spin_unlock(sb_bgl_lock(sbi, group_no));
-       jbd_unlock_bh_state(bitmap_bh);
-#endif
-
-       if (ret_block + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
-               ext3_error(sb, "ext3_new_block",
-                           "block("E3FSBLK") >= blocks count(%d) - "
-                           "block_group = %d, es == %p ", ret_block,
-                       le32_to_cpu(es->s_blocks_count), group_no, es);
-               goto out;
-       }
-
-       /*
-        * It is up to the caller to add the new buffer to a journal
-        * list of some description.  We don't know in advance whether
-        * the caller wants to use it as metadata or data.
-        */
-       ext3_debug("allocating block %lu. Goal hits %d of %d.\n",
-                       ret_block, goal_hits, goal_attempts);
-
-       spin_lock(sb_bgl_lock(sbi, group_no));
-       gdp->bg_free_blocks_count =
-                       cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
-       spin_unlock(sb_bgl_lock(sbi, group_no));
-       percpu_counter_sub(&sbi->s_freeblocks_counter, num);
-
-       BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
-       err = ext3_journal_dirty_metadata(handle, gdp_bh);
-       if (!fatal)
-               fatal = err;
-
-       sb->s_dirt = 1;
-       if (fatal)
-               goto out;
-
        *errp = 0;
-       brelse(bitmap_bh);
-       DQUOT_FREE_BLOCK(inode, *count-num);
-       *count = num;
-       return ret_block;
+       DQUOT_FREE_BLOCK(inode,
+                       indirect_blks + blks - indirect_blks_done - blks_done);
+
+       return blks_done;

 io_error:
        *errp = -EIO;
@@ -1700,7 +2027,13 @@ out:
         * Undo the block allocation
         */
        if (!performed_allocation)
-               DQUOT_FREE_BLOCK(inode, *count);
+               DQUOT_FREE_BLOCK(inode, indirect_blks + blks);
+       /*
+        * Free any indirect blocks we allocated already. If the transaction
+        * has been aborted this is essentially a no-op.
+        */
+       for (i = 0; i < indirect_blks_done; i++)
+               ext3_free_blocks(handle, inode, new_blocks[i], 1);
        brelse(bitmap_bh);
        return 0;
 }
@@ -1708,9 +2041,13 @@ out:
 ext3_fsblk_t ext3_new_block(handle_t *handle, struct inode *inode,
                        ext3_fsblk_t goal, int *errp)
 {
-       unsigned long count = 1;
+       ext3_fsblk_t new_blocks[4];

-       return ext3_new_blocks(handle, inode, goal, &count, errp);
+       ext3_new_blocks(handle, inode, goal, 0, 1, new_blocks, errp);
+       if (*errp)
+               return 0;
+
+       return new_blocks[0];
 }

 /**
diff -uprdN linux-2.6.23mm1-clean/fs/ext3/bitmap.c
linux-2.6.23mm1-ext3mc/fs/ext3/bitmap.c
--- linux-2.6.23mm1-clean/fs/ext3/bitmap.c      2007-10-17
18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/bitmap.c     2007-12-20
18:12:17.000000000 -0800
@@ -11,8 +11,6 @@
 #include <linux/jbd.h>
 #include <linux/ext3_fs.h>

-#ifdef EXT3FS_DEBUG
-
 static const int nibblemap[] = {4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1,
2, 1, 1, 0};

 unsigned long ext3_count_free (struct buffer_head * map, unsigned int numchars)
@@ -27,6 +25,3 @@ unsigned long ext3_count_free (struct bu
                        nibblemap[(map->b_data[i] >> 4) & 0xf];
        return (sum);
 }
-
-#endif  /*  EXT3FS_DEBUG  */
-
diff -uprdN linux-2.6.23mm1-clean/fs/ext3/inode.c
linux-2.6.23mm1-ext3mc/fs/ext3/inode.c
--- linux-2.6.23mm1-clean/fs/ext3/inode.c       2007-10-17
18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/inode.c      2007-12-21
05:38:41.000000000 -0800
@@ -36,10 +36,33 @@
 #include <linux/mpage.h>
 #include <linux/uio.h>
 #include <linux/bio.h>
+#include <linux/sort.h>
 #include "xattr.h"
 #include "acl.h"

+typedef struct {
+       __le32  *p;
+       __le32  key;
+       struct buffer_head *bh;
+} Indirect;
+
+struct ext3_ind_read_info {
+       int                     count;
+       int                     seq_prefetch;
+       long                    size;
+       struct buffer_head      *bh[0];
+};
+
+# define EXT3_IND_READ_INFO_SIZE(_c)        \
+       (sizeof(struct ext3_ind_read_info) + \
+        sizeof(struct buffer_head *) * (_c))
+
+# define EXT3_IND_READ_MAX             (32)
+
 static int ext3_writepage_trans_blocks(struct inode *inode);
+static Indirect *ext3_read_indblocks(struct inode *inode, int iblock,
+                                    int depth, int offsets[4],
+                                    Indirect chain[4], int *err);

 /*
  * Test whether an inode is a fast symlink.
@@ -233,12 +256,6 @@ no_delete:
        clear_inode(inode);     /* We must guarantee clearing of inode... */
 }

-typedef struct {
-       __le32  *p;
-       __le32  key;
-       struct buffer_head *bh;
-} Indirect;
-
 static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
 {
        p->key = *(p->p = v);
@@ -352,18 +369,21 @@ static int ext3_block_to_path(struct ino
  *     the whole chain, all way to the data (returns %NULL, *err == 0).
  */
 static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
-                                Indirect chain[4], int *err)
+                                Indirect chain[4], int ind_readahead, int *err)
 {
        struct super_block *sb = inode->i_sb;
        Indirect *p = chain;
        struct buffer_head *bh;
+       int index;

        *err = 0;
        /* i_data is not going away, no lock needed */
        add_chain (chain, NULL, EXT3_I(inode)->i_data + *offsets);
        if (!p->key)
                goto no_block;
-       while (--depth) {
+       for (index = 0; index < depth - 1; index++) {
+               if (ind_readahead && depth > 2 && index == depth - 2)
+                       break;
                bh = sb_bread(sb, le32_to_cpu(p->key));
                if (!bh)
                        goto failure;
@@ -396,7 +416,11 @@ no_block:
  *     It is used when heuristic for sequential allocation fails.
  *     Rules are:
  *       + if there is a block to the left of our position - allocate near it.
- *       + if pointer will live in indirect block - allocate near that block.
+ *       + If METACLUSTER options is not specified, allocate the data
+ *       block close to the metadata block. Otherwise, if pointer will live in
+ *       indirect block, we cannot allocate near the indirect block since
+ *       indirect blocks are allocated in the metacluster, just put in the same
+ *       cylinder group as the inode.
  *       + if pointer will live in inode - allocate in the same
  *         cylinder group.
  *
@@ -421,9 +445,11 @@ static ext3_fsblk_t ext3_find_near(struc
                        return le32_to_cpu(*p);
        }

-       /* No such thing, so let's try location of indirect block */
-       if (ind->bh)
-               return ind->bh->b_blocknr;
+       if (!test_opt(inode->i_sb, METACLUSTER)) {
+               /* No such thing, so let's try location of indirect block */
+               if (ind->bh)
+                       return ind->bh->b_blocknr;
+       }

        /*
         * It is going to be referred to from the inode itself? OK, just put it
@@ -475,8 +501,7 @@ static ext3_fsblk_t ext3_find_goal(struc
  *     @blks: number of data blocks to be mapped.
  *     @blocks_to_boundary:  the offset in the indirect block
  *
- *     return the total number of blocks to be allocate, including the
- *     direct and indirect blocks.
+ *     return the total number of direct blocks to be allocated.
  */
 static int ext3_blks_to_allocate(Indirect *branch, int k, unsigned long blks,
                int blocks_to_boundary)
@@ -505,75 +530,18 @@ static int ext3_blks_to_allocate(Indirec
 }

 /**
- *     ext3_alloc_blocks: multiple allocate blocks needed for a branch
- *     @indirect_blks: the number of blocks need to allocate for indirect
- *                     blocks
- *
- *     @new_blocks: on return it will store the new block numbers for
- *     the indirect blocks(if needed) and the first direct block,
- *     @blks:  on return it will store the total number of allocated
- *             direct blocks
- */
-static int ext3_alloc_blocks(handle_t *handle, struct inode *inode,
-                       ext3_fsblk_t goal, int indirect_blks, int blks,
-                       ext3_fsblk_t new_blocks[4], int *err)
-{
-       int target, i;
-       unsigned long count = 0;
-       int index = 0;
-       ext3_fsblk_t current_block = 0;
-       int ret = 0;
-
-       /*
-        * Here we try to allocate the requested multiple blocks at once,
-        * on a best-effort basis.
-        * To build a branch, we should allocate blocks for
-        * the indirect blocks(if not allocated yet), and at least
-        * the first direct block of this branch.  That's the
-        * minimum number of blocks need to allocate(required)
-        */
-       target = blks + indirect_blks;
-
-       while (1) {
-               count = target;
-               /* allocating blocks for indirect blocks and direct blocks */
-               current_block = ext3_new_blocks(handle,inode,goal,&count,err);
-               if (*err)
-                       goto failed_out;
-
-               target -= count;
-               /* allocate blocks for indirect blocks */
-               while (index < indirect_blks && count) {
-                       new_blocks[index++] = current_block++;
-                       count--;
-               }
-
-               if (count > 0)
-                       break;
-       }
-
-       /* save the new block number for the first direct block */
-       new_blocks[index] = current_block;
-
-       /* total number of blocks allocated for direct blocks */
-       ret = count;
-       *err = 0;
-       return ret;
-failed_out:
-       for (i = 0; i <index; i++)
-               ext3_free_blocks(handle, inode, new_blocks[i], 1);
-       return ret;
-}
-
-/**
  *     ext3_alloc_branch - allocate and set up a chain of blocks.
  *     @inode: owner
  *     @indirect_blks: number of allocated indirect blocks
  *     @blks: number of allocated direct blocks
+ *     @goal: goal for allocation
  *     @offsets: offsets (in the blocks) to store the pointers to next.
  *     @branch: place to store the chain in.
  *
- *     This function allocates blocks, zeroes out all but the last one,
+ *     returns error and number of direct blocks allocated via *blks
+ *
+ *     This function allocates indirect_blks + *blks, zeroes out all
+ *     indirect blocks,
  *     links them into chain and (if we are synchronous) writes them to disk.
  *     In other words, it prepares a branch that can be spliced onto the
  *     inode. It stores the information about that chain in the branch[], in
@@ -602,7 +570,7 @@ static int ext3_alloc_branch(handle_t *h
        ext3_fsblk_t new_blocks[4];
        ext3_fsblk_t current_block;

-       num = ext3_alloc_blocks(handle, inode, goal, indirect_blks,
+       num = ext3_new_blocks(handle, inode, goal, indirect_blks,
                                *blks, new_blocks, &err);
        if (err)
                return err;
@@ -799,17 +767,21 @@ int ext3_get_blocks_handle(handle_t *han
        int blocks_to_boundary = 0;
        int depth;
        struct ext3_inode_info *ei = EXT3_I(inode);
-       int count = 0;
+       int count = 0, ind_readahead;
        ext3_fsblk_t first_block = 0;

-
        J_ASSERT(handle != NULL || create == 0);
        depth = ext3_block_to_path(inode,iblock,offsets,&blocks_to_boundary);

        if (depth == 0)
                goto out;

-       partial = ext3_get_branch(inode, depth, offsets, chain, &err);
+       ind_readahead = !create && depth > 2;
+       partial = ext3_get_branch(inode, depth, offsets, chain,
+                                 ind_readahead, &err);
+       if (!partial && ind_readahead)
+               partial = ext3_read_indblocks(inode, iblock, depth,
+                                             offsets, chain, &err);

        /* Simplest case - block found, no allocation needed */
        if (!partial) {
@@ -844,7 +816,7 @@ int ext3_get_blocks_handle(handle_t *han
        }

        /* Next simple case - plain lookup or failed read of indirect block */
-       if (!create || err == -EIO)
+       if (!create || (err && err != -EAGAIN))
                goto cleanup;

        mutex_lock(&ei->truncate_mutex);
@@ -866,7 +838,8 @@ int ext3_get_blocks_handle(handle_t *han
                        brelse(partial->bh);
                        partial--;
                }
-               partial = ext3_get_branch(inode, depth, offsets, chain, &err);
+               partial = ext3_get_branch(inode, depth, offsets, chain, 0,
+                                       &err);
                if (!partial) {
                        count++;
                        mutex_unlock(&ei->truncate_mutex);
@@ -1974,7 +1947,7 @@ static Indirect *ext3_find_shared(struct
        /* Make k index the deepest non-null offest + 1 */
        for (k = depth; k > 1 && !offsets[k-1]; k--)
                ;
-       partial = ext3_get_branch(inode, k, offsets, chain, &err);
+       partial = ext3_get_branch(inode, k, offsets, chain, 0, &err);
        /* Writer: pointers */
        if (!partial)
                partial = chain + k-1;
@@ -3297,3 +3270,560 @@ int ext3_change_inode_journal_flag(struc

        return err;
 }
+
+/*
+ * ext3_ind_read_end_bio --
+ *
+ *     bio callback for read IO issued from ext3_read_indblocks.
+ *     May be called multiple times until the whole I/O completes at
+ *     which point bio->bi_size = 0 and it frees read_info and bio.
+ *     The first time it is called, first_bh is unlocked so that any sync
+ *     waier can unblock.
+ */
+static void ext3_ind_read_end_bio(struct bio *bio, int err)
+{
+       struct ext3_ind_read_info *read_info = bio->bi_private;
+       struct buffer_head *bh;
+       int uptodate = !err && test_bit(BIO_UPTODATE, &bio->bi_flags);
+       int i;
+
+       if (err == -EOPNOTSUPP)
+               set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+
+       /* Wait for all buffers to finish - is this needed? */
+       if (bio->bi_size)
+               return;
+
+       for (i = 0; i < read_info->count; i++) {
+               bh = read_info->bh[i];

+               if (err == -EOPNOTSUPP)
+                       set_bit(BH_Eopnotsupp, &bh->b_state);
+
+               if (uptodate) {
+                       BUG_ON(buffer_uptodate(bh));
+                       BUG_ON(ext3_buffer_prefetch(bh));
+                       set_buffer_uptodate(bh);
+                       if (read_info->seq_prefetch)
+                               ext3_set_buffer_prefetch(bh);
+               }
+
+               unlock_buffer(bh);
+               brelse(bh);
+       }
+
+       kfree(read_info);
+       bio_put(bio);
+}
+
+/*
+ * ext3_get_max_read --
+ *     @inode: inode of file.
+ *     @block: block number in file (starting from zero).
+ *     @offset_in_dind_block: offset of the indirect block inside it's
+ *     parent doubly-indirect block.
+ *
+ *      Compute the maximum no. of indirect blocks that can be read
+ *      satisfying following constraints:
+ *              - Don't read indirect blocks beyond the end of current
+ *              doubly-indirect block.
+ *              - Don't read beyond eof.
+ */
+static inline unsigned long ext3_get_max_read(const struct inode *inode,
+                                                 int block,
+                                                 int offset_in_dind_block)
+{
+       const struct super_block *sb = inode->i_sb;
+       unsigned long max_read;
+       unsigned long ptrs = EXT3_ADDR_PER_BLOCK(inode->i_sb);
+       unsigned long ptrs_bits = EXT3_ADDR_PER_BLOCK_BITS(inode->i_sb);
+       unsigned long blocks_in_file =
+               (inode->i_size + sb->s_blocksize - 1) >> sb->s_blocksize_bits;
+       unsigned long remaining_ind_blks_in_dind =
+               (ptrs >= offset_in_dind_block) ? (ptrs - offset_in_dind_block)
+                                              : 0;
+       unsigned long remaining_ind_blks_before_eof =
+               ((blocks_in_file - EXT3_NDIR_BLOCKS + ptrs - 1) >> ptrs_bits) -
+               ((block - EXT3_NDIR_BLOCKS) >> ptrs_bits);
+
+       BUG_ON(block >= blocks_in_file);
+
+       max_read = min_t(unsigned long, remaining_ind_blks_in_dind,
+                        remaining_ind_blks_before_eof);
+
+       BUG_ON(max_read < 1);
+
+       return max_read;
+}
+
+static void ext3_read_indblocks_submit(struct bio **pbio,
+                                       struct ext3_ind_read_info **pread_info,
+                                       int *read_cnt, int seq_prefetch)
+{
+       struct bio *bio = *pbio;
+       struct ext3_ind_read_info *read_info = *pread_info;
+
+       BUG_ON(*read_cnt < 1);
+
+       read_info->seq_prefetch = seq_prefetch;
+       read_info->count = *read_cnt;
+       read_info->size = bio->bi_size;
+       bio->bi_private = read_info;
+       bio->bi_end_io = ext3_ind_read_end_bio;
+       submit_bio(READ, bio);
+
+       *pbio = NULL;
+       *pread_info = NULL;
+       *read_cnt = 0;
+}
+
+struct ind_block_info {
+       ext3_fsblk_t            blockno;
+       struct buffer_head      *bh;
+};
+
+static int ind_info_cmp(const void *a, const void *b)
+{
+       struct ind_block_info *info_a = (struct ind_block_info *)a;
+       struct ind_block_info *info_b = (struct ind_block_info *)b;
+
+       return info_a->blockno - info_b->blockno;
+}
+
+static void ind_info_swap(void *a, void *b, int size)
+{
+       struct ind_block_info *info_a = (struct ind_block_info *)a;
+       struct ind_block_info *info_b = (struct ind_block_info *)b;
+       struct ind_block_info tmp;
+
+       tmp = *info_a;
+       *info_a = *info_b;
+       *info_b = tmp;
+}
+
+/*
+ * ext3_read_indblocks_async --
+ *      @sb:            super block
+ *      @ind_blocks[]:  array of indirect block numbers on disk
+ *      @count:         maximum number of indirect blocks to read
+ *      @first_bh:      buffer_head for indirect block ind_blocks[0], may be
+ *                      NULL
+ *      @seq_prefetch:  if this is part of a sequential prefetch and buffers'
+ *                      prefetch bit must be set.
+ *      @blocks_done:   number of blocks considered for prefetching.
+ *
+ *      Issue a single bio request to read upto count buffers identified in
+ *      ind_blocks[]. Fewer than count buffers may be read in some cases:
+ *      - If a buffer is found to be uptodate and it's prefetch bit is set, we
+ *      don't look at any more buffers as they will most likely be in
the cache.
+ *      - We skip buffers we cannot lock without blocking (except for first_bh
+ *      if specified).
+ *      - We skip buffers beyond a certain range on disk.
+ *
+ *      This function must issue read on first_bh if specified unless of course
+ *      it's already uptodate.
+ */
+static int ext3_read_indblocks_async(struct super_block *sb,
+                                    const __le32 ind_blocks[], int count,
+                                    struct buffer_head *first_bh,
+                                    int seq_prefetch,
+                                    unsigned long *blocks_done)
+{
+       struct buffer_head *bh;
+       struct bio *bio = NULL;
+       struct ext3_ind_read_info *read_info = NULL;
+       int read_cnt = 0, blk;
+       ext3_fsblk_t prev_blk = 0, io_start_blk = 0, curr;
+       struct ind_block_info *ind_info = NULL;
+       int err = 0, ind_info_count = 0;
+
+       BUG_ON(count < 1);
+       /* Don't move this to ext3_get_max_read() since callers often need to
+        * trim the count returned by that function. So this bound must only
+        * be imposed at the last moment. */
+       count = min_t(unsigned long, count, EXT3_IND_READ_MAX);
+       *blocks_done = 0UL;
+
+       if (count == 1 && first_bh) {
+               lock_buffer(first_bh);
+               get_bh(first_bh);
+               first_bh->b_end_io = end_buffer_read_sync;
+               submit_bh(READ, first_bh);
+               *blocks_done = 1UL;
+               return 0;
+       }
+
+       ind_info = kmalloc(count * sizeof(*ind_info), GFP_KERNEL);
+       if (unlikely(!ind_info))
+               return -ENOMEM;
+
+       /*
+        * First pass: sort block numbers for all indirect blocks that we'll
+        * read. This allows us to scan blocks in sequenial order during the
+        * second pass which helps coalasce requests to contiguous blocks.
+        * Since we sort block numbers here instead of assuming any specific
+        * layout on the disk, we have some protection against different
+        * indirect block layout strategies as long as they keep all indirect
+        * blocks close by.
+        */
+       for (blk = 0; blk < count; blk++) {
+               curr = le32_to_cpu(ind_blocks[blk]);
+               if (!curr)
+                       continue;
+
+               /*
+                * Skip this block if it lies too far from blocks we have
+                * already decided to read. "Too far" should typically indicate
+                * lying on a different track on the disk. EXT3_IND_READ_MAX
+                * seems reasonable for most disks.
+                */
+               if (io_start_blk > 0 &&
+                       (max(io_start_blk, curr) - min(io_start_blk, curr) >=
+                               EXT3_IND_READ_MAX))
+                       continue;
+
+               if (blk == 0 && first_bh) {
+                       bh = first_bh;
+                       get_bh(first_bh);
+               } else {
+                       bh = sb_getblk(sb, curr);
+                       if (unlikely(!bh)) {
+                               err = -ENOMEM;
+                               goto failure;
+                       }
+               }
+
+               if (buffer_uptodate(bh)) {
+                       if (ext3_buffer_prefetch(bh)) {
+                               brelse(bh);
+                               break;
+                       }
+                       brelse(bh);
+                       continue;
+               }
+
+               if (io_start_blk == 0)
+                       io_start_blk = curr;
+
+               ind_info[ind_info_count].blockno = curr;
+               ind_info[ind_info_count].bh = bh;
+               ind_info_count++;
+       }
+       *blocks_done = blk;
+
+       sort(ind_info, ind_info_count, sizeof(*ind_info),
+               ind_info_cmp, ind_info_swap);
+
+       /* Second pass: compose bio requests and issue them. */
+       for (blk = 0; blk < ind_info_count; blk++) {
+               bh = ind_info[blk].bh;
+               curr = ind_info[blk].blockno;
+
+               if (prev_blk > 0 && curr != prev_blk + 1) {
+                       ext3_read_indblocks_submit(&bio, &read_info,
+                                               &read_cnt, seq_prefetch);
+                       prev_blk = 0;
+               }

+
+               /* Lock the buffer without blocking, skipping any buffers
+                * which would require us to block. first_bh when specified is
+                * an exception as caller typically wants it to be read for
+                * sure (e.g., ext3_read_indblocks_sync).
+                */
+               if (bh == first_bh) {
+                       lock_buffer(bh);
+               } else if (test_set_buffer_locked(bh)) {
+                       brelse(bh);
+                       continue;
+               }
+
+               /* Check again with the buffer locked. */
+               if (buffer_uptodate(bh)) {
+                       if (ext3_buffer_prefetch(bh)) {
+                               unlock_buffer(bh);
+                               brelse(bh);
+                               break;
+                       }
+                       unlock_buffer(bh);
+                       brelse(bh);
+                       continue;
+               }
+
+               if (read_cnt == 0) {
+                       /* read_info freed in ext3_ind_read_end_bio(). */
+                       read_info = kmalloc(EXT3_IND_READ_INFO_SIZE(count),
+                                           GFP_KERNEL);
+                       if (unlikely(!read_info)) {
+                               err = -ENOMEM;
+                               goto failure;
+                       }
+
+                       bio = bio_alloc(GFP_KERNEL, count);
+                       if (unlikely(!bio)) {
+                               err = -ENOMEM;
+                               goto failure;
+                       }
+                       bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
+                       bio->bi_bdev = bh->b_bdev;
+               }
+
+               if (bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh))
+                               < bh->b_size) {
+                       brelse(bh);
+                       if (read_cnt == 0)
+                               goto failure;
+
+                       break;
+               }
+
+               read_info->bh[read_cnt++] = bh;
+               prev_blk = curr;
+       }
+
+       if (read_cnt == 0)
+               goto done;
+
+       ext3_read_indblocks_submit(&bio, &read_info, &read_cnt, seq_prefetch);
+
+       kfree(ind_info);
+       return 0;
+
+failure:
+       while (--read_cnt >= 0) {
+               unlock_buffer(read_info->bh[read_cnt]);
+               brelse(read_info->bh[read_cnt]);
+       }
+       *blocks_done = 0UL;
+
+done:
+       kfree(read_info);
+
+       if (bio)
+               bio_put(bio);
+
+       kfree(ind_info);
+       return err;
+}
+
+/*
+ * ext3_read_indblocks_sync --
+ *      @sb:            super block
+ *      @ind_blocks[]:  array of indirect block numbers on disk
+ *      @count:         maximum number of indirect blocks to read
+ *      @first_bh:      buffer_head for indirect block ind_blocks[0], must be
+ *                      non-NULL.
+ *      @seq_prefetch:  set prefetch bit of buffers, used when this is part of
+ *                      a sequential prefetch.
+ *      @blocks_done:   number of blocks considered for prefetching.
+ *
+ *      Synchronously read at most count indirect blocks listed in
+ *      ind_blocks[]. This function calls ext3_read_indblocks_async() to do all
+ *      the hard work. It waits for read to complete on first_bh before
+ *      returning.
+ */
+
+static int ext3_read_indblocks_sync(struct super_block *sb,
+                                   const __le32 ind_blocks[], int count,

+                                   struct buffer_head *first_bh,
+                                   int seq_prefetch,
+                                   unsigned long *blocks_done)
+{
+       int err;
+
+       BUG_ON(count < 1);
+       BUG_ON(!first_bh);
+
+       err = ext3_read_indblocks_async(sb, ind_blocks, count, first_bh,
+                                       seq_prefetch, blocks_done);
+       if (err)
+               return err;
+
+       wait_on_buffer(first_bh);
+       if (!buffer_uptodate(first_bh))
+               err = -EIO;
+
+       /* if seq_prefetch != 0, ext3_read_indblocks_async() sets prefetch bit
+        * for all buffers, but the first buffer for sync IO is never a prefetch
+        * buffer since it's needed presently so mark it so.
+        */
+       if (seq_prefetch)
+               ext3_clear_buffer_prefetch(first_bh);
+
+       BUG_ON(ext3_buffer_prefetch(first_bh));
+
+       return err;
+}
+
+/*
+ * ext3_read_indblocks --
+ *
+ *     @inode: inode of file
+ *     @iblock: block number inside file (starting from 0).
+ *     @depth: depth of path from inode to data block.
+ *     @offsets: array of offsets within blocks identified in 'chain'.
+ *     @chain: array of Indirect with info about all levels of blocks until
+ *     the data block.
+ *     @err: error pointer.
+ *
+ *     This function is called after reading all metablocks leading to 'iblock'
+ *     except the (singly) indirect block. It reads the indirect block if not
+ *     already in the cache and may also prefetch next few indirect blocks.
+ *     It uses a combination of synchronous and asynchronous requests to
+ *     accomplish this. We do prefetching even for random reads by reading
+ *     ahead one indirect block since reads of size >=512KB have at least 12%
+ *     chance of spanning two indirect blocks.
+ */
+
+static Indirect *ext3_read_indblocks(struct inode *inode, int iblock,
+                                    int depth, int offsets[4],
+                                    Indirect chain[4], int *err)
+{
+       struct super_block *sb = inode->i_sb;
+       struct buffer_head *first_bh, *prev_bh;
+       unsigned long max_read, blocks_done = 0;
+       __le32 *ind_blocks;
+
+       /* Must have doubly indirect block for prefetching indirect blocks. */
+       BUG_ON(depth <= 2);
+       BUG_ON(!chain[depth-2].key);
+
+       *err = 0;
+
+       /* Handle first block */
+       ind_blocks = chain[depth-2].p;
+       first_bh = sb_getblk(sb, le32_to_cpu(ind_blocks[0]));
+       if (unlikely(!first_bh)) {
+               printk(KERN_ERR "Failed to get block %u for sb %p\n",
+                      le32_to_cpu(ind_blocks[0]), sb);
+               goto failure;
+       }
+
+       BUG_ON(first_bh->b_size != sb->s_blocksize);
+
+       if (buffer_uptodate(first_bh)) {
+               /* Found the buffer in cache, either it was accessed recently or
+                * it was prefetched while reading previous indirect block(s).
+                * We need to figure out if we need to prefetch the following
+                * indirect blocks.
+                */
+               if (!ext3_buffer_prefetch(first_bh)) {
+                       /* Either we've seen this indirect block before while
+                        * accessing another data block, or this is a random
+                        * read. In the former case, we must have done the
+                        * needful the first time we had a cache hit on this
+                        * indirect block, in the latter case we obviously
+                        * don't need to do any prefetching.
+                        */
+                       goto done;
+               }
+
+               max_read = ext3_get_max_read(inode, iblock,
+                                            offsets[depth-2]);
+
+               /* This indirect block is in the cache due to prefetching and
+                * this is its first cache hit, clear the prefetch bit and
+                * make sure the following blocks are also prefetched.
+                */
+               ext3_clear_buffer_prefetch(first_bh);
+
+               if (max_read >= 2) {
+                       /* ext3_read_indblocks_async() stops at the first
+                        * indirect block which has the prefetch bit set which
+                        * will most likely be the very next indirect block.
+                        */
+                       ext3_read_indblocks_async(sb, &ind_blocks[1],
+                                                 max_read - 1,
+                                                 NULL, 1, &blocks_done);
+               }
+
+       } else {
+               /* Buffer is not in memory, we need to read it. If we are
+                * reading sequentially from the previous indirect block, we
+                * have just detected a sequential read and we must prefetch
+                * some indirect blocks for future.
+                */
+
+               max_read = ext3_get_max_read(inode, iblock,
+                                            offsets[depth-2]);
+
+               if ((ind_blocks - (__le32 *)chain[depth-2].bh->b_data) >= 1) {
+                       prev_bh = sb_getblk(sb, le32_to_cpu(ind_blocks[-1]));
+                       if (buffer_uptodate(prev_bh) &&
+                           !ext3_buffer_prefetch(prev_bh)) {
+                               /* Detected sequential read. */
+                               brelse(prev_bh);
+
+                               /* Sync read indirect block, also read the next
+                                * few indirect blocks.
+                                */
+                               *err = ext3_read_indblocks_sync(sb, ind_blocks,
+                                                        max_read, first_bh, 1,
+                                                        &blocks_done);
+
+                               if (*err)
+                                       goto out;
+
+                               /* In case the very next indirect block is
+                                * discontiguous by a non-trivial amount,
+                                * ext3_read_indblocks_sync() above won't
+                                * prefetch it (indicated by blocks_done < 2).
+                                * So to help sequential read, schedule an
+                                * async request for reading the next
+                                * contiguous indirect block range (which
+                                * in metaclustering case would be the next
+                                * metacluster, without metaclustering it
+                                * would be the next indirect block). This is
+                                * expected to benefit the non-metaclustering
+                                * case.
+                                */
+                               if (max_read >= 2 && blocks_done < 2)
+                                       ext3_read_indblocks_async(sb,
+                                                       &ind_blocks[1],
+                                                       max_read - 1,
+                                                       NULL, 1, &blocks_done);
+
+                               goto done;
+                       }
+                       brelse(prev_bh);
+               }
+
+               /* Either random read, or sequential detection failed above.
+                * We always prefetch the next indirect block in this case
+                * whenever possible.
+                * This is because for random reads of size ~512KB, there is
+                * >12% chance that a read will span two indirect blocks.
+                */
+               *err = ext3_read_indblocks_sync(sb, ind_blocks,
+                                               (max_read >= 2) ? 2 : 1,
+                                               first_bh, 0, &blocks_done);
+               if (*err)
+                       goto out;
+       }
+
+done:
+       /* Reader: pointers */
+       if (!verify_chain(chain, &chain[depth - 2])) {
+               brelse(first_bh);
+               goto changed;
+       }
+       add_chain(&chain[depth - 1], first_bh,
+                 (__le32*)first_bh->b_data + offsets[depth - 1]);
+       /* Reader: end */
+       if (!chain[depth - 1].key)
+               goto out;
+
+       BUG_ON(!buffer_uptodate(first_bh));
+       return NULL;
+
+changed:
+       *err = -EAGAIN;
+       goto out;
+failure:
+       *err = -EIO;
+out:
+       if (*err) {
+               ext3_debug("Error %d reading indirect blocks\n", *err);
+               return &chain[depth - 2];
+       } else
+               return &chain[depth - 1];
+}
+
diff -uprdN linux-2.6.23mm1-clean/fs/ext3/super.c
linux-2.6.23mm1-ext3mc/fs/ext3/super.c
--- linux-2.6.23mm1-clean/fs/ext3/super.c       2007-10-17
18:31:42.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/fs/ext3/super.c      2007-12-20
18:11:14.000000000 -0800

@@ -625,6 +625,9 @@ static int ext3_show_options(struct seq_
        else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
                seq_puts(seq, ",data=writeback");

+       if (test_opt(sb, METACLUSTER))
+               seq_puts(seq, ",metacluster");
+
        ext3_show_quota_options(seq, sb);

        return 0;
@@ -758,7 +761,7 @@ enum {
        Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
        Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
        Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
-       Opt_grpquota
+       Opt_grpquota, Opt_metacluster
 };

 static match_table_t tokens = {
@@ -808,6 +811,7 @@ static match_table_t tokens = {
        {Opt_quota, "quota"},
        {Opt_usrquota, "usrquota"},
        {Opt_barrier, "barrier=%u"},
+       {Opt_metacluster, "metacluster"},
        {Opt_err, NULL},
        {Opt_resize, "resize"},
 };
@@ -1140,6 +1144,9 @@ clear_qf_name:
                case Opt_bh:
                        clear_opt(sbi->s_mount_opt, NOBH);
                        break;
+               case Opt_metacluster:
+                       set_opt(sbi->s_mount_opt, METACLUSTER);
+                       break;
                default:
                        printk (KERN_ERR
                                "EXT3-fs: Unrecognized mount option \"%s\" "
@@ -1674,6 +1681,13 @@ static int ext3_fill_super (struct super
        }
        sbi->s_frags_per_block = 1;
        sbi->s_blocks_per_group = le32_to_cpu(es->s_blocks_per_group);
+       if (test_opt(sb, METACLUSTER)) {
+               sbi->s_nonmc_blocks_per_group = sbi->s_blocks_per_group -
+                       sbi->s_blocks_per_group / 12;
+               sbi->s_nonmc_blocks_per_group &= ~7;
+       } else
+               sbi->s_nonmc_blocks_per_group = sbi->s_blocks_per_group;
+
        sbi->s_frags_per_group = le32_to_cpu(es->s_frags_per_group);
        sbi->s_inodes_per_group = le32_to_cpu(es->s_inodes_per_group);
        if (EXT3_INODE_SIZE(sb) == 0)
@@ -1783,6 +1797,18 @@ static int ext3_fill_super (struct super
        sbi->s_rsv_window_head.rsv_goal_size = 0;
        ext3_rsv_window_add(sb, &sbi->s_rsv_window_head);

+       if (test_opt(sb, METACLUSTER)) {
+               sbi->s_bginfo = kmalloc(sbi->s_groups_count *
+                                       sizeof(*sbi->s_bginfo), GFP_KERNEL);
+               if (!sbi->s_bginfo) {
+                       printk(KERN_ERR "EXT3-fs: not enough memory\n");
+                       goto failed_mount3;
+               }
+               for (i = 0; i < sbi->s_groups_count; i++)
+                       sbi->s_bginfo[i].bgi_free_nonmc_blocks_count = -1;
+       } else
+               sbi->s_bginfo = NULL;
+
        /*
         * set up enough so that it can read an inode
         */
@@ -1808,16 +1834,16 @@ static int ext3_fill_super (struct super
        if (!test_opt(sb, NOLOAD) &&
            EXT3_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL)) {
                if (ext3_load_journal(sb, es, journal_devnum))
-                       goto failed_mount3;
+                       goto failed_mount4;
        } else if (journal_inum) {
                if (ext3_create_journal(sb, es, journal_inum))
-                       goto failed_mount3;
+                       goto failed_mount4;
        } else {
                if (!silent)
                        printk (KERN_ERR
                                "ext3: No journal on filesystem on %s\n",
                                sb->s_id);
-               goto failed_mount3;
+               goto failed_mount4;
        }

        /* We have now updated the journal if required, so we can
@@ -1840,7 +1866,7 @@ static int ext3_fill_super (struct super
                    (sbi->s_journal, 0, 0, JFS_FEATURE_INCOMPAT_REVOKE)) {
                        printk(KERN_ERR "EXT3-fs: Journal does not support "
                               "requested data journaling mode\n");
-                       goto failed_mount4;
+                       goto failed_mount5;
                }
        default:
                break;
@@ -1863,13 +1889,13 @@ static int ext3_fill_super (struct super
        if (!sb->s_root) {
                printk(KERN_ERR "EXT3-fs: get root inode failed\n");
                iput(root);
-               goto failed_mount4;
+               goto failed_mount5;
        }
        if (!S_ISDIR(root->i_mode) || !root->i_blocks || !root->i_size) {
                dput(sb->s_root);
                sb->s_root = NULL;
                printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
-               goto failed_mount4;
+               goto failed_mount5;
        }

        ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
@@ -1901,8 +1927,10 @@ cantfind_ext3:
                       sb->s_id);
        goto failed_mount;

-failed_mount4:
+failed_mount5:
        journal_destroy(sbi->s_journal);
+failed_mount4:
+       kfree(sbi->s_bginfo);
 failed_mount3:
        percpu_counter_destroy(&sbi->s_freeblocks_counter);
        percpu_counter_destroy(&sbi->s_freeinodes_counter);
diff -uprdN linux-2.6.23mm1-clean/include/linux/ext3_fs.h
linux-2.6.23mm1-ext3mc/include/linux/ext3_fs.h
--- linux-2.6.23mm1-clean/include/linux/ext3_fs.h       2007-10-17
18:31:43.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/include/linux/ext3_fs.h      2007-12-21
05:40:05.000000000 -0800

@@ -380,6 +380,7 @@ struct ext3_inode {
 #define EXT3_MOUNT_QUOTA               0x80000 /* Some quota option set */
 #define EXT3_MOUNT_USRQUOTA            0x100000 /* "old" user quota */
 #define EXT3_MOUNT_GRPQUOTA            0x200000 /* "old" group quota */
+#define EXT3_MOUNT_METACLUSTER         0x400000 /* Indirect block clustering */

 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -493,6 +494,7 @@ struct ext3_super_block {
 #ifdef __KERNEL__
 #include <linux/ext3_fs_i.h>
 #include <linux/ext3_fs_sb.h>
+#include <linux/buffer_head.h>
 static inline struct ext3_sb_info * EXT3_SB(struct super_block *sb)
 {
        return sb->s_fs_info;
@@ -722,6 +724,11 @@ struct dir_private_info {
        __u32           next_hash;
 };

+/* Special bh flag used by the metacluster readahead logic. */
+enum ext3_bh_state_bits {
+       EXT3_BH_PREFETCH = BH_JBD_Sentinel,
+};
+
 /* calculate the first block number of the group */
 static inline ext3_fsblk_t
 ext3_group_first_block_no(struct super_block *sb, unsigned long group_no)
@@ -730,6 +737,24 @@ ext3_group_first_block_no(struct super_b
                le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block);
 }

+static inline void
+ext3_set_buffer_prefetch(struct buffer_head *bh)
+{
+       set_bit(EXT3_BH_PREFETCH, &bh->b_state);
+}
+
+static inline void
+ext3_clear_buffer_prefetch(struct buffer_head *bh)
+{
+       clear_bit(EXT3_BH_PREFETCH, &bh->b_state);
+}
+
+static inline int
+ext3_buffer_prefetch(struct buffer_head *bh)
+{
+       return test_bit(EXT3_BH_PREFETCH, &bh->b_state);
+}
+
 /*
  * Special error return code only used by dx_probe() and its callers.
  */
@@ -752,8 +777,9 @@ extern int ext3_bg_has_super(struct supe
 extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
 extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode,
                        ext3_fsblk_t goal, int *errp);
-extern ext3_fsblk_t ext3_new_blocks (handle_t *handle, struct inode *inode,
-                       ext3_fsblk_t goal, unsigned long *count, int *errp);
+extern int ext3_new_blocks(handle_t *handle, struct inode *inode,
+                       ext3_fsblk_t goal, int indirect_blks, int blks,
+                       ext3_fsblk_t new_blocks[], int *errp);
 extern void ext3_free_blocks (handle_t *handle, struct inode *inode,
                        ext3_fsblk_t block, unsigned long count);
 extern void ext3_free_blocks_sb (handle_t *handle, struct super_block *sb,
diff -uprdN linux-2.6.23mm1-clean/include/linux/ext3_fs_sb.h
linux-2.6.23mm1-ext3mc/include/linux/ext3_fs_sb.h
--- linux-2.6.23mm1-clean/include/linux/ext3_fs_sb.h    2007-10-17
18:31:43.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/include/linux/ext3_fs_sb.h   2007-12-20
18:11:14.000000000 -0800
@@ -24,6 +24,8 @@
 #endif
 #include <linux/rbtree.h>

+struct ext3_bg_info;
+
 /*
  * third extended-fs super-block data in memory
  */
@@ -33,6 +35,7 @@ struct ext3_sb_info {
        unsigned long s_inodes_per_block;/* Number of inodes per block */
        unsigned long s_frags_per_group;/* Number of fragments in a group */
        unsigned long s_blocks_per_group;/* Number of blocks in a group */
+       unsigned long s_nonmc_blocks_per_group;/* Number of non-metacluster
blocks in a group */
        unsigned long s_inodes_per_group;/* Number of inodes in a group */
        unsigned long s_itb_per_group;  /* Number of inode table
blocks per group */
        unsigned long s_gdb_count;      /* Number of group descriptor blocks */
@@ -67,6 +70,9 @@ struct ext3_sb_info {
        struct rb_root s_rsv_window_root;
        struct ext3_reserve_window_node s_rsv_window_head;

+       /* array of per-bg in-memory info */
+       struct ext3_bg_info *s_bginfo;
+
        /* Journaling */
        struct inode * s_journal_inode;
        struct journal_s * s_journal;
@@ -83,4 +89,11 @@ struct ext3_sb_info {
 #endif
 };

+/*
+ * in-memory data associated with each block group.
+ */
+struct ext3_bg_info {
+       int bgi_free_nonmc_blocks_count;/* Number of free non-metacluster
blocks in group */
+};
+
 #endif /* _LINUX_EXT3_FS_SB */
diff -uprdN linux-2.6.23mm1-clean/include/linux/jbd.h
linux-2.6.23mm1-ext3mc/include/linux/jbd.h
--- linux-2.6.23mm1-clean/include/linux/jbd.h   2007-10-17
18:31:43.000000000 -0700
+++ linux-2.6.23mm1-ext3mc/include/linux/jbd.h  2007-12-20
18:11:14.000000000 -0800
@@ -294,6 +294,7 @@ enum jbd_state_bits {
        BH_State,               /* Pins most journal_head state */
        BH_JournalHead,         /* Pins bh->b_private and jh->b_bh */
        BH_Unshadow,            /* Dummy bit, for BJ_Shadow wakeup filtering */
+       BH_JBD_Sentinel,        /* Start bit for clients of jbd */
 };

 BUFFER_FNS(JBD, jbd)









On Nov 16, 2007 6:58 PM, Theodore Tso <tytso@mit.edu> wrote:
> The practice of starting search in the next block block in the
> metadata area only makes a difference for one indirect block, yes, but
> it's the right thing to do.  And if you fold the ext3_new_blocks and
> ext3_new_indirect_blocks(), it's really not that hard.  You can
> basically do something like this:
>
>         if (alloc_for_metadata)
>                 strategy = 0x132;
>         else
>                 strategy = 0x231;
>         for (; strategy; strategy = strategy >> 8) {
>                 switch (strategy & 0xF) {
>                 case 1:
>                      start = block_group_start;
>                      end = mc_start - 1;
>                      break;
>                 case 2:
>                      start = mc_start;
>                      end = mc_end;
>                      break;
>                 case 3:
>                      start = mc_end + 1;
>                      end = block_group_end;
>                      break;
>                 }
>                 <search region between start.. end>
>         }
>
> > We initially avoided making metaclustering a superblock tunable as we
> > didn't want to make any changes to the on-disk format as then ext4
> > extents are also a good option.
>
> Allocating a superblock field is no big deal.  I'll note further that
> metaclustering is not necessarily mutually exclusive with ext4
> extents.  Allocating the extent tree data blocks out of the
> metacluster blocks can be a good idea, depending on the average size
> of the blocks and how fragmented the filesystem gets (and hence how
> many contiguous extents can be expected).  If the filesystem is
> storing lots of really big files where being contiguous across
> multiple blockgroups are productive, then the metacluster area would
> actually be counterproductive.  And if files are all small so the
> extents fit the inode, the metadata cluster area wouldn't be necessary
> at all.  But if there are multiple external extent blocks in a block
> group, it would be useful for them to be allocated together.
>
> > If metaclustering gains acceptance
> > it might make sense to make it a superblock tunable. However, I would
> > avoid putting metacluster size into the superblock for the following
> > reason. Ideally, we should not have to bother about finding the sweet
> > spot of metacluster size as
> > (1) a given file system can be used for storing different kinds
> > of files at different times and it would be a pain to tune it every now
> > and then, and
>
> Yes, it doesn't make sense to retune the filesystem.  I was assuming
> that this would only be done at mke2fs time.
>
> > (2) it opens the possibility of doubting metcluster size for unrelated
> > ext3/fsck performance anomalies.
>
> I'm not sure I understand your concern.  The reality is that 99% of
> the time users will never change it from the defaults, but making it
> tunable makes it much, much easier for us to try various experiments
> to determine what is the best initial value for different workloads.
> What might get used for a Usenet news spool or a Squid cache might be
> quite different from series of DVD image files.
>
> > Allow me to propose a solution that will most likely address the above
> > issue and please ignore its complexity for a moment. Instead of a two
> > level partitioning in the block space between data blocks and
> > metacluster blocks, have a 3 or 4 level partitioning. E.g., a block
> > group with 'd' blocks can have d/32 blocks in metacluster level 1,
> > d/64 blocks in metacluster level 2, and d/128 blocks in metacluster
> > level 3 (define level 0 has having the remaining blocks = d - d/32 -
> > d/64 - d/128). Data block allocation starts looking for a free block
> > starting from the lowest possible level. If it is unable to find any
> > free blocks at that level in all block groups, it moves up a level and
> > so on. Indirect block allocation proceeds in the opposite direction
> > starting from higher levels. This approach has several benefits:
>
> That is clever.  Oh, one other thing.  You didn't mention what
> happened when the metacluster field was placed at the end of the block
> group.  I assume you tried that in your experiments; what were the
> results?  The obvious thing to do to avoid further fragmentation of
> the block group would be to put level 1 at the end of the block group,
> level 2 just before it, and level 3 before that, and then allocate the
> data blocks starting at the beginning of the block group, i.e:
>
> +----------------------------------+---------------+---------+-------+
> |     data                         | level 3       | level 2 | lvl 1 |
> +----------------------------------+---------------+---------+-------+
>
>
> > In traditional metaclustering, once we run out of metacluster blocks
> > or data blocks, all bets are off. This forces us to keep small
> > metaclusters in order to avoid this situation altogether. But with small
> > metaclusters, we cannot optimize indirect block allocation on file
> > systems with many small files (>48KB).There is only one glitch in
> > implementing this. If a block group doesn't have any free blocks at a
> > given level, we should be able to find that out quickly instead of
> > having to scan its entire bitmap. gdp->bg_free_blocks_count is not good
> > enough for this.
>
> Ideally, true, but this was a defect with the original metacluster
> scheme as well.  We could steal some bits in the block_group
> descriptor structure to indicate whether a particular level is full,
> though.  This would be another data format change that would require
> e2fsprogs support, though.
>
> Regards,
>
>                                                 - Ted
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2008-01-10 21:17           ` Abhishek Rai
@ 2008-01-11 17:05             ` Daniel Phillips
  2008-01-12  0:04               ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Phillips @ 2008-01-11 17:05 UTC (permalink / raw)
  To: Abhishek Rai
  Cc: Theodore Tso, Andrew Morton, Andreas Dilger, linux-kernel,
	Ken Chen, Mike Waychison, rohitseth

On Thursday 10 January 2008 13:17, Abhishek Rai wrote:
> Benchmark 5: fsck
> Description: Prepare a newly formated 400GB disk as follows: create
> 200 files of 0.5GB each, 100 files of 1GB each, 40 files of 2.5GB
> ech, and 10 files of 10GB each. fsck command line: fsck -f -n
> 1. vanilla:
>  Total: 11m25.3s
>  User: 13.4s
>  System: 13.2s
> 2. mc:
>  Total: 3m11.0s
>  User: 13.1s
>  System: 12.9s
>
> Note: I'll report results from kernbench and compilebench shortly.
>
> Observations:
> Sequential write performance is much better with metaclustering than
> with vanilla. To better understand it, I ran the same benchmark with
> the new code but with the metaclustering option turned off and I got
> the same performance as vanilla which makes me believe that there is
> something about metaclustering that helps write performance though I
> don't have a very good handle of what that thing might be.

Your results are very impressive.   In my opinion, the sooner this goes 
in, the better, since everybody hates waiting for fsck.  The only issue 
that jumps out at me is, the patch is big and changes a significant 
amount of Ext3 code outside of the metacluster path, which is not a bad 
thing except that these changes are going to need to be tested fairly 
heavily.

The way to do that is, put a big [CALL FOR TESTING] in your subject line 
the next time you post, and use an attention-getting subject line 
like "Make Ext3 fsck way faster".   Diff the patch against the latest 
stable kernel to make things as easy as possible for the people who are 
hopefully going to download your patch, try it, and report their 
results.

The other way is just to ask Andrew to put it in -mm when you feel 
ready, but your chances are much better if you already have people 
sending in mails saying how great your patch is.

Another thing you might consider is a port to Ext4.  After all, the 
world has waited this long for your patch, so it can likely survive 
waiting a little longer.

You somehow seem to have missed attracting the attention of Jon Corbet, 
a rare occurrence for a patch of this significance.  With the subject 
line modified as above, you are more likely to get the attention you 
deserve.  Good luck!

Regards,

Daniel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2008-01-11 17:05             ` Daniel Phillips
@ 2008-01-12  0:04               ` Andrew Morton
  2008-01-12  6:05                 ` Daniel Phillips
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2008-01-12  0:04 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: abhishekrai, tytso, adilger, linux-kernel, kenchen, mikew, rohitseth

On Fri, 11 Jan 2008 09:05:17 -0800
Daniel Phillips <phillips@phunq.net> wrote:

> On Thursday 10 January 2008 13:17, Abhishek Rai wrote:
> > Benchmark 5: fsck
> > Description: Prepare a newly formated 400GB disk as follows: create
> > 200 files of 0.5GB each, 100 files of 1GB each, 40 files of 2.5GB
> > ech, and 10 files of 10GB each. fsck command line: fsck -f -n
> > 1. vanilla:
> >  Total: 11m25.3s
> >  User: 13.4s
> >  System: 13.2s
> > 2. mc:
> >  Total: 3m11.0s
> >  User: 13.1s
> >  System: 12.9s
> >
> > Note: I'll report results from kernbench and compilebench shortly.
> >
> > Observations:
> > Sequential write performance is much better with metaclustering than
> > with vanilla. To better understand it, I ran the same benchmark with
> > the new code but with the metaclustering option turned off and I got
> > the same performance as vanilla which makes me believe that there is
> > something about metaclustering that helps write performance though I
> > don't have a very good handle of what that thing might be.
> 
> Your results are very impressive.   In my opinion, the sooner this goes 
> in, the better, since everybody hates waiting for fsck.  The only issue 
> that jumps out at me is, the patch is big and changes a significant 
> amount of Ext3 code outside of the metacluster path, which is not a bad 
> thing except that these changes are going to need to be tested fairly 
> heavily.

It needs to be reviewed.  In exhaustive detail.  Few people can do that and
fewer are inclined to do so.

> The way to do that is, put a big [CALL FOR TESTING] in your subject line 
> the next time you post, and use an attention-getting subject line 
> like "Make Ext3 fsck way faster".   Diff the patch against the latest 
> stable kernel to make things as easy as possible for the people who are 
> hopefully going to download your patch, try it, and report their 
> results.
> 
> The other way is just to ask Andrew to put it in -mm when you feel 
> ready, but your chances are much better if you already have people 
> sending in mails saying how great your patch is.

I went to merge it so it could get some testing while we await review but
the patch has all its tabs replaced with spaces, is seriously wordwrapped
and has random newlines added to it.  Please fix email client and resend
(offlist is OK if it is unaltered).


We should have a think about which workloads are most likely to be
adversely affected by this change.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2008-01-12  0:04               ` Andrew Morton
@ 2008-01-12  6:05                 ` Daniel Phillips
  2008-01-13  5:06                   ` Abhishek Rai
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Phillips @ 2008-01-12  6:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: abhishekrai, tytso, adilger, linux-kernel, kenchen, mikew, rohitseth

On Friday 11 January 2008 16:04, Andrew Morton wrote:
> It needs to be reviewed.  In exhaustive detail.  Few people can do
> that and fewer are inclined to do so.

Agreed, there just have to be a few bugs in this many lines of code.
I spent a couple of hours going through it, not really looking at the 
algorithms but just the superficial details.  I only found minor nits, 
and not many of those.

For example, I do not like to see "if (free_blocks == 0)" written as"if 
(free_blocks <= 0)" in an attempt to increase robustness.  What it 
actually does is make the effect of an error more subtle, or 
even "corrects" it.  Firmly in the niggle category.

I checked the locking of sbi->bginfo and didn't see a flaw, good.

I see a missing KERN_INFO added to a printk, it technically counts as an 
unrelated change but oh well.

Stylistically this new code is hard to tell apart from the incumbent 
code, except for being more heavily commented.  I wish all kernel code 
was written this clearly.

At this point I will run away in favor of for-real Ext3 hackers (you 
know who you are:-)

> I went to merge it so it could get some testing while we await review
> but the patch has all its tabs replaced with spaces, is seriously
> wordwrapped and has random newlines added to it.  Please fix email
> client and resend (offlist is OK if it is unaltered).

Odd, the original post has tabs and the updated one does not, though the 
client seems to be kmail in both cases.

> We should have a think about which workloads are most likely to be
> adversely affected by this change.

I was just rolling up my sleeves to construct the nasty sequential case 
where the head keeps seeking back to the center of the group after 
picking up each 4 MB of doubly indexed data when I realized that even 
the most simple minded disk cache makes this case a non-issue.  The 
drive will most likely suck a full track (roughly .5 MB) or big chunk 
thereof into cache the first time it seeks to the index cluster, thus 
having a whole group of double index blocks in cache and then will 
proceed to chew happily and linearly through the data blocks.
It seems like placing those second level index blocks all together 
really helps this case.  Hmm, how to break it.

How about having a disk full of 100 MB files and skipping all over the 
disk randomly reading one block each time.  That will fill the disk 
cache, and each random read then requires seeking to two places that 
were hopefully close together without index node clustering, and now 
will be an average of 32 MB apart.  Each of these "extra" seeks costs a 
couple of ms worth of head travel plus average rotational latency of 4 
ms or so, for a total 6 ms.  However, even with a perfect non-clustered 
layout, the index mode will still be an average of 2 MB away from the 
data block, so the rotational latency is still incurred and only the 
head travel is a little less, say 1 ms less.  So the "extra" seek time 
for clustered is 6 ms vs 5 ms for non-clustered.  Add in 8 ms for the 
long random seek and we have 14 ms vs 13 ms, or about 8% difference. 
Only a small regression there, and I tried hard.  Barring mistakes in 
my estimates the sequential improvement above is large while the 
regression for the nasty random construction is small.

Maybe somebody else will have better luck breaking it.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2008-01-12  6:05                 ` Daniel Phillips
@ 2008-01-13  5:06                   ` Abhishek Rai
  0 siblings, 0 replies; 21+ messages in thread
From: Abhishek Rai @ 2008-01-13  5:06 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrew Morton, tytso, adilger, linux-kernel, kenchen, mikew, rohitseth

Thanks for the great feedback Daniel!

Following this email, I'll be sending out two separate emails with the
actual patches, one against the latest stable kernel and one against
the latest mm patch, using the format suggested by you. Sorry about
the tabs and spaces thing, I've fixed my email client now.

Thanks for pointing out problems with the (free_blocks <= 0)
expression, I totally agree with you and have fixed it in the new
patch. Regarding some of the other points that you raised:

1. Worse performance with random read on large files using metaclustering:
This is a genuine drawback of this kind of metaclustering. In short,
the choice is between a slightly slower random read or a much faster
fsck. However, I believe that going forward, especially if and when we
port metaclustering to Ext4, slower random read will probably be less
of an issue, since there'll be fewer indirect blocks (due to the use
of extents) and so we'll be able to do more aggressive prefetching of
indirect blocks to help random reads.

That said, it would be great to see what random read performance other
users report since in my own experiments the degradation has been
somewhat smaller than I'd expect (I've also tried more complex random
read non-standard benchmarks that I haven't reported numbers for and
they did "reasonably ok" with metaclustering, but of course standard
reproducible results are always better).

2. Porting to Ext4:
It seems that popular opinion is that some form of metaclustering
could be useful for Ext4 as some other Ext4 hackers have also
suggested the same on LKML and I'd be glad to work on it. However, I
think metaclustering provides genuine value to current users of Ext3
and Ext2 and most people will agree that these two file systems are
very likely to remain popular for quite some time now (the backport of
metaclustering to Ext2 is quite trivial, so if metaclustering gets
accepted in Ext3, I'll probably release a "use-at-your-own-risk" patch
for Ext2 users).

Thanks!
Abhishek

On Jan 12, 2008 1:05 AM, Daniel Phillips <phillips@phunq.net> wrote:
> On Friday 11 January 2008 16:04, Andrew Morton wrote:
> > It needs to be reviewed.  In exhaustive detail.  Few people can do
> > that and fewer are inclined to do so.
>
> Agreed, there just have to be a few bugs in this many lines of code.
> I spent a couple of hours going through it, not really looking at the
> algorithms but just the superficial details.  I only found minor nits,
> and not many of those.
>
> For example, I do not like to see "if (free_blocks == 0)" written as"if
> (free_blocks <= 0)" in an attempt to increase robustness.  What it
> actually does is make the effect of an error more subtle, or
> even "corrects" it.  Firmly in the niggle category.
>
> I checked the locking of sbi->bginfo and didn't see a flaw, good.
>
> I see a missing KERN_INFO added to a printk, it technically counts as an
> unrelated change but oh well.
>
> Stylistically this new code is hard to tell apart from the incumbent
> code, except for being more heavily commented.  I wish all kernel code
> was written this clearly.
>
> At this point I will run away in favor of for-real Ext3 hackers (you
> know who you are:-)
>
> > I went to merge it so it could get some testing while we await review
> > but the patch has all its tabs replaced with spaces, is seriously
> > wordwrapped and has random newlines added to it.  Please fix email
> > client and resend (offlist is OK if it is unaltered).
>
> Odd, the original post has tabs and the updated one does not, though the
> client seems to be kmail in both cases.
>
> > We should have a think about which workloads are most likely to be
> > adversely affected by this change.
>
> I was just rolling up my sleeves to construct the nasty sequential case
> where the head keeps seeking back to the center of the group after
> picking up each 4 MB of doubly indexed data when I realized that even
> the most simple minded disk cache makes this case a non-issue.  The
> drive will most likely suck a full track (roughly .5 MB) or big chunk
> thereof into cache the first time it seeks to the index cluster, thus
> having a whole group of double index blocks in cache and then will
> proceed to chew happily and linearly through the data blocks.
> It seems like placing those second level index blocks all together
> really helps this case.  Hmm, how to break it.
>
> How about having a disk full of 100 MB files and skipping all over the
> disk randomly reading one block each time.  That will fill the disk
> cache, and each random read then requires seeking to two places that
> were hopefully close together without index node clustering, and now
> will be an average of 32 MB apart.  Each of these "extra" seeks costs a
> couple of ms worth of head travel plus average rotational latency of 4
> ms or so, for a total 6 ms.  However, even with a perfect non-clustered
> layout, the index mode will still be an average of 2 MB away from the
> data block, so the rotational latency is still incurred and only the
> head travel is a little less, say 1 ms less.  So the "extra" seek time
> for clustered is 6 ms vs 5 ms for non-clustered.  Add in 8 ms for the
> long random seek and we have 14 ms vs 13 ms, or about 8% difference.
> Only a small regression there, and I tried hard.  Barring mistakes in
> my estimates the sequential improvement above is large while the
> regression for the nasty random construction is small.
>
> Maybe somebody else will have better luck breaking it.
>
> Regards,
>
> Daniel
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
  2008-01-11 14:12             ` Bodo Eggert
@ 2008-01-11 14:49               ` Abhishek Rai
  0 siblings, 0 replies; 21+ messages in thread
From: Abhishek Rai @ 2008-01-11 14:49 UTC (permalink / raw)
  To: 7eggert
  Cc: ak, ebiederm, rdreier, gregkh, airlied, davej, mingo, tglx, akpm,
	arjan, Jesse, davem, linux-kernel, Suresh B, Linus Torvalds

That will surely help sequential read performance for large
unfragmented files and we have considered it before. There are two
main reasons why we want the data blocks and the corresponding
indirect blocks to share the same block group.

1. When a block group runs out of a certain types of blocks (data
blocks or indirect blocks), we use blocks of the other type for
allocation. Consequently, if data blocks and their corresponding
indirect blocks are sharing the same block group, we'll run out of
data blocks in the block group exactly at the same time as we run out
of indirect blocks, so we know we have well utilized the block group
and can move on to the next block group. This keeps things simple and
results in low fragmentation. However, if data blocks and their
indirect blocks were to go into two different block groups, it is
possible that you run out of one kind of blocks in one block group
while you still have the other kind available in the other block group
since these two are independent now. So now we need to decide which
kind of allocation to move over to which block group. This requires
slightly more advanced heuristics and I didn't want to add this
complexity for the small gain it offers.

2. I think sharing a block group the way it's done currently is a
cleaner design since allocation is quite self-contained within a block
group. I'd argue in the long run it's good to stick to a cleaner
design even if it is 1-2% worse in performance in some cases. Among
other things, cleaner designs are easier to change and enhance in the
future. More importantly, in this case our goal is to speed up fsck
without slowing down IO and we are comfortably achieving that goal.

Thanks,
Abhishek

On Jan 11, 2008 9:12 AM, Bodo Eggert <7eggert@gmx.de> wrote:
> Abhishek Rai <abhishekrai@google.com> wrote:
>
> > Putting metacluster at the end of the block group gives slightly
> > inferior sequential read throughput compared to putting it in the
> > beginning or the middle, but the difference is very tiny and exists
> > only for large files that span multiple block groups.
>
> Just an idea:
>
> What about putting it into the end of the previous block group (except for
> the first group, off cause) and starting to read the block group a little
> earlier (readahead/~before)? I imagine it might be about as good as placing
> it at the beginning while avoiding the fragmentation.
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Clustering indirect blocks in Ext3
       [not found]           ` <9KcYS-46E-27@gated-at.bofh.it>
@ 2008-01-11 14:12             ` Bodo Eggert
  2008-01-11 14:49               ` Abhishek Rai
  0 siblings, 1 reply; 21+ messages in thread
From: Bodo Eggert @ 2008-01-11 14:12 UTC (permalink / raw)
  To: Abhishek Rai, ak, ebiederm, rdreier, gregkh, airlied, davej,
	mingo, tglx, akpm, arjan, ?missing.closing.'

Abhishek Rai <abhishekrai@google.com> wrote:

> Putting metacluster at the end of the block group gives slightly
> inferior sequential read throughput compared to putting it in the
> beginning or the middle, but the difference is very tiny and exists
> only for large files that span multiple block groups.

Just an idea:

What about putting it into the end of the previous block group (except for
the first group, off cause) and starting to read the block group a little
earlier (readahead/~before)? I imagine it might be about as good as placing
it at the beginning while avoiding the fragmentation.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-01-13  5:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-16  5:02 [PATCH] Clustering indirect blocks in Ext3 Abhishek Rai
2007-11-16  7:02 ` Andrew Morton
2007-11-16  7:37   ` Matt Mackall
2007-11-18 15:52     ` Abhishek Rai
2007-11-18 20:47       ` Matt Mackall
2007-11-19 10:34         ` Kyungmin Park
2007-11-20 20:25       ` John Stoffel
2007-11-16 11:28   ` Andreas Dilger
2007-11-16 21:11   ` Theodore Tso
2007-11-17  0:25     ` Abhishek Rai
2007-11-17  2:58       ` Theodore Tso
2007-11-17  8:58         ` Abhishek Rai
2007-12-21 14:15         ` Abhishek Rai
2008-01-10 21:17           ` Abhishek Rai
2008-01-11 17:05             ` Daniel Phillips
2008-01-12  0:04               ` Andrew Morton
2008-01-12  6:05                 ` Daniel Phillips
2008-01-13  5:06                   ` Abhishek Rai
2007-11-16 22:27   ` Abhishek Rai
     [not found] <9q1CT-82L-3@gated-at.bofh.it>
     [not found] ` <9q3v2-2Br-3@gated-at.bofh.it>
     [not found]   ` <9qgLE-7ds-21@gated-at.bofh.it>
     [not found]     ` <9qjJx-3wE-9@gated-at.bofh.it>
     [not found]       ` <9qm4D-70Q-1@gated-at.bofh.it>
     [not found]         ` <9CQTt-7cr-27@gated-at.bofh.it>
     [not found]           ` <9KcYS-46E-27@gated-at.bofh.it>
2008-01-11 14:12             ` Bodo Eggert
2008-01-11 14:49               ` Abhishek Rai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).