All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5
@ 2014-08-09  4:26 Darrick J. Wong
  2014-08-09  4:26 ` [PATCH 1/6] libext2fs: create inlinedata symlinks Darrick J. Wong
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-09  4:26 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4

Hi all,

This is part 5 of the Summer 2014 e2fsprogs patchset.  Whereas all the
patches I've sent previously were to fix problems in the library and
e2fsck uncovered by the e2fuzz metadata fuzzer, we're now to the part
of the patch set that includes new features.  I'm only posting things
that have been discussed recently; the other "new" features (that have
been out for review for quite some time) haven't changed since May.
They'll be in part 6, which I'm holding onto until Ted has a chance to
look at the inline data pile of stuff.

The first new feature was suggested by Pu Hou -- the ability to create
inline data symlinks.  That's patch 1, and it requires the patches in
part 4 + the v2 patches I sent out today.

Patch 2 fixes a number of gcc warnings.

Patch 3-6 implement v2 of the e2fsck readahead functionality, which
promises to reduce fsck runtime by 10-30%.  You might want to read
the report: http://marc.info/?l=linux-ext4&m=140755433701165&w=2
("e2fsck readahead speedup performance report") for all the juicy
details!

I've tested these e2fsprogs changes against the -next branch as of
8/4.  The patches follow the end of "Summer 2014 patchbomb part 4",
and have been tested against the 'make check' suite.

Comments and questions are, as always, welcome.

--D

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] libext2fs: create inlinedata symlinks
  2014-08-09  4:26 [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Darrick J. Wong
@ 2014-08-09  4:26 ` Darrick J. Wong
  2014-08-24 16:15   ` Theodore Ts'o
  2014-08-09  4:26 ` [PATCH 2/6] misc: fix gcc warnings Darrick J. Wong
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-09  4:26 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4, Pu Hou

Add to ext2fs_symlink the ability to create inline data symlinks.

Suggested-by: Pu Hou <houpu.hp@alibaba-inc.com>
Cc: Pu Hou <houpu.hp@alibaba-inc.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 lib/ext2fs/symlink.c           |   34 ++++++++++++++++++++---
 tests/f_create_symlinks/expect |   36 ++++++++++++++++++++++++
 tests/f_create_symlinks/name   |    1 +
 tests/f_create_symlinks/script |   59 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 125 insertions(+), 5 deletions(-)
 create mode 100644 tests/f_create_symlinks/expect
 create mode 100644 tests/f_create_symlinks/name
 create mode 100644 tests/f_create_symlinks/script


diff --git a/lib/ext2fs/symlink.c b/lib/ext2fs/symlink.c
index ba8ed8e..f6eb6b6 100644
--- a/lib/ext2fs/symlink.c
+++ b/lib/ext2fs/symlink.c
@@ -35,7 +35,7 @@ errcode_t ext2fs_symlink(ext2_filsys fs, ext2_ino_t parent, ext2_ino_t ino,
 	struct ext2_inode	inode;
 	ext2_ino_t		scratch_ino;
 	blk64_t			blk;
-	int			fastlink;
+	int			fastlink, inlinelink;
 	unsigned int		target_len;
 	char			*block_buf = 0;
 
@@ -77,15 +77,36 @@ errcode_t ext2fs_symlink(ext2_filsys fs, ext2_ino_t parent, ext2_ino_t ino,
 	memset(&inode, 0, sizeof(struct ext2_inode));
 	inode.i_mode = LINUX_S_IFLNK | 0777;
 	inode.i_uid = inode.i_gid = 0;
-	ext2fs_iblk_set(fs, &inode, fastlink ? 0 : 1);
 	inode.i_links_count = 1;
 	ext2fs_inode_size_set(fs, &inode, target_len);
 	/* The time fields are set by ext2fs_write_new_inode() */
 
+	inlinelink = !fastlink &&
+		     (fs->super->s_feature_incompat &
+					EXT4_FEATURE_INCOMPAT_INLINE_DATA) &&
+		     (target_len < fs->blocksize);
 	if (fastlink) {
 		/* Fast symlinks, target stored in inode */
 		strcpy((char *)&inode.i_block, target);
+	} else if (inlinelink) {
+		/* Try inserting an inline data symlink */
+		inode.i_flags |= EXT4_INLINE_DATA_FL;
+		retval = ext2fs_write_new_inode(fs, ino, &inode);
+		if (retval)
+			goto cleanup;
+		retval = ext2fs_inline_data_set(fs, ino, &inode, target,
+						target_len);
+		if (retval) {
+			inode.i_flags &= ~EXT4_INLINE_DATA_FL;
+			inlinelink = 0;
+			goto need_block;
+		}
+		retval = ext2fs_read_inode(fs, ino, &inode);
+		if (retval)
+			goto cleanup;
 	} else {
+need_block:
+		ext2fs_iblk_set(fs, &inode, 1);
 		/* Slow symlinks, target stored in the first block */
 		memset(block_buf, 0, fs->blocksize);
 		strcpy(block_buf, target);
@@ -104,11 +125,14 @@ errcode_t ext2fs_symlink(ext2_filsys fs, ext2_ino_t parent, ext2_ino_t ino,
 	 * number is assigned by write_new_inode, which means that the
 	 * operations using ino must come after it.
 	 */
-	retval = ext2fs_write_new_inode(fs, ino, &inode);
+	if (inlinelink)
+		retval = ext2fs_write_inode(fs, ino, &inode);
+	else
+		retval = ext2fs_write_new_inode(fs, ino, &inode);
 	if (retval)
 		goto cleanup;
 
-	if (!fastlink) {
+	if (!fastlink && !inlinelink) {
 		retval = ext2fs_bmap2(fs, ino, &inode, NULL, BMAP_SET, 0, NULL,
 				      &blk);
 		if (retval)
@@ -139,7 +163,7 @@ errcode_t ext2fs_symlink(ext2_filsys fs, ext2_ino_t parent, ext2_ino_t ino,
 	/*
 	 * Update accounting....
 	 */
-	if (!fastlink)
+	if (!fastlink && !inlinelink)
 		ext2fs_block_alloc_stats2(fs, blk, +1);
 	ext2fs_inode_alloc_stats2(fs, ino, +1, 0);
 
diff --git a/tests/f_create_symlinks/expect b/tests/f_create_symlinks/expect
new file mode 100644
index 0000000..847e092
--- /dev/null
+++ b/tests/f_create_symlinks/expect
@@ -0,0 +1,36 @@
+mke2fs -q -F -o Linux -b 1024 -g 256 -O inline_data,extents -I 256 test.img 1024
+Pass 1: Checking inodes, blocks, and sizes
+Pass 2: Checking directory structure
+Pass 3: Checking directory connectivity
+Pass 4: Checking reference counts
+Pass 5: Checking group summary information
+test_filesys: 11/128 files (0.0% non-contiguous), 441/1024 blocks
+Exit status is 0
+stat /l_30
+Inode: 12   Type: symlink    Mode:  0777   Flags: 0x0
+User:     0   Group:     0   Size: 31
+Links: 1   Blockcount: 0
+Fragment:  Address: 0    Number: 0    Size: 0
+Fast link dest: "/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+stat /l_70
+Inode: 13   Type: symlink    Mode:  0777   Flags: 0x10000000
+User:     0   Group:     0   Size: 71
+Links: 1   Blockcount: 0
+Fragment:  Address: 0    Number: 0    Size: 0
+Extended attributes:
+  system.data (11)
+Fast link dest: "/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+stat /l_500
+Inode: 14   Type: symlink    Mode:  0777   Flags: 0x80000
+User:     0   Group:     0   Size: 501
+Links: 1   Blockcount: 2
+Fragment:  Address: 0    Number: 0    Size: 0
+stat /l_1500
+/l_1500: File not found by ext2_lookup 
+Pass 1: Checking inodes, blocks, and sizes
+Pass 2: Checking directory structure
+Pass 3: Checking directory connectivity
+Pass 4: Checking reference counts
+Pass 5: Checking group summary information
+test_filesys: 14/128 files (0.0% non-contiguous), 442/1024 blocks
+Exit status is 0
diff --git a/tests/f_create_symlinks/name b/tests/f_create_symlinks/name
new file mode 100644
index 0000000..0930e79
--- /dev/null
+++ b/tests/f_create_symlinks/name
@@ -0,0 +1 @@
+create fast, inlinedata, and regular symlinks
diff --git a/tests/f_create_symlinks/script b/tests/f_create_symlinks/script
new file mode 100644
index 0000000..c49825a
--- /dev/null
+++ b/tests/f_create_symlinks/script
@@ -0,0 +1,59 @@
+if test -x $DEBUGFS_EXE; then
+
+FSCK_OPT=-yf
+OUT=$test_name.log
+if [ -f $test_dir/expect.gz ]; then
+	EXP=$test_name.tmp
+	gunzip < $test_dir/expect.gz > $EXP1
+else
+	EXP=$test_dir/expect
+fi
+
+cp /dev/null $OUT
+
+dd if=/dev/zero of=$TMPFILE bs=1k count=512 > /dev/null 2>&1
+
+echo mke2fs -q -F -o Linux -b 1024 -g 256 -O inline_data,extents -I 256 test.img 1024 >> $OUT
+$MKE2FS -q -F -o Linux -b 1024 -g 256 -O inline_data,extents -I 256 $TMPFILE 1024 2>&1 |
+	sed -f $cmd_dir/filter.sed -e "s;$TMPFILE;test.img;" >> $OUT
+
+$FSCK $FSCK_OPT  -N test_filesys $TMPFILE > $OUT.new 2>&1
+status=$?
+echo Exit status is $status >> $OUT.new
+sed -f $cmd_dir/filter.sed -e "s;$TMPFILE;test.img;" $OUT.new >> $OUT
+rm -f $OUT.new
+
+for i in 30 70 500 1500; do
+	echo "symlink /l_$i /$(perl -e "print 'x' x $i;")"
+done | $DEBUGFS -w $TMPFILE >/dev/null 2>&1
+
+for i in 30 70 500 1500; do
+	echo "stat /l_$i" >> $OUT
+	$DEBUGFS -R "stat /l_$i" $TMPFILE 2>&1 | egrep '(File not found|^Inode|^Fast link dest|Blockcount:|^Extended attributes:|system.data|Size:)' >> $OUT
+done
+
+$FSCK $FSCK_OPT  -N test_filesys $TMPFILE > $OUT.new 2>&1
+status=$?
+echo Exit status is $status >> $OUT.new
+sed -f $cmd_dir/filter.sed -e "s;$TMPFILE;test.img;" $OUT.new >> $OUT
+rm -f $OUT.new
+
+rm -f $TMPFILE
+
+cmp -s $OUT $EXP
+status=$?
+
+if [ "$status" = 0 ] ; then
+	echo "$test_name: $test_description: ok"
+	touch $test_name.ok
+else
+	echo "$test_name: $test_description: failed"
+	diff $DIFF_OPTS $EXP $OUT > $test_name.failed
+	rm -f $test_name.tmp
+fi
+
+unset IMAGE FSCK_OPT OUT EXP
+
+else #if test -a -x $DEBUGFS_EXE; then
+	echo "$test_name: $test_description: skipped"
+fi


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] misc: fix gcc warnings
  2014-08-09  4:26 [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Darrick J. Wong
  2014-08-09  4:26 ` [PATCH 1/6] libext2fs: create inlinedata symlinks Darrick J. Wong
@ 2014-08-09  4:26 ` Darrick J. Wong
  2014-08-24 16:24   ` Theodore Ts'o
  2014-08-09  4:26 ` [PATCH 3/6] mke2fs: set block_validity as a default mount option Darrick J. Wong
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-09  4:26 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 debugfs/debugfs.c     |   11 +++--------
 debugfs/xattrs.c      |    3 ---
 debugfs/zap.c         |    1 -
 e2fsck/pass1.c        |    1 -
 e2fsck/pass2.c        |    1 -
 e2fsck/pass3.c        |    2 +-
 e2fsck/pass5.c        |    4 ----
 lib/ext2fs/ext_attr.c |   13 ++-----------
 lib/ext2fs/newdir.c   |    2 --
 misc/create_inode.c   |    5 +----
 misc/e2fuzz.c         |    3 ++-
 misc/tune2fs.c        |    4 +---
 12 files changed, 10 insertions(+), 40 deletions(-)


diff --git a/debugfs/debugfs.c b/debugfs/debugfs.c
index 6366465..0d8e9e8 100644
--- a/debugfs/debugfs.c
+++ b/debugfs/debugfs.c
@@ -498,11 +498,6 @@ static void internal_dump_inode_extra(FILE *out,
 				      ext2_ino_t inode_num EXT2FS_ATTR((unused)),
 				      struct ext2_inode_large *inode)
 {
-	struct ext2_ext_attr_entry *entry;
-	__u32 *magic;
-	char *start, *end;
-	unsigned int storage_size;
-
 	fprintf(out, "Size of extra inode fields: %u\n", inode->i_extra_isize);
 	if (inode->i_extra_isize > EXT2_INODE_SIZE(current_fs->super) -
 			EXT2_GOOD_OLD_INODE_SIZE) {
@@ -1595,9 +1590,9 @@ void do_write(int argc, char *argv[])
 
 void do_mknod(int argc, char *argv[])
 {
-	unsigned long	mode, major, minor;
+	unsigned long	major, minor;
 	errcode_t 	retval;
-	int		filetype, nr;
+	int		nr;
 	struct stat	st;
 
 	if (check_fs_open(argv[0]))
@@ -1608,7 +1603,7 @@ void do_mknod(int argc, char *argv[])
 		return;
 	}
 
-	mode = minor = major = 0;
+	minor = major = 0;
 	switch (argv[2][0]) {
 		case 'p':
 			st.st_mode = S_IFIFO;
diff --git a/debugfs/xattrs.c b/debugfs/xattrs.c
index 1c19b2f..a80fd0b 100644
--- a/debugfs/xattrs.c
+++ b/debugfs/xattrs.c
@@ -288,9 +288,6 @@ void do_rm_xattr(int argc, char **argv)
 		goto out;
 
 	for (i = 2; i < argc; i++) {
-		size_t buflen;
-		char *buf;
-
 		err = ext2fs_xattr_remove(h, argv[i]);
 		if (err)
 			goto out;
diff --git a/debugfs/zap.c b/debugfs/zap.c
index 917bddf..0a1ae9b 100644
--- a/debugfs/zap.c
+++ b/debugfs/zap.c
@@ -174,7 +174,6 @@ void do_block_dump(int argc, char *argv[])
 	errcode_t	errcode;
 	blk64_t		block;
 	char		*file = NULL;
-	unsigned int	i, j;
 	int		c, err;
 
 	if (check_fs_open(argv[0]))
diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
index a473bd7..4fc5311 100644
--- a/e2fsck/pass1.c
+++ b/e2fsck/pass1.c
@@ -2383,7 +2383,6 @@ static void scan_extent_node(e2fsck_t ctx, struct problem_context *pctx,
 		if (try_repairs && problem) {
 report_problem:
 			if (fix_problem(ctx, problem, pctx)) {
-fix_problem_now:
 				if (ctx->invalid_bitmaps) {
 					/*
 					 * If fsck knows the bitmaps are bad,
diff --git a/e2fsck/pass2.c b/e2fsck/pass2.c
index fb98af5..23310f1 100644
--- a/e2fsck/pass2.c
+++ b/e2fsck/pass2.c
@@ -746,7 +746,6 @@ static errcode_t insert_dirent_tail(ext2_filsys fs, void *dirbuf)
 	struct ext2_dir_entry *d;
 	void *top;
 	struct ext2_dir_entry_tail *t;
-	unsigned int rec_len;
 
 	d = dirbuf;
 	top = EXT2_DIRENT_TAIL(dirbuf, fs->blocksize);
diff --git a/e2fsck/pass3.c b/e2fsck/pass3.c
index 63b1d70..9860cdf 100644
--- a/e2fsck/pass3.c
+++ b/e2fsck/pass3.c
@@ -831,7 +831,7 @@ errcode_t e2fsck_expand_directory(e2fsck_t ctx, ext2_ino_t dir,
 	errcode_t	retval;
 	struct expand_dir_struct es;
 	struct ext2_inode	inode;
-	blk64_t		sz, before, after;
+	blk64_t		sz;
 
 	if (!(fs->flags & EXT2_FLAG_RW))
 		return EXT2_ET_RO_FILSYS;
diff --git a/e2fsck/pass5.c b/e2fsck/pass5.c
index bf3f733..86ac9fd 100644
--- a/e2fsck/pass5.c
+++ b/e2fsck/pass5.c
@@ -327,11 +327,7 @@ static void check_block_bitmaps(e2fsck_t ctx)
 	problem_t	problem, save_problem;
 	int		fixit, had_problem;
 	errcode_t	retval;
-	int	old_desc_blocks = 0;
-	int	count = 0;
-	int	cmp_block = 0;
 	int	redo_flag = 0;
-	blk64_t	super_blk, old_desc_blk, new_desc_blk;
 	char *actual_buf, *bitmap_buf;
 
 	actual_buf = (char *) e2fsck_allocate_memory(ctx, fs->blocksize,
diff --git a/lib/ext2fs/ext_attr.c b/lib/ext2fs/ext_attr.c
index fc191f5..f9b9208 100644
--- a/lib/ext2fs/ext_attr.c
+++ b/lib/ext2fs/ext_attr.c
@@ -302,8 +302,7 @@ errcode_t ext2fs_free_ext_attr(ext2_filsys fs, ext2_ino_t ino,
 {
 	struct ext2_ext_attr_header *header;
 	void *block_buf = NULL;
-	dgrp_t grp;
-	blk64_t blk, goal;
+	blk64_t blk;
 	errcode_t err;
 	struct ext2_inode_large i;
 
@@ -621,7 +620,6 @@ static errcode_t read_xattrs_from_buffer(struct ext2_xattr_handle *handle,
 	struct ext2_xattr *x;
 	struct ext2_ext_attr_entry *entry;
 	const char *prefix;
-	void *ptr;
 	unsigned int remain, prefix_len;
 	errcode_t err;
 	unsigned int values_size = storage_size +
@@ -719,7 +717,6 @@ static void xattrs_free_keys(struct ext2_xattr_handle *h)
 
 errcode_t ext2fs_xattrs_read(struct ext2_xattr_handle *handle)
 {
-	struct ext2_xattr *attrs = NULL, *x;
 	struct ext2_inode_large *inode;
 	struct ext2_ext_attr_header *header;
 	__u32 ea_inode_magic;
@@ -822,7 +819,6 @@ errcode_t ext2fs_xattrs_iterate(struct ext2_xattr_handle *h,
 				void *data)
 {
 	struct ext2_xattr *x;
-	errcode_t err;
 	int ret;
 
 	EXT2_CHECK_MAGIC(h, EXT2_ET_MAGIC_EA_HANDLE);
@@ -869,11 +865,10 @@ errcode_t ext2fs_xattr_get(struct ext2_xattr_handle *h, const char *key,
 errcode_t ext2fs_xattr_inode_max_size(ext2_filsys fs, ext2_ino_t ino,
 				      size_t *size)
 {
-	struct ext2_ext_attr_header *header;
 	struct ext2_ext_attr_entry *entry;
 	struct ext2_inode_large *inode;
 	__u32 ea_inode_magic;
-	unsigned int storage_size, freesize, minoff;
+	unsigned int minoff;
 	void *start;
 	int i;
 	errcode_t err;
@@ -903,9 +898,6 @@ errcode_t ext2fs_xattr_inode_max_size(ext2_filsys fs, ext2_ino_t ino,
 	       inode->i_extra_isize, sizeof(__u32));
 	if (ea_inode_magic == EXT2_EXT_ATTR_MAGIC) {
 		/* has xattrs.  calculate the size */
-		storage_size = EXT2_INODE_SIZE(fs->super) -
-			EXT2_GOOD_OLD_INODE_SIZE - inode->i_extra_isize -
-			sizeof(__u32);
 		start= ((char *) inode) + EXT2_GOOD_OLD_INODE_SIZE +
 			inode->i_extra_isize + sizeof(__u32);
 		entry = start;
@@ -1003,7 +995,6 @@ errcode_t ext2fs_xattr_remove(struct ext2_xattr_handle *handle,
 			      const char *key)
 {
 	struct ext2_xattr *x;
-	errcode_t err;
 
 	EXT2_CHECK_MAGIC(handle, EXT2_ET_MAGIC_EA_HANDLE);
 	for (x = handle->attrs; x < handle->attrs + handle->length; x++) {
diff --git a/lib/ext2fs/newdir.c b/lib/ext2fs/newdir.c
index 506d609..f9bf8fc 100644
--- a/lib/ext2fs/newdir.c
+++ b/lib/ext2fs/newdir.c
@@ -102,9 +102,7 @@ errcode_t ext2fs_new_dir_inline_data(ext2_filsys fs, ext2_ino_t dir_ino,
 {
 	struct ext2_dir_entry 	*dir = NULL;
 	errcode_t		retval;
-	char			*buf;
 	int			rec_len;
-	int			filetype = 0;
 
 	EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
 
diff --git a/misc/create_inode.c b/misc/create_inode.c
index 4d56719..7f57979 100644
--- a/misc/create_inode.c
+++ b/misc/create_inode.c
@@ -288,8 +288,6 @@ errcode_t do_symlink_internal(ext2_filsys fs, ext2_ino_t cwd, const char *name,
 	char			*cp;
 	ext2_ino_t		parent_ino;
 	errcode_t		retval;
-	struct ext2_inode	inode;
-	struct stat		st;
 
 	cp = strrchr(name, '/');
 	if (cp) {
@@ -324,9 +322,8 @@ errcode_t do_mkdir_internal(ext2_filsys fs, ext2_ino_t cwd, const char *name,
 			    struct stat *st, ext2_ino_t root)
 {
 	char			*cp;
-	ext2_ino_t		parent_ino, ino;
+	ext2_ino_t		parent_ino;
 	errcode_t		retval;
-	struct ext2_inode	inode;
 
 
 	cp = strrchr(name, '/');
diff --git a/misc/e2fuzz.c b/misc/e2fuzz.c
index 91aafe5..741dd67 100644
--- a/misc/e2fuzz.c
+++ b/misc/e2fuzz.c
@@ -52,7 +52,8 @@ int getseed(void)
 		perror("open");
 		exit(0);
 	}
-	read(fd, &r, sizeof(r));
+	if (read(fd, &r, sizeof(r)) != sizeof(r))
+		printf("Unable to read random seed!\n");
 	close(fd);
 	return r;
 }
diff --git a/misc/tune2fs.c b/misc/tune2fs.c
index 75a2735..ee47c04 100644
--- a/misc/tune2fs.c
+++ b/misc/tune2fs.c
@@ -186,7 +186,6 @@ static __u32 clear_ok_features[3] = {
 static int get_journal_sb(ext2_filsys jfs, char buf[SUPERBLOCK_SIZE])
 {
 	int retval;
-	int start;
 	journal_superblock_t *jsb;
 
 	if (!(jfs->super->s_feature_incompat &
@@ -805,7 +804,7 @@ static void enable_uninit_bg(ext2_filsys fs)
 static errcode_t zero_empty_inodes(ext2_filsys fs)
 {
 	int length = EXT2_INODE_SIZE(fs->super);
-	struct ext2_inode *inode;
+	struct ext2_inode *inode = NULL;
 	ext2_inode_scan	scan;
 	errcode_t	retval;
 	ext2_ino_t	ino;
@@ -845,7 +844,6 @@ static void disable_uninit_bg(ext2_filsys fs, __u32 csum_feature_flag)
 	dgrp_t i;
 	errcode_t retval;
 	blk64_t b, c, d;
-	int has_super;
 
 	/* Load bitmaps to ensure that the uninit ones get written out */
 	fs->super->s_feature_ro_compat |= csum_feature_flag;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] mke2fs: set block_validity as a default mount option
  2014-08-09  4:26 [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Darrick J. Wong
  2014-08-09  4:26 ` [PATCH 1/6] libext2fs: create inlinedata symlinks Darrick J. Wong
  2014-08-09  4:26 ` [PATCH 2/6] misc: fix gcc warnings Darrick J. Wong
@ 2014-08-09  4:26 ` Darrick J. Wong
  2014-08-24 22:47   ` Theodore Ts'o
  2014-08-09  4:26 ` [PATCH 4/6] ext2fs: add readahead method to improve scanning Darrick J. Wong
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-09  4:26 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4

The block_validity mount option spot-checks block allocations against
a bitmap of known group metadata blocks.  This helps us to prevent
self-inflicted catastrophic failures such as trying to "share"
critical metadata (think bitmaps) with file data, which usually
results in filesystem destruction.

In order to test the overhead of the mount option, I re-used the speed
tests in the metadata checksum testing script.  In short, the program
creates what looks like 15 copies of a kernel source tree, except that
it uses fallocate to strip out the overhead of writing the file data
so that we can focus on metadata overhead.  On a 64G RAM disk, the
overhead was generally about 0.9% and at most 1.6%.  On a 160G USB
disk, the overhead was about 0.8% and peaked at 1.2%.

When I changed the test to write out files instead of merely
fallocating space, the overhead was negligible.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 misc/mke2fs.conf.in |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/misc/mke2fs.conf.in b/misc/mke2fs.conf.in
index 4c5dba7..de0250d 100644
--- a/misc/mke2fs.conf.in
+++ b/misc/mke2fs.conf.in
@@ -1,6 +1,6 @@
 [defaults]
 	base_features = sparse_super,filetype,resize_inode,dir_index,ext_attr
-	default_mntopts = acl,user_xattr
+	default_mntopts = acl,user_xattr,block_validity
 	enable_periodic_fsck = 0
 	blocksize = 4096
 	inode_size = 256


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] ext2fs: add readahead method to improve scanning
  2014-08-09  4:26 [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2014-08-09  4:26 ` [PATCH 3/6] mke2fs: set block_validity as a default mount option Darrick J. Wong
@ 2014-08-09  4:26 ` Darrick J. Wong
  2014-08-09  4:26 ` [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata Darrick J. Wong
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-09  4:26 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4, Andreas Dilger

Frøm: Andreas Dilger <adilger@whamcloud.com>

Add a readahead method for prefetching ranges of disk blocks.  This is
useful for inode table scanning, and other large contiguous ranges of
blocks, and may also prove useful for random block prefetch, since it
will allow reordering of the IO without waiting synchronously for the
reads to complete.

It is currently using the posix_fadvise(POSIX_FADV_WILLNEED)
interface, as this proved most efficient during our testing.

[darrick.wong@oracle.com]
Make the arguments to the readahead function take the same ULL values
as the other IO functions, and return an appropriate error code when
fadvise isn't available.

Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 lib/ext2fs/ext2_io.h    |    8 +++++++-
 lib/ext2fs/io_manager.c |    9 +++++++++
 lib/ext2fs/unix_io.c    |   27 ++++++++++++++++++++++++---
 3 files changed, 40 insertions(+), 4 deletions(-)


diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index 1894fb8..4c5a5c5 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -90,7 +90,10 @@ struct struct_io_manager {
 					int count, const void *data);
 	errcode_t (*discard)(io_channel channel, unsigned long long block,
 			     unsigned long long count);
-	long	reserved[16];
+	errcode_t (*cache_readahead)(io_channel channel,
+				     unsigned long long block,
+				     unsigned long long count);
+	long	reserved[15];
 };
 
 #define IO_FLAG_RW		0x0001
@@ -124,6 +127,9 @@ extern errcode_t io_channel_discard(io_channel channel,
 				    unsigned long long count);
 extern errcode_t io_channel_alloc_buf(io_channel channel,
 				      int count, void *ptr);
+extern errcode_t io_channel_cache_readahead(io_channel io,
+					    unsigned long long block,
+					    unsigned long long count);
 
 /* unix_io.c */
 extern io_manager unix_io_manager;
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index 34e4859..dc5888d 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -128,3 +128,12 @@ errcode_t io_channel_alloc_buf(io_channel io, int count, void *ptr)
 	else
 		return ext2fs_get_mem(size, ptr);
 }
+
+errcode_t io_channel_cache_readahead(io_channel io, unsigned long long block,
+				     unsigned long long count)
+{
+	if (!io->manager->cache_readahead)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io->manager->cache_readahead(io, block, count);
+}
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 0dd1490..5c797e8 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -15,6 +15,9 @@
  * %End-Header%
  */
 
+#define _XOPEN_SOURCE 600
+#define _DARWIN_C_SOURCE
+#define _FILE_OFFSET_BITS 64
 #define _LARGEFILE_SOURCE
 #define _LARGEFILE64_SOURCE
 #ifndef _GNU_SOURCE
@@ -35,6 +38,9 @@
 #ifdef __linux__
 #include <sys/utsname.h>
 #endif
+#if HAVE_SYS_TYPES_H
+#include <sys/types.h>
+#endif
 #ifdef HAVE_SYS_IOCTL_H
 #include <sys/ioctl.h>
 #endif
@@ -44,9 +50,6 @@
 #if HAVE_SYS_STAT_H
 #include <sys/stat.h>
 #endif
-#if HAVE_SYS_TYPES_H
-#include <sys/types.h>
-#endif
 #if HAVE_SYS_RESOURCE_H
 #include <sys/resource.h>
 #endif
@@ -810,6 +813,23 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 #endif /* NO_IO_CACHE */
 }
 
+static errcode_t unix_cache_readahead(io_channel channel,
+				      unsigned long long block,
+				      unsigned long long count)
+{
+#ifdef POSIX_FADV_WILLNEED
+	struct unix_private_data *data;
+
+	data = (struct unix_private_data *)channel->private_data;
+	return posix_fadvise(data->dev,
+			     (ext2_loff_t)block * channel->block_size,
+			     (ext2_loff_t)count * channel->block_size,
+			     POSIX_FADV_WILLNEED);
+#else
+	return EXT2_ET_OP_NOT_SUPPORTED;
+#endif
+}
+
 static errcode_t unix_write_blk(io_channel channel, unsigned long block,
 				int count, const void *buf)
 {
@@ -961,6 +981,7 @@ static struct struct_io_manager struct_unix_manager = {
 	.read_blk64	= unix_read_blk64,
 	.write_blk64	= unix_write_blk64,
 	.discard	= unix_discard,
+	.cache_readahead	= unix_cache_readahead,
 };
 
 io_manager unix_io_manager = &struct_unix_manager;

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-09  4:26 [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Darrick J. Wong
                   ` (3 preceding siblings ...)
  2014-08-09  4:26 ` [PATCH 4/6] ext2fs: add readahead method to improve scanning Darrick J. Wong
@ 2014-08-09  4:26 ` Darrick J. Wong
  2014-08-11  5:21   ` Darrick J. Wong
  2014-08-09  4:26 ` [PATCH 6/6] e2fsck: read-ahead metadata during passes 1, 2, and 4 Darrick J. Wong
  2014-08-09  5:53 ` [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Theodore Ts'o
  6 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-09  4:26 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4

This patch adds to e2fsck the ability to pre-fetch metadata into the
page cache in the hopes of speeding up fsck runs.  There are two new
functions -- the first allows a caller to readahead a list of blocks,
and the second is a helper function that uses that first mechanism to
load group data (bitmaps, inode tables).

These new e2fsck routines require the addition of a dblist API to
allow us to iterate a subset of a dblist.  This will enable
incremental directory block readahead in e2fsck pass 2.

There's also a function to estimate the readahead given a FS.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 configure           |    2 
 configure.in        |    1 
 e2fsck/Makefile.in  |    8 +-
 e2fsck/e2fsck.h     |   18 ++++
 e2fsck/readahead.c  |  213 +++++++++++++++++++++++++++++++++++++++++++++++++++
 e2fsck/util.c       |   51 ++++++++++++
 lib/config.h.in     |    3 +
 lib/ext2fs/dblist.c |   21 ++++-
 lib/ext2fs/ext2fs.h |   10 ++
 9 files changed, 319 insertions(+), 8 deletions(-)
 create mode 100644 e2fsck/readahead.c


diff --git a/configure b/configure
index 8ad1408..71778e1 100755
--- a/configure
+++ b/configure
@@ -12404,7 +12404,7 @@ fi
 done
 
 fi
-for ac_header in  	dirent.h 	errno.h 	execinfo.h 	getopt.h 	malloc.h 	mntent.h 	paths.h 	semaphore.h 	setjmp.h 	signal.h 	stdarg.h 	stdint.h 	stdlib.h 	termios.h 	termio.h 	unistd.h 	utime.h 	attr/xattr.h 	linux/falloc.h 	linux/fd.h 	linux/major.h 	linux/loop.h 	net/if_dl.h 	netinet/in.h 	sys/disklabel.h 	sys/disk.h 	sys/file.h 	sys/ioctl.h 	sys/mkdev.h 	sys/mman.h 	sys/mount.h 	sys/prctl.h 	sys/resource.h 	sys/select.h 	sys/socket.h 	sys/sockio.h 	sys/stat.h 	sys/syscall.h 	sys/sysmacros.h 	sys/time.h 	sys/types.h 	sys/un.h 	sys/wait.h
+for ac_header in  	dirent.h 	errno.h 	execinfo.h 	getopt.h 	malloc.h 	mntent.h 	paths.h 	semaphore.h 	setjmp.h 	signal.h 	stdarg.h 	stdint.h 	stdlib.h 	termios.h 	termio.h 	unistd.h 	utime.h 	attr/xattr.h 	linux/falloc.h 	linux/fd.h 	linux/major.h 	linux/loop.h 	net/if_dl.h 	netinet/in.h 	sys/disklabel.h 	sys/disk.h 	sys/file.h 	sys/ioctl.h 	sys/mkdev.h 	sys/mman.h 	sys/mount.h 	sys/prctl.h 	sys/resource.h 	sys/select.h 	sys/socket.h 	sys/sockio.h 	sys/stat.h 	sys/syscall.h 	sys/sysctl.h 	sys/sysmacros.h 	sys/time.h 	sys/types.h 	sys/un.h 	sys/wait.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
diff --git a/configure.in b/configure.in
index 3c0d64f..e881204 100644
--- a/configure.in
+++ b/configure.in
@@ -941,6 +941,7 @@ AC_CHECK_HEADERS(m4_flatten([
 	sys/sockio.h
 	sys/stat.h
 	sys/syscall.h
+	sys/sysctl.h
 	sys/sysmacros.h
 	sys/time.h
 	sys/types.h
diff --git a/e2fsck/Makefile.in b/e2fsck/Makefile.in
index c40b188..3a9d7b5 100644
--- a/e2fsck/Makefile.in
+++ b/e2fsck/Makefile.in
@@ -62,7 +62,7 @@ OBJS= dict.o unix.o e2fsck.o super.o pass1.o pass1b.o pass2.o \
 	pass3.o pass4.o pass5.o journal.o badblocks.o util.o dirinfo.o \
 	dx_dirinfo.o ehandler.o problem.o message.o quota.o recovery.o \
 	region.o revoke.o ea_refcount.o rehash.o profile.o prof_err.o \
-	logfile.o sigcatcher.o $(MTRACE_OBJ)
+	logfile.o sigcatcher.o readahead.o $(MTRACE_OBJ)
 
 PROFILED_OBJS= profiled/dict.o profiled/unix.o profiled/e2fsck.o \
 	profiled/super.o profiled/pass1.o profiled/pass1b.o \
@@ -73,7 +73,7 @@ PROFILED_OBJS= profiled/dict.o profiled/unix.o profiled/e2fsck.o \
 	profiled/recovery.o profiled/region.o profiled/revoke.o \
 	profiled/ea_refcount.o profiled/rehash.o profiled/profile.o \
 	profiled/prof_err.o profiled/logfile.o \
-	profiled/sigcatcher.o
+	profiled/sigcatcher.o profiled/readahead.o
 
 SRCS= $(srcdir)/e2fsck.c \
 	$(srcdir)/dict.c \
@@ -97,6 +97,7 @@ SRCS= $(srcdir)/e2fsck.c \
 	$(srcdir)/message.c \
 	$(srcdir)/ea_refcount.c \
 	$(srcdir)/rehash.c \
+	$(srcdir)/readahead.c \
 	$(srcdir)/region.c \
 	$(srcdir)/profile.c \
 	$(srcdir)/sigcatcher.c \
@@ -527,3 +528,6 @@ quota.o: $(srcdir)/quota.c $(top_builddir)/lib/config.h \
  $(srcdir)/profile.h prof_err.h $(top_srcdir)/lib/quota/quotaio.h \
  $(top_srcdir)/lib/quota/dqblk_v2.h $(top_srcdir)/lib/quota/quotaio_tree.h \
  $(top_srcdir)/lib/../e2fsck/dict.h $(srcdir)/problem.h
+readahead.o: $(srcdir)/readahead.c $(top_builddir)/lib/config.h \
+ $(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/lib/ext2fs/ext2_fs.h \
+ $(top_builddir)/lib/ext2fs/ext2_err.h $(srcdir)/e2fsck.h
diff --git a/e2fsck/e2fsck.h b/e2fsck/e2fsck.h
index 8f16218..ead546e 100644
--- a/e2fsck/e2fsck.h
+++ b/e2fsck/e2fsck.h
@@ -490,6 +490,23 @@ extern ext2_ino_t e2fsck_get_lost_and_found(e2fsck_t ctx, int fix);
 extern errcode_t e2fsck_adjust_inode_count(e2fsck_t ctx, ext2_ino_t ino,
 					   int adj);
 
+/* readahead.c */
+#define E2FSCK_READA_SUPER	(0x01)
+#define E2FSCK_READA_GDT	(0x02)
+#define E2FSCK_READA_BBITMAP	(0x04)
+#define E2FSCK_READA_IBITMAP	(0x08)
+#define E2FSCK_READA_ITABLE	(0x10)
+#define E2FSCK_READA_ALL_FLAGS	(0x1F)
+errcode_t e2fsck_readahead(ext2_filsys fs, int flags, dgrp_t start,
+			   dgrp_t ngroups);
+#define E2FSCK_RA_DBLIST_IGNORE_BLOCKCNT	(0x01)
+#define E2FSCK_RA_DBLIST_ALL_FLAGS		(0x01)
+errcode_t e2fsck_readahead_dblist(ext2_filsys fs, int flags,
+				  ext2_dblist dblist,
+				  unsigned long long start,
+				  unsigned long long count);
+int e2fsck_can_readahead(ext2_filsys fs);
+unsigned long long e2fsck_guess_readahead(ext2_filsys fs);
 
 /* region.c */
 extern region_t region_create(region_addr_t min, region_addr_t max);
@@ -579,6 +596,7 @@ extern errcode_t e2fsck_allocate_subcluster_bitmap(ext2_filsys fs,
 						   int default_type,
 						   const char *profile_name,
 						   ext2fs_block_bitmap *ret);
+unsigned long long get_memory_size(void);
 
 /* unix.c */
 extern void e2fsck_clear_progbar(e2fsck_t ctx);
diff --git a/e2fsck/readahead.c b/e2fsck/readahead.c
new file mode 100644
index 0000000..0395975
--- /dev/null
+++ b/e2fsck/readahead.c
@@ -0,0 +1,213 @@
+/*
+ * readahead.c -- Prefetch filesystem metadata to speed up fsck.
+ *
+ * Copyright (C) 2014 Oracle.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Library
+ * General Public License, version 2.
+ * %End-Header%
+ */
+
+#include "config.h"
+#include <string.h>
+
+#include "e2fsck.h"
+
+#undef DEBUG
+
+#ifdef DEBUG
+# define dbg_printf(f, a...)  do {printf(f, ## a); fflush(stdout); } while (0)
+#else
+# define dbg_printf(f, a...)
+#endif
+
+struct read_dblist {
+	errcode_t err;
+	blk64_t run_start;
+	blk64_t run_len;
+	int flags;
+};
+
+static EXT2_QSORT_TYPE readahead_dir_block_cmp(const void *a, const void *b)
+{
+	const struct ext2_db_entry2 *db_a =
+		(const struct ext2_db_entry2 *) a;
+	const struct ext2_db_entry2 *db_b =
+		(const struct ext2_db_entry2 *) b;
+
+	return (int) (db_a->blk - db_b->blk);
+}
+
+static int readahead_dir_block(ext2_filsys fs, struct ext2_db_entry2 *db,
+			       void *priv_data)
+{
+	struct read_dblist *pr = priv_data;
+	e2_blkcnt_t count = (pr->flags & E2FSCK_RA_DBLIST_IGNORE_BLOCKCNT ?
+			     1 : db->blockcnt);
+
+	if (!pr->run_len || db->blk != pr->run_start + pr->run_len) {
+		if (pr->run_len) {
+			pr->err = io_channel_cache_readahead(fs->io,
+							     pr->run_start,
+							     pr->run_len);
+			dbg_printf("readahead start=%llu len=%llu err=%d\n",
+				   pr->run_start, pr->run_len,
+				   (int)pr->err);
+		}
+		pr->run_start = db->blk;
+		pr->run_len = 0;
+	}
+	pr->run_len += count;
+
+	return pr->err ? DBLIST_ABORT : 0;
+}
+
+errcode_t e2fsck_readahead_dblist(ext2_filsys fs, int flags,
+				  ext2_dblist dblist,
+				  unsigned long long start,
+				  unsigned long long count)
+{
+	errcode_t err;
+	struct read_dblist pr;
+
+	dbg_printf("%s: flags=0x%x\n", __func__, flags);
+	if (flags & ~E2FSCK_RA_DBLIST_ALL_FLAGS)
+		return EXT2_ET_INVALID_ARGUMENT;
+
+	memset(&pr, 0, sizeof(pr));
+	pr.flags = flags;
+	err = ext2fs_dblist_iterate3(dblist, readahead_dir_block, start,
+				     count, &pr);
+	if (pr.err)
+		return pr.err;
+	if (err)
+		return err;
+
+	if (pr.run_len)
+		err = io_channel_cache_readahead(fs->io, pr.run_start,
+						 pr.run_len);
+
+	return err;
+}
+
+errcode_t e2fsck_readahead(ext2_filsys fs, int flags, dgrp_t start,
+			   dgrp_t ngroups)
+{
+	blk64_t		super, old_gdt, new_gdt;
+	blk_t		blocks;
+	dgrp_t		i;
+	ext2_dblist	dblist;
+	dgrp_t		end = start + ngroups;
+	errcode_t	err = 0;
+
+	dbg_printf("%s: flags=0x%x start=%d groups=%d\n", __func__, flags,
+		   start, ngroups);
+	if (flags & ~E2FSCK_READA_ALL_FLAGS)
+		return EXT2_ET_INVALID_ARGUMENT;
+
+	if (end > fs->group_desc_count)
+		end = fs->group_desc_count;
+
+	if (flags == 0)
+		return 0;
+
+	err = ext2fs_init_dblist(fs, &dblist);
+	if (err)
+		return err;
+
+	for (i = start; i < end; i++) {
+		err = ext2fs_super_and_bgd_loc2(fs, i, &super, &old_gdt,
+						&new_gdt, &blocks);
+		if (err)
+			break;
+
+		if (flags & E2FSCK_READA_SUPER) {
+			err = ext2fs_add_dir_block2(dblist, 0, super, 0);
+			if (err)
+				break;
+		}
+
+		if (flags & E2FSCK_READA_GDT) {
+			if (old_gdt)
+				err = ext2fs_add_dir_block2(dblist, 0, old_gdt,
+							    blocks);
+			else if (new_gdt)
+				err = ext2fs_add_dir_block2(dblist, 0, new_gdt,
+							    blocks);
+			else
+				err = 0;
+			if (err)
+				break;
+		}
+
+		if ((flags & E2FSCK_READA_BBITMAP) &&
+		    !ext2fs_bg_flags_test(fs, i, EXT2_BG_BLOCK_UNINIT) &&
+		    ext2fs_bg_free_blocks_count(fs, i) <
+				fs->super->s_blocks_per_group) {
+			super = ext2fs_block_bitmap_loc(fs, i);
+			err = ext2fs_add_dir_block2(dblist, 0, super, 1);
+			if (err)
+				break;
+		}
+
+		if ((flags & E2FSCK_READA_IBITMAP) &&
+		    !ext2fs_bg_flags_test(fs, i, EXT2_BG_INODE_UNINIT) &&
+		    ext2fs_bg_free_inodes_count(fs, i) <
+				fs->super->s_inodes_per_group) {
+			super = ext2fs_inode_bitmap_loc(fs, i);
+			err = ext2fs_add_dir_block2(dblist, 0, super, 1);
+			if (err)
+				break;
+		}
+
+		if ((flags & E2FSCK_READA_ITABLE) &&
+		    ext2fs_bg_free_inodes_count(fs, i) <
+				fs->super->s_inodes_per_group) {
+			super = ext2fs_inode_table_loc(fs, i);
+			blocks = fs->inode_blocks_per_group -
+				 (ext2fs_bg_itable_unused(fs, i) *
+				  EXT2_INODE_SIZE(fs->super) / fs->blocksize);
+			err = ext2fs_add_dir_block2(dblist, 0, super, blocks);
+			if (err)
+				break;
+		}
+	}
+
+	if (!err) {
+		ext2fs_dblist_sort2(dblist, readahead_dir_block_cmp);
+		err = e2fsck_readahead_dblist(fs, 0, dblist, 0,
+					      ext2fs_dblist_count2(dblist));
+	}
+
+	ext2fs_free_dblist(dblist);
+	return err;
+}
+
+int e2fsck_can_readahead(ext2_filsys fs)
+{
+	errcode_t err;
+
+	err = io_channel_cache_readahead(fs->io, 0, 1);
+	dbg_printf("%s: supp=%d\n", __func__, err != EXT2_ET_OP_NOT_SUPPORTED);
+	return err != EXT2_ET_OP_NOT_SUPPORTED;
+}
+
+unsigned long long e2fsck_guess_readahead(ext2_filsys fs)
+{
+	unsigned long long guess;
+
+	/*
+	 * The optimal readahead sizes were experimentally determined by
+	 * djwong in August 2014.  Setting the RA size to one block group's
+	 * worth of inode table blocks seems to yield the largest reductions
+	 * in e2fsck runtime.
+	 */
+	guess = fs->blocksize * fs->inode_blocks_per_group;
+
+	/* Disable RA if it'd use more 1/100th of RAM. */
+	if (get_memory_size() > (guess * 100))
+		return guess / 1024;
+
+	return 0;
+}
diff --git a/e2fsck/util.c b/e2fsck/util.c
index 8237328..74f20062 100644
--- a/e2fsck/util.c
+++ b/e2fsck/util.c
@@ -37,6 +37,10 @@
 #include <errno.h>
 #endif
 
+#ifdef HAVE_SYS_SYSCTL_H
+#include <sys/sysctl.h>
+#endif
+
 #include "e2fsck.h"
 
 extern e2fsck_t e2fsck_global_ctx;   /* Try your very best not to use this! */
@@ -848,3 +852,50 @@ errcode_t e2fsck_allocate_subcluster_bitmap(ext2_filsys fs, const char *descr,
 	fs->default_bitmap_type = save_type;
 	return retval;
 }
+
+/* Return memory size in bytes */
+unsigned long long get_memory_size(void)
+{
+#if defined(_SC_PHYS_PAGES)
+# if defined(_SC_PAGESIZE)
+	return (unsigned long long)sysconf(_SC_PHYS_PAGES) *
+	       (unsigned long long)sysconf(_SC_PAGESIZE);
+# elif defined(_SC_PAGE_SIZE)
+	return (unsigned long long)sysconf(_SC_PHYS_PAGES) *
+	       (unsigned long long)sysconf(_SC_PAGE_SIZE);
+# endif
+#elif defined(CTL_HW)
+# if (defined(HW_MEMSIZE) || defined(HW_PHYSMEM64))
+#  define CTL_HW_INT64
+# elif (defined(HW_PHYSMEM) || defined(HW_REALMEM))
+#  define CTL_HW_UINT
+# endif
+	int mib[2];
+
+	mib[0] = CTL_HW;
+# if defined(HW_MEMSIZE)
+	mib[1] = HW_MEMSIZE;
+# elif defined(HW_PHYSMEM64)
+	mib[1] = HW_PHYSMEM64;
+# elif defined(HW_REALMEM)
+	mib[1] = HW_REALMEM;
+# elif defined(HW_PYSMEM)
+	mib[1] = HW_PHYSMEM;
+# endif
+# if defined(CTL_HW_INT64)
+	unsigned long long size = 0;
+# elif defined(CTL_HW_UINT)
+	unsigned int size = 0;
+# endif
+# if defined(CTL_HW_INT64) || defined(CTL_HW_UINT)
+	size_t len = sizeof(size);
+
+	if (sysctl(mib, 2, &size, &len, NULL, 0) == 0)
+		return (unsigned long long)size;
+# endif
+	return 0;
+#else
+# warning "Don't know how to detect memory on your platform?"
+	return 0;
+#endif
+}
diff --git a/lib/config.h.in b/lib/config.h.in
index 1d2382b..b598d1e 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -500,6 +500,9 @@
 /* Define to 1 if you have the <sys/syscall.h> header file. */
 #undef HAVE_SYS_SYSCALL_H
 
+/* Define to 1 if you have the <sys/sysctl.h> header file. */
+#undef HAVE_SYS_SYSCTL_H
+
 /* Define to 1 if you have the <sys/sysmacros.h> header file. */
 #undef HAVE_SYS_SYSMACROS_H
 
diff --git a/lib/ext2fs/dblist.c b/lib/ext2fs/dblist.c
index 942c4f0..bbdb221 100644
--- a/lib/ext2fs/dblist.c
+++ b/lib/ext2fs/dblist.c
@@ -194,20 +194,25 @@ void ext2fs_dblist_sort2(ext2_dblist dblist,
 /*
  * This function iterates over the directory block list
  */
-errcode_t ext2fs_dblist_iterate2(ext2_dblist dblist,
+errcode_t ext2fs_dblist_iterate3(ext2_dblist dblist,
 				 int (*func)(ext2_filsys fs,
 					     struct ext2_db_entry2 *db_info,
 					     void	*priv_data),
+				 unsigned long long start,
+				 unsigned long long count,
 				 void *priv_data)
 {
-	unsigned long long	i;
+	unsigned long long	i, end;
 	int		ret;
 
 	EXT2_CHECK_MAGIC(dblist, EXT2_ET_MAGIC_DBLIST);
 
+	end = start + count;
 	if (!dblist->sorted)
 		ext2fs_dblist_sort2(dblist, 0);
-	for (i=0; i < dblist->count; i++) {
+	if (end > dblist->count)
+		end = dblist->count;
+	for (i = start; i < end; i++) {
 		ret = (*func)(dblist->fs, &dblist->list[i], priv_data);
 		if (ret & DBLIST_ABORT)
 			return 0;
@@ -215,6 +220,16 @@ errcode_t ext2fs_dblist_iterate2(ext2_dblist dblist,
 	return 0;
 }
 
+errcode_t ext2fs_dblist_iterate2(ext2_dblist dblist,
+				 int (*func)(ext2_filsys fs,
+					     struct ext2_db_entry2 *db_info,
+					     void	*priv_data),
+				 void *priv_data)
+{
+	return ext2fs_dblist_iterate3(dblist, func, 0, dblist->count,
+				      priv_data);
+}
+
 static EXT2_QSORT_TYPE dir_block_cmp2(const void *a, const void *b)
 {
 	const struct ext2_db_entry2 *db_a =
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index b4a9f84..04a1f4a 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -1052,11 +1052,17 @@ extern void ext2fs_dblist_sort2(ext2_dblist dblist,
 extern errcode_t ext2fs_dblist_iterate(ext2_dblist dblist,
 	int (*func)(ext2_filsys fs, struct ext2_db_entry *db_info,
 		    void	*priv_data),
-       void *priv_data);
+	void *priv_data);
 extern errcode_t ext2fs_dblist_iterate2(ext2_dblist dblist,
 	int (*func)(ext2_filsys fs, struct ext2_db_entry2 *db_info,
 		    void	*priv_data),
-       void *priv_data);
+	void *priv_data);
+extern errcode_t ext2fs_dblist_iterate3(ext2_dblist dblist,
+	int (*func)(ext2_filsys fs, struct ext2_db_entry2 *db_info,
+		    void	*priv_data),
+	unsigned long long start,
+	unsigned long long count,
+	void *priv_data);
 extern errcode_t ext2fs_set_dir_block(ext2_dblist dblist, ext2_ino_t ino,
 				      blk_t blk, int blockcnt);
 extern errcode_t ext2fs_set_dir_block2(ext2_dblist dblist, ext2_ino_t ino,


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] e2fsck: read-ahead metadata during passes 1, 2, and 4
  2014-08-09  4:26 [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Darrick J. Wong
                   ` (4 preceding siblings ...)
  2014-08-09  4:26 ` [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata Darrick J. Wong
@ 2014-08-09  4:26 ` Darrick J. Wong
  2014-08-09  5:53 ` [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Theodore Ts'o
  6 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-09  4:26 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4

e2fsck pass1 is modified to use the block group data prefetch function
to try to fetch the inode tables into the pagecache before it is
needed.  We iterate through the blockgroups until we have enough inode
tables that need reading such that we can issue readahead; then we sit
and wait until the last inode table block read of the last group to
start fetching the next bunch.

pass2 is modified to use the dirblock prefetching function to prefetch
the list of directory blocks that are assembled in pass1.  We use the
"iterate a subset of a dblist" and avoid copying the dblist.  Directory
blocks are fetched incrementally as we walk through the directory
block list.  In previous iterations of this patch we would free the
directory blocks after processing, but the performance hit to e2fsck
itself wasn't worth it.  Furthermore, it is anticipated that most
users will then mount the FS and start using the directories, so they
may as well remain in the page cache.

pass4 is modified to prefetch the block and inode bitmaps in
anticipation of pass 5, because pass4 is entirely CPU bound.

In general, these mechanisms can decrease fsck time by 10-40%, if the
host system has sufficient memory and the storage system can provide a
lot of IOPs.  Pretty much any storage system capable of handling
multiple IOs in-flight at any time will see a fairly large performance
boost.  (Single-issue USB mass storage disks seem to suffer badly.)

By default, the readahead buffer size will be set to the size of a block
group's inode table (which is 2MiB for a regular ext4 FS).  The -E
readahead_kb= option can be given to specify the amount of memory to
use for readahead or zero to disable it entirely; or an option can be
given in e2fsck.conf.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 e2fsck/e2fsck.8.in      |    7 +++++
 e2fsck/e2fsck.conf.5.in |   15 +++++++++++
 e2fsck/e2fsck.h         |    3 ++
 e2fsck/pass1.c          |   63 +++++++++++++++++++++++++++++++++++++++++++++++
 e2fsck/pass2.c          |   38 ++++++++++++++++++++++++++++
 e2fsck/pass4.c          |   14 ++++++++++
 e2fsck/unix.c           |   28 +++++++++++++++++++++
 7 files changed, 167 insertions(+), 1 deletion(-)


diff --git a/e2fsck/e2fsck.8.in b/e2fsck/e2fsck.8.in
index f5ed758..84ae50f 100644
--- a/e2fsck/e2fsck.8.in
+++ b/e2fsck/e2fsck.8.in
@@ -207,6 +207,13 @@ option may prevent you from further manual data recovery.
 .BI nodiscard
 Do not attempt to discard free blocks and unused inode blocks. This option is
 exactly the opposite of discard option. This is set as default.
+.TP
+.BI readahead_kb
+Use this many KiB of memory to pre-fetch metadata in the hopes of reducing
+e2fsck runtime.  By default, this is set to the size of a block group's inode
+table (typically 2MiB on a regular ext4 filesystem); if this amount is more
+than 1/100 of total physical memory, readahead is disabled.  Set this to zero
+to disable readahead entirely.
 .RE
 .TP
 .B \-f
diff --git a/e2fsck/e2fsck.conf.5.in b/e2fsck/e2fsck.conf.5.in
index 9ebfbbf..e1d0518 100644
--- a/e2fsck/e2fsck.conf.5.in
+++ b/e2fsck/e2fsck.conf.5.in
@@ -205,6 +205,21 @@ of that type are squelched.  This can be useful if the console is slow
 (i.e., connected to a serial port) and so a large amount of output could
 end up delaying the boot process for a long time (potentially hours).
 .TP
+.I readahead_mem_pct
+Use this percentage of memory to try to read in metadata blocks ahead of the
+main e2fsck thread.  This should reduce run times, depending on the speed of
+the underlying storage and the amount of free memory.  There is no default, but
+see
+.B readahead_mem_pct
+for more details.
+.TP
+.I readahead_kb
+Use this amount of memory to read in metadata blocks ahead of the main checking
+thread.  Setting this value to zero disables readahead entirely.  By default,
+this is set the size of one block group's inode table (typically 2MiB on a
+regular ext4 filesystem); if this amount is more than 1/100th of total physical
+memory, readahead is disabled.
+.TP
 .I report_features
 If this boolean relation is true, e2fsck will print the file system
 features as part of its verbose reporting (i.e., if the
diff --git a/e2fsck/e2fsck.h b/e2fsck/e2fsck.h
index ead546e..39aa353 100644
--- a/e2fsck/e2fsck.h
+++ b/e2fsck/e2fsck.h
@@ -378,6 +378,9 @@ struct e2fsck_struct {
 	 */
 	void *priv_data;
 	ext2fs_block_bitmap block_metadata_map; /* Metadata blocks */
+
+	/* How much are we allowed to readahead? */
+	unsigned long long readahead_kb;
 };
 
 /* Used by the region allocation code */
diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
index 4fc5311..87fa538 100644
--- a/e2fsck/pass1.c
+++ b/e2fsck/pass1.c
@@ -867,6 +867,58 @@ out:
 	return 0;
 }
 
+static void pass1_readahead(e2fsck_t ctx, dgrp_t *group, ext2_ino_t *next_ino)
+{
+	ext2_ino_t inodes_in_group = 0, inodes_per_block, inodes_per_buffer;
+	dgrp_t start = *group, grp;
+	blk64_t blocks_to_read = 0;
+	errcode_t err = EXT2_ET_INVALID_ARGUMENT;
+
+	if (ctx->readahead_kb == 0)
+		goto out;
+
+	/* Keep iterating groups until we have enough to readahead */
+	inodes_per_block = EXT2_INODES_PER_BLOCK(ctx->fs->super);
+	for (grp = start; grp < ctx->fs->group_desc_count; grp++) {
+		if (ext2fs_bg_flags_test(ctx->fs, grp, EXT2_BG_INODE_UNINIT))
+			continue;
+		inodes_in_group = ctx->fs->super->s_inodes_per_group -
+					ext2fs_bg_itable_unused(ctx->fs, grp);
+		blocks_to_read += (inodes_in_group + inodes_per_block - 1) /
+					inodes_per_block;
+		if (blocks_to_read * ctx->fs->blocksize >
+		    ctx->readahead_kb * 1024)
+			break;
+	}
+
+	err = e2fsck_readahead(ctx->fs, E2FSCK_READA_ITABLE, start,
+			       grp - start + 1);
+	if (err == EAGAIN) {
+		ctx->readahead_kb /= 2;
+		err = 0;
+	}
+
+out:
+	if (err) {
+		/* Error; disable itable readahead */
+		*group = ctx->fs->group_desc_count;
+		*next_ino = ctx->fs->super->s_inodes_count;
+	} else {
+		/*
+		 * Don't do more readahead until we've reached the first inode
+		 * of the last inode scan buffer block for the last group.
+		 */
+		*group = grp + 1;
+		inodes_per_buffer = (ctx->inode_buffer_blocks ?
+				     ctx->inode_buffer_blocks : 8) *
+				    ctx->fs->blocksize /
+				    EXT2_INODE_SIZE(ctx->fs->super);
+		*next_ino = inodes_in_group -
+			    (inodes_in_group % inodes_per_buffer) + 1 +
+			    (grp * ctx->fs->super->s_inodes_per_group);
+	}
+}
+
 void e2fsck_pass1(e2fsck_t ctx)
 {
 	int	i;
@@ -889,10 +941,19 @@ void e2fsck_pass1(e2fsck_t ctx)
 	int		low_dtime_check = 1;
 	int		inode_size;
 	int		failed_csum = 0;
+	ext2_ino_t	ino_threshold = 0;
+	dgrp_t		ra_group = 0;
 
 	init_resource_track(&rtrack, ctx->fs->io);
 	clear_problem_context(&pctx);
 
+	/* If we can do readahead, figure out how many groups to pull in. */
+	if (!e2fsck_can_readahead(ctx->fs))
+		ctx->readahead_kb = 0;
+	else if (ctx->readahead_kb == ~0ULL)
+		ctx->readahead_kb = e2fsck_guess_readahead(ctx->fs);
+	pass1_readahead(ctx, &ra_group, &ino_threshold);
+
 	if (!(ctx->options & E2F_OPT_PREEN))
 		fix_problem(ctx, PR_1_PASS_HEADER, &pctx);
 
@@ -1069,6 +1130,8 @@ void e2fsck_pass1(e2fsck_t ctx)
 			if (e2fsck_mmp_update(fs))
 				fatal_error(ctx, 0);
 		}
+		if (ino > ino_threshold)
+			pass1_readahead(ctx, &ra_group, &ino_threshold);
 		old_op = ehandler_operation(_("getting next inode from scan"));
 		pctx.errcode = ext2fs_get_next_inode_full(scan, &ino,
 							  inode, inode_size);
diff --git a/e2fsck/pass2.c b/e2fsck/pass2.c
index 23310f1..5bf55ad 100644
--- a/e2fsck/pass2.c
+++ b/e2fsck/pass2.c
@@ -61,6 +61,9 @@
  * Keeps track of how many times an inode is referenced.
  */
 static void deallocate_inode(e2fsck_t ctx, ext2_ino_t ino, char* block_buf);
+static int check_dir_block2(ext2_filsys fs,
+			   struct ext2_db_entry2 *dir_blocks_info,
+			   void *priv_data);
 static int check_dir_block(ext2_filsys fs,
 			   struct ext2_db_entry2 *dir_blocks_info,
 			   void *priv_data);
@@ -77,6 +80,9 @@ struct check_dir_struct {
 	struct problem_context	pctx;
 	int	count, max;
 	e2fsck_t ctx;
+	unsigned long long list_offset;
+	unsigned long long ra_entries;
+	unsigned long long next_ra_off;
 };
 
 void e2fsck_pass2(e2fsck_t ctx)
@@ -96,6 +102,9 @@ void e2fsck_pass2(e2fsck_t ctx)
 	int			i, depth;
 	problem_t		code;
 	int			bad_dir;
+	int (*check_dir_func)(ext2_filsys fs,
+			      struct ext2_db_entry2 *dir_blocks_info,
+			      void *priv_data);
 
 	init_resource_track(&rtrack, ctx->fs->io);
 	clear_problem_context(&cd.pctx);
@@ -139,6 +148,9 @@ void e2fsck_pass2(e2fsck_t ctx)
 	cd.ctx = ctx;
 	cd.count = 1;
 	cd.max = ext2fs_dblist_count2(fs->dblist);
+	cd.list_offset = 0;
+	cd.ra_entries = ctx->readahead_kb * 1024 / ctx->fs->blocksize;
+	cd.next_ra_off = 0;
 
 	if (ctx->progress)
 		(void) (ctx->progress)(ctx, 2, 0, cd.max);
@@ -146,7 +158,8 @@ void e2fsck_pass2(e2fsck_t ctx)
 	if (fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_DIR_INDEX)
 		ext2fs_dblist_sort2(fs->dblist, special_dir_block_cmp);
 
-	cd.pctx.errcode = ext2fs_dblist_iterate2(fs->dblist, check_dir_block,
+	check_dir_func = cd.ra_entries ? check_dir_block2 : check_dir_block;
+	cd.pctx.errcode = ext2fs_dblist_iterate2(fs->dblist, check_dir_func,
 						 &cd);
 	if (ctx->flags & E2F_FLAG_SIGNAL_MASK || ctx->flags & E2F_FLAG_RESTART)
 		return;
@@ -824,6 +837,29 @@ err:
 	return retval;
 }
 
+static int check_dir_block2(ext2_filsys fs,
+			   struct ext2_db_entry2 *db,
+			   void *priv_data)
+{
+	int err;
+	struct check_dir_struct *cd = priv_data;
+
+	if (cd->ra_entries && cd->list_offset >= cd->next_ra_off) {
+		err = e2fsck_readahead_dblist(fs,
+					E2FSCK_RA_DBLIST_IGNORE_BLOCKCNT,
+					fs->dblist,
+					cd->list_offset + cd->ra_entries / 8,
+					cd->ra_entries);
+		if (err)
+			cd->ra_entries = 0;
+		cd->next_ra_off = cd->list_offset + (cd->ra_entries * 7 / 8);
+	}
+
+	err = check_dir_block(fs, db, priv_data);
+	cd->list_offset++;
+	return err;
+}
+
 static int check_dir_block(ext2_filsys fs,
 			   struct ext2_db_entry2 *db,
 			   void *priv_data)
diff --git a/e2fsck/pass4.c b/e2fsck/pass4.c
index 21d93f0..ee40095 100644
--- a/e2fsck/pass4.c
+++ b/e2fsck/pass4.c
@@ -100,12 +100,26 @@ void e2fsck_pass4(e2fsck_t ctx)
 	__u16	link_count, link_counted;
 	char	*buf = 0;
 	dgrp_t	group, maxgroup;
+	errcode_t	err;
 
 	init_resource_track(&rtrack, ctx->fs->io);
 
 #ifdef MTRACE
 	mtrace_print("Pass 4");
 #endif
+	/*
+	 * Since pass4 is mostly CPU bound, start readahead of bitmaps
+	 * ahead of pass 5 if we haven't already loaded them.
+	 */
+	if (ctx->readahead_kb &&
+	    (fs->block_map == NULL || fs->inode_map == NULL)) {
+		err = e2fsck_readahead(fs, E2FSCK_READA_BBITMAP |
+					   E2FSCK_READA_IBITMAP,
+				       0, fs->group_desc_count);
+		if (err)
+			com_err(ctx->program_name, err, "%s",
+				_("while starting pass5 readahead"));
+	}
 
 	clear_problem_context(&pctx);
 
diff --git a/e2fsck/unix.c b/e2fsck/unix.c
index f71afbc..13ce466 100644
--- a/e2fsck/unix.c
+++ b/e2fsck/unix.c
@@ -649,6 +649,7 @@ static void parse_extended_opts(e2fsck_t ctx, const char *opts)
 	char	*buf, *token, *next, *p, *arg;
 	int	ea_ver;
 	int	extended_usage = 0;
+	unsigned long long reada_kb;
 
 	buf = string_copy(ctx, opts, 0);
 	for (token = buf; token && *token; token = next) {
@@ -677,6 +678,15 @@ static void parse_extended_opts(e2fsck_t ctx, const char *opts)
 				continue;
 			}
 			ctx->ext_attr_ver = ea_ver;
+		} else if (strcmp(token, "readahead_kb") == 0) {
+			reada_kb = strtoull(arg, &p, 0);
+			if (*p) {
+				fprintf(stderr, "%s",
+					_("Invalid readahead buffer size.\n"));
+				extended_usage++;
+				continue;
+			}
+			ctx->readahead_kb = reada_kb;
 		} else if (strcmp(token, "fragcheck") == 0) {
 			ctx->options |= E2F_OPT_FRAGCHECK;
 			continue;
@@ -716,6 +726,7 @@ static void parse_extended_opts(e2fsck_t ctx, const char *opts)
 		fputs(("\tjournal_only\n"), stderr);
 		fputs(("\tdiscard\n"), stderr);
 		fputs(("\tnodiscard\n"), stderr);
+		fputs(("\treadahead_kb=<buffer size>\n"), stderr);
 		fputc('\n', stderr);
 		exit(1);
 	}
@@ -749,6 +760,7 @@ static errcode_t PRS(int argc, char *argv[], e2fsck_t *ret_ctx)
 #ifdef CONFIG_JBD_DEBUG
 	char 		*jbd_debug;
 #endif
+	unsigned long long phys_mem_kb;
 
 	retval = e2fsck_allocate_context(&ctx);
 	if (retval)
@@ -776,6 +788,8 @@ static errcode_t PRS(int argc, char *argv[], e2fsck_t *ret_ctx)
 	else
 		ctx->program_name = "e2fsck";
 
+	phys_mem_kb = get_memory_size() / 1024;
+	ctx->readahead_kb = ~0ULL;
 	while ((c = getopt (argc, argv, "panyrcC:B:dE:fvtFVM:b:I:j:P:l:L:N:SsDk")) != EOF)
 		switch (c) {
 		case 'C':
@@ -960,6 +974,20 @@ static errcode_t PRS(int argc, char *argv[], e2fsck_t *ret_ctx)
 	if (c)
 		verbose = 1;
 
+	if (ctx->readahead_kb == ~0ULL) {
+		profile_get_integer(ctx->profile, "options",
+				    "readahead_mem_pct", 0, -1, &c);
+		if (c >= 0 && c <= 100)
+			ctx->readahead_kb = phys_mem_kb * c / 100;
+		profile_get_integer(ctx->profile, "options",
+				    "readahead_kb", 0, -1, &c);
+		if (c >= 0)
+			ctx->readahead_kb = c;
+		if (ctx->readahead_kb != ~0ULL &&
+		    ctx->readahead_kb > phys_mem_kb)
+			ctx->readahead_kb = phys_mem_kb;
+	}
+
 	/* Turn off discard in read-only mode */
 	if ((ctx->options & E2F_OPT_NO) &&
 	    (ctx->options & E2F_OPT_DISCARD))


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5
  2014-08-09  4:26 [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Darrick J. Wong
                   ` (5 preceding siblings ...)
  2014-08-09  4:26 ` [PATCH 6/6] e2fsck: read-ahead metadata during passes 1, 2, and 4 Darrick J. Wong
@ 2014-08-09  5:53 ` Theodore Ts'o
  2014-08-09  5:59   ` Darrick J. Wong
  6 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-08-09  5:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4

On Fri, Aug 08, 2014 at 09:26:10PM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> This is part 5 of the Summer 2014 e2fsprogs patchset.  

Did you skip "part 4"?  Or am I confused?

					- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5
  2014-08-09  5:53 ` [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Theodore Ts'o
@ 2014-08-09  5:59   ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-09  5:59 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On Sat, Aug 09, 2014 at 01:53:10AM -0400, Theodore Ts'o wrote:
> On Fri, Aug 08, 2014 at 09:26:10PM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > This is part 5 of the Summer 2014 e2fsprogs patchset.  
> 
> Did you skip "part 4"?  Or am I confused?

I got confused and thought I'd already sent a part 4.  The sequence goes 1,
2, 2.1, 3, 5.  I'm glad my vacation is coming up in 3 weeks. <cough> 8-)

--D
> 
> 					- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-09  4:26 ` [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata Darrick J. Wong
@ 2014-08-11  5:21   ` Darrick J. Wong
  2014-08-11  6:24     ` Theodore Ts'o
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-11  5:21 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

On Fri, Aug 08, 2014 at 09:26:43PM -0700, Darrick J. Wong wrote:
> This patch adds to e2fsck the ability to pre-fetch metadata into the
> page cache in the hopes of speeding up fsck runs.  There are two new
> functions -- the first allows a caller to readahead a list of blocks,
> and the second is a helper function that uses that first mechanism to
> load group data (bitmaps, inode tables).
> 
> These new e2fsck routines require the addition of a dblist API to
> allow us to iterate a subset of a dblist.  This will enable
> incremental directory block readahead in e2fsck pass 2.
> 
> There's also a function to estimate the readahead given a FS.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  configure           |    2 
>  configure.in        |    1 
>  e2fsck/Makefile.in  |    8 +-
>  e2fsck/e2fsck.h     |   18 ++++
>  e2fsck/readahead.c  |  213 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  e2fsck/util.c       |   51 ++++++++++++
>  lib/config.h.in     |    3 +
>  lib/ext2fs/dblist.c |   21 ++++-
>  lib/ext2fs/ext2fs.h |   10 ++
>  9 files changed, 319 insertions(+), 8 deletions(-)
>  create mode 100644 e2fsck/readahead.c
> 
> 
> diff --git a/configure b/configure
> index 8ad1408..71778e1 100755
> --- a/configure
> +++ b/configure
> @@ -12404,7 +12404,7 @@ fi
>  done
>  
>  fi
> -for ac_header in  	dirent.h 	errno.h 	execinfo.h 	getopt.h 	malloc.h 	mntent.h 	paths.h 	semaphore.h 	setjmp.h 	signal.h 	stdarg.h 	stdint.h 	stdlib.h 	termios.h 	termio.h 	unistd.h 	utime.h 	attr/xattr.h 	linux/falloc.h 	linux/fd.h 	linux/major.h 	linux/loop.h 	net/if_dl.h 	netinet/in.h 	sys/disklabel.h 	sys/disk.h 	sys/file.h 	sys/ioctl.h 	sys/mkdev.h 	sys/mman.h 	sys/mount.h 	sys/prctl.h 	sys/resource.h 	sys/select.h 	sys/socket.h 	sys/sockio.h 	sys/stat.h 	sys/syscall.h 	sys/sysmacros.h 	sys/time.h 	sys/types.h 	sys/un.h 	sys/wait.h
> +for ac_header in  	dirent.h 	errno.h 	execinfo.h 	getopt.h 	malloc.h 	mntent.h 	paths.h 	semaphore.h 	setjmp.h 	signal.h 	stdarg.h 	stdint.h 	stdlib.h 	termios.h 	termio.h 	unistd.h 	utime.h 	attr/xattr.h 	linux/falloc.h 	linux/fd.h 	linux/major.h 	linux/loop.h 	net/if_dl.h 	netinet/in.h 	sys/disklabel.h 	sys/disk.h 	sys/file.h 	sys/ioctl.h 	sys/mkdev.h 	sys/mman.h 	sys/mount.h 	sys/prctl.h 	sys/resource.h 	sys/select.h 	sys/socket.h 	sys/sockio.h 	sys/stat.h 	sys/syscall.h 	sys/sysctl.h 	sys/sysmacros.h 	sys/time.h 	sys/types.h 	sys/un.h 	sys/wait.h
>  do :
>    as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
>  ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
> diff --git a/configure.in b/configure.in
> index 3c0d64f..e881204 100644
> --- a/configure.in
> +++ b/configure.in
> @@ -941,6 +941,7 @@ AC_CHECK_HEADERS(m4_flatten([
>  	sys/sockio.h
>  	sys/stat.h
>  	sys/syscall.h
> +	sys/sysctl.h
>  	sys/sysmacros.h
>  	sys/time.h
>  	sys/types.h
> diff --git a/e2fsck/Makefile.in b/e2fsck/Makefile.in
> index c40b188..3a9d7b5 100644
> --- a/e2fsck/Makefile.in
> +++ b/e2fsck/Makefile.in
> @@ -62,7 +62,7 @@ OBJS= dict.o unix.o e2fsck.o super.o pass1.o pass1b.o pass2.o \
>  	pass3.o pass4.o pass5.o journal.o badblocks.o util.o dirinfo.o \
>  	dx_dirinfo.o ehandler.o problem.o message.o quota.o recovery.o \
>  	region.o revoke.o ea_refcount.o rehash.o profile.o prof_err.o \
> -	logfile.o sigcatcher.o $(MTRACE_OBJ)
> +	logfile.o sigcatcher.o readahead.o $(MTRACE_OBJ)
>  
>  PROFILED_OBJS= profiled/dict.o profiled/unix.o profiled/e2fsck.o \
>  	profiled/super.o profiled/pass1.o profiled/pass1b.o \
> @@ -73,7 +73,7 @@ PROFILED_OBJS= profiled/dict.o profiled/unix.o profiled/e2fsck.o \
>  	profiled/recovery.o profiled/region.o profiled/revoke.o \
>  	profiled/ea_refcount.o profiled/rehash.o profiled/profile.o \
>  	profiled/prof_err.o profiled/logfile.o \
> -	profiled/sigcatcher.o
> +	profiled/sigcatcher.o profiled/readahead.o
>  
>  SRCS= $(srcdir)/e2fsck.c \
>  	$(srcdir)/dict.c \
> @@ -97,6 +97,7 @@ SRCS= $(srcdir)/e2fsck.c \
>  	$(srcdir)/message.c \
>  	$(srcdir)/ea_refcount.c \
>  	$(srcdir)/rehash.c \
> +	$(srcdir)/readahead.c \
>  	$(srcdir)/region.c \
>  	$(srcdir)/profile.c \
>  	$(srcdir)/sigcatcher.c \
> @@ -527,3 +528,6 @@ quota.o: $(srcdir)/quota.c $(top_builddir)/lib/config.h \
>   $(srcdir)/profile.h prof_err.h $(top_srcdir)/lib/quota/quotaio.h \
>   $(top_srcdir)/lib/quota/dqblk_v2.h $(top_srcdir)/lib/quota/quotaio_tree.h \
>   $(top_srcdir)/lib/../e2fsck/dict.h $(srcdir)/problem.h
> +readahead.o: $(srcdir)/readahead.c $(top_builddir)/lib/config.h \
> + $(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/lib/ext2fs/ext2_fs.h \
> + $(top_builddir)/lib/ext2fs/ext2_err.h $(srcdir)/e2fsck.h
> diff --git a/e2fsck/e2fsck.h b/e2fsck/e2fsck.h
> index 8f16218..ead546e 100644
> --- a/e2fsck/e2fsck.h
> +++ b/e2fsck/e2fsck.h
> @@ -490,6 +490,23 @@ extern ext2_ino_t e2fsck_get_lost_and_found(e2fsck_t ctx, int fix);
>  extern errcode_t e2fsck_adjust_inode_count(e2fsck_t ctx, ext2_ino_t ino,
>  					   int adj);
>  
> +/* readahead.c */
> +#define E2FSCK_READA_SUPER	(0x01)
> +#define E2FSCK_READA_GDT	(0x02)
> +#define E2FSCK_READA_BBITMAP	(0x04)
> +#define E2FSCK_READA_IBITMAP	(0x08)
> +#define E2FSCK_READA_ITABLE	(0x10)
> +#define E2FSCK_READA_ALL_FLAGS	(0x1F)
> +errcode_t e2fsck_readahead(ext2_filsys fs, int flags, dgrp_t start,
> +			   dgrp_t ngroups);
> +#define E2FSCK_RA_DBLIST_IGNORE_BLOCKCNT	(0x01)
> +#define E2FSCK_RA_DBLIST_ALL_FLAGS		(0x01)
> +errcode_t e2fsck_readahead_dblist(ext2_filsys fs, int flags,
> +				  ext2_dblist dblist,
> +				  unsigned long long start,
> +				  unsigned long long count);
> +int e2fsck_can_readahead(ext2_filsys fs);
> +unsigned long long e2fsck_guess_readahead(ext2_filsys fs);
>  
>  /* region.c */
>  extern region_t region_create(region_addr_t min, region_addr_t max);
> @@ -579,6 +596,7 @@ extern errcode_t e2fsck_allocate_subcluster_bitmap(ext2_filsys fs,
>  						   int default_type,
>  						   const char *profile_name,
>  						   ext2fs_block_bitmap *ret);
> +unsigned long long get_memory_size(void);
>  
>  /* unix.c */
>  extern void e2fsck_clear_progbar(e2fsck_t ctx);
> diff --git a/e2fsck/readahead.c b/e2fsck/readahead.c
> new file mode 100644
> index 0000000..0395975
> --- /dev/null
> +++ b/e2fsck/readahead.c
> @@ -0,0 +1,213 @@
> +/*
> + * readahead.c -- Prefetch filesystem metadata to speed up fsck.
> + *
> + * Copyright (C) 2014 Oracle.
> + *
> + * %Begin-Header%
> + * This file may be redistributed under the terms of the GNU Library
> + * General Public License, version 2.
> + * %End-Header%
> + */
> +
> +#include "config.h"
> +#include <string.h>
> +
> +#include "e2fsck.h"
> +
> +#undef DEBUG
> +
> +#ifdef DEBUG
> +# define dbg_printf(f, a...)  do {printf(f, ## a); fflush(stdout); } while (0)
> +#else
> +# define dbg_printf(f, a...)
> +#endif
> +
> +struct read_dblist {
> +	errcode_t err;
> +	blk64_t run_start;
> +	blk64_t run_len;
> +	int flags;
> +};
> +
> +static EXT2_QSORT_TYPE readahead_dir_block_cmp(const void *a, const void *b)
> +{
> +	const struct ext2_db_entry2 *db_a =
> +		(const struct ext2_db_entry2 *) a;
> +	const struct ext2_db_entry2 *db_b =
> +		(const struct ext2_db_entry2 *) b;
> +
> +	return (int) (db_a->blk - db_b->blk);
> +}
> +
> +static int readahead_dir_block(ext2_filsys fs, struct ext2_db_entry2 *db,
> +			       void *priv_data)
> +{
> +	struct read_dblist *pr = priv_data;
> +	e2_blkcnt_t count = (pr->flags & E2FSCK_RA_DBLIST_IGNORE_BLOCKCNT ?
> +			     1 : db->blockcnt);
> +
> +	if (!pr->run_len || db->blk != pr->run_start + pr->run_len) {
> +		if (pr->run_len) {
> +			pr->err = io_channel_cache_readahead(fs->io,
> +							     pr->run_start,
> +							     pr->run_len);
> +			dbg_printf("readahead start=%llu len=%llu err=%d\n",
> +				   pr->run_start, pr->run_len,
> +				   (int)pr->err);
> +		}
> +		pr->run_start = db->blk;
> +		pr->run_len = 0;
> +	}
> +	pr->run_len += count;
> +
> +	return pr->err ? DBLIST_ABORT : 0;
> +}
> +
> +errcode_t e2fsck_readahead_dblist(ext2_filsys fs, int flags,
> +				  ext2_dblist dblist,
> +				  unsigned long long start,
> +				  unsigned long long count)
> +{
> +	errcode_t err;
> +	struct read_dblist pr;
> +
> +	dbg_printf("%s: flags=0x%x\n", __func__, flags);
> +	if (flags & ~E2FSCK_RA_DBLIST_ALL_FLAGS)
> +		return EXT2_ET_INVALID_ARGUMENT;
> +
> +	memset(&pr, 0, sizeof(pr));
> +	pr.flags = flags;
> +	err = ext2fs_dblist_iterate3(dblist, readahead_dir_block, start,
> +				     count, &pr);
> +	if (pr.err)
> +		return pr.err;
> +	if (err)
> +		return err;
> +
> +	if (pr.run_len)
> +		err = io_channel_cache_readahead(fs->io, pr.run_start,
> +						 pr.run_len);
> +
> +	return err;
> +}
> +
> +errcode_t e2fsck_readahead(ext2_filsys fs, int flags, dgrp_t start,
> +			   dgrp_t ngroups)
> +{
> +	blk64_t		super, old_gdt, new_gdt;
> +	blk_t		blocks;
> +	dgrp_t		i;
> +	ext2_dblist	dblist;
> +	dgrp_t		end = start + ngroups;
> +	errcode_t	err = 0;
> +
> +	dbg_printf("%s: flags=0x%x start=%d groups=%d\n", __func__, flags,
> +		   start, ngroups);
> +	if (flags & ~E2FSCK_READA_ALL_FLAGS)
> +		return EXT2_ET_INVALID_ARGUMENT;
> +
> +	if (end > fs->group_desc_count)
> +		end = fs->group_desc_count;
> +
> +	if (flags == 0)
> +		return 0;
> +
> +	err = ext2fs_init_dblist(fs, &dblist);

It turns out that each of the calls to ext2fs_resize_mem in the
ext2fs_add_dir_block2() function is costing us ~2ms for each call to this
function.  I'll add a new ext2fs_init_dblist() APi that lets us specify the
initial size of the list.  This seems to reduce the fsck runtime by a few more
seconds.

--D

> +	if (err)
> +		return err;
> +
> +	for (i = start; i < end; i++) {
> +		err = ext2fs_super_and_bgd_loc2(fs, i, &super, &old_gdt,
> +						&new_gdt, &blocks);
> +		if (err)
> +			break;
> +
> +		if (flags & E2FSCK_READA_SUPER) {
> +			err = ext2fs_add_dir_block2(dblist, 0, super, 0);
> +			if (err)
> +				break;
> +		}
> +
> +		if (flags & E2FSCK_READA_GDT) {
> +			if (old_gdt)
> +				err = ext2fs_add_dir_block2(dblist, 0, old_gdt,
> +							    blocks);
> +			else if (new_gdt)
> +				err = ext2fs_add_dir_block2(dblist, 0, new_gdt,
> +							    blocks);
> +			else
> +				err = 0;
> +			if (err)
> +				break;
> +		}
> +
> +		if ((flags & E2FSCK_READA_BBITMAP) &&
> +		    !ext2fs_bg_flags_test(fs, i, EXT2_BG_BLOCK_UNINIT) &&
> +		    ext2fs_bg_free_blocks_count(fs, i) <
> +				fs->super->s_blocks_per_group) {
> +			super = ext2fs_block_bitmap_loc(fs, i);
> +			err = ext2fs_add_dir_block2(dblist, 0, super, 1);
> +			if (err)
> +				break;
> +		}
> +
> +		if ((flags & E2FSCK_READA_IBITMAP) &&
> +		    !ext2fs_bg_flags_test(fs, i, EXT2_BG_INODE_UNINIT) &&
> +		    ext2fs_bg_free_inodes_count(fs, i) <
> +				fs->super->s_inodes_per_group) {
> +			super = ext2fs_inode_bitmap_loc(fs, i);
> +			err = ext2fs_add_dir_block2(dblist, 0, super, 1);
> +			if (err)
> +				break;
> +		}
> +
> +		if ((flags & E2FSCK_READA_ITABLE) &&
> +		    ext2fs_bg_free_inodes_count(fs, i) <
> +				fs->super->s_inodes_per_group) {
> +			super = ext2fs_inode_table_loc(fs, i);
> +			blocks = fs->inode_blocks_per_group -
> +				 (ext2fs_bg_itable_unused(fs, i) *
> +				  EXT2_INODE_SIZE(fs->super) / fs->blocksize);
> +			err = ext2fs_add_dir_block2(dblist, 0, super, blocks);
> +			if (err)
> +				break;
> +		}
> +	}
> +
> +	if (!err) {
> +		ext2fs_dblist_sort2(dblist, readahead_dir_block_cmp);
> +		err = e2fsck_readahead_dblist(fs, 0, dblist, 0,
> +					      ext2fs_dblist_count2(dblist));
> +	}
> +
> +	ext2fs_free_dblist(dblist);
> +	return err;
> +}
> +
> +int e2fsck_can_readahead(ext2_filsys fs)
> +{
> +	errcode_t err;
> +
> +	err = io_channel_cache_readahead(fs->io, 0, 1);
> +	dbg_printf("%s: supp=%d\n", __func__, err != EXT2_ET_OP_NOT_SUPPORTED);
> +	return err != EXT2_ET_OP_NOT_SUPPORTED;
> +}
> +
> +unsigned long long e2fsck_guess_readahead(ext2_filsys fs)
> +{
> +	unsigned long long guess;
> +
> +	/*
> +	 * The optimal readahead sizes were experimentally determined by
> +	 * djwong in August 2014.  Setting the RA size to one block group's
> +	 * worth of inode table blocks seems to yield the largest reductions
> +	 * in e2fsck runtime.
> +	 */
> +	guess = fs->blocksize * fs->inode_blocks_per_group;
> +
> +	/* Disable RA if it'd use more 1/100th of RAM. */
> +	if (get_memory_size() > (guess * 100))
> +		return guess / 1024;
> +
> +	return 0;
> +}
> diff --git a/e2fsck/util.c b/e2fsck/util.c
> index 8237328..74f20062 100644
> --- a/e2fsck/util.c
> +++ b/e2fsck/util.c
> @@ -37,6 +37,10 @@
>  #include <errno.h>
>  #endif
>  
> +#ifdef HAVE_SYS_SYSCTL_H
> +#include <sys/sysctl.h>
> +#endif
> +
>  #include "e2fsck.h"
>  
>  extern e2fsck_t e2fsck_global_ctx;   /* Try your very best not to use this! */
> @@ -848,3 +852,50 @@ errcode_t e2fsck_allocate_subcluster_bitmap(ext2_filsys fs, const char *descr,
>  	fs->default_bitmap_type = save_type;
>  	return retval;
>  }
> +
> +/* Return memory size in bytes */
> +unsigned long long get_memory_size(void)
> +{
> +#if defined(_SC_PHYS_PAGES)
> +# if defined(_SC_PAGESIZE)
> +	return (unsigned long long)sysconf(_SC_PHYS_PAGES) *
> +	       (unsigned long long)sysconf(_SC_PAGESIZE);
> +# elif defined(_SC_PAGE_SIZE)
> +	return (unsigned long long)sysconf(_SC_PHYS_PAGES) *
> +	       (unsigned long long)sysconf(_SC_PAGE_SIZE);
> +# endif
> +#elif defined(CTL_HW)
> +# if (defined(HW_MEMSIZE) || defined(HW_PHYSMEM64))
> +#  define CTL_HW_INT64
> +# elif (defined(HW_PHYSMEM) || defined(HW_REALMEM))
> +#  define CTL_HW_UINT
> +# endif
> +	int mib[2];
> +
> +	mib[0] = CTL_HW;
> +# if defined(HW_MEMSIZE)
> +	mib[1] = HW_MEMSIZE;
> +# elif defined(HW_PHYSMEM64)
> +	mib[1] = HW_PHYSMEM64;
> +# elif defined(HW_REALMEM)
> +	mib[1] = HW_REALMEM;
> +# elif defined(HW_PYSMEM)
> +	mib[1] = HW_PHYSMEM;
> +# endif
> +# if defined(CTL_HW_INT64)
> +	unsigned long long size = 0;
> +# elif defined(CTL_HW_UINT)
> +	unsigned int size = 0;
> +# endif
> +# if defined(CTL_HW_INT64) || defined(CTL_HW_UINT)
> +	size_t len = sizeof(size);
> +
> +	if (sysctl(mib, 2, &size, &len, NULL, 0) == 0)
> +		return (unsigned long long)size;
> +# endif
> +	return 0;
> +#else
> +# warning "Don't know how to detect memory on your platform?"
> +	return 0;
> +#endif
> +}
> diff --git a/lib/config.h.in b/lib/config.h.in
> index 1d2382b..b598d1e 100644
> --- a/lib/config.h.in
> +++ b/lib/config.h.in
> @@ -500,6 +500,9 @@
>  /* Define to 1 if you have the <sys/syscall.h> header file. */
>  #undef HAVE_SYS_SYSCALL_H
>  
> +/* Define to 1 if you have the <sys/sysctl.h> header file. */
> +#undef HAVE_SYS_SYSCTL_H
> +
>  /* Define to 1 if you have the <sys/sysmacros.h> header file. */
>  #undef HAVE_SYS_SYSMACROS_H
>  
> diff --git a/lib/ext2fs/dblist.c b/lib/ext2fs/dblist.c
> index 942c4f0..bbdb221 100644
> --- a/lib/ext2fs/dblist.c
> +++ b/lib/ext2fs/dblist.c
> @@ -194,20 +194,25 @@ void ext2fs_dblist_sort2(ext2_dblist dblist,
>  /*
>   * This function iterates over the directory block list
>   */
> -errcode_t ext2fs_dblist_iterate2(ext2_dblist dblist,
> +errcode_t ext2fs_dblist_iterate3(ext2_dblist dblist,
>  				 int (*func)(ext2_filsys fs,
>  					     struct ext2_db_entry2 *db_info,
>  					     void	*priv_data),
> +				 unsigned long long start,
> +				 unsigned long long count,
>  				 void *priv_data)
>  {
> -	unsigned long long	i;
> +	unsigned long long	i, end;
>  	int		ret;
>  
>  	EXT2_CHECK_MAGIC(dblist, EXT2_ET_MAGIC_DBLIST);
>  
> +	end = start + count;
>  	if (!dblist->sorted)
>  		ext2fs_dblist_sort2(dblist, 0);
> -	for (i=0; i < dblist->count; i++) {
> +	if (end > dblist->count)
> +		end = dblist->count;
> +	for (i = start; i < end; i++) {
>  		ret = (*func)(dblist->fs, &dblist->list[i], priv_data);
>  		if (ret & DBLIST_ABORT)
>  			return 0;
> @@ -215,6 +220,16 @@ errcode_t ext2fs_dblist_iterate2(ext2_dblist dblist,
>  	return 0;
>  }
>  
> +errcode_t ext2fs_dblist_iterate2(ext2_dblist dblist,
> +				 int (*func)(ext2_filsys fs,
> +					     struct ext2_db_entry2 *db_info,
> +					     void	*priv_data),
> +				 void *priv_data)
> +{
> +	return ext2fs_dblist_iterate3(dblist, func, 0, dblist->count,
> +				      priv_data);
> +}
> +
>  static EXT2_QSORT_TYPE dir_block_cmp2(const void *a, const void *b)
>  {
>  	const struct ext2_db_entry2 *db_a =
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index b4a9f84..04a1f4a 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -1052,11 +1052,17 @@ extern void ext2fs_dblist_sort2(ext2_dblist dblist,
>  extern errcode_t ext2fs_dblist_iterate(ext2_dblist dblist,
>  	int (*func)(ext2_filsys fs, struct ext2_db_entry *db_info,
>  		    void	*priv_data),
> -       void *priv_data);
> +	void *priv_data);
>  extern errcode_t ext2fs_dblist_iterate2(ext2_dblist dblist,
>  	int (*func)(ext2_filsys fs, struct ext2_db_entry2 *db_info,
>  		    void	*priv_data),
> -       void *priv_data);
> +	void *priv_data);
> +extern errcode_t ext2fs_dblist_iterate3(ext2_dblist dblist,
> +	int (*func)(ext2_filsys fs, struct ext2_db_entry2 *db_info,
> +		    void	*priv_data),
> +	unsigned long long start,
> +	unsigned long long count,
> +	void *priv_data);
>  extern errcode_t ext2fs_set_dir_block(ext2_dblist dblist, ext2_ino_t ino,
>  				      blk_t blk, int blockcnt);
>  extern errcode_t ext2fs_set_dir_block2(ext2_dblist dblist, ext2_ino_t ino,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-11  5:21   ` Darrick J. Wong
@ 2014-08-11  6:24     ` Theodore Ts'o
  2014-08-11  6:31       ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-08-11  6:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4

On Sun, Aug 10, 2014 at 10:21:51PM -0700, Darrick J. Wong wrote:
> 
> It turns out that each of the calls to ext2fs_resize_mem in the
> ext2fs_add_dir_block2() function is costing us ~2ms for each call to this
> function.  I'll add a new ext2fs_init_dblist() APi that lets us specify the
> initial size of the list.  This seems to reduce the fsck runtime by a few more
> seconds.

I suspect dblist is the wrong abstraction to use here.  Since the
blocks we want to read ahead are generally going to be contiguous, why
not use a rbtree bitmap, setting the blocks that should be subject to
readahead, and then use the find_first_set() function to iterate over
the bitmap?  Or if you want to be even more efficient, create an
interator function which takes a bitmap and returns blocks that are
set in the bitmap, one by one.

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-11  6:24     ` Theodore Ts'o
@ 2014-08-11  6:31       ` Darrick J. Wong
  2014-08-11 14:34         ` Theodore Ts'o
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-11  6:31 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On Mon, Aug 11, 2014 at 02:24:15AM -0400, Theodore Ts'o wrote:
> On Sun, Aug 10, 2014 at 10:21:51PM -0700, Darrick J. Wong wrote:
> > 
> > It turns out that each of the calls to ext2fs_resize_mem in the
> > ext2fs_add_dir_block2() function is costing us ~2ms for each call to this
> > function.  I'll add a new ext2fs_init_dblist() APi that lets us specify the
> > initial size of the list.  This seems to reduce the fsck runtime by a few more
> > seconds.
> 
> I suspect dblist is the wrong abstraction to use here.  Since the
> blocks we want to read ahead are generally going to be contiguous, why
> not use a rbtree bitmap, setting the blocks that should be subject to
> readahead, and then use the find_first_set() function to iterate over
> the bitmap?  Or if you want to be even more efficient, create an
> interator function which takes a bitmap and returns blocks that are
> set in the bitmap, one by one.

Hmm, I'll try that, though I'm almost tempted just to issue io_cache_readahead
calls directly from the for loop.  The only reason I'm using the list here at
all is to handle the case of reading multiple groups in a flexbg (pass5 RA).

--D
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-11  6:31       ` Darrick J. Wong
@ 2014-08-11 14:34         ` Theodore Ts'o
  2014-08-11 18:05           ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-08-11 14:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4

On Sun, Aug 10, 2014 at 11:31:20PM -0700, Darrick J. Wong wrote:
> 
> Hmm, I'll try that, though I'm almost tempted just to issue io_cache_readahead
> calls directly from the for loop.  The only reason I'm using the list here at
> all is to handle the case of reading multiple groups in a flexbg (pass5 RA).

In that case you could just use an accumulator for each of the block
and inode bitmaps, and so long as the next group's block and inode
bitmaps are contiguous with the previous one, you can bump a counter.
Once you reach a discontiguous bitmap block, you can issue a single
readahead request for N blocks starting at block B.  That way you only
issue a single syscall, instead of one for every single block, and you
don't have the overhead involved with storing the list of blocks and
then sorting them.

					- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-11 14:34         ` Theodore Ts'o
@ 2014-08-11 18:05           ` Darrick J. Wong
  2014-08-11 18:32             ` Theodore Ts'o
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-11 18:05 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On Mon, Aug 11, 2014 at 10:34:23AM -0400, Theodore Ts'o wrote:
> On Sun, Aug 10, 2014 at 11:31:20PM -0700, Darrick J. Wong wrote:
> > 
> > Hmm, I'll try that, though I'm almost tempted just to issue io_cache_readahead
> > calls directly from the for loop.  The only reason I'm using the list here at
> > all is to handle the case of reading multiple groups in a flexbg (pass5 RA).
> 
> In that case you could just use an accumulator for each of the block
> and inode bitmaps, and so long as the next group's block and inode
> bitmaps are contiguous with the previous one, you can bump a counter.
> Once you reach a discontiguous bitmap block, you can issue a single
> readahead request for N blocks starting at block B.  That way you only
> issue a single syscall, instead of one for every single block, and you
> don't have the overhead involved with storing the list of blocks and
> then sorting them.

Using the bitmap turns out to be pretty quick (~130us to start RA for 4 groups
vs. ~70us per group if I issue the RA directly).  Each fadvise call seems to
cost us ~1ms, so I'll keep using the bitmap to minimize the number of fadvise
calls, since it's also a lot less code.

--D

> 
> 					- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-11 18:05           ` Darrick J. Wong
@ 2014-08-11 18:32             ` Theodore Ts'o
  2014-08-11 18:55               ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-08-11 18:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4

On Mon, Aug 11, 2014 at 11:05:09AM -0700, Darrick J. Wong wrote:
> 
> Using the bitmap turns out to be pretty quick (~130us to start RA for 4 groups
> vs. ~70us per group if I issue the RA directly).  Each fadvise call seems to
> cost us ~1ms, so I'll keep using the bitmap to minimize the number of fadvise
> calls, since it's also a lot less code.

4 groups?  Since the default flex_bg size is 16 block groups, I would
have expected that you would want to start RA every 16 groups.

(And BTW, I've been wondering whether we should increase the flex_bg
size for bigger file systems.  By the time we get to 4TB disks, Having
a flex_bg every 2GB seems a little small.)

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-11 18:32             ` Theodore Ts'o
@ 2014-08-11 18:55               ` Darrick J. Wong
  2014-08-11 20:10                 ` Theodore Ts'o
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-11 18:55 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On Mon, Aug 11, 2014 at 02:32:58PM -0400, Theodore Ts'o wrote:
> On Mon, Aug 11, 2014 at 11:05:09AM -0700, Darrick J. Wong wrote:
> > 
> > Using the bitmap turns out to be pretty quick (~130us to start RA for 4 groups
> > vs. ~70us per group if I issue the RA directly).  Each fadvise call seems to
> > cost us ~1ms, so I'll keep using the bitmap to minimize the number of fadvise
> > calls, since it's also a lot less code.
> 
> 4 groups?  Since the default flex_bg size is 16 block groups, I would
> have expected that you would want to start RA every 16 groups.

I was expecting 16 groups (32M readahead) to win, but as the observations in my
spreadsheet show, 2MB tends to win.  I _think_ the reason is that if we
encounter indirect map blocks or ETB blocks, they tend to be fairly close to
the file blocks in the block group, and if we're trying to do a large readahead
at the same time, we end up with a largeish seek penalty (half the flexbg on
average) for every ETB/map block.

I figured out what was going on with the 1TB SSD -- it has a huge RAM cache big
enough to store most of the metadata.  At that point, reads are essentially
free, but readahead costs us ~1ms per fadvise call.  If you use a RA buffer
that's big enough that there aren't many fadvise calls then you still come out
ahead (ditto if you shove the RA into a separate thread) but otherwise the
fadvise calls add up, badly.

Actually, I'd considered using a default of flexbg_size * itable_size, but (a)
the USB results are pretty bad for 32M v. 2M, and (b) I was thinking that 2MB
of readahead might be small enough that we could enable it by default without
having to worry about the mal-effects of parallel e2fsck runs.

A logical next step might be to do ETB/block map readahead, but let's keep it
simple for now.  I should have time to update the spreadsheet to reflect
performance of the new bitmap code while I go mess with fixing the jbd2
problems.

> (And BTW, I've been wondering whether we should increase the flex_bg
> size for bigger file systems.  By the time we get to 4TB disks, Having
> a flex_bg every 2GB seems a little small.)

:)

--D
> 
> 						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-11 18:55               ` Darrick J. Wong
@ 2014-08-11 20:10                 ` Theodore Ts'o
  2014-08-11 20:50                   ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-08-11 20:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4

On Mon, Aug 11, 2014 at 11:55:32AM -0700, Darrick J. Wong wrote:
> I was expecting 16 groups (32M readahead) to win, but as the observations in my
> spreadsheet show, 2MB tends to win.  I _think_ the reason is that if we
> encounter indirect map blocks or ETB blocks, they tend to be fairly close to
> the file blocks in the block group, and if we're trying to do a large readahead
> at the same time, we end up with a largeish seek penalty (half the flexbg on
> average) for every ETB/map block.

Hmm, that might be an argument for not trying to increase the flex_bg
size, since we want to keep seek distances within a flex_bg to be
dominated by settling time, and not by the track-to-track
accelleration/coasting/deaccelleration time.

> I figured out what was going on with the 1TB SSD -- it has a huge RAM cache big
> enough to store most of the metadata.  At that point, reads are essentially
> free, but readahead costs us ~1ms per fadvise call. 

Do we understand why fadvise() takes 1ms?   Is that something we can fix?

And readahead(2) was even worse, right?

							- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata
  2014-08-11 20:10                 ` Theodore Ts'o
@ 2014-08-11 20:50                   ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-11 20:50 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On Mon, Aug 11, 2014 at 04:10:30PM -0400, Theodore Ts'o wrote:
> On Mon, Aug 11, 2014 at 11:55:32AM -0700, Darrick J. Wong wrote:
> > I was expecting 16 groups (32M readahead) to win, but as the observations in my
> > spreadsheet show, 2MB tends to win.  I _think_ the reason is that if we
> > encounter indirect map blocks or ETB blocks, they tend to be fairly close to
> > the file blocks in the block group, and if we're trying to do a large readahead
> > at the same time, we end up with a largeish seek penalty (half the flexbg on
> > average) for every ETB/map block.
> 
> Hmm, that might be an argument for not trying to increase the flex_bg
> size, since we want to keep seek distances within a flex_bg to be
> dominated by settling time, and not by the track-to-track
> accelleration/coasting/deaccelleration time.

It might not be too horrible of a regression, since the distance between tracks
has gotten shorter and cylinders themselves have gotten bigger.  I suppose
you'd have to test a variety of flexbg sizes against a disk from, say, 5 years
ago.  If you know the size of the files you'll be storing at mkfs time (such as
with the mk_hugefiles.c options) then increasing flexbg size is probably ok to
avoid fragmenting.

But yes, I was sort of enjoying how stuff within a flexbg gets (sort of) faster
as disks get bigger. :)

> > I figured out what was going on with the 1TB SSD -- it has a huge RAM cache big
> > enough to store most of the metadata.  At that point, reads are essentially
> > free, but readahead costs us ~1ms per fadvise call. 
> 
> Do we understand why fadvise() takes 1ms?   Is that something we can fix?
> 
> And readahead(2) was even worse, right?

>From the readahead(2) manpage:

"readahead() blocks until the specified data has been read."

The fadvise time is pretty consistently 1ms, but with readahead you have to
wait for it to read everything off the disk.  That's fine for threaded
readahead, but for our single-thread readahead it's not much better than
regular blocking reads.  Letting the kernel do the readahead in the background
is way faster.

I don't know why fadvise takes so long.  I'll ftrace it to see where it goes.

--D
> 
> 							- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] libext2fs: create inlinedata symlinks
  2014-08-09  4:26 ` [PATCH 1/6] libext2fs: create inlinedata symlinks Darrick J. Wong
@ 2014-08-24 16:15   ` Theodore Ts'o
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-08-24 16:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4, Pu Hou

On Fri, Aug 08, 2014 at 09:26:17PM -0700, Darrick J. Wong wrote:
> Add to ext2fs_symlink the ability to create inline data symlinks.

Applied with the following modification to the test script and
expected out.  This makes the test more self-documenting --- by
looking at the expect script it's more clear what is going on --- and
it also makes it easier to understand what might have gone wrong if
the test fails.  In particular it includes the error messages when
trying to create a symlink longer than what is allowed given the file
system block size.

Thanks,

						- Ted

diff --git a/tests/f_create_symlinks/expect b/tests/f_create_symlinks/expect
index 847e092..13768a1 100644
--- a/tests/f_create_symlinks/expect
+++ b/tests/f_create_symlinks/expect
@@ -6,31 +6,80 @@ Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 test_filesys: 11/128 files (0.0% non-contiguous), 441/1024 blocks
 Exit status is 0
-stat /l_30
+debugfs -R "symlink /l_30 /xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" test.img
+debugfs -R "symlink /l_70 /xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" test.img
+debugfs -R "symlink /l_500 /xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" test.img
+debugfs -R "symlink /l_1023 /xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" test.img
+debugfs -R "symlink /l_1024 /xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" test.img
+ext2fs_symlink: Invalid argument passed to ext2 library 
+symlink: Invalid argument passed to ext2 library 
+debugfs -R "symlink /l_1500 /xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" test.img
+ext2fs_symlink: Invalid argument passed to ext2 library 
+symlink: Invalid argument passed to ext2 library 
+debugfs -R "stat /l_30" test.img
 Inode: 12   Type: symlink    Mode:  0777   Flags: 0x0
+Generation: 0    Version: 0x00000000:00000000
 User:     0   Group:     0   Size: 31
+File ACL: 0    Directory ACL: 0
 Links: 1   Blockcount: 0
 Fragment:  Address: 0    Number: 0    Size: 0
+ ctime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+ atime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+ mtime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+crtime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+Size of extra inode fields: 28
 Fast link dest: "/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
-stat /l_70
+debugfs -R "stat /l_70" test.img
 Inode: 13   Type: symlink    Mode:  0777   Flags: 0x10000000
+Generation: 0    Version: 0x00000000:00000000
 User:     0   Group:     0   Size: 71
+File ACL: 0    Directory ACL: 0
 Links: 1   Blockcount: 0
 Fragment:  Address: 0    Number: 0    Size: 0
+ ctime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+ atime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+ mtime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+crtime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+Size of extra inode fields: 28
 Extended attributes:
   system.data (11)
 Fast link dest: "/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
-stat /l_500
+debugfs -R "stat /l_500" test.img
 Inode: 14   Type: symlink    Mode:  0777   Flags: 0x80000
+Generation: 0    Version: 0x00000000:00000000
 User:     0   Group:     0   Size: 501
+File ACL: 0    Directory ACL: 0
 Links: 1   Blockcount: 2
 Fragment:  Address: 0    Number: 0    Size: 0
-stat /l_1500
+ ctime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+ atime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+ mtime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+crtime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+Size of extra inode fields: 28
+EXTENTS:
+(0):153
+debugfs -R "stat /l_1023" test.img
+Inode: 15   Type: symlink    Mode:  0777   Flags: 0x80000
+Generation: 0    Version: 0x00000000:00000000
+User:     0   Group:     0   Size: 1024
+File ACL: 0    Directory ACL: 0
+Links: 1   Blockcount: 2
+Fragment:  Address: 0    Number: 0    Size: 0
+ ctime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+ atime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+ mtime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+crtime: 0x53fa0e06:00000000 -- Sun Aug 24 16:08:38 2014
+Size of extra inode fields: 28
+EXTENTS:
+(0):154
+debugfs -R "stat /l_1024" test.img
+/l_1024: File not found by ext2_lookup 
+debugfs -R "stat /l_1500" test.img
 /l_1500: File not found by ext2_lookup 
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
-test_filesys: 14/128 files (0.0% non-contiguous), 442/1024 blocks
+test_filesys: 15/128 files (0.0% non-contiguous), 443/1024 blocks
 Exit status is 0
diff --git a/tests/f_create_symlinks/script b/tests/f_create_symlinks/script
index c49825a..958347a 100644
--- a/tests/f_create_symlinks/script
+++ b/tests/f_create_symlinks/script
@@ -23,15 +23,20 @@ echo Exit status is $status >> $OUT.new
 sed -f $cmd_dir/filter.sed -e "s;$TMPFILE;test.img;" $OUT.new >> $OUT
 rm -f $OUT.new
 
-for i in 30 70 500 1500; do
-	echo "symlink /l_$i /$(perl -e "print 'x' x $i;")"
-done | $DEBUGFS -w $TMPFILE >/dev/null 2>&1
+for i in 30 70 500 1023 1024 1500; do
+	echo "debugfs -R \"symlink /l_$i /$(perl -e "print 'x' x $i;")\" test.img" >> $OUT
+	$DEBUGFS -w -R "symlink /l_$i /$(perl -e "print 'x' x $i;")" $TMPFILE \
+		 2>&1 | sed -f $cmd_dir/filter.sed >> $OUT
+done
 
-for i in 30 70 500 1500; do
-	echo "stat /l_$i" >> $OUT
-	$DEBUGFS -R "stat /l_$i" $TMPFILE 2>&1 | egrep '(File not found|^Inode|^Fast link dest|Blockcount:|^Extended attributes:|system.data|Size:)' >> $OUT
+for i in 30 70 500 1023 1024 1500; do
+	echo "debugfs -R \"stat /l_$i\" test.img" >> $OUT
+	$DEBUGFS -R "stat /l_$i" $TMPFILE 2>&1 | \
+		 sed -f $cmd_dir/filter.sed >> $OUT
 done
 
+/bin/cp $TMPFILE /tmp/foo.img
+
 $FSCK $FSCK_OPT  -N test_filesys $TMPFILE > $OUT.new 2>&1
 status=$?
 echo Exit status is $status >> $OUT.new

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] misc: fix gcc warnings
  2014-08-09  4:26 ` [PATCH 2/6] misc: fix gcc warnings Darrick J. Wong
@ 2014-08-24 16:24   ` Theodore Ts'o
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-08-24 16:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4

On Fri, Aug 08, 2014 at 09:26:24PM -0700, Darrick J. Wong wrote:
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Applied, thanks.

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] mke2fs: set block_validity as a default mount option
  2014-08-09  4:26 ` [PATCH 3/6] mke2fs: set block_validity as a default mount option Darrick J. Wong
@ 2014-08-24 22:47   ` Theodore Ts'o
  2014-08-25 15:52     ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-08-24 22:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4

On Fri, Aug 08, 2014 at 09:26:30PM -0700, Darrick J. Wong wrote:
> The block_validity mount option spot-checks block allocations against
> a bitmap of known group metadata blocks.  This helps us to prevent
> self-inflicted catastrophic failures such as trying to "share"
> critical metadata (think bitmaps) with file data, which usually
> results in filesystem destruction.
> 
> In order to test the overhead of the mount option, I re-used the speed
> tests in the metadata checksum testing script.  In short, the program
> creates what looks like 15 copies of a kernel source tree, except that
> it uses fallocate to strip out the overhead of writing the file data
> so that we can focus on metadata overhead.  On a 64G RAM disk, the
> overhead was generally about 0.9% and at most 1.6%.  On a 160G USB
> disk, the overhead was about 0.8% and peaked at 1.2%.

I was doing a spot check of the additional memory impact of
block_validity mount option, and it's for a 20T file system, assuming
the basic flex_bg size of 16 block groups, it's a bit over 400k of
kernel memory.  That's not a *huge* amount of memory, but it could
potentially be noticeable on a bookshelf NAS server.

However, I could imagine that for a system with say, two dozen 10T
drives (which aren't that far off in the future) in a tray, that's
around 4 megabytes of memory, which starts being non-trivial.

That being said, I suspect for most users, it's not that big of a deal
--- so maybe this is something we should just simply enable by default
in the kernel, let those folks who want to disable specify a
noblock_validity mount option.

The other thing to consider is that for big raid arrays, maybe we
should use a larger flex_bg size.  The main reason for keeping the
size small is to minimize the seek time between the inode table and a
block in the flex_bg.  But for raid devices, we could probably afford
to increase flex_bg size, which would decrease the numer of system
zones that the block validity code would need to track.

      	       	     	      	   	      - Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] mke2fs: set block_validity as a default mount option
  2014-08-24 22:47   ` Theodore Ts'o
@ 2014-08-25 15:52     ` Darrick J. Wong
  2014-08-25 16:36       ` [PATCH] ext4: enable block_validity by default Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-25 15:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On Sun, Aug 24, 2014 at 06:47:21PM -0400, Theodore Ts'o wrote:
> On Fri, Aug 08, 2014 at 09:26:30PM -0700, Darrick J. Wong wrote:
> > The block_validity mount option spot-checks block allocations against
> > a bitmap of known group metadata blocks.  This helps us to prevent
> > self-inflicted catastrophic failures such as trying to "share"
> > critical metadata (think bitmaps) with file data, which usually
> > results in filesystem destruction.
> > 
> > In order to test the overhead of the mount option, I re-used the speed
> > tests in the metadata checksum testing script.  In short, the program
> > creates what looks like 15 copies of a kernel source tree, except that
> > it uses fallocate to strip out the overhead of writing the file data
> > so that we can focus on metadata overhead.  On a 64G RAM disk, the
> > overhead was generally about 0.9% and at most 1.6%.  On a 160G USB
> > disk, the overhead was about 0.8% and peaked at 1.2%.
> 
> I was doing a spot check of the additional memory impact of
> block_validity mount option, and it's for a 20T file system, assuming
> the basic flex_bg size of 16 block groups, it's a bit over 400k of
> kernel memory.  That's not a *huge* amount of memory, but it could
> potentially be noticeable on a bookshelf NAS server.
> 
> However, I could imagine that for a system with say, two dozen 10T
> drives (which aren't that far off in the future) in a tray, that's
> around 4 megabytes of memory, which starts being non-trivial.
> 
> That being said, I suspect for most users, it's not that big of a deal
> --- so maybe this is something we should just simply enable by default
> in the kernel, let those folks who want to disable specify a
> noblock_validity mount option.

Should there be a noblock_validity default mount option?

I suppose I can simply send in a one-liner making b_v the kernel default and
see if anyone screams....

> The other thing to consider is that for big raid arrays, maybe we
> should use a larger flex_bg size.  The main reason for keeping the
> size small is to minimize the seek time between the inode table and a
> block in the flex_bg.  But for raid devices, we could probably afford
> to increase flex_bg size, which would decrease the numer of system
> zones that the block validity code would need to track.

One could make the default flexbg size = 16 * stride_width / stripe_width as a
start.

--D
> 
>       	       	     	      	   	      - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] ext4: enable block_validity by default
  2014-08-25 15:52     ` Darrick J. Wong
@ 2014-08-25 16:36       ` Darrick J. Wong
  2014-09-02  2:02         ` Theodore Ts'o
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2014-08-25 16:36 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

Enable by default the block_validity feature, which checks for
collisions between newly allocated blocks and critical system
metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ext4/super.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b28b36..bbf515c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3519,8 +3519,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		set_opt(sb, ERRORS_CONT);
 	else
 		set_opt(sb, ERRORS_RO);
-	if (def_mount_opts & EXT4_DEFM_BLOCK_VALIDITY)
-		set_opt(sb, BLOCK_VALIDITY);
+	/* block_validity enabled by default; disable with noblock_validity */
+	set_opt(sb, BLOCK_VALIDITY);
 	if (def_mount_opts & EXT4_DEFM_DISCARD)
 		set_opt(sb, DISCARD);
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] ext4: enable block_validity by default
  2014-08-25 16:36       ` [PATCH] ext4: enable block_validity by default Darrick J. Wong
@ 2014-09-02  2:02         ` Theodore Ts'o
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-09-02  2:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-ext4

On Mon, Aug 25, 2014 at 09:36:32AM -0700, Darrick J. Wong wrote:
> Enable by default the block_validity feature, which checks for
> collisions between newly allocated blocks and critical system
> metadata.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Applied, thanks.

					- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-09-02  2:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-09  4:26 [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Darrick J. Wong
2014-08-09  4:26 ` [PATCH 1/6] libext2fs: create inlinedata symlinks Darrick J. Wong
2014-08-24 16:15   ` Theodore Ts'o
2014-08-09  4:26 ` [PATCH 2/6] misc: fix gcc warnings Darrick J. Wong
2014-08-24 16:24   ` Theodore Ts'o
2014-08-09  4:26 ` [PATCH 3/6] mke2fs: set block_validity as a default mount option Darrick J. Wong
2014-08-24 22:47   ` Theodore Ts'o
2014-08-25 15:52     ` Darrick J. Wong
2014-08-25 16:36       ` [PATCH] ext4: enable block_validity by default Darrick J. Wong
2014-09-02  2:02         ` Theodore Ts'o
2014-08-09  4:26 ` [PATCH 4/6] ext2fs: add readahead method to improve scanning Darrick J. Wong
2014-08-09  4:26 ` [PATCH 5/6] libext2fs/e2fsck: provide routines to read-ahead metadata Darrick J. Wong
2014-08-11  5:21   ` Darrick J. Wong
2014-08-11  6:24     ` Theodore Ts'o
2014-08-11  6:31       ` Darrick J. Wong
2014-08-11 14:34         ` Theodore Ts'o
2014-08-11 18:05           ` Darrick J. Wong
2014-08-11 18:32             ` Theodore Ts'o
2014-08-11 18:55               ` Darrick J. Wong
2014-08-11 20:10                 ` Theodore Ts'o
2014-08-11 20:50                   ` Darrick J. Wong
2014-08-09  4:26 ` [PATCH 6/6] e2fsck: read-ahead metadata during passes 1, 2, and 4 Darrick J. Wong
2014-08-09  5:53 ` [PATCH 0/6] e2fsprogs Summer 2014 patchbomb, part 5 Theodore Ts'o
2014-08-09  5:59   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.