All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] e2fsck metadata prefetch
@ 2014-02-01 10:37 Darrick J. Wong
  2014-02-01 10:37 ` [PATCH 1/3] ext2fs: add readahead method to improve scanning Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Darrick J. Wong @ 2014-02-01 10:37 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4

This is the second go at a patchset that tries to reduce e2fsck run
times by pre-loading ext4 metadata concurrent with e2fsck execution.
The first patch is Andreas Dilger's patch to add a readahead method to
the IO manager interface.  The second patch extends libext2fs with a
function call to invoke readahead on a list of blocks, and a second
call that invokes readahead on the bitmaps and inode tables of a bunch
of groups.  The third patch enhances e2fsck to start threads that call
the readahead functions.

Crude testing has been done via:
# echo 3 > /proc/sys/vm/drop_caches
# READAHEAD=1 /usr/bin/time ./e2fsck/e2fsck -Fnfvtt /dev/XXX

So far in my crude testing on a cold system, I've seen about a ~20%
speedup on a SSD, a ~40% speedup on a 3x RAID1 SATA array, and maybe
a 5% speedup on a single-spindle SATA disk.  On a single-queue USB
HDD, performance doesn't change much.  It looks as though in general,
single-spindle HDDs will not benefit, which doesn't surprise me.  The
SSD numbers are harder to quantify since they're already fast.

This second version of the patch uses posix_fadvise to hint to the
kernel that it really wants to have the blocks loaded in the page
cache ready to go.  This is much easier to manage, because all we need
to do is throw a list of blocks at it and let it go... and if we're
careful not to change any FS state, we can easily offload the
readahead work to a thread without weird crashes.

Note that this draft code does little to prevent page cache thrashing.
It doesn't hold back from issuing a large flood of IO.  It's not clear
if it's better to try to constrain how far the prefetcher gets ahead
of the checker code, or better to let the kernel sort it out.

I've tested these e2fsprogs changes against the -next branch as of
1/31.  These days, I use an 8GB ramdisk and whatever hardware I have
lying around.  The make check tests should pass.

Comments and questions are, as always, welcome.

--D

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] ext2fs: add readahead method to improve scanning
  2014-02-01 10:37 [RFC PATCH v2 0/3] e2fsck metadata prefetch Darrick J. Wong
@ 2014-02-01 10:37 ` Darrick J. Wong
  2014-02-01 10:37 ` [PATCH 2/3] libext2fs: allow clients to read-ahead metadata Darrick J. Wong
  2014-02-01 10:37 ` [PATCH 3/3] e2fsck: read-ahead metadata during pass1 and pass2 Darrick J. Wong
  2 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2014-02-01 10:37 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4, Andreas Dilger

Frøm: Andreas Dilger <adilger@whamcloud.com>

Add a readahead method for prefetching ranges of disk blocks.
This is useful for inode table scanning, and other large
contiguous ranges of blocks, and may also prove useful for
random block prefetch, since it will allow reordering of the
IO without waiting synchronously for the reads to complete.

It is currently using the posix_fadvise(POSIX_FADV_WILLNEED)
interface, as this proved most efficient during our testing

[darrick.wong@oracle.com]
Make the arguments to the readahead function take the same
ULL values as the other IO functions, and return an
appropriate error code when fadvise isn't available.

Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 lib/ext2fs/ext2_io.h    |    4 ++++
 lib/ext2fs/io_manager.c |    9 +++++++++
 lib/ext2fs/unix_io.c    |   28 +++++++++++++++++++++++++---
 3 files changed, 38 insertions(+), 3 deletions(-)


diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index 1894fb8..00e22e0 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -90,6 +90,8 @@ struct struct_io_manager {
 					int count, const void *data);
 	errcode_t (*discard)(io_channel channel, unsigned long long block,
 			     unsigned long long count);
+	errcode_t (*readahead)(io_channel channel, unsigned long long block,
+			       unsigned long long count);
 	long	reserved[16];
 };
 
@@ -124,6 +126,8 @@ extern errcode_t io_channel_discard(io_channel channel,
 				    unsigned long long count);
 extern errcode_t io_channel_alloc_buf(io_channel channel,
 				      int count, void *ptr);
+extern errcode_t io_channel_readahead(io_channel io, unsigned long long block,
+				      unsigned long long count);
 
 /* unix_io.c */
 extern io_manager unix_io_manager;
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index 34e4859..1acbb1d 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -128,3 +128,12 @@ errcode_t io_channel_alloc_buf(io_channel io, int count, void *ptr)
 	else
 		return ext2fs_get_mem(size, ptr);
 }
+
+errcode_t io_channel_readahead(io_channel io, unsigned long long block,
+			       unsigned long long count)
+{
+	if (!io->manager->readahead)
+		return EXT2_ET_OP_NOT_SUPPORTED;
+
+	return io->manager->readahead(io, block, nblocks);
+}
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 0cc0f52..bc4490c 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -15,6 +15,9 @@
  * %End-Header%
  */
 
+#define _XOPEN_SOURCE 600
+#define _DARWIN_C_SOURCE
+#define _FILE_OFFSET_BITS 64
 #define _LARGEFILE_SOURCE
 #define _LARGEFILE64_SOURCE
 #ifndef _GNU_SOURCE
@@ -35,6 +38,9 @@
 #ifdef __linux__
 #include <sys/utsname.h>
 #endif
+#if HAVE_SYS_TYPES_H
+#include <sys/types.h>
+#endif
 #ifdef HAVE_SYS_IOCTL_H
 #include <sys/ioctl.h>
 #endif
@@ -44,9 +50,6 @@
 #if HAVE_SYS_STAT_H
 #include <sys/stat.h>
 #endif
-#if HAVE_SYS_TYPES_H
-#include <sys/types.h>
-#endif
 #if HAVE_SYS_RESOURCE_H
 #include <sys/resource.h>
 #endif
@@ -119,6 +122,8 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 				int count, const void *data);
 static errcode_t unix_discard(io_channel channel, unsigned long long block,
 			      unsigned long long count);
+static errcode_t unix_readahead(io_channel channel, unsigned long long block,
+				unsigned long long count);
 
 static struct struct_io_manager struct_unix_manager = {
 	EXT2_ET_MAGIC_IO_MANAGER,
@@ -135,6 +140,7 @@ static struct struct_io_manager struct_unix_manager = {
 	unix_read_blk64,
 	unix_write_blk64,
 	unix_discard,
+	unix_readahead,
 };
 
 io_manager unix_io_manager = &struct_unix_manager;
@@ -828,6 +834,22 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
 #endif /* NO_IO_CACHE */
 }
 
+static errcode_t unix_readahead(io_channel channel, unsigned long long block,
+				unsigned long long count)
+{
+#ifdef POSIX_FADV_WILLNEED
+	struct unix_private_data *data;
+
+	data = (struct unix_private_data *)channel->private_data;
+	posix_fadvise(data->dev, (ext2_loff_t)block * channel->block_size,
+		      (ext2_loff_t)count * channel->block_size,
+		      POSIX_FADV_WILLNEED);
+	return 0;
+#else
+	return EXT2_ET_OP_NOT_SUPPORTED;
+#endif
+}
+
 static errcode_t unix_write_blk(io_channel channel, unsigned long block,
 				int count, const void *buf)
 {

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] libext2fs: allow clients to read-ahead metadata
  2014-02-01 10:37 [RFC PATCH v2 0/3] e2fsck metadata prefetch Darrick J. Wong
  2014-02-01 10:37 ` [PATCH 1/3] ext2fs: add readahead method to improve scanning Darrick J. Wong
@ 2014-02-01 10:37 ` Darrick J. Wong
  2014-02-01 10:41   ` Darrick J. Wong
  2014-02-03 21:32   ` Andreas Dilger
  2014-02-01 10:37 ` [PATCH 3/3] e2fsck: read-ahead metadata during pass1 and pass2 Darrick J. Wong
  2 siblings, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2014-02-01 10:37 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4

This patch adds to libext2fs the ability to pre-fetch metadata
into the page cache in the hopes of speeding up libext2fs' clients.
There are two new library functions -- the first allows a client to
readahead a list of blocks, and the second is a helper function that
uses that first mechanism to load group data (bitmaps, inode tables).

e2fsck will employ both of these methods to speed itself up.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 lib/ext2fs/Makefile.in  |    4 +
 lib/ext2fs/ext2fs.h     |   10 +++
 lib/ext2fs/io_manager.c |    2 -
 lib/ext2fs/readahead.c  |  153 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 168 insertions(+), 1 deletion(-)
 create mode 100644 lib/ext2fs/readahead.c


diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index 92b6ab0..8f98f4b 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -77,6 +77,7 @@ OBJS= $(DEBUGFS_LIB_OBJS) $(RESIZE_LIB_OBJS) $(E2IMAGE_LIB_OBJS) \
 	qcow2.o \
 	read_bb.o \
 	read_bb_file.o \
+	readahead.o \
 	res_gdt.o \
 	rw_bitmaps.o \
 	swapfs.o \
@@ -153,6 +154,7 @@ SRCS= ext2_err.c \
 	$(srcdir)/qcow2.c \
 	$(srcdir)/read_bb.c \
 	$(srcdir)/read_bb_file.c \
+	$(srcdir)/readahead.c \
 	$(srcdir)/res_gdt.c \
 	$(srcdir)/rw_bitmaps.c \
 	$(srcdir)/swapfs.c \
@@ -887,6 +889,8 @@ read_bb_file.o: $(srcdir)/read_bb_file.c $(top_builddir)/lib/config.h \
  $(srcdir)/ext2_fs.h $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h \
  $(srcdir)/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
  $(srcdir)/ext2_ext_attr.h $(srcdir)/bitops.h
+readahead.o: $(srcdir)/readahead.c $(top_builddir)/lib/config.h \
+ $(srcdir)/ext2fs.h $(srcdir)/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_err.h
 res_gdt.o: $(srcdir)/res_gdt.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(srcdir)/ext2_fs.h \
  $(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2fs.h \
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 069c1b6..1e06791 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -1543,6 +1543,16 @@ extern errcode_t ext2fs_read_bb_FILE(ext2_filsys fs, FILE *f,
 				     void (*invalid)(ext2_filsys fs,
 						     blk_t blk));
 
+/* readahead.c */
+#define EXT2FS_READ_SUPER	0x01
+#define EXT2FS_READ_GDT		0x02
+#define EXT2FS_READ_BBITMAP	0x04
+#define EXT2FS_READ_IBITMAP	0x08
+#define EXT2FS_READ_ITABLE	0x10
+errcode_t ext2fs_readahead(ext2_filsys fs, int flags, dgrp_t start,
+			   dgrp_t ngroups);
+errcode_t ext2fs_readahead_dblist(ext2_filsys fs, ext2_dblist dblist);
+
 /* res_gdt.c */
 extern errcode_t ext2fs_create_resize_inode(ext2_filsys fs);
 
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index 1acbb1d..aae0a4b 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -135,5 +135,5 @@ errcode_t io_channel_readahead(io_channel io, unsigned long long block,
 	if (!io->manager->readahead)
 		return EXT2_ET_OP_NOT_SUPPORTED;
 
-	return io->manager->readahead(io, block, nblocks);
+	return io->manager->readahead(io, block, count);
 }
diff --git a/lib/ext2fs/readahead.c b/lib/ext2fs/readahead.c
new file mode 100644
index 0000000..05f6135
--- /dev/null
+++ b/lib/ext2fs/readahead.c
@@ -0,0 +1,153 @@
+/*
+ * readahead.c -- Try to convince the OS to prefetch metadata.
+ *
+ * Copyright (C) 2014 Oracle.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Library
+ * General Public License, version 2.
+ * %End-Header%
+ */
+
+#include "config.h"
+#include <string.h>
+
+#include "ext2_fs.h"
+#include "ext2fs.h"
+
+struct read_dblist {
+	errcode_t err;
+	blk64_t run_start;
+	blk64_t run_len;
+};
+
+static EXT2_QSORT_TYPE readahead_dir_block_cmp(const void *a, const void *b)
+{
+	const struct ext2_db_entry2 *db_a =
+		(const struct ext2_db_entry2 *) a;
+	const struct ext2_db_entry2 *db_b =
+		(const struct ext2_db_entry2 *) b;
+
+	return (int) (db_a->blk - db_b->blk);
+}
+
+static int readahead_dir_block(ext2_filsys fs, struct ext2_db_entry2 *db,
+			       void *priv_data)
+{
+	errcode_t err = 0;
+	struct read_dblist *pr = priv_data;
+
+	if (!pr->run_len || db->blk != pr->run_start + pr->run_len) {
+		if (pr->run_len)
+			pr->err = io_channel_readahead(fs->io, pr->run_start,
+						       pr->run_len);
+		pr->run_start = db->blk;
+		pr->run_len = 0;
+	}
+	pr->run_len += db->blockcnt;
+
+	return pr->err ? DBLIST_ABORT : 0;
+}
+
+errcode_t ext2fs_readahead_dblist(ext2_filsys fs, ext2_dblist dblist)
+{
+	errcode_t err;
+	struct read_dblist pr;
+
+	ext2fs_dblist_sort2(dblist, readahead_dir_block_cmp);
+
+	memset(&pr, 0, sizeof(pr));
+	err = ext2fs_dblist_iterate2(dblist, readahead_dir_block, &pr);
+	if (pr.err)
+		return pr.err;
+	if (err)
+		return err;
+
+	if (pr.run_len)
+		err = io_channel_readahead(fs->io, pr.run_start, pr.run_len);
+
+	return err;
+}
+
+errcode_t ext2fs_readahead(ext2_filsys fs, int flags, dgrp_t start,
+			   dgrp_t ngroups)
+{
+	blk64_t		super, old_gdt, new_gdt;
+	blk_t		blocks;
+	dgrp_t		i;
+	ext2_dblist	dblist;
+	dgrp_t		end = start + ngroups;
+	errcode_t	err = 0;
+
+	if (end > fs->group_desc_count)
+		end = fs->group_desc_count;
+
+	if (flags == 0)
+		return 0;
+
+	err = ext2fs_init_dblist(fs, &dblist);
+	if (err)
+		return err;
+
+	for (i = start; i < end; i++) {
+		err = ext2fs_super_and_bgd_loc2(fs, i, &super, &old_gdt,
+						&new_gdt, &blocks);
+		if (err)
+			break;
+
+		if (flags & EXT2FS_READ_SUPER) {
+			err = ext2fs_add_dir_block2(dblist, 0, super, 0);
+			if (err)
+				break;
+		}
+
+		if (flags & EXT2FS_READ_GDT) {
+			if (old_gdt)
+				err = ext2fs_add_dir_block2(dblist, 0, old_gdt,
+							    blocks);
+			else if (new_gdt)
+				err = ext2fs_add_dir_block2(dblist, 0, new_gdt,
+							    blocks);
+			else
+				err = 0;
+			if (err)
+				break;
+		}
+
+		if ((flags & EXT2FS_READ_BBITMAP) &&
+		    !ext2fs_bg_flags_test(fs, i, EXT2_BG_BLOCK_UNINIT) &&
+		    ext2fs_bg_free_blocks_count(fs, i) <
+				fs->super->s_blocks_per_group) {
+			super = ext2fs_block_bitmap_loc(fs, i);
+			err = ext2fs_add_dir_block2(dblist, 0, super, 1);
+			if (err)
+				break;
+		}
+
+		if ((flags & EXT2FS_READ_IBITMAP) &&
+		    !ext2fs_bg_flags_test(fs, i, EXT2_BG_INODE_UNINIT) &&
+		    ext2fs_bg_free_inodes_count(fs, i) <
+				fs->super->s_inodes_per_group) {
+			super = ext2fs_inode_bitmap_loc(fs, i);
+			err = ext2fs_add_dir_block2(dblist, 0, super, 1);
+			if (err)
+				break;
+		}
+
+		if ((flags & EXT2FS_READ_ITABLE) &&
+		    ext2fs_bg_free_inodes_count(fs, i) <
+				fs->super->s_inodes_per_group) {
+			super = ext2fs_inode_table_loc(fs, i);
+			err = ext2fs_add_dir_block2(dblist, 0, super,
+					fs->inode_blocks_per_group);
+			if (err)
+				break;
+		}
+	}
+
+	if (!err)
+		err = ext2fs_readahead_dblist(fs, dblist);
+
+	ext2fs_free_dblist(dblist);
+	return err;
+}


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] e2fsck: read-ahead metadata during pass1 and pass2
  2014-02-01 10:37 [RFC PATCH v2 0/3] e2fsck metadata prefetch Darrick J. Wong
  2014-02-01 10:37 ` [PATCH 1/3] ext2fs: add readahead method to improve scanning Darrick J. Wong
  2014-02-01 10:37 ` [PATCH 2/3] libext2fs: allow clients to read-ahead metadata Darrick J. Wong
@ 2014-02-01 10:37 ` Darrick J. Wong
  2014-02-03 21:20   ` Andreas Dilger
  2 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2014-02-01 10:37 UTC (permalink / raw)
  To: tytso, darrick.wong; +Cc: linux-ext4

e2fsck pass1 is modified to use the block group data prefetch function
to try to fetch the data into the pagecache before it is needed.
pass2 is modified to use the dirblock prefetching function to prefetch
the list of directory blocks that are assembled in pass1.

In general, these mechanisms can halve fsck time... if the host system
has sufficient memory.  SSDs and multi-spindle RAIDs see the most
speedup, and single-spindle USB mass storage devices see hardly any
benefit.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 MCONFIG.in         |    1 +
 configure          |   47 ++++++++++++++++++++++++++++++++++++++++
 configure.in       |    5 ++++
 e2fsck/Makefile.in |    4 ++-
 e2fsck/pass1.c     |   26 ++++++++++++++++++++++
 e2fsck/pass2.c     |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/config.h.in    |    6 +++++
 7 files changed, 148 insertions(+), 2 deletions(-)


diff --git a/MCONFIG.in b/MCONFIG.in
index 114de0a..528c35e 100644
--- a/MCONFIG.in
+++ b/MCONFIG.in
@@ -111,6 +111,7 @@ LIBFUSE = @FUSE_LIB@
 LIBQUOTA = @STATIC_LIBQUOTA@
 LIBBLKID = @LIBBLKID@ @PRIVATE_LIBS_CMT@ $(LIBUUID)
 LIBINTL = @LIBINTL@
+LIBPTHREADS = @PTHREADS_LIB@
 SYSLIBS = @LIBS@
 DEPLIBSS = $(LIB)/libss@LIB_EXT@
 DEPLIBCOM_ERR = $(LIB)/libcom_err@LIB_EXT@
diff --git a/configure b/configure
index 5d032ce..f1f9b1b 100755
--- a/configure
+++ b/configure
@@ -639,6 +639,7 @@ CYGWIN_CMT
 LINUX_CMT
 UNI_DIFF_OPTS
 SEM_INIT_LIB
+PTHREADS_LIB
 FUSE_CMT
 FUSE_LIB
 SOCKET_LIB
@@ -11492,6 +11493,52 @@ if test $ac_cv_have_optreset = yes; then
 $as_echo "#define HAVE_OPTRESET 1" >>confdefs.h
 
 fi
+PTHREADS_LIB='-lpthread'
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for pthread_create in -lpthread" >&5
+$as_echo_n "checking for pthread_create in -lpthread... " >&6; }
+if ${ac_cv_lib_pthread_pthread_create+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpthread  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pthread_create ();
+int
+main ()
+{
+return pthread_create ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pthread_pthread_create=yes
+else
+  ac_cv_lib_pthread_pthread_create=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pthread_pthread_create" >&5
+$as_echo "$ac_cv_lib_pthread_pthread_create" >&6; }
+if test "x$ac_cv_lib_pthread_pthread_create" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPTHREAD 1
+_ACEOF
+
+  LIBS="-lpthread $LIBS"
+
+fi
+
 
 SEM_INIT_LIB=''
 ac_fn_c_check_func "$LINENO" "sem_init" "ac_cv_func_sem_init"
diff --git a/configure.in b/configure.in
index 2eda7ae..f130d7e 100644
--- a/configure.in
+++ b/configure.in
@@ -1212,6 +1212,11 @@ if test $ac_cv_have_optreset = yes; then
   AC_DEFINE(HAVE_OPTRESET, 1, [Define to 1 if optreset for getopt is present])
 fi
 dnl
+dnl Test for pthread_create in -lpthread
+dnl
+PTHREADS_LIB='-lpthread'
+AC_CHECK_LIB(pthread, pthread_create, AC_SUBST(PTHREADS_LIB))
+dnl
 dnl Test for sem_init, and which library it might require:
 dnl
 AH_TEMPLATE([HAVE_SEM_INIT], [Define to 1 if sem_init() exists])
diff --git a/e2fsck/Makefile.in b/e2fsck/Makefile.in
index 8ca329b..7e8e78e 100644
--- a/e2fsck/Makefile.in
+++ b/e2fsck/Makefile.in
@@ -16,13 +16,13 @@ MANPAGES=	e2fsck.8
 FMANPAGES=	e2fsck.conf.5
 
 LIBS= $(LIBQUOTA) $(LIBEXT2FS) $(LIBCOM_ERR) $(LIBBLKID) $(LIBUUID) \
-	$(LIBINTL) $(LIBE2P) $(SYSLIBS)
+	$(LIBINTL) $(LIBE2P) $(SYSLIBS) $(LIBPTHREADS)
 DEPLIBS= $(DEPLIBQUOTA) $(LIBEXT2FS) $(DEPLIBCOM_ERR) $(DEPLIBBLKID) \
 	 $(DEPLIBUUID) $(DEPLIBE2P)
 
 STATIC_LIBS= $(STATIC_LIBQUOTA) $(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) \
 	     $(STATIC_LIBBLKID) $(STATIC_LIBUUID) $(LIBINTL) $(STATIC_LIBE2P) \
-	     $(SYSLIBS)
+	     $(SYSLIBS) $(LIBPTHEADS)
 STATIC_DEPLIBS= $(DEPSTATIC_LIBQUOTA) $(STATIC_LIBEXT2FS) \
 		$(DEPSTATIC_LIBCOM_ERR) $(DEPSTATIC_LIBBLKID) \
 		$(DEPSTATIC_LIBUUID) $(DEPSTATIC_LIBE2P)
diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
index 7554f4e..590e1bd 100644
--- a/e2fsck/pass1.c
+++ b/e2fsck/pass1.c
@@ -44,6 +44,9 @@
 #ifdef HAVE_ERRNO_H
 #include <errno.h>
 #endif
+#ifdef HAVE_PTHREAD_H
+#include <pthread.h>
+#endif
 
 #include "e2fsck.h"
 #include <ext2fs/ext2_ext_attr.h>
@@ -574,6 +577,20 @@ static errcode_t recheck_bad_inode_checksum(ext2_filsys fs, ext2_ino_t ino,
 	return 0;
 }
 
+static void *pass1_readahead(void *p)
+{
+	errcode_t err;
+	e2fsck_t ctx = (e2fsck_t)p;
+
+	printf("%s: START READAHEAD\n", __func__);
+	err = ext2fs_readahead(ctx->fs, EXT2FS_READ_BBITMAP |
+			       EXT2FS_READ_IBITMAP | EXT2FS_READ_ITABLE,
+			       0, ctx->fs->group_desc_count);
+	printf("%s: READAHEAD=%d\n", __func__, (int)err);
+
+	return NULL;
+}
+
 void e2fsck_pass1(e2fsck_t ctx)
 {
 	int	i;
@@ -600,6 +617,15 @@ void e2fsck_pass1(e2fsck_t ctx)
 	init_resource_track(&rtrack, ctx->fs->io);
 	clear_problem_context(&pctx);
 
+	if (getenv("READAHEAD")) {
+#ifdef HAVE_PTHREAD_H
+		pthread_t tid;
+		pthread_create(&tid, NULL, pass1_readahead, ctx);
+#else
+		pass1_readahead(ctx);
+#endif
+	}
+
 	if (!(ctx->options & E2F_OPT_PREEN))
 		fix_problem(ctx, PR_1_PASS_HEADER, &pctx);
 
diff --git a/e2fsck/pass2.c b/e2fsck/pass2.c
index 5a2745a..bd7323f 100644
--- a/e2fsck/pass2.c
+++ b/e2fsck/pass2.c
@@ -44,6 +44,9 @@
 #define _GNU_SOURCE 1 /* get strnlen() */
 #include "config.h"
 #include <string.h>
+#ifdef HAVE_PTHREAD_H
+#include <pthread.h>
+#endif
 
 #include "e2fsck.h"
 #include "problem.h"
@@ -79,6 +82,29 @@ struct check_dir_struct {
 	e2fsck_t ctx;
 };
 
+struct pass2_readahead_data {
+	ext2_filsys fs;
+	ext2_dblist dblist;
+};
+
+static int readahead_dir_block(ext2_filsys fs, struct ext2_db_entry2 *db,
+			       void *priv_data)
+{
+	db->blockcnt = 1;
+}
+
+static void *pass2_readahead(void *p)
+{
+	errcode_t err;
+	struct pass2_readahead_data *pr = p;
+
+	printf("%s: START READAHEAD\n", __func__);
+	err = ext2fs_readahead_dblist(pr->fs, pr->dblist);
+	ext2fs_free_dblist(pr->dblist);
+	ext2fs_free_mem(&pr);
+	printf("%s: END READAHEAD %d\n", __func__, (int)err);
+}
+
 void e2fsck_pass2(e2fsck_t ctx)
 {
 	struct ext2_super_block *sb = ctx->fs->super;
@@ -146,6 +172,41 @@ void e2fsck_pass2(e2fsck_t ctx)
 	if (fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_DIR_INDEX)
 		ext2fs_dblist_sort2(fs->dblist, special_dir_block_cmp);
 
+	if (getenv("READAHEAD")) {
+#ifdef HAVE_PTHREAD_H
+		pthread_t tid;
+#endif
+		struct pass2_readahead_data *pr;
+		errcode_t err;
+
+		err = ext2fs_get_mem(sizeof(*pr), &pr);
+		if (err)
+			goto no_readahead;
+		pr->fs = fs;
+		err = ext2fs_copy_dblist(fs->dblist, &pr->dblist);
+		if (err) {
+			ext2fs_free_mem(&pr);
+			goto no_readahead;
+		}
+		err = ext2fs_dblist_iterate2(pr->dblist, readahead_dir_block,
+					     NULL);
+		if (err) {
+			ext2fs_free_dblist(pr->dblist);
+			ext2fs_free_mem(&pr);
+			goto no_readahead;
+		}
+#ifdef HAVE_PTHREAD_H
+		err = pthread_create(&tid, NULL, pass2_readahead, pr);
+#else
+		pass2_readahead(pr);
+#endif
+		if (err) {
+			ext2fs_free_dblist(pr->dblist);
+			ext2fs_free_mem(&pr);
+		}
+	}
+
+no_readahead:
 	cd.pctx.errcode = ext2fs_dblist_iterate2(fs->dblist, check_dir_block,
 						 &cd);
 	if (ctx->flags & E2F_FLAG_SIGNAL_MASK || ctx->flags & E2F_FLAG_RESTART)
diff --git a/lib/config.h.in b/lib/config.h.in
index 35ece01..1dd33b4 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -206,6 +206,9 @@
 /* Define if your <locale.h> file defines LC_MESSAGES. */
 #undef HAVE_LC_MESSAGES
 
+/* Define to 1 if you have the `pthread' library (-lpthread). */
+#undef HAVE_LIBPTHREAD
+
 /* Define to 1 if you have the <limits.h> header file. */
 #undef HAVE_LIMITS_H
 
@@ -314,6 +317,9 @@
 /* Define to 1 if you have the `prctl' function. */
 #undef HAVE_PRCTL
 
+/* Define to 1 if you have the <pthread.h> header file. */
+#undef HAVE_PTHREAD_H
+
 /* Define to 1 if you have the `putenv' function. */
 #undef HAVE_PUTENV
 


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] libext2fs: allow clients to read-ahead metadata
  2014-02-01 10:37 ` [PATCH 2/3] libext2fs: allow clients to read-ahead metadata Darrick J. Wong
@ 2014-02-01 10:41   ` Darrick J. Wong
  2014-02-03 21:32   ` Andreas Dilger
  1 sibling, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2014-02-01 10:41 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

On Sat, Feb 01, 2014 at 02:37:35AM -0800, Darrick J. Wong wrote:
> This patch adds to libext2fs the ability to pre-fetch metadata
> into the page cache in the hopes of speeding up libext2fs' clients.
> There are two new library functions -- the first allows a client to
> readahead a list of blocks, and the second is a helper function that
> uses that first mechanism to load group data (bitmaps, inode tables).
> 
> e2fsck will employ both of these methods to speed itself up.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  lib/ext2fs/Makefile.in  |    4 +
>  lib/ext2fs/ext2fs.h     |   10 +++
>  lib/ext2fs/io_manager.c |    2 -
>  lib/ext2fs/readahead.c  |  153 +++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 168 insertions(+), 1 deletion(-)
>  create mode 100644 lib/ext2fs/readahead.c
> 
> 
> diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
> index 92b6ab0..8f98f4b 100644
> --- a/lib/ext2fs/Makefile.in
> +++ b/lib/ext2fs/Makefile.in
> @@ -77,6 +77,7 @@ OBJS= $(DEBUGFS_LIB_OBJS) $(RESIZE_LIB_OBJS) $(E2IMAGE_LIB_OBJS) \
>  	qcow2.o \
>  	read_bb.o \
>  	read_bb_file.o \
> +	readahead.o \
>  	res_gdt.o \
>  	rw_bitmaps.o \
>  	swapfs.o \
> @@ -153,6 +154,7 @@ SRCS= ext2_err.c \
>  	$(srcdir)/qcow2.c \
>  	$(srcdir)/read_bb.c \
>  	$(srcdir)/read_bb_file.c \
> +	$(srcdir)/readahead.c \
>  	$(srcdir)/res_gdt.c \
>  	$(srcdir)/rw_bitmaps.c \
>  	$(srcdir)/swapfs.c \
> @@ -887,6 +889,8 @@ read_bb_file.o: $(srcdir)/read_bb_file.c $(top_builddir)/lib/config.h \
>   $(srcdir)/ext2_fs.h $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h \
>   $(srcdir)/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
>   $(srcdir)/ext2_ext_attr.h $(srcdir)/bitops.h
> +readahead.o: $(srcdir)/readahead.c $(top_builddir)/lib/config.h \
> + $(srcdir)/ext2fs.h $(srcdir)/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_err.h
>  res_gdt.o: $(srcdir)/res_gdt.c $(top_builddir)/lib/config.h \
>   $(top_builddir)/lib/dirpaths.h $(srcdir)/ext2_fs.h \
>   $(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2fs.h \
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index 069c1b6..1e06791 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -1543,6 +1543,16 @@ extern errcode_t ext2fs_read_bb_FILE(ext2_filsys fs, FILE *f,
>  				     void (*invalid)(ext2_filsys fs,
>  						     blk_t blk));
>  
> +/* readahead.c */
> +#define EXT2FS_READ_SUPER	0x01
> +#define EXT2FS_READ_GDT		0x02
> +#define EXT2FS_READ_BBITMAP	0x04
> +#define EXT2FS_READ_IBITMAP	0x08
> +#define EXT2FS_READ_ITABLE	0x10
> +errcode_t ext2fs_readahead(ext2_filsys fs, int flags, dgrp_t start,
> +			   dgrp_t ngroups);
> +errcode_t ext2fs_readahead_dblist(ext2_filsys fs, ext2_dblist dblist);
> +
>  /* res_gdt.c */
>  extern errcode_t ext2fs_create_resize_inode(ext2_filsys fs);
>  
> diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
> index 1acbb1d..aae0a4b 100644
> --- a/lib/ext2fs/io_manager.c
> +++ b/lib/ext2fs/io_manager.c
> @@ -135,5 +135,5 @@ errcode_t io_channel_readahead(io_channel io, unsigned long long block,
>  	if (!io->manager->readahead)
>  		return EXT2_ET_OP_NOT_SUPPORTED;
>  
> -	return io->manager->readahead(io, block, nblocks);
> +	return io->manager->readahead(io, block, count);
>  }

Oops, this hunk of course goes in the previous patch.

--D

> diff --git a/lib/ext2fs/readahead.c b/lib/ext2fs/readahead.c
> new file mode 100644
> index 0000000..05f6135
> --- /dev/null
> +++ b/lib/ext2fs/readahead.c
> @@ -0,0 +1,153 @@
> +/*
> + * readahead.c -- Try to convince the OS to prefetch metadata.
> + *
> + * Copyright (C) 2014 Oracle.
> + *
> + * %Begin-Header%
> + * This file may be redistributed under the terms of the GNU Library
> + * General Public License, version 2.
> + * %End-Header%
> + */
> +
> +#include "config.h"
> +#include <string.h>
> +
> +#include "ext2_fs.h"
> +#include "ext2fs.h"
> +
> +struct read_dblist {
> +	errcode_t err;
> +	blk64_t run_start;
> +	blk64_t run_len;
> +};
> +
> +static EXT2_QSORT_TYPE readahead_dir_block_cmp(const void *a, const void *b)
> +{
> +	const struct ext2_db_entry2 *db_a =
> +		(const struct ext2_db_entry2 *) a;
> +	const struct ext2_db_entry2 *db_b =
> +		(const struct ext2_db_entry2 *) b;
> +
> +	return (int) (db_a->blk - db_b->blk);
> +}
> +
> +static int readahead_dir_block(ext2_filsys fs, struct ext2_db_entry2 *db,
> +			       void *priv_data)
> +{
> +	errcode_t err = 0;
> +	struct read_dblist *pr = priv_data;
> +
> +	if (!pr->run_len || db->blk != pr->run_start + pr->run_len) {
> +		if (pr->run_len)
> +			pr->err = io_channel_readahead(fs->io, pr->run_start,
> +						       pr->run_len);
> +		pr->run_start = db->blk;
> +		pr->run_len = 0;
> +	}
> +	pr->run_len += db->blockcnt;
> +
> +	return pr->err ? DBLIST_ABORT : 0;
> +}
> +
> +errcode_t ext2fs_readahead_dblist(ext2_filsys fs, ext2_dblist dblist)
> +{
> +	errcode_t err;
> +	struct read_dblist pr;
> +
> +	ext2fs_dblist_sort2(dblist, readahead_dir_block_cmp);
> +
> +	memset(&pr, 0, sizeof(pr));
> +	err = ext2fs_dblist_iterate2(dblist, readahead_dir_block, &pr);
> +	if (pr.err)
> +		return pr.err;
> +	if (err)
> +		return err;
> +
> +	if (pr.run_len)
> +		err = io_channel_readahead(fs->io, pr.run_start, pr.run_len);
> +
> +	return err;
> +}
> +
> +errcode_t ext2fs_readahead(ext2_filsys fs, int flags, dgrp_t start,
> +			   dgrp_t ngroups)
> +{
> +	blk64_t		super, old_gdt, new_gdt;
> +	blk_t		blocks;
> +	dgrp_t		i;
> +	ext2_dblist	dblist;
> +	dgrp_t		end = start + ngroups;
> +	errcode_t	err = 0;
> +
> +	if (end > fs->group_desc_count)
> +		end = fs->group_desc_count;
> +
> +	if (flags == 0)
> +		return 0;
> +
> +	err = ext2fs_init_dblist(fs, &dblist);
> +	if (err)
> +		return err;
> +
> +	for (i = start; i < end; i++) {
> +		err = ext2fs_super_and_bgd_loc2(fs, i, &super, &old_gdt,
> +						&new_gdt, &blocks);
> +		if (err)
> +			break;
> +
> +		if (flags & EXT2FS_READ_SUPER) {
> +			err = ext2fs_add_dir_block2(dblist, 0, super, 0);
> +			if (err)
> +				break;
> +		}
> +
> +		if (flags & EXT2FS_READ_GDT) {
> +			if (old_gdt)
> +				err = ext2fs_add_dir_block2(dblist, 0, old_gdt,
> +							    blocks);
> +			else if (new_gdt)
> +				err = ext2fs_add_dir_block2(dblist, 0, new_gdt,
> +							    blocks);
> +			else
> +				err = 0;
> +			if (err)
> +				break;
> +		}
> +
> +		if ((flags & EXT2FS_READ_BBITMAP) &&
> +		    !ext2fs_bg_flags_test(fs, i, EXT2_BG_BLOCK_UNINIT) &&
> +		    ext2fs_bg_free_blocks_count(fs, i) <
> +				fs->super->s_blocks_per_group) {
> +			super = ext2fs_block_bitmap_loc(fs, i);
> +			err = ext2fs_add_dir_block2(dblist, 0, super, 1);
> +			if (err)
> +				break;
> +		}
> +
> +		if ((flags & EXT2FS_READ_IBITMAP) &&
> +		    !ext2fs_bg_flags_test(fs, i, EXT2_BG_INODE_UNINIT) &&
> +		    ext2fs_bg_free_inodes_count(fs, i) <
> +				fs->super->s_inodes_per_group) {
> +			super = ext2fs_inode_bitmap_loc(fs, i);
> +			err = ext2fs_add_dir_block2(dblist, 0, super, 1);
> +			if (err)
> +				break;
> +		}
> +
> +		if ((flags & EXT2FS_READ_ITABLE) &&
> +		    ext2fs_bg_free_inodes_count(fs, i) <
> +				fs->super->s_inodes_per_group) {
> +			super = ext2fs_inode_table_loc(fs, i);
> +			err = ext2fs_add_dir_block2(dblist, 0, super,
> +					fs->inode_blocks_per_group);
> +			if (err)
> +				break;
> +		}
> +	}
> +
> +	if (!err)
> +		err = ext2fs_readahead_dblist(fs, dblist);
> +
> +	ext2fs_free_dblist(dblist);
> +	return err;
> +}
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] e2fsck: read-ahead metadata during pass1 and pass2
  2014-02-01 10:37 ` [PATCH 3/3] e2fsck: read-ahead metadata during pass1 and pass2 Darrick J. Wong
@ 2014-02-03 21:20   ` Andreas Dilger
  2014-02-04  1:26     ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2014-02-03 21:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Theodore Ts'o, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 5135 bytes --]

On Feb 1, 2014, at 3:37 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
> index 7554f4e..590e1bd 100644
> --- a/e2fsck/pass1.c
> +++ b/e2fsck/pass1.c
> @@ -574,6 +577,20 @@ static errcode_t recheck_bad_inode_checksum(ext2_filsys fs, ext2_ino_t ino,
> 	return 0;
> }
> 
> +static void *pass1_readahead(void *p)
> +{
> +	errcode_t err;
> +	e2fsck_t ctx = (e2fsck_t)p;
> +
> +	printf("%s: START READAHEAD\n", __func__);
> +	err = ext2fs_readahead(ctx->fs, EXT2FS_READ_BBITMAP |
> +			       EXT2FS_READ_IBITMAP | EXT2FS_READ_ITABLE,
> +			       0, ctx->fs->group_desc_count);

This is basically launching readahead for the whole filesystem in one
shot.  That might be OK for small filesystems or running a single large
filesystem on a big machine, but could cause memory pressure and cache
eviction for many/large filesystems.

Have you done any tests to see what a limited readahead would do for
performance (say 8-16 groups ahead)?

Also, the bitmaps are not needed until pass 5, but would benefit from
being prefetched along with the inode table for non-flex_bg filesystems.
Probably there is little to no benefit to prefetching them in pass1 for
flex_bg filesystems.

Cheers, Andreas

> +	printf("%s: READAHEAD=%d\n", __func__, (int)err);
> +
> +	return NULL;
> +}
> +
> void e2fsck_pass1(e2fsck_t ctx)
> {
> 	int	i;
> @@ -600,6 +617,15 @@ void e2fsck_pass1(e2fsck_t ctx)
> 	init_resource_track(&rtrack, ctx->fs->io);
> 	clear_problem_context(&pctx);
> 
> +	if (getenv("READAHEAD")) {
> +#ifdef HAVE_PTHREAD_H
> +		pthread_t tid;
> +		pthread_create(&tid, NULL, pass1_readahead, ctx);
> +#else
> +		pass1_readahead(ctx);
> +#endif
> +	}
> +
> 	if (!(ctx->options & E2F_OPT_PREEN))
> 		fix_problem(ctx, PR_1_PASS_HEADER, &pctx);
> 
> diff --git a/e2fsck/pass2.c b/e2fsck/pass2.c
> index 5a2745a..bd7323f 100644
> --- a/e2fsck/pass2.c
> +++ b/e2fsck/pass2.c
> @@ -44,6 +44,9 @@
> #define _GNU_SOURCE 1 /* get strnlen() */
> #include "config.h"
> #include <string.h>
> +#ifdef HAVE_PTHREAD_H
> +#include <pthread.h>
> +#endif
> 
> #include "e2fsck.h"
> #include "problem.h"
> @@ -79,6 +82,29 @@ struct check_dir_struct {
> 	e2fsck_t ctx;
> };
> 
> +struct pass2_readahead_data {
> +	ext2_filsys fs;
> +	ext2_dblist dblist;
> +};
> +
> +static int readahead_dir_block(ext2_filsys fs, struct ext2_db_entry2 *db,
> +			       void *priv_data)
> +{
> +	db->blockcnt = 1;
> +}
> +
> +static void *pass2_readahead(void *p)
> +{
> +	errcode_t err;
> +	struct pass2_readahead_data *pr = p;
> +
> +	printf("%s: START READAHEAD\n", __func__);
> +	err = ext2fs_readahead_dblist(pr->fs, pr->dblist);
> +	ext2fs_free_dblist(pr->dblist);
> +	ext2fs_free_mem(&pr);
> +	printf("%s: END READAHEAD %d\n", __func__, (int)err);
> +}
> +
> void e2fsck_pass2(e2fsck_t ctx)
> {
> 	struct ext2_super_block *sb = ctx->fs->super;
> @@ -146,6 +172,41 @@ void e2fsck_pass2(e2fsck_t ctx)
> 	if (fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_DIR_INDEX)
> 		ext2fs_dblist_sort2(fs->dblist, special_dir_block_cmp);
> 
> +	if (getenv("READAHEAD")) {
> +#ifdef HAVE_PTHREAD_H
> +		pthread_t tid;
> +#endif
> +		struct pass2_readahead_data *pr;
> +		errcode_t err;
> +
> +		err = ext2fs_get_mem(sizeof(*pr), &pr);
> +		if (err)
> +			goto no_readahead;
> +		pr->fs = fs;
> +		err = ext2fs_copy_dblist(fs->dblist, &pr->dblist);
> +		if (err) {
> +			ext2fs_free_mem(&pr);
> +			goto no_readahead;
> +		}
> +		err = ext2fs_dblist_iterate2(pr->dblist, readahead_dir_block,
> +					     NULL);
> +		if (err) {
> +			ext2fs_free_dblist(pr->dblist);
> +			ext2fs_free_mem(&pr);
> +			goto no_readahead;
> +		}
> +#ifdef HAVE_PTHREAD_H
> +		err = pthread_create(&tid, NULL, pass2_readahead, pr);
> +#else
> +		pass2_readahead(pr);
> +#endif
> +		if (err) {
> +			ext2fs_free_dblist(pr->dblist);
> +			ext2fs_free_mem(&pr);
> +		}
> +	}
> +
> +no_readahead:
> 	cd.pctx.errcode = ext2fs_dblist_iterate2(fs->dblist, check_dir_block,
> 						 &cd);
> 	if (ctx->flags & E2F_FLAG_SIGNAL_MASK || ctx->flags & E2F_FLAG_RESTART)
> diff --git a/lib/config.h.in b/lib/config.h.in
> index 35ece01..1dd33b4 100644
> --- a/lib/config.h.in
> +++ b/lib/config.h.in
> @@ -206,6 +206,9 @@
> /* Define if your <locale.h> file defines LC_MESSAGES. */
> #undef HAVE_LC_MESSAGES
> 
> +/* Define to 1 if you have the `pthread' library (-lpthread). */
> +#undef HAVE_LIBPTHREAD
> +
> /* Define to 1 if you have the <limits.h> header file. */
> #undef HAVE_LIMITS_H
> 
> @@ -314,6 +317,9 @@
> /* Define to 1 if you have the `prctl' function. */
> #undef HAVE_PRCTL
> 
> +/* Define to 1 if you have the <pthread.h> header file. */
> +#undef HAVE_PTHREAD_H
> +
> /* Define to 1 if you have the `putenv' function. */
> #undef HAVE_PUTENV
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] libext2fs: allow clients to read-ahead metadata
  2014-02-01 10:37 ` [PATCH 2/3] libext2fs: allow clients to read-ahead metadata Darrick J. Wong
  2014-02-01 10:41   ` Darrick J. Wong
@ 2014-02-03 21:32   ` Andreas Dilger
  2014-02-03 23:26     ` Darrick J. Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2014-02-03 21:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Theodore Ts'o, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1659 bytes --]

On Feb 1, 2014, at 3:37 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> This patch adds to libext2fs the ability to pre-fetch metadata
> into the page cache in the hopes of speeding up libext2fs' clients.
> There are two new library functions -- the first allows a client to
> readahead a list of blocks, and the second is a helper function that
> uses that first mechanism to load group data (bitmaps, inode tables).
> 
> e2fsck will employ both of these methods to speed itself up.
> 
> diff --git a/lib/ext2fs/readahead.c b/lib/ext2fs/readahead.c
> new file mode 100644
> index 0000000..05f6135
> --- /dev/null
> +++ b/lib/ext2fs/readahead.c
> +errcode_t ext2fs_readahead(ext2_filsys fs, int flags, dgrp_t start,
> +			   dgrp_t ngroups)
> +{
> +
> +	for (i = start; i < end; i++) {
> \b+		if ((flags & EXT2FS_READ_ITABLE) &&
> +		    ext2fs_bg_free_inodes_count(fs, i) <
> +				fs->super->s_inodes_per_group) {
> +			super = ext2fs_inode_table_loc(fs, i);
> +			err = ext2fs_add_dir_block2(dblist, 0, super,
> +					fs->inode_blocks_per_group);

This prefetches all of the inode table blocks, when it could instead
just prefetch the in-use blocks using:

		if ((flags & EXT2FS_READ_ITABLE) &&
		    ext2fs_bg_itable_unused(fs, i) <
		    fs->inode_blocks_per_group))
			err = ext2fs_add_dir_block2(dblist, 0, super,
					fs->inode_blocks_per_group - 
					ext2fs_bg_itable_unused(fs, i));

If there is corruption in the filesystem and the "unused" blocks need
to be read later it is probably more than offset by not reading those
actually unused blocks for the rest of the time.


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] libext2fs: allow clients to read-ahead metadata
  2014-02-03 21:32   ` Andreas Dilger
@ 2014-02-03 23:26     ` Darrick J. Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2014-02-03 23:26 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Theodore Ts'o, linux-ext4

On Mon, Feb 03, 2014 at 02:32:45PM -0700, Andreas Dilger wrote:
> On Feb 1, 2014, at 3:37 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > This patch adds to libext2fs the ability to pre-fetch metadata
> > into the page cache in the hopes of speeding up libext2fs' clients.
> > There are two new library functions -- the first allows a client to
> > readahead a list of blocks, and the second is a helper function that
> > uses that first mechanism to load group data (bitmaps, inode tables).
> > 
> > e2fsck will employ both of these methods to speed itself up.
> > 
> > diff --git a/lib/ext2fs/readahead.c b/lib/ext2fs/readahead.c
> > new file mode 100644
> > index 0000000..05f6135
> > --- /dev/null
> > +++ b/lib/ext2fs/readahead.c
> > +errcode_t ext2fs_readahead(ext2_filsys fs, int flags, dgrp_t start,
> > +			   dgrp_t ngroups)
> > +{
> > +
> > +	for (i = start; i < end; i++) {
> > \b+		if ((flags & EXT2FS_READ_ITABLE) &&
> > +		    ext2fs_bg_free_inodes_count(fs, i) <
> > +				fs->super->s_inodes_per_group) {
> > +			super = ext2fs_inode_table_loc(fs, i);
> > +			err = ext2fs_add_dir_block2(dblist, 0, super,
> > +					fs->inode_blocks_per_group);
> 
> This prefetches all of the inode table blocks, when it could instead
> just prefetch the in-use blocks using:
> 
> 		if ((flags & EXT2FS_READ_ITABLE) &&
> 		    ext2fs_bg_itable_unused(fs, i) <
> 		    fs->inode_blocks_per_group))
> 			err = ext2fs_add_dir_block2(dblist, 0, super,
> 					fs->inode_blocks_per_group - 
> 					ext2fs_bg_itable_unused(fs, i));

I think you need to convert ext2fs_bg_itable_unused() to blocks there, but
point taken.  Actually, the first insane-o patch had this, but I forgot it when
writing up the second version.

> If there is corruption in the filesystem and the "unused" blocks need
> to be read later it is probably more than offset by not reading those
> actually unused blocks for the rest of the time.

<shrug> I'm not particularly concerned about less than optimal IO throughput on
broken filesystems.

--D
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] e2fsck: read-ahead metadata during pass1 and pass2
  2014-02-03 21:20   ` Andreas Dilger
@ 2014-02-04  1:26     ` Darrick J. Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2014-02-04  1:26 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Theodore Ts'o, linux-ext4

On Mon, Feb 03, 2014 at 02:20:01PM -0700, Andreas Dilger wrote:
> On Feb 1, 2014, at 3:37 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
> > index 7554f4e..590e1bd 100644
> > --- a/e2fsck/pass1.c
> > +++ b/e2fsck/pass1.c
> > @@ -574,6 +577,20 @@ static errcode_t recheck_bad_inode_checksum(ext2_filsys fs, ext2_ino_t ino,
> > 	return 0;
> > }
> > 
> > +static void *pass1_readahead(void *p)
> > +{
> > +	errcode_t err;
> > +	e2fsck_t ctx = (e2fsck_t)p;
> > +
> > +	printf("%s: START READAHEAD\n", __func__);
> > +	err = ext2fs_readahead(ctx->fs, EXT2FS_READ_BBITMAP |
> > +			       EXT2FS_READ_IBITMAP | EXT2FS_READ_ITABLE,
> > +			       0, ctx->fs->group_desc_count);
> 
> This is basically launching readahead for the whole filesystem in one
> shot.  That might be OK for small filesystems or running a single large
> filesystem on a big machine, but could cause memory pressure and cache
> eviction for many/large filesystems.
> 
> Have you done any tests to see what a limited readahead would do for
> performance (say 8-16 groups ahead)?

Yes.  I didn't see any significant speedups with a flexbg filesystem unless I
could readahead at least couple of flexbgs worth.  On the other hand, getting
so far ahead of the checker thread that it thrashes memory is clearly
counterproductive.  

For now I've set it to calculate the number of groups it takes to fill half of
memory with full inode tables, and it does incremental readahead in that
amount.  Partially filled (or totally empty) blockgroups of course reduce the
amount of memory used even further, but at least this lets us establish some
sort of upper bound.  Unfortunately, it's still a crude one since I'm using
sysconf(_SC_NUM_PAGES).

With this incremental thing hooked up, I can still observe speedups even on
low memory VMs (64GB fs, 2.5G metadata, 512M RAM).

As far as pass2 goes, I put in some more code so that we can call
fadvise(DONTNEED) on dir blocks after we're done with them.  I've seen a rather
small improvement.

> Also, the bitmaps are not needed until pass 5, but would benefit from
> being prefetched along with the inode table for non-flex_bg filesystems.
> Probably there is little to no benefit to prefetching them in pass1 for
> flex_bg filesystems.

If run during pass 5, the bitmap readahead thread doesn't seem to be able to
stay far enough ahead of ext2fs_read_bitmaps() to matter much.  However, pass 4
seems fairly IO-light and CPU-heavy, so when I moved bitmap readahead to pass
4, the (rather tiny) amount of time spent in P5 decreased.

I didn't do much with P3a other than hoping that everything we read in P2 is
still in cache.  It ran slower anyway.

--D

> Cheers, Andreas
> 
> > +	printf("%s: READAHEAD=%d\n", __func__, (int)err);
> > +
> > +	return NULL;
> > +}
> > +
> > void e2fsck_pass1(e2fsck_t ctx)
> > {
> > 	int	i;
> > @@ -600,6 +617,15 @@ void e2fsck_pass1(e2fsck_t ctx)
> > 	init_resource_track(&rtrack, ctx->fs->io);
> > 	clear_problem_context(&pctx);
> > 
> > +	if (getenv("READAHEAD")) {
> > +#ifdef HAVE_PTHREAD_H
> > +		pthread_t tid;
> > +		pthread_create(&tid, NULL, pass1_readahead, ctx);
> > +#else
> > +		pass1_readahead(ctx);
> > +#endif
> > +	}
> > +
> > 	if (!(ctx->options & E2F_OPT_PREEN))
> > 		fix_problem(ctx, PR_1_PASS_HEADER, &pctx);
> > 
> > diff --git a/e2fsck/pass2.c b/e2fsck/pass2.c
> > index 5a2745a..bd7323f 100644
> > --- a/e2fsck/pass2.c
> > +++ b/e2fsck/pass2.c
> > @@ -44,6 +44,9 @@
> > #define _GNU_SOURCE 1 /* get strnlen() */
> > #include "config.h"
> > #include <string.h>
> > +#ifdef HAVE_PTHREAD_H
> > +#include <pthread.h>
> > +#endif
> > 
> > #include "e2fsck.h"
> > #include "problem.h"
> > @@ -79,6 +82,29 @@ struct check_dir_struct {
> > 	e2fsck_t ctx;
> > };
> > 
> > +struct pass2_readahead_data {
> > +	ext2_filsys fs;
> > +	ext2_dblist dblist;
> > +};
> > +
> > +static int readahead_dir_block(ext2_filsys fs, struct ext2_db_entry2 *db,
> > +			       void *priv_data)
> > +{
> > +	db->blockcnt = 1;
> > +}
> > +
> > +static void *pass2_readahead(void *p)
> > +{
> > +	errcode_t err;
> > +	struct pass2_readahead_data *pr = p;
> > +
> > +	printf("%s: START READAHEAD\n", __func__);
> > +	err = ext2fs_readahead_dblist(pr->fs, pr->dblist);
> > +	ext2fs_free_dblist(pr->dblist);
> > +	ext2fs_free_mem(&pr);
> > +	printf("%s: END READAHEAD %d\n", __func__, (int)err);
> > +}
> > +
> > void e2fsck_pass2(e2fsck_t ctx)
> > {
> > 	struct ext2_super_block *sb = ctx->fs->super;
> > @@ -146,6 +172,41 @@ void e2fsck_pass2(e2fsck_t ctx)
> > 	if (fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_DIR_INDEX)
> > 		ext2fs_dblist_sort2(fs->dblist, special_dir_block_cmp);
> > 
> > +	if (getenv("READAHEAD")) {
> > +#ifdef HAVE_PTHREAD_H
> > +		pthread_t tid;
> > +#endif
> > +		struct pass2_readahead_data *pr;
> > +		errcode_t err;
> > +
> > +		err = ext2fs_get_mem(sizeof(*pr), &pr);
> > +		if (err)
> > +			goto no_readahead;
> > +		pr->fs = fs;
> > +		err = ext2fs_copy_dblist(fs->dblist, &pr->dblist);
> > +		if (err) {
> > +			ext2fs_free_mem(&pr);
> > +			goto no_readahead;
> > +		}
> > +		err = ext2fs_dblist_iterate2(pr->dblist, readahead_dir_block,
> > +					     NULL);
> > +		if (err) {
> > +			ext2fs_free_dblist(pr->dblist);
> > +			ext2fs_free_mem(&pr);
> > +			goto no_readahead;
> > +		}
> > +#ifdef HAVE_PTHREAD_H
> > +		err = pthread_create(&tid, NULL, pass2_readahead, pr);
> > +#else
> > +		pass2_readahead(pr);
> > +#endif
> > +		if (err) {
> > +			ext2fs_free_dblist(pr->dblist);
> > +			ext2fs_free_mem(&pr);
> > +		}
> > +	}
> > +
> > +no_readahead:
> > 	cd.pctx.errcode = ext2fs_dblist_iterate2(fs->dblist, check_dir_block,
> > 						 &cd);
> > 	if (ctx->flags & E2F_FLAG_SIGNAL_MASK || ctx->flags & E2F_FLAG_RESTART)
> > diff --git a/lib/config.h.in b/lib/config.h.in
> > index 35ece01..1dd33b4 100644
> > --- a/lib/config.h.in
> > +++ b/lib/config.h.in
> > @@ -206,6 +206,9 @@
> > /* Define if your <locale.h> file defines LC_MESSAGES. */
> > #undef HAVE_LC_MESSAGES
> > 
> > +/* Define to 1 if you have the `pthread' library (-lpthread). */
> > +#undef HAVE_LIBPTHREAD
> > +
> > /* Define to 1 if you have the <limits.h> header file. */
> > #undef HAVE_LIMITS_H
> > 
> > @@ -314,6 +317,9 @@
> > /* Define to 1 if you have the `prctl' function. */
> > #undef HAVE_PRCTL
> > 
> > +/* Define to 1 if you have the <pthread.h> header file. */
> > +#undef HAVE_PTHREAD_H
> > +
> > /* Define to 1 if you have the `putenv' function. */
> > #undef HAVE_PUTENV
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-02-04  1:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-01 10:37 [RFC PATCH v2 0/3] e2fsck metadata prefetch Darrick J. Wong
2014-02-01 10:37 ` [PATCH 1/3] ext2fs: add readahead method to improve scanning Darrick J. Wong
2014-02-01 10:37 ` [PATCH 2/3] libext2fs: allow clients to read-ahead metadata Darrick J. Wong
2014-02-01 10:41   ` Darrick J. Wong
2014-02-03 21:32   ` Andreas Dilger
2014-02-03 23:26     ` Darrick J. Wong
2014-02-01 10:37 ` [PATCH 3/3] e2fsck: read-ahead metadata during pass1 and pass2 Darrick J. Wong
2014-02-03 21:20   ` Andreas Dilger
2014-02-04  1:26     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.