All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes
@ 2014-02-27  9:51 Dave Chinner
  2014-02-27  9:51 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
                   ` (9 more replies)
  0 siblings, 10 replies; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

Hi folks,

Another update to the repair scalability fixes patchset. I've
addressed all the review comments from the v2 patch set, and only
patches 8-10 still need review.

The only major change in this posting is that I reordered the patch
set to put all the reviewed patches first - most patches have no
dependencies on each other, so none of the patches changed as a
result of the reordering. If no-one objects, I'll comit the review
patches (1-7) tomorrow to get the fixes and improvements into the
tree for wider testing as soon as possible...

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 01/10] repair: translation lookups limit scalability
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  2014-02-27  9:51 ` [PATCH 02/10] repair: per AG locks contend for cachelines Dave Chinner
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

A bit of perf magic showed that scalability was limits to 3-4
concurrent threads due to contention on a lock inside in something
called __dcigettext(). That some library somewhere that repair is
linked against, and it turns out to be inside the translation
infrastructure to support the _() string mechanism:

# Samples: 34K of event 'cs'
# Event count (approx.): 495567
#
# Overhead        Command      Shared Object          Symbol
# ........  .............  .................  ..............
#
    60.30%     xfs_repair  [kernel.kallsyms]  [k] __schedule
               |
               --- 0x63fffff9c
                   process_bmbt_reclist_int
                  |
                  |--39.95%-- __dcigettext
                  |          __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                  |
                  |--8.91%-- __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                   --51.13%-- [...]

Fix this by initialising global variables that hold the translated
strings at startup, hence avoiding the need to do repeated runtime
translation of the same strings.

Runtime of an unpatched xfs_repair is roughly:

        XFS_REPAIR Summary    Fri Dec  6 11:15:50 2013

Phase           Start           End             Duration
Phase 1:        12/06 10:56:21  12/06 10:56:21
Phase 2:        12/06 10:56:21  12/06 10:56:23  2 seconds
Phase 3:        12/06 10:56:23  12/06 11:01:31  5 minutes, 8 seconds
Phase 4:        12/06 11:01:31  12/06 11:07:08  5 minutes, 37 seconds
Phase 5:        12/06 11:07:08  12/06 11:07:09  1 second
Phase 6:        12/06 11:07:09  12/06 11:15:49  8 minutes, 40 seconds
Phase 7:        12/06 11:15:49  12/06 11:15:50  1 second

Total run time: 19 minutes, 29 seconds

Patched version:

Phase           Start           End             Duration
Phase 1:        12/06 10:36:29  12/06 10:36:29
Phase 2:        12/06 10:36:29  12/06 10:36:31  2 seconds
Phase 3:        12/06 10:36:31  12/06 10:40:08  3 minutes, 37 seconds
Phase 4:        12/06 10:40:08  12/06 10:43:42  3 minutes, 34 seconds
Phase 5:        12/06 10:43:42  12/06 10:43:42
Phase 6:        12/06 10:43:42  12/06 10:50:28  6 minutes, 46 seconds
Phase 7:        12/06 10:50:28  12/06 10:50:29  1 second

Total run time: 14 minutes

Big win!

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 repair/dinode.c     | 49 +++++++++++++++++++++++++++++++++++--------------
 repair/dinode.h     |  3 +++
 repair/scan.c       |  7 +------
 repair/xfs_repair.c |  2 ++
 4 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/repair/dinode.c b/repair/dinode.c
index 3115bd0..4953a56 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -32,6 +32,37 @@
 #include "threads.h"
 
 /*
+ * gettext lookups for translations of strings use mutexes internally to
+ * the library. Hence when we come through here doing parallel scans in
+ * multiple AGs, then all do concurrent text conversions and serialise
+ * on the translation string lookups. Let's avoid doing repeated lookups
+ * by making them static variables and only assigning the translation
+ * once.
+ */
+static char	*forkname_data;
+static char	*forkname_attr;
+static char	*ftype_real_time;
+static char	*ftype_regular;
+
+void
+dinode_bmbt_translation_init(void)
+{
+	forkname_data = _("data");
+	forkname_attr = _("attr");
+	ftype_real_time = _("real-time");
+	ftype_regular = _("regular");
+}
+
+char *
+get_forkname(int whichfork)
+{
+
+	if (whichfork == XFS_DATA_FORK)
+		return forkname_data;
+	return forkname_attr;
+}
+
+/*
  * inode clearing routines
  */
 
@@ -542,7 +573,7 @@ process_bmbt_reclist_int(
 	xfs_dfiloff_t		op = 0;		/* prev offset */
 	xfs_dfsbno_t		b;
 	char			*ftype;
-	char			*forkname;
+	char			*forkname = get_forkname(whichfork);
 	int			i;
 	int			state;
 	xfs_agnumber_t		agno;
@@ -552,15 +583,10 @@ process_bmbt_reclist_int(
 	xfs_agnumber_t		locked_agno = -1;
 	int			error = 1;
 
-	if (whichfork == XFS_DATA_FORK)
-		forkname = _("data");
-	else
-		forkname = _("attr");
-
 	if (type == XR_INO_RTDATA)
-		ftype = _("real-time");
+		ftype = ftype_real_time;
 	else
-		ftype = _("regular");
+		ftype = ftype_regular;
 
 	for (i = 0; i < *numrecs; i++) {
 		libxfs_bmbt_disk_get_all(rp + i, &irec);
@@ -1110,7 +1136,7 @@ process_btinode(
 	xfs_ino_t		lino;
 	xfs_bmbt_ptr_t		*pp;
 	xfs_bmbt_key_t		*pkey;
-	char			*forkname;
+	char			*forkname = get_forkname(whichfork);
 	int			i;
 	int			level;
 	int			numrecs;
@@ -1122,11 +1148,6 @@ process_btinode(
 	*tot = 0;
 	*nex = 0;
 
-	if (whichfork == XFS_DATA_FORK)
-		forkname = _("data");
-	else
-		forkname = _("attr");
-
 	magic = xfs_sb_version_hascrc(&mp->m_sb) ? XFS_BMAP_CRC_MAGIC
 						 : XFS_BMAP_MAGIC;
 
diff --git a/repair/dinode.h b/repair/dinode.h
index d9197c1..7521521 100644
--- a/repair/dinode.h
+++ b/repair/dinode.h
@@ -127,4 +127,7 @@ get_bmapi(xfs_mount_t		*mp,
 		xfs_dfiloff_t	bno,
 		int             whichfork );
 
+void dinode_bmbt_translation_init(void);
+char * get_forkname(int whichfork);
+
 #endif /* _XR_DINODE_H */
diff --git a/repair/scan.c b/repair/scan.c
index 73b4581..b12f48b 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -171,17 +171,12 @@ scan_bmapbt(
 	xfs_bmbt_rec_t		*rp;
 	xfs_dfiloff_t		first_key;
 	xfs_dfiloff_t		last_key;
-	char			*forkname;
+	char			*forkname = get_forkname(whichfork);
 	int			numrecs;
 	xfs_agnumber_t		agno;
 	xfs_agblock_t		agbno;
 	int			state;
 
-	if (whichfork == XFS_DATA_FORK)
-		forkname = _("data");
-	else
-		forkname = _("attr");
-
 	/*
 	 * unlike the ag freeblock btrees, if anything looks wrong
 	 * in an inode bmap tree, just bail.  it's possible that
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 9e0502a..bac334f 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -29,6 +29,7 @@
 #include "prefetch.h"
 #include "threads.h"
 #include "progress.h"
+#include "dinode.h"
 
 #define	rounddown(x, y)	(((x)/(y))*(y))
 
@@ -533,6 +534,7 @@ main(int argc, char **argv)
 	setlocale(LC_ALL, "");
 	bindtextdomain(PACKAGE, LOCALEDIR);
 	textdomain(PACKAGE);
+	dinode_bmbt_translation_init();
 
 	temp_mp = &xfs_m;
 	setbuf(stdout, NULL);
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 02/10] repair: per AG locks contend for cachelines
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
  2014-02-27  9:51 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  2014-02-27  9:51 ` [PATCH 03/10] libxfs: buffer cache hashing is suboptimal Dave Chinner
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The per-ag locks used to protect per-ag block lists are located in a tightly
packed array. That means that they share cachelines, so separate them out inot
separate 64 byte regions in the array.

pahole confirms the padding is correctly applied:

struct aglock {
        pthread_mutex_t            lock;                 /*     0    40 */

        /* size: 64, cachelines: 1, members: 1 */
        /* padding: 24 */
};

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 repair/dino_chunks.c | 24 ++++++++++++------------
 repair/dinode.c      |  6 +++---
 repair/globals.h     |  5 ++++-
 repair/incore.c      |  4 ++--
 repair/scan.c        |  4 ++--
 5 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
index 65281e4..afb26e0 100644
--- a/repair/dino_chunks.c
+++ b/repair/dino_chunks.c
@@ -141,7 +141,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 		if (check_aginode_block(mp, agno, agino) == 0)
 			return 0;
 
-		pthread_mutex_lock(&ag_locks[agno]);
+		pthread_mutex_lock(&ag_locks[agno].lock);
 
 		state = get_bmap(agno, agbno);
 		switch (state) {
@@ -166,7 +166,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 		_("inode block %d/%d multiply claimed, (state %d)\n"),
 				agno, agbno, state);
 			set_bmap(agno, agbno, XR_E_MULT);
-			pthread_mutex_unlock(&ag_locks[agno]);
+			pthread_mutex_unlock(&ag_locks[agno].lock);
 			return(0);
 		default:
 			do_warn(
@@ -176,7 +176,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 			break;
 		}
 
-		pthread_mutex_unlock(&ag_locks[agno]);
+		pthread_mutex_unlock(&ag_locks[agno].lock);
 
 		start_agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
 		*start_ino = XFS_AGINO_TO_INO(mp, agno, start_agino);
@@ -424,7 +424,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 	 * user data -- we're probably here as a result of a directory
 	 * entry or an iunlinked pointer
 	 */
-	pthread_mutex_lock(&ag_locks[agno]);
+	pthread_mutex_lock(&ag_locks[agno].lock);
 	for (cur_agbno = chunk_start_agbno;
 	     cur_agbno < chunk_stop_agbno;
 	     cur_agbno += blen)  {
@@ -438,7 +438,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 	_("inode block %d/%d multiply claimed, (state %d)\n"),
 				agno, cur_agbno, state);
 			set_bmap_ext(agno, cur_agbno, blen, XR_E_MULT);
-			pthread_mutex_unlock(&ag_locks[agno]);
+			pthread_mutex_unlock(&ag_locks[agno].lock);
 			return 0;
 		case XR_E_INO:
 			do_error(
@@ -449,7 +449,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 			break;
 		}
 	}
-	pthread_mutex_unlock(&ag_locks[agno]);
+	pthread_mutex_unlock(&ag_locks[agno].lock);
 
 	/*
 	 * ok, chunk is good.  put the record into the tree if required,
@@ -472,7 +472,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 
 	set_inode_used(irec_p, agino - start_agino);
 
-	pthread_mutex_lock(&ag_locks[agno]);
+	pthread_mutex_lock(&ag_locks[agno].lock);
 
 	for (cur_agbno = chunk_start_agbno;
 	     cur_agbno < chunk_stop_agbno;
@@ -505,7 +505,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 			break;
 		}
 	}
-	pthread_mutex_unlock(&ag_locks[agno]);
+	pthread_mutex_unlock(&ag_locks[agno].lock);
 
 	return(ino_cnt);
 }
@@ -736,7 +736,7 @@ process_inode_chunk(
 	/*
 	 * mark block as an inode block in the incore bitmap
 	 */
-	pthread_mutex_lock(&ag_locks[agno]);
+	pthread_mutex_lock(&ag_locks[agno].lock);
 	state = get_bmap(agno, agbno);
 	switch (state) {
 	case XR_E_INO:	/* already marked */
@@ -755,7 +755,7 @@ process_inode_chunk(
 			XFS_AGB_TO_FSB(mp, agno, agbno), state);
 		break;
 	}
-	pthread_mutex_unlock(&ag_locks[agno]);
+	pthread_mutex_unlock(&ag_locks[agno].lock);
 
 	for (;;) {
 		/*
@@ -925,7 +925,7 @@ process_inode_chunk(
 			ibuf_offset = 0;
 			agbno++;
 
-			pthread_mutex_lock(&ag_locks[agno]);
+			pthread_mutex_lock(&ag_locks[agno].lock);
 			state = get_bmap(agno, agbno);
 			switch (state) {
 			case XR_E_INO:	/* already marked */
@@ -946,7 +946,7 @@ process_inode_chunk(
 					XFS_AGB_TO_FSB(mp, agno, agbno), state);
 				break;
 			}
-			pthread_mutex_unlock(&ag_locks[agno]);
+			pthread_mutex_unlock(&ag_locks[agno].lock);
 
 		} else if (irec_offset == XFS_INODES_PER_CHUNK)  {
 			/*
diff --git a/repair/dinode.c b/repair/dinode.c
index 4953a56..48f17ac 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -707,8 +707,8 @@ _("Fatal error: inode %" PRIu64 " - blkmap_set_ext(): %s\n"
 		ebno = agbno + irec.br_blockcount;
 		if (agno != locked_agno) {
 			if (locked_agno != -1)
-				pthread_mutex_unlock(&ag_locks[locked_agno]);
-			pthread_mutex_lock(&ag_locks[agno]);
+				pthread_mutex_unlock(&ag_locks[locked_agno].lock);
+			pthread_mutex_lock(&ag_locks[agno].lock);
 			locked_agno = agno;
 		}
 
@@ -777,7 +777,7 @@ _("illegal state %d in block map %" PRIu64 "\n"),
 	error = 0;
 done:
 	if (locked_agno != -1)
-		pthread_mutex_unlock(&ag_locks[locked_agno]);
+		pthread_mutex_unlock(&ag_locks[locked_agno].lock);
 
 	if (i != *numrecs) {
 		ASSERT(i < *numrecs);
diff --git a/repair/globals.h b/repair/globals.h
index aef8b79..cbb2ce7 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -186,7 +186,10 @@ EXTERN xfs_extlen_t	sb_inoalignmt;
 EXTERN __uint32_t	sb_unit;
 EXTERN __uint32_t	sb_width;
 
-EXTERN pthread_mutex_t	*ag_locks;
+struct aglock {
+	pthread_mutex_t	lock __attribute__((__aligned__(64)));
+};
+EXTERN struct aglock	*ag_locks;
 
 EXTERN int 		report_interval;
 EXTERN __uint64_t 	*prog_rpt_done;
diff --git a/repair/incore.c b/repair/incore.c
index 3590464..a8d497e 100644
--- a/repair/incore.c
+++ b/repair/incore.c
@@ -294,13 +294,13 @@ init_bmaps(xfs_mount_t *mp)
 	if (!ag_bmap)
 		do_error(_("couldn't allocate block map btree roots\n"));
 
-	ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(pthread_mutex_t));
+	ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(struct aglock));
 	if (!ag_locks)
 		do_error(_("couldn't allocate block map locks\n"));
 
 	for (i = 0; i < mp->m_sb.sb_agcount; i++)  {
 		btree_init(&ag_bmap[i]);
-		pthread_mutex_init(&ag_locks[i], NULL);
+		pthread_mutex_init(&ag_locks[i].lock, NULL);
 	}
 
 	init_rt_bmap(mp);
diff --git a/repair/scan.c b/repair/scan.c
index b12f48b..f1411a2 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -268,7 +268,7 @@ _("bad back (left) sibling pointer (saw %llu should be NULL (0))\n"
 		agno = XFS_FSB_TO_AGNO(mp, bno);
 		agbno = XFS_FSB_TO_AGBNO(mp, bno);
 
-		pthread_mutex_lock(&ag_locks[agno]);
+		pthread_mutex_lock(&ag_locks[agno].lock);
 		state = get_bmap(agno, agbno);
 		switch (state) {
 		case XR_E_UNKNOWN:
@@ -314,7 +314,7 @@ _("bad state %d, inode %" PRIu64 " bmap block 0x%" PRIx64 "\n"),
 				state, ino, bno);
 			break;
 		}
-		pthread_mutex_unlock(&ag_locks[agno]);
+		pthread_mutex_unlock(&ag_locks[agno].lock);
 	} else  {
 		/*
 		 * attribute fork for realtime files is in the regular
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 03/10] libxfs: buffer cache hashing is suboptimal
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
  2014-02-27  9:51 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
  2014-02-27  9:51 ` [PATCH 02/10] repair: per AG locks contend for cachelines Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  2014-02-27  9:51 ` [PATCH 04/10] repair: limit auto-striding concurrency apprpriately Dave Chinner
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The hashkey calculation is very simplistic,and throws away an amount
of entropy that should be folded into the hash. The result is
sub-optimal distribution across the hash tables. For example, with a
default 512 entry table, phase 2 results in this:

Max supported entries = 4096
Max utilized entries = 3970
Active entries = 3970
Hash table size = 512
Hits = 0
Misses = 3970
Hit ratio =  0.00
Hash buckets with   0 entries     12 (  0%)
Hash buckets with   1 entries      3 (  0%)
Hash buckets with   2 entries     10 (  0%)
Hash buckets with   3 entries      2 (  0%)
Hash buckets with   4 entries    129 ( 12%)
Hash buckets with   5 entries     20 (  2%)
Hash buckets with   6 entries     54 (  8%)
Hash buckets with   7 entries     22 (  3%)
Hash buckets with   8 entries    150 ( 30%)
Hash buckets with   9 entries     14 (  3%)
Hash buckets with  10 entries     16 (  4%)
Hash buckets with  11 entries      7 (  1%)
Hash buckets with  12 entries     38 ( 11%)
Hash buckets with  13 entries      5 (  1%)
Hash buckets with  14 entries      4 (  1%)
Hash buckets with  17 entries      1 (  0%)
Hash buckets with  19 entries      1 (  0%)
Hash buckets with  23 entries      1 (  0%)
Hash buckets with >24 entries     23 ( 16%)

Now, given a perfect distribution, we shoul dhave 8 entries per
chain. What we end up with is nothing like that.

Unfortunately, for phase 3/4 and others, the number of cached
objects results in the cache being expanded to 256k entries, and
so the stats just give this;

Hits = 262276
Misses = 8130393
Hit ratio =  3.13
Hash buckets with >24 entries    512 (100%)

We can't evaluate the efficiency of the hashing algorithm here.
Let's increase the size of the hash table to 32768 entries and go
from there:

Phase 2:

Hash buckets with   0 entries  31884 (  0%)
Hash buckets with   1 entries     35 (  0%)
Hash buckets with   2 entries     78 (  3%)
Hash buckets with   3 entries     30 (  2%)
Hash buckets with   4 entries    649 ( 65%)
Hash buckets with   5 entries     12 (  1%)
Hash buckets with   6 entries     13 (  1%)
Hash buckets with   8 entries     40 (  8%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  15 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with  24 entries     17 ( 10%)
Hash buckets with >24 entries      6 (  4%)

There's a significant number of collisions given the population is
only 15% of the size of the table itself....

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262090
Hash table size = 32768
Hits = 530844
Misses = 7164575
Hit ratio =  6.90
Hash buckets with   0 entries  11898 (  0%)
....
Hash buckets with  12 entries   5513 ( 25%)
Hash buckets with  13 entries   4188 ( 20%)
Hash buckets with  14 entries   2073 ( 11%)
Hash buckets with  15 entries   1811 ( 10%)
Hash buckets with  16 entries   1994 ( 12%)
....
Hash buckets with >24 entries    339 (  4%)

So, a third of the hash table does not even has any entries in them,
despite having more than 7.5 million entries run through the cache.
Median chain lengths are 12-16 entries, ideal is 8. And lots of
collisions on the longer than 24 entrie chains...

Phase 6:

Hash buckets with   0 entries  14573 (  0%)
....
Hash buckets with >24 entries   2291 ( 36%)

Ouch. Not a good distribution at all.

Overall runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:35:04  12/06 11:35:04
Phase 2:        12/06 11:35:04  12/06 11:35:07  3 seconds
Phase 3:        12/06 11:35:07  12/06 11:38:27  3 minutes, 20 seconds
Phase 4:        12/06 11:38:27  12/06 11:41:32  3 minutes, 5 seconds
Phase 5:        12/06 11:41:32  12/06 11:41:32
Phase 6:        12/06 11:41:32  12/06 11:42:29  57 seconds
Phase 7:        12/06 11:42:29  12/06 11:42:30  1 second

Total run time: 7 minutes, 26 seconds

Modify the hash to be something more workable - steal the linux
kernel inode hash calculation and try that:

phase 2:

Max supported entries = 262144
Max utilized entries = 3970
Active entries = 3970
Hash table size = 32768
Hits = 0
Misses = 3972
Hit ratio =  0.00
Hash buckets with   0 entries  29055 (  0%)
Hash buckets with   1 entries   3464 ( 87%)
Hash buckets with   2 entries    241 ( 12%)
Hash buckets with   3 entries      8 (  0%)

Close to perfect.

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262118
Hash table size = 32768
Hits = 567900
Misses = 7118749
Hit ratio =  7.39
Hash buckets with   5 entries   1572 (  2%)
Hash buckets with   6 entries   2186 (  5%)
Hash buckets with   7 entries   9217 ( 24%)
Hash buckets with   8 entries   8757 ( 26%)
Hash buckets with   9 entries   6135 ( 21%)
Hash buckets with  10 entries   3166 ( 12%)
Hash buckets with  11 entries   1257 (  5%)
Hash buckets with  12 entries    364 (  1%)
Hash buckets with  13 entries     94 (  0%)
Hash buckets with  14 entries     14 (  0%)
Hash buckets with  15 entries      5 (  0%)

A near-perfect bell curve centered on the optimal distribution
number of 8 entries per chain.

Phase 6:

Hash buckets with   0 entries     24 (  0%)
Hash buckets with   1 entries    190 (  0%)
Hash buckets with   2 entries    571 (  0%)
Hash buckets with   3 entries   1263 (  1%)
Hash buckets with   4 entries   2465 (  3%)
Hash buckets with   5 entries   3399 (  6%)
Hash buckets with   6 entries   4002 (  9%)
Hash buckets with   7 entries   4186 ( 11%)
Hash buckets with   8 entries   3773 ( 11%)
Hash buckets with   9 entries   3240 ( 11%)
Hash buckets with  10 entries   2523 (  9%)
Hash buckets with  11 entries   2074 (  8%)
Hash buckets with  12 entries   1582 (  7%)
Hash buckets with  13 entries   1206 (  5%)
Hash buckets with  14 entries    863 (  4%)
Hash buckets with  15 entries    601 (  3%)
Hash buckets with  16 entries    386 (  2%)
Hash buckets with  17 entries    205 (  1%)
Hash buckets with  18 entries    122 (  0%)
Hash buckets with  19 entries     48 (  0%)
Hash buckets with  20 entries     24 (  0%)
Hash buckets with  21 entries     13 (  0%)
Hash buckets with  22 entries      8 (  0%)

A much wider bell curve than phase 3, but still centered around the
optimal value and far, far better than the distribution of the
current hash calculation. Runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:47:21  12/06 11:47:21
Phase 2:        12/06 11:47:21  12/06 11:47:23  2 seconds
Phase 3:        12/06 11:47:23  12/06 11:50:50  3 minutes, 27 seconds
Phase 4:        12/06 11:50:50  12/06 11:53:57  3 minutes, 7 seconds
Phase 5:        12/06 11:53:57  12/06 11:53:58  1 second
Phase 6:        12/06 11:53:58  12/06 11:54:51  53 seconds
Phase 7:        12/06 11:54:51  12/06 11:54:52  1 second

Total run time: 7 minutes, 31 seconds

Essentially unchanged - this is somewhat of a "swings and
roundabouts" test here because what it is testing is the cache-miss
overhead.

FWIW, the comparison here shows a pretty good case for the existing
hash calculation. On a less populated filesystem (5m inodes rather
than 50m inodes) the typical hash distribution was:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262094
Hash table size = 32768
Hits = 626228
Misses = 800166
Hit ratio = 43.90
Hash buckets with   0 entries  29274 (  0%)
Hash buckets with   3 entries      1 (  0%)
Hash buckets with   4 entries      1 (  0%)
Hash buckets with   7 entries      1 (  0%)
Hash buckets with   8 entries      1 (  0%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  12 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  16 entries      2 (  0%)
Hash buckets with  18 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with >24 entries   3483 ( 99%)

Total and utter crap. Same filesystem, new hash function:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262103
Hash table size = 32768
Hits = 673208
Misses = 838265
Hit ratio = 44.54
Hash buckets with   3 entries    558 (  0%)
Hash buckets with   4 entries   1126 (  1%)
Hash buckets with   5 entries   2440 (  4%)
Hash buckets with   6 entries   4249 (  9%)
Hash buckets with   7 entries   5280 ( 14%)
Hash buckets with   8 entries   5598 ( 17%)
Hash buckets with   9 entries   5446 ( 18%)
Hash buckets with  10 entries   3879 ( 14%)
Hash buckets with  11 entries   2405 ( 10%)
Hash buckets with  12 entries   1187 (  5%)
Hash buckets with  13 entries    447 (  2%)
Hash buckets with  14 entries    125 (  0%)
Hash buckets with  15 entries     25 (  0%)
Hash buckets with  16 entries      3 (  0%)

Kinda says it all, really...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 include/cache.h |  4 +++-
 libxfs/cache.c  |  7 +++++--
 libxfs/rdwr.c   | 12 ++++++++++--
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/cache.h b/include/cache.h
index 76cb234..0a84c69 100644
--- a/include/cache.h
+++ b/include/cache.h
@@ -66,7 +66,8 @@ typedef void (*cache_walk_t)(struct cache_node *);
 typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
 typedef void (*cache_node_flush_t)(struct cache_node *);
 typedef void (*cache_node_relse_t)(struct cache_node *);
-typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int);
+typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
+					  unsigned int);
 typedef int (*cache_node_compare_t)(struct cache_node *, cache_key_t);
 typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
 
@@ -112,6 +113,7 @@ struct cache {
 	cache_node_compare_t	compare;	/* comparison routine */
 	cache_bulk_relse_t	bulkrelse;	/* bulk release routine */
 	unsigned int		c_hashsize;	/* hash bucket count */
+	unsigned int		c_hashshift;	/* hash key shift */
 	struct cache_hash	*c_hash;	/* hash table buckets */
 	struct cache_mru	c_mrus[CACHE_MAX_PRIORITY + 1];
 	unsigned long long	c_misses;	/* cache misses */
diff --git a/libxfs/cache.c b/libxfs/cache.c
index 84d2860..dc69689 100644
--- a/libxfs/cache.c
+++ b/libxfs/cache.c
@@ -25,6 +25,7 @@
 #include <xfs/platform_defs.h>
 #include <xfs/list.h>
 #include <xfs/cache.h>
+#include <xfs/libxfs.h>
 
 #define CACHE_DEBUG 1
 #undef CACHE_DEBUG
@@ -61,6 +62,7 @@ cache_init(
 	cache->c_misses = 0;
 	cache->c_maxcount = maxcount;
 	cache->c_hashsize = hashsize;
+	cache->c_hashshift = libxfs_highbit32(hashsize);
 	cache->hash = cache_operations->hash;
 	cache->alloc = cache_operations->alloc;
 	cache->flush = cache_operations->flush;
@@ -343,7 +345,7 @@ cache_node_get(
 	int			priority = 0;
 	int			purged = 0;
 
-	hashidx = cache->hash(key, cache->c_hashsize);
+	hashidx = cache->hash(key, cache->c_hashsize, cache->c_hashshift);
 	hash = cache->c_hash + hashidx;
 	head = &hash->ch_list;
 
@@ -515,7 +517,8 @@ cache_node_purge(
 	struct cache_hash *	hash;
 	int			count = -1;
 
-	hash = cache->c_hash + cache->hash(key, cache->c_hashsize);
+	hash = cache->c_hash + cache->hash(key, cache->c_hashsize,
+					   cache->c_hashshift);
 	head = &hash->ch_list;
 	pthread_mutex_lock(&hash->ch_mutex);
 	for (pos = head->next, n = pos->next; pos != head;
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index d0ff15b..1b691fb 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -313,10 +313,18 @@ struct xfs_bufkey {
 	int			nmaps;
 };
 
+/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME	0x9e37fffffffc0001UL
+#define CACHE_LINE_SIZE		64
 static unsigned int
-libxfs_bhash(cache_key_t key, unsigned int hashsize)
+libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
 {
-	return (((unsigned int)((struct xfs_bufkey *)key)->blkno) >> 5) % hashsize;
+	uint64_t	hashval = ((struct xfs_bufkey *)key)->blkno;
+	uint64_t	tmp;
+
+	tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
+	return tmp % hashsize;
 }
 
 static int
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 04/10] repair: limit auto-striding concurrency apprpriately
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
                   ` (2 preceding siblings ...)
  2014-02-27  9:51 ` [PATCH 03/10] libxfs: buffer cache hashing is suboptimal Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  2014-02-27  9:51 ` [PATCH 05/10] repair: use a listhead for the dotdot list Dave Chinner
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

It's possible to have filesystems with hundreds of AGs on systems
with little concurrency and resources. In this case, we can easily
exhaust memory and fail to create threads and have all sorts of
interesting problems.

xfs/250 can cause this to occur, with failures like:

        - agno = 707
        - agno = 692
fatal error -- cannot create worker threads, error = [11] Resource temporarily unavailable

And this:

        - agno = 484
        - agno = 782
failed to create prefetch thread: Resource temporarily unavailable

Because it's trying to create more threads than a poor little 512MB
single CPU ia32 box can handle.

So, limit concurrency to a maximum of numcpus * 8 to prevent this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 include/libxfs.h    |  1 +
 libxfs/init.h       |  1 -
 repair/xfs_repair.c | 21 ++++++++++++++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/libxfs.h b/include/libxfs.h
index bb0369f..f688598 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -144,6 +144,7 @@ extern void	libxfs_device_close (dev_t);
 extern int	libxfs_device_alignment (void);
 extern void	libxfs_report(FILE *);
 extern void	platform_findsizes(char *path, int fd, long long *sz, int *bsz);
+extern int	platform_nproc(void);
 
 /* check or write log footer: specify device, log size in blocks & uuid */
 typedef xfs_caddr_t (libxfs_get_block_t)(xfs_caddr_t, int, void *);
diff --git a/libxfs/init.h b/libxfs/init.h
index f0b8cb6..112febb 100644
--- a/libxfs/init.h
+++ b/libxfs/init.h
@@ -31,7 +31,6 @@ extern char *platform_findrawpath (char *path);
 extern char *platform_findblockpath (char *path);
 extern int platform_direct_blockdev (void);
 extern int platform_align_blockdev (void);
-extern int platform_nproc(void);
 extern unsigned long platform_physmem(void);	/* in kilobytes */
 extern int platform_has_uuid;
 
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index bac334f..6327076 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -629,13 +629,32 @@ main(int argc, char **argv)
 	 * to target these for an increase in thread count. Hence a stride value
 	 * of 15 is chosen to ensure we get at least 2 AGs being scanned at once
 	 * on such filesystems.
+	 *
+	 * Limit the maximum thread count based on the available CPU power that
+	 * is available. If we use too many threads, we might run out of memory
+	 * and CPU power before we run out of IO concurrency. We limit to 8
+	 * threads/CPU as this is enough threads to saturate a CPU on fast
+	 * devices, yet few enough that it will saturate but won't overload slow
+	 * devices.
 	 */
 	if (!ag_stride && glob_agcount >= 16 && do_prefetch)
 		ag_stride = 15;
 
 	if (ag_stride) {
+		int max_threads = platform_nproc() * 8;
+
 		thread_count = (glob_agcount + ag_stride - 1) / ag_stride;
-		thread_init();
+		while (thread_count > max_threads) {
+			ag_stride *= 2;
+			thread_count = (glob_agcount + ag_stride - 1) /
+								ag_stride;
+		}
+		if (thread_count > 0)
+			thread_init();
+		else {
+			thread_count = 1;
+			ag_stride = 0;
+		}
 	}
 
 	if (ag_stride && report_interval) {
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 05/10] repair: use a listhead for the dotdot list
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
                   ` (3 preceding siblings ...)
  2014-02-27  9:51 ` [PATCH 04/10] repair: limit auto-striding concurrency apprpriately Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  2014-02-27  9:51 ` [PATCH 06/10] repair: fix prefetch queue limiting Dave Chinner
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Cleanup suggested by Christoph Hellwig - removes another open coded
list implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 repair/phase6.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index cdbf4db..7be68b3 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -43,13 +43,13 @@ static struct xfs_name		xfs_name_dot = {(unsigned char *)".",
  * entries are updated. These must be rebuilt after the initial pass
  */
 typedef struct dotdot_update {
-	struct dotdot_update	*next;
+	struct list_head	list;
 	ino_tree_node_t		*irec;
 	xfs_agnumber_t		agno;
 	int			ino_offset;
 } dotdot_update_t;
 
-static dotdot_update_t		*dotdot_update_list;
+static LIST_HEAD(dotdot_update_list);
 static int			dotdot_update;
 
 static void
@@ -64,12 +64,12 @@ add_dotdot_update(
 		do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
 			sizeof(dotdot_update_t));
 
-	dir->next = dotdot_update_list;
+	INIT_LIST_HEAD(&dir->list);
 	dir->irec = irec;
 	dir->agno = agno;
 	dir->ino_offset = ino_offset;
 
-	dotdot_update_list = dir;
+	list_add(&dir->list, &dotdot_update_list);
 }
 
 /*
@@ -3021,9 +3021,10 @@ update_missing_dotdot_entries(
 	 * set dotdot_update flag so processing routines do not count links
 	 */
 	dotdot_update = 1;
-	while (dotdot_update_list) {
-		dir = dotdot_update_list;
-		dotdot_update_list = dir->next;
+	while (!list_empty(&dotdot_update_list)) {
+		dir = list_entry(dotdot_update_list.prev, struct dotdot_update,
+				 list);
+		list_del(&dir->list);
 		process_dir_inode(mp, dir->agno, dir->irec, dir->ino_offset);
 		free(dir);
 	}
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 06/10] repair: fix prefetch queue limiting
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
                   ` (4 preceding siblings ...)
  2014-02-27  9:51 ` [PATCH 05/10] repair: use a listhead for the dotdot list Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  2014-02-27  9:51 ` [PATCH 07/10] repair: BMBT prefetch needs to be CRC aware Dave Chinner
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The length of the prefetch queue is limited by a semaphore. To avoid
a ABBA deadlock, we only trywait on the semaphore so if we fail to
get it we can kick the IO queues before sleeping. Unfortunately,
the "need to sleep" detection is just a little wrong - it needs to
lok at errno, not err for the EAGAIN value.

Hence this queue throttling has not been working for a long time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase6.c   | 9 ++++++++-
 repair/prefetch.c | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index 7be68b3..63359d1 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -2999,8 +2999,15 @@ traverse_function(
 		if (irec->ino_isa_dir == 0)
 			continue;
 
-		if (pf_args)
+		if (pf_args) {
 			sem_post(&pf_args->ra_count);
+#ifdef XR_PF_TRACE
+			sem_getvalue(&pf_args->ra_count, &i);
+			pftrace(
+		"processing inode chunk %p in AG %d (sem count = %d)",
+				irec, agno, i);
+#endif
+		}
 
 		for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
 			if (inode_isadir(irec, i))
diff --git a/repair/prefetch.c b/repair/prefetch.c
index 984beda..f4f3d71 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -723,7 +723,7 @@ pf_queuing_worker(
 			irec, args->agno, i);
 #endif
 		err = sem_trywait(&args->ra_count);
-		if (err == EAGAIN) {
+		if (err < 0 && errno == EAGAIN) {
 			/*
 			 * Kick the queue once we have reached the limit;
 			 * without this the threads processing the inodes
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 07/10] repair: BMBT prefetch needs to be CRC aware
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
                   ` (5 preceding siblings ...)
  2014-02-27  9:51 ` [PATCH 06/10] repair: fix prefetch queue limiting Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  2014-02-27  9:51 ` [PATCH 08/10] repair: factor out threading setup code Dave Chinner
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

I'd been trying to track down a behavioural difference between
non-crc and crc enabled filesystems that was resulting non-crc
filesystem executing prefetch almost 3x faster than CRC filesystems.
After amny ratholes, I finally stumbled on the fact that btree
format directories are not being prefetched due to a missing magic
number check, and it's rejecting all XFS_BMAP_CRC_MAGIC format BMBT
buffers. This makes prefetch on CRC enabled filesystems behave the
same as for non-CRC filesystems.

The difference a single line of code can make on a 50 million inode
filesystem with a single threaded prefetch enabled run is pretty
amazing. It goes from 3,000 iops @ 50MB/s to 2,000 IOPS @ 800MB/s
and the cache hit rate goes from 3% to 49%. The runtime difference:

Unpatched:

Phase           Start           End             Duration
Phase 1:        02/21 18:34:12  02/21 18:34:12
Phase 2:        02/21 18:34:12  02/21 18:34:15  3 seconds
Phase 3:        02/21 18:34:15  02/21 18:40:09  5 minutes, 54 seconds
Phase 4:        02/21 18:40:09  02/21 18:46:36  6 minutes, 27 seconds
Phase 5:        02/21 18:46:36  02/21 18:46:37  1 second
Phase 6:        02/21 18:46:37  02/21 18:52:51  6 minutes, 14 seconds
Phase 7:        02/21 18:52:51  02/21 18:52:52  1 second

Total run time: 18 minutes, 40 seconds

Patched:

Phase           Start           End             Duration
Phase 1:        02/21 19:58:23  02/21 19:58:23
Phase 2:        02/21 19:58:23  02/21 19:58:27  4 seconds
Phase 3:        02/21 19:58:27  02/21 19:59:20  53 seconds
Phase 4:        02/21 19:59:20  02/21 20:00:07  47 seconds
Phase 5:        02/21 20:00:07  02/21 20:00:08  1 second
Phase 6:        02/21 20:00:08  02/21 20:00:50  42 seconds
Phase 7:        02/21 20:00:50  02/21 20:00:50

Total run time: 2 minutes, 27 seconds

Is no less impressive. Shame it's just a regression fix. ;)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 repair/prefetch.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/repair/prefetch.c b/repair/prefetch.c
index f4f3d71..946e822 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -298,7 +298,8 @@ pf_scanfunc_bmap(
 	/*
 	 * do some validation on the block contents
 	 */
-	if ((be32_to_cpu(block->bb_magic) != XFS_BMAP_MAGIC) ||
+	if ((block->bb_magic != cpu_to_be32(XFS_BMAP_MAGIC) &&
+	     block->bb_magic != cpu_to_be32(XFS_BMAP_CRC_MAGIC)) ||
 			(be16_to_cpu(block->bb_level) != level))
 		return 0;
 
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 08/10] repair: factor out threading setup code
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
                   ` (6 preceding siblings ...)
  2014-02-27  9:51 ` [PATCH 07/10] repair: BMBT prefetch needs to be CRC aware Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  2014-02-27 14:05   ` Brian Foster
  2014-02-27  9:51 ` [PATCH 09/10] repair: prefetch runs too far ahead Dave Chinner
  2014-02-27  9:51 ` [PATCH 10/10] libxfs: remove a couple of locks Dave Chinner
  9 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The same code is repeated in different places to set up
multithreaded prefetching. This can all be factored into a single
implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/dinode.h   | 15 ++++++------
 repair/phase3.c   | 40 +++----------------------------
 repair/phase4.c   | 48 +++----------------------------------
 repair/phase6.c   | 22 ++++-------------
 repair/prefetch.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 repair/prefetch.h | 10 ++++++++
 6 files changed, 98 insertions(+), 108 deletions(-)

diff --git a/repair/dinode.h b/repair/dinode.h
index 7521521..5ee51ca 100644
--- a/repair/dinode.h
+++ b/repair/dinode.h
@@ -18,9 +18,8 @@
 #ifndef _XR_DINODE_H
 #define _XR_DINODE_H
 
-#include "prefetch.h"
-
 struct blkmap;
+struct prefetch_args;
 
 int
 verify_agbno(xfs_mount_t	*mp,
@@ -103,12 +102,12 @@ int
 process_uncertain_aginodes(xfs_mount_t		*mp,
 				xfs_agnumber_t	agno);
 void
-process_aginodes(xfs_mount_t	*mp,
-		prefetch_args_t	*pf_args,
-		xfs_agnumber_t	agno,
-		int		check_dirs,
-		int		check_dups,
-		int		extra_attr_check);
+process_aginodes(xfs_mount_t		*mp,
+		struct prefetch_args	*pf_args,
+		xfs_agnumber_t		agno,
+		int			check_dirs,
+		int			check_dups,
+		int			extra_attr_check);
 
 void
 check_uncertain_aginodes(xfs_mount_t	*mp,
diff --git a/repair/phase3.c b/repair/phase3.c
index 3e43938..213d368 100644
--- a/repair/phase3.c
+++ b/repair/phase3.c
@@ -17,6 +17,8 @@
  */
 
 #include <libxfs.h>
+#include "threads.h"
+#include "prefetch.h"
 #include "avl.h"
 #include "globals.h"
 #include "agheader.h"
@@ -24,9 +26,7 @@
 #include "protos.h"
 #include "err_protos.h"
 #include "dinode.h"
-#include "threads.h"
 #include "progress.h"
-#include "prefetch.h"
 
 static void
 process_agi_unlinked(
@@ -82,41 +82,7 @@ static void
 process_ags(
 	xfs_mount_t		*mp)
 {
-	int 			i, j;
-	xfs_agnumber_t 		agno;
-	work_queue_t		*queues;
-	prefetch_args_t		*pf_args[2];
-
-	queues = malloc(thread_count * sizeof(work_queue_t));
-
-	if (ag_stride) {
-		/*
-		 * create one worker thread for each segment of the volume
-		 */
-		for (i = 0, agno = 0; i < thread_count; i++) {
-			create_work_queue(&queues[i], mp, 1);
-			pf_args[0] = NULL;
-			for (j = 0; j < ag_stride && agno < mp->m_sb.sb_agcount;
-					j++, agno++) {
-				pf_args[0] = start_inode_prefetch(agno, 0, pf_args[0]);
-				queue_work(&queues[i], process_ag_func, agno, pf_args[0]);
-			}
-		}
-		/*
-		 * wait for workers to complete
-		 */
-		for (i = 0; i < thread_count; i++)
-			destroy_work_queue(&queues[i]);
-	} else {
-		queues[0].mp = mp;
-		pf_args[0] = start_inode_prefetch(0, 0, NULL);
-		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
-			pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 0,
-					pf_args[i & 1]);
-			process_ag_func(&queues[0], i, pf_args[i & 1]);
-		}
-	}
-	free(queues);
+	do_inode_prefetch(mp, ag_stride, process_ag_func, false, false);
 }
 
 void
diff --git a/repair/phase4.c b/repair/phase4.c
index a822aaa..189eeb9 100644
--- a/repair/phase4.c
+++ b/repair/phase4.c
@@ -17,6 +17,8 @@
  */
 
 #include <libxfs.h>
+#include "threads.h"
+#include "prefetch.h"
 #include "avl.h"
 #include "globals.h"
 #include "agheader.h"
@@ -27,9 +29,7 @@
 #include "bmap.h"
 #include "versions.h"
 #include "dir2.h"
-#include "threads.h"
 #include "progress.h"
-#include "prefetch.h"
 
 
 /*
@@ -150,49 +150,7 @@ static void
 process_ags(
 	xfs_mount_t		*mp)
 {
-	int 			i, j;
-	xfs_agnumber_t 		agno;
-	work_queue_t		*queues;
-	prefetch_args_t		*pf_args[2];
-
-	queues = malloc(thread_count * sizeof(work_queue_t));
-
-	if (!libxfs_bcache_overflowed()) {
-		queues[0].mp = mp;
-		create_work_queue(&queues[0], mp, libxfs_nproc());
-		for (i = 0; i < mp->m_sb.sb_agcount; i++)
-			queue_work(&queues[0], process_ag_func, i, NULL);
-		destroy_work_queue(&queues[0]);
-	} else {
-		if (ag_stride) {
-			/*
-			 * create one worker thread for each segment of the volume
-			 */
-			for (i = 0, agno = 0; i < thread_count; i++) {
-				create_work_queue(&queues[i], mp, 1);
-				pf_args[0] = NULL;
-				for (j = 0; j < ag_stride && agno < mp->m_sb.sb_agcount;
-						j++, agno++) {
-					pf_args[0] = start_inode_prefetch(agno, 0, pf_args[0]);
-					queue_work(&queues[i], process_ag_func, agno, pf_args[0]);
-				}
-			}
-			/*
-			 * wait for workers to complete
-			 */
-			for (i = 0; i < thread_count; i++)
-				destroy_work_queue(&queues[i]);
-		} else {
-			queues[0].mp = mp;
-			pf_args[0] = start_inode_prefetch(0, 0, NULL);
-			for (i = 0; i < mp->m_sb.sb_agcount; i++) {
-				pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
-						0, pf_args[i & 1]);
-				process_ag_func(&queues[0], i, pf_args[i & 1]);
-			}
-		}
-	}
-	free(queues);
+	do_inode_prefetch(mp, ag_stride, process_ag_func, true, false);
 }
 
 
diff --git a/repair/phase6.c b/repair/phase6.c
index 63359d1..446f3ee 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -17,6 +17,8 @@
  */
 
 #include <libxfs.h>
+#include "threads.h"
+#include "prefetch.h"
 #include "avl.h"
 #include "globals.h"
 #include "agheader.h"
@@ -25,9 +27,7 @@
 #include "protos.h"
 #include "err_protos.h"
 #include "dinode.h"
-#include "prefetch.h"
 #include "progress.h"
-#include "threads.h"
 #include "versions.h"
 
 static struct cred		zerocr;
@@ -3039,23 +3039,9 @@ update_missing_dotdot_entries(
 
 static void
 traverse_ags(
-	xfs_mount_t 		*mp)
+	struct xfs_mount	*mp)
 {
-	int			i;
-	work_queue_t		queue;
-	prefetch_args_t		*pf_args[2];
-
-	/*
-	 * we always do prefetch for phase 6 as it will fill in the gaps
-	 * not read during phase 3 prefetch.
-	 */
-	queue.mp = mp;
-	pf_args[0] = start_inode_prefetch(0, 1, NULL);
-	for (i = 0; i < glob_agcount; i++) {
-		pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
-				pf_args[i & 1]);
-		traverse_function(&queue, i, pf_args[i & 1]);
-	}
+	do_inode_prefetch(mp, 0, traverse_function, false, true);
 }
 
 void
diff --git a/repair/prefetch.c b/repair/prefetch.c
index 946e822..aee6342 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -866,6 +866,77 @@ start_inode_prefetch(
 	return args;
 }
 
+/*
+ * Do inode prefetch in the most optimal way for the context under which repair
+ * has been run.
+ */
+void
+do_inode_prefetch(
+	struct xfs_mount	*mp,
+	int			stride,
+	void			(*func)(struct work_queue *,
+					xfs_agnumber_t, void *),
+	bool			check_cache,
+	bool			dirs_only)
+{
+	int			i, j;
+	xfs_agnumber_t		agno;
+	struct work_queue	queue;
+	struct work_queue	*queues;
+	struct prefetch_args	*pf_args[2];
+
+	/*
+	 * If the previous phases of repair have not overflowed the buffer
+	 * cache, then we don't need to re-read any of the metadata in the
+	 * filesystem - it's all in the cache. In that case, run a thread per
+	 * CPU to maximise parallelism of the queue to be processed.
+	 */
+	if (check_cache && !libxfs_bcache_overflowed()) {
+		queue.mp = mp;
+		create_work_queue(&queue, mp, libxfs_nproc());
+		for (i = 0; i < mp->m_sb.sb_agcount; i++)
+			queue_work(&queue, func, i, NULL);
+		destroy_work_queue(&queue);
+		return;
+	}
+
+	/*
+	 * single threaded behaviour - single prefetch thread, processed
+	 * directly after each AG is queued.
+	 */
+	if (!stride) {
+		queue.mp = mp;
+		pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
+		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
+			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
+					dirs_only, pf_args[i & 1]);
+			func(&queue, i, pf_args[i & 1]);
+		}
+		return;
+	}
+
+	/*
+	 * create one worker thread for each segment of the volume
+	 */
+	queues = malloc(thread_count * sizeof(work_queue_t));
+	for (i = 0, agno = 0; i < thread_count; i++) {
+		create_work_queue(&queues[i], mp, 1);
+		pf_args[0] = NULL;
+		for (j = 0; j < stride && agno < mp->m_sb.sb_agcount;
+				j++, agno++) {
+			pf_args[0] = start_inode_prefetch(agno, dirs_only,
+							  pf_args[0]);
+			queue_work(&queues[i], func, agno, pf_args[0]);
+		}
+	}
+	/*
+	 * wait for workers to complete
+	 */
+	for (i = 0; i < thread_count; i++)
+		destroy_work_queue(&queues[i]);
+	free(queues);
+}
+
 void
 wait_for_inode_prefetch(
 	prefetch_args_t		*args)
diff --git a/repair/prefetch.h b/repair/prefetch.h
index 44a406c..b837752 100644
--- a/repair/prefetch.h
+++ b/repair/prefetch.h
@@ -4,6 +4,7 @@
 #include <semaphore.h>
 #include "incore.h"
 
+struct work_queue;
 
 extern int 	do_prefetch;
 
@@ -41,6 +42,15 @@ start_inode_prefetch(
 	prefetch_args_t		*prev_args);
 
 void
+do_inode_prefetch(
+	struct xfs_mount	*mp,
+	int			stride,
+	void			(*func)(struct work_queue *,
+					xfs_agnumber_t, void *),
+	bool			check_cache,
+	bool			dirs_only);
+
+void
 wait_for_inode_prefetch(
 	prefetch_args_t		*args);
 
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 09/10] repair: prefetch runs too far ahead
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
                   ` (7 preceding siblings ...)
  2014-02-27  9:51 ` [PATCH 08/10] repair: factor out threading setup code Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  2014-02-27 14:08   ` Brian Foster
  2014-02-27  9:51 ` [PATCH 10/10] libxfs: remove a couple of locks Dave Chinner
  9 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When trying to work out why a non-crc filesystem took 1m57 to repair
and the same CRC enabled filesystem took 11m35 to repair, I noticed
that the was way to much CRC checking going on. Prefetched buffers
should not be being CRCed, yet shortly after the starting this began
to happen. perf profiling also showed up an awful lot of time doing
buffer cache lookups, and the cache profile output indicated that
the hit rate was way below 3%. IOWs, the readahead was getting so
far ahead of the processing that it was thrashing the cache.

That there is a difference in processing rate between CRC and
non-CRC filesystems is not surprising. What is surprising is the
readahead behaviour - it basically just keeps reading ahead until it
has read everything on an AG, and then it goes on to the next AG,
and reads everything on it, and then goes on to the next AG,....

This goes on until it pushes all the buffers the processing threads
need out of the cache, and suddening they start re-reading from disk
with the various CRC checking verifiers enabled, and we end up going
-really- slow. Yes, threading made up for it a bit, but it's just
wrong.

Basically, the code assumes that IO is going to be slower than
processing, so it doesn't throttle prefetch across AGs to slow
down prefetch to match the processing rate.

So, to fix this, don't let a prefetch thread get more than a single
AG ahead of it's processing thread, just like occurs for single
threaded (i.e. -o ag_stride=-1) operation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/prefetch.c | 79 ++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 61 insertions(+), 18 deletions(-)

diff --git a/repair/prefetch.c b/repair/prefetch.c
index aee6342..7d3efde 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -866,6 +866,48 @@ start_inode_prefetch(
 	return args;
 }
 
+void
+prefetch_ag_range(
+	struct work_queue	*work,
+	xfs_agnumber_t		start_ag,
+	xfs_agnumber_t		end_ag,
+	bool			dirs_only,
+	void			(*func)(struct work_queue *,
+					xfs_agnumber_t, void *))
+{
+	int			i;
+	struct prefetch_args	*pf_args[2];
+
+	pf_args[start_ag & 1] = start_inode_prefetch(start_ag, dirs_only, NULL);
+	for (i = start_ag; i < end_ag; i++) {
+		/* Don't prefetch end_ag */
+		if (i + 1 < end_ag)
+			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
+						dirs_only, pf_args[i & 1]);
+		func(work, i, pf_args[i & 1]);
+	}
+}
+
+struct pf_work_args {
+	xfs_agnumber_t	start_ag;
+	xfs_agnumber_t	end_ag;
+	bool		dirs_only;
+	void		(*func)(struct work_queue *, xfs_agnumber_t, void *);
+};
+
+static void
+prefetch_ag_range_work(
+	struct work_queue	*work,
+	xfs_agnumber_t		unused,
+	void			*args)
+{
+	struct pf_work_args *wargs = args;
+
+	prefetch_ag_range(work, wargs->start_ag, wargs->end_ag, 
+			  wargs->dirs_only, wargs->func);
+	free(args);
+}
+
 /*
  * Do inode prefetch in the most optimal way for the context under which repair
  * has been run.
@@ -879,11 +921,9 @@ do_inode_prefetch(
 	bool			check_cache,
 	bool			dirs_only)
 {
-	int			i, j;
-	xfs_agnumber_t		agno;
+	int			i;
 	struct work_queue	queue;
 	struct work_queue	*queues;
-	struct prefetch_args	*pf_args[2];
 
 	/*
 	 * If the previous phases of repair have not overflowed the buffer
@@ -906,12 +946,8 @@ do_inode_prefetch(
 	 */
 	if (!stride) {
 		queue.mp = mp;
-		pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
-		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
-			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
-					dirs_only, pf_args[i & 1]);
-			func(&queue, i, pf_args[i & 1]);
-		}
+		prefetch_ag_range(&queue, 0, mp->m_sb.sb_agcount,
+				  dirs_only, func);
 		return;
 	}
 
@@ -919,20 +955,27 @@ do_inode_prefetch(
 	 * create one worker thread for each segment of the volume
 	 */
 	queues = malloc(thread_count * sizeof(work_queue_t));
-	for (i = 0, agno = 0; i < thread_count; i++) {
+	for (i = 0; i < thread_count; i++) {
+		struct pf_work_args *wargs;
+
+		wargs = malloc(sizeof(struct pf_work_args));
+		wargs->start_ag = i * stride;
+		wargs->end_ag = min((i + 1) * stride,
+				    mp->m_sb.sb_agcount);
+		wargs->dirs_only = dirs_only;
+		wargs->func = func;
+
 		create_work_queue(&queues[i], mp, 1);
-		pf_args[0] = NULL;
-		for (j = 0; j < stride && agno < mp->m_sb.sb_agcount;
-				j++, agno++) {
-			pf_args[0] = start_inode_prefetch(agno, dirs_only,
-							  pf_args[0]);
-			queue_work(&queues[i], func, agno, pf_args[0]);
-		}
+		queue_work(&queues[i], prefetch_ag_range_work, 0, wargs);
+
+		if (wargs->end_ag >= mp->m_sb.sb_agcount)
+			break;
 	}
+
 	/*
 	 * wait for workers to complete
 	 */
-	for (i = 0; i < thread_count; i++)
+	for (; i >= 0; i--)
 		destroy_work_queue(&queues[i]);
 	free(queues);
 }
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 10/10] libxfs: remove a couple of locks
  2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
                   ` (8 preceding siblings ...)
  2014-02-27  9:51 ` [PATCH 09/10] repair: prefetch runs too far ahead Dave Chinner
@ 2014-02-27  9:51 ` Dave Chinner
  9 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2014-02-27  9:51 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

THe libxfs cache has a lot of false scalability about it. I can't
get lookups to scale past about one and half CPUs, with one of the
key problems being a preponderance of global locks.

Like just after doing a hash lookup, that is careful only to take
the hash chain lock, it takes a global cache lock to update the
cache hit statistic. Scalable? Not at all.

The node priority stuff is protected by the object locks (i.e the
buffer lock) and so it doesn't need locks, either.

This doesn't do very much to improve scalability, but it's a small
start.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 libxfs/cache.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/libxfs/cache.c b/libxfs/cache.c
index dc69689..9f7f8f4 100644
--- a/libxfs/cache.c
+++ b/libxfs/cache.c
@@ -391,9 +391,7 @@ cache_node_get(
 			pthread_mutex_unlock(&node->cn_mutex);
 			pthread_mutex_unlock(&hash->ch_mutex);
 
-			pthread_mutex_lock(&cache->c_mutex);
 			cache->c_hits++;
-			pthread_mutex_unlock(&cache->c_mutex);
 
 			*nodep = node;
 			return 0;
@@ -482,10 +480,8 @@ cache_node_set_priority(
 	else if (priority > CACHE_MAX_PRIORITY)
 		priority = CACHE_MAX_PRIORITY;
 
-	pthread_mutex_lock(&node->cn_mutex);
 	ASSERT(node->cn_count > 0);
 	node->cn_priority = priority;
-	pthread_mutex_unlock(&node->cn_mutex);
 }
 
 int
@@ -494,9 +490,7 @@ cache_node_get_priority(
 {
 	int			priority;
 
-	pthread_mutex_lock(&node->cn_mutex);
 	priority = node->cn_priority;
-	pthread_mutex_unlock(&node->cn_mutex);
 
 	return priority;
 }
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 08/10] repair: factor out threading setup code
  2014-02-27  9:51 ` [PATCH 08/10] repair: factor out threading setup code Dave Chinner
@ 2014-02-27 14:05   ` Brian Foster
  0 siblings, 0 replies; 16+ messages in thread
From: Brian Foster @ 2014-02-27 14:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Feb 27, 2014 at 08:51:13PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The same code is repeated in different places to set up
> multithreaded prefetching. This can all be factored into a single
> implementation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  repair/dinode.h   | 15 ++++++------
>  repair/phase3.c   | 40 +++----------------------------
>  repair/phase4.c   | 48 +++----------------------------------
>  repair/phase6.c   | 22 ++++-------------
>  repair/prefetch.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  repair/prefetch.h | 10 ++++++++
>  6 files changed, 98 insertions(+), 108 deletions(-)
> 
> diff --git a/repair/dinode.h b/repair/dinode.h
> index 7521521..5ee51ca 100644
> --- a/repair/dinode.h
> +++ b/repair/dinode.h
> @@ -18,9 +18,8 @@
>  #ifndef _XR_DINODE_H
>  #define _XR_DINODE_H
>  
> -#include "prefetch.h"
> -
>  struct blkmap;
> +struct prefetch_args;
>  
>  int
>  verify_agbno(xfs_mount_t	*mp,
> @@ -103,12 +102,12 @@ int
>  process_uncertain_aginodes(xfs_mount_t		*mp,
>  				xfs_agnumber_t	agno);
>  void
> -process_aginodes(xfs_mount_t	*mp,
> -		prefetch_args_t	*pf_args,
> -		xfs_agnumber_t	agno,
> -		int		check_dirs,
> -		int		check_dups,
> -		int		extra_attr_check);
> +process_aginodes(xfs_mount_t		*mp,
> +		struct prefetch_args	*pf_args,
> +		xfs_agnumber_t		agno,
> +		int			check_dirs,
> +		int			check_dups,
> +		int			extra_attr_check);
>  
>  void
>  check_uncertain_aginodes(xfs_mount_t	*mp,
> diff --git a/repair/phase3.c b/repair/phase3.c
> index 3e43938..213d368 100644
> --- a/repair/phase3.c
> +++ b/repair/phase3.c
> @@ -17,6 +17,8 @@
>   */
>  
>  #include <libxfs.h>
> +#include "threads.h"
> +#include "prefetch.h"
>  #include "avl.h"
>  #include "globals.h"
>  #include "agheader.h"
> @@ -24,9 +26,7 @@
>  #include "protos.h"
>  #include "err_protos.h"
>  #include "dinode.h"
> -#include "threads.h"
>  #include "progress.h"
> -#include "prefetch.h"
>  
>  static void
>  process_agi_unlinked(
> @@ -82,41 +82,7 @@ static void
>  process_ags(
>  	xfs_mount_t		*mp)
>  {
> -	int 			i, j;
> -	xfs_agnumber_t 		agno;
> -	work_queue_t		*queues;
> -	prefetch_args_t		*pf_args[2];
> -
> -	queues = malloc(thread_count * sizeof(work_queue_t));
> -
> -	if (ag_stride) {
> -		/*
> -		 * create one worker thread for each segment of the volume
> -		 */
> -		for (i = 0, agno = 0; i < thread_count; i++) {
> -			create_work_queue(&queues[i], mp, 1);
> -			pf_args[0] = NULL;
> -			for (j = 0; j < ag_stride && agno < mp->m_sb.sb_agcount;
> -					j++, agno++) {
> -				pf_args[0] = start_inode_prefetch(agno, 0, pf_args[0]);
> -				queue_work(&queues[i], process_ag_func, agno, pf_args[0]);
> -			}
> -		}
> -		/*
> -		 * wait for workers to complete
> -		 */
> -		for (i = 0; i < thread_count; i++)
> -			destroy_work_queue(&queues[i]);
> -	} else {
> -		queues[0].mp = mp;
> -		pf_args[0] = start_inode_prefetch(0, 0, NULL);
> -		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
> -			pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 0,
> -					pf_args[i & 1]);
> -			process_ag_func(&queues[0], i, pf_args[i & 1]);
> -		}
> -	}
> -	free(queues);
> +	do_inode_prefetch(mp, ag_stride, process_ag_func, false, false);
>  }
>  
>  void
> diff --git a/repair/phase4.c b/repair/phase4.c
> index a822aaa..189eeb9 100644
> --- a/repair/phase4.c
> +++ b/repair/phase4.c
> @@ -17,6 +17,8 @@
>   */
>  
>  #include <libxfs.h>
> +#include "threads.h"
> +#include "prefetch.h"
>  #include "avl.h"
>  #include "globals.h"
>  #include "agheader.h"
> @@ -27,9 +29,7 @@
>  #include "bmap.h"
>  #include "versions.h"
>  #include "dir2.h"
> -#include "threads.h"
>  #include "progress.h"
> -#include "prefetch.h"
>  
>  
>  /*
> @@ -150,49 +150,7 @@ static void
>  process_ags(
>  	xfs_mount_t		*mp)
>  {
> -	int 			i, j;
> -	xfs_agnumber_t 		agno;
> -	work_queue_t		*queues;
> -	prefetch_args_t		*pf_args[2];
> -
> -	queues = malloc(thread_count * sizeof(work_queue_t));
> -
> -	if (!libxfs_bcache_overflowed()) {
> -		queues[0].mp = mp;
> -		create_work_queue(&queues[0], mp, libxfs_nproc());
> -		for (i = 0; i < mp->m_sb.sb_agcount; i++)
> -			queue_work(&queues[0], process_ag_func, i, NULL);
> -		destroy_work_queue(&queues[0]);
> -	} else {
> -		if (ag_stride) {
> -			/*
> -			 * create one worker thread for each segment of the volume
> -			 */
> -			for (i = 0, agno = 0; i < thread_count; i++) {
> -				create_work_queue(&queues[i], mp, 1);
> -				pf_args[0] = NULL;
> -				for (j = 0; j < ag_stride && agno < mp->m_sb.sb_agcount;
> -						j++, agno++) {
> -					pf_args[0] = start_inode_prefetch(agno, 0, pf_args[0]);
> -					queue_work(&queues[i], process_ag_func, agno, pf_args[0]);
> -				}
> -			}
> -			/*
> -			 * wait for workers to complete
> -			 */
> -			for (i = 0; i < thread_count; i++)
> -				destroy_work_queue(&queues[i]);
> -		} else {
> -			queues[0].mp = mp;
> -			pf_args[0] = start_inode_prefetch(0, 0, NULL);
> -			for (i = 0; i < mp->m_sb.sb_agcount; i++) {
> -				pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
> -						0, pf_args[i & 1]);
> -				process_ag_func(&queues[0], i, pf_args[i & 1]);
> -			}
> -		}
> -	}
> -	free(queues);
> +	do_inode_prefetch(mp, ag_stride, process_ag_func, true, false);
>  }
>  
>  
> diff --git a/repair/phase6.c b/repair/phase6.c
> index 63359d1..446f3ee 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -17,6 +17,8 @@
>   */
>  
>  #include <libxfs.h>
> +#include "threads.h"
> +#include "prefetch.h"
>  #include "avl.h"
>  #include "globals.h"
>  #include "agheader.h"
> @@ -25,9 +27,7 @@
>  #include "protos.h"
>  #include "err_protos.h"
>  #include "dinode.h"
> -#include "prefetch.h"
>  #include "progress.h"
> -#include "threads.h"
>  #include "versions.h"
>  
>  static struct cred		zerocr;
> @@ -3039,23 +3039,9 @@ update_missing_dotdot_entries(
>  
>  static void
>  traverse_ags(
> -	xfs_mount_t 		*mp)
> +	struct xfs_mount	*mp)
>  {
> -	int			i;
> -	work_queue_t		queue;
> -	prefetch_args_t		*pf_args[2];
> -
> -	/*
> -	 * we always do prefetch for phase 6 as it will fill in the gaps
> -	 * not read during phase 3 prefetch.
> -	 */
> -	queue.mp = mp;
> -	pf_args[0] = start_inode_prefetch(0, 1, NULL);
> -	for (i = 0; i < glob_agcount; i++) {
> -		pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> -				pf_args[i & 1]);
> -		traverse_function(&queue, i, pf_args[i & 1]);
> -	}
> +	do_inode_prefetch(mp, 0, traverse_function, false, true);
>  }
>  
>  void
> diff --git a/repair/prefetch.c b/repair/prefetch.c
> index 946e822..aee6342 100644
> --- a/repair/prefetch.c
> +++ b/repair/prefetch.c
> @@ -866,6 +866,77 @@ start_inode_prefetch(
>  	return args;
>  }
>  
> +/*
> + * Do inode prefetch in the most optimal way for the context under which repair
> + * has been run.
> + */
> +void
> +do_inode_prefetch(
> +	struct xfs_mount	*mp,
> +	int			stride,
> +	void			(*func)(struct work_queue *,
> +					xfs_agnumber_t, void *),
> +	bool			check_cache,
> +	bool			dirs_only)
> +{
> +	int			i, j;
> +	xfs_agnumber_t		agno;
> +	struct work_queue	queue;
> +	struct work_queue	*queues;
> +	struct prefetch_args	*pf_args[2];
> +
> +	/*
> +	 * If the previous phases of repair have not overflowed the buffer
> +	 * cache, then we don't need to re-read any of the metadata in the
> +	 * filesystem - it's all in the cache. In that case, run a thread per
> +	 * CPU to maximise parallelism of the queue to be processed.
> +	 */
> +	if (check_cache && !libxfs_bcache_overflowed()) {
> +		queue.mp = mp;
> +		create_work_queue(&queue, mp, libxfs_nproc());
> +		for (i = 0; i < mp->m_sb.sb_agcount; i++)
> +			queue_work(&queue, func, i, NULL);
> +		destroy_work_queue(&queue);
> +		return;
> +	}
> +
> +	/*
> +	 * single threaded behaviour - single prefetch thread, processed
> +	 * directly after each AG is queued.
> +	 */
> +	if (!stride) {
> +		queue.mp = mp;
> +		pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
> +		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
> +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
> +					dirs_only, pf_args[i & 1]);
> +			func(&queue, i, pf_args[i & 1]);
> +		}
> +		return;
> +	}
> +
> +	/*
> +	 * create one worker thread for each segment of the volume
> +	 */
> +	queues = malloc(thread_count * sizeof(work_queue_t));
> +	for (i = 0, agno = 0; i < thread_count; i++) {
> +		create_work_queue(&queues[i], mp, 1);
> +		pf_args[0] = NULL;
> +		for (j = 0; j < stride && agno < mp->m_sb.sb_agcount;
> +				j++, agno++) {
> +			pf_args[0] = start_inode_prefetch(agno, dirs_only,
> +							  pf_args[0]);
> +			queue_work(&queues[i], func, agno, pf_args[0]);
> +		}
> +	}
> +	/*
> +	 * wait for workers to complete
> +	 */
> +	for (i = 0; i < thread_count; i++)
> +		destroy_work_queue(&queues[i]);
> +	free(queues);
> +}
> +
>  void
>  wait_for_inode_prefetch(
>  	prefetch_args_t		*args)
> diff --git a/repair/prefetch.h b/repair/prefetch.h
> index 44a406c..b837752 100644
> --- a/repair/prefetch.h
> +++ b/repair/prefetch.h
> @@ -4,6 +4,7 @@
>  #include <semaphore.h>
>  #include "incore.h"
>  
> +struct work_queue;
>  
>  extern int 	do_prefetch;
>  
> @@ -41,6 +42,15 @@ start_inode_prefetch(
>  	prefetch_args_t		*prev_args);
>  
>  void
> +do_inode_prefetch(
> +	struct xfs_mount	*mp,
> +	int			stride,
> +	void			(*func)(struct work_queue *,
> +					xfs_agnumber_t, void *),
> +	bool			check_cache,
> +	bool			dirs_only);
> +
> +void
>  wait_for_inode_prefetch(
>  	prefetch_args_t		*args);
>  
> -- 
> 1.8.4.rc3
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 09/10] repair: prefetch runs too far ahead
  2014-02-27  9:51 ` [PATCH 09/10] repair: prefetch runs too far ahead Dave Chinner
@ 2014-02-27 14:08   ` Brian Foster
  2014-02-27 20:01     ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Foster @ 2014-02-27 14:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Feb 27, 2014 at 08:51:14PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 

Hmm, I replied to this one in the previous thread, but now I notice that
it apparently never made it to the list. Dave, did you happen to see
that in your inbox? Anyways, I had a couple minor comments/questions
that I'll duplicate here (which probably don't require another
repost)...

> When trying to work out why a non-crc filesystem took 1m57 to repair
> and the same CRC enabled filesystem took 11m35 to repair, I noticed
> that the was way to much CRC checking going on. Prefetched buffers
       there
> should not be being CRCed, yet shortly after the starting this began
		^?				^?
> to happen. perf profiling also showed up an awful lot of time doing
> buffer cache lookups, and the cache profile output indicated that
> the hit rate was way below 3%. IOWs, the readahead was getting so
> far ahead of the processing that it was thrashing the cache.
> 
> That there is a difference in processing rate between CRC and
> non-CRC filesystems is not surprising. What is surprising is the
> readahead behaviour - it basically just keeps reading ahead until it
> has read everything on an AG, and then it goes on to the next AG,
> and reads everything on it, and then goes on to the next AG,....
> 
> This goes on until it pushes all the buffers the processing threads
> need out of the cache, and suddening they start re-reading from disk
			     suddenly
> with the various CRC checking verifiers enabled, and we end up going
> -really- slow. Yes, threading made up for it a bit, but it's just
> wrong.
> 
> Basically, the code assumes that IO is going to be slower than
> processing, so it doesn't throttle prefetch across AGs to slow
> down prefetch to match the processing rate.
> 
> So, to fix this, don't let a prefetch thread get more than a single
> AG ahead of it's processing thread, just like occurs for single
> threaded (i.e. -o ag_stride=-1) operation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/prefetch.c | 79 ++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 61 insertions(+), 18 deletions(-)
> 
> diff --git a/repair/prefetch.c b/repair/prefetch.c
> index aee6342..7d3efde 100644
> --- a/repair/prefetch.c
> +++ b/repair/prefetch.c
> @@ -866,6 +866,48 @@ start_inode_prefetch(
>  	return args;
>  }
>  

A brief comment before the prefetch_ag_range bits that explain the
implicit design constraints (e.g., throttle prefetch based on
processing) would be nice. :)

> +void
> +prefetch_ag_range(
> +	struct work_queue	*work,
> +	xfs_agnumber_t		start_ag,
> +	xfs_agnumber_t		end_ag,
> +	bool			dirs_only,
> +	void			(*func)(struct work_queue *,
> +					xfs_agnumber_t, void *))
> +{
> +	int			i;
> +	struct prefetch_args	*pf_args[2];
> +
> +	pf_args[start_ag & 1] = start_inode_prefetch(start_ag, dirs_only, NULL);
> +	for (i = start_ag; i < end_ag; i++) {
> +		/* Don't prefetch end_ag */
> +		if (i + 1 < end_ag)
> +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
> +						dirs_only, pf_args[i & 1]);
> +		func(work, i, pf_args[i & 1]);
> +	}
> +}
> +
> +struct pf_work_args {
> +	xfs_agnumber_t	start_ag;
> +	xfs_agnumber_t	end_ag;
> +	bool		dirs_only;
> +	void		(*func)(struct work_queue *, xfs_agnumber_t, void *);
> +};
> +
> +static void
> +prefetch_ag_range_work(
> +	struct work_queue	*work,
> +	xfs_agnumber_t		unused,
> +	void			*args)
> +{
> +	struct pf_work_args *wargs = args;
> +
> +	prefetch_ag_range(work, wargs->start_ag, wargs->end_ag, 
> +			  wargs->dirs_only, wargs->func);
> +	free(args);
> +}
> +
>  /*
>   * Do inode prefetch in the most optimal way for the context under which repair
>   * has been run.
> @@ -879,11 +921,9 @@ do_inode_prefetch(
>  	bool			check_cache,
>  	bool			dirs_only)
>  {
> -	int			i, j;
> -	xfs_agnumber_t		agno;
> +	int			i;
>  	struct work_queue	queue;
>  	struct work_queue	*queues;
> -	struct prefetch_args	*pf_args[2];
>  
>  	/*
>  	 * If the previous phases of repair have not overflowed the buffer
> @@ -906,12 +946,8 @@ do_inode_prefetch(
>  	 */
>  	if (!stride) {
>  		queue.mp = mp;
> -		pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
> -		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
> -			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
> -					dirs_only, pf_args[i & 1]);
> -			func(&queue, i, pf_args[i & 1]);
> -		}
> +		prefetch_ag_range(&queue, 0, mp->m_sb.sb_agcount,
> +				  dirs_only, func);
>  		return;
>  	}
>  
> @@ -919,20 +955,27 @@ do_inode_prefetch(
>  	 * create one worker thread for each segment of the volume
>  	 */
>  	queues = malloc(thread_count * sizeof(work_queue_t));
> -	for (i = 0, agno = 0; i < thread_count; i++) {
> +	for (i = 0; i < thread_count; i++) {
> +		struct pf_work_args *wargs;
> +
> +		wargs = malloc(sizeof(struct pf_work_args));
> +		wargs->start_ag = i * stride;
> +		wargs->end_ag = min((i + 1) * stride,
> +				    mp->m_sb.sb_agcount);
> +		wargs->dirs_only = dirs_only;
> +		wargs->func = func;
> +
>  		create_work_queue(&queues[i], mp, 1);
> -		pf_args[0] = NULL;
> -		for (j = 0; j < stride && agno < mp->m_sb.sb_agcount;
> -				j++, agno++) {
> -			pf_args[0] = start_inode_prefetch(agno, dirs_only,
> -							  pf_args[0]);
> -			queue_work(&queues[i], func, agno, pf_args[0]);
> -		}
> +		queue_work(&queues[i], prefetch_ag_range_work, 0, wargs);
> +
> +		if (wargs->end_ag >= mp->m_sb.sb_agcount)
> +			break;
>  	}

Ok, so instead of giving prefetch a green light on every single AG (and
queueing the "work" functions), we queue a series of prefetch(next) then
do_work() instances based on the stride. The prefetch "greenlight" (to
distinguish from the prefetch itself) is now offloaded to the threads
doing the work, which will only green light the next AG in the sequence.

The code looks reasonable to me. Does the non-crc fs referenced in the
commit log to repair at 1m57 still run at that rate with this enabled?

Brian

> +
>  	/*
>  	 * wait for workers to complete
>  	 */
> -	for (i = 0; i < thread_count; i++)
> +	for (; i >= 0; i--)
>  		destroy_work_queue(&queues[i]);
>  	free(queues);
>  }
> -- 
> 1.8.4.rc3
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 09/10] repair: prefetch runs too far ahead
  2014-02-27 14:08   ` Brian Foster
@ 2014-02-27 20:01     ` Dave Chinner
  2014-02-27 20:24       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2014-02-27 20:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Feb 27, 2014 at 09:08:46AM -0500, Brian Foster wrote:
> On Thu, Feb 27, 2014 at 08:51:14PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> 
> Hmm, I replied to this one in the previous thread, but now I notice that
> it apparently never made it to the list. Dave, did you happen to see
> that in your inbox? Anyways, I had a couple minor comments/questions
> that I'll duplicate here (which probably don't require another
> repost)...

No, I didn't.

[snip typos that need fixing]

> > diff --git a/repair/prefetch.c b/repair/prefetch.c
> > index aee6342..7d3efde 100644
> > --- a/repair/prefetch.c
> > +++ b/repair/prefetch.c
> > @@ -866,6 +866,48 @@ start_inode_prefetch(
> >  	return args;
> >  }
> >  
> 
> A brief comment before the prefetch_ag_range bits that explain the
> implicit design constraints (e.g., throttle prefetch based on
> processing) would be nice. :)

Can do.

> > @@ -919,20 +955,27 @@ do_inode_prefetch(
> >  	 * create one worker thread for each segment of the volume
> >  	 */
> >  	queues = malloc(thread_count * sizeof(work_queue_t));
> > -	for (i = 0, agno = 0; i < thread_count; i++) {
> > +	for (i = 0; i < thread_count; i++) {
> > +		struct pf_work_args *wargs;
> > +
> > +		wargs = malloc(sizeof(struct pf_work_args));
> > +		wargs->start_ag = i * stride;
> > +		wargs->end_ag = min((i + 1) * stride,
> > +				    mp->m_sb.sb_agcount);
> > +		wargs->dirs_only = dirs_only;
> > +		wargs->func = func;
> > +
> >  		create_work_queue(&queues[i], mp, 1);
> > -		pf_args[0] = NULL;
> > -		for (j = 0; j < stride && agno < mp->m_sb.sb_agcount;
> > -				j++, agno++) {
> > -			pf_args[0] = start_inode_prefetch(agno, dirs_only,
> > -							  pf_args[0]);
> > -			queue_work(&queues[i], func, agno, pf_args[0]);
> > -		}
> > +		queue_work(&queues[i], prefetch_ag_range_work, 0, wargs);
> > +
> > +		if (wargs->end_ag >= mp->m_sb.sb_agcount)
> > +			break;
> >  	}
> 
> Ok, so instead of giving prefetch a green light on every single AG (and
> queueing the "work" functions), we queue a series of prefetch(next) then
> do_work() instances based on the stride. The prefetch "greenlight" (to
> distinguish from the prefetch itself) is now offloaded to the threads
> doing the work, which will only green light the next AG in the sequence.

Right - prefetch is now limited to one AG ahead of the AG being
processed by each worker thread.

> The code looks reasonable to me. Does the non-crc fs referenced in the
> commit log to repair at 1m57 still run at that rate with this enabled?

It's within the run-to-run variation:

<recreate 50m inode filesystem without CRCs>
....

Run single threaded:

$ time sudo xfs_repair -v -v -o bhash=32768  -t 1 -o ag_stride=-1 /dev/vdc
.....

        XFS_REPAIR Summary    Fri Feb 28 06:53:45 2014

Phase       Start       End             Duration
Phase 1:    02/28 06:51:54      02/28 06:51:54  
Phase 2:    02/28 06:51:54      02/28 06:52:02  8 seconds
Phase 3:    02/28 06:52:02      02/28 06:52:37  35 seconds
Phase 4:    02/28 06:52:37      02/28 06:53:03  26 seconds
Phase 5:    02/28 06:53:03      02/28 06:53:03  
Phase 6:    02/28 06:53:03      02/28 06:53:44  41 seconds
Phase 7:    02/28 06:53:44      02/28 06:53:44  

Total run time: 1 minute, 50 seconds
done

Run auto-threaded:

$ time sudo xfs_repair -v -v -o bhash=32768  -t 1 /dev/vdc
.....
        XFS_REPAIR Summary    Fri Feb 28 06:58:08 2014

Phase       Start       End             Duration
Phase 1:    02/28 06:56:13      02/28 06:56:14  1 second
Phase 2:    02/28 06:56:14      02/28 06:56:20  6 seconds
Phase 3:    02/28 06:56:20      02/28 06:56:59  39 seconds
Phase 4:    02/28 06:56:59      02/28 06:57:28  29 seconds
Phase 5:    02/28 06:57:28      02/28 06:57:28  
Phase 6:    02/28 06:57:28      02/28 06:58:08  40 seconds
Phase 7:    02/28 06:58:08      02/28 06:58:08  

Total run time: 1 minute, 55 seconds
done

Even single AG prefetching on this test is bandwidth bound (pair of
SSDs in RAID0, reading 900MB/s @ 2,500 IOPS), so multi-threading
doesn't make it any faster.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 09/10] repair: prefetch runs too far ahead
  2014-02-27 20:01     ` Dave Chinner
@ 2014-02-27 20:24       ` Dave Chinner
  2014-02-27 20:45         ` Brian Foster
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2014-02-27 20:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Feb 28, 2014 at 07:01:50AM +1100, Dave Chinner wrote:
> On Thu, Feb 27, 2014 at 09:08:46AM -0500, Brian Foster wrote:
> > On Thu, Feb 27, 2014 at 08:51:14PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > 
> > Hmm, I replied to this one in the previous thread, but now I notice that
> > it apparently never made it to the list. Dave, did you happen to see
> > that in your inbox? Anyways, I had a couple minor comments/questions
> > that I'll duplicate here (which probably don't require another
> > repost)...
> 
> No, I didn't.
> 
> [snip typos that need fixing]
> 
> > > diff --git a/repair/prefetch.c b/repair/prefetch.c
> > > index aee6342..7d3efde 100644
> > > --- a/repair/prefetch.c
> > > +++ b/repair/prefetch.c
> > > @@ -866,6 +866,48 @@ start_inode_prefetch(
> > >  	return args;
> > >  }
> > >  
> > 
> > A brief comment before the prefetch_ag_range bits that explain the
> > implicit design constraints (e.g., throttle prefetch based on
> > processing) would be nice. :)
> 
> Can do.

Added this:

/*
 * prefetch_ag_range runs a prefetch-and-process loop across a range of AGs. It
 * begins with @start+ag, and finishes with @end_ag - 1 (i.e. does not prefetch
 * or process @end_ag). The function starts prefetch on the first AG, then loops
 * starting prefetch on the next AG and then blocks processing the current AG as
 * the prefetch queue brings inodes into the processing queue.
 *
 * There is only one prefetch taking place at a time, so the prefetch on the
 * next AG only starts once the current AG has been completely prefetched. Hence
 * the prefetch of the next AG will start some time before the processing of the
 * current AG finishes, ensuring that when we iterate an start processing the
 * next AG there is already a significant queue of inodes to process.
 *
 * Prefetch is done this way to prevent it from running too far ahead of the
 * processing. Allowing it to do so can cause cache thrashing, where new
 * prefetch causes previously prefetched buffers to be reclaimed before the
 * processing thread uses them. This results in reading all the inodes and
 * metadata twice per phase and it greatly slows down the processing. Hence we
 * have to carefully control how far ahead we prefetch...
 */

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 09/10] repair: prefetch runs too far ahead
  2014-02-27 20:24       ` Dave Chinner
@ 2014-02-27 20:45         ` Brian Foster
  0 siblings, 0 replies; 16+ messages in thread
From: Brian Foster @ 2014-02-27 20:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Feb 28, 2014 at 07:24:05AM +1100, Dave Chinner wrote:
> On Fri, Feb 28, 2014 at 07:01:50AM +1100, Dave Chinner wrote:
> > On Thu, Feb 27, 2014 at 09:08:46AM -0500, Brian Foster wrote:
> > > On Thu, Feb 27, 2014 at 08:51:14PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > 
> > > Hmm, I replied to this one in the previous thread, but now I notice that
> > > it apparently never made it to the list. Dave, did you happen to see
> > > that in your inbox? Anyways, I had a couple minor comments/questions
> > > that I'll duplicate here (which probably don't require another
> > > repost)...
> > 
> > No, I didn't.
> > 
> > [snip typos that need fixing]
> > 
> > > > diff --git a/repair/prefetch.c b/repair/prefetch.c
> > > > index aee6342..7d3efde 100644
> > > > --- a/repair/prefetch.c
> > > > +++ b/repair/prefetch.c
> > > > @@ -866,6 +866,48 @@ start_inode_prefetch(
> > > >  	return args;
> > > >  }
> > > >  
> > > 
> > > A brief comment before the prefetch_ag_range bits that explain the
> > > implicit design constraints (e.g., throttle prefetch based on
> > > processing) would be nice. :)
> > 
> > Can do.
> 
> Added this:
> 
> /*
>  * prefetch_ag_range runs a prefetch-and-process loop across a range of AGs. It
>  * begins with @start+ag, and finishes with @end_ag - 1 (i.e. does not prefetch
>  * or process @end_ag). The function starts prefetch on the first AG, then loops
>  * starting prefetch on the next AG and then blocks processing the current AG as
>  * the prefetch queue brings inodes into the processing queue.
>  *
>  * There is only one prefetch taking place at a time, so the prefetch on the
>  * next AG only starts once the current AG has been completely prefetched. Hence
>  * the prefetch of the next AG will start some time before the processing of the
>  * current AG finishes, ensuring that when we iterate an start processing the
							and
>  * next AG there is already a significant queue of inodes to process.
>  *
>  * Prefetch is done this way to prevent it from running too far ahead of the
>  * processing. Allowing it to do so can cause cache thrashing, where new
>  * prefetch causes previously prefetched buffers to be reclaimed before the
>  * processing thread uses them. This results in reading all the inodes and
>  * metadata twice per phase and it greatly slows down the processing. Hence we
>  * have to carefully control how far ahead we prefetch...
>  */
> 

Looks good, thanks!

Reviewed-by: Brian Foster <bfoster@redhat.com>

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-02-27 20:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-27  9:51 [PATCH 00/10, v3] xfsprogs: reapir scalability and fixes Dave Chinner
2014-02-27  9:51 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
2014-02-27  9:51 ` [PATCH 02/10] repair: per AG locks contend for cachelines Dave Chinner
2014-02-27  9:51 ` [PATCH 03/10] libxfs: buffer cache hashing is suboptimal Dave Chinner
2014-02-27  9:51 ` [PATCH 04/10] repair: limit auto-striding concurrency apprpriately Dave Chinner
2014-02-27  9:51 ` [PATCH 05/10] repair: use a listhead for the dotdot list Dave Chinner
2014-02-27  9:51 ` [PATCH 06/10] repair: fix prefetch queue limiting Dave Chinner
2014-02-27  9:51 ` [PATCH 07/10] repair: BMBT prefetch needs to be CRC aware Dave Chinner
2014-02-27  9:51 ` [PATCH 08/10] repair: factor out threading setup code Dave Chinner
2014-02-27 14:05   ` Brian Foster
2014-02-27  9:51 ` [PATCH 09/10] repair: prefetch runs too far ahead Dave Chinner
2014-02-27 14:08   ` Brian Foster
2014-02-27 20:01     ` Dave Chinner
2014-02-27 20:24       ` Dave Chinner
2014-02-27 20:45         ` Brian Foster
2014-02-27  9:51 ` [PATCH 10/10] libxfs: remove a couple of locks Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.