All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10, v2] repair: scalability and prefetch fixes
@ 2014-02-24  6:29 Dave Chinner
  2014-02-24  6:29 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

Hi folks,

This is a followup to the patchset posted here:

http://oss.sgi.com/archives/xfs/2013-12/msg00495.html

I've made various changes to address the review comments, and
droped the parallelisation of phase 6 after I realised that is was
causing occasional problem in pahse 6 (i.e. the simple patch didn't
make it entirely threadsafe).

There are various other prefetching fixes added to the series as
well that help with scalability - mainly to do with reworking the
way prefetching runs ahead of the processing threads. This allowed
unbound prefetching when the number of inode chunks was less than
the queue depth (typically 16384 inode chunks or 65536 inode cluster
buffers on 512 byte inode filesystems). This was causing the
prefetching to blow out the caches and result in all metadata being
read twice - once for the readhead, then again when it was actually
necessary.

The prefetching was changed to only readahead the AG following the
current one being processed, hence preventing thrashing and an awful
lot of unnecessary IO and buffer cache churn.

I'm not seeing any problems with this series, and performance on CRC
filesystems is now on a par with non-CRC filesystems so there are no
more known xfs_repair performance issues to be addressed with CRC
enabled filesytems after this patch set is applied.

Comments, review, flames and testing welcome!

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/10] repair: translation lookups limit scalability
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-24 20:42   ` Brian Foster
  2014-02-25 20:01   ` Christoph Hellwig
  2014-02-24  6:29 ` [PATCH 02/10] repair: per AG locks contend for cachelines Dave Chinner
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

A bit of perf magic showed that scalability was limits to 3-4
concurrent threads due to contention on a lock inside in something
called __dcigettext(). That some library somewhere that repair is
linked against, and it turns out to be inside the translation
infrastructure to support the _() string mechanism:

# Samples: 34K of event 'cs'
# Event count (approx.): 495567
#
# Overhead        Command      Shared Object          Symbol
# ........  .............  .................  ..............
#
    60.30%     xfs_repair  [kernel.kallsyms]  [k] __schedule
               |
               --- 0x63fffff9c
                   process_bmbt_reclist_int
                  |
                  |--39.95%-- __dcigettext
                  |          __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                  |
                  |--8.91%-- __lll_lock_wait
                  |          system_call_fastpath
                  |          SyS_futex
                  |          do_futex
                  |          futex_wait
                  |          futex_wait_queue_me
                  |          schedule
                  |          __schedule
                   --51.13%-- [...]

Fix this by initialising global variables that hold the translated
strings at startup, hence avoiding the need to do repeated runtime
translation of the same strings.

Runtime of an unpatched xfs_repair is roughly:

        XFS_REPAIR Summary    Fri Dec  6 11:15:50 2013

Phase           Start           End             Duration
Phase 1:        12/06 10:56:21  12/06 10:56:21
Phase 2:        12/06 10:56:21  12/06 10:56:23  2 seconds
Phase 3:        12/06 10:56:23  12/06 11:01:31  5 minutes, 8 seconds
Phase 4:        12/06 11:01:31  12/06 11:07:08  5 minutes, 37 seconds
Phase 5:        12/06 11:07:08  12/06 11:07:09  1 second
Phase 6:        12/06 11:07:09  12/06 11:15:49  8 minutes, 40 seconds
Phase 7:        12/06 11:15:49  12/06 11:15:50  1 second

Total run time: 19 minutes, 29 seconds

Patched version:

Phase           Start           End             Duration
Phase 1:        12/06 10:36:29  12/06 10:36:29
Phase 2:        12/06 10:36:29  12/06 10:36:31  2 seconds
Phase 3:        12/06 10:36:31  12/06 10:40:08  3 minutes, 37 seconds
Phase 4:        12/06 10:40:08  12/06 10:43:42  3 minutes, 34 seconds
Phase 5:        12/06 10:43:42  12/06 10:43:42
Phase 6:        12/06 10:43:42  12/06 10:50:28  6 minutes, 46 seconds
Phase 7:        12/06 10:50:28  12/06 10:50:29  1 second

Total run time: 14 minutes

Big win!

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/dinode.c     | 49 +++++++++++++++++++++++++++++++++++--------------
 repair/dinode.h     |  3 +++
 repair/scan.c       |  7 +------
 repair/xfs_repair.c |  2 ++
 4 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/repair/dinode.c b/repair/dinode.c
index 3115bd0..4953a56 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -32,6 +32,37 @@
 #include "threads.h"
 
 /*
+ * gettext lookups for translations of strings use mutexes internally to
+ * the library. Hence when we come through here doing parallel scans in
+ * multiple AGs, then all do concurrent text conversions and serialise
+ * on the translation string lookups. Let's avoid doing repeated lookups
+ * by making them static variables and only assigning the translation
+ * once.
+ */
+static char	*forkname_data;
+static char	*forkname_attr;
+static char	*ftype_real_time;
+static char	*ftype_regular;
+
+void
+dinode_bmbt_translation_init(void)
+{
+	forkname_data = _("data");
+	forkname_attr = _("attr");
+	ftype_real_time = _("real-time");
+	ftype_regular = _("regular");
+}
+
+char *
+get_forkname(int whichfork)
+{
+
+	if (whichfork == XFS_DATA_FORK)
+		return forkname_data;
+	return forkname_attr;
+}
+
+/*
  * inode clearing routines
  */
 
@@ -542,7 +573,7 @@ process_bmbt_reclist_int(
 	xfs_dfiloff_t		op = 0;		/* prev offset */
 	xfs_dfsbno_t		b;
 	char			*ftype;
-	char			*forkname;
+	char			*forkname = get_forkname(whichfork);
 	int			i;
 	int			state;
 	xfs_agnumber_t		agno;
@@ -552,15 +583,10 @@ process_bmbt_reclist_int(
 	xfs_agnumber_t		locked_agno = -1;
 	int			error = 1;
 
-	if (whichfork == XFS_DATA_FORK)
-		forkname = _("data");
-	else
-		forkname = _("attr");
-
 	if (type == XR_INO_RTDATA)
-		ftype = _("real-time");
+		ftype = ftype_real_time;
 	else
-		ftype = _("regular");
+		ftype = ftype_regular;
 
 	for (i = 0; i < *numrecs; i++) {
 		libxfs_bmbt_disk_get_all(rp + i, &irec);
@@ -1110,7 +1136,7 @@ process_btinode(
 	xfs_ino_t		lino;
 	xfs_bmbt_ptr_t		*pp;
 	xfs_bmbt_key_t		*pkey;
-	char			*forkname;
+	char			*forkname = get_forkname(whichfork);
 	int			i;
 	int			level;
 	int			numrecs;
@@ -1122,11 +1148,6 @@ process_btinode(
 	*tot = 0;
 	*nex = 0;
 
-	if (whichfork == XFS_DATA_FORK)
-		forkname = _("data");
-	else
-		forkname = _("attr");
-
 	magic = xfs_sb_version_hascrc(&mp->m_sb) ? XFS_BMAP_CRC_MAGIC
 						 : XFS_BMAP_MAGIC;
 
diff --git a/repair/dinode.h b/repair/dinode.h
index d9197c1..7521521 100644
--- a/repair/dinode.h
+++ b/repair/dinode.h
@@ -127,4 +127,7 @@ get_bmapi(xfs_mount_t		*mp,
 		xfs_dfiloff_t	bno,
 		int             whichfork );
 
+void dinode_bmbt_translation_init(void);
+char * get_forkname(int whichfork);
+
 #endif /* _XR_DINODE_H */
diff --git a/repair/scan.c b/repair/scan.c
index 73b4581..b12f48b 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -171,17 +171,12 @@ scan_bmapbt(
 	xfs_bmbt_rec_t		*rp;
 	xfs_dfiloff_t		first_key;
 	xfs_dfiloff_t		last_key;
-	char			*forkname;
+	char			*forkname = get_forkname(whichfork);
 	int			numrecs;
 	xfs_agnumber_t		agno;
 	xfs_agblock_t		agbno;
 	int			state;
 
-	if (whichfork == XFS_DATA_FORK)
-		forkname = _("data");
-	else
-		forkname = _("attr");
-
 	/*
 	 * unlike the ag freeblock btrees, if anything looks wrong
 	 * in an inode bmap tree, just bail.  it's possible that
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 9e0502a..bac334f 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -29,6 +29,7 @@
 #include "prefetch.h"
 #include "threads.h"
 #include "progress.h"
+#include "dinode.h"
 
 #define	rounddown(x, y)	(((x)/(y))*(y))
 
@@ -533,6 +534,7 @@ main(int argc, char **argv)
 	setlocale(LC_ALL, "");
 	bindtextdomain(PACKAGE, LOCALEDIR);
 	textdomain(PACKAGE);
+	dinode_bmbt_translation_init();
 
 	temp_mp = &xfs_m;
 	setbuf(stdout, NULL);
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 02/10] repair: per AG locks contend for cachelines
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
  2014-02-24  6:29 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-24  6:29 ` [PATCH 03/10] libxfs: buffer cache hashing is suboptimal Dave Chinner
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The per-ag locks used to protect per-ag block lists are located in a tightly
packed array. That means that they share cachelines, so separate them out inot
separate 64 byte regions in the array.

pahole confirms the padding is correctly applied:

struct aglock {
        pthread_mutex_t            lock;                 /*     0    40 */

        /* size: 64, cachelines: 1, members: 1 */
        /* padding: 24 */
};

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 repair/dino_chunks.c | 24 ++++++++++++------------
 repair/dinode.c      |  6 +++---
 repair/globals.h     |  5 ++++-
 repair/incore.c      |  4 ++--
 repair/scan.c        |  4 ++--
 5 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
index 65281e4..afb26e0 100644
--- a/repair/dino_chunks.c
+++ b/repair/dino_chunks.c
@@ -141,7 +141,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 		if (check_aginode_block(mp, agno, agino) == 0)
 			return 0;
 
-		pthread_mutex_lock(&ag_locks[agno]);
+		pthread_mutex_lock(&ag_locks[agno].lock);
 
 		state = get_bmap(agno, agbno);
 		switch (state) {
@@ -166,7 +166,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 		_("inode block %d/%d multiply claimed, (state %d)\n"),
 				agno, agbno, state);
 			set_bmap(agno, agbno, XR_E_MULT);
-			pthread_mutex_unlock(&ag_locks[agno]);
+			pthread_mutex_unlock(&ag_locks[agno].lock);
 			return(0);
 		default:
 			do_warn(
@@ -176,7 +176,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 			break;
 		}
 
-		pthread_mutex_unlock(&ag_locks[agno]);
+		pthread_mutex_unlock(&ag_locks[agno].lock);
 
 		start_agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
 		*start_ino = XFS_AGINO_TO_INO(mp, agno, start_agino);
@@ -424,7 +424,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 	 * user data -- we're probably here as a result of a directory
 	 * entry or an iunlinked pointer
 	 */
-	pthread_mutex_lock(&ag_locks[agno]);
+	pthread_mutex_lock(&ag_locks[agno].lock);
 	for (cur_agbno = chunk_start_agbno;
 	     cur_agbno < chunk_stop_agbno;
 	     cur_agbno += blen)  {
@@ -438,7 +438,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 	_("inode block %d/%d multiply claimed, (state %d)\n"),
 				agno, cur_agbno, state);
 			set_bmap_ext(agno, cur_agbno, blen, XR_E_MULT);
-			pthread_mutex_unlock(&ag_locks[agno]);
+			pthread_mutex_unlock(&ag_locks[agno].lock);
 			return 0;
 		case XR_E_INO:
 			do_error(
@@ -449,7 +449,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 			break;
 		}
 	}
-	pthread_mutex_unlock(&ag_locks[agno]);
+	pthread_mutex_unlock(&ag_locks[agno].lock);
 
 	/*
 	 * ok, chunk is good.  put the record into the tree if required,
@@ -472,7 +472,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 
 	set_inode_used(irec_p, agino - start_agino);
 
-	pthread_mutex_lock(&ag_locks[agno]);
+	pthread_mutex_lock(&ag_locks[agno].lock);
 
 	for (cur_agbno = chunk_start_agbno;
 	     cur_agbno < chunk_stop_agbno;
@@ -505,7 +505,7 @@ verify_inode_chunk(xfs_mount_t		*mp,
 			break;
 		}
 	}
-	pthread_mutex_unlock(&ag_locks[agno]);
+	pthread_mutex_unlock(&ag_locks[agno].lock);
 
 	return(ino_cnt);
 }
@@ -736,7 +736,7 @@ process_inode_chunk(
 	/*
 	 * mark block as an inode block in the incore bitmap
 	 */
-	pthread_mutex_lock(&ag_locks[agno]);
+	pthread_mutex_lock(&ag_locks[agno].lock);
 	state = get_bmap(agno, agbno);
 	switch (state) {
 	case XR_E_INO:	/* already marked */
@@ -755,7 +755,7 @@ process_inode_chunk(
 			XFS_AGB_TO_FSB(mp, agno, agbno), state);
 		break;
 	}
-	pthread_mutex_unlock(&ag_locks[agno]);
+	pthread_mutex_unlock(&ag_locks[agno].lock);
 
 	for (;;) {
 		/*
@@ -925,7 +925,7 @@ process_inode_chunk(
 			ibuf_offset = 0;
 			agbno++;
 
-			pthread_mutex_lock(&ag_locks[agno]);
+			pthread_mutex_lock(&ag_locks[agno].lock);
 			state = get_bmap(agno, agbno);
 			switch (state) {
 			case XR_E_INO:	/* already marked */
@@ -946,7 +946,7 @@ process_inode_chunk(
 					XFS_AGB_TO_FSB(mp, agno, agbno), state);
 				break;
 			}
-			pthread_mutex_unlock(&ag_locks[agno]);
+			pthread_mutex_unlock(&ag_locks[agno].lock);
 
 		} else if (irec_offset == XFS_INODES_PER_CHUNK)  {
 			/*
diff --git a/repair/dinode.c b/repair/dinode.c
index 4953a56..48f17ac 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -707,8 +707,8 @@ _("Fatal error: inode %" PRIu64 " - blkmap_set_ext(): %s\n"
 		ebno = agbno + irec.br_blockcount;
 		if (agno != locked_agno) {
 			if (locked_agno != -1)
-				pthread_mutex_unlock(&ag_locks[locked_agno]);
-			pthread_mutex_lock(&ag_locks[agno]);
+				pthread_mutex_unlock(&ag_locks[locked_agno].lock);
+			pthread_mutex_lock(&ag_locks[agno].lock);
 			locked_agno = agno;
 		}
 
@@ -777,7 +777,7 @@ _("illegal state %d in block map %" PRIu64 "\n"),
 	error = 0;
 done:
 	if (locked_agno != -1)
-		pthread_mutex_unlock(&ag_locks[locked_agno]);
+		pthread_mutex_unlock(&ag_locks[locked_agno].lock);
 
 	if (i != *numrecs) {
 		ASSERT(i < *numrecs);
diff --git a/repair/globals.h b/repair/globals.h
index aef8b79..cbb2ce7 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -186,7 +186,10 @@ EXTERN xfs_extlen_t	sb_inoalignmt;
 EXTERN __uint32_t	sb_unit;
 EXTERN __uint32_t	sb_width;
 
-EXTERN pthread_mutex_t	*ag_locks;
+struct aglock {
+	pthread_mutex_t	lock __attribute__((__aligned__(64)));
+};
+EXTERN struct aglock	*ag_locks;
 
 EXTERN int 		report_interval;
 EXTERN __uint64_t 	*prog_rpt_done;
diff --git a/repair/incore.c b/repair/incore.c
index 3590464..a8d497e 100644
--- a/repair/incore.c
+++ b/repair/incore.c
@@ -294,13 +294,13 @@ init_bmaps(xfs_mount_t *mp)
 	if (!ag_bmap)
 		do_error(_("couldn't allocate block map btree roots\n"));
 
-	ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(pthread_mutex_t));
+	ag_locks = calloc(mp->m_sb.sb_agcount, sizeof(struct aglock));
 	if (!ag_locks)
 		do_error(_("couldn't allocate block map locks\n"));
 
 	for (i = 0; i < mp->m_sb.sb_agcount; i++)  {
 		btree_init(&ag_bmap[i]);
-		pthread_mutex_init(&ag_locks[i], NULL);
+		pthread_mutex_init(&ag_locks[i].lock, NULL);
 	}
 
 	init_rt_bmap(mp);
diff --git a/repair/scan.c b/repair/scan.c
index b12f48b..f1411a2 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -268,7 +268,7 @@ _("bad back (left) sibling pointer (saw %llu should be NULL (0))\n"
 		agno = XFS_FSB_TO_AGNO(mp, bno);
 		agbno = XFS_FSB_TO_AGBNO(mp, bno);
 
-		pthread_mutex_lock(&ag_locks[agno]);
+		pthread_mutex_lock(&ag_locks[agno].lock);
 		state = get_bmap(agno, agbno);
 		switch (state) {
 		case XR_E_UNKNOWN:
@@ -314,7 +314,7 @@ _("bad state %d, inode %" PRIu64 " bmap block 0x%" PRIx64 "\n"),
 				state, ino, bno);
 			break;
 		}
-		pthread_mutex_unlock(&ag_locks[agno]);
+		pthread_mutex_unlock(&ag_locks[agno].lock);
 	} else  {
 		/*
 		 * attribute fork for realtime files is in the regular
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 03/10] libxfs: buffer cache hashing is suboptimal
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
  2014-02-24  6:29 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
  2014-02-24  6:29 ` [PATCH 02/10] repair: per AG locks contend for cachelines Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-24  6:29 ` [PATCH 04/10] repair: limit auto-striding concurrency apprpriately Dave Chinner
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The hashkey calculation is very simplistic,and throws away an amount
of entropy that should be folded into the hash. The result is
sub-optimal distribution across the hash tables. For example, with a
default 512 entry table, phase 2 results in this:

Max supported entries = 4096
Max utilized entries = 3970
Active entries = 3970
Hash table size = 512
Hits = 0
Misses = 3970
Hit ratio =  0.00
Hash buckets with   0 entries     12 (  0%)
Hash buckets with   1 entries      3 (  0%)
Hash buckets with   2 entries     10 (  0%)
Hash buckets with   3 entries      2 (  0%)
Hash buckets with   4 entries    129 ( 12%)
Hash buckets with   5 entries     20 (  2%)
Hash buckets with   6 entries     54 (  8%)
Hash buckets with   7 entries     22 (  3%)
Hash buckets with   8 entries    150 ( 30%)
Hash buckets with   9 entries     14 (  3%)
Hash buckets with  10 entries     16 (  4%)
Hash buckets with  11 entries      7 (  1%)
Hash buckets with  12 entries     38 ( 11%)
Hash buckets with  13 entries      5 (  1%)
Hash buckets with  14 entries      4 (  1%)
Hash buckets with  17 entries      1 (  0%)
Hash buckets with  19 entries      1 (  0%)
Hash buckets with  23 entries      1 (  0%)
Hash buckets with >24 entries     23 ( 16%)

Now, given a perfect distribution, we shoul dhave 8 entries per
chain. What we end up with is nothing like that.

Unfortunately, for phase 3/4 and others, the number of cached
objects results in the cache being expanded to 256k entries, and
so the stats just give this;

Hits = 262276
Misses = 8130393
Hit ratio =  3.13
Hash buckets with >24 entries    512 (100%)

We can't evaluate the efficiency of the hashing algorithm here.
Let's increase the size of the hash table to 32768 entries and go
from there:

Phase 2:

Hash buckets with   0 entries  31884 (  0%)
Hash buckets with   1 entries     35 (  0%)
Hash buckets with   2 entries     78 (  3%)
Hash buckets with   3 entries     30 (  2%)
Hash buckets with   4 entries    649 ( 65%)
Hash buckets with   5 entries     12 (  1%)
Hash buckets with   6 entries     13 (  1%)
Hash buckets with   8 entries     40 (  8%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  15 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with  24 entries     17 ( 10%)
Hash buckets with >24 entries      6 (  4%)

There's a significant number of collisions given the population is
only 15% of the size of the table itself....

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262090
Hash table size = 32768
Hits = 530844
Misses = 7164575
Hit ratio =  6.90
Hash buckets with   0 entries  11898 (  0%)
....
Hash buckets with  12 entries   5513 ( 25%)
Hash buckets with  13 entries   4188 ( 20%)
Hash buckets with  14 entries   2073 ( 11%)
Hash buckets with  15 entries   1811 ( 10%)
Hash buckets with  16 entries   1994 ( 12%)
....
Hash buckets with >24 entries    339 (  4%)

So, a third of the hash table does not even has any entries in them,
despite having more than 7.5 million entries run through the cache.
Median chain lengths are 12-16 entries, ideal is 8. And lots of
collisions on the longer than 24 entrie chains...

Phase 6:

Hash buckets with   0 entries  14573 (  0%)
....
Hash buckets with >24 entries   2291 ( 36%)

Ouch. Not a good distribution at all.

Overall runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:35:04  12/06 11:35:04
Phase 2:        12/06 11:35:04  12/06 11:35:07  3 seconds
Phase 3:        12/06 11:35:07  12/06 11:38:27  3 minutes, 20 seconds
Phase 4:        12/06 11:38:27  12/06 11:41:32  3 minutes, 5 seconds
Phase 5:        12/06 11:41:32  12/06 11:41:32
Phase 6:        12/06 11:41:32  12/06 11:42:29  57 seconds
Phase 7:        12/06 11:42:29  12/06 11:42:30  1 second

Total run time: 7 minutes, 26 seconds

Modify the hash to be something more workable - steal the linux
kernel inode hash calculation and try that:

phase 2:

Max supported entries = 262144
Max utilized entries = 3970
Active entries = 3970
Hash table size = 32768
Hits = 0
Misses = 3972
Hit ratio =  0.00
Hash buckets with   0 entries  29055 (  0%)
Hash buckets with   1 entries   3464 ( 87%)
Hash buckets with   2 entries    241 ( 12%)
Hash buckets with   3 entries      8 (  0%)

Close to perfect.

Phase 3:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262118
Hash table size = 32768
Hits = 567900
Misses = 7118749
Hit ratio =  7.39
Hash buckets with   5 entries   1572 (  2%)
Hash buckets with   6 entries   2186 (  5%)
Hash buckets with   7 entries   9217 ( 24%)
Hash buckets with   8 entries   8757 ( 26%)
Hash buckets with   9 entries   6135 ( 21%)
Hash buckets with  10 entries   3166 ( 12%)
Hash buckets with  11 entries   1257 (  5%)
Hash buckets with  12 entries    364 (  1%)
Hash buckets with  13 entries     94 (  0%)
Hash buckets with  14 entries     14 (  0%)
Hash buckets with  15 entries      5 (  0%)

A near-perfect bell curve centered on the optimal distribution
number of 8 entries per chain.

Phase 6:

Hash buckets with   0 entries     24 (  0%)
Hash buckets with   1 entries    190 (  0%)
Hash buckets with   2 entries    571 (  0%)
Hash buckets with   3 entries   1263 (  1%)
Hash buckets with   4 entries   2465 (  3%)
Hash buckets with   5 entries   3399 (  6%)
Hash buckets with   6 entries   4002 (  9%)
Hash buckets with   7 entries   4186 ( 11%)
Hash buckets with   8 entries   3773 ( 11%)
Hash buckets with   9 entries   3240 ( 11%)
Hash buckets with  10 entries   2523 (  9%)
Hash buckets with  11 entries   2074 (  8%)
Hash buckets with  12 entries   1582 (  7%)
Hash buckets with  13 entries   1206 (  5%)
Hash buckets with  14 entries    863 (  4%)
Hash buckets with  15 entries    601 (  3%)
Hash buckets with  16 entries    386 (  2%)
Hash buckets with  17 entries    205 (  1%)
Hash buckets with  18 entries    122 (  0%)
Hash buckets with  19 entries     48 (  0%)
Hash buckets with  20 entries     24 (  0%)
Hash buckets with  21 entries     13 (  0%)
Hash buckets with  22 entries      8 (  0%)

A much wider bell curve than phase 3, but still centered around the
optimal value and far, far better than the distribution of the
current hash calculation. Runtime:

Phase           Start           End             Duration
Phase 1:        12/06 11:47:21  12/06 11:47:21
Phase 2:        12/06 11:47:21  12/06 11:47:23  2 seconds
Phase 3:        12/06 11:47:23  12/06 11:50:50  3 minutes, 27 seconds
Phase 4:        12/06 11:50:50  12/06 11:53:57  3 minutes, 7 seconds
Phase 5:        12/06 11:53:57  12/06 11:53:58  1 second
Phase 6:        12/06 11:53:58  12/06 11:54:51  53 seconds
Phase 7:        12/06 11:54:51  12/06 11:54:52  1 second

Total run time: 7 minutes, 31 seconds

Essentially unchanged - this is somewhat of a "swings and
roundabouts" test here because what it is testing is the cache-miss
overhead.

FWIW, the comparison here shows a pretty good case for the existing
hash calculation. On a less populated filesystem (5m inodes rather
than 50m inodes) the typical hash distribution was:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262094
Hash table size = 32768
Hits = 626228
Misses = 800166
Hit ratio = 43.90
Hash buckets with   0 entries  29274 (  0%)
Hash buckets with   3 entries      1 (  0%)
Hash buckets with   4 entries      1 (  0%)
Hash buckets with   7 entries      1 (  0%)
Hash buckets with   8 entries      1 (  0%)
Hash buckets with   9 entries      1 (  0%)
Hash buckets with  12 entries      1 (  0%)
Hash buckets with  13 entries      1 (  0%)
Hash buckets with  16 entries      2 (  0%)
Hash buckets with  18 entries      1 (  0%)
Hash buckets with  22 entries      1 (  0%)
Hash buckets with >24 entries   3483 ( 99%)

Total and utter crap. Same filesystem, new hash function:

Max supported entries = 262144
Max utilized entries = 262144
Active entries = 262103
Hash table size = 32768
Hits = 673208
Misses = 838265
Hit ratio = 44.54
Hash buckets with   3 entries    558 (  0%)
Hash buckets with   4 entries   1126 (  1%)
Hash buckets with   5 entries   2440 (  4%)
Hash buckets with   6 entries   4249 (  9%)
Hash buckets with   7 entries   5280 ( 14%)
Hash buckets with   8 entries   5598 ( 17%)
Hash buckets with   9 entries   5446 ( 18%)
Hash buckets with  10 entries   3879 ( 14%)
Hash buckets with  11 entries   2405 ( 10%)
Hash buckets with  12 entries   1187 (  5%)
Hash buckets with  13 entries    447 (  2%)
Hash buckets with  14 entries    125 (  0%)
Hash buckets with  15 entries     25 (  0%)
Hash buckets with  16 entries      3 (  0%)

Kinda says it all, really...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 include/cache.h |  4 +++-
 libxfs/cache.c  |  7 +++++--
 libxfs/rdwr.c   | 12 ++++++++++--
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/cache.h b/include/cache.h
index 76cb234..0a84c69 100644
--- a/include/cache.h
+++ b/include/cache.h
@@ -66,7 +66,8 @@ typedef void (*cache_walk_t)(struct cache_node *);
 typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
 typedef void (*cache_node_flush_t)(struct cache_node *);
 typedef void (*cache_node_relse_t)(struct cache_node *);
-typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int);
+typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
+					  unsigned int);
 typedef int (*cache_node_compare_t)(struct cache_node *, cache_key_t);
 typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
 
@@ -112,6 +113,7 @@ struct cache {
 	cache_node_compare_t	compare;	/* comparison routine */
 	cache_bulk_relse_t	bulkrelse;	/* bulk release routine */
 	unsigned int		c_hashsize;	/* hash bucket count */
+	unsigned int		c_hashshift;	/* hash key shift */
 	struct cache_hash	*c_hash;	/* hash table buckets */
 	struct cache_mru	c_mrus[CACHE_MAX_PRIORITY + 1];
 	unsigned long long	c_misses;	/* cache misses */
diff --git a/libxfs/cache.c b/libxfs/cache.c
index 84d2860..dc69689 100644
--- a/libxfs/cache.c
+++ b/libxfs/cache.c
@@ -25,6 +25,7 @@
 #include <xfs/platform_defs.h>
 #include <xfs/list.h>
 #include <xfs/cache.h>
+#include <xfs/libxfs.h>
 
 #define CACHE_DEBUG 1
 #undef CACHE_DEBUG
@@ -61,6 +62,7 @@ cache_init(
 	cache->c_misses = 0;
 	cache->c_maxcount = maxcount;
 	cache->c_hashsize = hashsize;
+	cache->c_hashshift = libxfs_highbit32(hashsize);
 	cache->hash = cache_operations->hash;
 	cache->alloc = cache_operations->alloc;
 	cache->flush = cache_operations->flush;
@@ -343,7 +345,7 @@ cache_node_get(
 	int			priority = 0;
 	int			purged = 0;
 
-	hashidx = cache->hash(key, cache->c_hashsize);
+	hashidx = cache->hash(key, cache->c_hashsize, cache->c_hashshift);
 	hash = cache->c_hash + hashidx;
 	head = &hash->ch_list;
 
@@ -515,7 +517,8 @@ cache_node_purge(
 	struct cache_hash *	hash;
 	int			count = -1;
 
-	hash = cache->c_hash + cache->hash(key, cache->c_hashsize);
+	hash = cache->c_hash + cache->hash(key, cache->c_hashsize,
+					   cache->c_hashshift);
 	head = &hash->ch_list;
 	pthread_mutex_lock(&hash->ch_mutex);
 	for (pos = head->next, n = pos->next; pos != head;
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index d0ff15b..1b691fb 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -313,10 +313,18 @@ struct xfs_bufkey {
 	int			nmaps;
 };
 
+/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME	0x9e37fffffffc0001UL
+#define CACHE_LINE_SIZE		64
 static unsigned int
-libxfs_bhash(cache_key_t key, unsigned int hashsize)
+libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
 {
-	return (((unsigned int)((struct xfs_bufkey *)key)->blkno) >> 5) % hashsize;
+	uint64_t	hashval = ((struct xfs_bufkey *)key)->blkno;
+	uint64_t	tmp;
+
+	tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
+	return tmp % hashsize;
 }
 
 static int
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 04/10] repair: limit auto-striding concurrency apprpriately
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
                   ` (2 preceding siblings ...)
  2014-02-24  6:29 ` [PATCH 03/10] libxfs: buffer cache hashing is suboptimal Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-24  6:29 ` [PATCH 05/10] repair: factor out threading setup code Dave Chinner
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

It's possible to have filesystems with hundreds of AGs on systems
with little concurrency and resources. In this case, we can easily
exhaust memory and fail to create threads and have all sorts of
interesting problems.

xfs/250 can cause this to occur, with failures like:

        - agno = 707
        - agno = 692
fatal error -- cannot create worker threads, error = [11] Resource temporarily unavailable

And this:

        - agno = 484
        - agno = 782
failed to create prefetch thread: Resource temporarily unavailable

Because it's trying to create more threads than a poor little 512MB
single CPU ia32 box can handle.

So, limit concurrency to a maximum of numcpus * 8 to prevent this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 include/libxfs.h    |  1 +
 libxfs/init.h       |  1 -
 repair/xfs_repair.c | 21 ++++++++++++++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/libxfs.h b/include/libxfs.h
index bb0369f..f688598 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -144,6 +144,7 @@ extern void	libxfs_device_close (dev_t);
 extern int	libxfs_device_alignment (void);
 extern void	libxfs_report(FILE *);
 extern void	platform_findsizes(char *path, int fd, long long *sz, int *bsz);
+extern int	platform_nproc(void);
 
 /* check or write log footer: specify device, log size in blocks & uuid */
 typedef xfs_caddr_t (libxfs_get_block_t)(xfs_caddr_t, int, void *);
diff --git a/libxfs/init.h b/libxfs/init.h
index f0b8cb6..112febb 100644
--- a/libxfs/init.h
+++ b/libxfs/init.h
@@ -31,7 +31,6 @@ extern char *platform_findrawpath (char *path);
 extern char *platform_findblockpath (char *path);
 extern int platform_direct_blockdev (void);
 extern int platform_align_blockdev (void);
-extern int platform_nproc(void);
 extern unsigned long platform_physmem(void);	/* in kilobytes */
 extern int platform_has_uuid;
 
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index bac334f..6327076 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -629,13 +629,32 @@ main(int argc, char **argv)
 	 * to target these for an increase in thread count. Hence a stride value
 	 * of 15 is chosen to ensure we get at least 2 AGs being scanned at once
 	 * on such filesystems.
+	 *
+	 * Limit the maximum thread count based on the available CPU power that
+	 * is available. If we use too many threads, we might run out of memory
+	 * and CPU power before we run out of IO concurrency. We limit to 8
+	 * threads/CPU as this is enough threads to saturate a CPU on fast
+	 * devices, yet few enough that it will saturate but won't overload slow
+	 * devices.
 	 */
 	if (!ag_stride && glob_agcount >= 16 && do_prefetch)
 		ag_stride = 15;
 
 	if (ag_stride) {
+		int max_threads = platform_nproc() * 8;
+
 		thread_count = (glob_agcount + ag_stride - 1) / ag_stride;
-		thread_init();
+		while (thread_count > max_threads) {
+			ag_stride *= 2;
+			thread_count = (glob_agcount + ag_stride - 1) /
+								ag_stride;
+		}
+		if (thread_count > 0)
+			thread_init();
+		else {
+			thread_count = 1;
+			ag_stride = 0;
+		}
 	}
 
 	if (ag_stride && report_interval) {
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 05/10] repair: factor out threading setup code
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
                   ` (3 preceding siblings ...)
  2014-02-24  6:29 ` [PATCH 04/10] repair: limit auto-striding concurrency apprpriately Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-24 20:43   ` Brian Foster
  2014-02-24  6:29 ` [PATCH 06/10] repair: use a listhead for the dotdot list Dave Chinner
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The same code is repeated in different places to set up
multithreaded prefetching. This can all be factored into a single
implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/dinode.h   | 15 ++++++------
 repair/phase3.c   | 40 +++----------------------------
 repair/phase4.c   | 48 +++----------------------------------
 repair/phase6.c   | 22 ++++-------------
 repair/prefetch.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 repair/prefetch.h | 10 ++++++++
 6 files changed, 98 insertions(+), 108 deletions(-)

diff --git a/repair/dinode.h b/repair/dinode.h
index 7521521..5ee51ca 100644
--- a/repair/dinode.h
+++ b/repair/dinode.h
@@ -18,9 +18,8 @@
 #ifndef _XR_DINODE_H
 #define _XR_DINODE_H
 
-#include "prefetch.h"
-
 struct blkmap;
+struct prefetch_args;
 
 int
 verify_agbno(xfs_mount_t	*mp,
@@ -103,12 +102,12 @@ int
 process_uncertain_aginodes(xfs_mount_t		*mp,
 				xfs_agnumber_t	agno);
 void
-process_aginodes(xfs_mount_t	*mp,
-		prefetch_args_t	*pf_args,
-		xfs_agnumber_t	agno,
-		int		check_dirs,
-		int		check_dups,
-		int		extra_attr_check);
+process_aginodes(xfs_mount_t		*mp,
+		struct prefetch_args	*pf_args,
+		xfs_agnumber_t		agno,
+		int			check_dirs,
+		int			check_dups,
+		int			extra_attr_check);
 
 void
 check_uncertain_aginodes(xfs_mount_t	*mp,
diff --git a/repair/phase3.c b/repair/phase3.c
index 3e43938..213d368 100644
--- a/repair/phase3.c
+++ b/repair/phase3.c
@@ -17,6 +17,8 @@
  */
 
 #include <libxfs.h>
+#include "threads.h"
+#include "prefetch.h"
 #include "avl.h"
 #include "globals.h"
 #include "agheader.h"
@@ -24,9 +26,7 @@
 #include "protos.h"
 #include "err_protos.h"
 #include "dinode.h"
-#include "threads.h"
 #include "progress.h"
-#include "prefetch.h"
 
 static void
 process_agi_unlinked(
@@ -82,41 +82,7 @@ static void
 process_ags(
 	xfs_mount_t		*mp)
 {
-	int 			i, j;
-	xfs_agnumber_t 		agno;
-	work_queue_t		*queues;
-	prefetch_args_t		*pf_args[2];
-
-	queues = malloc(thread_count * sizeof(work_queue_t));
-
-	if (ag_stride) {
-		/*
-		 * create one worker thread for each segment of the volume
-		 */
-		for (i = 0, agno = 0; i < thread_count; i++) {
-			create_work_queue(&queues[i], mp, 1);
-			pf_args[0] = NULL;
-			for (j = 0; j < ag_stride && agno < mp->m_sb.sb_agcount;
-					j++, agno++) {
-				pf_args[0] = start_inode_prefetch(agno, 0, pf_args[0]);
-				queue_work(&queues[i], process_ag_func, agno, pf_args[0]);
-			}
-		}
-		/*
-		 * wait for workers to complete
-		 */
-		for (i = 0; i < thread_count; i++)
-			destroy_work_queue(&queues[i]);
-	} else {
-		queues[0].mp = mp;
-		pf_args[0] = start_inode_prefetch(0, 0, NULL);
-		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
-			pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 0,
-					pf_args[i & 1]);
-			process_ag_func(&queues[0], i, pf_args[i & 1]);
-		}
-	}
-	free(queues);
+	do_inode_prefetch(mp, ag_stride, process_ag_func, false, false);
 }
 
 void
diff --git a/repair/phase4.c b/repair/phase4.c
index a822aaa..189eeb9 100644
--- a/repair/phase4.c
+++ b/repair/phase4.c
@@ -17,6 +17,8 @@
  */
 
 #include <libxfs.h>
+#include "threads.h"
+#include "prefetch.h"
 #include "avl.h"
 #include "globals.h"
 #include "agheader.h"
@@ -27,9 +29,7 @@
 #include "bmap.h"
 #include "versions.h"
 #include "dir2.h"
-#include "threads.h"
 #include "progress.h"
-#include "prefetch.h"
 
 
 /*
@@ -150,49 +150,7 @@ static void
 process_ags(
 	xfs_mount_t		*mp)
 {
-	int 			i, j;
-	xfs_agnumber_t 		agno;
-	work_queue_t		*queues;
-	prefetch_args_t		*pf_args[2];
-
-	queues = malloc(thread_count * sizeof(work_queue_t));
-
-	if (!libxfs_bcache_overflowed()) {
-		queues[0].mp = mp;
-		create_work_queue(&queues[0], mp, libxfs_nproc());
-		for (i = 0; i < mp->m_sb.sb_agcount; i++)
-			queue_work(&queues[0], process_ag_func, i, NULL);
-		destroy_work_queue(&queues[0]);
-	} else {
-		if (ag_stride) {
-			/*
-			 * create one worker thread for each segment of the volume
-			 */
-			for (i = 0, agno = 0; i < thread_count; i++) {
-				create_work_queue(&queues[i], mp, 1);
-				pf_args[0] = NULL;
-				for (j = 0; j < ag_stride && agno < mp->m_sb.sb_agcount;
-						j++, agno++) {
-					pf_args[0] = start_inode_prefetch(agno, 0, pf_args[0]);
-					queue_work(&queues[i], process_ag_func, agno, pf_args[0]);
-				}
-			}
-			/*
-			 * wait for workers to complete
-			 */
-			for (i = 0; i < thread_count; i++)
-				destroy_work_queue(&queues[i]);
-		} else {
-			queues[0].mp = mp;
-			pf_args[0] = start_inode_prefetch(0, 0, NULL);
-			for (i = 0; i < mp->m_sb.sb_agcount; i++) {
-				pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
-						0, pf_args[i & 1]);
-				process_ag_func(&queues[0], i, pf_args[i & 1]);
-			}
-		}
-	}
-	free(queues);
+	do_inode_prefetch(mp, ag_stride, process_ag_func, true, false);
 }
 
 
diff --git a/repair/phase6.c b/repair/phase6.c
index cdbf4db..08f78d2 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -17,6 +17,8 @@
  */
 
 #include <libxfs.h>
+#include "threads.h"
+#include "prefetch.h"
 #include "avl.h"
 #include "globals.h"
 #include "agheader.h"
@@ -25,9 +27,7 @@
 #include "protos.h"
 #include "err_protos.h"
 #include "dinode.h"
-#include "prefetch.h"
 #include "progress.h"
-#include "threads.h"
 #include "versions.h"
 
 static struct cred		zerocr;
@@ -3031,23 +3031,9 @@ update_missing_dotdot_entries(
 
 static void
 traverse_ags(
-	xfs_mount_t 		*mp)
+	struct xfs_mount	*mp)
 {
-	int			i;
-	work_queue_t		queue;
-	prefetch_args_t		*pf_args[2];
-
-	/*
-	 * we always do prefetch for phase 6 as it will fill in the gaps
-	 * not read during phase 3 prefetch.
-	 */
-	queue.mp = mp;
-	pf_args[0] = start_inode_prefetch(0, 1, NULL);
-	for (i = 0; i < glob_agcount; i++) {
-		pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
-				pf_args[i & 1]);
-		traverse_function(&queue, i, pf_args[i & 1]);
-	}
+	do_inode_prefetch(mp, 0, traverse_function, true, true);
 }
 
 void
diff --git a/repair/prefetch.c b/repair/prefetch.c
index 984beda..e573e35 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -865,6 +865,77 @@ start_inode_prefetch(
 	return args;
 }
 
+/*
+ * Do inode prefetch in the most optimal way for the context under which repair
+ * has been run.
+ */
+void
+do_inode_prefetch(
+	struct xfs_mount	*mp,
+	int			stride,
+	void			(*func)(struct work_queue *,
+					xfs_agnumber_t, void *),
+	bool			check_cache,
+	bool			dirs_only)
+{
+	int			i, j;
+	xfs_agnumber_t		agno;
+	struct work_queue	queue;
+	struct work_queue	*queues;
+	struct prefetch_args	*pf_args[2];
+
+	/*
+	 * If the previous phases of repair have not overflowed the buffer
+	 * cache, then we don't need to re-read any of the metadata in the
+	 * filesystem - it's all in the cache. In that case, run a thread per
+	 * CPU to maximise parallelism of the queue to be processed.
+	 */
+	if (check_cache && !libxfs_bcache_overflowed()) {
+		queue.mp = mp;
+		create_work_queue(&queue, mp, libxfs_nproc());
+		for (i = 0; i < mp->m_sb.sb_agcount; i++)
+			queue_work(&queue, func, i, NULL);
+		destroy_work_queue(&queue);
+		return;
+	}
+
+	/*
+	 * single threaded behaviour - single prefetch thread, processed
+	 * directly after each AG is queued.
+	 */
+	if (!stride) {
+		queue.mp = mp;
+		pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
+		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
+			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
+					dirs_only, pf_args[i & 1]);
+			func(&queue, i, pf_args[i & 1]);
+		}
+		return;
+	}
+
+	/*
+	 * create one worker thread for each segment of the volume
+	 */
+	queues = malloc(thread_count * sizeof(work_queue_t));
+	for (i = 0, agno = 0; i < thread_count; i++) {
+		create_work_queue(&queues[i], mp, 1);
+		pf_args[0] = NULL;
+		for (j = 0; j < stride && agno < mp->m_sb.sb_agcount;
+				j++, agno++) {
+			pf_args[0] = start_inode_prefetch(agno, dirs_only,
+							  pf_args[0]);
+			queue_work(&queues[i], func, agno, pf_args[0]);
+		}
+	}
+	/*
+	 * wait for workers to complete
+	 */
+	for (i = 0; i < thread_count; i++)
+		destroy_work_queue(&queues[i]);
+	free(queues);
+}
+
 void
 wait_for_inode_prefetch(
 	prefetch_args_t		*args)
diff --git a/repair/prefetch.h b/repair/prefetch.h
index 44a406c..b837752 100644
--- a/repair/prefetch.h
+++ b/repair/prefetch.h
@@ -4,6 +4,7 @@
 #include <semaphore.h>
 #include "incore.h"
 
+struct work_queue;
 
 extern int 	do_prefetch;
 
@@ -41,6 +42,15 @@ start_inode_prefetch(
 	prefetch_args_t		*prev_args);
 
 void
+do_inode_prefetch(
+	struct xfs_mount	*mp,
+	int			stride,
+	void			(*func)(struct work_queue *,
+					xfs_agnumber_t, void *),
+	bool			check_cache,
+	bool			dirs_only);
+
+void
 wait_for_inode_prefetch(
 	prefetch_args_t		*args);
 
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 06/10] repair: use a listhead for the dotdot list
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
                   ` (4 preceding siblings ...)
  2014-02-24  6:29 ` [PATCH 05/10] repair: factor out threading setup code Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-25 20:03   ` Christoph Hellwig
  2014-02-24  6:29 ` [PATCH 07/10] repair: prefetch runs too far ahead Dave Chinner
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Cleanup suggested by Christoph Hellwig - removes another open coded
list implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase6.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index 08f78d2..c8bfa09 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -43,13 +43,13 @@ static struct xfs_name		xfs_name_dot = {(unsigned char *)".",
  * entries are updated. These must be rebuilt after the initial pass
  */
 typedef struct dotdot_update {
-	struct dotdot_update	*next;
+	struct list_head	list;
 	ino_tree_node_t		*irec;
 	xfs_agnumber_t		agno;
 	int			ino_offset;
 } dotdot_update_t;
 
-static dotdot_update_t		*dotdot_update_list;
+static LIST_HEAD(dotdot_update_list);
 static int			dotdot_update;
 
 static void
@@ -64,12 +64,12 @@ add_dotdot_update(
 		do_error(_("malloc failed add_dotdot_update (%zu bytes)\n"),
 			sizeof(dotdot_update_t));
 
-	dir->next = dotdot_update_list;
+	INIT_LIST_HEAD(&dir->list);
 	dir->irec = irec;
 	dir->agno = agno;
 	dir->ino_offset = ino_offset;
 
-	dotdot_update_list = dir;
+	list_add(&dir->list, &dotdot_update_list);
 }
 
 /*
@@ -3021,9 +3021,10 @@ update_missing_dotdot_entries(
 	 * set dotdot_update flag so processing routines do not count links
 	 */
 	dotdot_update = 1;
-	while (dotdot_update_list) {
-		dir = dotdot_update_list;
-		dotdot_update_list = dir->next;
+	while (!list_empty(&dotdot_update_list)) {
+		dir = list_entry(dotdot_update_list.next, struct dotdot_update,
+				 list);
+		list_del(&dir->list);
 		process_dir_inode(mp, dir->agno, dir->irec, dir->ino_offset);
 		free(dir);
 	}
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 07/10] repair: prefetch runs too far ahead
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
                   ` (5 preceding siblings ...)
  2014-02-24  6:29 ` [PATCH 06/10] repair: use a listhead for the dotdot list Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-26  1:52   ` Christoph Hellwig
  2014-02-24  6:29 ` [PATCH 08/10] libxfs: remove a couple of locks Dave Chinner
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When trying to work out why a non-crc filesystem took 1m57 to repair
and the same CRC enabled filesystem took 11m35 to repair, I noticed
that the was way to much CRC checking going on. Prefetched buffers
should not be being CRCed, yet shortly after the starting this began
to happen. perf profiling also showed up an awful lot of time doing
buffer cache lookups, and the cache profile output indicated that
the hit rate was way below 3%. IOWs, the readahead was getting so
far ahead of the processing that it was thrashing the cache.

That there is a difference in processing rate between CRC and
non-CRC filesystems is not surprising. What is surprising is the
readahead behaviour - it basically just keeps reading ahead until it
has read everything on an AG, and then it goes on to the next AG,
and reads everything on it, and then goes on to the next AG,....

This goes on until it pushes all the buffers the processing threads
need out of the cache, and suddening they start re-reading from disk
with the various CRC checking verifiers enabled, and we end up going
-really- slow. Yes, threading made up for it a bit, but it's just
wrong.

Basically, the code assumes that IO is going to be slower than
processing, so it doesn't throttle prefetch across AGs to slow
down prefetch to match the processing rate.

So, to fix this, don't let a prefetch thread get more than a single
AG ahead of it's processing thread, just like occurs for single
threaded (i.e. -o ag_stride=-1) operation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/prefetch.c | 81 ++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 62 insertions(+), 19 deletions(-)

diff --git a/repair/prefetch.c b/repair/prefetch.c
index e573e35..7135d67 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -842,7 +842,7 @@ start_inode_prefetch(
 	 * and not any other associated metadata like directories
 	 */
 
-	max_queue = libxfs_bcache->c_maxcount / thread_count / 8;
+	max_queue = libxfs_bcache->c_maxcount / thread_count / 32;
 	if (XFS_INODE_CLUSTER_SIZE(mp) > mp->m_sb.sb_blocksize)
 		max_queue = max_queue * (XFS_INODE_CLUSTER_SIZE(mp) >>
 				mp->m_sb.sb_blocklog) / XFS_IALLOC_BLOCKS(mp);
@@ -865,6 +865,48 @@ start_inode_prefetch(
 	return args;
 }
 
+void
+prefetch_ag_range(
+	struct work_queue	*work,
+	xfs_agnumber_t		start_ag,
+	xfs_agnumber_t		end_ag,
+	bool			dirs_only,
+	void			(*func)(struct work_queue *,
+					xfs_agnumber_t, void *))
+{
+	int			i;
+	struct prefetch_args	*pf_args[2];
+
+	pf_args[start_ag & 1] = start_inode_prefetch(start_ag, dirs_only, NULL);
+	for (i = start_ag; i < end_ag; i++) {
+		/* Don't prefetch end_ag */
+		if (i + 1 < end_ag)
+			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
+						dirs_only, pf_args[i & 1]);
+		func(work, i, pf_args[i & 1]);
+	}
+}
+
+struct pf_work_args {
+	xfs_agnumber_t	start_ag;
+	xfs_agnumber_t	end_ag;
+	bool		dirs_only;
+	void		(*func)(struct work_queue *, xfs_agnumber_t, void *);
+};
+
+static void
+prefetch_ag_range_work(
+	struct work_queue	*work,
+	xfs_agnumber_t		unused,
+	void			*args)
+{
+	struct pf_work_args *wargs = args;
+
+	prefetch_ag_range(work, wargs->start_ag, wargs->end_ag, 
+			  wargs->dirs_only, wargs->func);
+	free(args);
+}
+
 /*
  * Do inode prefetch in the most optimal way for the context under which repair
  * has been run.
@@ -878,11 +920,9 @@ do_inode_prefetch(
 	bool			check_cache,
 	bool			dirs_only)
 {
-	int			i, j;
-	xfs_agnumber_t		agno;
+	int			i;
 	struct work_queue	queue;
 	struct work_queue	*queues;
-	struct prefetch_args	*pf_args[2];
 
 	/*
 	 * If the previous phases of repair have not overflowed the buffer
@@ -905,12 +945,8 @@ do_inode_prefetch(
 	 */
 	if (!stride) {
 		queue.mp = mp;
-		pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
-		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
-			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
-					dirs_only, pf_args[i & 1]);
-			func(&queue, i, pf_args[i & 1]);
-		}
+		prefetch_ag_range(&queue, 0, mp->m_sb.sb_agcount,
+				  dirs_only, func);
 		return;
 	}
 
@@ -918,20 +954,27 @@ do_inode_prefetch(
 	 * create one worker thread for each segment of the volume
 	 */
 	queues = malloc(thread_count * sizeof(work_queue_t));
-	for (i = 0, agno = 0; i < thread_count; i++) {
+	for (i = 0; i < thread_count; i++) {
+		struct pf_work_args *wargs;
+
+		wargs = malloc(sizeof(struct pf_work_args));
+		wargs->start_ag = i * stride;
+		wargs->end_ag = min((i + 1) * stride,
+				    mp->m_sb.sb_agcount);
+		wargs->dirs_only = dirs_only;
+		wargs->func = func;
+
 		create_work_queue(&queues[i], mp, 1);
-		pf_args[0] = NULL;
-		for (j = 0; j < stride && agno < mp->m_sb.sb_agcount;
-				j++, agno++) {
-			pf_args[0] = start_inode_prefetch(agno, dirs_only,
-							  pf_args[0]);
-			queue_work(&queues[i], func, agno, pf_args[0]);
-		}
+		queue_work(&queues[i], prefetch_ag_range_work, 0, wargs);
+
+		if (wargs->end_ag >= mp->m_sb.sb_agcount)
+			break;
 	}
+
 	/*
 	 * wait for workers to complete
 	 */
-	for (i = 0; i < thread_count; i++)
+	for (; i >= 0; i--)
 		destroy_work_queue(&queues[i]);
 	free(queues);
 }
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 08/10] libxfs: remove a couple of locks
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
                   ` (6 preceding siblings ...)
  2014-02-24  6:29 ` [PATCH 07/10] repair: prefetch runs too far ahead Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-25 20:05   ` Christoph Hellwig
  2014-02-24  6:29 ` [PATCH 09/10] repair: fix prefetch queue limiting Dave Chinner
  2014-02-24  6:29 ` [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware Dave Chinner
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

THe libxfs cache has a lot of false scalability about it. I can't
get lookups to scale past about one and half CPUs, with one of the
key problems being a preponderance of global locks.

Like just after doing a hash lookup, that is careful only to take
the hash chain lock, it takes a global cache lock to update the
cache hit statistic. Scalable? Not at all.

The node priority stuff is protected by the object locks (i.e the
buffer lock) and so it doesn't need locks, either.

This doesn't do very much to improve scalability, but it's a small
start.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 libxfs/cache.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/libxfs/cache.c b/libxfs/cache.c
index dc69689..9f7f8f4 100644
--- a/libxfs/cache.c
+++ b/libxfs/cache.c
@@ -391,9 +391,7 @@ cache_node_get(
 			pthread_mutex_unlock(&node->cn_mutex);
 			pthread_mutex_unlock(&hash->ch_mutex);
 
-			pthread_mutex_lock(&cache->c_mutex);
 			cache->c_hits++;
-			pthread_mutex_unlock(&cache->c_mutex);
 
 			*nodep = node;
 			return 0;
@@ -482,10 +480,8 @@ cache_node_set_priority(
 	else if (priority > CACHE_MAX_PRIORITY)
 		priority = CACHE_MAX_PRIORITY;
 
-	pthread_mutex_lock(&node->cn_mutex);
 	ASSERT(node->cn_count > 0);
 	node->cn_priority = priority;
-	pthread_mutex_unlock(&node->cn_mutex);
 }
 
 int
@@ -494,9 +490,7 @@ cache_node_get_priority(
 {
 	int			priority;
 
-	pthread_mutex_lock(&node->cn_mutex);
 	priority = node->cn_priority;
-	pthread_mutex_unlock(&node->cn_mutex);
 
 	return priority;
 }
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 09/10] repair: fix prefetch queue limiting
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
                   ` (7 preceding siblings ...)
  2014-02-24  6:29 ` [PATCH 08/10] libxfs: remove a couple of locks Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-25 20:08   ` Christoph Hellwig
  2014-02-24  6:29 ` [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware Dave Chinner
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The length of the prefetch queue is limited by a semaphore. To avoid
a ABBA deadlock, we only trywait on the semaphore so if we fail to
get it we can kick the IO queues before sleeping. Unfortunately,
the "need to sleep" detection is just a little wrong - it needs to
lok at errno, not err for the EAGAIN value.

Hence this queue throttling has not been working for a long time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase6.c   | 9 ++++++++-
 repair/prefetch.c | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index c8bfa09..ff16120 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -2999,8 +2999,15 @@ traverse_function(
 		if (irec->ino_isa_dir == 0)
 			continue;
 
-		if (pf_args)
+		if (pf_args) {
 			sem_post(&pf_args->ra_count);
+#ifdef XR_PF_TRACE
+			sem_getvalue(&pf_args->ra_count, &i);
+			pftrace(
+		"processing inode chunk %p in AG %d (sem count = %d)",
+				irec, agno, i);
+#endif
+		}
 
 		for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
 			if (inode_isadir(irec, i))
diff --git a/repair/prefetch.c b/repair/prefetch.c
index 7135d67..5158863 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -723,7 +723,7 @@ pf_queuing_worker(
 			irec, args->agno, i);
 #endif
 		err = sem_trywait(&args->ra_count);
-		if (err == EAGAIN) {
+		if (err < 0 && errno == EAGAIN) {
 			/*
 			 * Kick the queue once we have reached the limit;
 			 * without this the threads processing the inodes
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware
  2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
                   ` (8 preceding siblings ...)
  2014-02-24  6:29 ` [PATCH 09/10] repair: fix prefetch queue limiting Dave Chinner
@ 2014-02-24  6:29 ` Dave Chinner
  2014-02-25 17:25   ` Christoph Hellwig
  2014-02-26  1:44   ` Christoph Hellwig
  9 siblings, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2014-02-24  6:29 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

I'd been trying to track down a behavioural difference between
non-crc and crc enabled filesystems that was resulting non-crc
filesystem executing prefetch almost 3x faster than CRC filesystems.
After amny ratholes, I finally stumbled on the fact that btree
format directories are not being prefetched due to a missing magic
number check, and it's rejecting all XFS_BMAP_CRC_MAGIC format BMBT
buffers. This makes prefetch on CRC enabled filesystems behave the
same as for non-CRC filesystems.

The difference a single line of code can make on a 50 million inode
filesystem with a single threaded prefetch enabled run is pretty
amazing. It goes from 3,000 iops @ 50MB/s to 2,000 IOPS @ 800MB/s
and the cache hit rate goes from 3% to 49%. The runtime difference:

Unpatched:

Phase           Start           End             Duration
Phase 1:        02/21 18:34:12  02/21 18:34:12
Phase 2:        02/21 18:34:12  02/21 18:34:15  3 seconds
Phase 3:        02/21 18:34:15  02/21 18:40:09  5 minutes, 54 seconds
Phase 4:        02/21 18:40:09  02/21 18:46:36  6 minutes, 27 seconds
Phase 5:        02/21 18:46:36  02/21 18:46:37  1 second
Phase 6:        02/21 18:46:37  02/21 18:52:51  6 minutes, 14 seconds
Phase 7:        02/21 18:52:51  02/21 18:52:52  1 second

Total run time: 18 minutes, 40 seconds

Patched:

Phase           Start           End             Duration
Phase 1:        02/21 19:58:23  02/21 19:58:23
Phase 2:        02/21 19:58:23  02/21 19:58:27  4 seconds
Phase 3:        02/21 19:58:27  02/21 19:59:20  53 seconds
Phase 4:        02/21 19:59:20  02/21 20:00:07  47 seconds
Phase 5:        02/21 20:00:07  02/21 20:00:08  1 second
Phase 6:        02/21 20:00:08  02/21 20:00:50  42 seconds
Phase 7:        02/21 20:00:50  02/21 20:00:50

Total run time: 2 minutes, 27 seconds

Is no less impressive. Shame it's just a regression fix. ;)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/prefetch.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/repair/prefetch.c b/repair/prefetch.c
index 5158863..7e11d1d 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -298,7 +298,8 @@ pf_scanfunc_bmap(
 	/*
 	 * do some validation on the block contents
 	 */
-	if ((be32_to_cpu(block->bb_magic) != XFS_BMAP_MAGIC) ||
+	if ((block->bb_magic != cpu_to_be32(XFS_BMAP_MAGIC) &&
+	     block->bb_magic != cpu_to_be32(XFS_BMAP_CRC_MAGIC)) ||
 			(be16_to_cpu(block->bb_level) != level))
 		return 0;
 
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] repair: translation lookups limit scalability
  2014-02-24  6:29 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
@ 2014-02-24 20:42   ` Brian Foster
  2014-02-25 20:01   ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Brian Foster @ 2014-02-24 20:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Feb 24, 2014 at 05:29:20PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A bit of perf magic showed that scalability was limits to 3-4
> concurrent threads due to contention on a lock inside in something
> called __dcigettext(). That some library somewhere that repair is
> linked against, and it turns out to be inside the translation
> infrastructure to support the _() string mechanism:
> 
> # Samples: 34K of event 'cs'
> # Event count (approx.): 495567
> #
> # Overhead        Command      Shared Object          Symbol
> # ........  .............  .................  ..............
> #
>     60.30%     xfs_repair  [kernel.kallsyms]  [k] __schedule
>                |
>                --- 0x63fffff9c
>                    process_bmbt_reclist_int
>                   |
>                   |--39.95%-- __dcigettext
>                   |          __lll_lock_wait
>                   |          system_call_fastpath
>                   |          SyS_futex
>                   |          do_futex
>                   |          futex_wait
>                   |          futex_wait_queue_me
>                   |          schedule
>                   |          __schedule
>                   |
>                   |--8.91%-- __lll_lock_wait
>                   |          system_call_fastpath
>                   |          SyS_futex
>                   |          do_futex
>                   |          futex_wait
>                   |          futex_wait_queue_me
>                   |          schedule
>                   |          __schedule
>                    --51.13%-- [...]
> 
> Fix this by initialising global variables that hold the translated
> strings at startup, hence avoiding the need to do repeated runtime
> translation of the same strings.
> 
> Runtime of an unpatched xfs_repair is roughly:
> 
>         XFS_REPAIR Summary    Fri Dec  6 11:15:50 2013
> 
> Phase           Start           End             Duration
> Phase 1:        12/06 10:56:21  12/06 10:56:21
> Phase 2:        12/06 10:56:21  12/06 10:56:23  2 seconds
> Phase 3:        12/06 10:56:23  12/06 11:01:31  5 minutes, 8 seconds
> Phase 4:        12/06 11:01:31  12/06 11:07:08  5 minutes, 37 seconds
> Phase 5:        12/06 11:07:08  12/06 11:07:09  1 second
> Phase 6:        12/06 11:07:09  12/06 11:15:49  8 minutes, 40 seconds
> Phase 7:        12/06 11:15:49  12/06 11:15:50  1 second
> 
> Total run time: 19 minutes, 29 seconds
> 
> Patched version:
> 
> Phase           Start           End             Duration
> Phase 1:        12/06 10:36:29  12/06 10:36:29
> Phase 2:        12/06 10:36:29  12/06 10:36:31  2 seconds
> Phase 3:        12/06 10:36:31  12/06 10:40:08  3 minutes, 37 seconds
> Phase 4:        12/06 10:40:08  12/06 10:43:42  3 minutes, 34 seconds
> Phase 5:        12/06 10:43:42  12/06 10:43:42
> Phase 6:        12/06 10:43:42  12/06 10:50:28  6 minutes, 46 seconds
> Phase 7:        12/06 10:50:28  12/06 10:50:29  1 second
> 
> Total run time: 14 minutes
> 
> Big win!
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/dinode.c     | 49 +++++++++++++++++++++++++++++++++++--------------
>  repair/dinode.h     |  3 +++
>  repair/scan.c       |  7 +------
>  repair/xfs_repair.c |  2 ++
>  4 files changed, 41 insertions(+), 20 deletions(-)
> 
> diff --git a/repair/dinode.c b/repair/dinode.c
> index 3115bd0..4953a56 100644
> --- a/repair/dinode.c
> +++ b/repair/dinode.c
> @@ -32,6 +32,37 @@
>  #include "threads.h"
>  
>  /*
> + * gettext lookups for translations of strings use mutexes internally to
> + * the library. Hence when we come through here doing parallel scans in
> + * multiple AGs, then all do concurrent text conversions and serialise
Typo:		    they ?

Reviewed-by: Brian Foster <bfoster@redhat.com>

> + * on the translation string lookups. Let's avoid doing repeated lookups
> + * by making them static variables and only assigning the translation
> + * once.
> + */
> +static char	*forkname_data;
> +static char	*forkname_attr;
> +static char	*ftype_real_time;
> +static char	*ftype_regular;
> +
> +void
> +dinode_bmbt_translation_init(void)
> +{
> +	forkname_data = _("data");
> +	forkname_attr = _("attr");
> +	ftype_real_time = _("real-time");
> +	ftype_regular = _("regular");
> +}
> +
> +char *
> +get_forkname(int whichfork)
> +{
> +
> +	if (whichfork == XFS_DATA_FORK)
> +		return forkname_data;
> +	return forkname_attr;
> +}
> +
> +/*
>   * inode clearing routines
>   */
>  
> @@ -542,7 +573,7 @@ process_bmbt_reclist_int(
>  	xfs_dfiloff_t		op = 0;		/* prev offset */
>  	xfs_dfsbno_t		b;
>  	char			*ftype;
> -	char			*forkname;
> +	char			*forkname = get_forkname(whichfork);
>  	int			i;
>  	int			state;
>  	xfs_agnumber_t		agno;
> @@ -552,15 +583,10 @@ process_bmbt_reclist_int(
>  	xfs_agnumber_t		locked_agno = -1;
>  	int			error = 1;
>  
> -	if (whichfork == XFS_DATA_FORK)
> -		forkname = _("data");
> -	else
> -		forkname = _("attr");
> -
>  	if (type == XR_INO_RTDATA)
> -		ftype = _("real-time");
> +		ftype = ftype_real_time;
>  	else
> -		ftype = _("regular");
> +		ftype = ftype_regular;
>  
>  	for (i = 0; i < *numrecs; i++) {
>  		libxfs_bmbt_disk_get_all(rp + i, &irec);
> @@ -1110,7 +1136,7 @@ process_btinode(
>  	xfs_ino_t		lino;
>  	xfs_bmbt_ptr_t		*pp;
>  	xfs_bmbt_key_t		*pkey;
> -	char			*forkname;
> +	char			*forkname = get_forkname(whichfork);
>  	int			i;
>  	int			level;
>  	int			numrecs;
> @@ -1122,11 +1148,6 @@ process_btinode(
>  	*tot = 0;
>  	*nex = 0;
>  
> -	if (whichfork == XFS_DATA_FORK)
> -		forkname = _("data");
> -	else
> -		forkname = _("attr");
> -
>  	magic = xfs_sb_version_hascrc(&mp->m_sb) ? XFS_BMAP_CRC_MAGIC
>  						 : XFS_BMAP_MAGIC;
>  
> diff --git a/repair/dinode.h b/repair/dinode.h
> index d9197c1..7521521 100644
> --- a/repair/dinode.h
> +++ b/repair/dinode.h
> @@ -127,4 +127,7 @@ get_bmapi(xfs_mount_t		*mp,
>  		xfs_dfiloff_t	bno,
>  		int             whichfork );
>  
> +void dinode_bmbt_translation_init(void);
> +char * get_forkname(int whichfork);
> +
>  #endif /* _XR_DINODE_H */
> diff --git a/repair/scan.c b/repair/scan.c
> index 73b4581..b12f48b 100644
> --- a/repair/scan.c
> +++ b/repair/scan.c
> @@ -171,17 +171,12 @@ scan_bmapbt(
>  	xfs_bmbt_rec_t		*rp;
>  	xfs_dfiloff_t		first_key;
>  	xfs_dfiloff_t		last_key;
> -	char			*forkname;
> +	char			*forkname = get_forkname(whichfork);
>  	int			numrecs;
>  	xfs_agnumber_t		agno;
>  	xfs_agblock_t		agbno;
>  	int			state;
>  
> -	if (whichfork == XFS_DATA_FORK)
> -		forkname = _("data");
> -	else
> -		forkname = _("attr");
> -
>  	/*
>  	 * unlike the ag freeblock btrees, if anything looks wrong
>  	 * in an inode bmap tree, just bail.  it's possible that
> diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> index 9e0502a..bac334f 100644
> --- a/repair/xfs_repair.c
> +++ b/repair/xfs_repair.c
> @@ -29,6 +29,7 @@
>  #include "prefetch.h"
>  #include "threads.h"
>  #include "progress.h"
> +#include "dinode.h"
>  
>  #define	rounddown(x, y)	(((x)/(y))*(y))
>  
> @@ -533,6 +534,7 @@ main(int argc, char **argv)
>  	setlocale(LC_ALL, "");
>  	bindtextdomain(PACKAGE, LOCALEDIR);
>  	textdomain(PACKAGE);
> +	dinode_bmbt_translation_init();
>  
>  	temp_mp = &xfs_m;
>  	setbuf(stdout, NULL);
> -- 
> 1.8.4.rc3
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/10] repair: factor out threading setup code
  2014-02-24  6:29 ` [PATCH 05/10] repair: factor out threading setup code Dave Chinner
@ 2014-02-24 20:43   ` Brian Foster
  2014-02-24 23:16     ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Brian Foster @ 2014-02-24 20:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Feb 24, 2014 at 05:29:24PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The same code is repeated in different places to set up
> multithreaded prefetching. This can all be factored into a single
> implementation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/dinode.h   | 15 ++++++------
>  repair/phase3.c   | 40 +++----------------------------
>  repair/phase4.c   | 48 +++----------------------------------
>  repair/phase6.c   | 22 ++++-------------
>  repair/prefetch.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  repair/prefetch.h | 10 ++++++++
>  6 files changed, 98 insertions(+), 108 deletions(-)
> 
...
> diff --git a/repair/phase6.c b/repair/phase6.c
> index cdbf4db..08f78d2 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -17,6 +17,8 @@
>   */
>  
>  #include <libxfs.h>
> +#include "threads.h"
> +#include "prefetch.h"
>  #include "avl.h"
>  #include "globals.h"
>  #include "agheader.h"
> @@ -25,9 +27,7 @@
>  #include "protos.h"
>  #include "err_protos.h"
>  #include "dinode.h"
> -#include "prefetch.h"
>  #include "progress.h"
> -#include "threads.h"
>  #include "versions.h"
>  
>  static struct cred		zerocr;
> @@ -3031,23 +3031,9 @@ update_missing_dotdot_entries(
>  
>  static void
>  traverse_ags(
> -	xfs_mount_t 		*mp)
> +	struct xfs_mount	*mp)
>  {
> -	int			i;
> -	work_queue_t		queue;
> -	prefetch_args_t		*pf_args[2];
> -
> -	/*
> -	 * we always do prefetch for phase 6 as it will fill in the gaps
> -	 * not read during phase 3 prefetch.
> -	 */
> -	queue.mp = mp;
> -	pf_args[0] = start_inode_prefetch(0, 1, NULL);
> -	for (i = 0; i < glob_agcount; i++) {
> -		pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> -				pf_args[i & 1]);
> -		traverse_function(&queue, i, pf_args[i & 1]);
> -	}
> +	do_inode_prefetch(mp, 0, traverse_function, true, true);

The cover letter indicates the parallelization of phase 6 was dropped,
but this appears to (conditionally) enable it.

Brian

>  }
>  
>  void
> diff --git a/repair/prefetch.c b/repair/prefetch.c
> index 984beda..e573e35 100644
> --- a/repair/prefetch.c
> +++ b/repair/prefetch.c
> @@ -865,6 +865,77 @@ start_inode_prefetch(
>  	return args;
>  }
>  
> +/*
> + * Do inode prefetch in the most optimal way for the context under which repair
> + * has been run.
> + */
> +void
> +do_inode_prefetch(
> +	struct xfs_mount	*mp,
> +	int			stride,
> +	void			(*func)(struct work_queue *,
> +					xfs_agnumber_t, void *),
> +	bool			check_cache,
> +	bool			dirs_only)
> +{
> +	int			i, j;
> +	xfs_agnumber_t		agno;
> +	struct work_queue	queue;
> +	struct work_queue	*queues;
> +	struct prefetch_args	*pf_args[2];
> +
> +	/*
> +	 * If the previous phases of repair have not overflowed the buffer
> +	 * cache, then we don't need to re-read any of the metadata in the
> +	 * filesystem - it's all in the cache. In that case, run a thread per
> +	 * CPU to maximise parallelism of the queue to be processed.
> +	 */
> +	if (check_cache && !libxfs_bcache_overflowed()) {
> +		queue.mp = mp;
> +		create_work_queue(&queue, mp, libxfs_nproc());
> +		for (i = 0; i < mp->m_sb.sb_agcount; i++)
> +			queue_work(&queue, func, i, NULL);
> +		destroy_work_queue(&queue);
> +		return;
> +	}
> +
> +	/*
> +	 * single threaded behaviour - single prefetch thread, processed
> +	 * directly after each AG is queued.
> +	 */
> +	if (!stride) {
> +		queue.mp = mp;
> +		pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
> +		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
> +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
> +					dirs_only, pf_args[i & 1]);
> +			func(&queue, i, pf_args[i & 1]);
> +		}
> +		return;
> +	}
> +
> +	/*
> +	 * create one worker thread for each segment of the volume
> +	 */
> +	queues = malloc(thread_count * sizeof(work_queue_t));
> +	for (i = 0, agno = 0; i < thread_count; i++) {
> +		create_work_queue(&queues[i], mp, 1);
> +		pf_args[0] = NULL;
> +		for (j = 0; j < stride && agno < mp->m_sb.sb_agcount;
> +				j++, agno++) {
> +			pf_args[0] = start_inode_prefetch(agno, dirs_only,
> +							  pf_args[0]);
> +			queue_work(&queues[i], func, agno, pf_args[0]);
> +		}
> +	}
> +	/*
> +	 * wait for workers to complete
> +	 */
> +	for (i = 0; i < thread_count; i++)
> +		destroy_work_queue(&queues[i]);
> +	free(queues);
> +}
> +
>  void
>  wait_for_inode_prefetch(
>  	prefetch_args_t		*args)
> diff --git a/repair/prefetch.h b/repair/prefetch.h
> index 44a406c..b837752 100644
> --- a/repair/prefetch.h
> +++ b/repair/prefetch.h
> @@ -4,6 +4,7 @@
>  #include <semaphore.h>
>  #include "incore.h"
>  
> +struct work_queue;
>  
>  extern int 	do_prefetch;
>  
> @@ -41,6 +42,15 @@ start_inode_prefetch(
>  	prefetch_args_t		*prev_args);
>  
>  void
> +do_inode_prefetch(
> +	struct xfs_mount	*mp,
> +	int			stride,
> +	void			(*func)(struct work_queue *,
> +					xfs_agnumber_t, void *),
> +	bool			check_cache,
> +	bool			dirs_only);
> +
> +void
>  wait_for_inode_prefetch(
>  	prefetch_args_t		*args);
>  
> -- 
> 1.8.4.rc3
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/10] repair: factor out threading setup code
  2014-02-24 20:43   ` Brian Foster
@ 2014-02-24 23:16     ` Dave Chinner
  2014-02-24 23:30       ` Brian Foster
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2014-02-24 23:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Mon, Feb 24, 2014 at 03:43:05PM -0500, Brian Foster wrote:
> On Mon, Feb 24, 2014 at 05:29:24PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The same code is repeated in different places to set up
> > multithreaded prefetching. This can all be factored into a single
> > implementation.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
....
> >  static void
> >  traverse_ags(
> > -	xfs_mount_t 		*mp)
> > +	struct xfs_mount	*mp)
> >  {
> > -	int			i;
> > -	work_queue_t		queue;
> > -	prefetch_args_t		*pf_args[2];
> > -
> > -	/*
> > -	 * we always do prefetch for phase 6 as it will fill in the gaps
> > -	 * not read during phase 3 prefetch.
> > -	 */
> > -	queue.mp = mp;
> > -	pf_args[0] = start_inode_prefetch(0, 1, NULL);
> > -	for (i = 0; i < glob_agcount; i++) {
> > -		pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> > -				pf_args[i & 1]);
> > -		traverse_function(&queue, i, pf_args[i & 1]);
> > -	}
> > +	do_inode_prefetch(mp, 0, traverse_function, true, true);
> 
> The cover letter indicates the parallelization of phase 6 was dropped,
> but this appears to (conditionally) enable it.

No, it enables prefetch, it does not enable threading. The second
parameter is "0" which means that do_inode_prefetch() executes the
single threaded prefetch walk like the above code. i.e.:

> > + */
> > +void
> > +do_inode_prefetch(
> > +	struct xfs_mount	*mp,
> > +	int			stride,

stride = 0

> > +	void			(*func)(struct work_queue *,
> > +					xfs_agnumber_t, void *),
> > +	bool			check_cache,
> > +	bool			dirs_only)
> > +{
> > +	int			i, j;
> > +	xfs_agnumber_t		agno;
> > +	struct work_queue	queue;
> > +	struct work_queue	*queues;
> > +	struct prefetch_args	*pf_args[2];
> > +
> > +	/*
> > +	 * If the previous phases of repair have not overflowed the buffer
> > +	 * cache, then we don't need to re-read any of the metadata in the
> > +	 * filesystem - it's all in the cache. In that case, run a thread per
> > +	 * CPU to maximise parallelism of the queue to be processed.
> > +	 */
> > +	if (check_cache && !libxfs_bcache_overflowed()) {
> > +		queue.mp = mp;
> > +		create_work_queue(&queue, mp, libxfs_nproc());
> > +		for (i = 0; i < mp->m_sb.sb_agcount; i++)
> > +			queue_work(&queue, func, i, NULL);
> > +		destroy_work_queue(&queue);
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * single threaded behaviour - single prefetch thread, processed
> > +	 * directly after each AG is queued.
> > +	 */
> > +	if (!stride) {
> > +		queue.mp = mp;
> > +		pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
> > +		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
> > +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
> > +					dirs_only, pf_args[i & 1]);
> > +			func(&queue, i, pf_args[i & 1]);
> > +		}
> > +		return;
> > +	}

So we run this "!stride" code. Hmmmm - maybe you are commenting on
the "check_cache" code? I probably should prevent that from
triggering, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/10] repair: factor out threading setup code
  2014-02-24 23:16     ` Dave Chinner
@ 2014-02-24 23:30       ` Brian Foster
  0 siblings, 0 replies; 29+ messages in thread
From: Brian Foster @ 2014-02-24 23:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Feb 25, 2014 at 10:16:20AM +1100, Dave Chinner wrote:
> On Mon, Feb 24, 2014 at 03:43:05PM -0500, Brian Foster wrote:
> > On Mon, Feb 24, 2014 at 05:29:24PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The same code is repeated in different places to set up
> > > multithreaded prefetching. This can all be factored into a single
> > > implementation.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ....
> > >  static void
> > >  traverse_ags(
> > > -	xfs_mount_t 		*mp)
> > > +	struct xfs_mount	*mp)
> > >  {
> > > -	int			i;
> > > -	work_queue_t		queue;
> > > -	prefetch_args_t		*pf_args[2];
> > > -
> > > -	/*
> > > -	 * we always do prefetch for phase 6 as it will fill in the gaps
> > > -	 * not read during phase 3 prefetch.
> > > -	 */
> > > -	queue.mp = mp;
> > > -	pf_args[0] = start_inode_prefetch(0, 1, NULL);
> > > -	for (i = 0; i < glob_agcount; i++) {
> > > -		pf_args[(~i) & 1] = start_inode_prefetch(i + 1, 1,
> > > -				pf_args[i & 1]);
> > > -		traverse_function(&queue, i, pf_args[i & 1]);
> > > -	}
> > > +	do_inode_prefetch(mp, 0, traverse_function, true, true);
> > 
> > The cover letter indicates the parallelization of phase 6 was dropped,
> > but this appears to (conditionally) enable it.
> 
> No, it enables prefetch, it does not enable threading. The second
> parameter is "0" which means that do_inode_prefetch() executes the
> single threaded prefetch walk like the above code. i.e.:
> 
> > > + */
> > > +void
> > > +do_inode_prefetch(
> > > +	struct xfs_mount	*mp,
> > > +	int			stride,
> 
> stride = 0
> 
> > > +	void			(*func)(struct work_queue *,
> > > +					xfs_agnumber_t, void *),
> > > +	bool			check_cache,
> > > +	bool			dirs_only)
> > > +{
> > > +	int			i, j;
> > > +	xfs_agnumber_t		agno;
> > > +	struct work_queue	queue;
> > > +	struct work_queue	*queues;
> > > +	struct prefetch_args	*pf_args[2];
> > > +
> > > +	/*
> > > +	 * If the previous phases of repair have not overflowed the buffer
> > > +	 * cache, then we don't need to re-read any of the metadata in the
> > > +	 * filesystem - it's all in the cache. In that case, run a thread per
> > > +	 * CPU to maximise parallelism of the queue to be processed.
> > > +	 */
> > > +	if (check_cache && !libxfs_bcache_overflowed()) {
> > > +		queue.mp = mp;
> > > +		create_work_queue(&queue, mp, libxfs_nproc());
> > > +		for (i = 0; i < mp->m_sb.sb_agcount; i++)
> > > +			queue_work(&queue, func, i, NULL);
> > > +		destroy_work_queue(&queue);
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > > +	 * single threaded behaviour - single prefetch thread, processed
> > > +	 * directly after each AG is queued.
> > > +	 */
> > > +	if (!stride) {
> > > +		queue.mp = mp;
> > > +		pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
> > > +		for (i = 0; i < mp->m_sb.sb_agcount; i++) {
> > > +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
> > > +					dirs_only, pf_args[i & 1]);
> > > +			func(&queue, i, pf_args[i & 1]);
> > > +		}
> > > +		return;
> > > +	}
> 
> So we run this "!stride" code. Hmmmm - maybe you are commenting on
> the "check_cache" code? I probably should prevent that from
> triggering, too.
> 

Sorry, I could have been more clear on that. Yes, I'm referring
specifically to setting check_cache to true.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware
  2014-02-24  6:29 ` [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware Dave Chinner
@ 2014-02-25 17:25   ` Christoph Hellwig
  2014-02-25 23:51     ` Dave Chinner
  2014-02-26  1:44   ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2014-02-25 17:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

>  	/*
>  	 * do some validation on the block contents
>  	 */
> -	if ((be32_to_cpu(block->bb_magic) != XFS_BMAP_MAGIC) ||
> +	if ((block->bb_magic != cpu_to_be32(XFS_BMAP_MAGIC) &&
> +	     block->bb_magic != cpu_to_be32(XFS_BMAP_CRC_MAGIC)) ||
>  			(be16_to_cpu(block->bb_level) != level))

Seems like this should be factored into a well-documented helper
function.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] repair: translation lookups limit scalability
  2014-02-24  6:29 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
  2014-02-24 20:42   ` Brian Foster
@ 2014-02-25 20:01   ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2014-02-25 20:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/10] repair: use a listhead for the dotdot list
  2014-02-24  6:29 ` [PATCH 06/10] repair: use a listhead for the dotdot list Dave Chinner
@ 2014-02-25 20:03   ` Christoph Hellwig
  2014-02-27  2:06     ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2014-02-25 20:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Feb 24, 2014 at 05:29:25PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Cleanup suggested by Christoph Hellwig - removes another open coded
> list implementation.

Looks good.  Question for both the old and new code:  any good reason to
do the dotdot updates LIFO?  If yes it might be worth adding a comment
about that fact.

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 08/10] libxfs: remove a couple of locks
  2014-02-24  6:29 ` [PATCH 08/10] libxfs: remove a couple of locks Dave Chinner
@ 2014-02-25 20:05   ` Christoph Hellwig
  2014-02-25 23:43     ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2014-02-25 20:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

So what's protecting the cache hits statistics now?

Longer term it might make sense to just port the XFS buffercache from
the kernel to libxfs..

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 09/10] repair: fix prefetch queue limiting
  2014-02-24  6:29 ` [PATCH 09/10] repair: fix prefetch queue limiting Dave Chinner
@ 2014-02-25 20:08   ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2014-02-25 20:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Great, most other posix realtime functions actually do return the
errno, but this one doesn't.  This would really be a useful check for
some sort of semantic checker tool in userspace.

This bug has been there since day 1 of the sem_trywait call btw, so it
never really worked.

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 08/10] libxfs: remove a couple of locks
  2014-02-25 20:05   ` Christoph Hellwig
@ 2014-02-25 23:43     ` Dave Chinner
  2014-02-26  1:54       ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2014-02-25 23:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Feb 25, 2014 at 12:05:58PM -0800, Christoph Hellwig wrote:
> So what's protecting the cache hits statistics now?

Nothing. And I don't care because the statistic is meaningless from
the point of view of tuning xfs_repair performance.

> Longer term it might make sense to just port the XFS buffercache from
> the kernel to libxfs..

*nod*

We badly need to separate the buffer cache into per-AG caches
because most of the per-thread operations are isolated to a single
AG and so a global hash simply doesn't scale from a lock contention
or a table sizing point of view....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware
  2014-02-25 17:25   ` Christoph Hellwig
@ 2014-02-25 23:51     ` Dave Chinner
  2014-02-26  1:40       ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2014-02-25 23:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Feb 25, 2014 at 09:25:18AM -0800, Christoph Hellwig wrote:
> >  	/*
> >  	 * do some validation on the block contents
> >  	 */
> > -	if ((be32_to_cpu(block->bb_magic) != XFS_BMAP_MAGIC) ||
> > +	if ((block->bb_magic != cpu_to_be32(XFS_BMAP_MAGIC) &&
> > +	     block->bb_magic != cpu_to_be32(XFS_BMAP_CRC_MAGIC)) ||
> >  			(be16_to_cpu(block->bb_level) != level))
> 
> Seems like this should be factored into a well-documented helper
> function.

Sure, but it's way outside the scope of fixing this bug.
We'd need to start with the kernel code, then port it is userspace
and propagate it into all the utilities. Patches welcome ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware
  2014-02-25 23:51     ` Dave Chinner
@ 2014-02-26  1:40       ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2014-02-26  1:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Wed, Feb 26, 2014 at 10:51:12AM +1100, Dave Chinner wrote:
> Sure, but it's way outside the scope of fixing this bug.
> We'd need to start with the kernel code, then port it is userspace
> and propagate it into all the utilities. Patches welcome ;)

Just doing that repair-local would be a start.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware
  2014-02-24  6:29 ` [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware Dave Chinner
  2014-02-25 17:25   ` Christoph Hellwig
@ 2014-02-26  1:44   ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2014-02-26  1:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks fine even if I'd prefer a broader fix:


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 07/10] repair: prefetch runs too far ahead
  2014-02-24  6:29 ` [PATCH 07/10] repair: prefetch runs too far ahead Dave Chinner
@ 2014-02-26  1:52   ` Christoph Hellwig
  2014-02-26  5:51     ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2014-02-26  1:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Feb 24, 2014 at 05:29:26PM +1100, Dave Chinner wrote:
> @@ -842,7 +842,7 @@ start_inode_prefetch(
>  	 * and not any other associated metadata like directories
>  	 */
>  
> -	max_queue = libxfs_bcache->c_maxcount / thread_count / 8;
> +	max_queue = libxfs_bcache->c_maxcount / thread_count / 32;

I can't correlate this to anything mentioned in the changelog.
Also if you're touching it anyway it might be a good idea to document
the magic number here.

> +void
> +prefetch_ag_range(
> +	struct work_queue	*work,
> +	xfs_agnumber_t		start_ag,
> +	xfs_agnumber_t		end_ag,
> +	bool			dirs_only,
> +	void			(*func)(struct work_queue *,
> +					xfs_agnumber_t, void *))
> +{
> +	int			i;
> +	struct prefetch_args	*pf_args[2];
> +
> +	pf_args[start_ag & 1] = start_inode_prefetch(start_ag, dirs_only, NULL);
> +	for (i = start_ag; i < end_ag; i++) {
> +		/* Don't prefetch end_ag */
> +		if (i + 1 < end_ag)
> +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
> +						dirs_only, pf_args[i & 1]);
> +		func(work, i, pf_args[i & 1]);
> +	}
> +}

This seems to largely duplicate the common code added in patch 5.
Having _range variants of those that the non-range ones wrap with 0 and
mp->m_sb.sb_agcount as default parameters would avoid that duplication.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 08/10] libxfs: remove a couple of locks
  2014-02-25 23:43     ` Dave Chinner
@ 2014-02-26  1:54       ` Christoph Hellwig
  2014-02-26  5:53         ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2014-02-26  1:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Feb 26, 2014 at 10:43:57AM +1100, Dave Chinner wrote:
> On Tue, Feb 25, 2014 at 12:05:58PM -0800, Christoph Hellwig wrote:
> > So what's protecting the cache hits statistics now?
> 
> Nothing. And I don't care because the statistic is meaningless from
> the point of view of tuning xfs_repair performance.

In that case I think it's better to just remove c_hits instead of
updating it in a racy way.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 07/10] repair: prefetch runs too far ahead
  2014-02-26  1:52   ` Christoph Hellwig
@ 2014-02-26  5:51     ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2014-02-26  5:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Feb 25, 2014 at 05:52:50PM -0800, Christoph Hellwig wrote:
> On Mon, Feb 24, 2014 at 05:29:26PM +1100, Dave Chinner wrote:
> > @@ -842,7 +842,7 @@ start_inode_prefetch(
> >  	 * and not any other associated metadata like directories
> >  	 */
> >  
> > -	max_queue = libxfs_bcache->c_maxcount / thread_count / 8;
> > +	max_queue = libxfs_bcache->c_maxcount / thread_count / 32;
> 
> I can't correlate this to anything mentioned in the changelog.
> Also if you're touching it anyway it might be a good idea to document
> the magic number here.

I was fiddling with the magic number to see if it affected the
readahead behaviour (it didn't) and forgot to set it back to the
original value. Will fix.

> 
> > +void
> > +prefetch_ag_range(
> > +	struct work_queue	*work,
> > +	xfs_agnumber_t		start_ag,
> > +	xfs_agnumber_t		end_ag,
> > +	bool			dirs_only,
> > +	void			(*func)(struct work_queue *,
> > +					xfs_agnumber_t, void *))
> > +{
> > +	int			i;
> > +	struct prefetch_args	*pf_args[2];
> > +
> > +	pf_args[start_ag & 1] = start_inode_prefetch(start_ag, dirs_only, NULL);
> > +	for (i = start_ag; i < end_ag; i++) {
> > +		/* Don't prefetch end_ag */
> > +		if (i + 1 < end_ag)
> > +			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
> > +						dirs_only, pf_args[i & 1]);
> > +		func(work, i, pf_args[i & 1]);
> > +	}
> > +}
> 
> This seems to largely duplicate the common code added in patch 5.
> Having _range variants of those that the non-range ones wrap with 0 and
> mp->m_sb.sb_agcount as default parameters would avoid that duplication.

Actually, that's pretty much what this patch does in this hunk:

@@ -905,12 +945,8 @@ do_inode_prefetch(
         */
        if (!stride) {
                queue.mp = mp;
-               pf_args[0] = start_inode_prefetch(0, dirs_only, NULL);
-               for (i = 0; i < mp->m_sb.sb_agcount; i++) {
-                       pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
-                                       dirs_only, pf_args[i & 1]);
-                       func(&queue, i, pf_args[i & 1]);
-               }
+               prefetch_ag_range(&queue, 0, mp->m_sb.sb_agcount,
+                                 dirs_only, func);
                return;
        }


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 08/10] libxfs: remove a couple of locks
  2014-02-26  1:54       ` Christoph Hellwig
@ 2014-02-26  5:53         ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2014-02-26  5:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Feb 25, 2014 at 05:54:32PM -0800, Christoph Hellwig wrote:
> On Wed, Feb 26, 2014 at 10:43:57AM +1100, Dave Chinner wrote:
> > On Tue, Feb 25, 2014 at 12:05:58PM -0800, Christoph Hellwig wrote:
> > > So what's protecting the cache hits statistics now?
> > 
> > Nothing. And I don't care because the statistic is meaningless from
> > the point of view of tuning xfs_repair performance.
> 
> In that case I think it's better to just remove c_hits instead of
> updating it in a racy way.

I didn't say it's completely meaningless - it's handy for debugging,
just not useful to end users and so accuracy isn't important enough
to use a global lock on.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/10] repair: use a listhead for the dotdot list
  2014-02-25 20:03   ` Christoph Hellwig
@ 2014-02-27  2:06     ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2014-02-27  2:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Feb 25, 2014 at 12:03:32PM -0800, Christoph Hellwig wrote:
> On Mon, Feb 24, 2014 at 05:29:25PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Cleanup suggested by Christoph Hellwig - removes another open coded
> > list implementation.
> 
> Looks good.  Question for both the old and new code:  any good reason to
> do the dotdot updates LIFO?  If yes it might be worth adding a comment
> about that fact.

No, it didn't even occur to me that it might matter. I'll change it
back to FIFO.

> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-02-27  2:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-24  6:29 [PATCH 00/10, v2] repair: scalability and prefetch fixes Dave Chinner
2014-02-24  6:29 ` [PATCH 01/10] repair: translation lookups limit scalability Dave Chinner
2014-02-24 20:42   ` Brian Foster
2014-02-25 20:01   ` Christoph Hellwig
2014-02-24  6:29 ` [PATCH 02/10] repair: per AG locks contend for cachelines Dave Chinner
2014-02-24  6:29 ` [PATCH 03/10] libxfs: buffer cache hashing is suboptimal Dave Chinner
2014-02-24  6:29 ` [PATCH 04/10] repair: limit auto-striding concurrency apprpriately Dave Chinner
2014-02-24  6:29 ` [PATCH 05/10] repair: factor out threading setup code Dave Chinner
2014-02-24 20:43   ` Brian Foster
2014-02-24 23:16     ` Dave Chinner
2014-02-24 23:30       ` Brian Foster
2014-02-24  6:29 ` [PATCH 06/10] repair: use a listhead for the dotdot list Dave Chinner
2014-02-25 20:03   ` Christoph Hellwig
2014-02-27  2:06     ` Dave Chinner
2014-02-24  6:29 ` [PATCH 07/10] repair: prefetch runs too far ahead Dave Chinner
2014-02-26  1:52   ` Christoph Hellwig
2014-02-26  5:51     ` Dave Chinner
2014-02-24  6:29 ` [PATCH 08/10] libxfs: remove a couple of locks Dave Chinner
2014-02-25 20:05   ` Christoph Hellwig
2014-02-25 23:43     ` Dave Chinner
2014-02-26  1:54       ` Christoph Hellwig
2014-02-26  5:53         ` Dave Chinner
2014-02-24  6:29 ` [PATCH 09/10] repair: fix prefetch queue limiting Dave Chinner
2014-02-25 20:08   ` Christoph Hellwig
2014-02-24  6:29 ` [PATCH 10/10] repair: BMBT prefetch needs to be CRC aware Dave Chinner
2014-02-25 17:25   ` Christoph Hellwig
2014-02-25 23:51     ` Dave Chinner
2014-02-26  1:40       ` Christoph Hellwig
2014-02-26  1:44   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.