All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] xfs_repair: scale to 150,000 iops
@ 2018-10-30 11:20 Dave Chinner
  2018-10-30 11:20 ` [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption" Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 11:20 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

This patchset enables me to successfully repair a rather large
metadump image (~500GB of metadata) that was provided to us because
it crashed xfs_repair. Darrick and Eric have already posted patches
to fix the crash bugs, and this series is built on top of them.
Those patches are:

	libxfs: add missing agfl free deferred op type
	xfs_repair: initialize realloced bplist in longform_dir2_entry_check
	xfs_repair: continue after xfs_bunmapi deadlock avoidance

This series starts with another couple of regression fixes - the
revert is for a change in 4.18, the unlinked list issue is only in
the 4.19 dev tree.

The third patch prevents a problem I had during development that
resulted in blowing the buffer cache size out to > 100GB RAM and
causing xfs_repair to be OOM-killed on my 128GB RAM machine. If
there was a sudden prefetch demand or a set of queues were allowed
to grow very deep (e.g. lots of AGs all starting prefetch at the
same time) then they would all race to expand the cache, causing
multiple expansions within a few milliseconds. Only one expansion
was needed, so I rate limited it.

The 4th patch actually solved the runaway queueing problems I was
having, but I figured it was still a good idea to prevent
unnecessary cache growth. The fourth patch allowed me to bound how
much work was queued internally to an AG in phase 6, so the queue
didn't suck up the entire AG's readahead in one go....

patches 5 and 6 protect objects/structures that have concurrent
access in phase 6 - the bad inode list and the inode chunk records
in the per-ag AVL trees. the trees themselves aren't modified in
phase 6, so they don't need any additional concurrency protection.

Patch 7 enables concurrency in phase 6. Firstly it parallelises
across AGs like phase 3 and 4, but because phase 6 is largely CPU
bound processing directories one at a time, it also uses a workqueue
to parallelise processing of individual inode chunks records. This
is convenient and easy to do, and is very effective. If you now have
the IO capability, phase 6 will still run as a CPU workload - I
watched it use 30 of 32 CPUs for 15 minutes before the long tail of
large directories slowly burnt down.

While burning all that CPU, it also sustained about 160k IOPS from
the SSDs. Phase 3 and 4 also ran at about 130-150k IOPS, but that is
about the current limit of the prefetching and IO infrastructure we
have in xfsprogs.

Comments, thoughts, ideas, testing all welcome!

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption"
  2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
@ 2018-10-30 11:20 ` Dave Chinner
  2018-10-30 17:20   ` Darrick J. Wong
  2018-10-30 11:20 ` [PATCH 2/7] repair: don't dirty inodes which are not unlinked Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 11:20 UTC (permalink / raw)
  To: linux-xfs

This reverts commit 67a79e2cc9320aaf269cd00e9c8d16892931886d.

A root LEAFN block can exist in a directory. When we convert from
leaf format (LEAF1 - internal free list) to node format (LEAFN -
external free list) the only change to the single root leaf block is
that it's magic number is changed from LEAF1 to LEAFN.

We don't actually end up with DA nodes in the tree until the LEAFN
node is split, and that requires a couple more dirents to be added
to the directory to fill the LEAFN block up completely. Then it will
split and create a DA node root block pointing to multiple LEAFN
leaf blocks.

Hence restore the old behaviour where we skip the DA node tree
rebuild if there is a LEAFN root block found as there is no tree to
rebuild.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/dir2.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/repair/dir2.c b/repair/dir2.c
index 3374ae722bf9..ba5763ed3d26 100644
--- a/repair/dir2.c
+++ b/repair/dir2.c
@@ -1242,11 +1242,11 @@ process_node_dir2(
 		return 1;
 
 	/*
-	 * Directories with a root marked XFS_DIR2_LEAFN_MAGIC are corrupt
+	 * Skip directories with a root marked XFS_DIR2_LEAFN_MAGIC
 	 */
 	if (bno == 0) {
-		err_release_da_cursor(mp, &da_cursor, 0);
-		return 1;
+		release_da_cursor(mp, &da_cursor, 0);
+		return 0;
 	} else {
 		/*
 		 * Now pass cursor and bno into leaf-block processing routine.
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/7] repair: don't dirty inodes which are not unlinked
  2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
  2018-10-30 11:20 ` [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption" Dave Chinner
@ 2018-10-30 11:20 ` Dave Chinner
  2018-10-30 17:26   ` Darrick J. Wong
  2018-10-30 20:03   ` Eric Sandeen
  2018-10-30 11:20 ` [PATCH 3/7] cache: prevent expansion races Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 11:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

I noticed phase 4 writing back lots of inode buffers during recent
testing. The recent rework of clear_inode() in commit 0724d0f4cb53
("xfs_repair: clear_dinode should simply clear, not check contents")
accidentally caught a call to clear_inode_unlinked() as well,
resulting in all inodes being marked dirty whether then needed
updating or not.

Fix it by reverting the erroneous hunk and adding warnings so taht
this corruption is no longer silently fixed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/dinode.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/repair/dinode.c b/repair/dinode.c
index 379f85cf1268..90400128d4bb 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -2675,9 +2675,15 @@ _("bad (negative) size %" PRId64 " on inode %" PRIu64 "\n"),
 	 * we're going to find.  check_dups is set to 1 only during
 	 * phase 4.  Ugly.
 	 */
-	if (check_dups && !no_modify) {
-		clear_dinode_unlinked(mp, dino);
-		*dirty += 1;
+	if (check_dups && clear_dinode_unlinked(mp, dino)) {
+		if (no_modify) {
+			do_warn(
+	_("Would clear unlinked_next in inode %" PRIu64 "\n"), lino);
+		} else  {
+			do_warn(
+	_("Cleared unlinked_next in inode %" PRIu64 "\n"), lino);
+			*dirty += 1;
+		}
 	}
 
 	/* set type and map type info */
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/7] cache: prevent expansion races
  2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
  2018-10-30 11:20 ` [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption" Dave Chinner
  2018-10-30 11:20 ` [PATCH 2/7] repair: don't dirty inodes which are not unlinked Dave Chinner
@ 2018-10-30 11:20 ` Dave Chinner
  2018-10-30 17:39   ` Darrick J. Wong
  2018-10-31 17:13   ` Brian Foster
  2018-10-30 11:20 ` [PATCH 4/7] workqueue: bound maximum queue depth Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 11:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When a sudden buffer cache growth demand occurs from multiple
concurrent sources, they can all fail shaking the cache at the same
time and expand the cache. This results in the cache doubling in
size multiple times when only a single expansion was really
necessary. The resultant massive cache can cause repair to OOM as
the cache will grow to much larger than memory can actually hold.

Prevent this sudden and unnecessary expansion by rate limiting cache
growth to one size increase every tens seconds. This is sufficient
to prevent racing expansion calls from doing the wrong thing, but
not too long to prevent necesary cache growth when it is undersized.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/cache.h |  1 +
 libxfs/cache.c  | 17 ++++++++++++++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/cache.h b/include/cache.h
index 552b92489e46..1b774619f7a0 100644
--- a/include/cache.h
+++ b/include/cache.h
@@ -114,6 +114,7 @@ struct cache {
 	unsigned long long	c_misses;	/* cache misses */
 	unsigned long long	c_hits;		/* cache hits */
 	unsigned int 		c_max;		/* max nodes ever used */
+	time_t			expand_time;	/* time of last expansion */
 };
 
 struct cache *cache_init(int, unsigned int, struct cache_operations *);
diff --git a/libxfs/cache.c b/libxfs/cache.c
index 139c7c1b9e71..e10df395e60e 100644
--- a/libxfs/cache.c
+++ b/libxfs/cache.c
@@ -62,6 +62,7 @@ cache_init(
 	cache->bulkrelse = cache_operations->bulkrelse ?
 		cache_operations->bulkrelse : cache_generic_bulkrelse;
 	pthread_mutex_init(&cache->c_mutex, NULL);
+	time(&cache->expand_time);
 
 	for (i = 0; i < hashsize; i++) {
 		list_head_init(&cache->c_hash[i].ch_list);
@@ -77,15 +78,25 @@ cache_init(
 	return cache;
 }
 
+/*
+ * rate limit expansion so several concurrent shakes don't instantly all
+ * expand the cache multiple times and blow repair to OOM death.
+ */
 static void
 cache_expand(
-	struct cache *		cache)
+	struct cache	*cache)
 {
+	time_t		now;
+
 	pthread_mutex_lock(&cache->c_mutex);
+	time(&now);
+	if (now - cache->expand_time > 10) {
 #ifdef CACHE_DEBUG
-	fprintf(stderr, "doubling cache size to %d\n", 2 * cache->c_maxcount);
+		fprintf(stderr, "doubling cache size to %d\n", 2 * cache->c_maxcount);
 #endif
-	cache->c_maxcount *= 2;
+		cache->c_maxcount *= 2;
+		cache->expand_time = now;
+	}
 	pthread_mutex_unlock(&cache->c_mutex);
 }
 
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/7] workqueue: bound maximum queue depth
  2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
                   ` (2 preceding siblings ...)
  2018-10-30 11:20 ` [PATCH 3/7] cache: prevent expansion races Dave Chinner
@ 2018-10-30 11:20 ` Dave Chinner
  2018-10-30 17:58   ` Darrick J. Wong
  2018-10-30 11:20 ` [PATCH 5/7] repair: Protect bad inode list with mutex Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 11:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Existing users of workqueues have bound maximum queue depths in
their external algorithms (e.g. prefetch counts). For parallelising
work that doesn't have an external bound, allow workqueues to
throttle incoming requests at a maximum bound. bounded workqueues
also need to distribute work over all worker threads themselves as
there is no external bounding or worker function throttling
provided.

Existing callers are not throttled and retain direct control of
worker threads, only users of the new create interface will be
throttled and concurrency managed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/workqueue.h |  4 ++++
 libfrog/workqueue.c | 30 +++++++++++++++++++++++++++---
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/workqueue.h b/include/workqueue.h
index c45dc4fbcf64..504da9403b85 100644
--- a/include/workqueue.h
+++ b/include/workqueue.h
@@ -30,10 +30,14 @@ struct workqueue {
 	unsigned int		item_count;
 	unsigned int		thread_count;
 	bool			terminate;
+	int			max_queued;
+	pthread_cond_t		queue_full;
 };
 
 int workqueue_create(struct workqueue *wq, void *wq_ctx,
 		unsigned int nr_workers);
+int workqueue_create_bound(struct workqueue *wq, void *wq_ctx,
+		unsigned int nr_workers, int max_queue);
 int workqueue_add(struct workqueue *wq, workqueue_func_t fn,
 		uint32_t index, void *arg);
 void workqueue_destroy(struct workqueue *wq);
diff --git a/libfrog/workqueue.c b/libfrog/workqueue.c
index 7311477374b4..8fe0dc7249f5 100644
--- a/libfrog/workqueue.c
+++ b/libfrog/workqueue.c
@@ -40,13 +40,21 @@ workqueue_thread(void *arg)
 		}
 
 		/*
-		 *  Dequeue work from the head of the list.
+		 *  Dequeue work from the head of the list. If the queue was
+		 *  full then send a wakeup if we're configured to do so.
 		 */
 		assert(wq->item_count > 0);
+		if (wq->max_queued && wq->item_count == wq->max_queued)
+			pthread_cond_signal(&wq->queue_full);
+
 		wi = wq->next_item;
 		wq->next_item = wi->next;
 		wq->item_count--;
 
+		if (wq->max_queued && wq->next_item) {
+			/* more work, wake up another worker */
+			pthread_cond_signal(&wq->wakeup);
+		}
 		pthread_mutex_unlock(&wq->lock);
 
 		(wi->function)(wi->queue, wi->index, wi->arg);
@@ -58,22 +66,25 @@ workqueue_thread(void *arg)
 
 /* Allocate a work queue and threads. */
 int
-workqueue_create(
+workqueue_create_bound(
 	struct workqueue	*wq,
 	void			*wq_ctx,
-	unsigned int		nr_workers)
+	unsigned int		nr_workers,
+	int			max_queue)
 {
 	unsigned int		i;
 	int			err = 0;
 
 	memset(wq, 0, sizeof(*wq));
 	pthread_cond_init(&wq->wakeup, NULL);
+	pthread_cond_init(&wq->queue_full, NULL);
 	pthread_mutex_init(&wq->lock, NULL);
 
 	wq->wq_ctx = wq_ctx;
 	wq->thread_count = nr_workers;
 	wq->threads = malloc(nr_workers * sizeof(pthread_t));
 	wq->terminate = false;
+	wq->max_queued = max_queue;
 
 	for (i = 0; i < nr_workers; i++) {
 		err = pthread_create(&wq->threads[i], NULL, workqueue_thread,
@@ -87,6 +98,15 @@ workqueue_create(
 	return err;
 }
 
+int
+workqueue_create(
+	struct workqueue	*wq,
+	void			*wq_ctx,
+	unsigned int		nr_workers)
+{
+	return workqueue_create_bound(wq, wq_ctx, nr_workers, 0);
+}
+
 /*
  * Create a work item consisting of a function and some arguments and
  * schedule the work item to be run via the thread pool.
@@ -122,6 +142,9 @@ workqueue_add(
 		assert(wq->item_count == 0);
 		pthread_cond_signal(&wq->wakeup);
 	} else {
+		/* throttle on a full queue if configured */
+		if (wq->max_queued && wq->item_count == wq->max_queued)
+			pthread_cond_wait(&wq->queue_full, &wq->lock);
 		wq->last_item->next = wi;
 	}
 	wq->last_item = wi;
@@ -153,5 +176,6 @@ workqueue_destroy(
 	free(wq->threads);
 	pthread_mutex_destroy(&wq->lock);
 	pthread_cond_destroy(&wq->wakeup);
+	pthread_cond_destroy(&wq->queue_full);
 	memset(wq, 0, sizeof(*wq));
 }
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 5/7] repair: Protect bad inode list with mutex
  2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
                   ` (3 preceding siblings ...)
  2018-10-30 11:20 ` [PATCH 4/7] workqueue: bound maximum queue depth Dave Chinner
@ 2018-10-30 11:20 ` Dave Chinner
  2018-10-30 17:44   ` Darrick J. Wong
  2018-10-30 11:20 ` [PATCH 6/7] repair: protect inode chunk tree records with a mutex Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 11:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To enable phase 6 parallelisation, we need to protect the bad inode
list from concurrent modification and/or access. Wrap it with a
mutex and clean up the nasty typedefs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/dir2.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/repair/dir2.c b/repair/dir2.c
index ba5763ed3d26..a73a675b97c8 100644
--- a/repair/dir2.c
+++ b/repair/dir2.c
@@ -20,40 +20,50 @@
  * Known bad inode list.  These are seen when the leaf and node
  * block linkages are incorrect.
  */
-typedef struct dir2_bad {
+struct dir2_bad {
 	xfs_ino_t	ino;
 	struct dir2_bad	*next;
-} dir2_bad_t;
+};
 
-static dir2_bad_t *dir2_bad_list;
+static struct dir2_bad	*dir2_bad_list;
+pthread_mutex_t		dir2_bad_list_lock = PTHREAD_MUTEX_INITIALIZER;
 
 static void
 dir2_add_badlist(
 	xfs_ino_t	ino)
 {
-	dir2_bad_t	*l;
+	struct dir2_bad	*l;
 
-	if ((l = malloc(sizeof(dir2_bad_t))) == NULL) {
+	l = malloc(sizeof(*l));
+	if (!l) {
 		do_error(
 _("malloc failed (%zu bytes) dir2_add_badlist:ino %" PRIu64 "\n"),
-			sizeof(dir2_bad_t), ino);
+			sizeof(*l), ino);
 		exit(1);
 	}
+	pthread_mutex_lock(&dir2_bad_list_lock);
 	l->next = dir2_bad_list;
 	dir2_bad_list = l;
 	l->ino = ino;
+	pthread_mutex_unlock(&dir2_bad_list_lock);
 }
 
 int
 dir2_is_badino(
 	xfs_ino_t	ino)
 {
-	dir2_bad_t	*l;
+	struct dir2_bad	*l;
+	int		ret = 0;
 
-	for (l = dir2_bad_list; l; l = l->next)
-		if (l->ino == ino)
-			return 1;
-	return 0;
+	pthread_mutex_lock(&dir2_bad_list_lock);
+	for (l = dir2_bad_list; l; l = l->next) {
+		if (l->ino == ino) {
+			ret = 1;
+			break;
+		}
+	}
+	pthread_mutex_unlock(&dir2_bad_list_lock);
+	return ret;
 }
 
 /*
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 6/7] repair: protect inode chunk tree records with a mutex
  2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
                   ` (4 preceding siblings ...)
  2018-10-30 11:20 ` [PATCH 5/7] repair: Protect bad inode list with mutex Dave Chinner
@ 2018-10-30 11:20 ` Dave Chinner
  2018-10-30 17:46   ` Darrick J. Wong
  2018-10-30 11:20 ` [PATCH 7/7] repair: parallelise phase 6 Dave Chinner
  2018-11-07  5:44 ` [PATCH 0/7] xfs_repair: scale to 150,000 iops Arkadiusz Miśkiewicz
  7 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 11:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Phase 6 accesses inode chunk records mostly in an isolated manner.
However, when it finds a corruption in a directory or there are
multiple hardlinks to an inode, there can be concurrent access
to the inode chunk record to update state.

Hence the inode record itself needs a mutex. This protects all state
changes within the inode chunk record, as well as inode link counts
and chunk references. That allows us to process multiple chunks at
once, providing concurrency within an AG as well as across AGs.

The inode chunk tree itself is not modified in phase 6 - it's built
in phases 3 and 4  - and so we do not need to worry about locking
for AVL tree lookups to find the inode chunk records themselves.
hence internal locking is all we need here.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/incore.h     | 23 +++++++++++++++++++++++
 repair/incore_ino.c | 15 +++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/repair/incore.h b/repair/incore.h
index 5b29d5d1efd8..6564e0d38963 100644
--- a/repair/incore.h
+++ b/repair/incore.h
@@ -282,6 +282,7 @@ typedef struct ino_tree_node  {
 		parent_list_t	*plist;		/* phases 2-5 */
 	} ino_un;
 	uint8_t			*ftypes;	/* phases 3,6 */
+	pthread_mutex_t		lock;
 } ino_tree_node_t;
 
 #define INOS_PER_IREC	(sizeof(uint64_t) * NBBY)
@@ -412,7 +413,9 @@ next_free_ino_rec(ino_tree_node_t *ino_rec)
  */
 static inline void add_inode_refchecked(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_un.ex_data->ino_processed |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int is_inode_refchecked(struct ino_tree_node *irec, int offset)
@@ -438,12 +441,16 @@ static inline int is_inode_confirmed(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_isadir(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_isa_dir |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline void clear_inode_isadir(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_isa_dir &= ~IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int inode_isadir(struct ino_tree_node *irec, int offset)
@@ -456,15 +463,19 @@ static inline int inode_isadir(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_free(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	set_inode_confirmed(irec, offset);
 	irec->ir_free |= XFS_INOBT_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 
 }
 
 static inline void set_inode_used(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	set_inode_confirmed(irec, offset);
 	irec->ir_free &= ~XFS_INOBT_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int is_inode_free(struct ino_tree_node *irec, int offset)
@@ -477,7 +488,9 @@ static inline int is_inode_free(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_sparse(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ir_sparse |= XFS_INOBT_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline bool is_inode_sparse(struct ino_tree_node *irec, int offset)
@@ -490,12 +503,16 @@ static inline bool is_inode_sparse(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_was_rl(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_was_rl |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline void clear_inode_was_rl(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_was_rl &= ~IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int inode_was_rl(struct ino_tree_node *irec, int offset)
@@ -508,12 +525,16 @@ static inline int inode_was_rl(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_is_rl(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_is_rl |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline void clear_inode_is_rl(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_is_rl &= ~IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int inode_is_rl(struct ino_tree_node *irec, int offset)
@@ -546,7 +567,9 @@ static inline int is_inode_reached(struct ino_tree_node *irec, int offset)
 static inline void add_inode_reached(struct ino_tree_node *irec, int offset)
 {
 	add_inode_ref(irec, offset);
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_un.ex_data->ino_reached |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 /*
diff --git a/repair/incore_ino.c b/repair/incore_ino.c
index 82956ae93005..299e4f949e5e 100644
--- a/repair/incore_ino.c
+++ b/repair/incore_ino.c
@@ -91,6 +91,7 @@ void add_inode_ref(struct ino_tree_node *irec, int ino_offset)
 {
 	ASSERT(irec->ino_un.ex_data != NULL);
 
+	pthread_mutex_lock(&irec->lock);
 	switch (irec->nlink_size) {
 	case sizeof(uint8_t):
 		if (irec->ino_un.ex_data->counted_nlinks.un8[ino_offset] < 0xff) {
@@ -112,6 +113,7 @@ void add_inode_ref(struct ino_tree_node *irec, int ino_offset)
 	default:
 		ASSERT(0);
 	}
+	pthread_mutex_unlock(&irec->lock);
 }
 
 void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
@@ -120,6 +122,7 @@ void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
 
 	ASSERT(irec->ino_un.ex_data != NULL);
 
+	pthread_mutex_lock(&irec->lock);
 	switch (irec->nlink_size) {
 	case sizeof(uint8_t):
 		ASSERT(irec->ino_un.ex_data->counted_nlinks.un8[ino_offset] > 0);
@@ -139,6 +142,7 @@ void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
 
 	if (refs == 0)
 		irec->ino_un.ex_data->ino_reached &= ~IREC_MASK(ino_offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 uint32_t num_inode_references(struct ino_tree_node *irec, int ino_offset)
@@ -161,6 +165,7 @@ uint32_t num_inode_references(struct ino_tree_node *irec, int ino_offset)
 void set_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset,
 		uint32_t nlinks)
 {
+	pthread_mutex_lock(&irec->lock);
 	switch (irec->nlink_size) {
 	case sizeof(uint8_t):
 		if (nlinks < 0xff) {
@@ -182,6 +187,7 @@ void set_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset,
 	default:
 		ASSERT(0);
 	}
+	pthread_mutex_unlock(&irec->lock);
 }
 
 uint32_t get_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset)
@@ -253,6 +259,7 @@ alloc_ino_node(
 	irec->nlink_size = sizeof(uint8_t);
 	irec->disk_nlinks.un8 = alloc_nlink_array(irec->nlink_size);
 	irec->ftypes = alloc_ftypes_array(mp);
+	pthread_mutex_init(&irec->lock, NULL);
 	return irec;
 }
 
@@ -294,6 +301,7 @@ free_ino_tree_node(
 	}
 
 	free(irec->ftypes);
+	pthread_mutex_destroy(&irec->lock);
 	free(irec);
 }
 
@@ -600,6 +608,7 @@ set_inode_parent(
 	uint64_t		bitmask;
 	parent_entry_t		*tmp;
 
+	pthread_mutex_lock(&irec->lock);
 	if (full_ino_ex_data)
 		ptbl = irec->ino_un.ex_data->parents;
 	else
@@ -625,6 +634,7 @@ set_inode_parent(
 #endif
 		ptbl->pentries[0] = parent;
 
+		pthread_mutex_unlock(&irec->lock);
 		return;
 	}
 
@@ -642,6 +652,7 @@ set_inode_parent(
 #endif
 		ptbl->pentries[target] = parent;
 
+		pthread_mutex_unlock(&irec->lock);
 		return;
 	}
 
@@ -682,6 +693,7 @@ set_inode_parent(
 #endif
 	ptbl->pentries[target] = parent;
 	ptbl->pmask |= (1ULL << offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 xfs_ino_t
@@ -692,6 +704,7 @@ get_inode_parent(ino_tree_node_t *irec, int offset)
 	int		i;
 	int		target;
 
+	pthread_mutex_lock(&irec->lock);
 	if (full_ino_ex_data)
 		ptbl = irec->ino_un.ex_data->parents;
 	else
@@ -709,9 +722,11 @@ get_inode_parent(ino_tree_node_t *irec, int offset)
 #ifdef DEBUG
 		ASSERT(target < ptbl->cnt);
 #endif
+		pthread_mutex_unlock(&irec->lock);
 		return(ptbl->pentries[target]);
 	}
 
+	pthread_mutex_unlock(&irec->lock);
 	return(0LL);
 }
 
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 7/7] repair: parallelise phase 6
  2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
                   ` (5 preceding siblings ...)
  2018-10-30 11:20 ` [PATCH 6/7] repair: protect inode chunk tree records with a mutex Dave Chinner
@ 2018-10-30 11:20 ` Dave Chinner
  2018-10-30 17:51   ` Darrick J. Wong
  2018-11-07  5:44 ` [PATCH 0/7] xfs_repair: scale to 150,000 iops Arkadiusz Miśkiewicz
  7 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 11:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

A recent metadump provided to us caused repair to take hours in
phase6. It wasn't IO bound - it was fully CPU bound the entire time.
The only way to speed it up is to make phase 6 run multiple
concurrent processing threads.

The obvious way to do this is to spread the concurrency across AGs,
like the other phases, and while this works it is not optimal. When
a processing thread hits a really large directory, it essentially
sits CPU bound until that directory is processed. IF an AG has lots
of large directories, we end up with a really long single threaded
tail that limits concurrency.

Hence we also need to have concurrency /within/ the AG. This is
realtively easy, as the inode chunk records allow for a simple
concurrency mechanism within an AG. We can simply feed each chunk
record to a workqueue, and we get concurrency within the AG for
free. However, this allows prefetch to run way ahead of processing
and this blows out the buffer cache size and can cause OOM.

However, we can use the new workqueue depth limiting to limit the
number of inode chunks queued, and this then backs up the inode
prefetching to it's maximum queue depth. Hence we prevent having the
prefetch code queue the entire AG's inode chunks on the workqueue
blowing out memory by throttling the prefetch consumer.

This takes phase 6 from taking many, many hours down to:

Phase 6:        10/30 21:12:58  10/30 21:40:48  27 minutes, 50 seconds

And burning 20-30 cpus that entire time on my test rig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase6.c | 43 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index 9477bc2527f3..422a53bca6c7 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -6,6 +6,7 @@
 
 #include "libxfs.h"
 #include "threads.h"
+#include "threads.h"
 #include "prefetch.h"
 #include "avl.h"
 #include "globals.h"
@@ -3169,20 +3170,45 @@ check_for_orphaned_inodes(
 }
 
 static void
-traverse_function(
+do_dir_inode(
 	struct workqueue	*wq,
-	xfs_agnumber_t 		agno,
+	xfs_agnumber_t		agno,
 	void			*arg)
 {
-	ino_tree_node_t 	*irec;
+	struct ino_tree_node	*irec = arg;
 	int			i;
+
+	for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
+		if (inode_isadir(irec, i))
+			process_dir_inode(wq->wq_ctx, agno, irec, i);
+	}
+}
+
+static void
+traverse_function(
+	struct workqueue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct ino_tree_node	*irec;
 	prefetch_args_t		*pf_args = arg;
+	struct workqueue	lwq;
+	struct xfs_mount	*mp = wq->wq_ctx;
+
 
 	wait_for_inode_prefetch(pf_args);
 
 	if (verbose)
 		do_log(_("        - agno = %d\n"), agno);
 
+	/*
+	 * The more AGs we have in flight at once, the fewer processing threads
+	 * per AG. This means we don't overwhelm the machine with hundreds of
+	 * threads when we start acting on lots of AGs at once. We just want
+	 * enough that we can keep multiple CPUs busy across multiple AGs.
+	 */
+	workqueue_create_bound(&lwq, mp, ag_stride, 1000);
+
 	for (irec = findfirst_inode_rec(agno); irec; irec = next_ino_rec(irec)) {
 		if (irec->ino_isa_dir == 0)
 			continue;
@@ -3190,18 +3216,19 @@ traverse_function(
 		if (pf_args) {
 			sem_post(&pf_args->ra_count);
 #ifdef XR_PF_TRACE
+			{
+			int	i;
 			sem_getvalue(&pf_args->ra_count, &i);
 			pftrace(
 		"processing inode chunk %p in AG %d (sem count = %d)",
 				irec, agno, i);
+			}
 #endif
 		}
 
-		for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
-			if (inode_isadir(irec, i))
-				process_dir_inode(wq->wq_ctx, agno, irec, i);
-		}
+		queue_work(&lwq, do_dir_inode, agno, irec);
 	}
+	destroy_work_queue(&lwq);
 	cleanup_inode_prefetch(pf_args);
 }
 
@@ -3229,7 +3256,7 @@ static void
 traverse_ags(
 	struct xfs_mount	*mp)
 {
-	do_inode_prefetch(mp, 0, traverse_function, false, true);
+	do_inode_prefetch(mp, ag_stride, traverse_function, false, true);
 }
 
 void
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption"
  2018-10-30 11:20 ` [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption" Dave Chinner
@ 2018-10-30 17:20   ` Darrick J. Wong
  2018-10-30 19:35     ` Eric Sandeen
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2018-10-30 17:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:20:37PM +1100, Dave Chinner wrote:
> This reverts commit 67a79e2cc9320aaf269cd00e9c8d16892931886d.
> 
> A root LEAFN block can exist in a directory. When we convert from
> leaf format (LEAF1 - internal free list) to node format (LEAFN -
> external free list) the only change to the single root leaf block is
> that it's magic number is changed from LEAF1 to LEAFN.
> 
> We don't actually end up with DA nodes in the tree until the LEAFN
> node is split, and that requires a couple more dirents to be added
> to the directory to fill the LEAFN block up completely. Then it will
> split and create a DA node root block pointing to multiple LEAFN
> leaf blocks.
> 
> Hence restore the old behaviour where we skip the DA node tree
> rebuild if there is a LEAFN root block found as there is no tree to
> rebuild.

The trouble with reverting the patch is that xfs_repair goes back to
tripping over an assertion in release_da_cursor_int if the reason why
we got bno == 0 is that the directory has a da btree block with
nbtree[0].before pointing to zero.

Will post a better fix + regression tests for both issues shortly.

--D

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/dir2.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/repair/dir2.c b/repair/dir2.c
> index 3374ae722bf9..ba5763ed3d26 100644
> --- a/repair/dir2.c
> +++ b/repair/dir2.c
> @@ -1242,11 +1242,11 @@ process_node_dir2(
>  		return 1;
>  
>  	/*
> -	 * Directories with a root marked XFS_DIR2_LEAFN_MAGIC are corrupt
> +	 * Skip directories with a root marked XFS_DIR2_LEAFN_MAGIC
>  	 */
>  	if (bno == 0) {
> -		err_release_da_cursor(mp, &da_cursor, 0);
> -		return 1;
> +		release_da_cursor(mp, &da_cursor, 0);
> +		return 0;
>  	} else {
>  		/*
>  		 * Now pass cursor and bno into leaf-block processing routine.
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] repair: don't dirty inodes which are not unlinked
  2018-10-30 11:20 ` [PATCH 2/7] repair: don't dirty inodes which are not unlinked Dave Chinner
@ 2018-10-30 17:26   ` Darrick J. Wong
  2018-10-30 20:03   ` Eric Sandeen
  1 sibling, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2018-10-30 17:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:20:38PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> I noticed phase 4 writing back lots of inode buffers during recent
> testing. The recent rework of clear_inode() in commit 0724d0f4cb53
> ("xfs_repair: clear_dinode should simply clear, not check contents")
> accidentally caught a call to clear_inode_unlinked() as well,
> resulting in all inodes being marked dirty whether then needed
> updating or not.
> 
> Fix it by reverting the erroneous hunk and adding warnings so taht
> this corruption is no longer silently fixed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Oops... that explains a lot. :/

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  repair/dinode.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/repair/dinode.c b/repair/dinode.c
> index 379f85cf1268..90400128d4bb 100644
> --- a/repair/dinode.c
> +++ b/repair/dinode.c
> @@ -2675,9 +2675,15 @@ _("bad (negative) size %" PRId64 " on inode %" PRIu64 "\n"),
>  	 * we're going to find.  check_dups is set to 1 only during
>  	 * phase 4.  Ugly.
>  	 */
> -	if (check_dups && !no_modify) {
> -		clear_dinode_unlinked(mp, dino);
> -		*dirty += 1;
> +	if (check_dups && clear_dinode_unlinked(mp, dino)) {
> +		if (no_modify) {
> +			do_warn(
> +	_("Would clear unlinked_next in inode %" PRIu64 "\n"), lino);
> +		} else  {
> +			do_warn(
> +	_("Cleared unlinked_next in inode %" PRIu64 "\n"), lino);
> +			*dirty += 1;
> +		}
>  	}
>  
>  	/* set type and map type info */
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] cache: prevent expansion races
  2018-10-30 11:20 ` [PATCH 3/7] cache: prevent expansion races Dave Chinner
@ 2018-10-30 17:39   ` Darrick J. Wong
  2018-10-30 20:35     ` Dave Chinner
  2018-10-31 17:13   ` Brian Foster
  1 sibling, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2018-10-30 17:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:20:39PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a sudden buffer cache growth demand occurs from multiple
> concurrent sources, they can all fail shaking the cache at the same
> time and expand the cache. This results in the cache doubling in
> size multiple times when only a single expansion was really
> necessary. The resultant massive cache can cause repair to OOM as
> the cache will grow to much larger than memory can actually hold.
> 
> Prevent this sudden and unnecessary expansion by rate limiting cache
> growth to one size increase every tens seconds. This is sufficient
> to prevent racing expansion calls from doing the wrong thing, but
> not too long to prevent necesary cache growth when it is undersized.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/cache.h |  1 +
>  libxfs/cache.c  | 17 ++++++++++++++---
>  2 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/include/cache.h b/include/cache.h
> index 552b92489e46..1b774619f7a0 100644
> --- a/include/cache.h
> +++ b/include/cache.h
> @@ -114,6 +114,7 @@ struct cache {
>  	unsigned long long	c_misses;	/* cache misses */
>  	unsigned long long	c_hits;		/* cache hits */
>  	unsigned int 		c_max;		/* max nodes ever used */
> +	time_t			expand_time;	/* time of last expansion */
>  };
>  
>  struct cache *cache_init(int, unsigned int, struct cache_operations *);
> diff --git a/libxfs/cache.c b/libxfs/cache.c
> index 139c7c1b9e71..e10df395e60e 100644
> --- a/libxfs/cache.c
> +++ b/libxfs/cache.c
> @@ -62,6 +62,7 @@ cache_init(
>  	cache->bulkrelse = cache_operations->bulkrelse ?
>  		cache_operations->bulkrelse : cache_generic_bulkrelse;
>  	pthread_mutex_init(&cache->c_mutex, NULL);
> +	time(&cache->expand_time);
>  
>  	for (i = 0; i < hashsize; i++) {
>  		list_head_init(&cache->c_hash[i].ch_list);
> @@ -77,15 +78,25 @@ cache_init(
>  	return cache;
>  }
>  
> +/*
> + * rate limit expansion so several concurrent shakes don't instantly all
> + * expand the cache multiple times and blow repair to OOM death.
> + */
>  static void
>  cache_expand(
> -	struct cache *		cache)
> +	struct cache	*cache)
>  {
> +	time_t		now;
> +
>  	pthread_mutex_lock(&cache->c_mutex);
> +	time(&now);
> +	if (now - cache->expand_time > 10) {

At first I wondered to myself about what happens if we fill the cache
fast enough that we run out of space in less than ten seconds, but
(afaict) the cache_expand caller will keep trying shake/expand until it
gets something, right?

--D

>  #ifdef CACHE_DEBUG
> -	fprintf(stderr, "doubling cache size to %d\n", 2 * cache->c_maxcount);
> +		fprintf(stderr, "doubling cache size to %d\n", 2 * cache->c_maxcount);
>  #endif
> -	cache->c_maxcount *= 2;
> +		cache->c_maxcount *= 2;
> +		cache->expand_time = now;
> +	}
>  	pthread_mutex_unlock(&cache->c_mutex);
>  }
>  
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/7] repair: Protect bad inode list with mutex
  2018-10-30 11:20 ` [PATCH 5/7] repair: Protect bad inode list with mutex Dave Chinner
@ 2018-10-30 17:44   ` Darrick J. Wong
  2018-10-30 20:54     ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2018-10-30 17:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:20:41PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To enable phase 6 parallelisation, we need to protect the bad inode
> list from concurrent modification and/or access. Wrap it with a
> mutex and clean up the nasty typedefs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/dir2.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/repair/dir2.c b/repair/dir2.c
> index ba5763ed3d26..a73a675b97c8 100644
> --- a/repair/dir2.c
> +++ b/repair/dir2.c
> @@ -20,40 +20,50 @@
>   * Known bad inode list.  These are seen when the leaf and node
>   * block linkages are incorrect.
>   */
> -typedef struct dir2_bad {
> +struct dir2_bad {
>  	xfs_ino_t	ino;
>  	struct dir2_bad	*next;
> -} dir2_bad_t;
> +};

Just out of curiosity, how fast does this linked list grow?  In theory
we could hoist scrub/bitmap.c to libfrog and turn this into a bitmap,
but that probably depends on how often we find inodes with bad link
counts...?  And admittedly, avlnodes aren't cheap either...

Anyway, this patch itself looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

>  
> -static dir2_bad_t *dir2_bad_list;
> +static struct dir2_bad	*dir2_bad_list;
> +pthread_mutex_t		dir2_bad_list_lock = PTHREAD_MUTEX_INITIALIZER;
>  
>  static void
>  dir2_add_badlist(
>  	xfs_ino_t	ino)
>  {
> -	dir2_bad_t	*l;
> +	struct dir2_bad	*l;
>  
> -	if ((l = malloc(sizeof(dir2_bad_t))) == NULL) {
> +	l = malloc(sizeof(*l));
> +	if (!l) {
>  		do_error(
>  _("malloc failed (%zu bytes) dir2_add_badlist:ino %" PRIu64 "\n"),
> -			sizeof(dir2_bad_t), ino);
> +			sizeof(*l), ino);
>  		exit(1);
>  	}
> +	pthread_mutex_lock(&dir2_bad_list_lock);
>  	l->next = dir2_bad_list;
>  	dir2_bad_list = l;
>  	l->ino = ino;
> +	pthread_mutex_unlock(&dir2_bad_list_lock);
>  }
>  
>  int
>  dir2_is_badino(
>  	xfs_ino_t	ino)
>  {
> -	dir2_bad_t	*l;
> +	struct dir2_bad	*l;
> +	int		ret = 0;
>  
> -	for (l = dir2_bad_list; l; l = l->next)
> -		if (l->ino == ino)
> -			return 1;
> -	return 0;
> +	pthread_mutex_lock(&dir2_bad_list_lock);
> +	for (l = dir2_bad_list; l; l = l->next) {
> +		if (l->ino == ino) {
> +			ret = 1;
> +			break;
> +		}
> +	}
> +	pthread_mutex_unlock(&dir2_bad_list_lock);
> +	return ret;
>  }
>  
>  /*
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] repair: protect inode chunk tree records with a mutex
  2018-10-30 11:20 ` [PATCH 6/7] repair: protect inode chunk tree records with a mutex Dave Chinner
@ 2018-10-30 17:46   ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2018-10-30 17:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:20:42PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Phase 6 accesses inode chunk records mostly in an isolated manner.
> However, when it finds a corruption in a directory or there are
> multiple hardlinks to an inode, there can be concurrent access
> to the inode chunk record to update state.
> 
> Hence the inode record itself needs a mutex. This protects all state
> changes within the inode chunk record, as well as inode link counts
> and chunk references. That allows us to process multiple chunks at
> once, providing concurrency within an AG as well as across AGs.
> 
> The inode chunk tree itself is not modified in phase 6 - it's built
> in phases 3 and 4  - and so we do not need to worry about locking
> for AVL tree lookups to find the inode chunk records themselves.
> hence internal locking is all we need here.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  repair/incore.h     | 23 +++++++++++++++++++++++
>  repair/incore_ino.c | 15 +++++++++++++++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/repair/incore.h b/repair/incore.h
> index 5b29d5d1efd8..6564e0d38963 100644
> --- a/repair/incore.h
> +++ b/repair/incore.h
> @@ -282,6 +282,7 @@ typedef struct ino_tree_node  {
>  		parent_list_t	*plist;		/* phases 2-5 */
>  	} ino_un;
>  	uint8_t			*ftypes;	/* phases 3,6 */
> +	pthread_mutex_t		lock;
>  } ino_tree_node_t;
>  
>  #define INOS_PER_IREC	(sizeof(uint64_t) * NBBY)
> @@ -412,7 +413,9 @@ next_free_ino_rec(ino_tree_node_t *ino_rec)
>   */
>  static inline void add_inode_refchecked(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_un.ex_data->ino_processed |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int is_inode_refchecked(struct ino_tree_node *irec, int offset)
> @@ -438,12 +441,16 @@ static inline int is_inode_confirmed(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_isadir(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_isa_dir |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline void clear_inode_isadir(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_isa_dir &= ~IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int inode_isadir(struct ino_tree_node *irec, int offset)
> @@ -456,15 +463,19 @@ static inline int inode_isadir(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_free(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	set_inode_confirmed(irec, offset);
>  	irec->ir_free |= XFS_INOBT_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  
>  }
>  
>  static inline void set_inode_used(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	set_inode_confirmed(irec, offset);
>  	irec->ir_free &= ~XFS_INOBT_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int is_inode_free(struct ino_tree_node *irec, int offset)
> @@ -477,7 +488,9 @@ static inline int is_inode_free(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_sparse(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ir_sparse |= XFS_INOBT_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline bool is_inode_sparse(struct ino_tree_node *irec, int offset)
> @@ -490,12 +503,16 @@ static inline bool is_inode_sparse(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_was_rl(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_was_rl |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline void clear_inode_was_rl(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_was_rl &= ~IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int inode_was_rl(struct ino_tree_node *irec, int offset)
> @@ -508,12 +525,16 @@ static inline int inode_was_rl(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_is_rl(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_is_rl |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline void clear_inode_is_rl(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_is_rl &= ~IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int inode_is_rl(struct ino_tree_node *irec, int offset)
> @@ -546,7 +567,9 @@ static inline int is_inode_reached(struct ino_tree_node *irec, int offset)
>  static inline void add_inode_reached(struct ino_tree_node *irec, int offset)
>  {
>  	add_inode_ref(irec, offset);
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_un.ex_data->ino_reached |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  /*
> diff --git a/repair/incore_ino.c b/repair/incore_ino.c
> index 82956ae93005..299e4f949e5e 100644
> --- a/repair/incore_ino.c
> +++ b/repair/incore_ino.c
> @@ -91,6 +91,7 @@ void add_inode_ref(struct ino_tree_node *irec, int ino_offset)
>  {
>  	ASSERT(irec->ino_un.ex_data != NULL);
>  
> +	pthread_mutex_lock(&irec->lock);
>  	switch (irec->nlink_size) {
>  	case sizeof(uint8_t):
>  		if (irec->ino_un.ex_data->counted_nlinks.un8[ino_offset] < 0xff) {
> @@ -112,6 +113,7 @@ void add_inode_ref(struct ino_tree_node *irec, int ino_offset)
>  	default:
>  		ASSERT(0);
>  	}
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
> @@ -120,6 +122,7 @@ void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
>  
>  	ASSERT(irec->ino_un.ex_data != NULL);
>  
> +	pthread_mutex_lock(&irec->lock);
>  	switch (irec->nlink_size) {
>  	case sizeof(uint8_t):
>  		ASSERT(irec->ino_un.ex_data->counted_nlinks.un8[ino_offset] > 0);
> @@ -139,6 +142,7 @@ void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
>  
>  	if (refs == 0)
>  		irec->ino_un.ex_data->ino_reached &= ~IREC_MASK(ino_offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  uint32_t num_inode_references(struct ino_tree_node *irec, int ino_offset)
> @@ -161,6 +165,7 @@ uint32_t num_inode_references(struct ino_tree_node *irec, int ino_offset)
>  void set_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset,
>  		uint32_t nlinks)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	switch (irec->nlink_size) {
>  	case sizeof(uint8_t):
>  		if (nlinks < 0xff) {
> @@ -182,6 +187,7 @@ void set_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset,
>  	default:
>  		ASSERT(0);
>  	}
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  uint32_t get_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset)
> @@ -253,6 +259,7 @@ alloc_ino_node(
>  	irec->nlink_size = sizeof(uint8_t);
>  	irec->disk_nlinks.un8 = alloc_nlink_array(irec->nlink_size);
>  	irec->ftypes = alloc_ftypes_array(mp);
> +	pthread_mutex_init(&irec->lock, NULL);
>  	return irec;
>  }
>  
> @@ -294,6 +301,7 @@ free_ino_tree_node(
>  	}
>  
>  	free(irec->ftypes);
> +	pthread_mutex_destroy(&irec->lock);
>  	free(irec);
>  }
>  
> @@ -600,6 +608,7 @@ set_inode_parent(
>  	uint64_t		bitmask;
>  	parent_entry_t		*tmp;
>  
> +	pthread_mutex_lock(&irec->lock);
>  	if (full_ino_ex_data)
>  		ptbl = irec->ino_un.ex_data->parents;
>  	else
> @@ -625,6 +634,7 @@ set_inode_parent(
>  #endif
>  		ptbl->pentries[0] = parent;
>  
> +		pthread_mutex_unlock(&irec->lock);
>  		return;
>  	}
>  
> @@ -642,6 +652,7 @@ set_inode_parent(
>  #endif
>  		ptbl->pentries[target] = parent;
>  
> +		pthread_mutex_unlock(&irec->lock);
>  		return;
>  	}
>  
> @@ -682,6 +693,7 @@ set_inode_parent(
>  #endif
>  	ptbl->pentries[target] = parent;
>  	ptbl->pmask |= (1ULL << offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  xfs_ino_t
> @@ -692,6 +704,7 @@ get_inode_parent(ino_tree_node_t *irec, int offset)
>  	int		i;
>  	int		target;
>  
> +	pthread_mutex_lock(&irec->lock);
>  	if (full_ino_ex_data)
>  		ptbl = irec->ino_un.ex_data->parents;
>  	else
> @@ -709,9 +722,11 @@ get_inode_parent(ino_tree_node_t *irec, int offset)
>  #ifdef DEBUG
>  		ASSERT(target < ptbl->cnt);
>  #endif
> +		pthread_mutex_unlock(&irec->lock);
>  		return(ptbl->pentries[target]);
>  	}
>  
> +	pthread_mutex_unlock(&irec->lock);
>  	return(0LL);
>  }
>  
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/7] repair: parallelise phase 6
  2018-10-30 11:20 ` [PATCH 7/7] repair: parallelise phase 6 Dave Chinner
@ 2018-10-30 17:51   ` Darrick J. Wong
  2018-10-30 20:55     ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2018-10-30 17:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:20:43PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A recent metadump provided to us caused repair to take hours in
> phase6. It wasn't IO bound - it was fully CPU bound the entire time.
> The only way to speed it up is to make phase 6 run multiple
> concurrent processing threads.
> 
> The obvious way to do this is to spread the concurrency across AGs,
> like the other phases, and while this works it is not optimal. When
> a processing thread hits a really large directory, it essentially
> sits CPU bound until that directory is processed. IF an AG has lots
> of large directories, we end up with a really long single threaded
> tail that limits concurrency.
> 
> Hence we also need to have concurrency /within/ the AG. This is
> realtively easy, as the inode chunk records allow for a simple
> concurrency mechanism within an AG. We can simply feed each chunk
> record to a workqueue, and we get concurrency within the AG for
> free. However, this allows prefetch to run way ahead of processing
> and this blows out the buffer cache size and can cause OOM.
> 
> However, we can use the new workqueue depth limiting to limit the
> number of inode chunks queued, and this then backs up the inode
> prefetching to it's maximum queue depth. Hence we prevent having the
> prefetch code queue the entire AG's inode chunks on the workqueue
> blowing out memory by throttling the prefetch consumer.
> 
> This takes phase 6 from taking many, many hours down to:
> 
> Phase 6:        10/30 21:12:58  10/30 21:40:48  27 minutes, 50 seconds
> 
> And burning 20-30 cpus that entire time on my test rig.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/phase6.c | 43 +++++++++++++++++++++++++++++++++++--------
>  1 file changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/repair/phase6.c b/repair/phase6.c
> index 9477bc2527f3..422a53bca6c7 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -6,6 +6,7 @@
>  
>  #include "libxfs.h"
>  #include "threads.h"
> +#include "threads.h"
>  #include "prefetch.h"
>  #include "avl.h"
>  #include "globals.h"
> @@ -3169,20 +3170,45 @@ check_for_orphaned_inodes(
>  }
>  
>  static void
> -traverse_function(
> +do_dir_inode(
>  	struct workqueue	*wq,
> -	xfs_agnumber_t 		agno,
> +	xfs_agnumber_t		agno,
>  	void			*arg)
>  {
> -	ino_tree_node_t 	*irec;
> +	struct ino_tree_node	*irec = arg;
>  	int			i;
> +
> +	for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
> +		if (inode_isadir(irec, i))
> +			process_dir_inode(wq->wq_ctx, agno, irec, i);
> +	}
> +}
> +
> +static void
> +traverse_function(
> +	struct workqueue	*wq,
> +	xfs_agnumber_t		agno,
> +	void			*arg)
> +{
> +	struct ino_tree_node	*irec;
>  	prefetch_args_t		*pf_args = arg;
> +	struct workqueue	lwq;
> +	struct xfs_mount	*mp = wq->wq_ctx;
> +
>  
>  	wait_for_inode_prefetch(pf_args);
>  
>  	if (verbose)
>  		do_log(_("        - agno = %d\n"), agno);
>  
> +	/*
> +	 * The more AGs we have in flight at once, the fewer processing threads
> +	 * per AG. This means we don't overwhelm the machine with hundreds of
> +	 * threads when we start acting on lots of AGs at once. We just want
> +	 * enough that we can keep multiple CPUs busy across multiple AGs.
> +	 */
> +	workqueue_create_bound(&lwq, mp, ag_stride, 1000);

Unlikely to happen, but do we need to check for errors here?
create_work_queue aborts repair if workqueue_create fails.

--D

> +
>  	for (irec = findfirst_inode_rec(agno); irec; irec = next_ino_rec(irec)) {
>  		if (irec->ino_isa_dir == 0)
>  			continue;
> @@ -3190,18 +3216,19 @@ traverse_function(
>  		if (pf_args) {
>  			sem_post(&pf_args->ra_count);
>  #ifdef XR_PF_TRACE
> +			{
> +			int	i;
>  			sem_getvalue(&pf_args->ra_count, &i);
>  			pftrace(
>  		"processing inode chunk %p in AG %d (sem count = %d)",
>  				irec, agno, i);
> +			}
>  #endif
>  		}
>  
> -		for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
> -			if (inode_isadir(irec, i))
> -				process_dir_inode(wq->wq_ctx, agno, irec, i);
> -		}
> +		queue_work(&lwq, do_dir_inode, agno, irec);
>  	}
> +	destroy_work_queue(&lwq);
>  	cleanup_inode_prefetch(pf_args);
>  }
>  
> @@ -3229,7 +3256,7 @@ static void
>  traverse_ags(
>  	struct xfs_mount	*mp)
>  {
> -	do_inode_prefetch(mp, 0, traverse_function, false, true);
> +	do_inode_prefetch(mp, ag_stride, traverse_function, false, true);
>  }
>  
>  void
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/7] workqueue: bound maximum queue depth
  2018-10-30 11:20 ` [PATCH 4/7] workqueue: bound maximum queue depth Dave Chinner
@ 2018-10-30 17:58   ` Darrick J. Wong
  2018-10-30 20:53     ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2018-10-30 17:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:20:40PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Existing users of workqueues have bound maximum queue depths in
> their external algorithms (e.g. prefetch counts). For parallelising
> work that doesn't have an external bound, allow workqueues to
> throttle incoming requests at a maximum bound. bounded workqueues
> also need to distribute work over all worker threads themselves as
> there is no external bounding or worker function throttling
> provided.
> 
> Existing callers are not throttled and retain direct control of
> worker threads, only users of the new create interface will be
> throttled and concurrency managed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/workqueue.h |  4 ++++
>  libfrog/workqueue.c | 30 +++++++++++++++++++++++++++---
>  2 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/include/workqueue.h b/include/workqueue.h
> index c45dc4fbcf64..504da9403b85 100644
> --- a/include/workqueue.h
> +++ b/include/workqueue.h
> @@ -30,10 +30,14 @@ struct workqueue {
>  	unsigned int		item_count;
>  	unsigned int		thread_count;
>  	bool			terminate;
> +	int			max_queued;
> +	pthread_cond_t		queue_full;
>  };
>  
>  int workqueue_create(struct workqueue *wq, void *wq_ctx,
>  		unsigned int nr_workers);
> +int workqueue_create_bound(struct workqueue *wq, void *wq_ctx,
> +		unsigned int nr_workers, int max_queue);

What does negative max_queue mean?

>  int workqueue_add(struct workqueue *wq, workqueue_func_t fn,
>  		uint32_t index, void *arg);
>  void workqueue_destroy(struct workqueue *wq);
> diff --git a/libfrog/workqueue.c b/libfrog/workqueue.c
> index 7311477374b4..8fe0dc7249f5 100644
> --- a/libfrog/workqueue.c
> +++ b/libfrog/workqueue.c
> @@ -40,13 +40,21 @@ workqueue_thread(void *arg)
>  		}
>  
>  		/*
> -		 *  Dequeue work from the head of the list.
> +		 *  Dequeue work from the head of the list. If the queue was
> +		 *  full then send a wakeup if we're configured to do so.
>  		 */
>  		assert(wq->item_count > 0);
> +		if (wq->max_queued && wq->item_count == wq->max_queued)
> +			pthread_cond_signal(&wq->queue_full);
> +
>  		wi = wq->next_item;
>  		wq->next_item = wi->next;
>  		wq->item_count--;
>  
> +		if (wq->max_queued && wq->next_item) {
> +			/* more work, wake up another worker */
> +			pthread_cond_signal(&wq->wakeup);
> +		}

It seems a little funny to me that the worker thread wakes up other
worker threads when there is more work to do (vs. workqueue_add which
actually added more work)...

--D

>  		pthread_mutex_unlock(&wq->lock);
>  
>  		(wi->function)(wi->queue, wi->index, wi->arg);
> @@ -58,22 +66,25 @@ workqueue_thread(void *arg)
>  
>  /* Allocate a work queue and threads. */
>  int
> -workqueue_create(
> +workqueue_create_bound(
>  	struct workqueue	*wq,
>  	void			*wq_ctx,
> -	unsigned int		nr_workers)
> +	unsigned int		nr_workers,
> +	int			max_queue)
>  {
>  	unsigned int		i;
>  	int			err = 0;
>  
>  	memset(wq, 0, sizeof(*wq));
>  	pthread_cond_init(&wq->wakeup, NULL);
> +	pthread_cond_init(&wq->queue_full, NULL);
>  	pthread_mutex_init(&wq->lock, NULL);
>  
>  	wq->wq_ctx = wq_ctx;
>  	wq->thread_count = nr_workers;
>  	wq->threads = malloc(nr_workers * sizeof(pthread_t));
>  	wq->terminate = false;
> +	wq->max_queued = max_queue;
>  
>  	for (i = 0; i < nr_workers; i++) {
>  		err = pthread_create(&wq->threads[i], NULL, workqueue_thread,
> @@ -87,6 +98,15 @@ workqueue_create(
>  	return err;
>  }
>  
> +int
> +workqueue_create(
> +	struct workqueue	*wq,
> +	void			*wq_ctx,
> +	unsigned int		nr_workers)
> +{
> +	return workqueue_create_bound(wq, wq_ctx, nr_workers, 0);
> +}
> +
>  /*
>   * Create a work item consisting of a function and some arguments and
>   * schedule the work item to be run via the thread pool.
> @@ -122,6 +142,9 @@ workqueue_add(
>  		assert(wq->item_count == 0);
>  		pthread_cond_signal(&wq->wakeup);
>  	} else {
> +		/* throttle on a full queue if configured */
> +		if (wq->max_queued && wq->item_count == wq->max_queued)
> +			pthread_cond_wait(&wq->queue_full, &wq->lock);
>  		wq->last_item->next = wi;
>  	}
>  	wq->last_item = wi;
> @@ -153,5 +176,6 @@ workqueue_destroy(
>  	free(wq->threads);
>  	pthread_mutex_destroy(&wq->lock);
>  	pthread_cond_destroy(&wq->wakeup);
> +	pthread_cond_destroy(&wq->queue_full);
>  	memset(wq, 0, sizeof(*wq));
>  }
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption"
  2018-10-30 17:20   ` Darrick J. Wong
@ 2018-10-30 19:35     ` Eric Sandeen
  2018-10-30 20:11       ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Sandeen @ 2018-10-30 19:35 UTC (permalink / raw)
  To: Darrick J. Wong, Dave Chinner; +Cc: linux-xfs



On 10/30/18 12:20 PM, Darrick J. Wong wrote:
> On Tue, Oct 30, 2018 at 10:20:37PM +1100, Dave Chinner wrote:
>> This reverts commit 67a79e2cc9320aaf269cd00e9c8d16892931886d.
>>
>> A root LEAFN block can exist in a directory. When we convert from
>> leaf format (LEAF1 - internal free list) to node format (LEAFN -
>> external free list) the only change to the single root leaf block is
>> that it's magic number is changed from LEAF1 to LEAFN.
>>
>> We don't actually end up with DA nodes in the tree until the LEAFN
>> node is split, and that requires a couple more dirents to be added
>> to the directory to fill the LEAFN block up completely. Then it will
>> split and create a DA node root block pointing to multiple LEAFN
>> leaf blocks.
>>
>> Hence restore the old behaviour where we skip the DA node tree
>> rebuild if there is a LEAFN root block found as there is no tree to
>> rebuild.
> 
> The trouble with reverting the patch is that xfs_repair goes back to
> tripping over an assertion in release_da_cursor_int if the reason why
> we got bno == 0 is that the directory has a da btree block with
> nbtree[0].before pointing to zero.
> 
> Will post a better fix + regression tests for both issues shortly.

Well let's discuss pros/cons & risks/rewards on that, I'm semi-inclined
to just revert it for 4.19 and fix it properly in 4.20, because I don't think
it's ever been seen in the wild...?

-eric

> --D
> 
>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>> ---
>>  repair/dir2.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/repair/dir2.c b/repair/dir2.c
>> index 3374ae722bf9..ba5763ed3d26 100644
>> --- a/repair/dir2.c
>> +++ b/repair/dir2.c
>> @@ -1242,11 +1242,11 @@ process_node_dir2(
>>  		return 1;
>>  
>>  	/*
>> -	 * Directories with a root marked XFS_DIR2_LEAFN_MAGIC are corrupt
>> +	 * Skip directories with a root marked XFS_DIR2_LEAFN_MAGIC
>>  	 */
>>  	if (bno == 0) {
>> -		err_release_da_cursor(mp, &da_cursor, 0);
>> -		return 1;
>> +		release_da_cursor(mp, &da_cursor, 0);
>> +		return 0;
>>  	} else {
>>  		/*
>>  		 * Now pass cursor and bno into leaf-block processing routine.
>> -- 
>> 2.19.1
>>
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] repair: don't dirty inodes which are not unlinked
  2018-10-30 11:20 ` [PATCH 2/7] repair: don't dirty inodes which are not unlinked Dave Chinner
  2018-10-30 17:26   ` Darrick J. Wong
@ 2018-10-30 20:03   ` Eric Sandeen
  2018-10-30 20:09     ` Eric Sandeen
  1 sibling, 1 reply; 35+ messages in thread
From: Eric Sandeen @ 2018-10-30 20:03 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 10/30/18 6:20 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> I noticed phase 4 writing back lots of inode buffers during recent
> testing. The recent rework of clear_inode() in commit 0724d0f4cb53
> ("xfs_repair: clear_dinode should simply clear, not check contents")
> accidentally caught a call to clear_inode_unlinked() as well,
> resulting in all inodes being marked dirty whether then needed
> updating or not.
> 
> Fix it by reverting the erroneous hunk and adding warnings so taht
> this corruption is no longer silently fixed.

I find it confusing that "clear_dinode_unlinked" may or may not clear
unlinked.  Can we change it to actually do as it's named, and move the
test to the (one) caller where it's needed, something like:

===

repair: don't dirty inodes unconditionally when testing unlinked state

Dave noticed phase 4 writing back lots of inode buffers during recent
testing. The recent rework of clear_inode() in commit 0724d0f4cb53
("xfs_repair: clear_dinode should simply clear, not check contents")
accidentally caught a call to clear_inode_unlinked() as well,
resulting in all inodes being marked dirty whether then needed
updating or not.

Fix it by making clear_inode_unlinked unconditionally do the clear
(as was done for clear_inode), and move the test to the caller.
Add warnings as well so that this corruption is no longer silently
fixed.

Reported-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

(I can keep Dave's SOB if preferred with a [sandeen: do it different]
tag if this looks ok and that approach is preferred)

(this also fixes unlinked_next lysdexia in the warnings)

diff --git a/repair/dinode.c b/repair/dinode.c
index 379f85c..c0fa3df 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -125,23 +125,16 @@ clear_dinode_core(struct xfs_mount *mp, xfs_dinode_t *dinoc, xfs_ino_t ino_num)
 	return;
 }
 
-static int
+static void
 clear_dinode_unlinked(xfs_mount_t *mp, xfs_dinode_t *dino)
 {
 
-	if (be32_to_cpu(dino->di_next_unlinked) != NULLAGINO)  {
-		if (!no_modify)
-			dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
-		return(1);
-	}
-
-	return(0);
+	dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
 }
 
 /*
  * this clears the unlinked list too so it should not be called
  * until after the agi unlinked lists are walked in phase 3.
- * returns > zero if the inode has been altered while being cleared
  */
 static void
 clear_dinode(xfs_mount_t *mp, xfs_dinode_t *dino, xfs_ino_t ino_num)
@@ -2675,9 +2668,16 @@ _("bad (negative) size %" PRId64 " on inode %" PRIu64 "\n"),
 	 * we're going to find.  check_dups is set to 1 only during
 	 * phase 4.  Ugly.
 	 */
-	if (check_dups && !no_modify) {
-		clear_dinode_unlinked(mp, dino);
-		*dirty += 1;
+	if (check_dups && be32_to_cpu(dino->di_next_unlinked) != NULLAGINO) {
+		if (no_modify) {
+			do_warn(
+	_("Would clear next_unlinked in inode %" PRIu64 "\n"), lino);
+		} else  {
+			clear_dinode_unlinked(mp, dino);
+			do_warn(
+	_("Cleared next_unlinked in inode %" PRIu64 "\n"), lino);
+			*dirty += 1;
+		}
 	}
 
 	/* set type and map type info */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] repair: don't dirty inodes which are not unlinked
  2018-10-30 20:03   ` Eric Sandeen
@ 2018-10-30 20:09     ` Eric Sandeen
  2018-10-30 20:34       ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Sandeen @ 2018-10-30 20:09 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

So I'm not stealing commits: ;)

===

repair: don't dirty inodes unconditionally when testing unlinked state

From: Dave Chinner <dchinner@redhat.com>

I noticed phase 4 writing back lots of inode buffers during recent
testing. The recent rework of clear_inode() in commit 0724d0f4cb53
("xfs_repair: clear_dinode should simply clear, not check contents")
accidentally caught a call to clear_inode_unlinked() as well,
resulting in all inodes being marked dirty whether then needed
updating or not.

Fix it by making clear_inode_unlinked unconditionally do the clear
(as was done for clear_inode), and move the test to the caller.
Add warnings as well so that this corruption is no longer silently
fixed.

Reported-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[sandeen: rework so clear_inode_unlinked is unconditional]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
diff --git a/repair/dinode.c b/repair/dinode.c
index 379f85c..f466e77 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -125,23 +125,16 @@ clear_dinode_core(struct xfs_mount *mp, xfs_dinode_t *dinoc, xfs_ino_t ino_num)
 	return;
 }
 
-static int
+static void
 clear_dinode_unlinked(xfs_mount_t *mp, xfs_dinode_t *dino)
 {
 
-	if (be32_to_cpu(dino->di_next_unlinked) != NULLAGINO)  {
-		if (!no_modify)
-			dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
-		return(1);
-	}
-
-	return(0);
+	dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
 }
 
 /*
  * this clears the unlinked list too so it should not be called
  * until after the agi unlinked lists are walked in phase 3.
- * returns > zero if the inode has been altered while being cleared
  */
 static void
 clear_dinode(xfs_mount_t *mp, xfs_dinode_t *dino, xfs_ino_t ino_num)
@@ -2675,9 +2668,16 @@ _("bad (negative) size %" PRId64 " on inode %" PRIu64 "\n"),
 	 * we're going to find.  check_dups is set to 1 only during
 	 * phase 4.  Ugly.
 	 */
-	if (check_dups && !no_modify) {
-		clear_dinode_unlinked(mp, dino);
-		*dirty += 1;
+	if (check_dups && be32_to_cpu(dino->di_next_unlinked) != NULLAGINO) {
+		if (no_modify) {
+			do_warn(
+	_("Would clear next_unlinked in inode %" PRIu64 "\n"), lino);
+		} else  {
+			clear_dinode_unlinked(mp, dino);
+			do_warn(
+	_("Cleared next_unlinked in inode %" PRIu64 "\n"), lino);
+			*dirty += 1;
+		}
 	}
 
 	/* set type and map type info */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption"
  2018-10-30 19:35     ` Eric Sandeen
@ 2018-10-30 20:11       ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 20:11 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Darrick J. Wong, linux-xfs

On Tue, Oct 30, 2018 at 02:35:29PM -0500, Eric Sandeen wrote:
> 
> 
> On 10/30/18 12:20 PM, Darrick J. Wong wrote:
> > On Tue, Oct 30, 2018 at 10:20:37PM +1100, Dave Chinner wrote:
> >> This reverts commit 67a79e2cc9320aaf269cd00e9c8d16892931886d.
> >>
> >> A root LEAFN block can exist in a directory. When we convert from
> >> leaf format (LEAF1 - internal free list) to node format (LEAFN -
> >> external free list) the only change to the single root leaf block is
> >> that it's magic number is changed from LEAF1 to LEAFN.
> >>
> >> We don't actually end up with DA nodes in the tree until the LEAFN
> >> node is split, and that requires a couple more dirents to be added
> >> to the directory to fill the LEAFN block up completely. Then it will
> >> split and create a DA node root block pointing to multiple LEAFN
> >> leaf blocks.
> >>
> >> Hence restore the old behaviour where we skip the DA node tree
> >> rebuild if there is a LEAFN root block found as there is no tree to
> >> rebuild.
> > 
> > The trouble with reverting the patch is that xfs_repair goes back to
> > tripping over an assertion in release_da_cursor_int if the reason why
> > we got bno == 0 is that the directory has a da btree block with
> > nbtree[0].before pointing to zero.
> > 
> > Will post a better fix + regression tests for both issues shortly.
> 
> Well let's discuss pros/cons & risks/rewards on that, I'm semi-inclined
> to just revert it for 4.19 and fix it properly in 4.20, because I don't think
> it's ever been seen in the wild...?

I'd prefer a revert if we are closing on a release.

A fix is essentially going to have to first revert the change
anyway, so if the replacement fix is simple and tesed then it can
just go straight in on top of the revert...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] repair: don't dirty inodes which are not unlinked
  2018-10-30 20:09     ` Eric Sandeen
@ 2018-10-30 20:34       ` Dave Chinner
  2018-10-30 20:40         ` Eric Sandeen
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 20:34 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 03:09:56PM -0500, Eric Sandeen wrote:
> So I'm not stealing commits: ;)
> 
> ===
> 
> repair: don't dirty inodes unconditionally when testing unlinked state
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> I noticed phase 4 writing back lots of inode buffers during recent
> testing. The recent rework of clear_inode() in commit 0724d0f4cb53
> ("xfs_repair: clear_dinode should simply clear, not check contents")
> accidentally caught a call to clear_inode_unlinked() as well,
> resulting in all inodes being marked dirty whether then needed
> updating or not.
> 
> Fix it by making clear_inode_unlinked unconditionally do the clear
> (as was done for clear_inode), and move the test to the caller.
> Add warnings as well so that this corruption is no longer silently
> fixed.
> 
> Reported-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
> [sandeen: rework so clear_inode_unlinked is unconditional]
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> diff --git a/repair/dinode.c b/repair/dinode.c
> index 379f85c..f466e77 100644
> --- a/repair/dinode.c
> +++ b/repair/dinode.c
> @@ -125,23 +125,16 @@ clear_dinode_core(struct xfs_mount *mp, xfs_dinode_t *dinoc, xfs_ino_t ino_num)
>  	return;
>  }
>  
> -static int
> +static void
>  clear_dinode_unlinked(xfs_mount_t *mp, xfs_dinode_t *dino)
>  {
>  
> -	if (be32_to_cpu(dino->di_next_unlinked) != NULLAGINO)  {
> -		if (!no_modify)
> -			dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
> -		return(1);
> -	}
> -
> -	return(0);
> +	dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
>  }

What's the point of keeping this wrapper now?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] cache: prevent expansion races
  2018-10-30 17:39   ` Darrick J. Wong
@ 2018-10-30 20:35     ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 20:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:39:01AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 30, 2018 at 10:20:39PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When a sudden buffer cache growth demand occurs from multiple
> > concurrent sources, they can all fail shaking the cache at the same
> > time and expand the cache. This results in the cache doubling in
> > size multiple times when only a single expansion was really
> > necessary. The resultant massive cache can cause repair to OOM as
> > the cache will grow to much larger than memory can actually hold.
> > 
> > Prevent this sudden and unnecessary expansion by rate limiting cache
> > growth to one size increase every tens seconds. This is sufficient
> > to prevent racing expansion calls from doing the wrong thing, but
> > not too long to prevent necesary cache growth when it is undersized.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  include/cache.h |  1 +
> >  libxfs/cache.c  | 17 ++++++++++++++---
> >  2 files changed, 15 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/cache.h b/include/cache.h
> > index 552b92489e46..1b774619f7a0 100644
> > --- a/include/cache.h
> > +++ b/include/cache.h
> > @@ -114,6 +114,7 @@ struct cache {
> >  	unsigned long long	c_misses;	/* cache misses */
> >  	unsigned long long	c_hits;		/* cache hits */
> >  	unsigned int 		c_max;		/* max nodes ever used */
> > +	time_t			expand_time;	/* time of last expansion */
> >  };
> >  
> >  struct cache *cache_init(int, unsigned int, struct cache_operations *);
> > diff --git a/libxfs/cache.c b/libxfs/cache.c
> > index 139c7c1b9e71..e10df395e60e 100644
> > --- a/libxfs/cache.c
> > +++ b/libxfs/cache.c
> > @@ -62,6 +62,7 @@ cache_init(
> >  	cache->bulkrelse = cache_operations->bulkrelse ?
> >  		cache_operations->bulkrelse : cache_generic_bulkrelse;
> >  	pthread_mutex_init(&cache->c_mutex, NULL);
> > +	time(&cache->expand_time);
> >  
> >  	for (i = 0; i < hashsize; i++) {
> >  		list_head_init(&cache->c_hash[i].ch_list);
> > @@ -77,15 +78,25 @@ cache_init(
> >  	return cache;
> >  }
> >  
> > +/*
> > + * rate limit expansion so several concurrent shakes don't instantly all
> > + * expand the cache multiple times and blow repair to OOM death.
> > + */
> >  static void
> >  cache_expand(
> > -	struct cache *		cache)
> > +	struct cache	*cache)
> >  {
> > +	time_t		now;
> > +
> >  	pthread_mutex_lock(&cache->c_mutex);
> > +	time(&now);
> > +	if (now - cache->expand_time > 10) {
> 
> At first I wondered to myself about what happens if we fill the cache
> fast enough that we run out of space in less than ten seconds, but
> (afaict) the cache_expand caller will keep trying shake/expand until it
> gets something, right?

Yes. Expansion occurs when the shaker fails to make progress because
the cache is full and nothing can be reclaimed after many attempts.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] repair: don't dirty inodes which are not unlinked
  2018-10-30 20:34       ` Dave Chinner
@ 2018-10-30 20:40         ` Eric Sandeen
  2018-10-30 20:58           ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Sandeen @ 2018-10-30 20:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs



On 10/30/18 3:34 PM, Dave Chinner wrote:
> On Tue, Oct 30, 2018 at 03:09:56PM -0500, Eric Sandeen wrote:
>> So I'm not stealing commits: ;)
>>
>> ===
>>
>> repair: don't dirty inodes unconditionally when testing unlinked state
>>
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> I noticed phase 4 writing back lots of inode buffers during recent
>> testing. The recent rework of clear_inode() in commit 0724d0f4cb53
>> ("xfs_repair: clear_dinode should simply clear, not check contents")
>> accidentally caught a call to clear_inode_unlinked() as well,
>> resulting in all inodes being marked dirty whether then needed
>> updating or not.
>>
>> Fix it by making clear_inode_unlinked unconditionally do the clear
>> (as was done for clear_inode), and move the test to the caller.
>> Add warnings as well so that this corruption is no longer silently
>> fixed.
>>
>> Reported-by: Dave Chinner <dchinner@redhat.com>
>> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
>> [sandeen: rework so clear_inode_unlinked is unconditional]
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> diff --git a/repair/dinode.c b/repair/dinode.c
>> index 379f85c..f466e77 100644
>> --- a/repair/dinode.c
>> +++ b/repair/dinode.c
>> @@ -125,23 +125,16 @@ clear_dinode_core(struct xfs_mount *mp, xfs_dinode_t *dinoc, xfs_ino_t ino_num)
>>  	return;
>>  }
>>  
>> -static int
>> +static void
>>  clear_dinode_unlinked(xfs_mount_t *mp, xfs_dinode_t *dino)
>>  {
>>  
>> -	if (be32_to_cpu(dino->di_next_unlinked) != NULLAGINO)  {
>> -		if (!no_modify)
>> -			dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
>> -		return(1);
>> -	}
>> -
>> -	return(0);
>> +	dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
>>  }
> 
> What's the point of keeping this wrapper now?

Not much other than making clear_dinode() sightly prettier.

Happy to nuke it if preferred.

-Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/7] workqueue: bound maximum queue depth
  2018-10-30 17:58   ` Darrick J. Wong
@ 2018-10-30 20:53     ` Dave Chinner
  2018-10-31 17:14       ` Brian Foster
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 20:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:58:39AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 30, 2018 at 10:20:40PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Existing users of workqueues have bound maximum queue depths in
> > their external algorithms (e.g. prefetch counts). For parallelising
> > work that doesn't have an external bound, allow workqueues to
> > throttle incoming requests at a maximum bound. bounded workqueues
> > also need to distribute work over all worker threads themselves as
> > there is no external bounding or worker function throttling
> > provided.
> > 
> > Existing callers are not throttled and retain direct control of
> > worker threads, only users of the new create interface will be
> > throttled and concurrency managed.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  include/workqueue.h |  4 ++++
> >  libfrog/workqueue.c | 30 +++++++++++++++++++++++++++---
> >  2 files changed, 31 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/workqueue.h b/include/workqueue.h
> > index c45dc4fbcf64..504da9403b85 100644
> > --- a/include/workqueue.h
> > +++ b/include/workqueue.h
> > @@ -30,10 +30,14 @@ struct workqueue {
> >  	unsigned int		item_count;
> >  	unsigned int		thread_count;
> >  	bool			terminate;
> > +	int			max_queued;
> > +	pthread_cond_t		queue_full;
> >  };
> >  
> >  int workqueue_create(struct workqueue *wq, void *wq_ctx,
> >  		unsigned int nr_workers);
> > +int workqueue_create_bound(struct workqueue *wq, void *wq_ctx,
> > +		unsigned int nr_workers, int max_queue);
> 
> What does negative max_queue mean?

Nothing. it can be made unsigned.

> 
> >  int workqueue_add(struct workqueue *wq, workqueue_func_t fn,
> >  		uint32_t index, void *arg);
> >  void workqueue_destroy(struct workqueue *wq);
> > diff --git a/libfrog/workqueue.c b/libfrog/workqueue.c
> > index 7311477374b4..8fe0dc7249f5 100644
> > --- a/libfrog/workqueue.c
> > +++ b/libfrog/workqueue.c
> > @@ -40,13 +40,21 @@ workqueue_thread(void *arg)
> >  		}
> >  
> >  		/*
> > -		 *  Dequeue work from the head of the list.
> > +		 *  Dequeue work from the head of the list. If the queue was
> > +		 *  full then send a wakeup if we're configured to do so.
> >  		 */
> >  		assert(wq->item_count > 0);
> > +		if (wq->max_queued && wq->item_count == wq->max_queued)
> > +			pthread_cond_signal(&wq->queue_full);
> > +
> >  		wi = wq->next_item;
> >  		wq->next_item = wi->next;
> >  		wq->item_count--;
> >  
> > +		if (wq->max_queued && wq->next_item) {
> > +			/* more work, wake up another worker */
> > +			pthread_cond_signal(&wq->wakeup);
> > +		}
> 
> It seems a little funny to me that the worker thread wakes up other
> worker threads when there is more work to do (vs. workqueue_add which
> actually added more work)...

The problem is that workqueue_add() delegates all concurrency and
queue throttling to the worker thread callback function. The work
queue doesn't function as a "queue" at all - it functions as a
method of starting long running functions that do there own work
queuing and throttling.  Hence these externally co-ordinated worker
threads only require kicking off when the first work item is queued,
otherwise they completely manage themselves and never return to the
worker thread itself until they are done.

This is one of the reasons the prefetch code is so damn complex - it
has to do all this queue throttling and worker thread co-ordination
itself with it's own infrastructure, rather than just having a
thread walking the block maps calling "queue_work" on each object it
needs read. Instead it's got counting semaphores,
start/done/restart/maybe start/maybe stop logic to manage the queue
depth, etc.

What the above change does is enable us to use workqueues for
queuing small pieces of work that need to be processed, and allows
them to be processed concurrently without the caller having to do
anything to manage that concurrency. This way the concurrency will
grow automatically to the maximum bound of the workqueue and we
don't have to worry about doing any extra wakeups or tracking
anything in workqueue_add...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/7] repair: Protect bad inode list with mutex
  2018-10-30 17:44   ` Darrick J. Wong
@ 2018-10-30 20:54     ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 20:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:44:34AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 30, 2018 at 10:20:41PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To enable phase 6 parallelisation, we need to protect the bad inode
> > list from concurrent modification and/or access. Wrap it with a
> > mutex and clean up the nasty typedefs.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  repair/dir2.c | 32 +++++++++++++++++++++-----------
> >  1 file changed, 21 insertions(+), 11 deletions(-)
> > 
> > diff --git a/repair/dir2.c b/repair/dir2.c
> > index ba5763ed3d26..a73a675b97c8 100644
> > --- a/repair/dir2.c
> > +++ b/repair/dir2.c
> > @@ -20,40 +20,50 @@
> >   * Known bad inode list.  These are seen when the leaf and node
> >   * block linkages are incorrect.
> >   */
> > -typedef struct dir2_bad {
> > +struct dir2_bad {
> >  	xfs_ino_t	ino;
> >  	struct dir2_bad	*next;
> > -} dir2_bad_t;
> > +};
> 
> Just out of curiosity, how fast does this linked list grow?  In theory
> we could hoist scrub/bitmap.c to libfrog and turn this into a bitmap,
> but that probably depends on how often we find inodes with bad link
> counts...?  And admittedly, avlnodes aren't cheap either...

Never seen it in profiles, so I don't see it as problem we need to
solve at this point. The dir_hash code, OTOH....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/7] repair: parallelise phase 6
  2018-10-30 17:51   ` Darrick J. Wong
@ 2018-10-30 20:55     ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 20:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:51:58AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 30, 2018 at 10:20:43PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > +static void
> > +traverse_function(
> > +	struct workqueue	*wq,
> > +	xfs_agnumber_t		agno,
> > +	void			*arg)
> > +{
> > +	struct ino_tree_node	*irec;
> >  	prefetch_args_t		*pf_args = arg;
> > +	struct workqueue	lwq;
> > +	struct xfs_mount	*mp = wq->wq_ctx;
> > +
> >  
> >  	wait_for_inode_prefetch(pf_args);
> >  
> >  	if (verbose)
> >  		do_log(_("        - agno = %d\n"), agno);
> >  
> > +	/*
> > +	 * The more AGs we have in flight at once, the fewer processing threads
> > +	 * per AG. This means we don't overwhelm the machine with hundreds of
> > +	 * threads when we start acting on lots of AGs at once. We just want
> > +	 * enough that we can keep multiple CPUs busy across multiple AGs.
> > +	 */
> > +	workqueue_create_bound(&lwq, mp, ag_stride, 1000);
> 
> Unlikely to happen, but do we need to check for errors here?
> create_work_queue aborts repair if workqueue_create fails.

Yup, will fix.


-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] repair: don't dirty inodes which are not unlinked
  2018-10-30 20:40         ` Eric Sandeen
@ 2018-10-30 20:58           ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2018-10-30 20:58 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 03:40:20PM -0500, Eric Sandeen wrote:
> 
> 
> On 10/30/18 3:34 PM, Dave Chinner wrote:
> > On Tue, Oct 30, 2018 at 03:09:56PM -0500, Eric Sandeen wrote:
> >> So I'm not stealing commits: ;)
> >>
> >> ===
> >>
> >> repair: don't dirty inodes unconditionally when testing unlinked state
> >>
> >> From: Dave Chinner <dchinner@redhat.com>
> >>
> >> I noticed phase 4 writing back lots of inode buffers during recent
> >> testing. The recent rework of clear_inode() in commit 0724d0f4cb53
> >> ("xfs_repair: clear_dinode should simply clear, not check contents")
> >> accidentally caught a call to clear_inode_unlinked() as well,
> >> resulting in all inodes being marked dirty whether then needed
> >> updating or not.
> >>
> >> Fix it by making clear_inode_unlinked unconditionally do the clear
> >> (as was done for clear_inode), and move the test to the caller.
> >> Add warnings as well so that this corruption is no longer silently
> >> fixed.
> >>
> >> Reported-by: Dave Chinner <dchinner@redhat.com>
> >> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
> >> [sandeen: rework so clear_inode_unlinked is unconditional]
> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> >> diff --git a/repair/dinode.c b/repair/dinode.c
> >> index 379f85c..f466e77 100644
> >> --- a/repair/dinode.c
> >> +++ b/repair/dinode.c
> >> @@ -125,23 +125,16 @@ clear_dinode_core(struct xfs_mount *mp, xfs_dinode_t *dinoc, xfs_ino_t ino_num)
> >>  	return;
> >>  }
> >>  
> >> -static int
> >> +static void
> >>  clear_dinode_unlinked(xfs_mount_t *mp, xfs_dinode_t *dino)
> >>  {
> >>  
> >> -	if (be32_to_cpu(dino->di_next_unlinked) != NULLAGINO)  {
> >> -		if (!no_modify)
> >> -			dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
> >> -		return(1);
> >> -	}
> >> -
> >> -	return(0);
> >> +	dino->di_next_unlinked = cpu_to_be32(NULLAGINO);
> >>  }
> > 
> > What's the point of keeping this wrapper now?
> 
> Not much other than making clear_dinode() sightly prettier.
> 
> Happy to nuke it if preferred.

The code is already a mess with all the no_modify/fix branches and
warnings, so it really doesn't matter that much. Keep it as is...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] cache: prevent expansion races
  2018-10-30 11:20 ` [PATCH 3/7] cache: prevent expansion races Dave Chinner
  2018-10-30 17:39   ` Darrick J. Wong
@ 2018-10-31 17:13   ` Brian Foster
  2018-11-01  1:27     ` Dave Chinner
  1 sibling, 1 reply; 35+ messages in thread
From: Brian Foster @ 2018-10-31 17:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 30, 2018 at 10:20:39PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a sudden buffer cache growth demand occurs from multiple
> concurrent sources, they can all fail shaking the cache at the same
> time and expand the cache. This results in the cache doubling in
> size multiple times when only a single expansion was really
> necessary. The resultant massive cache can cause repair to OOM as
> the cache will grow to much larger than memory can actually hold.
> 
> Prevent this sudden and unnecessary expansion by rate limiting cache
> growth to one size increase every tens seconds. This is sufficient
> to prevent racing expansion calls from doing the wrong thing, but
> not too long to prevent necesary cache growth when it is undersized.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

This seems fairly harmless, but I'm not really a fan of this kind of
time-based algorithm in general. Could we do something similarly
effective that doesn't require a magic time value?

For example, suppose we sample the current cache size at the top of the
function, pass that value along to cache_expand() and have the latter
only bump ->c_maxcount if it still matches the parameter value. Then
return-assign the local var with the latest value:

	max_sample = cache_expand(cache, max_sample);

The end result is that an allocator must shake every mru under a given
cache size before it's allowed to grow the cache. Hm?

Brian

>  include/cache.h |  1 +
>  libxfs/cache.c  | 17 ++++++++++++++---
>  2 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/include/cache.h b/include/cache.h
> index 552b92489e46..1b774619f7a0 100644
> --- a/include/cache.h
> +++ b/include/cache.h
> @@ -114,6 +114,7 @@ struct cache {
>  	unsigned long long	c_misses;	/* cache misses */
>  	unsigned long long	c_hits;		/* cache hits */
>  	unsigned int 		c_max;		/* max nodes ever used */
> +	time_t			expand_time;	/* time of last expansion */
>  };
>  
>  struct cache *cache_init(int, unsigned int, struct cache_operations *);
> diff --git a/libxfs/cache.c b/libxfs/cache.c
> index 139c7c1b9e71..e10df395e60e 100644
> --- a/libxfs/cache.c
> +++ b/libxfs/cache.c
> @@ -62,6 +62,7 @@ cache_init(
>  	cache->bulkrelse = cache_operations->bulkrelse ?
>  		cache_operations->bulkrelse : cache_generic_bulkrelse;
>  	pthread_mutex_init(&cache->c_mutex, NULL);
> +	time(&cache->expand_time);
>  
>  	for (i = 0; i < hashsize; i++) {
>  		list_head_init(&cache->c_hash[i].ch_list);
> @@ -77,15 +78,25 @@ cache_init(
>  	return cache;
>  }
>  
> +/*
> + * rate limit expansion so several concurrent shakes don't instantly all
> + * expand the cache multiple times and blow repair to OOM death.
> + */
>  static void
>  cache_expand(
> -	struct cache *		cache)
> +	struct cache	*cache)
>  {
> +	time_t		now;
> +
>  	pthread_mutex_lock(&cache->c_mutex);
> +	time(&now);
> +	if (now - cache->expand_time > 10) {
>  #ifdef CACHE_DEBUG
> -	fprintf(stderr, "doubling cache size to %d\n", 2 * cache->c_maxcount);
> +		fprintf(stderr, "doubling cache size to %d\n", 2 * cache->c_maxcount);
>  #endif
> -	cache->c_maxcount *= 2;
> +		cache->c_maxcount *= 2;
> +		cache->expand_time = now;
> +	}
>  	pthread_mutex_unlock(&cache->c_mutex);
>  }
>  
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/7] workqueue: bound maximum queue depth
  2018-10-30 20:53     ` Dave Chinner
@ 2018-10-31 17:14       ` Brian Foster
  0 siblings, 0 replies; 35+ messages in thread
From: Brian Foster @ 2018-10-31 17:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Wed, Oct 31, 2018 at 07:53:20AM +1100, Dave Chinner wrote:
> On Tue, Oct 30, 2018 at 10:58:39AM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 30, 2018 at 10:20:40PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Existing users of workqueues have bound maximum queue depths in
> > > their external algorithms (e.g. prefetch counts). For parallelising
> > > work that doesn't have an external bound, allow workqueues to
> > > throttle incoming requests at a maximum bound. bounded workqueues
> > > also need to distribute work over all worker threads themselves as
> > > there is no external bounding or worker function throttling
> > > provided.
> > > 
> > > Existing callers are not throttled and retain direct control of
> > > worker threads, only users of the new create interface will be
> > > throttled and concurrency managed.
> > > 

Might be helpful to add a sentence or two on what this is for.

> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  include/workqueue.h |  4 ++++
> > >  libfrog/workqueue.c | 30 +++++++++++++++++++++++++++---
> > >  2 files changed, 31 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/include/workqueue.h b/include/workqueue.h
> > > index c45dc4fbcf64..504da9403b85 100644
> > > --- a/include/workqueue.h
> > > +++ b/include/workqueue.h
> > > @@ -30,10 +30,14 @@ struct workqueue {
> > >  	unsigned int		item_count;
> > >  	unsigned int		thread_count;
> > >  	bool			terminate;
> > > +	int			max_queued;
> > > +	pthread_cond_t		queue_full;
> > >  };
> > >  
> > >  int workqueue_create(struct workqueue *wq, void *wq_ctx,
> > >  		unsigned int nr_workers);
> > > +int workqueue_create_bound(struct workqueue *wq, void *wq_ctx,
> > > +		unsigned int nr_workers, int max_queue);
> > 
> > What does negative max_queue mean?
> 
> Nothing. it can be made unsigned.
> 
> > 
> > >  int workqueue_add(struct workqueue *wq, workqueue_func_t fn,
> > >  		uint32_t index, void *arg);
> > >  void workqueue_destroy(struct workqueue *wq);
> > > diff --git a/libfrog/workqueue.c b/libfrog/workqueue.c
> > > index 7311477374b4..8fe0dc7249f5 100644
> > > --- a/libfrog/workqueue.c
> > > +++ b/libfrog/workqueue.c
> > > @@ -40,13 +40,21 @@ workqueue_thread(void *arg)
> > >  		}
> > >  
> > >  		/*
> > > -		 *  Dequeue work from the head of the list.
> > > +		 *  Dequeue work from the head of the list. If the queue was
> > > +		 *  full then send a wakeup if we're configured to do so.
> > >  		 */
> > >  		assert(wq->item_count > 0);
> > > +		if (wq->max_queued && wq->item_count == wq->max_queued)
> > > +			pthread_cond_signal(&wq->queue_full);
> > > +
> > >  		wi = wq->next_item;
> > >  		wq->next_item = wi->next;
> > >  		wq->item_count--;
> > >  
> > > +		if (wq->max_queued && wq->next_item) {
> > > +			/* more work, wake up another worker */
> > > +			pthread_cond_signal(&wq->wakeup);
> > > +		}
> > 
> > It seems a little funny to me that the worker thread wakes up other
> > worker threads when there is more work to do (vs. workqueue_add which
> > actually added more work)...
> 

FWIW, I had the same thought when looking at this patch..

> The problem is that workqueue_add() delegates all concurrency and
> queue throttling to the worker thread callback function. The work
> queue doesn't function as a "queue" at all - it functions as a
> method of starting long running functions that do there own work
> queuing and throttling.  Hence these externally co-ordinated worker
> threads only require kicking off when the first work item is queued,
> otherwise they completely manage themselves and never return to the
> worker thread itself until they are done.
> 

Ok, so the existing workqueue client code doesn't need additional
wakeups..

> This is one of the reasons the prefetch code is so damn complex - it
> has to do all this queue throttling and worker thread co-ordination
> itself with it's own infrastructure, rather than just having a
> thread walking the block maps calling "queue_work" on each object it
> needs read. Instead it's got counting semaphores,
> start/done/restart/maybe start/maybe stop logic to manage the queue
> depth, etc.
> 

It's been a while since I've looked at that cache prefetch code, but
that makes sense..

> What the above change does is enable us to use workqueues for
> queuing small pieces of work that need to be processed, and allows
> them to be processed concurrently without the caller having to do
> anything to manage that concurrency. This way the concurrency will
> grow automatically to the maximum bound of the workqueue and we
> don't have to worry about doing any extra wakeups or tracking
> anything in workqueue_add...
> 

Also makes sense, but I'm not sure this answers the original question:
why not do this in workqueue_add()? It looks like the current
workqueue_add() does exactly what you describe here in that it only
kicks the worker thread when the initial item is added. The above
explains why that's a problem, but why can't workqueue_add() kick the
worker on every add to a bounded queue (or any queue, if that doesn't
cause problems for !bounded)?

Also, have you considered whether pthread_cond_broadcast() may be more
appropriate than pthread_cond_signal() in the described use case with
multiple workers? The man page for the latter says it "unblocks at least
one of the threads that are blocked on the specified condition
variable," but that isn't exactly the most helpful description. :P

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] cache: prevent expansion races
  2018-10-31 17:13   ` Brian Foster
@ 2018-11-01  1:27     ` Dave Chinner
  2018-11-01 13:17       ` Brian Foster
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2018-11-01  1:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Oct 31, 2018 at 01:13:02PM -0400, Brian Foster wrote:
> On Tue, Oct 30, 2018 at 10:20:39PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When a sudden buffer cache growth demand occurs from multiple
> > concurrent sources, they can all fail shaking the cache at the same
> > time and expand the cache. This results in the cache doubling in
> > size multiple times when only a single expansion was really
> > necessary. The resultant massive cache can cause repair to OOM as
> > the cache will grow to much larger than memory can actually hold.
> > 
> > Prevent this sudden and unnecessary expansion by rate limiting cache
> > growth to one size increase every tens seconds. This is sufficient
> > to prevent racing expansion calls from doing the wrong thing, but
> > not too long to prevent necesary cache growth when it is undersized.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> This seems fairly harmless, but I'm not really a fan of this kind of
> time-based algorithm in general. Could we do something similarly
> effective that doesn't require a magic time value?
> 
> For example, suppose we sample the current cache size at the top of the
> function, pass that value along to cache_expand() and have the latter
> only bump ->c_maxcount if it still matches the parameter value. Then
> return-assign the local var with the latest value:
> 
> 	max_sample = cache_expand(cache, max_sample);
> 
> The end result is that an allocator must shake every mru under a given
> cache size before it's allowed to grow the cache. Hm?

I tried that and it still causes excessive cache expansion on my
really fast IO subsystem.  I'm seeing peaks of 300,000 IOPS when cache
expansion bursts occur. They are lasting for up to 20s on my machine
and it's enough to cause the cache to double in size multiple times.
Once the initial bursts have run their course, demand drops to a
steady state of 130-150kiops which the original cache size is quite
capable of sustaining.

These bursts are driven by the initial read-ahead queues being
filled and stop when the queues fill up.  If the IO is slow, then
there isn't cache pressure because processing keeps up with
readahead. If IO is fast, it fills the RA queues and then throttles
back to processing speed. It's filling the RA queues that causes the
buffer cache growth pressure, and it's really not necessary - the RA
queues are already full enough to maximise performance in phase 6,
and so growing memory footprint doesn't gain us anything.

i.e. the buffer cache on large filesystems like this is used as a
prefetch buffer. It is not a cache that stores anything for repeated
use - there's 500GB of metadata (>450 million inodes) in this
filesystem and we're only running w/ 128GB RAM, so we have no hope
of caching the entire working set in RAM. Hence we want to keep the
cache size down to just what is necessary to sustain effective
prefetch.

The problem with throttling on "scanned the whole cache" is that
this happens really quickly when you have a buffer demand of ~300k/s
When I start with a cache size of 800k buffers (-o bhash=101031)
doubling the cache size from 800k objects to 1.6m objects occurs
within a couple of seconds of phase 6 starting. A couple of seconds
later it doubles again, and by the time the initial 20s burst has
finished, it's doubled 3 or 4 times.

After the last expansion, the buffer cache can now grow way beyond
what we can fit in memory and reapir eventually gets OOM killed on a
128GB RAM machine even though it really only needs ~70GB RAM to
complete successfully.

IOWs, throttling expansion based on "cache full" demand doesn't
throttle hard enough to prevent OOM kill in these cases. The time
based throttling is based around preventing bursts for driving
unnecessary expansion. If the sudden burst turns out ot be sustained
demand and hence the cache really does need to be expanded multiple
times, then the cache will grow quickly enough to meet that demand.
We just don't want bursts to trigger that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] cache: prevent expansion races
  2018-11-01  1:27     ` Dave Chinner
@ 2018-11-01 13:17       ` Brian Foster
  2018-11-01 21:23         ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Brian Foster @ 2018-11-01 13:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Nov 01, 2018 at 12:27:39PM +1100, Dave Chinner wrote:
> On Wed, Oct 31, 2018 at 01:13:02PM -0400, Brian Foster wrote:
> > On Tue, Oct 30, 2018 at 10:20:39PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > When a sudden buffer cache growth demand occurs from multiple
> > > concurrent sources, they can all fail shaking the cache at the same
> > > time and expand the cache. This results in the cache doubling in
> > > size multiple times when only a single expansion was really
> > > necessary. The resultant massive cache can cause repair to OOM as
> > > the cache will grow to much larger than memory can actually hold.
> > > 
> > > Prevent this sudden and unnecessary expansion by rate limiting cache
> > > growth to one size increase every tens seconds. This is sufficient
> > > to prevent racing expansion calls from doing the wrong thing, but
> > > not too long to prevent necesary cache growth when it is undersized.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > This seems fairly harmless, but I'm not really a fan of this kind of
> > time-based algorithm in general. Could we do something similarly
> > effective that doesn't require a magic time value?
> > 
> > For example, suppose we sample the current cache size at the top of the
> > function, pass that value along to cache_expand() and have the latter
> > only bump ->c_maxcount if it still matches the parameter value. Then
> > return-assign the local var with the latest value:
> > 
> > 	max_sample = cache_expand(cache, max_sample);
> > 
> > The end result is that an allocator must shake every mru under a given
> > cache size before it's allowed to grow the cache. Hm?
> 
> I tried that and it still causes excessive cache expansion on my
> really fast IO subsystem.  I'm seeing peaks of 300,000 IOPS when cache
> expansion bursts occur. They are lasting for up to 20s on my machine
> and it's enough to cause the cache to double in size multiple times.
> Once the initial bursts have run their course, demand drops to a
> steady state of 130-150kiops which the original cache size is quite
> capable of sustaining.
> 

Interesting.

> These bursts are driven by the initial read-ahead queues being
> filled and stop when the queues fill up.  If the IO is slow, then
> there isn't cache pressure because processing keeps up with
> readahead. If IO is fast, it fills the RA queues and then throttles
> back to processing speed. It's filling the RA queues that causes the
> buffer cache growth pressure, and it's really not necessary - the RA
> queues are already full enough to maximise performance in phase 6,
> and so growing memory footprint doesn't gain us anything.
> 

Makes sense..

> i.e. the buffer cache on large filesystems like this is used as a
> prefetch buffer. It is not a cache that stores anything for repeated
> use - there's 500GB of metadata (>450 million inodes) in this
> filesystem and we're only running w/ 128GB RAM, so we have no hope
> of caching the entire working set in RAM. Hence we want to keep the
> cache size down to just what is necessary to sustain effective
> prefetch.
> 
> The problem with throttling on "scanned the whole cache" is that
> this happens really quickly when you have a buffer demand of ~300k/s
> When I start with a cache size of 800k buffers (-o bhash=101031)
> doubling the cache size from 800k objects to 1.6m objects occurs
> within a couple of seconds of phase 6 starting. A couple of seconds
> later it doubles again, and by the time the initial 20s burst has
> finished, it's doubled 3 or 4 times.
> 

Ok, but I'm curious how much of this boils down to lock contention
exacerbated by the fast I/O. I can see how reduced I/O latency could
contribute to the rapid growth rate. At the same time, it looks like all
the shaker does in this pattern is flush buffers if dirty (which seems
unlikely if we're in some kind of prefetch overload?) and move them to
another list (for potential reuse of memory). 

We clearly have multiple lookup/fail/expand races going on as evidenced
by the need for this patch in the first place. The cache growth is
driven by cache misses (i.e. prefetch). The buffer lookups acquire hash
locks and the cache lock (for the node allocation attempt). If the alloc
is allowed, we read the buf and add it to the hash. If not, we shake the
mru and retry the same mru or next. The shake trylocks each node (buf)
and hash lock in the mru. If either lock fails, we skip to the next
buffer.

I'm not sure how likely this is, but what are the chances the shakes are
contending with eachother such that lock contention prevents real work
from being done in the shaker? That seems like a reasonable possibility
to me given that you're clearly reproducing enough parallel shakes to
explode the cache rather quickly. Each shake is essentially iterating
through the same dataset, right? Of course, I don't have the complete
picture so perhaps there are other factors at play. I think it's at
least worth looking into if you haven't already because if something
like that is going on, it could be more likely that the time-based
implementation basically trades one kind of resource (memory) wastage
for another (cpu).

Brian

> After the last expansion, the buffer cache can now grow way beyond
> what we can fit in memory and reapir eventually gets OOM killed on a
> 128GB RAM machine even though it really only needs ~70GB RAM to
> complete successfully.
> 
> IOWs, throttling expansion based on "cache full" demand doesn't
> throttle hard enough to prevent OOM kill in these cases. The time
> based throttling is based around preventing bursts for driving
> unnecessary expansion. If the sudden burst turns out ot be sustained
> demand and hence the cache really does need to be expanded multiple
> times, then the cache will grow quickly enough to meet that demand.
> We just don't want bursts to trigger that.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] cache: prevent expansion races
  2018-11-01 13:17       ` Brian Foster
@ 2018-11-01 21:23         ` Dave Chinner
  2018-11-02 11:31           ` Brian Foster
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2018-11-01 21:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Nov 01, 2018 at 09:17:32AM -0400, Brian Foster wrote:
> On Thu, Nov 01, 2018 at 12:27:39PM +1100, Dave Chinner wrote:
> > On Wed, Oct 31, 2018 at 01:13:02PM -0400, Brian Foster wrote:
> > > On Tue, Oct 30, 2018 at 10:20:39PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > When a sudden buffer cache growth demand occurs from multiple
> > > > concurrent sources, they can all fail shaking the cache at the same
> > > > time and expand the cache. This results in the cache doubling in
> > > > size multiple times when only a single expansion was really
> > > > necessary. The resultant massive cache can cause repair to OOM as
> > > > the cache will grow to much larger than memory can actually hold.
> > > > 
> > > > Prevent this sudden and unnecessary expansion by rate limiting cache
> > > > growth to one size increase every tens seconds. This is sufficient
> > > > to prevent racing expansion calls from doing the wrong thing, but
> > > > not too long to prevent necesary cache growth when it is undersized.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > 
> > > This seems fairly harmless, but I'm not really a fan of this kind of
> > > time-based algorithm in general. Could we do something similarly
> > > effective that doesn't require a magic time value?
> > > 
> > > For example, suppose we sample the current cache size at the top of the
> > > function, pass that value along to cache_expand() and have the latter
> > > only bump ->c_maxcount if it still matches the parameter value. Then
> > > return-assign the local var with the latest value:
> > > 
> > > 	max_sample = cache_expand(cache, max_sample);
> > > 
> > > The end result is that an allocator must shake every mru under a given
> > > cache size before it's allowed to grow the cache. Hm?
> > 
> > I tried that and it still causes excessive cache expansion on my
> > really fast IO subsystem.  I'm seeing peaks of 300,000 IOPS when cache
> > expansion bursts occur. They are lasting for up to 20s on my machine
> > and it's enough to cause the cache to double in size multiple times.
> > Once the initial bursts have run their course, demand drops to a
> > steady state of 130-150kiops which the original cache size is quite
> > capable of sustaining.
> > 
> 
> Interesting.
> 
> > These bursts are driven by the initial read-ahead queues being
> > filled and stop when the queues fill up.  If the IO is slow, then
> > there isn't cache pressure because processing keeps up with
> > readahead. If IO is fast, it fills the RA queues and then throttles
> > back to processing speed. It's filling the RA queues that causes the
> > buffer cache growth pressure, and it's really not necessary - the RA
> > queues are already full enough to maximise performance in phase 6,
> > and so growing memory footprint doesn't gain us anything.
> > 
> 
> Makes sense..
> 
> > i.e. the buffer cache on large filesystems like this is used as a
> > prefetch buffer. It is not a cache that stores anything for repeated
> > use - there's 500GB of metadata (>450 million inodes) in this
> > filesystem and we're only running w/ 128GB RAM, so we have no hope
> > of caching the entire working set in RAM. Hence we want to keep the
> > cache size down to just what is necessary to sustain effective
> > prefetch.
> > 
> > The problem with throttling on "scanned the whole cache" is that
> > this happens really quickly when you have a buffer demand of ~300k/s
> > When I start with a cache size of 800k buffers (-o bhash=101031)
> > doubling the cache size from 800k objects to 1.6m objects occurs
> > within a couple of seconds of phase 6 starting. A couple of seconds
> > later it doubles again, and by the time the initial 20s burst has
> > finished, it's doubled 3 or 4 times.
> > 
> 
> Ok, but I'm curious how much of this boils down to lock contention
> exacerbated by the fast I/O.

There is a fair bit of lock contention during the expansion period
at the start of phase 6, but that appears to mostly due to memory
allocation causing serialisation on the process mmap_sem in page
faults.  The time spent contending on userspace locks and shaking
caches appears to be small compared to, say, the CPU time we spend
CRCing the metadata being pulled in from disk. The lock contention
slowly reduces as the heap footprint grows large enough not to
require frequent memory allocation....

> I can see how reduced I/O latency could
> contribute to the rapid growth rate. At the same time, it looks like all
> the shaker does in this pattern is flush buffers if dirty (which seems
> unlikely if we're in some kind of prefetch overload?) and move them to
> another list (for potential reuse of memory). 
> 
> We clearly have multiple lookup/fail/expand races going on as evidenced
> by the need for this patch in the first place. The cache growth is
> driven by cache misses (i.e. prefetch). The buffer lookups acquire hash
> locks and the cache lock (for the node allocation attempt). If the alloc
> is allowed, we read the buf and add it to the hash. If not, we shake the
> mru and retry the same mru or next. The shake trylocks each node (buf)
> and hash lock in the mru. If either lock fails, we skip to the next
> buffer.
> 
> I'm not sure how likely this is, but what are the chances the shakes are
> contending with eachother such that lock contention prevents real work
> from being done in the shaker?

Yes, it's definitely a possibility - the cache infrastructure needs
an overhaul to scale beyond about 8 AGs concurrently prefetching.
I've had a long term plan to replace the global hash based cache
with something closer to the kernel per-ag cache code to get rid
of a lot of the cross-ag cache contention (because most of the
processing is per-ag in repair).

> That seems like a reasonable possibility
> to me given that you're clearly reproducing enough parallel shakes to
> explode the cache rather quickly. Each shake is essentially iterating
> through the same dataset, right? Of course, I don't have the complete
> picture so perhaps there are other factors at play. I think it's at
> least worth looking into if you haven't already because if something
> like that is going on, it could be more likely that the time-based
> implementation basically trades one kind of resource (memory) wastage
> for another (cpu).

Given that we are talking about filesystems that take hours/days to
run repair over, 10s here or there is just noise. It's a simple,
easy fix that doesn't require a massive amount of investment of time
or resources to avoid. It's a trade off - I've got way more things
to do than I have time for, so if a simple "throttle to one change
per 10s" avoids a gnarly OOM-killer death hours later for a user
with an immediate "can't repair a broken filesystem because...",
then that's a no-brainer...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] cache: prevent expansion races
  2018-11-01 21:23         ` Dave Chinner
@ 2018-11-02 11:31           ` Brian Foster
  2018-11-02 23:26             ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Brian Foster @ 2018-11-02 11:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Nov 02, 2018 at 08:23:44AM +1100, Dave Chinner wrote:
> On Thu, Nov 01, 2018 at 09:17:32AM -0400, Brian Foster wrote:
> > On Thu, Nov 01, 2018 at 12:27:39PM +1100, Dave Chinner wrote:
> > > On Wed, Oct 31, 2018 at 01:13:02PM -0400, Brian Foster wrote:
> > > > On Tue, Oct 30, 2018 at 10:20:39PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > When a sudden buffer cache growth demand occurs from multiple
> > > > > concurrent sources, they can all fail shaking the cache at the same
> > > > > time and expand the cache. This results in the cache doubling in
> > > > > size multiple times when only a single expansion was really
> > > > > necessary. The resultant massive cache can cause repair to OOM as
> > > > > the cache will grow to much larger than memory can actually hold.
> > > > > 
> > > > > Prevent this sudden and unnecessary expansion by rate limiting cache
> > > > > growth to one size increase every tens seconds. This is sufficient
> > > > > to prevent racing expansion calls from doing the wrong thing, but
> > > > > not too long to prevent necesary cache growth when it is undersized.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > 
> > > > This seems fairly harmless, but I'm not really a fan of this kind of
> > > > time-based algorithm in general. Could we do something similarly
> > > > effective that doesn't require a magic time value?
> > > > 
> > > > For example, suppose we sample the current cache size at the top of the
> > > > function, pass that value along to cache_expand() and have the latter
> > > > only bump ->c_maxcount if it still matches the parameter value. Then
> > > > return-assign the local var with the latest value:
> > > > 
> > > > 	max_sample = cache_expand(cache, max_sample);
> > > > 
> > > > The end result is that an allocator must shake every mru under a given
> > > > cache size before it's allowed to grow the cache. Hm?
> > > 
> > > I tried that and it still causes excessive cache expansion on my
> > > really fast IO subsystem.  I'm seeing peaks of 300,000 IOPS when cache
> > > expansion bursts occur. They are lasting for up to 20s on my machine
> > > and it's enough to cause the cache to double in size multiple times.
> > > Once the initial bursts have run their course, demand drops to a
> > > steady state of 130-150kiops which the original cache size is quite
> > > capable of sustaining.
> > > 
> > 
> > Interesting.
> > 
> > > These bursts are driven by the initial read-ahead queues being
> > > filled and stop when the queues fill up.  If the IO is slow, then
> > > there isn't cache pressure because processing keeps up with
> > > readahead. If IO is fast, it fills the RA queues and then throttles
> > > back to processing speed. It's filling the RA queues that causes the
> > > buffer cache growth pressure, and it's really not necessary - the RA
> > > queues are already full enough to maximise performance in phase 6,
> > > and so growing memory footprint doesn't gain us anything.
> > > 
> > 
> > Makes sense..
> > 
> > > i.e. the buffer cache on large filesystems like this is used as a
> > > prefetch buffer. It is not a cache that stores anything for repeated
> > > use - there's 500GB of metadata (>450 million inodes) in this
> > > filesystem and we're only running w/ 128GB RAM, so we have no hope
> > > of caching the entire working set in RAM. Hence we want to keep the
> > > cache size down to just what is necessary to sustain effective
> > > prefetch.
> > > 
> > > The problem with throttling on "scanned the whole cache" is that
> > > this happens really quickly when you have a buffer demand of ~300k/s
> > > When I start with a cache size of 800k buffers (-o bhash=101031)
> > > doubling the cache size from 800k objects to 1.6m objects occurs
> > > within a couple of seconds of phase 6 starting. A couple of seconds
> > > later it doubles again, and by the time the initial 20s burst has
> > > finished, it's doubled 3 or 4 times.
> > > 
> > 
> > Ok, but I'm curious how much of this boils down to lock contention
> > exacerbated by the fast I/O.
> 
> There is a fair bit of lock contention during the expansion period
> at the start of phase 6, but that appears to mostly due to memory
> allocation causing serialisation on the process mmap_sem in page
> faults.  The time spent contending on userspace locks and shaking
> caches appears to be small compared to, say, the CPU time we spend
> CRCing the metadata being pulled in from disk. The lock contention
> slowly reduces as the heap footprint grows large enough not to
> require frequent memory allocation....
> 
> > I can see how reduced I/O latency could
> > contribute to the rapid growth rate. At the same time, it looks like all
> > the shaker does in this pattern is flush buffers if dirty (which seems
> > unlikely if we're in some kind of prefetch overload?) and move them to
> > another list (for potential reuse of memory). 
> > 
> > We clearly have multiple lookup/fail/expand races going on as evidenced
> > by the need for this patch in the first place. The cache growth is
> > driven by cache misses (i.e. prefetch). The buffer lookups acquire hash
> > locks and the cache lock (for the node allocation attempt). If the alloc
> > is allowed, we read the buf and add it to the hash. If not, we shake the
> > mru and retry the same mru or next. The shake trylocks each node (buf)
> > and hash lock in the mru. If either lock fails, we skip to the next
> > buffer.
> > 
> > I'm not sure how likely this is, but what are the chances the shakes are
> > contending with eachother such that lock contention prevents real work
> > from being done in the shaker?
> 
> Yes, it's definitely a possibility - the cache infrastructure needs
> an overhaul to scale beyond about 8 AGs concurrently prefetching.
> I've had a long term plan to replace the global hash based cache
> with something closer to the kernel per-ag cache code to get rid
> of a lot of the cross-ag cache contention (because most of the
> processing is per-ag in repair).
> 

Ok.

> > That seems like a reasonable possibility
> > to me given that you're clearly reproducing enough parallel shakes to
> > explode the cache rather quickly. Each shake is essentially iterating
> > through the same dataset, right? Of course, I don't have the complete
> > picture so perhaps there are other factors at play. I think it's at
> > least worth looking into if you haven't already because if something
> > like that is going on, it could be more likely that the time-based
> > implementation basically trades one kind of resource (memory) wastage
> > for another (cpu).
> 
> Given that we are talking about filesystems that take hours/days to
> run repair over, 10s here or there is just noise. It's a simple,
> easy fix that doesn't require a massive amount of investment of time
> or resources to avoid. It's a trade off - I've got way more things
> to do than I have time for, so if a simple "throttle to one change
> per 10s" avoids a gnarly OOM-killer death hours later for a user
> with an immediate "can't repair a broken filesystem because...",
> then that's a no-brainer...
> 

Fair enough, but I'm still curious if doing something like changing the
hash trylock in the shaker to a blocking lock would improve shaker
effectiveness enough to avoid the need for the time-based hackery. It's
possible it has no effect or just replaces one problem with a set of
others, but it's an even more trivial change than this patch is.

Another approach may be to lift the cache shaker from a per-lookup-miss
execution context to something more central (like its own thread(s))
such that lookups can block on (bounded) shakes without introducing too
much concurrency therein. That is certainly more involved than bolting a
throttle onto cache expansion and may not be worth the effort if the
long term plan is to replace the whole cache mechanism.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] cache: prevent expansion races
  2018-11-02 11:31           ` Brian Foster
@ 2018-11-02 23:26             ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2018-11-02 23:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Nov 02, 2018 at 07:31:38AM -0400, Brian Foster wrote:
> Fair enough, but I'm still curious if doing something like changing the
> hash trylock in the shaker to a blocking lock would improve shaker
> effectiveness enough to avoid the need for the time-based hackery. It's
> possible it has no effect or just replaces one problem with a set of
> others, but it's an even more trivial change than this patch is.
> 
> Another approach may be to lift the cache shaker from a per-lookup-miss
> execution context to something more central (like its own thread(s))
> such that lookups can block on (bounded) shakes without introducing too
> much concurrency therein. That is certainly more involved than bolting a
> throttle onto cache expansion and may not be worth the effort if the
> long term plan is to replace the whole cache mechanism.

I'm more inclined to kill the entire libxfs buffer cache
implementation and MRUs and port the kernel code across with it's
reference based LRU and shrinker. And with that, use AIO so that we
don't need huge numbers of prefetch threads to keep IO in flight.

Getting rid of the repair prefetcher threads removes the major
concurrency component that is placed on the cache. Using the kernel
infrastrucutre also moves from a global cache to a per-ag cache
which matches how xfs_repair operates and hence further reduces lock
contention. i.e. it allows threads working within an AG to work
without interfence from other AGs.

Basically, we are hitting on the scalability limits of the libxfs
architecture right now, and we have an architecure in the kernel
that scales way beyond what we have in userspace. And it fits right
in with the way userspace algorithms operate.

Add to that the need for AIO and delwri lists to scale mkfs to
really large filesystems, and it really comes down to "we need to
port the kernel code and move to AIO" rather than tinker around the
edges with an architecture that we can't easily scale further that
it current does...

Cheers,

Dave
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/7] xfs_repair: scale to 150,000 iops
  2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
                   ` (6 preceding siblings ...)
  2018-10-30 11:20 ` [PATCH 7/7] repair: parallelise phase 6 Dave Chinner
@ 2018-11-07  5:44 ` Arkadiusz Miśkiewicz
  2018-11-07  6:48   ` Dave Chinner
  7 siblings, 1 reply; 35+ messages in thread
From: Arkadiusz Miśkiewicz @ 2018-11-07  5:44 UTC (permalink / raw)
  To: linux-xfs

On 30/10/2018 12:20, Dave Chinner wrote:
> Hi folks,
> 
> This patchset enables me to successfully repair a rather large
> metadump image (~500GB of metadata) that was provided to us because
> it crashed xfs_repair. Darrick and Eric have already posted patches
> to fix the crash bugs, and this series is built on top of them.

I was finally able to repair my big fs using for-next + these patches.

But it wasn't as easy as just running repair.

With default bhash OOM killed repair in ~1/3 of phase6 (128GB of ram +
50GB of ssd swap). bhash=256000 worked.

Sometimes segfault happens but I don't have any stack trace
unfortunately and trying to reproduce on my other test machine
gave me no luck.

One time I got:
xfs_repair: workqueue.c:142: workqueue_add: Assertion `wq->item_count ==
0' failed.

-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/7] xfs_repair: scale to 150,000 iops
  2018-11-07  5:44 ` [PATCH 0/7] xfs_repair: scale to 150,000 iops Arkadiusz Miśkiewicz
@ 2018-11-07  6:48   ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2018-11-07  6:48 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: linux-xfs

On Wed, Nov 07, 2018 at 06:44:54AM +0100, Arkadiusz Miśkiewicz wrote:
> On 30/10/2018 12:20, Dave Chinner wrote:
> > Hi folks,
> > 
> > This patchset enables me to successfully repair a rather large
> > metadump image (~500GB of metadata) that was provided to us because
> > it crashed xfs_repair. Darrick and Eric have already posted patches
> > to fix the crash bugs, and this series is built on top of them.
> 
> I was finally able to repair my big fs using for-next + these patches.
> 
> But it wasn't as easy as just running repair.
> 
> With default bhash OOM killed repair in ~1/3 of phase6 (128GB of ram +
> 50GB of ssd swap). bhash=256000 worked.

Yup, we need to work on the default bhash sizing. it comes out at
about 750,000 for 128GB ram on your fs. It needs to be much smaller.

> Sometimes segfault happens but I don't have any stack trace
> unfortunately and trying to reproduce on my other test machine
> gave me no luck.
> 
> One time I got:
> xfs_repair: workqueue.c:142: workqueue_add: Assertion `wq->item_count ==
> 0' failed.

Yup, I think i've fixed that - a throttling wakeup related race
condition - but I'm still trying to reproduce it to confirm I've
fixed it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-11-07 16:17 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
2018-10-30 11:20 ` [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption" Dave Chinner
2018-10-30 17:20   ` Darrick J. Wong
2018-10-30 19:35     ` Eric Sandeen
2018-10-30 20:11       ` Dave Chinner
2018-10-30 11:20 ` [PATCH 2/7] repair: don't dirty inodes which are not unlinked Dave Chinner
2018-10-30 17:26   ` Darrick J. Wong
2018-10-30 20:03   ` Eric Sandeen
2018-10-30 20:09     ` Eric Sandeen
2018-10-30 20:34       ` Dave Chinner
2018-10-30 20:40         ` Eric Sandeen
2018-10-30 20:58           ` Dave Chinner
2018-10-30 11:20 ` [PATCH 3/7] cache: prevent expansion races Dave Chinner
2018-10-30 17:39   ` Darrick J. Wong
2018-10-30 20:35     ` Dave Chinner
2018-10-31 17:13   ` Brian Foster
2018-11-01  1:27     ` Dave Chinner
2018-11-01 13:17       ` Brian Foster
2018-11-01 21:23         ` Dave Chinner
2018-11-02 11:31           ` Brian Foster
2018-11-02 23:26             ` Dave Chinner
2018-10-30 11:20 ` [PATCH 4/7] workqueue: bound maximum queue depth Dave Chinner
2018-10-30 17:58   ` Darrick J. Wong
2018-10-30 20:53     ` Dave Chinner
2018-10-31 17:14       ` Brian Foster
2018-10-30 11:20 ` [PATCH 5/7] repair: Protect bad inode list with mutex Dave Chinner
2018-10-30 17:44   ` Darrick J. Wong
2018-10-30 20:54     ` Dave Chinner
2018-10-30 11:20 ` [PATCH 6/7] repair: protect inode chunk tree records with a mutex Dave Chinner
2018-10-30 17:46   ` Darrick J. Wong
2018-10-30 11:20 ` [PATCH 7/7] repair: parallelise phase 6 Dave Chinner
2018-10-30 17:51   ` Darrick J. Wong
2018-10-30 20:55     ` Dave Chinner
2018-11-07  5:44 ` [PATCH 0/7] xfs_repair: scale to 150,000 iops Arkadiusz Miśkiewicz
2018-11-07  6:48   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.