All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] repair: Phase 6 performance improvements
@ 2020-10-22  5:15 Dave Chinner
  2020-10-22  5:15 ` [PATCH 1/7] workqueue: bound maximum queue depth Dave Chinner
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  5:15 UTC (permalink / raw)
  To: linux-xfs

Hi folks

Phase 6 is single threaded, processing a single AG at a time and a
single directory inode at a time.  Phase 6 if often IO latency bound
despite the prefetching it does, resulting in low disk utilisation
and high runtimes. The solution for this is the same as phase 3 and
4 - scan multiple AGs at once for directory inodes to process. This
patch set enables phase 6 to scan multiple AGS at once, and hence
requires concurrent updates of inode records as tehy can be accessed
and modified by multiple scanning threads now. We also need to
protect the bad inodes list from concurrent access and then we can
enable concurrent processing of directories.

However, directory entry checking and reconstruction can also be CPU
bound - large directories overwhelm the directory name hash
structures because the algorithms have poor scalability - one is O(n
+ n^2), another is O(n^2) when the number of dirents greatly
outsizes the hash table sizes. Hence we need to more than just
parallelise across AGs - we need to parallelise processing within
AGs so that a single large directory doesn't completely serialise
processing within an AG.  This is done by using bound-depth
workqueues to allow inode records to be processed asynchronously as
the inode records are fetched from disk.

Further, we need to fix the bad alogrithmic scalability of the in
memory directory tracking structures. This is done through a
combination of better structures and more appropriate dynamic size
choices.

The results on a filesystem with a single 10 million entry directory
containing 400MB of directory entry data is as follows:

v5.6.0 (Baseline)

       XFS_REPAIR Summary    Thu Oct 22 12:10:52 2020

Phase           Start           End             Duration
Phase 1:        10/22 12:06:41  10/22 12:06:41
Phase 2:        10/22 12:06:41  10/22 12:06:41
Phase 3:        10/22 12:06:41  10/22 12:07:00  19 seconds
Phase 4:        10/22 12:07:00  10/22 12:07:12  12 seconds
Phase 5:        10/22 12:07:12  10/22 12:07:13  1 second
Phase 6:        10/22 12:07:13  10/22 12:10:51  3 minutes, 38 seconds
Phase 7:        10/22 12:10:51  10/22 12:10:51

Total run time: 4 minutes, 10 seconds

real	4m11.151s
user	4m20.083s
sys	0m14.744s


5.9.0-rc1 + patchset:

        XFS_REPAIR Summary    Thu Oct 22 13:19:02 2020

Phase           Start           End             Duration
Phase 1:        10/22 13:18:09  10/22 13:18:09
Phase 2:        10/22 13:18:09  10/22 13:18:09
Phase 3:        10/22 13:18:09  10/22 13:18:31  22 seconds
Phase 4:        10/22 13:18:31  10/22 13:18:45  14 seconds
Phase 5:        10/22 13:18:45  10/22 13:18:45
Phase 6:        10/22 13:18:45  10/22 13:19:00  15 seconds
Phase 7:        10/22 13:19:00  10/22 13:19:00

Total run time: 51 seconds

real	0m52.375s
user	1m3.739s
sys	0m20.346s


Performance improvements on filesystems with small directories and
really fast storage are, at best, modest. The big improvements are
seen with either really large directories and/or relatively slow
devices that are IO latency bound and can benefit from having more
IO in flight at once.

Cheers,

Dave.


Dave Chinner (7):
  workqueue: bound maximum queue depth
  repair: Protect bad inode list with mutex
  repair: protect inode chunk tree records with a mutex
  repair: parallelise phase 6
  repair: don't duplicate names in phase 6
  repair: convert the dir byaddr hash to a radix tree
  repair: scale duplicate name checking in phase 6.

 libfrog/radix-tree.c |  46 +++++
 libfrog/workqueue.c  |  42 ++++-
 libfrog/workqueue.h  |   4 +
 repair/dir2.c        |  32 ++--
 repair/incore.h      |  23 +++
 repair/incore_ino.c  |  15 ++
 repair/phase6.c      | 396 +++++++++++++++++++++----------------------
 7 files changed, 338 insertions(+), 220 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/7] workqueue: bound maximum queue depth
  2020-10-22  5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
@ 2020-10-22  5:15 ` Dave Chinner
  2020-10-22  5:54   ` Darrick J. Wong
  2020-10-25  4:41   ` Darrick J. Wong
  2020-10-22  5:15 ` [PATCH 2/7] repair: Protect bad inode list with mutex Dave Chinner
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  5:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Existing users of workqueues have bound maximum queue depths in
their external algorithms (e.g. prefetch counts). For parallelising
work that doesn't have an external bound, allow workqueues to
throttle incoming requests at a maximum bound. bounded workqueues
also need to distribute work over all worker threads themselves as
there is no external bounding or worker function throttling
provided.

Existing callers are not throttled and retain direct control of
worker threads, only users of the new create interface will be
throttled and concurrency managed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 libfrog/workqueue.c | 42 +++++++++++++++++++++++++++++++++++++++---
 libfrog/workqueue.h |  4 ++++
 2 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/libfrog/workqueue.c b/libfrog/workqueue.c
index fe3de4289379..e42b2a2f678b 100644
--- a/libfrog/workqueue.c
+++ b/libfrog/workqueue.c
@@ -40,13 +40,21 @@ workqueue_thread(void *arg)
 		}
 
 		/*
-		 *  Dequeue work from the head of the list.
+		 *  Dequeue work from the head of the list. If the queue was
+		 *  full then send a wakeup if we're configured to do so.
 		 */
 		assert(wq->item_count > 0);
+		if (wq->max_queued && wq->item_count == wq->max_queued)
+			pthread_cond_signal(&wq->queue_full);
+
 		wi = wq->next_item;
 		wq->next_item = wi->next;
 		wq->item_count--;
 
+		if (wq->max_queued && wq->next_item) {
+			/* more work, wake up another worker */
+			pthread_cond_signal(&wq->wakeup);
+		}
 		pthread_mutex_unlock(&wq->lock);
 
 		(wi->function)(wi->queue, wi->index, wi->arg);
@@ -58,10 +66,11 @@ workqueue_thread(void *arg)
 
 /* Allocate a work queue and threads.  Returns zero or negative error code. */
 int
-workqueue_create(
+workqueue_create_bound(
 	struct workqueue	*wq,
 	void			*wq_ctx,
-	unsigned int		nr_workers)
+	unsigned int		nr_workers,
+	unsigned int		max_queue)
 {
 	unsigned int		i;
 	int			err = 0;
@@ -70,12 +79,16 @@ workqueue_create(
 	err = -pthread_cond_init(&wq->wakeup, NULL);
 	if (err)
 		return err;
+	err = -pthread_cond_init(&wq->queue_full, NULL);
+	if (err)
+		goto out_wake;
 	err = -pthread_mutex_init(&wq->lock, NULL);
 	if (err)
 		goto out_cond;
 
 	wq->wq_ctx = wq_ctx;
 	wq->thread_count = nr_workers;
+	wq->max_queued = max_queue;
 	wq->threads = malloc(nr_workers * sizeof(pthread_t));
 	if (!wq->threads) {
 		err = -errno;
@@ -102,10 +115,21 @@ workqueue_create(
 out_mutex:
 	pthread_mutex_destroy(&wq->lock);
 out_cond:
+	pthread_cond_destroy(&wq->queue_full);
+out_wake:
 	pthread_cond_destroy(&wq->wakeup);
 	return err;
 }
 
+int
+workqueue_create(
+	struct workqueue	*wq,
+	void			*wq_ctx,
+	unsigned int		nr_workers)
+{
+	return workqueue_create_bound(wq, wq_ctx, nr_workers, 0);
+}
+
 /*
  * Create a work item consisting of a function and some arguments and schedule
  * the work item to be run via the thread pool.  Returns zero or a negative
@@ -140,6 +164,7 @@ workqueue_add(
 
 	/* Now queue the new work structure to the work queue. */
 	pthread_mutex_lock(&wq->lock);
+restart:
 	if (wq->next_item == NULL) {
 		assert(wq->item_count == 0);
 		ret = -pthread_cond_signal(&wq->wakeup);
@@ -150,6 +175,16 @@ workqueue_add(
 		}
 		wq->next_item = wi;
 	} else {
+		/* throttle on a full queue if configured */
+		if (wq->max_queued && wq->item_count == wq->max_queued) {
+			pthread_cond_wait(&wq->queue_full, &wq->lock);
+			/*
+			 * Queue might be empty or even still full by the time
+			 * we get the lock back, so restart the lookup so we do
+			 * the right thing with the current state of the queue.
+			 */
+			goto restart;
+		}
 		wq->last_item->next = wi;
 	}
 	wq->last_item = wi;
@@ -201,5 +236,6 @@ workqueue_destroy(
 	free(wq->threads);
 	pthread_mutex_destroy(&wq->lock);
 	pthread_cond_destroy(&wq->wakeup);
+	pthread_cond_destroy(&wq->queue_full);
 	memset(wq, 0, sizeof(*wq));
 }
diff --git a/libfrog/workqueue.h b/libfrog/workqueue.h
index a56d1cf14081..a9c108d0e66a 100644
--- a/libfrog/workqueue.h
+++ b/libfrog/workqueue.h
@@ -31,10 +31,14 @@ struct workqueue {
 	unsigned int		thread_count;
 	bool			terminate;
 	bool			terminated;
+	int			max_queued;
+	pthread_cond_t		queue_full;
 };
 
 int workqueue_create(struct workqueue *wq, void *wq_ctx,
 		unsigned int nr_workers);
+int workqueue_create_bound(struct workqueue *wq, void *wq_ctx,
+		unsigned int nr_workers, unsigned int max_queue);
 int workqueue_add(struct workqueue *wq, workqueue_func_t fn,
 		uint32_t index, void *arg);
 int workqueue_terminate(struct workqueue *wq);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/7] repair: Protect bad inode list with mutex
  2020-10-22  5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
  2020-10-22  5:15 ` [PATCH 1/7] workqueue: bound maximum queue depth Dave Chinner
@ 2020-10-22  5:15 ` Dave Chinner
  2020-10-22  5:45   ` Darrick J. Wong
  2020-10-29  9:35   ` Christoph Hellwig
  2020-10-22  5:15 ` [PATCH 3/7] repair: protect inode chunk tree records with a mutex Dave Chinner
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  5:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To enable phase 6 parallelisation, we need to protect the bad inode
list from concurrent modification and/or access. Wrap it with a
mutex and clean up the nasty typedefs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/dir2.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/repair/dir2.c b/repair/dir2.c
index eabdb4f2d497..23333e59a382 100644
--- a/repair/dir2.c
+++ b/repair/dir2.c
@@ -20,40 +20,50 @@
  * Known bad inode list.  These are seen when the leaf and node
  * block linkages are incorrect.
  */
-typedef struct dir2_bad {
+struct dir2_bad {
 	xfs_ino_t	ino;
 	struct dir2_bad	*next;
-} dir2_bad_t;
+};
 
-static dir2_bad_t *dir2_bad_list;
+static struct dir2_bad	*dir2_bad_list;
+pthread_mutex_t		dir2_bad_list_lock = PTHREAD_MUTEX_INITIALIZER;
 
 static void
 dir2_add_badlist(
 	xfs_ino_t	ino)
 {
-	dir2_bad_t	*l;
+	struct dir2_bad	*l;
 
-	if ((l = malloc(sizeof(dir2_bad_t))) == NULL) {
+	l = malloc(sizeof(*l));
+	if (!l) {
 		do_error(
 _("malloc failed (%zu bytes) dir2_add_badlist:ino %" PRIu64 "\n"),
-			sizeof(dir2_bad_t), ino);
+			sizeof(*l), ino);
 		exit(1);
 	}
+	pthread_mutex_lock(&dir2_bad_list_lock);
 	l->next = dir2_bad_list;
 	dir2_bad_list = l;
 	l->ino = ino;
+	pthread_mutex_unlock(&dir2_bad_list_lock);
 }
 
 int
 dir2_is_badino(
 	xfs_ino_t	ino)
 {
-	dir2_bad_t	*l;
+	struct dir2_bad	*l;
+	int		ret = 0;
 
-	for (l = dir2_bad_list; l; l = l->next)
-		if (l->ino == ino)
-			return 1;
-	return 0;
+	pthread_mutex_lock(&dir2_bad_list_lock);
+	for (l = dir2_bad_list; l; l = l->next) {
+		if (l->ino == ino) {
+			ret = 1;
+			break;
+		}
+	}
+	pthread_mutex_unlock(&dir2_bad_list_lock);
+	return ret;
 }
 
 /*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/7] repair: protect inode chunk tree records with a mutex
  2020-10-22  5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
  2020-10-22  5:15 ` [PATCH 1/7] workqueue: bound maximum queue depth Dave Chinner
  2020-10-22  5:15 ` [PATCH 2/7] repair: Protect bad inode list with mutex Dave Chinner
@ 2020-10-22  5:15 ` Dave Chinner
  2020-10-22  6:02   ` Darrick J. Wong
  2020-10-22  5:15 ` [PATCH 4/7] repair: parallelise phase 6 Dave Chinner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  5:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Phase 6 accesses inode chunk records mostly in an isolated manner.
However, when it finds a corruption in a directory or there are
multiple hardlinks to an inode, there can be concurrent access
to the inode chunk record to update state.

Hence the inode record itself needs a mutex. This protects all state
changes within the inode chunk record, as well as inode link counts
and chunk references. That allows us to process multiple chunks at
once, providing concurrency within an AG as well as across AGs.

The inode chunk tree itself is not modified in phase 6 - it's built
in phases 3 and 4  - and so we do not need to worry about locking
for AVL tree lookups to find the inode chunk records themselves.
hence internal locking is all we need here.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/incore.h     | 23 +++++++++++++++++++++++
 repair/incore_ino.c | 15 +++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/repair/incore.h b/repair/incore.h
index 5b29d5d1efd8..6564e0d38963 100644
--- a/repair/incore.h
+++ b/repair/incore.h
@@ -282,6 +282,7 @@ typedef struct ino_tree_node  {
 		parent_list_t	*plist;		/* phases 2-5 */
 	} ino_un;
 	uint8_t			*ftypes;	/* phases 3,6 */
+	pthread_mutex_t		lock;
 } ino_tree_node_t;
 
 #define INOS_PER_IREC	(sizeof(uint64_t) * NBBY)
@@ -412,7 +413,9 @@ next_free_ino_rec(ino_tree_node_t *ino_rec)
  */
 static inline void add_inode_refchecked(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_un.ex_data->ino_processed |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int is_inode_refchecked(struct ino_tree_node *irec, int offset)
@@ -438,12 +441,16 @@ static inline int is_inode_confirmed(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_isadir(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_isa_dir |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline void clear_inode_isadir(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_isa_dir &= ~IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int inode_isadir(struct ino_tree_node *irec, int offset)
@@ -456,15 +463,19 @@ static inline int inode_isadir(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_free(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	set_inode_confirmed(irec, offset);
 	irec->ir_free |= XFS_INOBT_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 
 }
 
 static inline void set_inode_used(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	set_inode_confirmed(irec, offset);
 	irec->ir_free &= ~XFS_INOBT_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int is_inode_free(struct ino_tree_node *irec, int offset)
@@ -477,7 +488,9 @@ static inline int is_inode_free(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_sparse(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ir_sparse |= XFS_INOBT_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline bool is_inode_sparse(struct ino_tree_node *irec, int offset)
@@ -490,12 +503,16 @@ static inline bool is_inode_sparse(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_was_rl(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_was_rl |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline void clear_inode_was_rl(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_was_rl &= ~IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int inode_was_rl(struct ino_tree_node *irec, int offset)
@@ -508,12 +525,16 @@ static inline int inode_was_rl(struct ino_tree_node *irec, int offset)
  */
 static inline void set_inode_is_rl(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_is_rl |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline void clear_inode_is_rl(struct ino_tree_node *irec, int offset)
 {
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_is_rl &= ~IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 static inline int inode_is_rl(struct ino_tree_node *irec, int offset)
@@ -546,7 +567,9 @@ static inline int is_inode_reached(struct ino_tree_node *irec, int offset)
 static inline void add_inode_reached(struct ino_tree_node *irec, int offset)
 {
 	add_inode_ref(irec, offset);
+	pthread_mutex_lock(&irec->lock);
 	irec->ino_un.ex_data->ino_reached |= IREC_MASK(offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 /*
diff --git a/repair/incore_ino.c b/repair/incore_ino.c
index 82956ae93005..299e4f949e5e 100644
--- a/repair/incore_ino.c
+++ b/repair/incore_ino.c
@@ -91,6 +91,7 @@ void add_inode_ref(struct ino_tree_node *irec, int ino_offset)
 {
 	ASSERT(irec->ino_un.ex_data != NULL);
 
+	pthread_mutex_lock(&irec->lock);
 	switch (irec->nlink_size) {
 	case sizeof(uint8_t):
 		if (irec->ino_un.ex_data->counted_nlinks.un8[ino_offset] < 0xff) {
@@ -112,6 +113,7 @@ void add_inode_ref(struct ino_tree_node *irec, int ino_offset)
 	default:
 		ASSERT(0);
 	}
+	pthread_mutex_unlock(&irec->lock);
 }
 
 void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
@@ -120,6 +122,7 @@ void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
 
 	ASSERT(irec->ino_un.ex_data != NULL);
 
+	pthread_mutex_lock(&irec->lock);
 	switch (irec->nlink_size) {
 	case sizeof(uint8_t):
 		ASSERT(irec->ino_un.ex_data->counted_nlinks.un8[ino_offset] > 0);
@@ -139,6 +142,7 @@ void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
 
 	if (refs == 0)
 		irec->ino_un.ex_data->ino_reached &= ~IREC_MASK(ino_offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 uint32_t num_inode_references(struct ino_tree_node *irec, int ino_offset)
@@ -161,6 +165,7 @@ uint32_t num_inode_references(struct ino_tree_node *irec, int ino_offset)
 void set_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset,
 		uint32_t nlinks)
 {
+	pthread_mutex_lock(&irec->lock);
 	switch (irec->nlink_size) {
 	case sizeof(uint8_t):
 		if (nlinks < 0xff) {
@@ -182,6 +187,7 @@ void set_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset,
 	default:
 		ASSERT(0);
 	}
+	pthread_mutex_unlock(&irec->lock);
 }
 
 uint32_t get_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset)
@@ -253,6 +259,7 @@ alloc_ino_node(
 	irec->nlink_size = sizeof(uint8_t);
 	irec->disk_nlinks.un8 = alloc_nlink_array(irec->nlink_size);
 	irec->ftypes = alloc_ftypes_array(mp);
+	pthread_mutex_init(&irec->lock, NULL);
 	return irec;
 }
 
@@ -294,6 +301,7 @@ free_ino_tree_node(
 	}
 
 	free(irec->ftypes);
+	pthread_mutex_destroy(&irec->lock);
 	free(irec);
 }
 
@@ -600,6 +608,7 @@ set_inode_parent(
 	uint64_t		bitmask;
 	parent_entry_t		*tmp;
 
+	pthread_mutex_lock(&irec->lock);
 	if (full_ino_ex_data)
 		ptbl = irec->ino_un.ex_data->parents;
 	else
@@ -625,6 +634,7 @@ set_inode_parent(
 #endif
 		ptbl->pentries[0] = parent;
 
+		pthread_mutex_unlock(&irec->lock);
 		return;
 	}
 
@@ -642,6 +652,7 @@ set_inode_parent(
 #endif
 		ptbl->pentries[target] = parent;
 
+		pthread_mutex_unlock(&irec->lock);
 		return;
 	}
 
@@ -682,6 +693,7 @@ set_inode_parent(
 #endif
 	ptbl->pentries[target] = parent;
 	ptbl->pmask |= (1ULL << offset);
+	pthread_mutex_unlock(&irec->lock);
 }
 
 xfs_ino_t
@@ -692,6 +704,7 @@ get_inode_parent(ino_tree_node_t *irec, int offset)
 	int		i;
 	int		target;
 
+	pthread_mutex_lock(&irec->lock);
 	if (full_ino_ex_data)
 		ptbl = irec->ino_un.ex_data->parents;
 	else
@@ -709,9 +722,11 @@ get_inode_parent(ino_tree_node_t *irec, int offset)
 #ifdef DEBUG
 		ASSERT(target < ptbl->cnt);
 #endif
+		pthread_mutex_unlock(&irec->lock);
 		return(ptbl->pentries[target]);
 	}
 
+	pthread_mutex_unlock(&irec->lock);
 	return(0LL);
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/7] repair: parallelise phase 6
  2020-10-22  5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
                   ` (2 preceding siblings ...)
  2020-10-22  5:15 ` [PATCH 3/7] repair: protect inode chunk tree records with a mutex Dave Chinner
@ 2020-10-22  5:15 ` Dave Chinner
  2020-10-22  6:11   ` Darrick J. Wong
  2020-10-22  5:15 ` [PATCH 5/7] repair: don't duplicate names in " Dave Chinner
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  5:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

A recent metadump provided to us caused repair to take hours in
phase6. It wasn't IO bound - it was fully CPU bound the entire time.
The only way to speed it up is to make phase 6 run multiple
concurrent processing threads.

The obvious way to do this is to spread the concurrency across AGs,
like the other phases, and while this works it is not optimal. When
a processing thread hits a really large directory, it essentially
sits CPU bound until that directory is processed. IF an AG has lots
of large directories, we end up with a really long single threaded
tail that limits concurrency.

Hence we also need to have concurrency /within/ the AG. This is
realtively easy, as the inode chunk records allow for a simple
concurrency mechanism within an AG. We can simply feed each chunk
record to a workqueue, and we get concurrency within the AG for
free. However, this allows prefetch to run way ahead of processing
and this blows out the buffer cache size and can cause OOM.

However, we can use the new workqueue depth limiting to limit the
number of inode chunks queued, and this then backs up the inode
prefetching to it's maximum queue depth. Hence we prevent having the
prefetch code queue the entire AG's inode chunks on the workqueue
blowing out memory by throttling the prefetch consumer.

This takes phase 6 from taking many, many hours down to:

Phase 6:        10/30 21:12:58  10/30 21:40:48  27 minutes, 50 seconds

And burning 20-30 cpus that entire time on my test rig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase6.c | 43 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index 70d32089bb57..bf0719c186fb 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -6,6 +6,7 @@
 
 #include "libxfs.h"
 #include "threads.h"
+#include "threads.h"
 #include "prefetch.h"
 #include "avl.h"
 #include "globals.h"
@@ -3109,20 +3110,45 @@ check_for_orphaned_inodes(
 }
 
 static void
-traverse_function(
+do_dir_inode(
 	struct workqueue	*wq,
-	xfs_agnumber_t 		agno,
+	xfs_agnumber_t		agno,
 	void			*arg)
 {
-	ino_tree_node_t 	*irec;
+	struct ino_tree_node	*irec = arg;
 	int			i;
+
+	for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
+		if (inode_isadir(irec, i))
+			process_dir_inode(wq->wq_ctx, agno, irec, i);
+	}
+}
+
+static void
+traverse_function(
+	struct workqueue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct ino_tree_node	*irec;
 	prefetch_args_t		*pf_args = arg;
+	struct workqueue	lwq;
+	struct xfs_mount	*mp = wq->wq_ctx;
+
 
 	wait_for_inode_prefetch(pf_args);
 
 	if (verbose)
 		do_log(_("        - agno = %d\n"), agno);
 
+	/*
+	 * The more AGs we have in flight at once, the fewer processing threads
+	 * per AG. This means we don't overwhelm the machine with hundreds of
+	 * threads when we start acting on lots of AGs at once. We just want
+	 * enough that we can keep multiple CPUs busy across multiple AGs.
+	 */
+	workqueue_create_bound(&lwq, mp, ag_stride, 1000);
+
 	for (irec = findfirst_inode_rec(agno); irec; irec = next_ino_rec(irec)) {
 		if (irec->ino_isa_dir == 0)
 			continue;
@@ -3130,18 +3156,19 @@ traverse_function(
 		if (pf_args) {
 			sem_post(&pf_args->ra_count);
 #ifdef XR_PF_TRACE
+			{
+			int	i;
 			sem_getvalue(&pf_args->ra_count, &i);
 			pftrace(
 		"processing inode chunk %p in AG %d (sem count = %d)",
 				irec, agno, i);
+			}
 #endif
 		}
 
-		for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
-			if (inode_isadir(irec, i))
-				process_dir_inode(wq->wq_ctx, agno, irec, i);
-		}
+		queue_work(&lwq, do_dir_inode, agno, irec);
 	}
+	destroy_work_queue(&lwq);
 	cleanup_inode_prefetch(pf_args);
 }
 
@@ -3169,7 +3196,7 @@ static void
 traverse_ags(
 	struct xfs_mount	*mp)
 {
-	do_inode_prefetch(mp, 0, traverse_function, false, true);
+	do_inode_prefetch(mp, ag_stride, traverse_function, false, true);
 }
 
 void
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/7] repair: don't duplicate names in phase 6
  2020-10-22  5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
                   ` (3 preceding siblings ...)
  2020-10-22  5:15 ` [PATCH 4/7] repair: parallelise phase 6 Dave Chinner
@ 2020-10-22  5:15 ` Dave Chinner
  2020-10-22  6:21   ` Darrick J. Wong
  2020-10-29  9:39   ` Christoph Hellwig
  2020-10-22  5:15 ` [PATCH 6/7] repair: convert the dir byaddr hash to a radix tree Dave Chinner
  2020-10-22  5:15 ` [PATCH 7/7] repair: scale duplicate name checking in phase 6 Dave Chinner
  6 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  5:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The name hash in phase 6 is constructed by using names that point
directly into the directory buffers. Hence before the buffers can be
released, the constructed name hash has to duplicate all those names
into meory it owns via dir_hash_dup_names().

Given that the structure that holds the name is dynamically
allocated, it makes no sense to store a pointer to the name
dir_hash_add() and then later have dynamically allocate the name.

Extend the name hash allocation to contain space for the name
itself, and copy the name into the name hash structure in
dir_hash_add(). This allows us to get rid of dir_hash_dup_names(),
and the directory checking code no longer needs to hold all the
directory buffers in memory until the entire directory walk is
complete and the names duplicated.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase6.c | 101 ++++++++++++++----------------------------------
 1 file changed, 29 insertions(+), 72 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index bf0719c186fb..79c87495656f 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -72,7 +72,7 @@ typedef struct dir_hash_ent {
 	struct dir_hash_ent	*nextbyorder;	/* next in order added */
 	xfs_dahash_t		hashval;	/* hash value of name */
 	uint32_t		address;	/* offset of data entry */
-	xfs_ino_t 		inum;		/* inode num of entry */
+	xfs_ino_t		inum;		/* inode num of entry */
 	short			junkit;		/* name starts with / */
 	short			seen;		/* have seen leaf entry */
 	struct xfs_name		name;
@@ -80,7 +80,6 @@ typedef struct dir_hash_ent {
 
 typedef struct dir_hash_tab {
 	int			size;		/* size of hash tables */
-	int			names_duped;	/* 1 = ent names malloced */
 	dir_hash_ent_t		*first;		/* ptr to first added entry */
 	dir_hash_ent_t		*last;		/* ptr to last added entry */
 	dir_hash_ent_t		**byhash;	/* ptr to name hash buckets */
@@ -171,8 +170,6 @@ dir_hash_add(
 	short			junk;
 	struct xfs_name		xname;
 
-	ASSERT(!hashtab->names_duped);
-
 	xname.name = name;
 	xname.len = namelen;
 	xname.type = ftype;
@@ -199,7 +196,12 @@ dir_hash_add(
 		}
 	}
 
-	if ((p = malloc(sizeof(*p))) == NULL)
+	/*
+	 * Allocate enough space for the hash entry and the name in a single
+	 * allocation so we can store our own copy of the name for later use.
+	 */
+	p = calloc(1, sizeof(*p) + namelen + 1);
+	if (!p)
 		do_error(_("malloc failed in dir_hash_add (%zu bytes)\n"),
 			sizeof(*p));
 
@@ -220,7 +222,12 @@ dir_hash_add(
 	p->address = addr;
 	p->inum = inum;
 	p->seen = 0;
-	p->name = xname;
+
+	/* Set up the name in the region trailing the hash entry. */
+	memcpy(p + 1, name, namelen);
+	p->name.name = (const unsigned char *)(p + 1);
+	p->name.len = namelen;
+	p->name.type = ftype;
 
 	return !dup;
 }
@@ -287,8 +294,6 @@ dir_hash_done(
 	for (i = 0; i < hashtab->size; i++) {
 		for (p = hashtab->byaddr[i]; p; p = n) {
 			n = p->nextbyaddr;
-			if (hashtab->names_duped)
-				free((void *)p->name.name);
 			free(p);
 		}
 	}
@@ -385,27 +390,6 @@ dir_hash_see_all(
 	return j == stale ? DIR_HASH_CK_OK : DIR_HASH_CK_BADSTALE;
 }
 
-/*
- * Convert name pointers into locally allocated memory.
- * This must only be done after all the entries have been added.
- */
-static void
-dir_hash_dup_names(dir_hash_tab_t *hashtab)
-{
-	unsigned char		*name;
-	dir_hash_ent_t		*p;
-
-	if (hashtab->names_duped)
-		return;
-
-	for (p = hashtab->first; p; p = p->nextbyorder) {
-		name = malloc(p->name.len);
-		memcpy(name, p->name.name, p->name.len);
-		p->name.name = name;
-	}
-	hashtab->names_duped = 1;
-}
-
 /*
  * Given a block number in a fork, return the next valid block number
  * (not a hole).
@@ -1387,6 +1371,7 @@ dir2_kill_block(
 		res_failed(error);
 	libxfs_trans_ijoin(tp, ip, 0);
 	libxfs_trans_bjoin(tp, bp);
+	libxfs_trans_bhold(tp, bp);
 	memset(&args, 0, sizeof(args));
 	args.dp = ip;
 	args.trans = tp;
@@ -1418,7 +1403,7 @@ longform_dir2_entry_check_data(
 	int			*need_dot,
 	ino_tree_node_t		*current_irec,
 	int			current_ino_offset,
-	struct xfs_buf		**bpp,
+	struct xfs_buf		*bp,
 	dir_hash_tab_t		*hashtab,
 	freetab_t		**freetabp,
 	xfs_dablk_t		da_bno,
@@ -1426,7 +1411,6 @@ longform_dir2_entry_check_data(
 {
 	xfs_dir2_dataptr_t	addr;
 	xfs_dir2_leaf_entry_t	*blp;
-	struct xfs_buf		*bp;
 	xfs_dir2_block_tail_t	*btp;
 	struct xfs_dir2_data_hdr *d;
 	xfs_dir2_db_t		db;
@@ -1457,7 +1441,6 @@ longform_dir2_entry_check_data(
 	};
 
 
-	bp = *bpp;
 	d = bp->b_addr;
 	ptr = (char *)d + mp->m_dir_geo->data_entry_offset;
 	nbad = 0;
@@ -1558,10 +1541,8 @@ longform_dir2_entry_check_data(
 			dir2_kill_block(mp, ip, da_bno, bp);
 		} else {
 			do_warn(_("would junk block\n"));
-			libxfs_buf_relse(bp);
 		}
 		freetab->ents[db].v = NULLDATAOFF;
-		*bpp = NULL;
 		return;
 	}
 
@@ -2219,17 +2200,15 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
 			int		ino_offset,
 			dir_hash_tab_t	*hashtab)
 {
-	struct xfs_buf		**bplist;
+	struct xfs_buf		*bp;
 	xfs_dablk_t		da_bno;
 	freetab_t		*freetab;
-	int			num_bps;
 	int			i;
 	int			isblock;
 	int			isleaf;
 	xfs_fileoff_t		next_da_bno;
 	int			seeval;
 	int			fixit = 0;
-	xfs_dir2_db_t		db;
 	struct xfs_da_args	args;
 
 	*need_dot = 1;
@@ -2246,11 +2225,6 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
 		freetab->ents[i].v = NULLDATAOFF;
 		freetab->ents[i].s = 0;
 	}
-	num_bps = freetab->naents;
-	bplist = calloc(num_bps, sizeof(struct xfs_buf*));
-	if (!bplist)
-		do_error(_("calloc failed in %s (%zu bytes)\n"),
-			__func__, num_bps * sizeof(struct xfs_buf*));
 
 	/* is this a block, leaf, or node directory? */
 	args.dp = ip;
@@ -2279,28 +2253,12 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
 			break;
 		}
 
-		db = xfs_dir2_da_to_db(mp->m_dir_geo, da_bno);
-		if (db >= num_bps) {
-			int last_size = num_bps;
-
-			/* more data blocks than expected */
-			num_bps = db + 1;
-			bplist = realloc(bplist, num_bps * sizeof(struct xfs_buf*));
-			if (!bplist)
-				do_error(_("realloc failed in %s (%zu bytes)\n"),
-					__func__,
-					num_bps * sizeof(struct xfs_buf*));
-			/* Initialize the new elements */
-			for (i = last_size; i < num_bps; i++)
-				bplist[i] = NULL;
-		}
-
 		if (isblock)
 			ops = &xfs_dir3_block_buf_ops;
 		else
 			ops = &xfs_dir3_data_buf_ops;
 
-		error = dir_read_buf(ip, da_bno, &bplist[db], ops, &fixit);
+		error = dir_read_buf(ip, da_bno, &bp, ops, &fixit);
 		if (error) {
 			do_warn(
 	_("can't read data block %u for directory inode %" PRIu64 " error %d\n"),
@@ -2320,21 +2278,25 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
 		}
 
 		/* check v5 metadata */
-		d = bplist[db]->b_addr;
+		d = bp->b_addr;
 		if (be32_to_cpu(d->magic) == XFS_DIR3_BLOCK_MAGIC ||
 		    be32_to_cpu(d->magic) == XFS_DIR3_DATA_MAGIC) {
-			struct xfs_buf		 *bp = bplist[db];
-
 			error = check_dir3_header(mp, bp, ino);
 			if (error) {
 				fixit++;
+				if (isblock)
+					goto out_fix;
 				continue;
 			}
 		}
 
 		longform_dir2_entry_check_data(mp, ip, num_illegal, need_dot,
-				irec, ino_offset, &bplist[db], hashtab,
+				irec, ino_offset, bp, hashtab,
 				&freetab, da_bno, isblock);
+		if (isblock)
+			break;
+
+		libxfs_buf_relse(bp);
 	}
 	fixit |= (*num_illegal != 0) || dir2_is_badino(ino) || *need_dot;
 
@@ -2345,7 +2307,7 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
 			xfs_dir2_block_tail_t	*btp;
 			xfs_dir2_leaf_entry_t	*blp;
 
-			block = bplist[0]->b_addr;
+			block = bp->b_addr;
 			btp = xfs_dir2_block_tail_p(mp->m_dir_geo, block);
 			blp = xfs_dir2_block_leaf_p(btp);
 			seeval = dir_hash_see_all(hashtab, blp,
@@ -2362,11 +2324,10 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
 		}
 	}
 out_fix:
+	if (isblock && bp)
+		libxfs_buf_relse(bp);
+
 	if (!no_modify && (fixit || dotdot_update)) {
-		dir_hash_dup_names(hashtab);
-		for (i = 0; i < num_bps; i++)
-			if (bplist[i])
-				libxfs_buf_relse(bplist[i]);
 		longform_dir2_rebuild(mp, ino, ip, irec, ino_offset, hashtab);
 		*num_illegal = 0;
 		*need_dot = 0;
@@ -2374,12 +2335,8 @@ out_fix:
 		if (fixit || dotdot_update)
 			do_warn(
 	_("would rebuild directory inode %" PRIu64 "\n"), ino);
-		for (i = 0; i < num_bps; i++)
-			if (bplist[i])
-				libxfs_buf_relse(bplist[i]);
 	}
 
-	free(bplist);
 	free(freetab);
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/7] repair: convert the dir byaddr hash to a radix tree
  2020-10-22  5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
                   ` (4 preceding siblings ...)
  2020-10-22  5:15 ` [PATCH 5/7] repair: don't duplicate names in " Dave Chinner
@ 2020-10-22  5:15 ` Dave Chinner
  2020-10-29 16:41   ` Darrick J. Wong
  2020-10-22  5:15 ` [PATCH 7/7] repair: scale duplicate name checking in phase 6 Dave Chinner
  6 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  5:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Phase 6 uses a hash table to track the data segment addresses of the
entries it has seen in a directory. This is indexed by the offset
into the data segment for the dirent, and is used to check if the
entry exists, is a duplicate or has a bad hash value. The lookup
operations involve walking long hash chains on large directories and
they are done for every entry in the directory. This means certain
operations have O(n^2) scalability (or worse!) and hence hurt on
very large directories.

It is also used to determine if the directory has unseen entries,
which involves a full hash traversal that is very expensive on large
directories. Hence the directory checking for unseen ends up being
roughly a O(n^2 + n) algorithm.

Switch the byaddr indexing to a radix tree. While a radix tree will
burn more memory than the linked list, it gives us O(log n) lookup
operations instead of O(n) on large directories, and use for tags
gives us O(1) determination of whether all entries have been seen or
not. This brings the "entry seen" algorithm scalability back to
O(nlog n) and so is a major improvement for processing large
directories.

Given a filesystem with 10M empty files in a single directory, we
see:

5.6.0:

  97.56%  xfs_repair              [.] dir_hash_add.lto_priv.0
   0.38%  xfs_repair              [.] avl_ino_start.lto_priv.0
   0.37%  libc-2.31.so            [.] malloc
   0.34%  xfs_repair              [.] longform_dir2_entry_check_data.lto_priv.0

Phase 6:        10/22 12:07:13  10/22 12:10:51  3 minutes, 38 seconds

Patched:

  97.11%  xfs_repair          [.] dir_hash_add
   0.38%  xfs_repair          [.] longform_dir2_entry_check_data
   0.34%  libc-2.31.so        [.] __libc_calloc
   0.32%  xfs_repair          [.] avl_ino_start

Phase 6:        10/22 12:11:40  10/22 12:14:28  2 minutes, 48 seconds

So there's some improvement, but we are clearly still CPU bound due
to the O(n^2) scalability of the duplicate name checking algorithm.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 libfrog/radix-tree.c |  46 +++++++++
 repair/phase6.c      | 222 ++++++++++++++++++++-----------------------
 2 files changed, 148 insertions(+), 120 deletions(-)

diff --git a/libfrog/radix-tree.c b/libfrog/radix-tree.c
index c1c74876964c..261fc2487de9 100644
--- a/libfrog/radix-tree.c
+++ b/libfrog/radix-tree.c
@@ -312,6 +312,52 @@ void *radix_tree_lookup_first(struct radix_tree_root *root, unsigned long *index
 
 #ifdef RADIX_TREE_TAGS
 
+/**
+ * radix_tree_tag_get - get a tag on a radix tree node
+ * @root:		radix tree root
+ * @index:		index key
+ * @tag:		tag index (< RADIX_TREE_MAX_TAGS)
+ *
+ * Return values:
+ *
+ *  0: tag not present or not set
+ *  1: tag set
+ *
+ * Note that the return value of this function may not be relied on, even if
+ * the RCU lock is held, unless tag modification and node deletion are excluded
+ * from concurrency.
+ */
+int radix_tree_tag_get(struct radix_tree_root *root,
+			unsigned long index, unsigned int tag)
+{
+	unsigned int height, shift;
+	struct radix_tree_node *slot;
+
+	height = root->height;
+	if (index > radix_tree_maxindex(height))
+		return 0;
+
+	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+	slot = root->rnode;
+
+	while (height > 0) {
+		int offset;
+
+		if (slot == NULL)
+			return 0;
+
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		if (!tag_get(slot, tag, offset))
+			return 0;
+
+		slot = slot->slots[offset];
+		ASSERT(slot != NULL);
+		shift -= RADIX_TREE_MAP_SHIFT;
+		height--;
+	}
+	return 1;
+}
+
 /**
  *	radix_tree_tag_set - set a tag on a radix tree node
  *	@root:		radix tree root
diff --git a/repair/phase6.c b/repair/phase6.c
index 79c87495656f..21f49dd748e1 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -66,8 +66,7 @@ add_dotdot_update(
  * and whether their leaf entry has been seen. Also used for name
  * duplicate checking and rebuilding step if required.
  */
-typedef struct dir_hash_ent {
-	struct dir_hash_ent	*nextbyaddr;	/* next in addr bucket */
+struct dir_hash_ent {
 	struct dir_hash_ent	*nextbyhash;	/* next in name bucket */
 	struct dir_hash_ent	*nextbyorder;	/* next in order added */
 	xfs_dahash_t		hashval;	/* hash value of name */
@@ -76,18 +75,19 @@ typedef struct dir_hash_ent {
 	short			junkit;		/* name starts with / */
 	short			seen;		/* have seen leaf entry */
 	struct xfs_name		name;
-} dir_hash_ent_t;
+};
 
-typedef struct dir_hash_tab {
+struct dir_hash_tab {
 	int			size;		/* size of hash tables */
-	dir_hash_ent_t		*first;		/* ptr to first added entry */
-	dir_hash_ent_t		*last;		/* ptr to last added entry */
-	dir_hash_ent_t		**byhash;	/* ptr to name hash buckets */
-	dir_hash_ent_t		**byaddr;	/* ptr to addr hash buckets */
-} dir_hash_tab_t;
+	struct dir_hash_ent	*first;		/* ptr to first added entry */
+	struct dir_hash_ent	*last;		/* ptr to last added entry */
+	struct dir_hash_ent	**byhash;	/* ptr to name hash buckets */
+#define HT_UNSEEN		1
+	struct radix_tree_root	byaddr;
+};
 
 #define	DIR_HASH_TAB_SIZE(n)	\
-	(sizeof(dir_hash_tab_t) + (sizeof(dir_hash_ent_t *) * (n) * 2))
+	(sizeof(struct dir_hash_tab) + (sizeof(struct dir_hash_ent *) * (n)))
 #define	DIR_HASH_FUNC(t,a)	((a) % (t)->size)
 
 /*
@@ -154,8 +154,8 @@ dir_read_buf(
  */
 static int
 dir_hash_add(
-	xfs_mount_t		*mp,
-	dir_hash_tab_t		*hashtab,
+	struct xfs_mount	*mp,
+	struct dir_hash_tab	*hashtab,
 	uint32_t		addr,
 	xfs_ino_t		inum,
 	int			namelen,
@@ -163,19 +163,18 @@ dir_hash_add(
 	uint8_t			ftype)
 {
 	xfs_dahash_t		hash = 0;
-	int			byaddr;
 	int			byhash = 0;
-	dir_hash_ent_t		*p;
+	struct dir_hash_ent	*p;
 	int			dup;
 	short			junk;
 	struct xfs_name		xname;
+	int			error;
 
 	xname.name = name;
 	xname.len = namelen;
 	xname.type = ftype;
 
 	junk = name[0] == '/';
-	byaddr = DIR_HASH_FUNC(hashtab, addr);
 	dup = 0;
 
 	if (!junk) {
@@ -205,8 +204,14 @@ dir_hash_add(
 		do_error(_("malloc failed in dir_hash_add (%zu bytes)\n"),
 			sizeof(*p));
 
-	p->nextbyaddr = hashtab->byaddr[byaddr];
-	hashtab->byaddr[byaddr] = p;
+	error = radix_tree_insert(&hashtab->byaddr, addr, p);
+	if (error == EEXIST) {
+		do_warn(_("duplicate addrs %u in directory!\n"), addr);
+		free(p);
+		return 0;
+	}
+	radix_tree_tag_set(&hashtab->byaddr, addr, HT_UNSEEN);
+
 	if (hashtab->last)
 		hashtab->last->nextbyorder = p;
 	else
@@ -232,33 +237,14 @@ dir_hash_add(
 	return !dup;
 }
 
-/*
- * checks to see if any data entries are not in the leaf blocks
- */
-static int
-dir_hash_unseen(
-	dir_hash_tab_t	*hashtab)
-{
-	int		i;
-	dir_hash_ent_t	*p;
-
-	for (i = 0; i < hashtab->size; i++) {
-		for (p = hashtab->byaddr[i]; p; p = p->nextbyaddr) {
-			if (p->seen == 0)
-				return 1;
-		}
-	}
-	return 0;
-}
-
 static int
 dir_hash_check(
-	dir_hash_tab_t	*hashtab,
-	xfs_inode_t	*ip,
-	int		seeval)
+	struct dir_hash_tab	*hashtab,
+	struct xfs_inode	*ip,
+	int			seeval)
 {
-	static char	*seevalstr[DIR_HASH_CK_TOTAL];
-	static int	done;
+	static char		*seevalstr[DIR_HASH_CK_TOTAL];
+	static int		done;
 
 	if (!done) {
 		seevalstr[DIR_HASH_CK_OK] = _("ok");
@@ -270,7 +256,8 @@ dir_hash_check(
 		done = 1;
 	}
 
-	if (seeval == DIR_HASH_CK_OK && dir_hash_unseen(hashtab))
+	if (seeval == DIR_HASH_CK_OK &&
+	    radix_tree_tagged(&hashtab->byaddr, HT_UNSEEN))
 		seeval = DIR_HASH_CK_NOLEAF;
 	if (seeval == DIR_HASH_CK_OK)
 		return 0;
@@ -285,27 +272,28 @@ dir_hash_check(
 
 static void
 dir_hash_done(
-	dir_hash_tab_t	*hashtab)
+	struct dir_hash_tab	*hashtab)
 {
-	int		i;
-	dir_hash_ent_t	*n;
-	dir_hash_ent_t	*p;
+	int			i;
+	struct dir_hash_ent	*n;
+	struct dir_hash_ent	*p;
 
 	for (i = 0; i < hashtab->size; i++) {
-		for (p = hashtab->byaddr[i]; p; p = n) {
-			n = p->nextbyaddr;
+		for (p = hashtab->byhash[i]; p; p = n) {
+			n = p->nextbyhash;
+			radix_tree_delete(&hashtab->byaddr, p->address);
 			free(p);
 		}
 	}
 	free(hashtab);
 }
 
-static dir_hash_tab_t *
+static struct dir_hash_tab *
 dir_hash_init(
-	xfs_fsize_t	size)
+	xfs_fsize_t		size)
 {
-	dir_hash_tab_t	*hashtab;
-	int		hsize;
+	struct dir_hash_tab	*hashtab;
+	int			hsize;
 
 	hsize = size / (16 * 4);
 	if (hsize > 65536)
@@ -315,51 +303,43 @@ dir_hash_init(
 	if ((hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1)) == NULL)
 		do_error(_("calloc failed in dir_hash_init\n"));
 	hashtab->size = hsize;
-	hashtab->byhash = (dir_hash_ent_t**)((char *)hashtab +
-		sizeof(dir_hash_tab_t));
-	hashtab->byaddr = (dir_hash_ent_t**)((char *)hashtab +
-		sizeof(dir_hash_tab_t) + sizeof(dir_hash_ent_t*) * hsize);
+	hashtab->byhash = (struct dir_hash_ent **)((char *)hashtab +
+		sizeof(struct dir_hash_tab));
+	INIT_RADIX_TREE(&hashtab->byaddr, 0);
 	return hashtab;
 }
 
 static int
 dir_hash_see(
-	dir_hash_tab_t		*hashtab,
+	struct dir_hash_tab	*hashtab,
 	xfs_dahash_t		hash,
 	xfs_dir2_dataptr_t	addr)
 {
-	int			i;
-	dir_hash_ent_t		*p;
+	struct dir_hash_ent	*p;
 
-	i = DIR_HASH_FUNC(hashtab, addr);
-	for (p = hashtab->byaddr[i]; p; p = p->nextbyaddr) {
-		if (p->address != addr)
-			continue;
-		if (p->seen)
-			return DIR_HASH_CK_DUPLEAF;
-		if (p->junkit == 0 && p->hashval != hash)
-			return DIR_HASH_CK_BADHASH;
-		p->seen = 1;
-		return DIR_HASH_CK_OK;
-	}
-	return DIR_HASH_CK_NODATA;
+	p = radix_tree_lookup(&hashtab->byaddr, addr);
+	if (!p)
+		return DIR_HASH_CK_NODATA;
+	if (!radix_tree_tag_get(&hashtab->byaddr, addr, HT_UNSEEN))
+		return DIR_HASH_CK_DUPLEAF;
+	if (p->junkit == 0 && p->hashval != hash)
+		return DIR_HASH_CK_BADHASH;
+	radix_tree_tag_clear(&hashtab->byaddr, addr, HT_UNSEEN);
+	return DIR_HASH_CK_OK;
 }
 
 static void
 dir_hash_update_ftype(
-	dir_hash_tab_t		*hashtab,
+	struct dir_hash_tab	*hashtab,
 	xfs_dir2_dataptr_t	addr,
 	uint8_t			ftype)
 {
-	int			i;
-	dir_hash_ent_t		*p;
+	struct dir_hash_ent	*p;
 
-	i = DIR_HASH_FUNC(hashtab, addr);
-	for (p = hashtab->byaddr[i]; p; p = p->nextbyaddr) {
-		if (p->address != addr)
-			continue;
-		p->name.type = ftype;
-	}
+	p = radix_tree_lookup(&hashtab->byaddr, addr);
+	if (!p)
+		return;
+	p->name.type = ftype;
 }
 
 /*
@@ -368,7 +348,7 @@ dir_hash_update_ftype(
  */
 static int
 dir_hash_see_all(
-	dir_hash_tab_t		*hashtab,
+	struct dir_hash_tab	*hashtab,
 	xfs_dir2_leaf_entry_t	*ents,
 	int			count,
 	int			stale)
@@ -1226,19 +1206,19 @@ dir_binval(
 
 static void
 longform_dir2_rebuild(
-	xfs_mount_t		*mp,
+	struct xfs_mount	*mp,
 	xfs_ino_t		ino,
-	xfs_inode_t		*ip,
-	ino_tree_node_t		*irec,
+	struct xfs_inode	*ip,
+	struct ino_tree_node	*irec,
 	int			ino_offset,
-	dir_hash_tab_t		*hashtab)
+	struct dir_hash_tab	*hashtab)
 {
 	int			error;
 	int			nres;
-	xfs_trans_t		*tp;
+	struct xfs_trans	*tp;
 	xfs_fileoff_t		lastblock;
-	xfs_inode_t		pip;
-	dir_hash_ent_t		*p;
+	struct xfs_inode	pip;
+	struct dir_hash_ent	*p;
 	int			done = 0;
 
 	/*
@@ -1397,14 +1377,14 @@ _("directory shrink failed (%d)\n"), error);
  */
 static void
 longform_dir2_entry_check_data(
-	xfs_mount_t		*mp,
-	xfs_inode_t		*ip,
+	struct xfs_mount	*mp,
+	struct xfs_inode	*ip,
 	int			*num_illegal,
 	int			*need_dot,
-	ino_tree_node_t		*current_irec,
+	struct ino_tree_node	*current_irec,
 	int			current_ino_offset,
 	struct xfs_buf		*bp,
-	dir_hash_tab_t		*hashtab,
+	struct dir_hash_tab	*hashtab,
 	freetab_t		**freetabp,
 	xfs_dablk_t		da_bno,
 	int			isblock)
@@ -1931,10 +1911,10 @@ check_dir3_header(
  */
 static int
 longform_dir2_check_leaf(
-	xfs_mount_t		*mp,
-	xfs_inode_t		*ip,
-	dir_hash_tab_t		*hashtab,
-	freetab_t		*freetab)
+	struct xfs_mount	*mp,
+	struct xfs_inode	*ip,
+	struct dir_hash_tab	*hashtab,
+	struct freetab		*freetab)
 {
 	int			badtail;
 	__be16			*bestsp;
@@ -2016,10 +1996,10 @@ longform_dir2_check_leaf(
  */
 static int
 longform_dir2_check_node(
-	xfs_mount_t		*mp,
-	xfs_inode_t		*ip,
-	dir_hash_tab_t		*hashtab,
-	freetab_t		*freetab)
+	struct xfs_mount	*mp,
+	struct xfs_inode	*ip,
+	struct dir_hash_tab	*hashtab,
+	struct freetab		*freetab)
 {
 	struct xfs_buf		*bp;
 	xfs_dablk_t		da_bno;
@@ -2191,14 +2171,15 @@ longform_dir2_check_node(
  * (ie. get libxfs to do all the grunt work)
  */
 static void
-longform_dir2_entry_check(xfs_mount_t	*mp,
-			xfs_ino_t	ino,
-			xfs_inode_t	*ip,
-			int		*num_illegal,
-			int		*need_dot,
-			ino_tree_node_t	*irec,
-			int		ino_offset,
-			dir_hash_tab_t	*hashtab)
+longform_dir2_entry_check(
+	struct xfs_mount	*mp,
+	xfs_ino_t		ino,
+	struct xfs_inode	*ip,
+	int			*num_illegal,
+	int			*need_dot,
+	struct ino_tree_node	*irec,
+	int			ino_offset,
+	struct dir_hash_tab	*hashtab)
 {
 	struct xfs_buf		*bp;
 	xfs_dablk_t		da_bno;
@@ -2401,13 +2382,14 @@ shortform_dir2_junk(
 }
 
 static void
-shortform_dir2_entry_check(xfs_mount_t	*mp,
-			xfs_ino_t	ino,
-			xfs_inode_t	*ip,
-			int		*ino_dirty,
-			ino_tree_node_t	*current_irec,
-			int		current_ino_offset,
-			dir_hash_tab_t	*hashtab)
+shortform_dir2_entry_check(
+	struct xfs_mount	*mp,
+	xfs_ino_t		ino,
+	struct xfs_inode	*ip,
+	int			*ino_dirty,
+	struct ino_tree_node	*current_irec,
+	int			current_ino_offset,
+	struct dir_hash_tab	*hashtab)
 {
 	xfs_ino_t		lino;
 	xfs_ino_t		parent;
@@ -2749,15 +2731,15 @@ _("entry \"%s\" (ino %" PRIu64 ") in dir %" PRIu64 " is a duplicate name"),
  */
 static void
 process_dir_inode(
-	xfs_mount_t 		*mp,
+	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
-	ino_tree_node_t		*irec,
+	struct ino_tree_node	*irec,
 	int			ino_offset)
 {
 	xfs_ino_t		ino;
-	xfs_inode_t		*ip;
-	xfs_trans_t		*tp;
-	dir_hash_tab_t		*hashtab;
+	struct xfs_inode	*ip;
+	struct xfs_trans	*tp;
+	struct dir_hash_tab	*hashtab;
 	int			need_dot;
 	int			dirty, num_illegal, error, nres;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/7] repair: scale duplicate name checking in phase 6.
  2020-10-22  5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
                   ` (5 preceding siblings ...)
  2020-10-22  5:15 ` [PATCH 6/7] repair: convert the dir byaddr hash to a radix tree Dave Chinner
@ 2020-10-22  5:15 ` Dave Chinner
  2020-10-29 16:29   ` Darrick J. Wong
  6 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  5:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

phase 6 on large directories is cpu bound on duplicate name checking
due to the algorithm having effectively O(n^2) scalability. Hence
when the duplicate name hash table  size is far smaller than the
number of directory entries, we end up with long hash chains that
are searched linearly on every new entry that is found in the
directory to do duplicate detection.

The in-memory hash table size is limited to 64k entries. Hence when
we have millions of entries in a directory, duplicate entry lookups
on the hash table have substantial overhead. Scale this table out to
larger sizes so that we keep the chain lengths short and hence the
O(n^2) scalability impact is limited because N is always small.

For a 10M entry directoryi consuming 400MB of directory data, the
hash table now sizes at 6.4 million entries instead of ~64k - it is
~100x larger. While the hash table now consumes ~50MB of RAM, the
xfs_repair footprint barely changes at it's using already consuming
~9GB of RAM at this point in time. IOWs, the incremental memory
usage change is noise, but the directory checking time:

Unpatched:

  97.11%  xfs_repair          [.] dir_hash_add
   0.38%  xfs_repair          [.] longform_dir2_entry_check_data
   0.34%  libc-2.31.so        [.] __libc_calloc
   0.32%  xfs_repair          [.] avl_ino_start

Phase 6:        10/22 12:11:40  10/22 12:14:28  2 minutes, 48 seconds

Patched:

  46.74%  xfs_repair          [.] radix_tree_lookup
  32.13%  xfs_repair          [.] dir_hash_see_all
   7.70%  xfs_repair          [.] radix_tree_tag_get
   3.92%  xfs_repair          [.] dir_hash_add
   3.52%  xfs_repair          [.] radix_tree_tag_clear
   2.43%  xfs_repair          [.] crc32c_le

Phase 6:        10/22 13:11:01  10/22 13:11:18  17 seconds

has been reduced by an order of magnitude.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase6.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index 21f49dd748e1..7dd6130056ee 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -288,19 +288,37 @@ dir_hash_done(
 	free(hashtab);
 }
 
+/*
+ * Create a directory hash index structure based on the size of the directory we
+ * are about to try to repair. The size passed in is the size of the data
+ * segment of the directory in bytes, so we don't really know exactly how many
+ * entries are in it. Hence assume an entry size of around 64 bytes - that's a
+ * name length of 40+ bytes so should cover a most situations with large
+ * really directories.
+ */
 static struct dir_hash_tab *
 dir_hash_init(
 	xfs_fsize_t		size)
 {
-	struct dir_hash_tab	*hashtab;
+	struct dir_hash_tab	*hashtab = NULL;
 	int			hsize;
 
-	hsize = size / (16 * 4);
-	if (hsize > 65536)
-		hsize = 63336;
-	else if (hsize < 16)
+	hsize = size / 64;
+	if (hsize < 16)
 		hsize = 16;
-	if ((hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1)) == NULL)
+
+	/*
+	 * Try to allocate as large a hash table as possible. Failure to
+	 * allocate isn't fatal, it will just result in slower performance as we
+	 * reduce the size of the table.
+	 */
+	while (hsize >= 16) {
+		hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1);
+		if (hashtab)
+			break;
+		hsize /= 2;
+	}
+	if (!hashtab)
 		do_error(_("calloc failed in dir_hash_init\n"));
 	hashtab->size = hsize;
 	hashtab->byhash = (struct dir_hash_ent **)((char *)hashtab +
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] repair: Protect bad inode list with mutex
  2020-10-22  5:15 ` [PATCH 2/7] repair: Protect bad inode list with mutex Dave Chinner
@ 2020-10-22  5:45   ` Darrick J. Wong
  2020-10-29  9:35   ` Christoph Hellwig
  1 sibling, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-22  5:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 04:15:32PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To enable phase 6 parallelisation, we need to protect the bad inode
> list from concurrent modification and/or access. Wrap it with a
> mutex and clean up the nasty typedefs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

/methinks the typedef removal ought to be a separate commit,
but otherwise:

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  repair/dir2.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/repair/dir2.c b/repair/dir2.c
> index eabdb4f2d497..23333e59a382 100644
> --- a/repair/dir2.c
> +++ b/repair/dir2.c
> @@ -20,40 +20,50 @@
>   * Known bad inode list.  These are seen when the leaf and node
>   * block linkages are incorrect.
>   */
> -typedef struct dir2_bad {
> +struct dir2_bad {
>  	xfs_ino_t	ino;
>  	struct dir2_bad	*next;
> -} dir2_bad_t;
> +};
>  
> -static dir2_bad_t *dir2_bad_list;
> +static struct dir2_bad	*dir2_bad_list;
> +pthread_mutex_t		dir2_bad_list_lock = PTHREAD_MUTEX_INITIALIZER;
>  
>  static void
>  dir2_add_badlist(
>  	xfs_ino_t	ino)
>  {
> -	dir2_bad_t	*l;
> +	struct dir2_bad	*l;
>  
> -	if ((l = malloc(sizeof(dir2_bad_t))) == NULL) {
> +	l = malloc(sizeof(*l));
> +	if (!l) {
>  		do_error(
>  _("malloc failed (%zu bytes) dir2_add_badlist:ino %" PRIu64 "\n"),
> -			sizeof(dir2_bad_t), ino);
> +			sizeof(*l), ino);
>  		exit(1);
>  	}
> +	pthread_mutex_lock(&dir2_bad_list_lock);
>  	l->next = dir2_bad_list;
>  	dir2_bad_list = l;
>  	l->ino = ino;
> +	pthread_mutex_unlock(&dir2_bad_list_lock);
>  }
>  
>  int
>  dir2_is_badino(
>  	xfs_ino_t	ino)
>  {
> -	dir2_bad_t	*l;
> +	struct dir2_bad	*l;
> +	int		ret = 0;
>  
> -	for (l = dir2_bad_list; l; l = l->next)
> -		if (l->ino == ino)
> -			return 1;
> -	return 0;
> +	pthread_mutex_lock(&dir2_bad_list_lock);
> +	for (l = dir2_bad_list; l; l = l->next) {
> +		if (l->ino == ino) {
> +			ret = 1;
> +			break;
> +		}
> +	}
> +	pthread_mutex_unlock(&dir2_bad_list_lock);
> +	return ret;
>  }
>  
>  /*
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] workqueue: bound maximum queue depth
  2020-10-22  5:15 ` [PATCH 1/7] workqueue: bound maximum queue depth Dave Chinner
@ 2020-10-22  5:54   ` Darrick J. Wong
  2020-10-22  8:11     ` Dave Chinner
  2020-10-25  4:41   ` Darrick J. Wong
  1 sibling, 1 reply; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-22  5:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 04:15:31PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Existing users of workqueues have bound maximum queue depths in
> their external algorithms (e.g. prefetch counts). For parallelising
> work that doesn't have an external bound, allow workqueues to
> throttle incoming requests at a maximum bound. bounded workqueues

Nit: capitalize the 'B' in 'bounded'.

> also need to distribute work over all worker threads themselves as
> there is no external bounding or worker function throttling
> provided.
> 
> Existing callers are not throttled and retain direct control of
> worker threads, only users of the new create interface will be
> throttled and concurrency managed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  libfrog/workqueue.c | 42 +++++++++++++++++++++++++++++++++++++++---
>  libfrog/workqueue.h |  4 ++++
>  2 files changed, 43 insertions(+), 3 deletions(-)
> 
> diff --git a/libfrog/workqueue.c b/libfrog/workqueue.c
> index fe3de4289379..e42b2a2f678b 100644
> --- a/libfrog/workqueue.c
> +++ b/libfrog/workqueue.c
> @@ -40,13 +40,21 @@ workqueue_thread(void *arg)
>  		}
>  
>  		/*
> -		 *  Dequeue work from the head of the list.
> +		 *  Dequeue work from the head of the list. If the queue was
> +		 *  full then send a wakeup if we're configured to do so.
>  		 */
>  		assert(wq->item_count > 0);
> +		if (wq->max_queued && wq->item_count == wq->max_queued)
> +			pthread_cond_signal(&wq->queue_full);
> +
>  		wi = wq->next_item;
>  		wq->next_item = wi->next;
>  		wq->item_count--;
>  
> +		if (wq->max_queued && wq->next_item) {
> +			/* more work, wake up another worker */
> +			pthread_cond_signal(&wq->wakeup);

Hmm.  The net effect of this is that we wake up workers faster when a
ton of work comes in, right?  And I bet none of the current workqueue
users have suffered from this because they queue a bunch of work and
then call workqueue_terminate, which wakes all the threads, and they
never go to sleep again.

Does it make sense to simplify the test to "if (wq->next_item) {"?

Other than that, looks good!

--D

> +		}
>  		pthread_mutex_unlock(&wq->lock);
>  
>  		(wi->function)(wi->queue, wi->index, wi->arg);
> @@ -58,10 +66,11 @@ workqueue_thread(void *arg)
>  
>  /* Allocate a work queue and threads.  Returns zero or negative error code. */
>  int
> -workqueue_create(
> +workqueue_create_bound(
>  	struct workqueue	*wq,
>  	void			*wq_ctx,
> -	unsigned int		nr_workers)
> +	unsigned int		nr_workers,
> +	unsigned int		max_queue)
>  {
>  	unsigned int		i;
>  	int			err = 0;
> @@ -70,12 +79,16 @@ workqueue_create(
>  	err = -pthread_cond_init(&wq->wakeup, NULL);
>  	if (err)
>  		return err;
> +	err = -pthread_cond_init(&wq->queue_full, NULL);
> +	if (err)
> +		goto out_wake;
>  	err = -pthread_mutex_init(&wq->lock, NULL);
>  	if (err)
>  		goto out_cond;
>  
>  	wq->wq_ctx = wq_ctx;
>  	wq->thread_count = nr_workers;
> +	wq->max_queued = max_queue;
>  	wq->threads = malloc(nr_workers * sizeof(pthread_t));
>  	if (!wq->threads) {
>  		err = -errno;
> @@ -102,10 +115,21 @@ workqueue_create(
>  out_mutex:
>  	pthread_mutex_destroy(&wq->lock);
>  out_cond:
> +	pthread_cond_destroy(&wq->queue_full);
> +out_wake:
>  	pthread_cond_destroy(&wq->wakeup);
>  	return err;
>  }
>  
> +int
> +workqueue_create(
> +	struct workqueue	*wq,
> +	void			*wq_ctx,
> +	unsigned int		nr_workers)
> +{
> +	return workqueue_create_bound(wq, wq_ctx, nr_workers, 0);
> +}
> +
>  /*
>   * Create a work item consisting of a function and some arguments and schedule
>   * the work item to be run via the thread pool.  Returns zero or a negative
> @@ -140,6 +164,7 @@ workqueue_add(
>  
>  	/* Now queue the new work structure to the work queue. */
>  	pthread_mutex_lock(&wq->lock);
> +restart:
>  	if (wq->next_item == NULL) {
>  		assert(wq->item_count == 0);
>  		ret = -pthread_cond_signal(&wq->wakeup);
> @@ -150,6 +175,16 @@ workqueue_add(
>  		}
>  		wq->next_item = wi;
>  	} else {
> +		/* throttle on a full queue if configured */
> +		if (wq->max_queued && wq->item_count == wq->max_queued) {
> +			pthread_cond_wait(&wq->queue_full, &wq->lock);
> +			/*
> +			 * Queue might be empty or even still full by the time
> +			 * we get the lock back, so restart the lookup so we do
> +			 * the right thing with the current state of the queue.
> +			 */
> +			goto restart;
> +		}
>  		wq->last_item->next = wi;
>  	}
>  	wq->last_item = wi;
> @@ -201,5 +236,6 @@ workqueue_destroy(
>  	free(wq->threads);
>  	pthread_mutex_destroy(&wq->lock);
>  	pthread_cond_destroy(&wq->wakeup);
> +	pthread_cond_destroy(&wq->queue_full);
>  	memset(wq, 0, sizeof(*wq));
>  }
> diff --git a/libfrog/workqueue.h b/libfrog/workqueue.h
> index a56d1cf14081..a9c108d0e66a 100644
> --- a/libfrog/workqueue.h
> +++ b/libfrog/workqueue.h
> @@ -31,10 +31,14 @@ struct workqueue {
>  	unsigned int		thread_count;
>  	bool			terminate;
>  	bool			terminated;
> +	int			max_queued;
> +	pthread_cond_t		queue_full;
>  };
>  
>  int workqueue_create(struct workqueue *wq, void *wq_ctx,
>  		unsigned int nr_workers);
> +int workqueue_create_bound(struct workqueue *wq, void *wq_ctx,
> +		unsigned int nr_workers, unsigned int max_queue);
>  int workqueue_add(struct workqueue *wq, workqueue_func_t fn,
>  		uint32_t index, void *arg);
>  int workqueue_terminate(struct workqueue *wq);
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] repair: protect inode chunk tree records with a mutex
  2020-10-22  5:15 ` [PATCH 3/7] repair: protect inode chunk tree records with a mutex Dave Chinner
@ 2020-10-22  6:02   ` Darrick J. Wong
  2020-10-22  8:15     ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-22  6:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 04:15:33PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Phase 6 accesses inode chunk records mostly in an isolated manner.
> However, when it finds a corruption in a directory or there are
> multiple hardlinks to an inode, there can be concurrent access
> to the inode chunk record to update state.
> 
> Hence the inode record itself needs a mutex. This protects all state
> changes within the inode chunk record, as well as inode link counts
> and chunk references. That allows us to process multiple chunks at
> once, providing concurrency within an AG as well as across AGs.
> 
> The inode chunk tree itself is not modified in phase 6 - it's built

Well, that's not 100% true -- mk_orphanage can do that, but AFAICT
that's outside the scope of the parallel processing (and I don't see
much point in parallelizing that part) so I guess that's fine?

> in phases 3 and 4  - and so we do not need to worry about locking
> for AVL tree lookups to find the inode chunk records themselves.
> hence internal locking is all we need here.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>

TBH I wonder if all the phase6.c code to recreate the root dir, the
orphanage, and the realtime inodes ought to get moved to another file,
particularly since the metadata directory patches add quite a bit more
stuff here?  But that's a topic for another patch...

I /think/ this looks ok, but I guess I'll go look at the rest of the
series.

--D

> ---
>  repair/incore.h     | 23 +++++++++++++++++++++++
>  repair/incore_ino.c | 15 +++++++++++++++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/repair/incore.h b/repair/incore.h
> index 5b29d5d1efd8..6564e0d38963 100644
> --- a/repair/incore.h
> +++ b/repair/incore.h
> @@ -282,6 +282,7 @@ typedef struct ino_tree_node  {
>  		parent_list_t	*plist;		/* phases 2-5 */
>  	} ino_un;
>  	uint8_t			*ftypes;	/* phases 3,6 */
> +	pthread_mutex_t		lock;
>  } ino_tree_node_t;
>  
>  #define INOS_PER_IREC	(sizeof(uint64_t) * NBBY)
> @@ -412,7 +413,9 @@ next_free_ino_rec(ino_tree_node_t *ino_rec)
>   */
>  static inline void add_inode_refchecked(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_un.ex_data->ino_processed |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int is_inode_refchecked(struct ino_tree_node *irec, int offset)
> @@ -438,12 +441,16 @@ static inline int is_inode_confirmed(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_isadir(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_isa_dir |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline void clear_inode_isadir(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_isa_dir &= ~IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int inode_isadir(struct ino_tree_node *irec, int offset)
> @@ -456,15 +463,19 @@ static inline int inode_isadir(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_free(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	set_inode_confirmed(irec, offset);
>  	irec->ir_free |= XFS_INOBT_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  
>  }
>  
>  static inline void set_inode_used(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	set_inode_confirmed(irec, offset);
>  	irec->ir_free &= ~XFS_INOBT_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int is_inode_free(struct ino_tree_node *irec, int offset)
> @@ -477,7 +488,9 @@ static inline int is_inode_free(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_sparse(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ir_sparse |= XFS_INOBT_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline bool is_inode_sparse(struct ino_tree_node *irec, int offset)
> @@ -490,12 +503,16 @@ static inline bool is_inode_sparse(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_was_rl(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_was_rl |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline void clear_inode_was_rl(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_was_rl &= ~IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int inode_was_rl(struct ino_tree_node *irec, int offset)
> @@ -508,12 +525,16 @@ static inline int inode_was_rl(struct ino_tree_node *irec, int offset)
>   */
>  static inline void set_inode_is_rl(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_is_rl |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline void clear_inode_is_rl(struct ino_tree_node *irec, int offset)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_is_rl &= ~IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  static inline int inode_is_rl(struct ino_tree_node *irec, int offset)
> @@ -546,7 +567,9 @@ static inline int is_inode_reached(struct ino_tree_node *irec, int offset)
>  static inline void add_inode_reached(struct ino_tree_node *irec, int offset)
>  {
>  	add_inode_ref(irec, offset);
> +	pthread_mutex_lock(&irec->lock);
>  	irec->ino_un.ex_data->ino_reached |= IREC_MASK(offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  /*
> diff --git a/repair/incore_ino.c b/repair/incore_ino.c
> index 82956ae93005..299e4f949e5e 100644
> --- a/repair/incore_ino.c
> +++ b/repair/incore_ino.c
> @@ -91,6 +91,7 @@ void add_inode_ref(struct ino_tree_node *irec, int ino_offset)
>  {
>  	ASSERT(irec->ino_un.ex_data != NULL);
>  
> +	pthread_mutex_lock(&irec->lock);
>  	switch (irec->nlink_size) {
>  	case sizeof(uint8_t):
>  		if (irec->ino_un.ex_data->counted_nlinks.un8[ino_offset] < 0xff) {
> @@ -112,6 +113,7 @@ void add_inode_ref(struct ino_tree_node *irec, int ino_offset)
>  	default:
>  		ASSERT(0);
>  	}
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
> @@ -120,6 +122,7 @@ void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
>  
>  	ASSERT(irec->ino_un.ex_data != NULL);
>  
> +	pthread_mutex_lock(&irec->lock);
>  	switch (irec->nlink_size) {
>  	case sizeof(uint8_t):
>  		ASSERT(irec->ino_un.ex_data->counted_nlinks.un8[ino_offset] > 0);
> @@ -139,6 +142,7 @@ void drop_inode_ref(struct ino_tree_node *irec, int ino_offset)
>  
>  	if (refs == 0)
>  		irec->ino_un.ex_data->ino_reached &= ~IREC_MASK(ino_offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  uint32_t num_inode_references(struct ino_tree_node *irec, int ino_offset)
> @@ -161,6 +165,7 @@ uint32_t num_inode_references(struct ino_tree_node *irec, int ino_offset)
>  void set_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset,
>  		uint32_t nlinks)
>  {
> +	pthread_mutex_lock(&irec->lock);
>  	switch (irec->nlink_size) {
>  	case sizeof(uint8_t):
>  		if (nlinks < 0xff) {
> @@ -182,6 +187,7 @@ void set_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset,
>  	default:
>  		ASSERT(0);
>  	}
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  uint32_t get_inode_disk_nlinks(struct ino_tree_node *irec, int ino_offset)
> @@ -253,6 +259,7 @@ alloc_ino_node(
>  	irec->nlink_size = sizeof(uint8_t);
>  	irec->disk_nlinks.un8 = alloc_nlink_array(irec->nlink_size);
>  	irec->ftypes = alloc_ftypes_array(mp);
> +	pthread_mutex_init(&irec->lock, NULL);
>  	return irec;
>  }
>  
> @@ -294,6 +301,7 @@ free_ino_tree_node(
>  	}
>  
>  	free(irec->ftypes);
> +	pthread_mutex_destroy(&irec->lock);
>  	free(irec);
>  }
>  
> @@ -600,6 +608,7 @@ set_inode_parent(
>  	uint64_t		bitmask;
>  	parent_entry_t		*tmp;
>  
> +	pthread_mutex_lock(&irec->lock);
>  	if (full_ino_ex_data)
>  		ptbl = irec->ino_un.ex_data->parents;
>  	else
> @@ -625,6 +634,7 @@ set_inode_parent(
>  #endif
>  		ptbl->pentries[0] = parent;
>  
> +		pthread_mutex_unlock(&irec->lock);
>  		return;
>  	}
>  
> @@ -642,6 +652,7 @@ set_inode_parent(
>  #endif
>  		ptbl->pentries[target] = parent;
>  
> +		pthread_mutex_unlock(&irec->lock);
>  		return;
>  	}
>  
> @@ -682,6 +693,7 @@ set_inode_parent(
>  #endif
>  	ptbl->pentries[target] = parent;
>  	ptbl->pmask |= (1ULL << offset);
> +	pthread_mutex_unlock(&irec->lock);
>  }
>  
>  xfs_ino_t
> @@ -692,6 +704,7 @@ get_inode_parent(ino_tree_node_t *irec, int offset)
>  	int		i;
>  	int		target;
>  
> +	pthread_mutex_lock(&irec->lock);
>  	if (full_ino_ex_data)
>  		ptbl = irec->ino_un.ex_data->parents;
>  	else
> @@ -709,9 +722,11 @@ get_inode_parent(ino_tree_node_t *irec, int offset)
>  #ifdef DEBUG
>  		ASSERT(target < ptbl->cnt);
>  #endif
> +		pthread_mutex_unlock(&irec->lock);
>  		return(ptbl->pentries[target]);
>  	}
>  
> +	pthread_mutex_unlock(&irec->lock);
>  	return(0LL);
>  }
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] repair: parallelise phase 6
  2020-10-22  5:15 ` [PATCH 4/7] repair: parallelise phase 6 Dave Chinner
@ 2020-10-22  6:11   ` Darrick J. Wong
  2020-10-27  5:10     ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-22  6:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 04:15:34PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A recent metadump provided to us caused repair to take hours in
> phase6. It wasn't IO bound - it was fully CPU bound the entire time.
> The only way to speed it up is to make phase 6 run multiple
> concurrent processing threads.
> 
> The obvious way to do this is to spread the concurrency across AGs,
> like the other phases, and while this works it is not optimal. When
> a processing thread hits a really large directory, it essentially
> sits CPU bound until that directory is processed. IF an AG has lots
> of large directories, we end up with a really long single threaded
> tail that limits concurrency.
> 
> Hence we also need to have concurrency /within/ the AG. This is
> realtively easy, as the inode chunk records allow for a simple
> concurrency mechanism within an AG. We can simply feed each chunk
> record to a workqueue, and we get concurrency within the AG for
> free. However, this allows prefetch to run way ahead of processing
> and this blows out the buffer cache size and can cause OOM.
> 
> However, we can use the new workqueue depth limiting to limit the
> number of inode chunks queued, and this then backs up the inode
> prefetching to it's maximum queue depth.

I'm interested in (some day) hooking up xfs_scrub to max_queued, since
it has the same concurrency problem when one of the AGs has a number of
hugely fragmented files.

> Hence we prevent having the
> prefetch code queue the entire AG's inode chunks on the workqueue
> blowing out memory by throttling the prefetch consumer.
> 
> This takes phase 6 from taking many, many hours down to:
> 
> Phase 6:        10/30 21:12:58  10/30 21:40:48  27 minutes, 50 seconds
> 
> And burning 20-30 cpus that entire time on my test rig.

Yay!

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/phase6.c | 43 +++++++++++++++++++++++++++++++++++--------
>  1 file changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/repair/phase6.c b/repair/phase6.c
> index 70d32089bb57..bf0719c186fb 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -6,6 +6,7 @@
>  
>  #include "libxfs.h"
>  #include "threads.h"
> +#include "threads.h"
>  #include "prefetch.h"
>  #include "avl.h"
>  #include "globals.h"
> @@ -3109,20 +3110,45 @@ check_for_orphaned_inodes(
>  }
>  
>  static void
> -traverse_function(
> +do_dir_inode(
>  	struct workqueue	*wq,
> -	xfs_agnumber_t 		agno,
> +	xfs_agnumber_t		agno,
>  	void			*arg)
>  {
> -	ino_tree_node_t 	*irec;
> +	struct ino_tree_node	*irec = arg;
>  	int			i;
> +
> +	for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
> +		if (inode_isadir(irec, i))
> +			process_dir_inode(wq->wq_ctx, agno, irec, i);
> +	}
> +}
> +
> +static void
> +traverse_function(
> +	struct workqueue	*wq,
> +	xfs_agnumber_t		agno,
> +	void			*arg)
> +{
> +	struct ino_tree_node	*irec;
>  	prefetch_args_t		*pf_args = arg;
> +	struct workqueue	lwq;
> +	struct xfs_mount	*mp = wq->wq_ctx;
> +
>  
>  	wait_for_inode_prefetch(pf_args);
>  
>  	if (verbose)
>  		do_log(_("        - agno = %d\n"), agno);
>  
> +	/*
> +	 * The more AGs we have in flight at once, the fewer processing threads
> +	 * per AG. This means we don't overwhelm the machine with hundreds of
> +	 * threads when we start acting on lots of AGs at once. We just want
> +	 * enough that we can keep multiple CPUs busy across multiple AGs.
> +	 */
> +	workqueue_create_bound(&lwq, mp, ag_stride, 1000);

Eeeeee, magic number! :)

/me tosses in obligatory hand-wringing about 2000 CPU systems running
out of work.  How about ag_stride * 50 or something? :P

(Aside from that this all looks ok to me)

--D

> +
>  	for (irec = findfirst_inode_rec(agno); irec; irec = next_ino_rec(irec)) {
>  		if (irec->ino_isa_dir == 0)
>  			continue;
> @@ -3130,18 +3156,19 @@ traverse_function(
>  		if (pf_args) {
>  			sem_post(&pf_args->ra_count);
>  #ifdef XR_PF_TRACE
> +			{
> +			int	i;
>  			sem_getvalue(&pf_args->ra_count, &i);
>  			pftrace(
>  		"processing inode chunk %p in AG %d (sem count = %d)",
>  				irec, agno, i);
> +			}
>  #endif
>  		}
>  
> -		for (i = 0; i < XFS_INODES_PER_CHUNK; i++)  {
> -			if (inode_isadir(irec, i))
> -				process_dir_inode(wq->wq_ctx, agno, irec, i);
> -		}
> +		queue_work(&lwq, do_dir_inode, agno, irec);
>  	}
> +	destroy_work_queue(&lwq);
>  	cleanup_inode_prefetch(pf_args);
>  }
>  
> @@ -3169,7 +3196,7 @@ static void
>  traverse_ags(
>  	struct xfs_mount	*mp)
>  {
> -	do_inode_prefetch(mp, 0, traverse_function, false, true);
> +	do_inode_prefetch(mp, ag_stride, traverse_function, false, true);
>  }
>  
>  void
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] repair: don't duplicate names in phase 6
  2020-10-22  5:15 ` [PATCH 5/7] repair: don't duplicate names in " Dave Chinner
@ 2020-10-22  6:21   ` Darrick J. Wong
  2020-10-22  8:23     ` Dave Chinner
  2020-10-29  9:39   ` Christoph Hellwig
  1 sibling, 1 reply; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-22  6:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 04:15:35PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The name hash in phase 6 is constructed by using names that point
> directly into the directory buffers. Hence before the buffers can be
> released, the constructed name hash has to duplicate all those names
> into meory it owns via dir_hash_dup_names().
> 
> Given that the structure that holds the name is dynamically
> allocated, it makes no sense to store a pointer to the name
> dir_hash_add() and then later have dynamically allocate the name.
> 
> Extend the name hash allocation to contain space for the name
> itself, and copy the name into the name hash structure in
> dir_hash_add(). This allows us to get rid of dir_hash_dup_names(),
> and the directory checking code no longer needs to hold all the
> directory buffers in memory until the entire directory walk is
> complete and the names duplicated.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/phase6.c | 101 ++++++++++++++----------------------------------
>  1 file changed, 29 insertions(+), 72 deletions(-)
> 
> diff --git a/repair/phase6.c b/repair/phase6.c
> index bf0719c186fb..79c87495656f 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -72,7 +72,7 @@ typedef struct dir_hash_ent {
>  	struct dir_hash_ent	*nextbyorder;	/* next in order added */
>  	xfs_dahash_t		hashval;	/* hash value of name */
>  	uint32_t		address;	/* offset of data entry */
> -	xfs_ino_t 		inum;		/* inode num of entry */
> +	xfs_ino_t		inum;		/* inode num of entry */
>  	short			junkit;		/* name starts with / */
>  	short			seen;		/* have seen leaf entry */
>  	struct xfs_name		name;
> @@ -80,7 +80,6 @@ typedef struct dir_hash_ent {
>  
>  typedef struct dir_hash_tab {
>  	int			size;		/* size of hash tables */
> -	int			names_duped;	/* 1 = ent names malloced */
>  	dir_hash_ent_t		*first;		/* ptr to first added entry */
>  	dir_hash_ent_t		*last;		/* ptr to last added entry */
>  	dir_hash_ent_t		**byhash;	/* ptr to name hash buckets */
> @@ -171,8 +170,6 @@ dir_hash_add(
>  	short			junk;
>  	struct xfs_name		xname;
>  
> -	ASSERT(!hashtab->names_duped);
> -
>  	xname.name = name;
>  	xname.len = namelen;
>  	xname.type = ftype;
> @@ -199,7 +196,12 @@ dir_hash_add(
>  		}
>  	}
>  
> -	if ((p = malloc(sizeof(*p))) == NULL)
> +	/*
> +	 * Allocate enough space for the hash entry and the name in a single
> +	 * allocation so we can store our own copy of the name for later use.
> +	 */
> +	p = calloc(1, sizeof(*p) + namelen + 1);
> +	if (!p)
>  		do_error(_("malloc failed in dir_hash_add (%zu bytes)\n"),
>  			sizeof(*p));
>  
> @@ -220,7 +222,12 @@ dir_hash_add(
>  	p->address = addr;
>  	p->inum = inum;
>  	p->seen = 0;
> -	p->name = xname;
> +
> +	/* Set up the name in the region trailing the hash entry. */
> +	memcpy(p + 1, name, namelen);
> +	p->name.name = (const unsigned char *)(p + 1);
> +	p->name.len = namelen;
> +	p->name.type = ftype;
>  
>  	return !dup;
>  }
> @@ -287,8 +294,6 @@ dir_hash_done(
>  	for (i = 0; i < hashtab->size; i++) {
>  		for (p = hashtab->byaddr[i]; p; p = n) {
>  			n = p->nextbyaddr;
> -			if (hashtab->names_duped)
> -				free((void *)p->name.name);
>  			free(p);
>  		}
>  	}
> @@ -385,27 +390,6 @@ dir_hash_see_all(
>  	return j == stale ? DIR_HASH_CK_OK : DIR_HASH_CK_BADSTALE;
>  }
>  
> -/*
> - * Convert name pointers into locally allocated memory.
> - * This must only be done after all the entries have been added.
> - */
> -static void
> -dir_hash_dup_names(dir_hash_tab_t *hashtab)
> -{
> -	unsigned char		*name;
> -	dir_hash_ent_t		*p;
> -
> -	if (hashtab->names_duped)
> -		return;
> -
> -	for (p = hashtab->first; p; p = p->nextbyorder) {
> -		name = malloc(p->name.len);
> -		memcpy(name, p->name.name, p->name.len);
> -		p->name.name = name;
> -	}
> -	hashtab->names_duped = 1;
> -}
> -
>  /*
>   * Given a block number in a fork, return the next valid block number
>   * (not a hole).
> @@ -1387,6 +1371,7 @@ dir2_kill_block(
>  		res_failed(error);
>  	libxfs_trans_ijoin(tp, ip, 0);
>  	libxfs_trans_bjoin(tp, bp);
> +	libxfs_trans_bhold(tp, bp);

Why hold on to the buffer?  We killed the block, why keep the reference
around so that someone else has to remember to drop it later?

Hooray for killing that bplist thing later though. :)

--D

>  	memset(&args, 0, sizeof(args));
>  	args.dp = ip;
>  	args.trans = tp;
> @@ -1418,7 +1403,7 @@ longform_dir2_entry_check_data(
>  	int			*need_dot,
>  	ino_tree_node_t		*current_irec,
>  	int			current_ino_offset,
> -	struct xfs_buf		**bpp,
> +	struct xfs_buf		*bp,
>  	dir_hash_tab_t		*hashtab,
>  	freetab_t		**freetabp,
>  	xfs_dablk_t		da_bno,
> @@ -1426,7 +1411,6 @@ longform_dir2_entry_check_data(
>  {
>  	xfs_dir2_dataptr_t	addr;
>  	xfs_dir2_leaf_entry_t	*blp;
> -	struct xfs_buf		*bp;
>  	xfs_dir2_block_tail_t	*btp;
>  	struct xfs_dir2_data_hdr *d;
>  	xfs_dir2_db_t		db;
> @@ -1457,7 +1441,6 @@ longform_dir2_entry_check_data(
>  	};
>  
>  
> -	bp = *bpp;
>  	d = bp->b_addr;
>  	ptr = (char *)d + mp->m_dir_geo->data_entry_offset;
>  	nbad = 0;
> @@ -1558,10 +1541,8 @@ longform_dir2_entry_check_data(
>  			dir2_kill_block(mp, ip, da_bno, bp);
>  		} else {
>  			do_warn(_("would junk block\n"));
> -			libxfs_buf_relse(bp);
>  		}
>  		freetab->ents[db].v = NULLDATAOFF;
> -		*bpp = NULL;
>  		return;
>  	}
>  
> @@ -2219,17 +2200,15 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
>  			int		ino_offset,
>  			dir_hash_tab_t	*hashtab)
>  {
> -	struct xfs_buf		**bplist;
> +	struct xfs_buf		*bp;
>  	xfs_dablk_t		da_bno;
>  	freetab_t		*freetab;
> -	int			num_bps;
>  	int			i;
>  	int			isblock;
>  	int			isleaf;
>  	xfs_fileoff_t		next_da_bno;
>  	int			seeval;
>  	int			fixit = 0;
> -	xfs_dir2_db_t		db;
>  	struct xfs_da_args	args;
>  
>  	*need_dot = 1;
> @@ -2246,11 +2225,6 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
>  		freetab->ents[i].v = NULLDATAOFF;
>  		freetab->ents[i].s = 0;
>  	}
> -	num_bps = freetab->naents;
> -	bplist = calloc(num_bps, sizeof(struct xfs_buf*));
> -	if (!bplist)
> -		do_error(_("calloc failed in %s (%zu bytes)\n"),
> -			__func__, num_bps * sizeof(struct xfs_buf*));
>  
>  	/* is this a block, leaf, or node directory? */
>  	args.dp = ip;
> @@ -2279,28 +2253,12 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
>  			break;
>  		}
>  
> -		db = xfs_dir2_da_to_db(mp->m_dir_geo, da_bno);
> -		if (db >= num_bps) {
> -			int last_size = num_bps;
> -
> -			/* more data blocks than expected */
> -			num_bps = db + 1;
> -			bplist = realloc(bplist, num_bps * sizeof(struct xfs_buf*));
> -			if (!bplist)
> -				do_error(_("realloc failed in %s (%zu bytes)\n"),
> -					__func__,
> -					num_bps * sizeof(struct xfs_buf*));
> -			/* Initialize the new elements */
> -			for (i = last_size; i < num_bps; i++)
> -				bplist[i] = NULL;
> -		}
> -
>  		if (isblock)
>  			ops = &xfs_dir3_block_buf_ops;
>  		else
>  			ops = &xfs_dir3_data_buf_ops;
>  
> -		error = dir_read_buf(ip, da_bno, &bplist[db], ops, &fixit);
> +		error = dir_read_buf(ip, da_bno, &bp, ops, &fixit);
>  		if (error) {
>  			do_warn(
>  	_("can't read data block %u for directory inode %" PRIu64 " error %d\n"),
> @@ -2320,21 +2278,25 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
>  		}
>  
>  		/* check v5 metadata */
> -		d = bplist[db]->b_addr;
> +		d = bp->b_addr;
>  		if (be32_to_cpu(d->magic) == XFS_DIR3_BLOCK_MAGIC ||
>  		    be32_to_cpu(d->magic) == XFS_DIR3_DATA_MAGIC) {
> -			struct xfs_buf		 *bp = bplist[db];
> -
>  			error = check_dir3_header(mp, bp, ino);
>  			if (error) {
>  				fixit++;
> +				if (isblock)
> +					goto out_fix;
>  				continue;
>  			}
>  		}
>  
>  		longform_dir2_entry_check_data(mp, ip, num_illegal, need_dot,
> -				irec, ino_offset, &bplist[db], hashtab,
> +				irec, ino_offset, bp, hashtab,
>  				&freetab, da_bno, isblock);
> +		if (isblock)
> +			break;
> +
> +		libxfs_buf_relse(bp);
>  	}
>  	fixit |= (*num_illegal != 0) || dir2_is_badino(ino) || *need_dot;
>  
> @@ -2345,7 +2307,7 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
>  			xfs_dir2_block_tail_t	*btp;
>  			xfs_dir2_leaf_entry_t	*blp;
>  
> -			block = bplist[0]->b_addr;
> +			block = bp->b_addr;
>  			btp = xfs_dir2_block_tail_p(mp->m_dir_geo, block);
>  			blp = xfs_dir2_block_leaf_p(btp);
>  			seeval = dir_hash_see_all(hashtab, blp,
> @@ -2362,11 +2324,10 @@ longform_dir2_entry_check(xfs_mount_t	*mp,
>  		}
>  	}
>  out_fix:
> +	if (isblock && bp)
> +		libxfs_buf_relse(bp);
> +
>  	if (!no_modify && (fixit || dotdot_update)) {
> -		dir_hash_dup_names(hashtab);
> -		for (i = 0; i < num_bps; i++)
> -			if (bplist[i])
> -				libxfs_buf_relse(bplist[i]);
>  		longform_dir2_rebuild(mp, ino, ip, irec, ino_offset, hashtab);
>  		*num_illegal = 0;
>  		*need_dot = 0;
> @@ -2374,12 +2335,8 @@ out_fix:
>  		if (fixit || dotdot_update)
>  			do_warn(
>  	_("would rebuild directory inode %" PRIu64 "\n"), ino);
> -		for (i = 0; i < num_bps; i++)
> -			if (bplist[i])
> -				libxfs_buf_relse(bplist[i]);
>  	}
>  
> -	free(bplist);
>  	free(freetab);
>  }
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] workqueue: bound maximum queue depth
  2020-10-22  5:54   ` Darrick J. Wong
@ 2020-10-22  8:11     ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  8:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Oct 21, 2020 at 10:54:25PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 22, 2020 at 04:15:31PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Existing users of workqueues have bound maximum queue depths in
> > their external algorithms (e.g. prefetch counts). For parallelising
> > work that doesn't have an external bound, allow workqueues to
> > throttle incoming requests at a maximum bound. bounded workqueues
> 
> Nit: capitalize the 'B' in 'bounded'.
> 
> > also need to distribute work over all worker threads themselves as
> > there is no external bounding or worker function throttling
> > provided.
> > 
> > Existing callers are not throttled and retain direct control of
> > worker threads, only users of the new create interface will be
> > throttled and concurrency managed.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  libfrog/workqueue.c | 42 +++++++++++++++++++++++++++++++++++++++---
> >  libfrog/workqueue.h |  4 ++++
> >  2 files changed, 43 insertions(+), 3 deletions(-)
> > 
> > diff --git a/libfrog/workqueue.c b/libfrog/workqueue.c
> > index fe3de4289379..e42b2a2f678b 100644
> > --- a/libfrog/workqueue.c
> > +++ b/libfrog/workqueue.c
> > @@ -40,13 +40,21 @@ workqueue_thread(void *arg)
> >  		}
> >  
> >  		/*
> > -		 *  Dequeue work from the head of the list.
> > +		 *  Dequeue work from the head of the list. If the queue was
> > +		 *  full then send a wakeup if we're configured to do so.
> >  		 */
> >  		assert(wq->item_count > 0);
> > +		if (wq->max_queued && wq->item_count == wq->max_queued)
> > +			pthread_cond_signal(&wq->queue_full);
> > +
> >  		wi = wq->next_item;
> >  		wq->next_item = wi->next;
> >  		wq->item_count--;
> >  
> > +		if (wq->max_queued && wq->next_item) {
> > +			/* more work, wake up another worker */
> > +			pthread_cond_signal(&wq->wakeup);
> 
> Hmm.  The net effect of this is that we wake up workers faster when a
> ton of work comes in, right?

Effectively. What it does is increase the concurrency of processing
only when the current worker threads cannot keep up with the
incoming work....

> And I bet none of the current workqueue
> users have suffered from this because they queue a bunch of work and
> then call workqueue_terminate, which wakes all the threads, and they
> never go to sleep again.
> 
> Does it make sense to simplify the test to "if (wq->next_item) {"?

Perhaps so, but I didn't want to make subtle changes to the way the
prefetch stuff works - that's a tangled ball of string that is easy
to deadlock and really hard to debug....

> Other than that, looks good!

Ta!

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] repair: protect inode chunk tree records with a mutex
  2020-10-22  6:02   ` Darrick J. Wong
@ 2020-10-22  8:15     ` Dave Chinner
  2020-10-29 16:45       ` Darrick J. Wong
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  8:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Oct 21, 2020 at 11:02:56PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 22, 2020 at 04:15:33PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Phase 6 accesses inode chunk records mostly in an isolated manner.
> > However, when it finds a corruption in a directory or there are
> > multiple hardlinks to an inode, there can be concurrent access
> > to the inode chunk record to update state.
> > 
> > Hence the inode record itself needs a mutex. This protects all state
> > changes within the inode chunk record, as well as inode link counts
> > and chunk references. That allows us to process multiple chunks at
> > once, providing concurrency within an AG as well as across AGs.
> > 
> > The inode chunk tree itself is not modified in phase 6 - it's built
> 
> Well, that's not 100% true -- mk_orphanage can do that, but AFAICT
> that's outside the scope of the parallel processing (and I don't see
> much point in parallelizing that part) so I guess that's fine?

AFAICT, yes.

> > in phases 3 and 4  - and so we do not need to worry about locking
> > for AVL tree lookups to find the inode chunk records themselves.
> > hence internal locking is all we need here.
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> TBH I wonder if all the phase6.c code to recreate the root dir, the
> orphanage, and the realtime inodes ought to get moved to another file,
> particularly since the metadata directory patches add quite a bit more
> stuff here?  But that's a topic for another patch...

Probably should and yes, spearate patch :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] repair: don't duplicate names in phase 6
  2020-10-22  6:21   ` Darrick J. Wong
@ 2020-10-22  8:23     ` Dave Chinner
  2020-10-22 15:53       ` Darrick J. Wong
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2020-10-22  8:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Oct 21, 2020 at 11:21:52PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 22, 2020 at 04:15:35PM +1100, Dave Chinner wrote:
....
> > @@ -1387,6 +1371,7 @@ dir2_kill_block(
> >  		res_failed(error);
> >  	libxfs_trans_ijoin(tp, ip, 0);
> >  	libxfs_trans_bjoin(tp, bp);
> > +	libxfs_trans_bhold(tp, bp);
> 
> Why hold on to the buffer?  We killed the block, why keep the reference
> around so that someone else has to remember to drop it later?

Consistency in buffer handling. This "kill block" path nulled out
the buffer in the bplist array (the reason it's passed as a **bpp
to longform_dir2_entry_check_data(). This path releases the buffer
through the trans_commit(), the alternate path here:

> > @@ -1558,10 +1541,8 @@ longform_dir2_entry_check_data(
> >  			dir2_kill_block(mp, ip, da_bno, bp);
> >  		} else {
> >  			do_warn(_("would junk block\n"));
> > -			libxfs_buf_relse(bp);
> >  		}
> >  		freetab->ents[db].v = NULLDATAOFF;
> > -		*bpp = NULL;
> >  		return;
> >  	}

does an explicit release, and all other paths end up with the
buffers being released way back where the bplist is defined.

If the directory is in block form, nulling out the buffer in the
bplist here will result in dereferencing a null pointer later when
the buffer is pulled from bplist[0] without checking.

So I changed longform_dir2_entry_check_data() to never release the
buffer so that the caller knows that it has always got a valid
buffer reference and the isblock path will always work correctly....

Cheers,

Dave.


-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] repair: don't duplicate names in phase 6
  2020-10-22  8:23     ` Dave Chinner
@ 2020-10-22 15:53       ` Darrick J. Wong
  0 siblings, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-22 15:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 07:23:54PM +1100, Dave Chinner wrote:
> On Wed, Oct 21, 2020 at 11:21:52PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 22, 2020 at 04:15:35PM +1100, Dave Chinner wrote:
> ....
> > > @@ -1387,6 +1371,7 @@ dir2_kill_block(
> > >  		res_failed(error);
> > >  	libxfs_trans_ijoin(tp, ip, 0);
> > >  	libxfs_trans_bjoin(tp, bp);
> > > +	libxfs_trans_bhold(tp, bp);
> > 
> > Why hold on to the buffer?  We killed the block, why keep the reference
> > around so that someone else has to remember to drop it later?
> 
> Consistency in buffer handling. This "kill block" path nulled out
> the buffer in the bplist array (the reason it's passed as a **bpp
> to longform_dir2_entry_check_data(). This path releases the buffer
> through the trans_commit(), the alternate path here:
> 
> > > @@ -1558,10 +1541,8 @@ longform_dir2_entry_check_data(
> > >  			dir2_kill_block(mp, ip, da_bno, bp);
> > >  		} else {
> > >  			do_warn(_("would junk block\n"));
> > > -			libxfs_buf_relse(bp);
> > >  		}
> > >  		freetab->ents[db].v = NULLDATAOFF;
> > > -		*bpp = NULL;
> > >  		return;
> > >  	}
> 
> does an explicit release, and all other paths end up with the
> buffers being released way back where the bplist is defined.
> 
> If the directory is in block form, nulling out the buffer in the
> bplist here will result in dereferencing a null pointer later when
> the buffer is pulled from bplist[0] without checking.
> 
> So I changed longform_dir2_entry_check_data() to never release the
> buffer so that the caller knows that it has always got a valid
> buffer reference and the isblock path will always work correctly....

Hmmm, looking through the directory code, I see that it calls binval to
stale the buffer, so the kill_block caller won't be able to do much with
its incore reference.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> 
> Cheers,
> 
> Dave.
> 
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] workqueue: bound maximum queue depth
  2020-10-22  5:15 ` [PATCH 1/7] workqueue: bound maximum queue depth Dave Chinner
  2020-10-22  5:54   ` Darrick J. Wong
@ 2020-10-25  4:41   ` Darrick J. Wong
  2020-10-26 22:29     ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-25  4:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 04:15:31PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Existing users of workqueues have bound maximum queue depths in
> their external algorithms (e.g. prefetch counts). For parallelising
> work that doesn't have an external bound, allow workqueues to
> throttle incoming requests at a maximum bound. bounded workqueues
> also need to distribute work over all worker threads themselves as
> there is no external bounding or worker function throttling
> provided.
> 
> Existing callers are not throttled and retain direct control of
> worker threads, only users of the new create interface will be
> throttled and concurrency managed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  libfrog/workqueue.c | 42 +++++++++++++++++++++++++++++++++++++++---
>  libfrog/workqueue.h |  4 ++++
>  2 files changed, 43 insertions(+), 3 deletions(-)
> 
> diff --git a/libfrog/workqueue.c b/libfrog/workqueue.c
> index fe3de4289379..e42b2a2f678b 100644
> --- a/libfrog/workqueue.c
> +++ b/libfrog/workqueue.c
> @@ -40,13 +40,21 @@ workqueue_thread(void *arg)
>  		}
>  
>  		/*
> -		 *  Dequeue work from the head of the list.
> +		 *  Dequeue work from the head of the list. If the queue was
> +		 *  full then send a wakeup if we're configured to do so.
>  		 */
>  		assert(wq->item_count > 0);
> +		if (wq->max_queued && wq->item_count == wq->max_queued)
> +			pthread_cond_signal(&wq->queue_full);
> +
>  		wi = wq->next_item;
>  		wq->next_item = wi->next;
>  		wq->item_count--;
>  
> +		if (wq->max_queued && wq->next_item) {
> +			/* more work, wake up another worker */
> +			pthread_cond_signal(&wq->wakeup);
> +		}
>  		pthread_mutex_unlock(&wq->lock);
>  
>  		(wi->function)(wi->queue, wi->index, wi->arg);
> @@ -58,10 +66,11 @@ workqueue_thread(void *arg)
>  
>  /* Allocate a work queue and threads.  Returns zero or negative error code. */
>  int
> -workqueue_create(
> +workqueue_create_bound(
>  	struct workqueue	*wq,
>  	void			*wq_ctx,
> -	unsigned int		nr_workers)
> +	unsigned int		nr_workers,
> +	unsigned int		max_queue)
>  {
>  	unsigned int		i;
>  	int			err = 0;
> @@ -70,12 +79,16 @@ workqueue_create(
>  	err = -pthread_cond_init(&wq->wakeup, NULL);
>  	if (err)
>  		return err;
> +	err = -pthread_cond_init(&wq->queue_full, NULL);
> +	if (err)
> +		goto out_wake;
>  	err = -pthread_mutex_init(&wq->lock, NULL);
>  	if (err)
>  		goto out_cond;
>  
>  	wq->wq_ctx = wq_ctx;
>  	wq->thread_count = nr_workers;
> +	wq->max_queued = max_queue;
>  	wq->threads = malloc(nr_workers * sizeof(pthread_t));
>  	if (!wq->threads) {
>  		err = -errno;
> @@ -102,10 +115,21 @@ workqueue_create(
>  out_mutex:
>  	pthread_mutex_destroy(&wq->lock);
>  out_cond:
> +	pthread_cond_destroy(&wq->queue_full);
> +out_wake:
>  	pthread_cond_destroy(&wq->wakeup);
>  	return err;
>  }
>  
> +int
> +workqueue_create(
> +	struct workqueue	*wq,
> +	void			*wq_ctx,
> +	unsigned int		nr_workers)
> +{
> +	return workqueue_create_bound(wq, wq_ctx, nr_workers, 0);
> +}
> +
>  /*
>   * Create a work item consisting of a function and some arguments and schedule
>   * the work item to be run via the thread pool.  Returns zero or a negative
> @@ -140,6 +164,7 @@ workqueue_add(
>  
>  	/* Now queue the new work structure to the work queue. */
>  	pthread_mutex_lock(&wq->lock);
> +restart:
>  	if (wq->next_item == NULL) {
>  		assert(wq->item_count == 0);
>  		ret = -pthread_cond_signal(&wq->wakeup);
> @@ -150,6 +175,16 @@ workqueue_add(
>  		}
>  		wq->next_item = wi;
>  	} else {
> +		/* throttle on a full queue if configured */
> +		if (wq->max_queued && wq->item_count == wq->max_queued) {
> +			pthread_cond_wait(&wq->queue_full, &wq->lock);

I ported xfs_scrub to use max_queued for the inode scanner, and got a
hang here.  It uses two workqueues -- the first is an unbouned workqueue
that receives one work item per AG in which each work item calls
INUMBERS, creates a work item for the returned inode chunk, and throws
it at the second workqueue.  The second workqueue is a bounded workqueue
that calls BULKSTAT on the INUMBERS work item and then calls the
iteration function on each bulkstat record returned.

The hang happens when the inumbers workqueue has more than one thread
running.  Both* threads notice the full workqueue and wait on
queue_full.  One of the workers in the second workqueue goes to pull off
the next work item, ends up in this if body, signals one of the sleeping
threads, and starts calling bulkstat.

In the time it takes to wake up the sleeping thread from wq 1, the
second workqueue pulls far enough ahead that the single thread from wq1
never manages to fill wq2 again.  Often, the wq1 thread was sleeping so
that it could add the last inode chunk of that AG to wq2.  We therefore
never wake up the *other* sleeping thread from wq1, and the whole app
stalls.

I dunno if that's a sane way to structure an inumbers/bulkstat scan, but
it seemed reasonable to me.  I can envision two possible fixes here: (1)
use pthread_cond_broadcast to wake everything up; or (2) always call
pthread_cond_wait when we pull a work item off the queue.  Thoughts?

--D

*by "both threads" I really mean "the other 31 threads that are asleep
trying to add things to wq2" but afaict this is a general problem.

> +			/*
> +			 * Queue might be empty or even still full by the time
> +			 * we get the lock back, so restart the lookup so we do
> +			 * the right thing with the current state of the queue.
> +			 */
> +			goto restart;
> +		}
>  		wq->last_item->next = wi;
>  	}
>  	wq->last_item = wi;
> @@ -201,5 +236,6 @@ workqueue_destroy(
>  	free(wq->threads);
>  	pthread_mutex_destroy(&wq->lock);
>  	pthread_cond_destroy(&wq->wakeup);
> +	pthread_cond_destroy(&wq->queue_full);
>  	memset(wq, 0, sizeof(*wq));
>  }
> diff --git a/libfrog/workqueue.h b/libfrog/workqueue.h
> index a56d1cf14081..a9c108d0e66a 100644
> --- a/libfrog/workqueue.h
> +++ b/libfrog/workqueue.h
> @@ -31,10 +31,14 @@ struct workqueue {
>  	unsigned int		thread_count;
>  	bool			terminate;
>  	bool			terminated;
> +	int			max_queued;
> +	pthread_cond_t		queue_full;
>  };
>  
>  int workqueue_create(struct workqueue *wq, void *wq_ctx,
>  		unsigned int nr_workers);
> +int workqueue_create_bound(struct workqueue *wq, void *wq_ctx,
> +		unsigned int nr_workers, unsigned int max_queue);
>  int workqueue_add(struct workqueue *wq, workqueue_func_t fn,
>  		uint32_t index, void *arg);
>  int workqueue_terminate(struct workqueue *wq);
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] workqueue: bound maximum queue depth
  2020-10-25  4:41   ` Darrick J. Wong
@ 2020-10-26 22:29     ` Dave Chinner
  2020-10-26 22:40       ` Darrick J. Wong
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2020-10-26 22:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sat, Oct 24, 2020 at 09:41:14PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 22, 2020 at 04:15:31PM +1100, Dave Chinner wrote:
> > @@ -140,6 +164,7 @@ workqueue_add(
> >  
> >  	/* Now queue the new work structure to the work queue. */
> >  	pthread_mutex_lock(&wq->lock);
> > +restart:
> >  	if (wq->next_item == NULL) {
> >  		assert(wq->item_count == 0);
> >  		ret = -pthread_cond_signal(&wq->wakeup);
> > @@ -150,6 +175,16 @@ workqueue_add(
> >  		}
> >  		wq->next_item = wi;
> >  	} else {
> > +		/* throttle on a full queue if configured */
> > +		if (wq->max_queued && wq->item_count == wq->max_queued) {
> > +			pthread_cond_wait(&wq->queue_full, &wq->lock);
> 
> I ported xfs_scrub to use max_queued for the inode scanner, and got a
> hang here.  It uses two workqueues -- the first is an unbouned workqueue
> that receives one work item per AG in which each work item calls
> INUMBERS, creates a work item for the returned inode chunk, and throws
> it at the second workqueue.  The second workqueue is a bounded workqueue
> that calls BULKSTAT on the INUMBERS work item and then calls the
> iteration function on each bulkstat record returned.
> 
> The hang happens when the inumbers workqueue has more than one thread
> running.

IIUC, that means you have multiple producer threads? IIRC, he usage
in this patchset is single producer, so it won't hit this problem...

> Both* threads notice the full workqueue and wait on
> queue_full.  One of the workers in the second workqueue goes to pull off
> the next work item, ends up in this if body, signals one of the sleeping
> threads, and starts calling bulkstat.
> 
> In the time it takes to wake up the sleeping thread from wq 1, the
> second workqueue pulls far enough ahead that the single thread from wq1
> never manages to fill wq2 again.  Often, the wq1 thread was sleeping so
> that it could add the last inode chunk of that AG to wq2.  We therefore
> never wake up the *other* sleeping thread from wq1, and the whole app
> stalls.
> 
> I dunno if that's a sane way to structure an inumbers/bulkstat scan, but
> it seemed reasonable to me.  I can envision two possible fixes here: (1)
> use pthread_cond_broadcast to wake everything up; or (2) always call
> pthread_cond_wait when we pull a work item off the queue.  Thoughts?

pthread_cond_broadcast() makes more sense, but I suspect there will
be other issues with multiple producers that render the throttling
ineffective. I suspect supporting multiple producers should be a
separate patchset...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] workqueue: bound maximum queue depth
  2020-10-26 22:29     ` Dave Chinner
@ 2020-10-26 22:40       ` Darrick J. Wong
  2020-10-26 22:57         ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-26 22:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 27, 2020 at 09:29:43AM +1100, Dave Chinner wrote:
> On Sat, Oct 24, 2020 at 09:41:14PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 22, 2020 at 04:15:31PM +1100, Dave Chinner wrote:
> > > @@ -140,6 +164,7 @@ workqueue_add(
> > >  
> > >  	/* Now queue the new work structure to the work queue. */
> > >  	pthread_mutex_lock(&wq->lock);
> > > +restart:
> > >  	if (wq->next_item == NULL) {
> > >  		assert(wq->item_count == 0);
> > >  		ret = -pthread_cond_signal(&wq->wakeup);
> > > @@ -150,6 +175,16 @@ workqueue_add(
> > >  		}
> > >  		wq->next_item = wi;
> > >  	} else {
> > > +		/* throttle on a full queue if configured */
> > > +		if (wq->max_queued && wq->item_count == wq->max_queued) {
> > > +			pthread_cond_wait(&wq->queue_full, &wq->lock);
> > 
> > I ported xfs_scrub to use max_queued for the inode scanner, and got a
> > hang here.  It uses two workqueues -- the first is an unbouned workqueue
> > that receives one work item per AG in which each work item calls
> > INUMBERS, creates a work item for the returned inode chunk, and throws
> > it at the second workqueue.  The second workqueue is a bounded workqueue
> > that calls BULKSTAT on the INUMBERS work item and then calls the
> > iteration function on each bulkstat record returned.
> > 
> > The hang happens when the inumbers workqueue has more than one thread
> > running.
> 
> IIUC, that means you have multiple producer threads? IIRC, he usage
> in this patchset is single producer, so it won't hit this problem...

Right, there are multiple producers, because that seemed like fun.
At this stage, I'm merely using this patch in anger (as it were) to
prototype a fix to scrub.

I'm not even sure it's a reasonable enhancement for xfs_scrub since each
of those bulkstat workers will then fight with the inumbers workers for
the AGI, but so it goes.  It does seem to eliminate the problem of one
thread working hard on an AG that has one huge fragmented file while the
other threads sit idle, but otherwise I mostly just see more buffer lock
contention.  The total runtime doesn't change on more balanced
filesystems, at least. :P

> > Both* threads notice the full workqueue and wait on
> > queue_full.  One of the workers in the second workqueue goes to pull off
> > the next work item, ends up in this if body, signals one of the sleeping
> > threads, and starts calling bulkstat.
> > 
> > In the time it takes to wake up the sleeping thread from wq 1, the
> > second workqueue pulls far enough ahead that the single thread from wq1
> > never manages to fill wq2 again.  Often, the wq1 thread was sleeping so
> > that it could add the last inode chunk of that AG to wq2.  We therefore
> > never wake up the *other* sleeping thread from wq1, and the whole app
> > stalls.
> > 
> > I dunno if that's a sane way to structure an inumbers/bulkstat scan, but
> > it seemed reasonable to me.  I can envision two possible fixes here: (1)
> > use pthread_cond_broadcast to wake everything up; or (2) always call
> > pthread_cond_wait when we pull a work item off the queue.  Thoughts?
> 
> pthread_cond_broadcast() makes more sense, but I suspect there will
> be other issues with multiple producers that render the throttling
> ineffective. I suspect supporting multiple producers should be a
> separate patchset...

<nod> Making change (2) seems to work for multiple producers, but I
guess I'll keep poking at perf to see if I discover anything exciting.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] workqueue: bound maximum queue depth
  2020-10-26 22:40       ` Darrick J. Wong
@ 2020-10-26 22:57         ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2020-10-26 22:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Oct 26, 2020 at 03:40:51PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 27, 2020 at 09:29:43AM +1100, Dave Chinner wrote:
> > On Sat, Oct 24, 2020 at 09:41:14PM -0700, Darrick J. Wong wrote:
> > > On Thu, Oct 22, 2020 at 04:15:31PM +1100, Dave Chinner wrote:
> > > > @@ -140,6 +164,7 @@ workqueue_add(
> > > >  
> > > >  	/* Now queue the new work structure to the work queue. */
> > > >  	pthread_mutex_lock(&wq->lock);
> > > > +restart:
> > > >  	if (wq->next_item == NULL) {
> > > >  		assert(wq->item_count == 0);
> > > >  		ret = -pthread_cond_signal(&wq->wakeup);
> > > > @@ -150,6 +175,16 @@ workqueue_add(
> > > >  		}
> > > >  		wq->next_item = wi;
> > > >  	} else {
> > > > +		/* throttle on a full queue if configured */
> > > > +		if (wq->max_queued && wq->item_count == wq->max_queued) {
> > > > +			pthread_cond_wait(&wq->queue_full, &wq->lock);
> > > 
> > > I ported xfs_scrub to use max_queued for the inode scanner, and got a
> > > hang here.  It uses two workqueues -- the first is an unbouned workqueue
> > > that receives one work item per AG in which each work item calls
> > > INUMBERS, creates a work item for the returned inode chunk, and throws
> > > it at the second workqueue.  The second workqueue is a bounded workqueue
> > > that calls BULKSTAT on the INUMBERS work item and then calls the
> > > iteration function on each bulkstat record returned.
> > > 
> > > The hang happens when the inumbers workqueue has more than one thread
> > > running.
> > 
> > IIUC, that means you have multiple producer threads? IIRC, he usage
> > in this patchset is single producer, so it won't hit this problem...
> 
> Right, there are multiple producers, because that seemed like fun.
> At this stage, I'm merely using this patch in anger (as it were) to
> prototype a fix to scrub.
> 
> I'm not even sure it's a reasonable enhancement for xfs_scrub since each
> of those bulkstat workers will then fight with the inumbers workers for
> the AGI, but so it goes.  It does seem to eliminate the problem of one
> thread working hard on an AG that has one huge fragmented file while the
> other threads sit idle, but otherwise I mostly just see more buffer lock
> contention.

Yeah, that's exactly the problem it solves for phase6, too. i.e. it
stops a single huge directory from preventing other directories in
that AG from being processed even when everything else is done.

> The total runtime doesn't change on more balanced
> filesystems, at least. :P

That's a start :)

> > > I dunno if that's a sane way to structure an inumbers/bulkstat scan, but
> > > it seemed reasonable to me.  I can envision two possible fixes here: (1)
> > > use pthread_cond_broadcast to wake everything up; or (2) always call
> > > pthread_cond_wait when we pull a work item off the queue.  Thoughts?
> > 
> > pthread_cond_broadcast() makes more sense, but I suspect there will
> > be other issues with multiple producers that render the throttling
> > ineffective. I suspect supporting multiple producers should be a
> > separate patchset...
> 
> <nod> Making change (2) seems to work for multiple producers, but I
> guess I'll keep poking at perf to see if I discover anything exciting.

I'll look at the multiple producer thing in a bit more detail later
today...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] repair: parallelise phase 6
  2020-10-22  6:11   ` Darrick J. Wong
@ 2020-10-27  5:10     ` Dave Chinner
  2020-10-29 17:20       ` Darrick J. Wong
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2020-10-27  5:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Oct 21, 2020 at 11:11:00PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 22, 2020 at 04:15:34PM +1100, Dave Chinner wrote:
> > +static void
> > +traverse_function(
> > +	struct workqueue	*wq,
> > +	xfs_agnumber_t		agno,
> > +	void			*arg)
> > +{
> > +	struct ino_tree_node	*irec;
> >  	prefetch_args_t		*pf_args = arg;
> > +	struct workqueue	lwq;
> > +	struct xfs_mount	*mp = wq->wq_ctx;
> > +
> >  
> >  	wait_for_inode_prefetch(pf_args);
> >  
> >  	if (verbose)
> >  		do_log(_("        - agno = %d\n"), agno);
> >  
> > +	/*
> > +	 * The more AGs we have in flight at once, the fewer processing threads
> > +	 * per AG. This means we don't overwhelm the machine with hundreds of
> > +	 * threads when we start acting on lots of AGs at once. We just want
> > +	 * enough that we can keep multiple CPUs busy across multiple AGs.
> > +	 */
> > +	workqueue_create_bound(&lwq, mp, ag_stride, 1000);
> 
> Eeeeee, magic number! :)
> 
> /me tosses in obligatory hand-wringing about 2000 CPU systems running
> out of work.  How about ag_stride * 50 or something? :P

ag_stride already determines concurrency via how many AGs are being
scanned at once. However, it provides no insight into the depth of
the queue we need to use per AG.

What this magic number does is bound how deep the work queue gets
before we ask another worker thread to start also processing the
queue. We've already got async threads doing inode prefetch, so the
bound here throttles the rate at which inodes are
prefetched into the buffer cache. In general, we're going to be IO
bound and waiting on readahead rather than throttling on processing
the inodes, so all this bound is doing is preventing readahead from
running too far ahead of processing and potentially causing cache
thrashing.

However, we don't want to go using lots of threads to parallelise
the work within the AG when we have already parallelised across AGs.
We want the initial worker thread per AG to just keep working away
burning CPU while the prefetch code is blocking waiting for more
inodes from disk. Then we get another burst of work being queued,
and so on.

Hence the queue needs to be quite deep so that we can soak up the
bursts of processing that readahead triggers without asking lots of
worker threads to do work. However, if the worker thread hits some
big directories and starts falling behind readahead, that's when it
will hit the maximum queue depth and kick another thread to do work.

IOWs, the queue depth needs to be deep enough to prevent bursts from
triggering extra workers from running, but shallow enough that extra
workers will be scheduled when processing falls behind readahead.

I really don't have a good way to automatically calculate this
depth. I just figure that if we have a 1000 inodes queued up for
processing, we really should kick another thread to start working on
them. It's a simple solution, so I'd like to see if we have problems
with this simple threshold before we try to replace the magic number
with a magic heuristic....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] repair: Protect bad inode list with mutex
  2020-10-22  5:15 ` [PATCH 2/7] repair: Protect bad inode list with mutex Dave Chinner
  2020-10-22  5:45   ` Darrick J. Wong
@ 2020-10-29  9:35   ` Christoph Hellwig
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-29  9:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 04:15:32PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To enable phase 6 parallelisation, we need to protect the bad inode
> list from concurrent modification and/or access. Wrap it with a
> mutex and clean up the nasty typedefs.

The patch itself looks good, but if you touch this code anyway, the
linked list here seems like an incredibly suboptimal data structure.
Even just a simple array that gets realloced would seems better.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] repair: don't duplicate names in phase 6
  2020-10-22  5:15 ` [PATCH 5/7] repair: don't duplicate names in " Dave Chinner
  2020-10-22  6:21   ` Darrick J. Wong
@ 2020-10-29  9:39   ` Christoph Hellwig
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-29  9:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> -	xfs_ino_t 		inum;		/* inode num of entry */
> +	xfs_ino_t		inum;		/* inode num of entry */

spurious unrelated whitespace change?

> +	/* Set up the name in the region trailing the hash entry. */
> +	memcpy(p + 1, name, namelen);
> +	p->name.name = (const unsigned char *)(p + 1);
> +	p->name.len = namelen;
> +	p->name.type = ftype;

just add a

	char namebuf[];

to the end of struct dir_hash_ent and use that.  This has the same
memory usage, but avoids the cast and pointer arithmetics magic.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] repair: scale duplicate name checking in phase 6.
  2020-10-22  5:15 ` [PATCH 7/7] repair: scale duplicate name checking in phase 6 Dave Chinner
@ 2020-10-29 16:29   ` Darrick J. Wong
  0 siblings, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-29 16:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 04:15:37PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> phase 6 on large directories is cpu bound on duplicate name checking
> due to the algorithm having effectively O(n^2) scalability. Hence
> when the duplicate name hash table  size is far smaller than the
> number of directory entries, we end up with long hash chains that
> are searched linearly on every new entry that is found in the
> directory to do duplicate detection.
> 
> The in-memory hash table size is limited to 64k entries. Hence when
> we have millions of entries in a directory, duplicate entry lookups
> on the hash table have substantial overhead. Scale this table out to
> larger sizes so that we keep the chain lengths short and hence the
> O(n^2) scalability impact is limited because N is always small.
> 
> For a 10M entry directoryi consuming 400MB of directory data, the
> hash table now sizes at 6.4 million entries instead of ~64k - it is
> ~100x larger. While the hash table now consumes ~50MB of RAM, the
> xfs_repair footprint barely changes at it's using already consuming
> ~9GB of RAM at this point in time. IOWs, the incremental memory
> usage change is noise, but the directory checking time:
> 
> Unpatched:
> 
>   97.11%  xfs_repair          [.] dir_hash_add
>    0.38%  xfs_repair          [.] longform_dir2_entry_check_data
>    0.34%  libc-2.31.so        [.] __libc_calloc
>    0.32%  xfs_repair          [.] avl_ino_start
> 
> Phase 6:        10/22 12:11:40  10/22 12:14:28  2 minutes, 48 seconds
> 
> Patched:
> 
>   46.74%  xfs_repair          [.] radix_tree_lookup
>   32.13%  xfs_repair          [.] dir_hash_see_all
>    7.70%  xfs_repair          [.] radix_tree_tag_get
>    3.92%  xfs_repair          [.] dir_hash_add
>    3.52%  xfs_repair          [.] radix_tree_tag_clear
>    2.43%  xfs_repair          [.] crc32c_le
> 
> Phase 6:        10/22 13:11:01  10/22 13:11:18  17 seconds
> 
> has been reduced by an order of magnitude.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/phase6.c | 30 ++++++++++++++++++++++++------
>  1 file changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/repair/phase6.c b/repair/phase6.c
> index 21f49dd748e1..7dd6130056ee 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -288,19 +288,37 @@ dir_hash_done(
>  	free(hashtab);
>  }
>  
> +/*
> + * Create a directory hash index structure based on the size of the directory we
> + * are about to try to repair. The size passed in is the size of the data
> + * segment of the directory in bytes, so we don't really know exactly how many
> + * entries are in it. Hence assume an entry size of around 64 bytes - that's a
> + * name length of 40+ bytes so should cover a most situations with large
> + * really directories.

"...with really large directories." ?

> + */
>  static struct dir_hash_tab *
>  dir_hash_init(
>  	xfs_fsize_t		size)
>  {
> -	struct dir_hash_tab	*hashtab;
> +	struct dir_hash_tab	*hashtab = NULL;
>  	int			hsize;
>  
> -	hsize = size / (16 * 4);
> -	if (hsize > 65536)
> -		hsize = 63336;
> -	else if (hsize < 16)
> +	hsize = size / 64;
> +	if (hsize < 16)
>  		hsize = 16;

Since I'm not that familiar with the directory hash table, I'm curious
about our choice of hash table size (which is hsize, right?).  IIRC most
CS textbooks tell you to pick a "clever" hash table size that's a prime
number just in case the values are unevenly distributed.

I don't know if that's the case here because I haven't studied the
directory name hash in detail, but I wonder, do we have a way to measure
the length of the hash chains?  Or at least the evenness of them?

I'm vaguely wondering if there's more gains to be had here.  Mmm science
projects...

(The rest of the code here looks reasonable to me, fwiw.)

--D

> -	if ((hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1)) == NULL)
> +
> +	/*
> +	 * Try to allocate as large a hash table as possible. Failure to
> +	 * allocate isn't fatal, it will just result in slower performance as we
> +	 * reduce the size of the table.
> +	 */
> +	while (hsize >= 16) {
> +		hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1);
> +		if (hashtab)
> +			break;
> +		hsize /= 2;
> +	}
> +	if (!hashtab)
>  		do_error(_("calloc failed in dir_hash_init\n"));
>  	hashtab->size = hsize;
>  	hashtab->byhash = (struct dir_hash_ent **)((char *)hashtab +
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] repair: convert the dir byaddr hash to a radix tree
  2020-10-22  5:15 ` [PATCH 6/7] repair: convert the dir byaddr hash to a radix tree Dave Chinner
@ 2020-10-29 16:41   ` Darrick J. Wong
  0 siblings, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-29 16:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 04:15:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Phase 6 uses a hash table to track the data segment addresses of the
> entries it has seen in a directory. This is indexed by the offset
> into the data segment for the dirent, and is used to check if the
> entry exists, is a duplicate or has a bad hash value. The lookup
> operations involve walking long hash chains on large directories and
> they are done for every entry in the directory. This means certain
> operations have O(n^2) scalability (or worse!) and hence hurt on
> very large directories.
> 
> It is also used to determine if the directory has unseen entries,
> which involves a full hash traversal that is very expensive on large
> directories. Hence the directory checking for unseen ends up being
> roughly a O(n^2 + n) algorithm.
> 
> Switch the byaddr indexing to a radix tree. While a radix tree will
> burn more memory than the linked list, it gives us O(log n) lookup
> operations instead of O(n) on large directories, and use for tags
> gives us O(1) determination of whether all entries have been seen or
> not. This brings the "entry seen" algorithm scalability back to
> O(nlog n) and so is a major improvement for processing large
> directories.
> 
> Given a filesystem with 10M empty files in a single directory, we
> see:
> 
> 5.6.0:
> 
>   97.56%  xfs_repair              [.] dir_hash_add.lto_priv.0
>    0.38%  xfs_repair              [.] avl_ino_start.lto_priv.0
>    0.37%  libc-2.31.so            [.] malloc
>    0.34%  xfs_repair              [.] longform_dir2_entry_check_data.lto_priv.0
> 
> Phase 6:        10/22 12:07:13  10/22 12:10:51  3 minutes, 38 seconds
> 
> Patched:
> 
>   97.11%  xfs_repair          [.] dir_hash_add
>    0.38%  xfs_repair          [.] longform_dir2_entry_check_data
>    0.34%  libc-2.31.so        [.] __libc_calloc
>    0.32%  xfs_repair          [.] avl_ino_start
> 
> Phase 6:        10/22 12:11:40  10/22 12:14:28  2 minutes, 48 seconds
> 
> So there's some improvement, but we are clearly still CPU bound due
> to the O(n^2) scalability of the duplicate name checking algorithm.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

woooooooo
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  libfrog/radix-tree.c |  46 +++++++++
>  repair/phase6.c      | 222 ++++++++++++++++++++-----------------------
>  2 files changed, 148 insertions(+), 120 deletions(-)
> 
> diff --git a/libfrog/radix-tree.c b/libfrog/radix-tree.c
> index c1c74876964c..261fc2487de9 100644
> --- a/libfrog/radix-tree.c
> +++ b/libfrog/radix-tree.c
> @@ -312,6 +312,52 @@ void *radix_tree_lookup_first(struct radix_tree_root *root, unsigned long *index
>  
>  #ifdef RADIX_TREE_TAGS
>  
> +/**
> + * radix_tree_tag_get - get a tag on a radix tree node
> + * @root:		radix tree root
> + * @index:		index key
> + * @tag:		tag index (< RADIX_TREE_MAX_TAGS)
> + *
> + * Return values:
> + *
> + *  0: tag not present or not set
> + *  1: tag set
> + *
> + * Note that the return value of this function may not be relied on, even if
> + * the RCU lock is held, unless tag modification and node deletion are excluded
> + * from concurrency.
> + */
> +int radix_tree_tag_get(struct radix_tree_root *root,
> +			unsigned long index, unsigned int tag)
> +{
> +	unsigned int height, shift;
> +	struct radix_tree_node *slot;
> +
> +	height = root->height;
> +	if (index > radix_tree_maxindex(height))
> +		return 0;
> +
> +	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> +	slot = root->rnode;
> +
> +	while (height > 0) {
> +		int offset;
> +
> +		if (slot == NULL)
> +			return 0;
> +
> +		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
> +		if (!tag_get(slot, tag, offset))
> +			return 0;
> +
> +		slot = slot->slots[offset];
> +		ASSERT(slot != NULL);
> +		shift -= RADIX_TREE_MAP_SHIFT;
> +		height--;
> +	}
> +	return 1;
> +}
> +
>  /**
>   *	radix_tree_tag_set - set a tag on a radix tree node
>   *	@root:		radix tree root
> diff --git a/repair/phase6.c b/repair/phase6.c
> index 79c87495656f..21f49dd748e1 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -66,8 +66,7 @@ add_dotdot_update(
>   * and whether their leaf entry has been seen. Also used for name
>   * duplicate checking and rebuilding step if required.
>   */
> -typedef struct dir_hash_ent {
> -	struct dir_hash_ent	*nextbyaddr;	/* next in addr bucket */
> +struct dir_hash_ent {
>  	struct dir_hash_ent	*nextbyhash;	/* next in name bucket */
>  	struct dir_hash_ent	*nextbyorder;	/* next in order added */
>  	xfs_dahash_t		hashval;	/* hash value of name */
> @@ -76,18 +75,19 @@ typedef struct dir_hash_ent {
>  	short			junkit;		/* name starts with / */
>  	short			seen;		/* have seen leaf entry */
>  	struct xfs_name		name;
> -} dir_hash_ent_t;
> +};
>  
> -typedef struct dir_hash_tab {
> +struct dir_hash_tab {
>  	int			size;		/* size of hash tables */
> -	dir_hash_ent_t		*first;		/* ptr to first added entry */
> -	dir_hash_ent_t		*last;		/* ptr to last added entry */
> -	dir_hash_ent_t		**byhash;	/* ptr to name hash buckets */
> -	dir_hash_ent_t		**byaddr;	/* ptr to addr hash buckets */
> -} dir_hash_tab_t;
> +	struct dir_hash_ent	*first;		/* ptr to first added entry */
> +	struct dir_hash_ent	*last;		/* ptr to last added entry */
> +	struct dir_hash_ent	**byhash;	/* ptr to name hash buckets */
> +#define HT_UNSEEN		1
> +	struct radix_tree_root	byaddr;
> +};
>  
>  #define	DIR_HASH_TAB_SIZE(n)	\
> -	(sizeof(dir_hash_tab_t) + (sizeof(dir_hash_ent_t *) * (n) * 2))
> +	(sizeof(struct dir_hash_tab) + (sizeof(struct dir_hash_ent *) * (n)))
>  #define	DIR_HASH_FUNC(t,a)	((a) % (t)->size)
>  
>  /*
> @@ -154,8 +154,8 @@ dir_read_buf(
>   */
>  static int
>  dir_hash_add(
> -	xfs_mount_t		*mp,
> -	dir_hash_tab_t		*hashtab,
> +	struct xfs_mount	*mp,
> +	struct dir_hash_tab	*hashtab,
>  	uint32_t		addr,
>  	xfs_ino_t		inum,
>  	int			namelen,
> @@ -163,19 +163,18 @@ dir_hash_add(
>  	uint8_t			ftype)
>  {
>  	xfs_dahash_t		hash = 0;
> -	int			byaddr;
>  	int			byhash = 0;
> -	dir_hash_ent_t		*p;
> +	struct dir_hash_ent	*p;
>  	int			dup;
>  	short			junk;
>  	struct xfs_name		xname;
> +	int			error;
>  
>  	xname.name = name;
>  	xname.len = namelen;
>  	xname.type = ftype;
>  
>  	junk = name[0] == '/';
> -	byaddr = DIR_HASH_FUNC(hashtab, addr);
>  	dup = 0;
>  
>  	if (!junk) {
> @@ -205,8 +204,14 @@ dir_hash_add(
>  		do_error(_("malloc failed in dir_hash_add (%zu bytes)\n"),
>  			sizeof(*p));
>  
> -	p->nextbyaddr = hashtab->byaddr[byaddr];
> -	hashtab->byaddr[byaddr] = p;
> +	error = radix_tree_insert(&hashtab->byaddr, addr, p);
> +	if (error == EEXIST) {
> +		do_warn(_("duplicate addrs %u in directory!\n"), addr);
> +		free(p);
> +		return 0;
> +	}
> +	radix_tree_tag_set(&hashtab->byaddr, addr, HT_UNSEEN);
> +
>  	if (hashtab->last)
>  		hashtab->last->nextbyorder = p;
>  	else
> @@ -232,33 +237,14 @@ dir_hash_add(
>  	return !dup;
>  }
>  
> -/*
> - * checks to see if any data entries are not in the leaf blocks
> - */
> -static int
> -dir_hash_unseen(
> -	dir_hash_tab_t	*hashtab)
> -{
> -	int		i;
> -	dir_hash_ent_t	*p;
> -
> -	for (i = 0; i < hashtab->size; i++) {
> -		for (p = hashtab->byaddr[i]; p; p = p->nextbyaddr) {
> -			if (p->seen == 0)
> -				return 1;
> -		}
> -	}
> -	return 0;
> -}
> -
>  static int
>  dir_hash_check(
> -	dir_hash_tab_t	*hashtab,
> -	xfs_inode_t	*ip,
> -	int		seeval)
> +	struct dir_hash_tab	*hashtab,
> +	struct xfs_inode	*ip,
> +	int			seeval)
>  {
> -	static char	*seevalstr[DIR_HASH_CK_TOTAL];
> -	static int	done;
> +	static char		*seevalstr[DIR_HASH_CK_TOTAL];
> +	static int		done;
>  
>  	if (!done) {
>  		seevalstr[DIR_HASH_CK_OK] = _("ok");
> @@ -270,7 +256,8 @@ dir_hash_check(
>  		done = 1;
>  	}
>  
> -	if (seeval == DIR_HASH_CK_OK && dir_hash_unseen(hashtab))
> +	if (seeval == DIR_HASH_CK_OK &&
> +	    radix_tree_tagged(&hashtab->byaddr, HT_UNSEEN))
>  		seeval = DIR_HASH_CK_NOLEAF;
>  	if (seeval == DIR_HASH_CK_OK)
>  		return 0;
> @@ -285,27 +272,28 @@ dir_hash_check(
>  
>  static void
>  dir_hash_done(
> -	dir_hash_tab_t	*hashtab)
> +	struct dir_hash_tab	*hashtab)
>  {
> -	int		i;
> -	dir_hash_ent_t	*n;
> -	dir_hash_ent_t	*p;
> +	int			i;
> +	struct dir_hash_ent	*n;
> +	struct dir_hash_ent	*p;
>  
>  	for (i = 0; i < hashtab->size; i++) {
> -		for (p = hashtab->byaddr[i]; p; p = n) {
> -			n = p->nextbyaddr;
> +		for (p = hashtab->byhash[i]; p; p = n) {
> +			n = p->nextbyhash;
> +			radix_tree_delete(&hashtab->byaddr, p->address);
>  			free(p);
>  		}
>  	}
>  	free(hashtab);
>  }
>  
> -static dir_hash_tab_t *
> +static struct dir_hash_tab *
>  dir_hash_init(
> -	xfs_fsize_t	size)
> +	xfs_fsize_t		size)
>  {
> -	dir_hash_tab_t	*hashtab;
> -	int		hsize;
> +	struct dir_hash_tab	*hashtab;
> +	int			hsize;
>  
>  	hsize = size / (16 * 4);
>  	if (hsize > 65536)
> @@ -315,51 +303,43 @@ dir_hash_init(
>  	if ((hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1)) == NULL)
>  		do_error(_("calloc failed in dir_hash_init\n"));
>  	hashtab->size = hsize;
> -	hashtab->byhash = (dir_hash_ent_t**)((char *)hashtab +
> -		sizeof(dir_hash_tab_t));
> -	hashtab->byaddr = (dir_hash_ent_t**)((char *)hashtab +
> -		sizeof(dir_hash_tab_t) + sizeof(dir_hash_ent_t*) * hsize);
> +	hashtab->byhash = (struct dir_hash_ent **)((char *)hashtab +
> +		sizeof(struct dir_hash_tab));
> +	INIT_RADIX_TREE(&hashtab->byaddr, 0);
>  	return hashtab;
>  }
>  
>  static int
>  dir_hash_see(
> -	dir_hash_tab_t		*hashtab,
> +	struct dir_hash_tab	*hashtab,
>  	xfs_dahash_t		hash,
>  	xfs_dir2_dataptr_t	addr)
>  {
> -	int			i;
> -	dir_hash_ent_t		*p;
> +	struct dir_hash_ent	*p;
>  
> -	i = DIR_HASH_FUNC(hashtab, addr);
> -	for (p = hashtab->byaddr[i]; p; p = p->nextbyaddr) {
> -		if (p->address != addr)
> -			continue;
> -		if (p->seen)
> -			return DIR_HASH_CK_DUPLEAF;
> -		if (p->junkit == 0 && p->hashval != hash)
> -			return DIR_HASH_CK_BADHASH;
> -		p->seen = 1;
> -		return DIR_HASH_CK_OK;
> -	}
> -	return DIR_HASH_CK_NODATA;
> +	p = radix_tree_lookup(&hashtab->byaddr, addr);
> +	if (!p)
> +		return DIR_HASH_CK_NODATA;
> +	if (!radix_tree_tag_get(&hashtab->byaddr, addr, HT_UNSEEN))
> +		return DIR_HASH_CK_DUPLEAF;
> +	if (p->junkit == 0 && p->hashval != hash)
> +		return DIR_HASH_CK_BADHASH;
> +	radix_tree_tag_clear(&hashtab->byaddr, addr, HT_UNSEEN);
> +	return DIR_HASH_CK_OK;
>  }
>  
>  static void
>  dir_hash_update_ftype(
> -	dir_hash_tab_t		*hashtab,
> +	struct dir_hash_tab	*hashtab,
>  	xfs_dir2_dataptr_t	addr,
>  	uint8_t			ftype)
>  {
> -	int			i;
> -	dir_hash_ent_t		*p;
> +	struct dir_hash_ent	*p;
>  
> -	i = DIR_HASH_FUNC(hashtab, addr);
> -	for (p = hashtab->byaddr[i]; p; p = p->nextbyaddr) {
> -		if (p->address != addr)
> -			continue;
> -		p->name.type = ftype;
> -	}
> +	p = radix_tree_lookup(&hashtab->byaddr, addr);
> +	if (!p)
> +		return;
> +	p->name.type = ftype;
>  }
>  
>  /*
> @@ -368,7 +348,7 @@ dir_hash_update_ftype(
>   */
>  static int
>  dir_hash_see_all(
> -	dir_hash_tab_t		*hashtab,
> +	struct dir_hash_tab	*hashtab,
>  	xfs_dir2_leaf_entry_t	*ents,
>  	int			count,
>  	int			stale)
> @@ -1226,19 +1206,19 @@ dir_binval(
>  
>  static void
>  longform_dir2_rebuild(
> -	xfs_mount_t		*mp,
> +	struct xfs_mount	*mp,
>  	xfs_ino_t		ino,
> -	xfs_inode_t		*ip,
> -	ino_tree_node_t		*irec,
> +	struct xfs_inode	*ip,
> +	struct ino_tree_node	*irec,
>  	int			ino_offset,
> -	dir_hash_tab_t		*hashtab)
> +	struct dir_hash_tab	*hashtab)
>  {
>  	int			error;
>  	int			nres;
> -	xfs_trans_t		*tp;
> +	struct xfs_trans	*tp;
>  	xfs_fileoff_t		lastblock;
> -	xfs_inode_t		pip;
> -	dir_hash_ent_t		*p;
> +	struct xfs_inode	pip;
> +	struct dir_hash_ent	*p;
>  	int			done = 0;
>  
>  	/*
> @@ -1397,14 +1377,14 @@ _("directory shrink failed (%d)\n"), error);
>   */
>  static void
>  longform_dir2_entry_check_data(
> -	xfs_mount_t		*mp,
> -	xfs_inode_t		*ip,
> +	struct xfs_mount	*mp,
> +	struct xfs_inode	*ip,
>  	int			*num_illegal,
>  	int			*need_dot,
> -	ino_tree_node_t		*current_irec,
> +	struct ino_tree_node	*current_irec,
>  	int			current_ino_offset,
>  	struct xfs_buf		*bp,
> -	dir_hash_tab_t		*hashtab,
> +	struct dir_hash_tab	*hashtab,
>  	freetab_t		**freetabp,
>  	xfs_dablk_t		da_bno,
>  	int			isblock)
> @@ -1931,10 +1911,10 @@ check_dir3_header(
>   */
>  static int
>  longform_dir2_check_leaf(
> -	xfs_mount_t		*mp,
> -	xfs_inode_t		*ip,
> -	dir_hash_tab_t		*hashtab,
> -	freetab_t		*freetab)
> +	struct xfs_mount	*mp,
> +	struct xfs_inode	*ip,
> +	struct dir_hash_tab	*hashtab,
> +	struct freetab		*freetab)
>  {
>  	int			badtail;
>  	__be16			*bestsp;
> @@ -2016,10 +1996,10 @@ longform_dir2_check_leaf(
>   */
>  static int
>  longform_dir2_check_node(
> -	xfs_mount_t		*mp,
> -	xfs_inode_t		*ip,
> -	dir_hash_tab_t		*hashtab,
> -	freetab_t		*freetab)
> +	struct xfs_mount	*mp,
> +	struct xfs_inode	*ip,
> +	struct dir_hash_tab	*hashtab,
> +	struct freetab		*freetab)
>  {
>  	struct xfs_buf		*bp;
>  	xfs_dablk_t		da_bno;
> @@ -2191,14 +2171,15 @@ longform_dir2_check_node(
>   * (ie. get libxfs to do all the grunt work)
>   */
>  static void
> -longform_dir2_entry_check(xfs_mount_t	*mp,
> -			xfs_ino_t	ino,
> -			xfs_inode_t	*ip,
> -			int		*num_illegal,
> -			int		*need_dot,
> -			ino_tree_node_t	*irec,
> -			int		ino_offset,
> -			dir_hash_tab_t	*hashtab)
> +longform_dir2_entry_check(
> +	struct xfs_mount	*mp,
> +	xfs_ino_t		ino,
> +	struct xfs_inode	*ip,
> +	int			*num_illegal,
> +	int			*need_dot,
> +	struct ino_tree_node	*irec,
> +	int			ino_offset,
> +	struct dir_hash_tab	*hashtab)
>  {
>  	struct xfs_buf		*bp;
>  	xfs_dablk_t		da_bno;
> @@ -2401,13 +2382,14 @@ shortform_dir2_junk(
>  }
>  
>  static void
> -shortform_dir2_entry_check(xfs_mount_t	*mp,
> -			xfs_ino_t	ino,
> -			xfs_inode_t	*ip,
> -			int		*ino_dirty,
> -			ino_tree_node_t	*current_irec,
> -			int		current_ino_offset,
> -			dir_hash_tab_t	*hashtab)
> +shortform_dir2_entry_check(
> +	struct xfs_mount	*mp,
> +	xfs_ino_t		ino,
> +	struct xfs_inode	*ip,
> +	int			*ino_dirty,
> +	struct ino_tree_node	*current_irec,
> +	int			current_ino_offset,
> +	struct dir_hash_tab	*hashtab)
>  {
>  	xfs_ino_t		lino;
>  	xfs_ino_t		parent;
> @@ -2749,15 +2731,15 @@ _("entry \"%s\" (ino %" PRIu64 ") in dir %" PRIu64 " is a duplicate name"),
>   */
>  static void
>  process_dir_inode(
> -	xfs_mount_t 		*mp,
> +	struct xfs_mount	*mp,
>  	xfs_agnumber_t		agno,
> -	ino_tree_node_t		*irec,
> +	struct ino_tree_node	*irec,
>  	int			ino_offset)
>  {
>  	xfs_ino_t		ino;
> -	xfs_inode_t		*ip;
> -	xfs_trans_t		*tp;
> -	dir_hash_tab_t		*hashtab;
> +	struct xfs_inode	*ip;
> +	struct xfs_trans	*tp;
> +	struct dir_hash_tab	*hashtab;
>  	int			need_dot;
>  	int			dirty, num_illegal, error, nres;
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] repair: protect inode chunk tree records with a mutex
  2020-10-22  8:15     ` Dave Chinner
@ 2020-10-29 16:45       ` Darrick J. Wong
  0 siblings, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-29 16:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 22, 2020 at 07:15:05PM +1100, Dave Chinner wrote:
> On Wed, Oct 21, 2020 at 11:02:56PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 22, 2020 at 04:15:33PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Phase 6 accesses inode chunk records mostly in an isolated manner.
> > > However, when it finds a corruption in a directory or there are
> > > multiple hardlinks to an inode, there can be concurrent access
> > > to the inode chunk record to update state.
> > > 
> > > Hence the inode record itself needs a mutex. This protects all state
> > > changes within the inode chunk record, as well as inode link counts
> > > and chunk references. That allows us to process multiple chunks at
> > > once, providing concurrency within an AG as well as across AGs.
> > > 
> > > The inode chunk tree itself is not modified in phase 6 - it's built
> > 
> > Well, that's not 100% true -- mk_orphanage can do that, but AFAICT
> > that's outside the scope of the parallel processing (and I don't see
> > much point in parallelizing that part) so I guess that's fine?
> 
> AFAICT, yes.

Ok, good, I'm confident I understand what's going on here. :)

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> > > in phases 3 and 4  - and so we do not need to worry about locking
> > > for AVL tree lookups to find the inode chunk records themselves.
> > > hence internal locking is all we need here.
> > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > 
> > TBH I wonder if all the phase6.c code to recreate the root dir, the
> > orphanage, and the realtime inodes ought to get moved to another file,
> > particularly since the metadata directory patches add quite a bit more
> > stuff here?  But that's a topic for another patch...
> 
> Probably should and yes, spearate patch :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] repair: parallelise phase 6
  2020-10-27  5:10     ` Dave Chinner
@ 2020-10-29 17:20       ` Darrick J. Wong
  0 siblings, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2020-10-29 17:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 27, 2020 at 04:10:44PM +1100, Dave Chinner wrote:
> On Wed, Oct 21, 2020 at 11:11:00PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 22, 2020 at 04:15:34PM +1100, Dave Chinner wrote:
> > > +static void
> > > +traverse_function(
> > > +	struct workqueue	*wq,
> > > +	xfs_agnumber_t		agno,
> > > +	void			*arg)
> > > +{
> > > +	struct ino_tree_node	*irec;
> > >  	prefetch_args_t		*pf_args = arg;
> > > +	struct workqueue	lwq;
> > > +	struct xfs_mount	*mp = wq->wq_ctx;
> > > +
> > >  
> > >  	wait_for_inode_prefetch(pf_args);
> > >  
> > >  	if (verbose)
> > >  		do_log(_("        - agno = %d\n"), agno);
> > >  
> > > +	/*
> > > +	 * The more AGs we have in flight at once, the fewer processing threads
> > > +	 * per AG. This means we don't overwhelm the machine with hundreds of
> > > +	 * threads when we start acting on lots of AGs at once. We just want
> > > +	 * enough that we can keep multiple CPUs busy across multiple AGs.
> > > +	 */
> > > +	workqueue_create_bound(&lwq, mp, ag_stride, 1000);
> > 
> > Eeeeee, magic number! :)
> > 
> > /me tosses in obligatory hand-wringing about 2000 CPU systems running
> > out of work.  How about ag_stride * 50 or something? :P
> 
> ag_stride already determines concurrency via how many AGs are being
> scanned at once. However, it provides no insight into the depth of
> the queue we need to use per AG.
> 
> What this magic number does is bound how deep the work queue gets
> before we ask another worker thread to start also processing the
> queue.

It does?  I didn't think we'd wake up extra worker threads when the
queue depth reached max_queued:

	if (wq->max_queued && wq->next_item) {
		/* more work, wake up another worker */
		pthread_cond_signal(&wq->wakeup);
	}

AFAICT, any time a worker dequeues a work item, observes that we have
a max queue depth, and sees that there's still more work to do, it'll
wake up another worker.

TBH I think this is better for cpu utilization because we should never
have idle workers while there's more work to do.  At least for the scrub
case...

<shrug> Either that or I think I've misunderstood something?

> We've already got async threads doing inode prefetch, so the
> bound here throttles the rate at which inodes are
> prefetched into the buffer cache. In general, we're going to be IO
> bound and waiting on readahead rather than throttling on processing
> the inodes, so all this bound is doing is preventing readahead from
> running too far ahead of processing and potentially causing cache
> thrashing.
> 
> However, we don't want to go using lots of threads to parallelise
> the work within the AG when we have already parallelised across AGs.
> We want the initial worker thread per AG to just keep working away
> burning CPU while the prefetch code is blocking waiting for more
> inodes from disk. Then we get another burst of work being queued,
> and so on.
> Hence the queue needs to be quite deep so that we can soak up the
> bursts of processing that readahead triggers without asking lots of
> worker threads to do work. However, if the worker thread hits some
> big directories and starts falling behind readahead, that's when it
> will hit the maximum queue depth and kick another thread to do work.

...ah, I think I grok the differences between what repair and scrub are
trying to do with the workqueue.  Repair reaadahead is driving
workqueue_add() calls, so you don't really want to increase parallelism
of the workqueue until (a) you can be reasonably certain that the
workers won't block on IO and (b) the workers have bogged down on huge
directories such that the buffer cache is filling up with memory that
has no immediate consumer.

It's a little different from what I'm doing with scrub, which
effectively reads inobt records and queues each of them separately for
processing.  In this case, the queue depth merely restrains our work
item memory allocations.

> IOWs, the queue depth needs to be deep enough to prevent bursts from
> triggering extra workers from running, but shallow enough that extra
> workers will be scheduled when processing falls behind readahead.
> 
> I really don't have a good way to automatically calculate this
> depth. I just figure that if we have a 1000 inodes queued up for
> processing, we really should kick another thread to start working on
> them. It's a simple solution, so I'd like to see if we have problems
> with this simple threshold before we try to replace the magic number
> with a magic heuristic....

Hm, in that case, shouldn't that code snippet above read:

	if (wq->max_queued && wq->item_count == wq->max_queued - 1) {
		/* more work, wake up another worker */
		pthread_cond_signal(&wq->wakeup);
	}

That would seem to wake up another worker, but only after 1000 inodes
have been added to the queue.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] repair: Phase 6 performance improvements
  2021-03-24  1:26       ` Gao Xiang
@ 2021-03-24  2:08         ` Darrick J. Wong
  0 siblings, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2021-03-24  2:08 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Dave Chinner, linux-xfs

On Wed, Mar 24, 2021 at 09:26:55AM +0800, Gao Xiang wrote:
> Hi Dave and Darrick,
> 
> On Sat, Mar 20, 2021 at 10:09:31AM +0800, Gao Xiang wrote:
> > On Fri, Mar 19, 2021 at 11:22:21AM -0700, Darrick J. Wong wrote:
> > > On Fri, Mar 19, 2021 at 09:38:45AM +0800, Gao Xiang wrote:
> > > > On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > This is largely a repost of my current code so that Xiang can take
> > > > > over and finish it off. It applies against 5.11.0 and the
> > > > > performance numbers are still valid. I can't remember how much of
> > > > > the review comments I addressed from the first time I posted it, so
> > > > > the changelog is poor....
> > > > 
> > > > Yeah, I will catch what's missing (now looking the previous review),
> > > > and follow up then...
> > > 
> > > :)
> > > 
> > > While you're revising the patches, you might as well convert:
> > > 
> > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > into:
> > > 
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > because Exchange is so awful for inline replies that I don't use that
> > > email address anymore.
> > 
> > Yeah, I'm just starting sorting out all previous opinions
> > and patches diff. Will update in the next version.
> > 
> 
> Sorry for bothering... After reading the previous discussion for a while,
> I'm fine with the trivial cleanups. Yet, it seems that there are mainly 2
> remaining open discussions unsolved yet...
> 
> 1 is magic number 1000,
> https://lore.kernel.org/r/20201029172045.GP1061252@magnolia
> 
> while I also don't have better ideas of this (and have no idea why queue
> depth 1000 is optimal compared with other configurations), so it'd be better
> to get your thoughts about this in advance (e.g. just leave it as-is, or...
> plus, I don't have such test setting with such many cpus)
> 
> 2 is the hash size modificiation,
> https://lore.kernel.org/r/20201029162922.GM1061252@magnolia/
> 
> it seems previously hash entires are limited to 64k, and this patch relaxes
> such limitation, but for huge directories I'm not sure the hash table
> utilization but from the previous commit message it seems the extra memory
> usage can be ignored.
> 
> Anyway, I'm fine with just leave them as-is if agreed on these.

FWIW I didn't have any specific objections to either magic number, I
simply wanted to know where they came from and why. :)

--D

> Thanks,
> Gao Xiang
> 
> > Thanks,
> > Gao Xiang
> > 
> > > 
> > > --D
> > > 
> > > > Thanks,
> > > > Gao Xiang
> > > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] repair: Phase 6 performance improvements
  2021-03-20  2:09     ` Gao Xiang
@ 2021-03-24  1:26       ` Gao Xiang
  2021-03-24  2:08         ` Darrick J. Wong
  0 siblings, 1 reply; 34+ messages in thread
From: Gao Xiang @ 2021-03-24  1:26 UTC (permalink / raw)
  To: Darrick J. Wong, Dave Chinner; +Cc: linux-xfs

Hi Dave and Darrick,

On Sat, Mar 20, 2021 at 10:09:31AM +0800, Gao Xiang wrote:
> On Fri, Mar 19, 2021 at 11:22:21AM -0700, Darrick J. Wong wrote:
> > On Fri, Mar 19, 2021 at 09:38:45AM +0800, Gao Xiang wrote:
> > > On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > This is largely a repost of my current code so that Xiang can take
> > > > over and finish it off. It applies against 5.11.0 and the
> > > > performance numbers are still valid. I can't remember how much of
> > > > the review comments I addressed from the first time I posted it, so
> > > > the changelog is poor....
> > > 
> > > Yeah, I will catch what's missing (now looking the previous review),
> > > and follow up then...
> > 
> > :)
> > 
> > While you're revising the patches, you might as well convert:
> > 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > into:
> > 
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > because Exchange is so awful for inline replies that I don't use that
> > email address anymore.
> 
> Yeah, I'm just starting sorting out all previous opinions
> and patches diff. Will update in the next version.
> 

Sorry for bothering... After reading the previous discussion for a while,
I'm fine with the trivial cleanups. Yet, it seems that there are mainly 2
remaining open discussions unsolved yet...

1 is magic number 1000,
https://lore.kernel.org/r/20201029172045.GP1061252@magnolia

while I also don't have better ideas of this (and have no idea why queue
depth 1000 is optimal compared with other configurations), so it'd be better
to get your thoughts about this in advance (e.g. just leave it as-is, or...
plus, I don't have such test setting with such many cpus)

2 is the hash size modificiation,
https://lore.kernel.org/r/20201029162922.GM1061252@magnolia/

it seems previously hash entires are limited to 64k, and this patch relaxes
such limitation, but for huge directories I'm not sure the hash table
utilization but from the previous commit message it seems the extra memory
usage can be ignored.

Anyway, I'm fine with just leave them as-is if agreed on these.

Thanks,
Gao Xiang

> Thanks,
> Gao Xiang
> 
> > 
> > --D
> > 
> > > Thanks,
> > > Gao Xiang
> > > 
> > 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] repair: Phase 6 performance improvements
  2021-03-19 18:22   ` Darrick J. Wong
@ 2021-03-20  2:09     ` Gao Xiang
  2021-03-24  1:26       ` Gao Xiang
  0 siblings, 1 reply; 34+ messages in thread
From: Gao Xiang @ 2021-03-20  2:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Fri, Mar 19, 2021 at 11:22:21AM -0700, Darrick J. Wong wrote:
> On Fri, Mar 19, 2021 at 09:38:45AM +0800, Gao Xiang wrote:
> > On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > This is largely a repost of my current code so that Xiang can take
> > > over and finish it off. It applies against 5.11.0 and the
> > > performance numbers are still valid. I can't remember how much of
> > > the review comments I addressed from the first time I posted it, so
> > > the changelog is poor....
> > 
> > Yeah, I will catch what's missing (now looking the previous review),
> > and follow up then...
> 
> :)
> 
> While you're revising the patches, you might as well convert:
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> into:
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> because Exchange is so awful for inline replies that I don't use that
> email address anymore.

Yeah, I'm just starting sorting out all previous opinions
and patches diff. Will update in the next version.

Thanks,
Gao Xiang

> 
> --D
> 
> > Thanks,
> > Gao Xiang
> > 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] repair: Phase 6 performance improvements
  2021-03-19  1:38 ` Gao Xiang
@ 2021-03-19 18:22   ` Darrick J. Wong
  2021-03-20  2:09     ` Gao Xiang
  0 siblings, 1 reply; 34+ messages in thread
From: Darrick J. Wong @ 2021-03-19 18:22 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Dave Chinner, linux-xfs

On Fri, Mar 19, 2021 at 09:38:45AM +0800, Gao Xiang wrote:
> On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > This is largely a repost of my current code so that Xiang can take
> > over and finish it off. It applies against 5.11.0 and the
> > performance numbers are still valid. I can't remember how much of
> > the review comments I addressed from the first time I posted it, so
> > the changelog is poor....
> 
> Yeah, I will catch what's missing (now looking the previous review),
> and follow up then...

:)

While you're revising the patches, you might as well convert:

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

into:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

because Exchange is so awful for inline replies that I don't use that
email address anymore.

--D

> Thanks,
> Gao Xiang
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] repair: Phase 6 performance improvements
  2021-03-19  1:33 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
@ 2021-03-19  1:38 ` Gao Xiang
  2021-03-19 18:22   ` Darrick J. Wong
  0 siblings, 1 reply; 34+ messages in thread
From: Gao Xiang @ 2021-03-19  1:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Mar 19, 2021 at 12:33:48PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> This is largely a repost of my current code so that Xiang can take
> over and finish it off. It applies against 5.11.0 and the
> performance numbers are still valid. I can't remember how much of
> the review comments I addressed from the first time I posted it, so
> the changelog is poor....

Yeah, I will catch what's missing (now looking the previous review),
and follow up then...

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/7] repair: Phase 6 performance improvements
@ 2021-03-19  1:33 Dave Chinner
  2021-03-19  1:38 ` Gao Xiang
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2021-03-19  1:33 UTC (permalink / raw)
  To: linux-xfs; +Cc: hsiangkao

Hi folks,

This is largely a repost of my current code so that Xiang can take
over and finish it off. It applies against 5.11.0 and the
performance numbers are still valid. I can't remember how much of
the review comments I addressed from the first time I posted it, so
the changelog is poor....

Original description:

Phase 6 is single threaded, processing a single AG at a time and a
single directory inode at a time.  Phase 6 if often IO latency bound
despite the prefetching it does, resulting in low disk utilisation
and high runtimes. The solution for this is the same as phase 3 and
4 - scan multiple AGs at once for directory inodes to process. This
patch set enables phase 6 to scan multiple AGS at once, and hence
requires concurrent updates of inode records as tehy can be accessed
and modified by multiple scanning threads now. We also need to
protect the bad inodes list from concurrent access and then we can
enable concurrent processing of directories.

However, directory entry checking and reconstruction can also be CPU
bound - large directories overwhelm the directory name hash
structures because the algorithms have poor scalability - one is O(n
+ n^2), another is O(n^2) when the number of dirents greatly
outsizes the hash table sizes. Hence we need to more than just
parallelise across AGs - we need to parallelise processing within
AGs so that a single large directory doesn't completely serialise
processing within an AG.  This is done by using bound-depth
workqueues to allow inode records to be processed asynchronously as
the inode records are fetched from disk.

Further, we need to fix the bad alogrithmic scalability of the in
memory directory tracking structures. This is done through a
combination of better structures and more appropriate dynamic size
choices.

The results on a filesystem with a single 10 million entry directory
containing 400MB of directory entry data is as follows:

v5.6.0 (Baseline)

       XFS_REPAIR Summary    Thu Oct 22 12:10:52 2020

Phase           Start           End             Duration
Phase 1:        10/22 12:06:41  10/22 12:06:41
Phase 2:        10/22 12:06:41  10/22 12:06:41
Phase 3:        10/22 12:06:41  10/22 12:07:00  19 seconds
Phase 4:        10/22 12:07:00  10/22 12:07:12  12 seconds
Phase 5:        10/22 12:07:12  10/22 12:07:13  1 second
Phase 6:        10/22 12:07:13  10/22 12:10:51  3 minutes, 38 seconds
Phase 7:        10/22 12:10:51  10/22 12:10:51

Total run time: 4 minutes, 10 seconds

real	4m11.151s
user	4m20.083s
sys	0m14.744s


5.9.0-rc1 + patchset:

        XFS_REPAIR Summary    Thu Oct 22 13:19:02 2020

Phase           Start           End             Duration
Phase 1:        10/22 13:18:09  10/22 13:18:09
Phase 2:        10/22 13:18:09  10/22 13:18:09
Phase 3:        10/22 13:18:09  10/22 13:18:31  22 seconds
Phase 4:        10/22 13:18:31  10/22 13:18:45  14 seconds
Phase 5:        10/22 13:18:45  10/22 13:18:45
Phase 6:        10/22 13:18:45  10/22 13:19:00  15 seconds
Phase 7:        10/22 13:19:00  10/22 13:19:00

Total run time: 51 seconds

real	0m52.375s
user	1m3.739s
sys	0m20.346s


Performance improvements on filesystems with small directories and
really fast storage are, at best, modest. The big improvements are
seen with either really large directories and/or relatively slow
devices that are IO latency bound and can benefit from having more
IO in flight at once.

Cheers,

Dave.

Version 2
- use pthread_cond_broadcast() to wakeup throttled waiters in
  bounded workqueues.
- other changes I didn't record when I made them so I've forgotten
  about them.



Dave Chinner (7):
  workqueue: bound maximum queue depth
  repair: Protect bad inode list with mutex
  repair: protect inode chunk tree records with a mutex
  repair: parallelise phase 6
  repair: don't duplicate names in phase 6
  repair: convert the dir byaddr hash to a radix tree
  repair: scale duplicate name checking in phase 6.

 libfrog/radix-tree.c |  46 +++++
 libfrog/workqueue.c  |  42 ++++-
 libfrog/workqueue.h  |   4 +
 repair/dir2.c        |  32 ++--
 repair/incore.h      |  23 +++
 repair/incore_ino.c  |  15 ++
 repair/phase6.c      | 396 +++++++++++++++++++++----------------------
 7 files changed, 338 insertions(+), 220 deletions(-)

-- 
2.30.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-03-24  2:08 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-22  5:15 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
2020-10-22  5:15 ` [PATCH 1/7] workqueue: bound maximum queue depth Dave Chinner
2020-10-22  5:54   ` Darrick J. Wong
2020-10-22  8:11     ` Dave Chinner
2020-10-25  4:41   ` Darrick J. Wong
2020-10-26 22:29     ` Dave Chinner
2020-10-26 22:40       ` Darrick J. Wong
2020-10-26 22:57         ` Dave Chinner
2020-10-22  5:15 ` [PATCH 2/7] repair: Protect bad inode list with mutex Dave Chinner
2020-10-22  5:45   ` Darrick J. Wong
2020-10-29  9:35   ` Christoph Hellwig
2020-10-22  5:15 ` [PATCH 3/7] repair: protect inode chunk tree records with a mutex Dave Chinner
2020-10-22  6:02   ` Darrick J. Wong
2020-10-22  8:15     ` Dave Chinner
2020-10-29 16:45       ` Darrick J. Wong
2020-10-22  5:15 ` [PATCH 4/7] repair: parallelise phase 6 Dave Chinner
2020-10-22  6:11   ` Darrick J. Wong
2020-10-27  5:10     ` Dave Chinner
2020-10-29 17:20       ` Darrick J. Wong
2020-10-22  5:15 ` [PATCH 5/7] repair: don't duplicate names in " Dave Chinner
2020-10-22  6:21   ` Darrick J. Wong
2020-10-22  8:23     ` Dave Chinner
2020-10-22 15:53       ` Darrick J. Wong
2020-10-29  9:39   ` Christoph Hellwig
2020-10-22  5:15 ` [PATCH 6/7] repair: convert the dir byaddr hash to a radix tree Dave Chinner
2020-10-29 16:41   ` Darrick J. Wong
2020-10-22  5:15 ` [PATCH 7/7] repair: scale duplicate name checking in phase 6 Dave Chinner
2020-10-29 16:29   ` Darrick J. Wong
2021-03-19  1:33 [PATCH 0/7] repair: Phase 6 performance improvements Dave Chinner
2021-03-19  1:38 ` Gao Xiang
2021-03-19 18:22   ` Darrick J. Wong
2021-03-20  2:09     ` Gao Xiang
2021-03-24  1:26       ` Gao Xiang
2021-03-24  2:08         ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.