Re: [PATCH 7/7] repair: scale duplicate name checking in phase 6.

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 7/7] repair: scale duplicate name checking in phase 6.
Date: Thu, 29 Oct 2020 09:29:22 -0700	[thread overview]
Message-ID: <20201029162922.GM1061252@magnolia> (raw)
In-Reply-To: <20201022051537.2286402-8-david@fromorbit.com>

On Thu, Oct 22, 2020 at 04:15:37PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> phase 6 on large directories is cpu bound on duplicate name checking
> due to the algorithm having effectively O(n^2) scalability. Hence
> when the duplicate name hash table  size is far smaller than the
> number of directory entries, we end up with long hash chains that
> are searched linearly on every new entry that is found in the
> directory to do duplicate detection.
> 
> The in-memory hash table size is limited to 64k entries. Hence when
> we have millions of entries in a directory, duplicate entry lookups
> on the hash table have substantial overhead. Scale this table out to
> larger sizes so that we keep the chain lengths short and hence the
> O(n^2) scalability impact is limited because N is always small.
> 
> For a 10M entry directoryi consuming 400MB of directory data, the
> hash table now sizes at 6.4 million entries instead of ~64k - it is
> ~100x larger. While the hash table now consumes ~50MB of RAM, the
> xfs_repair footprint barely changes at it's using already consuming
> ~9GB of RAM at this point in time. IOWs, the incremental memory
> usage change is noise, but the directory checking time:
> 
> Unpatched:
> 
>   97.11%  xfs_repair          [.] dir_hash_add
>    0.38%  xfs_repair          [.] longform_dir2_entry_check_data
>    0.34%  libc-2.31.so        [.] __libc_calloc
>    0.32%  xfs_repair          [.] avl_ino_start
> 
> Phase 6:        10/22 12:11:40  10/22 12:14:28  2 minutes, 48 seconds
> 
> Patched:
> 
>   46.74%  xfs_repair          [.] radix_tree_lookup
>   32.13%  xfs_repair          [.] dir_hash_see_all
>    7.70%  xfs_repair          [.] radix_tree_tag_get
>    3.92%  xfs_repair          [.] dir_hash_add
>    3.52%  xfs_repair          [.] radix_tree_tag_clear
>    2.43%  xfs_repair          [.] crc32c_le
> 
> Phase 6:        10/22 13:11:01  10/22 13:11:18  17 seconds
> 
> has been reduced by an order of magnitude.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  repair/phase6.c | 30 ++++++++++++++++++++++++------
>  1 file changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/repair/phase6.c b/repair/phase6.c
> index 21f49dd748e1..7dd6130056ee 100644
> --- a/repair/phase6.c
> +++ b/repair/phase6.c
> @@ -288,19 +288,37 @@ dir_hash_done(
>  	free(hashtab);
>  }
>  
> +/*
> + * Create a directory hash index structure based on the size of the directory we
> + * are about to try to repair. The size passed in is the size of the data
> + * segment of the directory in bytes, so we don't really know exactly how many
> + * entries are in it. Hence assume an entry size of around 64 bytes - that's a
> + * name length of 40+ bytes so should cover a most situations with large
> + * really directories.

"...with really large directories." ?

> + */
>  static struct dir_hash_tab *
>  dir_hash_init(
>  	xfs_fsize_t		size)
>  {
> -	struct dir_hash_tab	*hashtab;
> +	struct dir_hash_tab	*hashtab = NULL;
>  	int			hsize;
>  
> -	hsize = size / (16 * 4);
> -	if (hsize > 65536)
> -		hsize = 63336;
> -	else if (hsize < 16)
> +	hsize = size / 64;
> +	if (hsize < 16)
>  		hsize = 16;

Since I'm not that familiar with the directory hash table, I'm curious
about our choice of hash table size (which is hsize, right?).  IIRC most
CS textbooks tell you to pick a "clever" hash table size that's a prime
number just in case the values are unevenly distributed.

I don't know if that's the case here because I haven't studied the
directory name hash in detail, but I wonder, do we have a way to measure
the length of the hash chains?  Or at least the evenness of them?

I'm vaguely wondering if there's more gains to be had here.  Mmm science
projects...

(The rest of the code here looks reasonable to me, fwiw.)

--D

> -	if ((hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1)) == NULL)
> +
> +	/*
> +	 * Try to allocate as large a hash table as possible. Failure to
> +	 * allocate isn't fatal, it will just result in slower performance as we
> +	 * reduce the size of the table.
> +	 */
> +	while (hsize >= 16) {
> +		hashtab = calloc(DIR_HASH_TAB_SIZE(hsize), 1);
> +		if (hashtab)
> +			break;
> +		hsize /= 2;
> +	}
> +	if (!hashtab)
>  		do_error(_("calloc failed in dir_hash_init\n"));
>  	hashtab->size = hsize;
>  	hashtab->byhash = (struct dir_hash_ent **)((char *)hashtab +
> -- 
> 2.28.0
>