Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

From: Guoqing Jiang <guoqing.jiang@linux.dev>
To: Logan Gunthorpe <logang@deltatee.com>,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	Song Liu <song@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	Stephen Bates <sbates@raithlin.com>,
	Martin Oliveira <Martin.Oliveira@eideticom.com>,
	David Sloan <David.Sloan@eideticom.com>
Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
Date: Sun, 24 Apr 2022 15:53:54 +0800	[thread overview]
Message-ID: <243b3e7f-1fa1-700c-a850-caaf45d95cde@linux.dev> (raw)
In-Reply-To: <20220420195425.34911-1-logang@deltatee.com>

On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> Hi,
>
> This is v2 of this series which addresses Christoph's feedback and
> fixes some bugs. The first posting is at [1]. A git branch is
> available at [2].
>
> --
>
> I've been doing some work trying to improve the bulk write performance
> of raid5 on large systems with fast NVMe drives. The bottleneck appears
> largely to be lock contention on the hash_lock and device_lock. This
> series improves the situation slightly by addressing a couple of low
> hanging fruit ways to take the lock fewer times in the request path.
>
> Patch 9 adjusts how batching works by keeping a reference to the
> previous stripe_head in raid5_make_request(). Under most situtations,
> this removes the need to take the hash_lock in stripe_add_to_batch_list()
> which should reduce the number of times the lock is taken by a factor of
> about 2.
>
> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
> code must find the stripe_head for every 4KB page in the request, so each
> stripe head must be found once for every data disk. The patch changes this
> so that all the data disks can be added to a stripe_head at once and the
> number of times the stripe_head must be found (and thus the number of
> times the hash_lock is taken) should be reduced by a factor roughly equal
> to the number of data disks.
>
> The remaining patches are just cleanup and prep patches for those two
> patches.
>
> Doing apples to apples testing this series on a small VM with 5 ram
> disks, I saw a bandwidth increase of roughly 14% and lock contentions
> on the hash_lock (as reported by lock stat) reduced by more than a factor
> of 5 (though it is still significantly contended).
>
> Testing on larger systems with NVMe drives saw similar small bandwidth
> increases from 3% to 20% depending on the parameters. Oddly small arrays
> had larger gains, likely due to them having lower starting bandwidths; I
> would have expected larger gains with larger arrays (seeing there
> should have been even fewer locks taken in raid5_make_request()).
>
> Logan
>
> [1] https://lkml.kernel.org/r/20220407164511.8472-1-logang@deltatee.com
> [2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2
>
> --
>
> Changes since v1:
>    - Rebased on current md-next branch (190a901246c69d79)
>    - Added patch to create a helper for checking if a sector
>      is ahead of the reshape (per Christoph)
>    - Reworked the __find_stripe() patch to create a find_get_stripe()
>      helper (per Christoph)
>    - Added more patches to further refactor raid5_make_request() and
>      pull most of the loop body into a helper function (per Christoph)
>    - A few other minor cleanups (boolean return, droping casting when
>      printing sectors, commit message grammar) as suggested by Christoph.
>    - Fixed two uncommon but bad data corruption bugs in that were found.
>
> --
>
> Logan Gunthorpe (12):
>    md/raid5: Factor out ahead_of_reshape() function
>    md/raid5: Refactor raid5_make_request loop
>    md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
>    md/raid5: Move common stripe count increment code into __find_stripe()
>    md/raid5: Factor out helper from raid5_make_request() loop
>    md/raid5: Drop the do_prepare flag in raid5_make_request()
>    md/raid5: Move read_seqcount_begin() into make_stripe_request()
>    md/raid5: Refactor for loop in raid5_make_request() into while loop
>    md/raid5: Keep a reference to last stripe_head for batch
>    md/raid5: Refactor add_stripe_bio()
>    md/raid5: Check all disks in a stripe_head for reshape progress
>    md/raid5: Pivot raid5_make_request()

Generally, I don't object the cleanup patches since the code looks more 
cleaner.
But my concern is that since some additional function calls are added to 
hot path
(raid5_make_request), could the performance be affected?

And I think patch 9 and patch 12 are helpful for performance 
improvement,  did
you measure the performance without those cleanup patches?

Thanks,
Guoqing