Re: [PATCH v2 08/11] md: add atomic mode switching in RAID 1/10

From: Paul E Luse <paul.e.luse@linux.intel.com>
To: tada keisuke <keisuke1.tada@kioxia.com>
Cc: Yu Kuai <yukuai1@huaweicloud.com>,
	"song@kernel.org" <song@kernel.org>,
	"yukuai (C)" <yukuai3@huawei.com>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Luse, Paul E" <paul.e.luse@intel.com>
Subject: Re: [PATCH v2 08/11] md: add atomic mode switching in RAID 1/10
Date: Thu, 25 Apr 2024 02:12:24 -0700	[thread overview]
Message-ID: <20240425021224.6419ee2c@peluse-desk5> (raw)
In-Reply-To: <4b8f2f7499bd497c9db280775ae4ea81@kioxia.com>

On Fri, 26 Apr 2024 08:01:55 +0000
tada keisuke <keisuke1.tada@kioxia.com> wrote:

> > > > Hi,
> > > >
> > > > 在 2024/04/18 13:44, tada keisuke 写道:
> > > > > This patch depends on patch 07.
> > > > >
> > > > > All rdevs running in RAID 1/10 switch nr_pending to atomic
> > > > > mode. The value of nr_pending is read in a normal operation
> > > > > (choose_best_rdev()). Therefore, nr_pending must always be
> > > > > consistent.
> > > > >
> > > > > Signed-off-by: Keisuke TADA <keisuke1.tada@kioxia.com>
> > > > > Signed-off-by: Toshifumi OHTAKE <toshifumi.ootake@kioxia.com>
> > > > > ---
> > > > >   drivers/md/md.h     | 14 ++++++++++++++
> > > > >   drivers/md/raid1.c  |  7 +++++++
> > > > >   drivers/md/raid10.c |  4 ++++
> > > > >   3 files changed, 25 insertions(+)
> > > > >
> > > > > diff --git a/drivers/md/md.h b/drivers/md/md.h
> > > > > index ab09e312c9bb..57b09b567ffa 100644
> > > > > --- a/drivers/md/md.h
> > > > > +++ b/drivers/md/md.h
> > > > > @@ -236,6 +236,20 @@ static inline unsigned long
> > > > > nr_pending_read(struct md_rdev *rdev) return
> > > > > atomic_long_read(&rdev->nr_pending.data->count); }
> > > > >
> > > > > +static inline bool nr_pending_is_percpu_mode(struct md_rdev
> > > > > *rdev) +{
> > > > > +	unsigned long __percpu *percpu_count;
> > > > > +
> > > > > +	return __ref_is_percpu(&rdev->nr_pending,
> > > > > &percpu_count); +}
> > > > > +
> > > > > +static inline bool nr_pending_is_atomic_mode(struct md_rdev
> > > > > *rdev) +{
> > > > > +	unsigned long __percpu *percpu_count;
> > > > > +
> > > > > +	return !__ref_is_percpu(&rdev->nr_pending,
> > > > > &percpu_count); +}
> > > > > +
> > > > >   static inline int is_badblock(struct md_rdev *rdev,
> > > > > sector_t s, int sectors, sector_t *first_bad, int
> > > > > *bad_sectors) {
> > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > > > > index 12318fb15a88..c38ae13aadab 100644
> > > > > --- a/drivers/md/raid1.c
> > > > > +++ b/drivers/md/raid1.c
> > > > > @@ -784,6 +784,7 @@ static int choose_best_rdev(struct r1conf
> > > > > *conf, struct r1bio *r1_bio) if (ctl.readable_disks++ == 1)
> > > > >   			set_bit(R1BIO_FailFast,
> > > > > &r1_bio->state);
> > > > >
> > > > > +
> > > > > WARN_ON_ONCE(nr_pending_is_percpu_mode(rdev)); pending =
> > > > > nr_pending_read(rdev); dist = abs(r1_bio->sector -
> > > > > conf->mirrors[disk].head_position);
> > > > > @@ -1930,6 +1931,7 @@ static int raid1_add_disk(struct mddev
> > > > > *mddev, struct md_rdev *rdev) if (err)
> > > > >   				return err;
> > > > >
> > > > > +
> > > > > percpu_ref_switch_to_atomic_sync(&rdev->nr_pending);
> > > > > raid1_add_conf(conf, rdev, mirror, false); /* As all devices
> > > > > are equivalent, we don't need a full recovery
> > > > >   			 * if this was recently any drive
> > > > > of the array @@ -1949,6 +1951,7 @@ static int
> > > > > raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> > > > > set_bit(Replacement, &rdev->flags); raid1_add_conf(conf,
> > > > > rdev, repl_slot, true); err = 0;
> > > > > +
> > > > > percpu_ref_switch_to_atomic_sync(&rdev->nr_pending);
> > > >
> > > > I don't understand what's the point here, 'nr_pending' will be
> > > > used when the rdev issuing IO, and it's always used as atomic
> > > > mode, there is no difference.
> > > >
> > > > Consider that 'nr_pending' must be read from IO fast path, use
> > > > it as atomic is something we must accept. Unless someone comes
> > > > up with a plan to avoid reading 'inflight' counter from fast
> > > > path like generic block layer, it's not ok to me to switch to
> > > > percpu_ref for now.
> 
> The main purpose of this patchset is to improve RAID5 performance.
> In the current RAID 1/10 design, the value of nr_pending is
> intentionally always in atomic mode because it must be read in IO
> fast path. Unless the design of reading the value of nr_pending has
> changed, I believe that this patchset is a reasonable design and
> RAID1 performance is about the same as atomic_t before this patchset
> was applied. Paul's results also show that.
> 
> Best Regards,
> Keisuke

I only tested RAID1 and do believe that simpler is better so would
prefer not to change the RAID1 code.  I can run some RAID5 tests on
this as well unless you have some wide sweeping results? Would love to
see more RAID5 performance improvments.  Shushu has another RAID5 perf
patch out there that I think has some very good potential, it would be
good if you could take a look at that one.

-Paul

> 
> > > > +CC Paul
> > > >
> > > > HI, Paul, perhaps you RR mode doesn't need such 'inflight'
> > > > counter anymore?
> > > >
> > >
> > > I too am struggling to see the benefit but am curious enough that
> > > I will run some tests on my own as I have fresh baseline RAID1
> > > large sweep performance data ready right now.
> > >
> > > So my WIP round robin patch won't need nr_pedning for SSDs but I
> > > think it will for HDD plus I need a new atomic counter for round
> > > robin for SSDs anyway but I see no perfomrnace impact so far from
> > > adding it.
> > >
> > > -Paul
> > >
> > 
> > I can run more if others are interested (RAID5 or HDD for example)
> > but at least for RAID1 there's really no difference.  This was a
> > quick run, just 40 sec each, 16 jobs and the rest ofthe fio params
> > are on the charts. 2 disk RAID1. THe baseline is 6.8.0 from the md
> > repo. Using my favorite drives, of course, KIOXIA KCMYDVUG3T20 :)
> > 
> > Here's the results: https://photos.app.goo.gl/Avyw64eXCqWFWrs78
> > 
> > NOTE:  There are few small randoms where it appears to help but I
> > assumed that was because these are small randoms with very short run
> > times.  SO I reran the 4K mixed rw randoms with 5 minute run time
> > and that chart is at the very bottom and shows that over longer
> > duration its a wash, there's no clear winner.  I'm sure an even
> > longer run would show more consistently close results.
> 
>