Re: [dm-devel] [dm-6.4 PATCH v2 3/9] dm bufio: improve concurrent IO performance

From: Mike Snitzer <snitzer@kernel.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: ebiggers@kernel.org, keescook@chromium.org, heinzm@redhat.com,
	dm-devel@redhat.com, nhuck@google.com,
	linux-block@vger.kernel.org, ejt@redhat.com, mpatocka@redhat.com,
	luomeng12@huawei.com
Subject: Re: [dm-devel] [dm-6.4 PATCH v2 3/9] dm bufio: improve concurrent IO performance
Date: Sat, 25 Mar 2023 00:21:44 -0400	[thread overview]
Message-ID: <ZB522MSroOmkyzBC@redhat.com> (raw)
In-Reply-To: <e552ad80-37fe-247e-aaf1-064572ccc154@kernel.dk>

On Fri, Mar 24 2023 at  7:11P -0400,
Jens Axboe <axboe@kernel.dk> wrote:

> On 3/24/23 4:53?PM, Mike Snitzer wrote:
> > On Fri, Mar 24 2023 at  3:34P -0400,
> > Jens Axboe <axboe@kernel.dk> wrote:
> > 
> >> Just some random drive-by comments.
> >>
> >>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> >>> index 1de1bdcda1ce..a58f8ac3ba75 100644
> >>> --- a/drivers/md/dm-bufio.c
> >>> +++ b/drivers/md/dm-bufio.c
> >>> +static void lru_destroy(struct lru *lru)
> >>> +{
> >>> +	BUG_ON(lru->cursor);
> >>> +	BUG_ON(!list_empty(&lru->iterators));
> >>> +}
> >>
> >> Ehm no, WARN_ON_ONCE() for these presumably.
> > 
> > Yeah, I raised concern about the BUG_ONs with Joe. Will try to rid the
> > code of BUG_ONs as follow-on work.
> 
> Please do.

FYI, you'll cringe if you open the latest upstream
drivers/dm/dm-bufio.c  Mikulas made heavy use of BUG_ONs. Because of
that Joe started working a code that was setting him up for
persecution (and me by association for allowing it in to begin with).

Good news, the dm-6.4-dev branch I point to below already addresses
many of the historic and new BUG_ONs.

> >>> @@ -116,9 +366,579 @@ struct dm_buffer {
> >>>  #endif
> >>>  };
> >>>  
> >>> +/*--------------------------------------------------------------*/
> >>> +
> >>> +/*
> >>> + * The buffer cache manages buffers, particularly:
> >>> + *  - inc/dec of holder count
> >>> + *  - setting the last_accessed field
> >>> + *  - maintains clean/dirty state along with lru
> >>> + *  - selecting buffers that match predicates
> >>> + *
> >>> + * It does *not* handle:
> >>> + *  - allocation/freeing of buffers.
> >>> + *  - IO
> >>> + *  - Eviction or cache sizing.
> >>> + *
> >>> + * cache_get() and cache_put() are threadsafe, you do not need to
> >>> + * protect these calls with a surrounding mutex.  All the other
> >>> + * methods are not threadsafe; they do use locking primitives, but
> >>> + * only enough to ensure get/put are threadsafe.
> >>> + */
> >>> +
> >>> +#define NR_LOCKS 64
> >>> +#define LOCKS_MASK (NR_LOCKS - 1)
> >>> +
> >>> +struct tree_lock {
> >>> +	struct rw_semaphore lock;
> >>> +} ____cacheline_aligned_in_smp;
> >>> +
> >>> +struct dm_buffer_cache {
> >>> +	/*
> >>> +	 * We spread entries across multiple trees to reduce contention
> >>> +	 * on the locks.
> >>> +	 */
> >>> +	struct tree_lock locks[NR_LOCKS];
> >>> +	struct rb_root roots[NR_LOCKS];
> >>> +	struct lru lru[LIST_SIZE];
> >>> +};
> >>
> >> This:
> >>
> >> struct foo_tree {
> >> 	struct rw_semaphore lock;
> >> 	struct rb_root root;
> >> 	struct lru lru;
> >> } ____cacheline_aligned_in_smp;
> >>
> >> would be a lot better.
> >>
> >> And where does this NR_LOCKS come from? Don't make up magic values out
> >> of thin air. Should this be per-cpu? per-node? N per node? I'll bet you
> >> that 64 is way too much for most use cases, and too little for others.
> > 
> > I cannot speak to the 64 magic value (other than I think it worked
> > well for Joe's testbed).  But the point of this exercise is to split
> > the lock to avoid contention.  Using 64 accomplishes this. Having
> > there be more or less isn't _that_ critical.  The hash to get the
> > region index isn't a function of cpu.  We aren't after lockless here.
> 
> I don't doubt it worked well in his setup, and it'll probably be fine in
> a lot of setups. But the point still stands - it's a magic value, it
> should at least be documented. And 64 is likely way too much on a lot of
> machines.

Yeap.

> > Now that said, will certainly discuss further with Joe and see if we
> > can be smarter here.
> > 
> > Your suggestion to combine members certainly makes a lot of sense.  I
> > ran with it relative to the bio-prison-v1 (patch 9) changes which have
> > the same layout. Definitely a win on in-core memory as well as
> > avoiding cacheline thrashing while accessing the lock and then the
> > rb_root members (as is always expected):
> 
> Right, this is why I suggested doing it like that. It's not very smart
> to split related members like that, wastes both memory and is less
> efficient than doing the right thing.

Was just an oversight. The code has had quite some review internally
at Red Hat. But we've been focused on ensuring the code stable. Case
of not seeing the forest from the trees... *shrug*  But nice catch,
appreciate it.

> >> I stopped reading here, the patch is just too long. Surely this could be
> >> split up?
> >>
> >>  1 file changed, 1292 insertions(+), 477 deletions(-)
> >>
> >> That's not a patch, that's a patch series.
> > 
> > I don't want to upset you or the community but saying this but:
> > in this narrow instance where a sizable percentage of the file got
> > rewritten: to properly review this work you need to look at the full
> > scope of the changes in combination.
> 
> That's nonsense. That's like saying "to see what this series does, apply
> the whole thing and compare it with the file before". With that logic,
> why even split changes ever.
> 
> A big patch that could be split is harder to properly review than a lot
> of small patches. Nobody ever reviews a 1000+ line patch. But I guess we
> can just stop reviewing?
> 
> It should be split, it's really not up for debate.

It should. And I've started to do it. I was trying to avoid making
Joe do it after the fact. And I loathe to do it.  It is developing in
reverse, it sucks (but I'm hopeful Joe has the work split out in old
devel trees, will check with him Monday).

For this specific code, the issue is that while the code changes are
split logically between:
1) LRU with clock algorithm
2) DM buffer cache manager
They are pretty tightly coupled (I see no point trying to implement
1 without 2 just to check a review box... it's thankless make-work,
100% so if nobody actually reviews the code that closely). Wanting
code to always be in that form isn't wrong, but it is very much a
suspect requirement if the evolution needs to be recreated in
post-mortum fashion.

And then having reviewers take time to understand various intermediate
transformations the code might go through to reach the final result
isn't necessarily a good use of reviewers' time.

BUT I _have_ already split out commits like this:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-6.4-dev&id=8de2bc47d6ed5f96855e6262eb928a59e8a2a9cb

And if you look at the broader changes I made tonight in this
dm-6.4-dev tree you'll see I'm taking your review feedback seriously:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-6.4-dev

You'll hear from me with a v3 of this patchset next week.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel