Re: [PATCH v3] bcache: dynamic incremental gc

From: Eric Wheeler <bcache@lists.ewheeler.net>
To: Zou Mingzhe <mingzhe.zou@easystack.cn>
Cc: linux-bcache@vger.kernel.org, zoumingzhe@qq.com
Subject: Re: [PATCH v3] bcache: dynamic incremental gc
Date: Fri, 20 May 2022 11:24:05 -0700 (PDT)	[thread overview]
Message-ID: <37d75ff-877c-5453-b6a0-81c8d737299@ewheeler.net> (raw)
In-Reply-To: <112eaaf7-05fd-3b4f-0190-958d0c85fa1f@easystack.cn>

[-- Attachment #1: Type: text/plain, Size: 4922 bytes --]

On Fri, 20 May 2022, Zou Mingzhe wrote:
> 在 2022/5/12 21:41, Coly Li 写道:
> > On 5/11/22 3:39 PM, mingzhe.zou@easystack.cn wrote:
> >> From: ZouMingzhe <mingzhe.zou@easystack.cn>
> >>
> >> Currently, GC wants no more than 100 times, with at least
> >> 100 nodes each time, the maximum number of nodes each time
> >> is not limited.
> >>
> >> ```
> >> static size_t btree_gc_min_nodes(struct cache_set *c)
> >> {
> >>          ......
> >>          min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
> >>          if (min_nodes < MIN_GC_NODES)
> >>                  min_nodes = MIN_GC_NODES;
> >>
> >>          return min_nodes;
> >> }
> >> ```
> >>
> >> According to our test data, when nvme is used as the cache,
> >> it takes about 1ms for GC to handle each node (block 4k and
> >> bucket 512k). This means that the latency during GC is at
> >> least 100ms. During GC, IO performance would be reduced by
> >> half or more.
> >>
> >> I want to optimize the IOPS and latency under high pressure.
> >> This patch hold the inflight peak. When IO depth up to maximum,
> >> GC only process very few(10) nodes, then sleep immediately and
> >> handle these requests.
> >>
> >> bch_bucket_alloc() maybe wait for bch_allocator_thread() to
> >> wake up, and and bch_allocator_thread() needs to wait for gc
> >> to complete, in which case gc needs to end quickly. So, add
> >> bucket_alloc_inflight to cache_set in v3.
> >>
> >> ```
> >> long bch_bucket_alloc(struct cache *ca, unsigned int reserve, bool wait)
> >> {
> >>          ......
> >>          do {
> >> prepare_to_wait(&ca->set->bucket_wait, &w,
> >>                                  TASK_UNINTERRUPTIBLE);
> >>
> >>                  mutex_unlock(&ca->set->bucket_lock);
> >>                  schedule();
> >>                  mutex_lock(&ca->set->bucket_lock);
> >>          } while (!fifo_pop(&ca->free[RESERVE_NONE], r) &&
> >>                   !fifo_pop(&ca->free[reserve], r));
> >>          ......
> >> }
> >>
> >> static int bch_allocator_thread(void *arg)
> >> {
> >>     ......
> >>     allocator_wait(ca, bch_allocator_push(ca, bucket));
> >>     wake_up(&ca->set->btree_cache_wait);
> >>     wake_up(&ca->set->bucket_wait);
> >>     ......
> >> }
> >>
> >> static void bch_btree_gc(struct cache_set *c)
> >> {
> >>     ......
> >>     bch_btree_gc_finish(c);
> >>     wake_up_allocators(c);
> >>     ......
> >> }
> >> ```
> >>
> >> Apply this patch, each GC maybe only process very few nodes,
> >> GC would last a long time if sleep 100ms each time. So, the
> >> sleep time should be calculated dynamically based on gc_cost.
> >>
> >> At the same time, I added some cost statistics in gc_stat,
> >> hoping to provide useful information for future work.
> >
> >
> > Hi Mingzhe,
> >
> > From the first glance, I feel this change may delay the small GC period, and
> > finally result a large GC period, which is not expected.
> >
> > But it is possible that my feeling is incorrect. Do you have detailed
> > performance number about both I/O latency  and GC period, then I can have
> > more understanding for this effort.
> >
> > BTW, I will add this patch to my testing set and experience myself.
> >
> >
> > Thanks.
> >
> >
> > Coly Li
> >
> >
> Hi Coly,
> 
> First, your feeling is right. Then, I have some performance number abort
> before and after this patch.
> Since the mailing list does not accept attachments, I put them on the gist.
> 
> Please visit the page for details:
> https://gist.github.com/zoumingzhe/69a353e7c6fffe43142c2f42b94a67b5
> mingzhe

The graphs certainly show that peak latency is much lower, that is 
improvement, and dmesg shows the avail_nbuckets stays about the same so GC 
is keeping up.

Questions: 

1. Why is the after-"BW NO GC" graph so much flatter than the before-"BW 
   NO GC" graph?  I would expect your control measurements to be about the 
   same before vs after.  You might `blkdiscard` the cachedev and 
   re-format between runs in case the FTL is getting in the way, or maybe 
   something in the patch is affecting the "NO GC" graphs.

2. I wonder how the latency looks if you zoom into to the latency graph: 
   If you truncate the before-"LATENCY DO GC" graph at 3000 us then how 
   does the average latency look between the two?

3. This may be solved if you can fix the control graph issue in #1, but 
   the before vs after of "BW DO GC" shows about a 30% decrease in 
   bandwidth performance outside of the GC spikes.  "IOPS DO GC" is lower 
   with the patch too.  Do you think that your dynamic incremental gc 
   algorithm be tuned to deal with GC latency and still provide nearly the 
   same IOPS and bandwidth as before?

--
Eric Wheeler

> >
> >
> > [snipped]
> >
> >
> >
> 
>