All of lore.kernel.org
 help / color / mirror / Atom feed
* Writeback cache all used.
       [not found] <1012241948.1268315.1680082721600.ref@mail.yahoo.com>
@ 2023-03-29  9:38 ` Adriano Silva
  2023-03-29 19:18   ` Eric Wheeler
  2023-03-30  4:55   ` Martin McClure
  0 siblings, 2 replies; 28+ messages in thread
From: Adriano Silva @ 2023-03-29  9:38 UTC (permalink / raw)
  To: Bcache Linux

Hey guys,

I'm using bcache to support Ceph. Ten Cluster nodes have a bcache device each consisting of an HDD block device and an NVMe cache. But I am noticing what I consider to be a problem: My cache is 100% used even though I still have 80% of the space available on my HDD.

It is true that there is more data written than would fit in the cache. However, I imagine that most of them should only be on the HDD and not in the cache, as they are cold data, almost never used.

I noticed that there was a significant drop in performance on the disks (writes) and went to check. Benchmark tests confirmed this. Then I noticed that there was 100% cache full and 85% cache evictable. There was a bit of dirty cache. I found an internet message talking about the garbage collector, so I tried the following:

echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc

That doesn't seem to have helped.

Then I collected the following data:

--- bcache ---
Device /dev/sdc (8:32)
UUID 38e81dff-a7c9-449f-9ddd-182128a19b69
Block Size 4.00KiB
Bucket Size 256.00KiB
Congested? False
Read Congestion 0.0ms
Write Congestion 0.0ms
Total Cache Size 553.31GiB
Total Cache Used 547.78GiB (99%)
Total Unused Cache 5.53GiB (1%)
Dirty Data 0B (0%)
Evictable Cache 503.52GiB (91%)
Replacement Policy [lru] fifo random
Cache Mode writethrough [writeback] writearound none
Total Hits 33361829 (99%)
Total Missions 185029
Total Bypass Hits 6203 (100%)
Total Bypass Misses 0
Total Bypassed 59.20MiB
--- Cache Device ---
   Device /dev/nvme0n1p1 (259:1)
   Size 553.31GiB
   Block Size 4.00KiB
   Bucket Size 256.00KiB
   Replacement Policy [lru] fifo random
   Discard? False
   I/O Errors 0
   Metadata Written 395.00GiB
   Data Written 1.50 TiB
   Buckets 2266376
   Cache Used 547.78GiB (99%)
   Cache Unused 5.53GiB (0%)
--- Backing Device ---
   Device /dev/sdc (8:32)
   Size 5.46TiB
   Cache Mode writethrough [writeback] writearound none
   Readhead
   Sequential Cutoff 0B
   Sequential merge? False
   state clean
   Writeback? true
   Dirty Data 0B
   Total Hits 32903077 (99%)
   Total Missions 185029
   Total Bypass Hits 6203 (100%)
   Total Bypass Misses 0
   Total Bypassed 59.20MiB

The dirty data has disappeared. But the cache remains 99% utilization, down just 1%. Already the evictable cache increased to 91%!

The impression I have is that this harms the write cache. That is, if I need to write again, the data goes straight to the HDD disks, as there is no space available in the Cache.

Shouldn't bcache remove the least used part of the cache?

Does anyone know why this isn't happening?

I may be talking nonsense, but isn't there a way to tell bcache to keep a write-free space rate in the cache automatically? Or even if it was manually by some command that I would trigger at low disk access times?

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-03-29  9:38 ` Writeback cache all used Adriano Silva
@ 2023-03-29 19:18   ` Eric Wheeler
  2023-03-30  1:38     ` Adriano Silva
  2023-03-30  4:55   ` Martin McClure
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Wheeler @ 2023-03-29 19:18 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Bcache Linux

[-- Attachment #1: Type: text/plain, Size: 3183 bytes --]

On Wed, 29 Mar 2023, Adriano Silva wrote:

> Hey guys,
> 
> I'm using bcache to support Ceph. Ten Cluster nodes have a bcache device 
> each consisting of an HDD block device and an NVMe cache. But I am 
> noticing what I consider to be a problem: My cache is 100% used even 
> though I still have 80% of the space available on my HDD.
> 
> It is true that there is more data written than would fit in the cache. 
> However, I imagine that most of them should only be on the HDD and not 
> in the cache, as they are cold data, almost never used.
> 
> I noticed that there was a significant drop in performance on the disks 
> (writes) and went to check. Benchmark tests confirmed this. Then I 
> noticed that there was 100% cache full and 85% cache evictable. There 
> was a bit of dirty cache. I found an internet message talking about the 
> garbage collector, so I tried the following:
> 
> echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc

What kernel version are you running?  There are some gc cache fixes out 
there, about v5.18 IIRC, that might help things.

--
Eric Wheeler



> 
> That doesn't seem to have helped.
> 
> Then I collected the following data:
> 
> --- bcache ---
> Device /dev/sdc (8:32)
> UUID 38e81dff-a7c9-449f-9ddd-182128a19b69
> Block Size 4.00KiB
> Bucket Size 256.00KiB
> Congested? False
> Read Congestion 0.0ms
> Write Congestion 0.0ms
> Total Cache Size 553.31GiB
> Total Cache Used 547.78GiB (99%)
> Total Unused Cache 5.53GiB (1%)
> Dirty Data 0B (0%)
> Evictable Cache 503.52GiB (91%)
> Replacement Policy [lru] fifo random
> Cache Mode writethrough [writeback] writearound none
> Total Hits 33361829 (99%)
> Total Missions 185029
> Total Bypass Hits 6203 (100%)
> Total Bypass Misses 0
> Total Bypassed 59.20MiB
> --- Cache Device ---
>    Device /dev/nvme0n1p1 (259:1)
>    Size 553.31GiB
>    Block Size 4.00KiB
>    Bucket Size 256.00KiB
>    Replacement Policy [lru] fifo random
>    Discard? False
>    I/O Errors 0
>    Metadata Written 395.00GiB
>    Data Written 1.50 TiB
>    Buckets 2266376
>    Cache Used 547.78GiB (99%)
>    Cache Unused 5.53GiB (0%)
> --- Backing Device ---
>    Device /dev/sdc (8:32)
>    Size 5.46TiB
>    Cache Mode writethrough [writeback] writearound none
>    Readhead
>    Sequential Cutoff 0B
>    Sequential merge? False
>    state clean
>    Writeback? true
>    Dirty Data 0B
>    Total Hits 32903077 (99%)
>    Total Missions 185029
>    Total Bypass Hits 6203 (100%)
>    Total Bypass Misses 0
>    Total Bypassed 59.20MiB
> 
> The dirty data has disappeared. But the cache remains 99% utilization, down just 1%. Already the evictable cache increased to 91%!
> 
> The impression I have is that this harms the write cache. That is, if I need to write again, the data goes straight to the HDD disks, as there is no space available in the Cache.
> 
> Shouldn't bcache remove the least used part of the cache?
> 
> Does anyone know why this isn't happening?
> 
> I may be talking nonsense, but isn't there a way to tell bcache to keep a write-free space rate in the cache automatically? Or even if it was manually by some command that I would trigger at low disk access times?
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-03-29 19:18   ` Eric Wheeler
@ 2023-03-30  1:38     ` Adriano Silva
  0 siblings, 0 replies; 28+ messages in thread
From: Adriano Silva @ 2023-03-30  1:38 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Bcache Linux

Hello!

I use Kernel version 5.15, the default version of the Proxmox Virtualization Environment. I will try to change the kernel as soon as possible.

Is there anything interim I can do to improvise while I can't switch kernel versions?

Strangely, it doesn't reduce cache usage even when the computer is completely idle, with virtually no disk activity.

Thanks!



Em quarta-feira, 29 de março de 2023 às 16:18:39 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 





On Wed, 29 Mar 2023, Adriano Silva wrote:

> Hey guys,
> 
> I'm using bcache to support Ceph. Ten Cluster nodes have a bcache device 
> each consisting of an HDD block device and an NVMe cache. But I am 
> noticing what I consider to be a problem: My cache is 100% used even 
> though I still have 80% of the space available on my HDD.
> 
> It is true that there is more data written than would fit in the cache. 
> However, I imagine that most of them should only be on the HDD and not 
> in the cache, as they are cold data, almost never used.
> 
> I noticed that there was a significant drop in performance on the disks 
> (writes) and went to check. Benchmark tests confirmed this. Then I 
> noticed that there was 100% cache full and 85% cache evictable. There 
> was a bit of dirty cache. I found an internet message talking about the 
> garbage collector, so I tried the following:
> 
> echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc

What kernel version are you running?  There are some gc cache fixes out 
there, about v5.18 IIRC, that might help things.

--
Eric Wheeler




> 
> That doesn't seem to have helped.
> 
> Then I collected the following data:
> 
> --- bcache ---
> Device /dev/sdc (8:32)
> UUID 38e81dff-a7c9-449f-9ddd-182128a19b69
> Block Size 4.00KiB
> Bucket Size 256.00KiB
> Congested? False
> Read Congestion 0.0ms
> Write Congestion 0.0ms
> Total Cache Size 553.31GiB
> Total Cache Used 547.78GiB (99%)
> Total Unused Cache 5.53GiB (1%)
> Dirty Data 0B (0%)
> Evictable Cache 503.52GiB (91%)
> Replacement Policy [lru] fifo random
> Cache Mode writethrough [writeback] writearound none
> Total Hits 33361829 (99%)
> Total Missions 185029
> Total Bypass Hits 6203 (100%)
> Total Bypass Misses 0
> Total Bypassed 59.20MiB
> --- Cache Device ---
>    Device /dev/nvme0n1p1 (259:1)
>    Size 553.31GiB
>    Block Size 4.00KiB
>    Bucket Size 256.00KiB
>    Replacement Policy [lru] fifo random
>    Discard? False
>    I/O Errors 0
>    Metadata Written 395.00GiB
>    Data Written 1.50 TiB
>    Buckets 2266376
>    Cache Used 547.78GiB (99%)
>    Cache Unused 5.53GiB (0%)
> --- Backing Device ---
>    Device /dev/sdc (8:32)
>    Size 5.46TiB
>    Cache Mode writethrough [writeback] writearound none
>    Readhead
>    Sequential Cutoff 0B
>    Sequential merge? False
>    state clean
>    Writeback? true
>    Dirty Data 0B
>    Total Hits 32903077 (99%)
>    Total Missions 185029
>    Total Bypass Hits 6203 (100%)
>    Total Bypass Misses 0
>    Total Bypassed 59.20MiB
> 
> The dirty data has disappeared. But the cache remains 99% utilization, down just 1%. Already the evictable cache increased to 91%!
> 
> The impression I have is that this harms the write cache. That is, if I need to write again, the data goes straight to the HDD disks, as there is no space available in the Cache.
> 
> Shouldn't bcache remove the least used part of the cache?
> 
> Does anyone know why this isn't happening?
> 
> I may be talking nonsense, but isn't there a way to tell bcache to keep a write-free space rate in the cache automatically? Or even if it was manually by some command that I would trigger at low disk access times?
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-03-29  9:38 ` Writeback cache all used Adriano Silva
  2023-03-29 19:18   ` Eric Wheeler
@ 2023-03-30  4:55   ` Martin McClure
  2023-03-31  0:17     ` Adriano Silva
  1 sibling, 1 reply; 28+ messages in thread
From: Martin McClure @ 2023-03-30  4:55 UTC (permalink / raw)
  To: Adriano Silva, Bcache Linux

On 3/29/23 02:38, Adriano Silva wrote:
> Hey guys,
>
> I'm using bcache to support Ceph. Ten Cluster nodes have a bcache device each consisting of an HDD block device and an NVMe cache. But I am noticing what I consider to be a problem: My cache is 100% used even though I still have 80% of the space available on my HDD.
>
> It is true that there is more data written than would fit in the cache. However, I imagine that most of them should only be on the HDD and not in the cache, as they are cold data, almost never used.
>
> I noticed that there was a significant drop in performance on the disks (writes) and went to check. Benchmark tests confirmed this. Then I noticed that there was 100% cache full and 85% cache evictable. There was a bit of dirty cache. I found an internet message talking about the garbage collector, so I tried the following:
>
> echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc
>
> That doesn't seem to have helped.
>
> Then I collected the following data:
>
> --- bcache ---
> Device /dev/sdc (8:32)
> UUID 38e81dff-a7c9-449f-9ddd-182128a19b69
> Block Size 4.00KiB
> Bucket Size 256.00KiB
> Congested? False
> Read Congestion 0.0ms
> Write Congestion 0.0ms
> Total Cache Size 553.31GiB
> Total Cache Used 547.78GiB (99%)
> Total Unused Cache 5.53GiB (1%)
> Dirty Data 0B (0%)
> Evictable Cache 503.52GiB (91%)
> Replacement Policy [lru] fifo random
> Cache Mode writethrough [writeback] writearound none
> Total Hits 33361829 (99%)
> Total Missions 185029
> Total Bypass Hits 6203 (100%)
> Total Bypass Misses 0
> Total Bypassed 59.20MiB
> --- Cache Device ---
>     Device /dev/nvme0n1p1 (259:1)
>     Size 553.31GiB
>     Block Size 4.00KiB
>     Bucket Size 256.00KiB
>     Replacement Policy [lru] fifo random
>     Discard? False
>     I/O Errors 0
>     Metadata Written 395.00GiB
>     Data Written 1.50 TiB
>     Buckets 2266376
>     Cache Used 547.78GiB (99%)
>     Cache Unused 5.53GiB (0%)
> --- Backing Device ---
>     Device /dev/sdc (8:32)
>     Size 5.46TiB
>     Cache Mode writethrough [writeback] writearound none
>     Readhead
>     Sequential Cutoff 0B
>     Sequential merge? False
>     state clean
>     Writeback? true
>     Dirty Data 0B
>     Total Hits 32903077 (99%)
>     Total Missions 185029
>     Total Bypass Hits 6203 (100%)
>     Total Bypass Misses 0
>     Total Bypassed 59.20MiB
>
> The dirty data has disappeared. But the cache remains 99% utilization, down just 1%. Already the evictable cache increased to 91%!
>
> The impression I have is that this harms the write cache. That is, if I need to write again, the data goes straight to the HDD disks, as there is no space available in the Cache.
>
> Shouldn't bcache remove the least used part of the cache?

I don't know for sure, but I'd think that since 91% of the cache is 
evictable, writing would just evict some data from the cache (without 
writing to the HDD, since it's not dirty data) and write to that area of 
the cache, *not* to the HDD. It wouldn't make sense in many cases to 
actually remove data from the cache, because then any reads of that data 
would have to read from the HDD; leaving it in the cache has very little 
cost and would speed up any reads of that data.

Regards,
-Martin

>
> Does anyone know why this isn't happening?
>
> I may be talking nonsense, but isn't there a way to tell bcache to keep a write-free space rate in the cache automatically? Or even if it was manually by some command that I would trigger at low disk access times?
>
> Thanks!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-03-30  4:55   ` Martin McClure
@ 2023-03-31  0:17     ` Adriano Silva
  2023-04-02  0:01       ` Eric Wheeler
  0 siblings, 1 reply; 28+ messages in thread
From: Adriano Silva @ 2023-03-31  0:17 UTC (permalink / raw)
  To: Bcache Linux, Martin McClure

Thank you very much!

> I don't know for sure, but I'd think that since 91% of the cache is
> evictable, writing would just evict some data from the cache (without
> writing to the HDD, since it's not dirty data) and write to that area of
> the cache, *not* to the HDD. It wouldn't make sense in many cases to
> actually remove data from the cache, because then any reads of that data
> would have to read from the HDD; leaving it in the cache has very little
> cost and would speed up any reads of that data.

Maybe you're right, it seems to be writing to the cache, despite it indicating that the cache is at 100% full.

I noticed that it has excellent reading performance, but the writing performance dropped a lot when the cache was full. It's still a higher performance than the HDD, but much lower than it is when it's half full or empty.

Sequential writing tests with "_fio" now show me 240MB/s of writing, which was already 900MB/s when the cache was still half full. Write latency has also increased. IOPS on random 4K writes are now in the 5K range. It was 16K with half used cache. At random 4K with Ioping, latency went up. With half cache it was 500us. It is now 945us.

For reading, nothing has changed.

However, for systems where writing time is critical, it makes a significant difference. If possible I would like to always keep it with a reasonable amount of empty space, to improve writing responses. Reduce 4K latency, mostly. Even if it were for me to program a script in crontab or something like that, so that during the night or something like that the system executes a command for it to clear a percentage of the cache (about 30% for example) that has been unused for the longest time . This would possibly make the cache more efficient on writes as well.

If anyone knows a solution, thanks!



On 3/29/23 02:38, Adriano Silva wrote:
> Hey guys,
>
> I'm using bcache to support Ceph. Ten Cluster nodes have a bcache device each consisting of an HDD block device and an NVMe cache. But I am noticing what I consider to be a problem: My cache is 100% used even though I still have 80% of the space available on my HDD.
>
> It is true that there is more data written than would fit in the cache. However, I imagine that most of them should only be on the HDD and not in the cache, as they are cold data, almost never used.
>
> I noticed that there was a significant drop in performance on the disks (writes) and went to check. Benchmark tests confirmed this. Then I noticed that there was 100% cache full and 85% cache evictable. There was a bit of dirty cache. I found an internet message talking about the garbage collector, so I tried the following:
>
> echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc
>
> That doesn't seem to have helped.
>
> Then I collected the following data:
>
> --- bcache ---
> Device /dev/sdc (8:32)
> UUID 38e81dff-a7c9-449f-9ddd-182128a19b69
> Block Size 4.00KiB
> Bucket Size 256.00KiB
> Congested? False
> Read Congestion 0.0ms
> Write Congestion 0.0ms
> Total Cache Size 553.31GiB
> Total Cache Used 547.78GiB (99%)
> Total Unused Cache 5.53GiB (1%)
> Dirty Data 0B (0%)
> Evictable Cache 503.52GiB (91%)
> Replacement Policy [lru] fifo random
> Cache Mode writethrough [writeback] writearound none
> Total Hits 33361829 (99%)
> Total Missions 185029
> Total Bypass Hits 6203 (100%)
> Total Bypass Misses 0
> Total Bypassed 59.20MiB
> --- Cache Device ---
>     Device /dev/nvme0n1p1 (259:1)
>     Size 553.31GiB
>     Block Size 4.00KiB
>     Bucket Size 256.00KiB
>     Replacement Policy [lru] fifo random
>     Discard? False
>     I/O Errors 0
>     Metadata Written 395.00GiB
>     Data Written 1.50 TiB
>     Buckets 2266376
>     Cache Used 547.78GiB (99%)
>     Cache Unused 5.53GiB (0%)
> --- Backing Device ---
>     Device /dev/sdc (8:32)
>     Size 5.46TiB
>     Cache Mode writethrough [writeback] writearound none
>     Readhead
>     Sequential Cutoff 0B
>     Sequential merge? False
>     state clean
>     Writeback? true
>     Dirty Data 0B
>     Total Hits 32903077 (99%)
>     Total Missions 185029
>     Total Bypass Hits 6203 (100%)
>     Total Bypass Misses 0
>     Total Bypassed 59.20MiB
>
> The dirty data has disappeared. But the cache remains 99% utilization, down just 1%. Already the evictable cache increased to 91%!
>
> The impression I have is that this harms the write cache. That is, if I need to write again, the data goes straight to the HDD disks, as there is no space available in the Cache.
>
> Shouldn't bcache remove the least used part of the cache?

I don't know for sure, but I'd think that since 91% of the cache is 
evictable, writing would just evict some data from the cache (without 
writing to the HDD, since it's not dirty data) and write to that area of 
the cache, *not* to the HDD. It wouldn't make sense in many cases to 
actually remove data from the cache, because then any reads of that data 
would have to read from the HDD; leaving it in the cache has very little 
cost and would speed up any reads of that data.

Regards,
-Martin


>
> Does anyone know why this isn't happening?
>
> I may be talking nonsense, but isn't there a way to tell bcache to keep a write-free space rate in the cache automatically? Or even if it was manually by some command that I would trigger at low disk access times?
>
> Thanks!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-03-31  0:17     ` Adriano Silva
@ 2023-04-02  0:01       ` Eric Wheeler
  2023-04-03  7:14         ` Coly Li
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Wheeler @ 2023-04-02  0:01 UTC (permalink / raw)
  To: Coly Li; +Cc: Bcache Linux, Martin McClure, Adriano Silva

[-- Attachment #1: Type: text/plain, Size: 5976 bytes --]

On Fri, 31 Mar 2023, Adriano Silva wrote:
> Thank you very much!
> 
> > I don't know for sure, but I'd think that since 91% of the cache is
> > evictable, writing would just evict some data from the cache (without
> > writing to the HDD, since it's not dirty data) and write to that area of
> > the cache, *not* to the HDD. It wouldn't make sense in many cases to
> > actually remove data from the cache, because then any reads of that data
> > would have to read from the HDD; leaving it in the cache has very little
> > cost and would speed up any reads of that data.
> 
> Maybe you're right, it seems to be writing to the cache, despite it 
> indicating that the cache is at 100% full.
> 
> I noticed that it has excellent reading performance, but the writing 
> performance dropped a lot when the cache was full. It's still a higher 
> performance than the HDD, but much lower than it is when it's half full 
> or empty.
> 
> Sequential writing tests with "_fio" now show me 240MB/s of writing, 
> which was already 900MB/s when the cache was still half full. Write 
> latency has also increased. IOPS on random 4K writes are now in the 5K 
> range. It was 16K with half used cache. At random 4K with Ioping, 
> latency went up. With half cache it was 500us. It is now 945us.
> 
> For reading, nothing has changed.
> 
> However, for systems where writing time is critical, it makes a 
> significant difference. If possible I would like to always keep it with 
> a reasonable amount of empty space, to improve writing responses. Reduce 
> 4K latency, mostly. Even if it were for me to program a script in 
> crontab or something like that, so that during the night or something 
> like that the system executes a command for it to clear a percentage of 
> the cache (about 30% for example) that has been unused for the longest 
> time . This would possibly make the cache more efficient on writes as 
> well.

That is an intersting idea since it saves latency. Keeping a few unused 
ready to go would prevent GC during a cached write. 

Coly, would this be an easy feature to add?

Bcache would need a `cache_min_free` tunable that would (asynchronously) 
free the least recently used buckets that are not dirty.

-Eric

> 
> If anyone knows a solution, thanks!
> 
> 
> 
> On 3/29/23 02:38, Adriano Silva wrote:
> > Hey guys,
> >
> > I'm using bcache to support Ceph. Ten Cluster nodes have a bcache device each consisting of an HDD block device and an NVMe cache. But I am noticing what I consider to be a problem: My cache is 100% used even though I still have 80% of the space available on my HDD.
> >
> > It is true that there is more data written than would fit in the cache. However, I imagine that most of them should only be on the HDD and not in the cache, as they are cold data, almost never used.
> >
> > I noticed that there was a significant drop in performance on the disks (writes) and went to check. Benchmark tests confirmed this. Then I noticed that there was 100% cache full and 85% cache evictable. There was a bit of dirty cache. I found an internet message talking about the garbage collector, so I tried the following:
> >
> > echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc
> >
> > That doesn't seem to have helped.
> >
> > Then I collected the following data:
> >
> > --- bcache ---
> > Device /dev/sdc (8:32)
> > UUID 38e81dff-a7c9-449f-9ddd-182128a19b69
> > Block Size 4.00KiB
> > Bucket Size 256.00KiB
> > Congested? False
> > Read Congestion 0.0ms
> > Write Congestion 0.0ms
> > Total Cache Size 553.31GiB
> > Total Cache Used 547.78GiB (99%)
> > Total Unused Cache 5.53GiB (1%)
> > Dirty Data 0B (0%)
> > Evictable Cache 503.52GiB (91%)
> > Replacement Policy [lru] fifo random
> > Cache Mode writethrough [writeback] writearound none
> > Total Hits 33361829 (99%)
> > Total Missions 185029
> > Total Bypass Hits 6203 (100%)
> > Total Bypass Misses 0
> > Total Bypassed 59.20MiB
> > --- Cache Device ---
> >     Device /dev/nvme0n1p1 (259:1)
> >     Size 553.31GiB
> >     Block Size 4.00KiB
> >     Bucket Size 256.00KiB
> >     Replacement Policy [lru] fifo random
> >     Discard? False
> >     I/O Errors 0
> >     Metadata Written 395.00GiB
> >     Data Written 1.50 TiB
> >     Buckets 2266376
> >     Cache Used 547.78GiB (99%)
> >     Cache Unused 5.53GiB (0%)
> > --- Backing Device ---
> >     Device /dev/sdc (8:32)
> >     Size 5.46TiB
> >     Cache Mode writethrough [writeback] writearound none
> >     Readhead
> >     Sequential Cutoff 0B
> >     Sequential merge? False
> >     state clean
> >     Writeback? true
> >     Dirty Data 0B
> >     Total Hits 32903077 (99%)
> >     Total Missions 185029
> >     Total Bypass Hits 6203 (100%)
> >     Total Bypass Misses 0
> >     Total Bypassed 59.20MiB
> >
> > The dirty data has disappeared. But the cache remains 99% utilization, down just 1%. Already the evictable cache increased to 91%!
> >
> > The impression I have is that this harms the write cache. That is, if I need to write again, the data goes straight to the HDD disks, as there is no space available in the Cache.
> >
> > Shouldn't bcache remove the least used part of the cache?
> 
> I don't know for sure, but I'd think that since 91% of the cache is 
> evictable, writing would just evict some data from the cache (without 
> writing to the HDD, since it's not dirty data) and write to that area of 
> the cache, *not* to the HDD. It wouldn't make sense in many cases to 
> actually remove data from the cache, because then any reads of that data 
> would have to read from the HDD; leaving it in the cache has very little 
> cost and would speed up any reads of that data.
> 
> Regards,
> -Martin
> 
> 
> >
> > Does anyone know why this isn't happening?
> >
> > I may be talking nonsense, but isn't there a way to tell bcache to keep a write-free space rate in the cache automatically? Or even if it was manually by some command that I would trigger at low disk access times?
> >
> > Thanks!
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-02  0:01       ` Eric Wheeler
@ 2023-04-03  7:14         ` Coly Li
  2023-04-03 19:27           ` Eric Wheeler
  0 siblings, 1 reply; 28+ messages in thread
From: Coly Li @ 2023-04-03  7:14 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Bcache Linux, Martin McClure, Adriano Silva



> 2023年4月2日 08:01,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> 
> On Fri, 31 Mar 2023, Adriano Silva wrote:
>> Thank you very much!
>> 
>>> I don't know for sure, but I'd think that since 91% of the cache is
>>> evictable, writing would just evict some data from the cache (without
>>> writing to the HDD, since it's not dirty data) and write to that area of
>>> the cache, *not* to the HDD. It wouldn't make sense in many cases to
>>> actually remove data from the cache, because then any reads of that data
>>> would have to read from the HDD; leaving it in the cache has very little
>>> cost and would speed up any reads of that data.
>> 
>> Maybe you're right, it seems to be writing to the cache, despite it 
>> indicating that the cache is at 100% full.
>> 
>> I noticed that it has excellent reading performance, but the writing 
>> performance dropped a lot when the cache was full. It's still a higher 
>> performance than the HDD, but much lower than it is when it's half full 
>> or empty.
>> 
>> Sequential writing tests with "_fio" now show me 240MB/s of writing, 
>> which was already 900MB/s when the cache was still half full. Write 
>> latency has also increased. IOPS on random 4K writes are now in the 5K 
>> range. It was 16K with half used cache. At random 4K with Ioping, 
>> latency went up. With half cache it was 500us. It is now 945us.
>> 
>> For reading, nothing has changed.
>> 
>> However, for systems where writing time is critical, it makes a 
>> significant difference. If possible I would like to always keep it with 
>> a reasonable amount of empty space, to improve writing responses. Reduce 
>> 4K latency, mostly. Even if it were for me to program a script in 
>> crontab or something like that, so that during the night or something 
>> like that the system executes a command for it to clear a percentage of 
>> the cache (about 30% for example) that has been unused for the longest 
>> time . This would possibly make the cache more efficient on writes as 
>> well.
> 
> That is an intersting idea since it saves latency. Keeping a few unused 
> ready to go would prevent GC during a cached write. 
> 

Currently there are around 10% reserved already, if dirty data exceeds the threshold further writing will go into backing device directly.

Reserve more space doesn’t change too much, if there are always busy write request arriving. For occupied clean cache space, I tested years ago, the space can be shrunk very fast and it won’t be a performance bottleneck. If the situation changes now, please inform me.

> Coly, would this be an easy feature to add?
> 

To make it, the change won’t be complexed. But I don’t feel it may solve the original writing performance issue when space is almost full. In the code we already have similar lists to hold available buckets for future data/metadata allocation. But if the lists are empty, there is still time required to do the dirty writeback and garbage collection if necessary.

> Bcache would need a `cache_min_free` tunable that would (asynchronously) 
> free the least recently used buckets that are not dirty.
> 

For clean cache space, it has been already. This is very fast to shrink clean cache space, I did a test 2 years ago, it was just not more than 10 seconds to reclaim around 1TB+ clean cache space. I guess the time might be much less, because reading the information from priorities file also takes time.



Coly Li



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-03  7:14         ` Coly Li
@ 2023-04-03 19:27           ` Eric Wheeler
  2023-04-04  8:19             ` Coly Li
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Wheeler @ 2023-04-03 19:27 UTC (permalink / raw)
  To: Coly Li; +Cc: Bcache Linux, Martin McClure, Adriano Silva

[-- Attachment #1: Type: text/plain, Size: 5326 bytes --]

On Mon, 3 Apr 2023, Coly Li wrote:
> > 2023年4月2日 08:01,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> > 
> > On Fri, 31 Mar 2023, Adriano Silva wrote:
> >> Thank you very much!
> >> 
> >>> I don't know for sure, but I'd think that since 91% of the cache is
> >>> evictable, writing would just evict some data from the cache (without
> >>> writing to the HDD, since it's not dirty data) and write to that area of
> >>> the cache, *not* to the HDD. It wouldn't make sense in many cases to
> >>> actually remove data from the cache, because then any reads of that data
> >>> would have to read from the HDD; leaving it in the cache has very little
> >>> cost and would speed up any reads of that data.
> >> 
> >> Maybe you're right, it seems to be writing to the cache, despite it 
> >> indicating that the cache is at 100% full.
> >> 
> >> I noticed that it has excellent reading performance, but the writing 
> >> performance dropped a lot when the cache was full. It's still a higher 
> >> performance than the HDD, but much lower than it is when it's half full 
> >> or empty.
> >> 
> >> Sequential writing tests with "_fio" now show me 240MB/s of writing, 
> >> which was already 900MB/s when the cache was still half full. Write 
> >> latency has also increased. IOPS on random 4K writes are now in the 5K 
> >> range. It was 16K with half used cache. At random 4K with Ioping, 
> >> latency went up. With half cache it was 500us. It is now 945us.
> >> 
> >> For reading, nothing has changed.
> >> 
> >> However, for systems where writing time is critical, it makes a 
> >> significant difference. If possible I would like to always keep it with 
> >> a reasonable amount of empty space, to improve writing responses. Reduce 
> >> 4K latency, mostly. Even if it were for me to program a script in 
> >> crontab or something like that, so that during the night or something 
> >> like that the system executes a command for it to clear a percentage of 
> >> the cache (about 30% for example) that has been unused for the longest 
> >> time . This would possibly make the cache more efficient on writes as 
> >> well.
> > 
> > That is an intersting idea since it saves latency. Keeping a few unused 
> > ready to go would prevent GC during a cached write. 
> > 
> 
> Currently there are around 10% reserved already, if dirty data exceeds 
> the threshold further writing will go into backing device directly.

It doesn't sound like he is referring to dirty data.  If I understand 
correctly, he means that when the cache is 100% allocated (whether or not 
anything is dirty) that latency is 2x what it could be compared to when 
there are unallocated buckets ready for writing (ie, after formatting the 
cache, but before it fully allocates).  His sequential throughput was 4x 
slower with a 100% allocated cache: 900MB/s at 50% full after a format, 
but only 240MB/s when the cache buckets are 100% allocated.

> Reserve more space doesn’t change too much, if there are always busy 
> write request arriving. For occupied clean cache space, I tested years 
> ago, the space can be shrunk very fast and it won’t be a performance 
> bottleneck. If the situation changes now, please inform me.

His performance specs above indicate that 100% occupided but clean cache 
increases latency (due to release/re-allocate overhead).  The increased 
latency reduces effective throughput.

> > Coly, would this be an easy feature to add?
> 
> To make it, the change won’t be complexed. But I don’t feel it may solve 
> the original writing performance issue when space is almost full. In the 
> code we already have similar lists to hold available buckets for future 
> data/metadata allocation.

Questions: 

- Right after a cache is formated and attached to a bcache volume, which 
  list contains the buckets that have never been used?

- Where does bcache insert back onto that list?  Does it?

> But if the lists are empty, there is still time required to do the dirty 
> writeback and garbage collection if necessary.

True, that code would always remain, no change there.

> > Bcache would need a `cache_min_free` tunable that would (asynchronously) 
> > free the least recently used buckets that are not dirty.
> 
> For clean cache space, it has been already.

I'm not sure what you mean by "it has been already" - do you mean this is 
already implemented?  If so, what/where is the sysfs tunable (or 
hard-coded min-free-buckets value) ?

> This is very fast to shrink clean cache space, I did a test 2 years ago, 
> it was just not more than 10 seconds to reclaim around 1TB+ clean cache 
> space. I guess the time might be much less, because reading the 
> information from priorities file also takes time.

Reclaiming large chunks of cache is probably fast in one shot, but 
reclaiming one "clean but allocated" bucket (or even a few buckets) with 
each WRITE has latency overhead associated with it.  Early reclaim to some 
reasonable (configrable) minimum free-space value could hide that latency 
in many workloads.

Long-running bcache volumes are usually 100% allocated, and if freeing 
batches of clean buckets is fast, then doing it early would save metadata 
handling during clean bucket re-allocation for new writes (and maybe 
read-promotion, too).


--
Eric Wheeler

> 
> 
> Coly Li
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-03 19:27           ` Eric Wheeler
@ 2023-04-04  8:19             ` Coly Li
  2023-04-04 20:29               ` Adriano Silva
  0 siblings, 1 reply; 28+ messages in thread
From: Coly Li @ 2023-04-04  8:19 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Bcache Linux, Martin McClure, Adriano Silva



> 2023年4月4日 03:27,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> 
> On Mon, 3 Apr 2023, Coly Li wrote:
>>> 2023年4月2日 08:01,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
>>> 
>>> On Fri, 31 Mar 2023, Adriano Silva wrote:
>>>> Thank you very much!
>>>> 
>>>>> I don't know for sure, but I'd think that since 91% of the cache is
>>>>> evictable, writing would just evict some data from the cache (without
>>>>> writing to the HDD, since it's not dirty data) and write to that area of
>>>>> the cache, *not* to the HDD. It wouldn't make sense in many cases to
>>>>> actually remove data from the cache, because then any reads of that data
>>>>> would have to read from the HDD; leaving it in the cache has very little
>>>>> cost and would speed up any reads of that data.
>>>> 
>>>> Maybe you're right, it seems to be writing to the cache, despite it 
>>>> indicating that the cache is at 100% full.
>>>> 
>>>> I noticed that it has excellent reading performance, but the writing 
>>>> performance dropped a lot when the cache was full. It's still a higher 
>>>> performance than the HDD, but much lower than it is when it's half full 
>>>> or empty.
>>>> 
>>>> Sequential writing tests with "_fio" now show me 240MB/s of writing, 
>>>> which was already 900MB/s when the cache was still half full. Write 
>>>> latency has also increased. IOPS on random 4K writes are now in the 5K 
>>>> range. It was 16K with half used cache. At random 4K with Ioping, 
>>>> latency went up. With half cache it was 500us. It is now 945us.
>>>> 
>>>> For reading, nothing has changed.
>>>> 
>>>> However, for systems where writing time is critical, it makes a 
>>>> significant difference. If possible I would like to always keep it with 
>>>> a reasonable amount of empty space, to improve writing responses. Reduce 
>>>> 4K latency, mostly. Even if it were for me to program a script in 
>>>> crontab or something like that, so that during the night or something 
>>>> like that the system executes a command for it to clear a percentage of 
>>>> the cache (about 30% for example) that has been unused for the longest 
>>>> time . This would possibly make the cache more efficient on writes as 
>>>> well.
>>> 
>>> That is an intersting idea since it saves latency. Keeping a few unused 
>>> ready to go would prevent GC during a cached write. 
>>> 
>> 
>> Currently there are around 10% reserved already, if dirty data exceeds 
>> the threshold further writing will go into backing device directly.
> 
> It doesn't sound like he is referring to dirty data.  If I understand 
> correctly, he means that when the cache is 100% allocated (whether or not 
> anything is dirty) that latency is 2x what it could be compared to when 
> there are unallocated buckets ready for writing (ie, after formatting the 
> cache, but before it fully allocates).  His sequential throughput was 4x 
> slower with a 100% allocated cache: 900MB/s at 50% full after a format, 
> but only 240MB/s when the cache buckets are 100% allocated.
> 

It sounds like a large cache size with limit memory cache for B+tree nodes?

If the memory is limited and all B+tree nodes in the hot I/O paths cannot stay in memory, it is possible for such behavior happens. In this case, shrink the cached data may deduce the meta data and consequential in-memory B+tree nodes as well. Yes it may be helpful for such scenario.

But what is the I/O pattern here? If all the cache space occupied by clean data for read request, and write performance is cared about then. Is this a write tended, or read tended workload, or mixed?

It is suspicious that the meta data I/O to read-in missing B+tree nodes contributes the I/O performance degrade. But this is only a guess from me.


>> Reserve more space doesn’t change too much, if there are always busy 
>> write request arriving. For occupied clean cache space, I tested years 
>> ago, the space can be shrunk very fast and it won’t be a performance 
>> bottleneck. If the situation changes now, please inform me.
> 
> His performance specs above indicate that 100% occupided but clean cache 
> increases latency (due to release/re-allocate overhead).  The increased 
> latency reduces effective throughput.
> 

Maybe, increase a bit memory may help a lot. But if this is not a server machine, e.g. laptop, increasing DIMM size might be unfeasible for now days models.

Let’s check whether the memory is insufficient for this case firstly.


>>> Coly, would this be an easy feature to add?
>> 
>> To make it, the change won’t be complexed. But I don’t feel it may solve 
>> the original writing performance issue when space is almost full. In the 
>> code we already have similar lists to hold available buckets for future 
>> data/metadata allocation.
> 
> Questions: 
> 
> - Right after a cache is formated and attached to a bcache volume, which 
>  list contains the buckets that have never been used?

All buckets on cache are allocated in sequential-like order, no such dedicate list to track never-used buckets. There are arrays to trace the buckets dirty state and get number, but not for this purpose. Maintain such list is expensive when cache size is large.

> 
> - Where does bcache insert back onto that list?  Does it?

Even for synchronized bucket invalidate by fifo/lru/random, the selection is decided in run time by heap sort, which is still quite slow. But this is on the very slow code path, doesn’t hurt too much. more.


> 
>> But if the lists are empty, there is still time required to do the dirty 
>> writeback and garbage collection if necessary.
> 
> True, that code would always remain, no change there.
> 
>>> Bcache would need a `cache_min_free` tunable that would (asynchronously) 
>>> free the least recently used buckets that are not dirty.
>> 
>> For clean cache space, it has been already.
> 
> I'm not sure what you mean by "it has been already" - do you mean this is 
> already implemented?  If so, what/where is the sysfs tunable (or 
> hard-coded min-free-buckets value) ?

For read request, gc thread is still waken up time to time. See the code path,

cached_dev_read_done_bh() ==> cached_dev_read_done() ==> bch_data_insert() ==> bch_data_insert_start() ==> wake_up_gc()

When read missed in cache, writing the clean data from backing device into cache device still occupies cache space, and default every 1/16 space allocated/occupied, the gc thread is waken up asynchronously.


> 
>> This is very fast to shrink clean cache space, I did a test 2 years ago, 
>> it was just not more than 10 seconds to reclaim around 1TB+ clean cache 
>> space. I guess the time might be much less, because reading the 
>> information from priorities file also takes time.
> 
> Reclaiming large chunks of cache is probably fast in one shot, but 
> reclaiming one "clean but allocated" bucket (or even a few buckets) with 
> each WRITE has latency overhead associated with it.  Early reclaim to some 
> reasonable (configrable) minimum free-space value could hide that latency 
> in many workloads.
> 

As I explained, the re-reclaim has been here already. But it cannot help too much if busy I/O requests always coming and writeback and gc threads have no spare time to perform.

If coming I/Os exceeds the service capacity of the cache service window, disappointed requesters can be expected. 


> Long-running bcache volumes are usually 100% allocated, and if freeing 
> batches of clean buckets is fast, then doing it early would save metadata 
> handling during clean bucket re-allocation for new writes (and maybe 
> read-promotion, too).

Let’s check whether it is just becasue of insuffecient memory to hold the hot B+tree node in memory.

Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-04  8:19             ` Coly Li
@ 2023-04-04 20:29               ` Adriano Silva
  2023-04-05 13:57                 ` Coly Li
  0 siblings, 1 reply; 28+ messages in thread
From: Adriano Silva @ 2023-04-04 20:29 UTC (permalink / raw)
  To: Eric Wheeler, Coly Li; +Cc: Bcache Linux, Martin McClure

Hello,

> It sounds like a large cache size with limit memory cache 
> for B+tree nodes?

> If the memory is limited and all B+tree nodes in the hot I/O 
> paths cannot stay in memory, it is possible for such 
> behavior happens. In this case, shrink the cached data 
> may deduce the meta data and consequential in-memory 
> B+tree nodes as well. Yes it may be helpful for such 
> scenario.

There are several servers (TEN) all with 128 GB of RAM, of which around 100GB (on average) are presented by the OS as free. Cache is 594GB in size on enterprise NVMe, mass storage is 6TB. The configuration on all is the same. They run Ceph OSD to service a pool of disks accessed by servers (others including themselves).

All show the same behavior.

When they were installed, they did not occupy the entire cache. Throughout use, the cache gradually filled up and  never decreased in size. I have another five servers in  another cluster going the same way. During the night  their workload is reduced.

> But what is the I/O pattern here? If all the cache space 
> occupied by clean data for read request, and write 
> performance is cared about then. Is this a write tended, 
> or read tended workload, or mixed?

The workload is greater in writing. Both are important, read and write. But write latency is critical. These are virtual machine disks that are stored on Ceph. Inside we have mixed loads, Windows with terminal service, Linux, including a database where direct write latency is critical.

> As I explained, the re-reclaim has been here already. 
> But it cannot help too much if busy I/O requests always 
> coming and writeback and gc threads have no spare 
> time to perform.

> If coming I/Os exceeds the service capacity of the 
> cache service window, disappointed requesters can 
> be expected.

Today, the ten servers have been without I/O operation for at least 24 hours. Nothing has changed, they continue with 100% cache occupancy. I believe I should have given time for the GC, no?

> Let’s check whether it is just becasue of insuffecient 
> memory to hold the hot B+tree node in memory.

Does the bcache configuration have any RAM memory reservation options? Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? For that amount of Cache, how much RAM should I have reserved for bcache? Is there any command or parameter I should use to signal bcache that it should reserve this RAM memory? I didn't do anything about this matter. How would I do it?

Another question: How do I know if I should trigger a TRIM (discard) for my NVMe with bcache?


Best regards,





Em terça-feira, 4 de abril de 2023 às 05:20:06 BRT, Coly Li <colyli@suse.de> escreveu: 




> 2023年4月4日 03:27,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> 
> On Mon, 3 Apr 2023, Coly Li wrote:
>>> 2023年4月2日 08:01,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
>>> 
>>> On Fri, 31 Mar 2023, Adriano Silva wrote:
>>>> Thank you very much!
>>>> 
>>>>> I don't know for sure, but I'd think that since 91% of the cache is
>>>>> evictable, writing would just evict some data from the cache (without
>>>>> writing to the HDD, since it's not dirty data) and write to that area of
>>>>> the cache, *not* to the HDD. It wouldn't make sense in many cases to
>>>>> actually remove data from the cache, because then any reads of that data
>>>>> would have to read from the HDD; leaving it in the cache has very little
>>>>> cost and would speed up any reads of that data.
>>>> 
>>>> Maybe you're right, it seems to be writing to the cache, despite it 
>>>> indicating that the cache is at 100% full.
>>>> 
>>>> I noticed that it has excellent reading performance, but the writing 
>>>> performance dropped a lot when the cache was full. It's still a higher 
>>>> performance than the HDD, but much lower than it is when it's half full 
>>>> or empty.
>>>> 
>>>> Sequential writing tests with "_fio" now show me 240MB/s of writing, 
>>>> which was already 900MB/s when the cache was still half full. Write 
>>>> latency has also increased. IOPS on random 4K writes are now in the 5K 
>>>> range. It was 16K with half used cache. At random 4K with Ioping, 
>>>> latency went up. With half cache it was 500us. It is now 945us.
>>>> 
>>>> For reading, nothing has changed.
>>>> 
>>>> However, for systems where writing time is critical, it makes a 
>>>> significant difference. If possible I would like to always keep it with 
>>>> a reasonable amount of empty space, to improve writing responses. Reduce 
>>>> 4K latency, mostly. Even if it were for me to program a script in 
>>>> crontab or something like that, so that during the night or something 
>>>> like that the system executes a command for it to clear a percentage of 
>>>> the cache (about 30% for example) that has been unused for the longest 
>>>> time . This would possibly make the cache more efficient on writes as 
>>>> well.
>>> 
>>> That is an intersting idea since it saves latency. Keeping a few unused 
>>> ready to go would prevent GC during a cached write. 
>>> 
>> 
>> Currently there are around 10% reserved already, if dirty data exceeds 
>> the threshold further writing will go into backing device directly.
> 
> It doesn't sound like he is referring to dirty data.  If I understand 
> correctly, he means that when the cache is 100% allocated (whether or not 
> anything is dirty) that latency is 2x what it could be compared to when 
> there are unallocated buckets ready for writing (ie, after formatting the 
> cache, but before it fully allocates).  His sequential throughput was 4x 
> slower with a 100% allocated cache: 900MB/s at 50% full after a format, 
> but only 240MB/s when the cache buckets are 100% allocated.
> 

It sounds like a large cache size with limit memory cache for B+tree nodes?

If the memory is limited and all B+tree nodes in the hot I/O paths cannot stay in memory, it is possible for such behavior happens. In this case, shrink the cached data may deduce the meta data and consequential in-memory B+tree nodes as well. Yes it may be helpful for such scenario.

But what is the I/O pattern here? If all the cache space occupied by clean data for read request, and write performance is cared about then. Is this a write tended, or read tended workload, or mixed?

It is suspicious that the meta data I/O to read-in missing B+tree nodes contributes the I/O performance degrade. But this is only a guess from me.


>> Reserve more space doesn’t change too much, if there are always busy 
>> write request arriving. For occupied clean cache space, I tested years 
>> ago, the space can be shrunk very fast and it won’t be a performance 
>> bottleneck. If the situation changes now, please inform me.
> 
> His performance specs above indicate that 100% occupided but clean cache 
> increases latency (due to release/re-allocate overhead).  The increased 
> latency reduces effective throughput.
> 

Maybe, increase a bit memory may help a lot. But if this is not a server machine, e.g. laptop, increasing DIMM size might be unfeasible for now days models.

Let’s check whether the memory is insufficient for this case firstly.


>>> Coly, would this be an easy feature to add?
>> 
>> To make it, the change won’t be complexed. But I don’t feel it may solve 
>> the original writing performance issue when space is almost full. In the 
>> code we already have similar lists to hold available buckets for future 
>> data/metadata allocation.
> 
> Questions: 
> 
> - Right after a cache is formated and attached to a bcache volume, which 
>  list contains the buckets that have never been used?

All buckets on cache are allocated in sequential-like order, no such dedicate list to track never-used buckets. There are arrays to trace the buckets dirty state and get number, but not for this purpose. Maintain such list is expensive when cache size is large.

> 
> - Where does bcache insert back onto that list?  Does it?

Even for synchronized bucket invalidate by fifo/lru/random, the selection is decided in run time by heap sort, which is still quite slow. But this is on the very slow code path, doesn’t hurt too much. more.


> 
>> But if the lists are empty, there is still time required to do the dirty 
>> writeback and garbage collection if necessary.
> 
> True, that code would always remain, no change there.
> 
>>> Bcache would need a `cache_min_free` tunable that would (asynchronously) 
>>> free the least recently used buckets that are not dirty.
>> 
>> For clean cache space, it has been already.
> 
> I'm not sure what you mean by "it has been already" - do you mean this is 
> already implemented?  If so, what/where is the sysfs tunable (or 
> hard-coded min-free-buckets value) ?

For read request, gc thread is still waken up time to time. See the code path,

cached_dev_read_done_bh() ==> cached_dev_read_done() ==> bch_data_insert() ==> bch_data_insert_start() ==> wake_up_gc()

When read missed in cache, writing the clean data from backing device into cache device still occupies cache space, and default every 1/16 space allocated/occupied, the gc thread is waken up asynchronously.


> 
>> This is very fast to shrink clean cache space, I did a test 2 years ago, 
>> it was just not more than 10 seconds to reclaim around 1TB+ clean cache 
>> space. I guess the time might be much less, because reading the 
>> information from priorities file also takes time.
> 
> Reclaiming large chunks of cache is probably fast in one shot, but 
> reclaiming one "clean but allocated" bucket (or even a few buckets) with 
> each WRITE has latency overhead associated with it.  Early reclaim to some 
> reasonable (configrable) minimum free-space value could hide that latency 
> in many workloads.
> 

As I explained, the re-reclaim has been here already. But it cannot help too much if busy I/O requests always coming and writeback and gc threads have no spare time to perform.

If coming I/Os exceeds the service capacity of the cache service window, disappointed requesters can be expected. 


> Long-running bcache volumes are usually 100% allocated, and if freeing 
> batches of clean buckets is fast, then doing it early would save metadata 
> handling during clean bucket re-allocation for new writes (and maybe 
> read-promotion, too).

Let’s check whether it is just becasue of insuffecient memory to hold the hot B+tree node in memory.

Thanks.


Coly Li

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-04 20:29               ` Adriano Silva
@ 2023-04-05 13:57                 ` Coly Li
  2023-04-05 19:24                   ` Eric Wheeler
  2023-04-05 19:31                   ` Adriano Silva
  0 siblings, 2 replies; 28+ messages in thread
From: Coly Li @ 2023-04-05 13:57 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Eric Wheeler, Bcache Linux, Martin McClure



> 2023年4月5日 04:29,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> 
> Hello,
> 
>> It sounds like a large cache size with limit memory cache 
>> for B+tree nodes?
> 
>> If the memory is limited and all B+tree nodes in the hot I/O 
>> paths cannot stay in memory, it is possible for such 
>> behavior happens. In this case, shrink the cached data 
>> may deduce the meta data and consequential in-memory 
>> B+tree nodes as well. Yes it may be helpful for such 
>> scenario.
> 
> There are several servers (TEN) all with 128 GB of RAM, of which around 100GB (on average) are presented by the OS as free. Cache is 594GB in size on enterprise NVMe, mass storage is 6TB. The configuration on all is the same. They run Ceph OSD to service a pool of disks accessed by servers (others including themselves).
> 
> All show the same behavior.
> 
> When they were installed, they did not occupy the entire cache. Throughout use, the cache gradually filled up and  never decreased in size. I have another five servers in  another cluster going the same way. During the night  their workload is reduced.

Copied.

> 
>> But what is the I/O pattern here? If all the cache space 
>> occupied by clean data for read request, and write 
>> performance is cared about then. Is this a write tended, 
>> or read tended workload, or mixed?
> 
> The workload is greater in writing. Both are important, read and write. But write latency is critical. These are virtual machine disks that are stored on Ceph. Inside we have mixed loads, Windows with terminal service, Linux, including a database where direct write latency is critical.


Copied.

> 
>> As I explained, the re-reclaim has been here already. 
>> But it cannot help too much if busy I/O requests always 
>> coming and writeback and gc threads have no spare 
>> time to perform.
> 
>> If coming I/Os exceeds the service capacity of the 
>> cache service window, disappointed requesters can 
>> be expected.
> 
> Today, the ten servers have been without I/O operation for at least 24 hours. Nothing has changed, they continue with 100% cache occupancy. I believe I should have given time for the GC, no?

This is nice. Now we have the maximum writeback thoughput after I/O idle for a while, so after 24 hours all dirty data should be written back and the whole cache might be clean.

I guess just a gc is needed here.

Can you try to write 1 to cache set sysfs file gc_after_writeback? When it is set, a gc will be waken up automatically after all writeback accomplished. Then most of the clean cache might be shunk and the B+tree nodes will be deduced quite a lot.


> 
>> Let’s check whether it is just becasue of insuffecient 
>> memory to hold the hot B+tree node in memory.
> 
> Does the bcache configuration have any RAM memory reservation options? Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? For that amount of Cache, how much RAM should I have reserved for bcache? Is there any command or parameter I should use to signal bcache that it should reserve this RAM memory? I didn't do anything about this matter. How would I do it?
> 

Currently there is no such option for limit bcache in-memory B+tree nodes cache occupation, but when I/O load reduces, such memory consumption may drop very fast by the reaper from system memory management code. So it won’t be a problem. Bcache will try to use any possible memory for B+tree nodes cache if it is necessary, and throttle I/O performance to return these memory back to memory management code when the available system memory is low. By default, it should work well and nothing should be done from user. 

> Another question: How do I know if I should trigger a TRIM (discard) for my NVMe with bcache?

Bcache doesn’t issue trim request proactively. The bcache program from bcache-tools may issue a discard request when you run,
	bcache make -C <cache device path>
to create a cache device.

In run time, bcache code only forward the trim request to backing device (not cache device).


Thanks.

Coly Li
 

> 
[snipped]


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-05 13:57                 ` Coly Li
@ 2023-04-05 19:24                   ` Eric Wheeler
  2023-04-05 19:31                   ` Adriano Silva
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2023-04-05 19:24 UTC (permalink / raw)
  To: Coly Li; +Cc: Adriano Silva, Bcache Linux, Martin McClure

[-- Attachment #1: Type: text/plain, Size: 5498 bytes --]

On Wed, 5 Apr 2023, Coly Li wrote:
> > 2023年4月5日 04:29,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> > 
> > Hello,
> > 
> >> It sounds like a large cache size with limit memory cache 
> >> for B+tree nodes?
> > 
> >> If the memory is limited and all B+tree nodes in the hot I/O 
> >> paths cannot stay in memory, it is possible for such 
> >> behavior happens. In this case, shrink the cached data 
> >> may deduce the meta data and consequential in-memory 
> >> B+tree nodes as well. Yes it may be helpful for such 
> >> scenario.
> > 
> > There are several servers (TEN) all with 128 GB of RAM, of which 
> > around 100GB (on average) are presented by the OS as free. Cache is 
> > 594GB in size on enterprise NVMe, mass storage is 6TB. The 
> > configuration on all is the same. They run Ceph OSD to service a pool 
> > of disks accessed by servers (others including themselves).
> > 
> > All show the same behavior.
> > 
> > When they were installed, they did not occupy the entire cache. 
> > Throughout use, the cache gradually filled up and never decreased in 
> > size. I have another five servers in another cluster going the same 
> > way. During the night their workload is reduced.
> 
> Copied.
> 
> >> But what is the I/O pattern here? If all the cache space occupied by 
> >> clean data for read request, and write performance is cared about 
> >> then. Is this a write tended, or read tended workload, or mixed?
> > 
> > The workload is greater in writing. Both are important, read and 
> > write. But write latency is critical. These are virtual machine disks 
> > that are stored on Ceph. Inside we have mixed loads, Windows with 
> > terminal service, Linux, including a database where direct write 
> > latency is critical.
> 
> 
> Copied.
> 
> >> As I explained, the re-reclaim has been here already. But it cannot 
> >> help too much if busy I/O requests always coming and writeback and gc 
> >> threads have no spare time to perform.
> >>
> >> If coming I/Os exceeds the service capacity of the cache service 
> >> window, disappointed requesters can be expected.
> > 
> > Today, the ten servers have been without I/O operation for at least 24 
> > hours. Nothing has changed, they continue with 100% cache occupancy. I 
> > believe I should have given time for the GC, no?
> 
> This is nice. Now we have the maximum writeback thoughput after I/O idle 
> for a while, so after 24 hours all dirty data should be written back and 
> the whole cache might be clean.
> 
> I guess just a gc is needed here.
> 
> Can you try to write 1 to cache set sysfs file gc_after_writeback? When 
> it is set, a gc will be waken up automatically after all writeback 
> accomplished. Then most of the clean cache might be shunk and the B+tree 
> nodes will be deduced quite a lot.

If writeback is done then you might need this to trigger it, too:
	echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc

Question for Coly: Will `gc_after_writeback` evict read-promoted pages, 
too, making this effectively a writeback-only cache?

Here is the commit:
	https://patchwork.kernel.org/project/linux-block/patch/20181213145357.38528-9-colyli@suse.de/

> >> Let’s check whether it is just becasue of insuffecient 
> >> memory to hold the hot B+tree node in memory.
> > 
> > Does the bcache configuration have any RAM memory reservation options? 
> > Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? 
> > For that amount of Cache, how much RAM should I have reserved for 
> > bcache? Is there any command or parameter I should use to signal 
> > bcache that it should reserve this RAM memory? I didn't do anything 
> > about this matter. How would I do it?
> > 
> 
> Currently there is no such option for limit bcache in-memory B+tree 
> nodes cache occupation, but when I/O load reduces, such memory 
> consumption may drop very fast by the reaper from system memory 
> management code. So it won’t be a problem. Bcache will try to use any 
> possible memory for B+tree nodes cache if it is necessary, and throttle 
> I/O performance to return these memory back to memory

Does bcache intentionally throttle I/O under memory pressure, or is the 
I/O throttling just a side-effect of increased memory pressure caused by 
fewer B+tree nodes in cache?

> management code when the available system memory is low. By default, it 
> should work well and nothing should be done from user.
> 
> > Another question: How do I know if I should trigger a TRIM (discard) 
> > for my NVMe with bcache?
> 
> Bcache doesn’t issue trim request proactively. 

Are you sure?  Maybe I misunderstood the code here, but it looks like 
buckets get a discard during allocation:
	https://elixir.bootlin.com/linux/v6.3-rc5/source/drivers/md/bcache/alloc.c#L335

	static int bch_allocator_thread(void *arg)
	{
		...
		while (1) {
			long bucket;

			if (!fifo_pop(&ca->free_inc, bucket))
				break;

			if (ca->discard) {
				mutex_unlock(&ca->set->bucket_lock);
				blkdev_issue_discard(ca->bdev, <<<<<<<<<<<<<
					bucket_to_sector(ca->set, bucket),
					ca->sb.bucket_size, GFP_KERNEL);
				mutex_lock(&ca->set->bucket_lock);
			}
			...
		}
		...
	}

-Eric


> The bcache program from bcache-tools may issue a discard request when 
> you run,
> 	bcache make -C <cache device path>
> to create a cache device.
> 
> In run time, bcache code only forward the trim request to backing device (not cache device).



> 
> 
> Thanks.
> 
> Coly Li
>  
> 
> > 
> [snipped]
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-05 13:57                 ` Coly Li
  2023-04-05 19:24                   ` Eric Wheeler
@ 2023-04-05 19:31                   ` Adriano Silva
  2023-04-06 21:21                     ` Eric Wheeler
  2023-04-09 16:37                     ` Coly Li
  1 sibling, 2 replies; 28+ messages in thread
From: Adriano Silva @ 2023-04-05 19:31 UTC (permalink / raw)
  To: Coly Li; +Cc: Eric Wheeler, Bcache Linux, Martin McClure

Hello Coly.

Yes, the server is always on. I allowed it to stay on for more than 24 hours with zero disk I/O to the bcache device. The result is that there are no movements on the cache or data disks, nor on the bcache device as we can see:

root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
--dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|     time     
  54k  154k: 301k  221k: 223k  169k|0.67  0.54 :6.99  20.5 :6.77  12.3 |05-04 14:45:50
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:51
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:52
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:53
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:54
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:55
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:56
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:57
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:58
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:59
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:00
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:01
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:02
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:03
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:04
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:05
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:06
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:07

It can stay like that for hours without showing any, zero data flow, either read or write on any of the devices.

root@pve-00-005:~# cat /sys/block/bcache0/bcache/state
clean
root@pve-00-005:~#

But look how strange, in another command (priority_stats), it shows that there is still 1% of dirt in the cache. And 0% unused cache space. Even after hours of server on and completely idle:

root@pve-00-005:~# cat /sys/devices/pci0000:80/0000:80:01.1/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/bcache/priority_stats
Unused:         0%
Clean:          98%
Dirty:          1%
Metadata:       0%
Average:        1137
Sectors per Q:  36245232
Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]

Why is this happening?

> Can you try to write 1 to cache set sysfs file 
> gc_after_writeback? 
> When it is set, a gc will be waken up automatically after 
> all writeback accomplished. Then most of the clean cache 
> might be shunk and the B+tree nodes will be deduced 
> quite a lot.

Would this be the command you ask me for?

root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback

If this command is correct, I already advance that it did not give the expected result. The Cache continues with 100% of the occupied space. Nothing has changed despite the cache being cleaned and having written the command you recommended. Let's see:

root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         0%
Clean:          98%
Dirty:          1%
Metadata:       0%
Average:        1137
Sectors per Q:  36245232
Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]

But if there was any movement on the disks after the command, I couldn't detect it:

root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
--dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|     time     
  54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:58
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:59
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:00
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:01
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:02
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:03
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:04^C
root@pve-00-005:~#

Why were there no changes?

> Currently there is no such option for limit bcache 
> in-memory B+tree nodes cache occupation, but when I/O 
> load reduces, such memory consumption may drop very 
> fast by the reaper from system memory management 
> code. So it won’t be a problem. Bcache will try to use any 
> possible memory for B+tree nodes cache if it is 
> necessary, and throttle I/O performance to return these 
> memory back to memory management code when the 
> available system memory is low. By default, it should 
> work well and nothing should be done from user.

I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 

root@pve-00-005:~# free               total        used        free      shared  buff/cache   available
Mem:       131980688    72670448    19088648       76780    40221592    57335704
Swap:              0           0           0
root@pve-00-005:~#

There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?

> Bcache doesn’t issue trim request proactively. 
> [...]
> In run time, bcache code only forward the trim request to backing device (not cache device).

Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?

Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?

As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)

Thank you very much!



Em quarta-feira, 5 de abril de 2023 às 10:57:58 BRT, Coly Li <colyli@suse.de> escreveu: 







> 2023年4月5日 04:29,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> 
> Hello,
> 
>> It sounds like a large cache size with limit memory cache 
>> for B+tree nodes?
> 
>> If the memory is limited and all B+tree nodes in the hot I/O 
>> paths cannot stay in memory, it is possible for such 
>> behavior happens. In this case, shrink the cached data 
>> may deduce the meta data and consequential in-memory 
>> B+tree nodes as well. Yes it may be helpful for such 
>> scenario.
> 
> There are several servers (TEN) all with 128 GB of RAM, of which around 100GB (on average) are presented by the OS as free. Cache is 594GB in size on enterprise NVMe, mass storage is 6TB. The configuration on all is the same. They run Ceph OSD to service a pool of disks accessed by servers (others including themselves).
> 
> All show the same behavior.
> 
> When they were installed, they did not occupy the entire cache. Throughout use, the cache gradually filled up and  never decreased in size. I have another five servers in  another cluster going the same way. During the night  their workload is reduced.

Copied.

> 
>> But what is the I/O pattern here? If all the cache space 
>> occupied by clean data for read request, and write 
>> performance is cared about then. Is this a write tended, 
>> or read tended workload, or mixed?
> 
> The workload is greater in writing. Both are important, read and write. But write latency is critical. These are virtual machine disks that are stored on Ceph. Inside we have mixed loads, Windows with terminal service, Linux, including a database where direct write latency is critical.


Copied.

> 
>> As I explained, the re-reclaim has been here already. 
>> But it cannot help too much if busy I/O requests always 
>> coming and writeback and gc threads have no spare 
>> time to perform.
> 
>> If coming I/Os exceeds the service capacity of the 
>> cache service window, disappointed requesters can 
>> be expected.
> 
> Today, the ten servers have been without I/O operation for at least 24 hours. Nothing has changed, they continue with 100% cache occupancy. I believe I should have given time for the GC, no?

This is nice. Now we have the maximum writeback thoughput after I/O idle for a while, so after 24 hours all dirty data should be written back and the whole cache might be clean.

I guess just a gc is needed here.

Can you try to write 1 to cache set sysfs file gc_after_writeback? When it is set, a gc will be waken up automatically after all writeback accomplished. Then most of the clean cache might be shunk and the B+tree nodes will be deduced quite a lot.


> 
>> Let’s check whether it is just becasue of insuffecient 
>> memory to hold the hot B+tree node in memory.
> 
> Does the bcache configuration have any RAM memory reservation options? Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? For that amount of Cache, how much RAM should I have reserved for bcache? Is there any command or parameter I should use to signal bcache that it should reserve this RAM memory? I didn't do anything about this matter. How would I do it?
> 

Currently there is no such option for limit bcache in-memory B+tree nodes cache occupation, but when I/O load reduces, such memory consumption may drop very fast by the reaper from system memory management code. So it won’t be a problem. Bcache will try to use any possible memory for B+tree nodes cache if it is necessary, and throttle I/O performance to return these memory back to memory management code when the available system memory is low. By default, it should work well and nothing should be done from user. 

> Another question: How do I know if I should trigger a TRIM (discard) for my NVMe with bcache?

Bcache doesn’t issue trim request proactively. The bcache program from bcache-tools may issue a discard request when you run,
    bcache make -C <cache device path>
to create a cache device.

In run time, bcache code only forward the trim request to backing device (not cache device).



Thanks.

Coly Li



> 
[snipped]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-05 19:31                   ` Adriano Silva
@ 2023-04-06 21:21                     ` Eric Wheeler
  2023-04-07  3:15                       ` Adriano Silva
  2023-04-09 16:37                     ` Coly Li
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Wheeler @ 2023-04-06 21:21 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Coly Li, Bcache Linux, Martin McClure

[-- Attachment #1: Type: text/plain, Size: 9689 bytes --]

On Wed, 5 Apr 2023, Adriano Silva wrote:
> > Can you try to write 1 to cache set sysfs file 
> > gc_after_writeback? 
> > When it is set, a gc will be waken up automatically after 
> > all writeback accomplished. Then most of the clean cache 
> > might be shunk and the B+tree nodes will be deduced 
> > quite a lot.
> 
> Would this be the command you ask me for?
> 
> root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> 
> If this command is correct, I already advance that it did not give the 
> expected result. The Cache continues with 100% of the occupied space. 
> Nothing has changed despite the cache being cleaned and having written 
> the command you recommended. Let's see:

Did you try to trigger gc after setting gc_after_writeback=1?

        echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc

The `gc_after_writeback=1` setting might not trigger until writeback 
finishes, but if writeback is already finished and there is no new IO then 
it may never trigger unless it is forced via `tigger_gc`

-Eric

> root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> Unused:         0%
> Clean:          98%
> Dirty:          1%
> Metadata:       0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> But if there was any movement on the disks after the command, I couldn't detect it:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|     time     
>   54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:58
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:59
\>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 
|05-04 15:29:00
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:01
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:02
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:03
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:04^C
> root@pve-00-005:~#
> 
> Why were there no changes?
> 
> > Currently there is no such option for limit bcache 
> > in-memory B+tree nodes cache occupation, but when I/O 
> > load reduces, such memory consumption may drop very 
> > fast by the reaper from system memory management 
> > code. So it won’t be a problem. Bcache will try to use any 
> > possible memory for B+tree nodes cache if it is 
> > necessary, and throttle I/O performance to return these 
> > memory back to memory management code when the 
> > available system memory is low. By default, it should 
> > work well and nothing should be done from user.
> 
> I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> 
> root@pve-00-005:~# free               total        used        free      shared  buff/cache   available
> Mem:       131980688    72670448    19088648       76780    40221592    57335704
> Swap:              0           0           0
> root@pve-00-005:~#
> 
> There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?
> 
> > Bcache doesn’t issue trim request proactively. 
> > [...]
> > In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?
> 
> Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> 
> As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)
> 
> Thank you very much!
> 
> 
> 
> Em quarta-feira, 5 de abril de 2023 às 10:57:58 BRT, Coly Li <colyli@suse.de> escreveu: 
> 
> 
> 
> 
> 
> 
> 
> > 2023年4月5日 04:29,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> > 
> > Hello,
> > 
> >> It sounds like a large cache size with limit memory cache 
> >> for B+tree nodes?
> > 
> >> If the memory is limited and all B+tree nodes in the hot I/O 
> >> paths cannot stay in memory, it is possible for such 
> >> behavior happens. In this case, shrink the cached data 
> >> may deduce the meta data and consequential in-memory 
> >> B+tree nodes as well. Yes it may be helpful for such 
> >> scenario.
> > 
> > There are several servers (TEN) all with 128 GB of RAM, of which around 100GB (on average) are presented by the OS as free. Cache is 594GB in size on enterprise NVMe, mass storage is 6TB. The configuration on all is the same. They run Ceph OSD to service a pool of disks accessed by servers (others including themselves).
> > 
> > All show the same behavior.
> > 
> > When they were installed, they did not occupy the entire cache. Throughout use, the cache gradually filled up and  never decreased in size. I have another five servers in  another cluster going the same way. During the night  their workload is reduced.
> 
> Copied.
> 
> > 
> >> But what is the I/O pattern here? If all the cache space 
> >> occupied by clean data for read request, and write 
> >> performance is cared about then. Is this a write tended, 
> >> or read tended workload, or mixed?
> > 
> > The workload is greater in writing. Both are important, read and write. But write latency is critical. These are virtual machine disks that are stored on Ceph. Inside we have mixed loads, Windows with terminal service, Linux, including a database where direct write latency is critical.
> 
> 
> Copied.
> 
> > 
> >> As I explained, the re-reclaim has been here already. 
> >> But it cannot help too much if busy I/O requests always 
> >> coming and writeback and gc threads have no spare 
> >> time to perform.
> > 
> >> If coming I/Os exceeds the service capacity of the 
> >> cache service window, disappointed requesters can 
> >> be expected.
> > 
> > Today, the ten servers have been without I/O operation for at least 24 hours. Nothing has changed, they continue with 100% cache occupancy. I believe I should have given time for the GC, no?
> 
> This is nice. Now we have the maximum writeback thoughput after I/O idle for a while, so after 24 hours all dirty data should be written back and the whole cache might be clean.
> 
> I guess just a gc is needed here.
> 
> Can you try to write 1 to cache set sysfs file gc_after_writeback? When it is set, a gc will be waken up automatically after all writeback accomplished. Then most of the clean cache might be shunk and the B+tree nodes will be deduced quite a lot.
> 
> 
> > 
> >> Let’s check whether it is just becasue of insuffecient 
> >> memory to hold the hot B+tree node in memory.
> > 
> > Does the bcache configuration have any RAM memory reservation options? Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? For that amount of Cache, how much RAM should I have reserved for bcache? Is there any command or parameter I should use to signal bcache that it should reserve this RAM memory? I didn't do anything about this matter. How would I do it?
> > 
> 
> Currently there is no such option for limit bcache in-memory B+tree nodes cache occupation, but when I/O load reduces, such memory consumption may drop very fast by the reaper from system memory management code. So it won’t be a problem. Bcache will try to use any possible memory for B+tree nodes cache if it is necessary, and throttle I/O performance to return these memory back to memory management code when the available system memory is low. By default, it should work well and nothing should be done from user. 
> 
> > Another question: How do I know if I should trigger a TRIM (discard) for my NVMe with bcache?
> 
> Bcache doesn’t issue trim request proactively. The bcache program from bcache-tools may issue a discard request when you run,
>     bcache make -C <cache device path>
> to create a cache device.
> 
> In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> 
> 
> Thanks.
> 
> Coly Li
> 
> 
> 
> > 
> [snipped]
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-06 21:21                     ` Eric Wheeler
@ 2023-04-07  3:15                       ` Adriano Silva
  0 siblings, 0 replies; 28+ messages in thread
From: Adriano Silva @ 2023-04-07  3:15 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Coly Li, Bcache Linux, Martin McClure

Hello Eric,

> The `gc_after_writeback=1` setting might not trigger until writeback
> finishes, but if writeback is already finished and there is no new IO then
> it may never trigger unless it is forced via `tigger_gc`

Yes, I tried both commands, but I didn't get the expected result and continued with (almost) no free cache.

After executing the two commands, some dirty data was cleaned and some free space was left in the cache. But almost insignificant.

However, it stopped there. It has not increased any more available space. I tried again both commands and there were no changes again.

See that on all computers, I have from 185 to a maximum of 203 GB of total disk occupied in a 5.6TB bcache device.

root@pve-00-005:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  5.57269   1.00000  5.6 TiB  185 GiB   68 GiB    6 KiB  1.3 GiB  5.4 TiB  3.25  0.95   24      up
 1    hdd  5.57269   1.00000  5.6 TiB  197 GiB   80 GiB  2.8 MiB  1.4 GiB  5.4 TiB  3.46  1.01   31      up
 2    hdd  5.57269   1.00000  5.6 TiB  203 GiB   86 GiB  2.8 MiB  1.6 GiB  5.4 TiB  3.56  1.04   30      up
 3    hdd  5.57269   1.00000  5.6 TiB  197 GiB   80 GiB  2.8 MiB  1.5 GiB  5.4 TiB  3.45  1.01   31      up
 4    hdd  5.57269   1.00000  5.6 TiB  194 GiB   76 GiB    5 KiB  361 MiB  5.4 TiB  3.39  0.99   26      up
 5    hdd  5.57269   1.00000  5.6 TiB  187 GiB   69 GiB    5 KiB  1.1 GiB  5.4 TiB  3.27  0.96   25      up
 6    hdd  5.57269   1.00000  5.6 TiB  202 GiB   84 GiB    5 KiB  1.5 GiB  5.4 TiB  3.54  1.04   28      up
                       TOTAL   39 TiB  1.3 TiB  543 GiB  8.4 MiB  8.8 GiB   38 TiB  3.42                   
MIN/MAX VAR: 0.95/1.04  STDDEV: 0.11
root@pve-00-005:~#

But when I look inside the bcache devices, the caches are all pretty much full, with a maximum of 5% free (at best). This after many hours stopped and after the aforementioned commands.

root@pve-00-001:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         4%
Clean:          95%
Dirty:          0%
Metadata:       0%
Average:        1145
Sectors per Q:  36244576
Quantiles:      [8 24 39 56 84 112 155 256 392 476 605 714 825 902 988 1070 1184 1273 1369 1475 1568 1686 1775 1890 1994 2088 2212 2323 2441 2553 2693]
root@pve-00-001:~#

root@pve-00-002:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         4%
Clean:          95%
Dirty:          0%
Metadata:       0%
Average:        1143
Sectors per Q:  36245072
Quantiles:      [10 25 42 78 107 147 201 221 304 444 529 654 757 863 962 1057 1146 1264 1355 1469 1568 1664 1773 1885 2001 2111 2241 2368 2490 2613 2779]
root@pve-00-002:~#

root@pve-00-003:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         2%
Clean:          97%
Dirty:          0%
Metadata:       0%
Average:        971
Sectors per Q:  36244400
Quantiles:      [8 21 36 51 87 127 161 181 217 278 435 535 627 741 825 919 993 1080 1165 1239 1340 1428 1503 1611 1716 1815 1945 2037 2129 2248 2357]
root@pve-00-003:~#

root@pve-00-004:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         5%
Clean:          94%
Dirty:          0%
Metadata:       0%
Average:        1133
Sectors per Q:  36243024
Quantiles:      [10 26 41 57 92 121 152 192 289 440 550 645 806 913 989 1068 1170 1243 1371 1455 1567 1656 1746 1887 1996 2107 2201 2318 2448 2588 2729]
root@pve-00-004:~#

root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         2%
Clean:          97%
Dirty:          0%
Metadata:       0%
Average:        1076
Sectors per Q:  36245312
Quantiles:      [10 25 42 59 93 115 139 218 276 368 478 568 676 770 862 944 1090 1178 1284 1371 1453 1589 1700 1814 1904 1990 2147 2264 2386 2509 2679]
root@pve-00-005:~#

root@pve-00-006:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         4%
Clean:          95%
Dirty:          0%
Metadata:       0%
Average:        1085
Sectors per Q:  36244688
Quantiles:      [10 27 45 68 101 137 175 234 365 448 547 651 757 834 921 1001 1098 1185 1283 1379 1470 1575 1673 1781 1892 1994 2102 2216 2336 2461 2606]
root@pve-00-006:~#

root@pve-00-007:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         4%
Clean:          95%
Dirty:          0%
Metadata:       0%
Average:        1061
Sectors per Q:  36244160
Quantiles:      [10 24 40 56 94 132 177 233 275 326 495 602 704 846 928 1014 1091 1180 1276 1355 1471 1572 1665 1759 1862 1952 2087 2179 2292 2417 2537]
root@pve-00-007:~#

As you can see, out of 7 servers, they all range from 2 to a maximum of 5% unused space. Though none have even 4% of the mass disk space occupied.

Little has changed after many hours with the system on but no new writes or reads to the bcache device. Only this time I turned on the virtual machine, but it stayed on for a short time. Even so, as you can see, almost nothing has changed and there are still disks with no cache space available. I can bet that, in this situation, with a few minutes of use, the disks will all be 100% full again.

It's even a funny situation, because the usable space used (with real data) in the bcache device doesn't reach half of what is actually occupied in the cache. It's as if it keeps in the cache, even data that has already been deleted from the device.

Is there a solution?

Grateful,

Em quinta-feira, 6 de abril de 2023 às 18:21:20 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 



On Wed, 5 Apr 2023, Adriano Silva wrote:
> > Can you try to write 1 to cache set sysfs file 
> > gc_after_writeback? 
> > When it is set, a gc will be waken up automatically after 
> > all writeback accomplished. Then most of the clean cache 
> > might be shunk and the B+tree nodes will be deduced 
> > quite a lot.
> 
> Would this be the command you ask me for?
> 
> root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> 
> If this command is correct, I already advance that it did not give the 
> expected result. The Cache continues with 100% of the occupied space. 
> Nothing has changed despite the cache being cleaned and having written 
> the command you recommended. Let's see:

Did you try to trigger gc after setting gc_after_writeback=1?

        echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc

The `gc_after_writeback=1` setting might not trigger until writeback 
finishes, but if writeback is already finished and there is no new IO then 
it may never trigger unless it is forced via `tigger_gc`

-Eric


> root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> Unused:         0%
> Clean:          98%
> Dirty:          1%
> Metadata:       0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> But if there was any movement on the disks after the command, I couldn't detect it:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|     time     
>   54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:58
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:59
\>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 
|05-04 15:29:00
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:01
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:02
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:03
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:04^C
> root@pve-00-005:~#
> 
> Why were there no changes?
> 
> > Currently there is no such option for limit bcache 
> > in-memory B+tree nodes cache occupation, but when I/O 
> > load reduces, such memory consumption may drop very 
> > fast by the reaper from system memory management 
> > code. So it won’t be a problem. Bcache will try to use any 
> > possible memory for B+tree nodes cache if it is 
> > necessary, and throttle I/O performance to return these 
> > memory back to memory management code when the 
> > available system memory is low. By default, it should 
> > work well and nothing should be done from user.
> 
> I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> 
> root@pve-00-005:~# free               total        used        free      shared  buff/cache   available
> Mem:       131980688    72670448    19088648       76780    40221592    57335704
> Swap:              0           0           0
> root@pve-00-005:~#
> 
> There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?
> 
> > Bcache doesn’t issue trim request proactively. 
> > [...]
> > In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?
> 
> Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> 
> As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)
> 
> Thank you very much!
> 
> 
> 
> Em quarta-feira, 5 de abril de 2023 às 10:57:58 BRT, Coly Li <colyli@suse.de> escreveu: 
> 
> 
> 
> 
> 
> 
> 
> > 2023年4月5日 04:29,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> > 
> > Hello,
> > 
> >> It sounds like a large cache size with limit memory cache 
> >> for B+tree nodes?
> > 
> >> If the memory is limited and all B+tree nodes in the hot I/O 
> >> paths cannot stay in memory, it is possible for such 
> >> behavior happens. In this case, shrink the cached data 
> >> may deduce the meta data and consequential in-memory 
> >> B+tree nodes as well. Yes it may be helpful for such 
> >> scenario.
> > 
> > There are several servers (TEN) all with 128 GB of RAM, of which around 100GB (on average) are presented by the OS as free. Cache is 594GB in size on enterprise NVMe, mass storage is 6TB. The configuration on all is the same. They run Ceph OSD to service a pool of disks accessed by servers (others including themselves).
> > 
> > All show the same behavior.
> > 
> > When they were installed, they did not occupy the entire cache. Throughout use, the cache gradually filled up and  never decreased in size. I have another five servers in  another cluster going the same way. During the night  their workload is reduced.
> 
> Copied.
> 
> > 
> >> But what is the I/O pattern here? If all the cache space 
> >> occupied by clean data for read request, and write 
> >> performance is cared about then. Is this a write tended, 
> >> or read tended workload, or mixed?
> > 
> > The workload is greater in writing. Both are important, read and write. But write latency is critical. These are virtual machine disks that are stored on Ceph. Inside we have mixed loads, Windows with terminal service, Linux, including a database where direct write latency is critical.
> 
> 
> Copied.
> 
> > 
> >> As I explained, the re-reclaim has been here already. 
> >> But it cannot help too much if busy I/O requests always 
> >> coming and writeback and gc threads have no spare 
> >> time to perform.
> > 
> >> If coming I/Os exceeds the service capacity of the 
> >> cache service window, disappointed requesters can 
> >> be expected.
> > 
> > Today, the ten servers have been without I/O operation for at least 24 hours. Nothing has changed, they continue with 100% cache occupancy. I believe I should have given time for the GC, no?
> 
> This is nice. Now we have the maximum writeback thoughput after I/O idle for a while, so after 24 hours all dirty data should be written back and the whole cache might be clean.
> 
> I guess just a gc is needed here.
> 
> Can you try to write 1 to cache set sysfs file gc_after_writeback? When it is set, a gc will be waken up automatically after all writeback accomplished. Then most of the clean cache might be shunk and the B+tree nodes will be deduced quite a lot.
> 
> 
> > 
> >> Let’s check whether it is just becasue of insuffecient 
> >> memory to hold the hot B+tree node in memory.
> > 
> > Does the bcache configuration have any RAM memory reservation options? Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? For that amount of Cache, how much RAM should I have reserved for bcache? Is there any command or parameter I should use to signal bcache that it should reserve this RAM memory? I didn't do anything about this matter. How would I do it?
> > 
> 
> Currently there is no such option for limit bcache in-memory B+tree nodes cache occupation, but when I/O load reduces, such memory consumption may drop very fast by the reaper from system memory management code. So it won’t be a problem. Bcache will try to use any possible memory for B+tree nodes cache if it is necessary, and throttle I/O performance to return these memory back to memory management code when the available system memory is low. By default, it should work well and nothing should be done from user. 
> 
> > Another question: How do I know if I should trigger a TRIM (discard) for my NVMe with bcache?
> 
> Bcache doesn’t issue trim request proactively. The bcache program from bcache-tools may issue a discard request when you run,
>     bcache make -C <cache device path>
> to create a cache device.
> 
> In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> 
> 
> Thanks.
> 
> Coly Li
> 
> 
> 
> > 
> [snipped]
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-05 19:31                   ` Adriano Silva
  2023-04-06 21:21                     ` Eric Wheeler
@ 2023-04-09 16:37                     ` Coly Li
  2023-04-09 20:14                       ` Adriano Silva
  1 sibling, 1 reply; 28+ messages in thread
From: Coly Li @ 2023-04-09 16:37 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Eric Wheeler, Bcache Linux, Martin McClure



> 2023年4月6日 03:31,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> 
> Hello Coly.
> 
> Yes, the server is always on. I allowed it to stay on for more than 24 hours with zero disk I/O to the bcache device. The result is that there are no movements on the cache or data disks, nor on the bcache device as we can see:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|     time     
>   54k  154k: 301k  221k: 223k  169k|0.67  0.54 :6.99  20.5 :6.77  12.3 |05-04 14:45:50
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:51
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:52
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:53
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:54
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:55
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:56
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:57
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:58
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:59
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:00
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:01
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:02
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:03
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:04
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:05
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:06
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:07
> 
> It can stay like that for hours without showing any, zero data flow, either read or write on any of the devices.
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/state
> clean
> root@pve-00-005:~#
> 
> But look how strange, in another command (priority_stats), it shows that there is still 1% of dirt in the cache. And 0% unused cache space. Even after hours of server on and completely idle:
> 
> root@pve-00-005:~# cat /sys/devices/pci0000:80/0000:80:01.1/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/bcache/priority_stats
> Unused:         0%
> Clean:          98%
> Dirty:          1%
> Metadata:       0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> Why is this happening?
> 
>> Can you try to write 1 to cache set sysfs file 
>> gc_after_writeback? 
>> When it is set, a gc will be waken up automatically after 
>> all writeback accomplished. Then most of the clean cache 
>> might be shunk and the B+tree nodes will be deduced 
>> quite a lot.
> 
> Would this be the command you ask me for?
> 
> root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> 
> If this command is correct, I already advance that it did not give the expected result. The Cache continues with 100% of the occupied space. Nothing has changed despite the cache being cleaned and having written the command you recommended. Let's see:
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> Unused:         0%
> Clean:          98%
> Dirty:          1%
> Metadata:       0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> But if there was any movement on the disks after the command, I couldn't detect it:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|     time     
>   54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:58
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:59
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:00
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:01
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:02
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:03
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:04^C
> root@pve-00-005:~#
> 
> Why were there no changes?

Thanks for the above information. The result is unexpected from me. Let me check whether the B+tree nodes are not shrunk, this is something should be improved. And when the write erase time matters for write requests, normally it is the condition that heavy write loads coming. In such education, the LBA of the collected buckets might be allocated out very soon even before the SSD controller finishes internal write-erasure by the hint of discard/trim. Therefore issue discard/trim right before writing to this LBA doesn’t help on any write performance and involves in extra unnecessary workload into the SSD controller.

And for nowadays SATA/NVMe SSDs, with your workload described above, the write performance drawback can be almost ignored

> 
>> Currently there is no such option for limit bcache 
>> in-memory B+tree nodes cache occupation, but when I/O 
>> load reduces, such memory consumption may drop very 
>> fast by the reaper from system memory management 
>> code. So it won’t be a problem. Bcache will try to use any 
>> possible memory for B+tree nodes cache if it is 
>> necessary, and throttle I/O performance to return these 
>> memory back to memory management code when the 
>> available system memory is low. By default, it should 
>> work well and nothing should be done from user.
> 
> I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> 
> root@pve-00-005:~# free               total        used        free      shared  buff/cache   available
> Mem:       131980688    72670448    19088648       76780    40221592    57335704
> Swap:              0           0           0
> root@pve-00-005:~#
> 
> There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?

No, this is not because of insufficient memory. From your information the memory is enough.

> 
>> Bcache doesn’t issue trim request proactively. 
>> [...]
>> In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?

There was such attempt but indeed doesn’t help at all. The reason is, bcache can only know which bucket can be discarded when it is handled by garbage collection. 


> 
> Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> 
> As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)

Let me look into this…

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-09 16:37                     ` Coly Li
@ 2023-04-09 20:14                       ` Adriano Silva
  2023-04-09 21:07                         ` Adriano Silva
  0 siblings, 1 reply; 28+ messages in thread
From: Adriano Silva @ 2023-04-09 20:14 UTC (permalink / raw)
  To: Coly Li; +Cc: Eric Wheeler, Bcache Linux, Martin McClure

Hello Eric !

> Did you try to trigger gc after setting gc_after_writeback=1?
> 
>         echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc
> 
> The `gc_after_writeback=1` setting might not trigger until writeback
> finishes, but if writeback is already finished and there is no new IO then
> it may never trigger unless it is forced via `tigger_gc`
> 
> -Eric


Yes, I use the two commands indicated several times, one after the other, first one, then the other, then in reversed order... successive times, after hours of zero disk writing/reading. On more than one server. I tested it on all my servers actually. And in all, the results are similar, there is no significant cache space flush.

And to make matters worse, in other performance tests, I realized that depending on the size of the block I manipulate, the difference in performance is frightening. With 4MB blocks I can write 691MB/s with freshly formatted cache.

root@pve-01-007:~# ceph tell osd.0 bench                
{                
    bytes_written: 1073741824,                
    blocksize: 4194304,                
    elapsed_sec: 1.5536911500000001,                
    bytes_per_sec: 691090905.67967761,
    iops: 164.76891176216068
}                
root@pve-01-007:~#

In the same test I only get 142MB/s when all cache is occupied.

root@pve-00-005:~# ceph tell osd.0 bench
{
    bytes_written: 1073741824,
    blocksize: 4194304,
    elapsed_sec: 7.5302066820000002,
    bytes_per_sec: 142591281.93209398,
    iops: 33.996410830520148
}
root@pve-00-005:~#

That is, with the cache after all occupied, the bcache can write with only 21% of the performance obtained with the newly formatted cache. It doesn't look like we're talking about exactly the same hardware... Same NVME, same processors, same RAM, same server, same OS, same bcache settings..... If you format the cache, it returns to the original performance.

I'm looking at the bcache source code to see if I can pick up anything that might be useful to me. But the code is big and complex. I confess that it is not quick to understand.

I created a little C program to try and call a built-in bcache function for testing, but I spent Sunday and couldn't even compile the program. It is funny.

But what would the garbage collector be in this case? What I understand is that the "garbage" would be parts of buckets (blocks) that would not have been reused and were "lost" outside the c->free list and also outside the free_inc list. I think that would help yes, but maybe in a very limited way. Is this the condition of most buckets that are in use?

As it seems to me (I could be talking nonsense), what would solve the problem would be to get bcache to allocate an adequate amount of buckets in the c->free list. I see this being mentioned in bcache/alloc.c

Would it be through invalidate_buckets(ca) called through the bch_allocator_thread(void *arg) thread? I don't know. What is limiting the action of this thread? I could not understand.

But here in my anxious ignorance, I'm left thinking maybe this was the way, a way to call this function to invalidate many clean buckets in the lru order and discard them. So I looked for an external interface that calls it, but I didn't find it.

Thank you very much!

Em domingo, 9 de abril de 2023 às 13:37:32 BRT, Coly Li <colyli@suse.de> escreveu: 







> 2023年4月6日 03:31,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> 
> Hello Coly.
> 
> Yes, the server is always on. I allowed it to stay on for more than 24 hours with zero disk I/O to the bcache device. The result is that there are no movements on the cache or data disks, nor on the bcache device as we can see:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
>  54k  154k: 301k  221k: 223k  169k|0.67  0.54 :6.99  20.5 :6.77  12.3 |05-04 14:45:50
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:51
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:52
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:53
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:54
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:55
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:56
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:57
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:58
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:59
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:00
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:01
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:02
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:03
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:04
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:05
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:06
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:07
> 
> It can stay like that for hours without showing any, zero data flow, either read or write on any of the devices.
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/state
> clean
> root@pve-00-005:~#
> 
> But look how strange, in another command (priority_stats), it shows that there is still 1% of dirt in the cache. And 0% unused cache space. Even after hours of server on and completely idle:
> 
> root@pve-00-005:~# cat /sys/devices/pci0000:80/0000:80:01.1/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/bcache/priority_stats
> Unused:        0%
> Clean:          98%
> Dirty:          1%
> Metadata:      0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> Why is this happening?
> 
>> Can you try to write 1 to cache set sysfs file 
>> gc_after_writeback? 
>> When it is set, a gc will be waken up automatically after 
>> all writeback accomplished. Then most of the clean cache 
>> might be shunk and the B+tree nodes will be deduced 
>> quite a lot.
> 
> Would this be the command you ask me for?
> 
> root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> 
> If this command is correct, I already advance that it did not give the expected result. The Cache continues with 100% of the occupied space. Nothing has changed despite the cache being cleaned and having written the command you recommended. Let's see:
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> Unused:        0%
> Clean:          98%
> Dirty:          1%
> Metadata:      0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> But if there was any movement on the disks after the command, I couldn't detect it:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
>  54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:58
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:59
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:00
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:01
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:02
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:03
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:04^C
> root@pve-00-005:~#
> 
> Why were there no changes?

Thanks for the above information. The result is unexpected from me. Let me check whether the B+tree nodes are not shrunk, this is something should be improved. And when the write erase time matters for write requests, normally it is the condition that heavy write loads coming. In such education, the LBA of the collected buckets might be allocated out very soon even before the SSD controller finishes internal write-erasure by the hint of discard/trim. Therefore issue discard/trim right before writing to this LBA doesn’t help on any write performance and involves in extra unnecessary workload into the SSD controller.

And for nowadays SATA/NVMe SSDs, with your workload described above, the write performance drawback can be almost ignored

> 
>> Currently there is no such option for limit bcache 
>> in-memory B+tree nodes cache occupation, but when I/O 
>> load reduces, such memory consumption may drop very 
>> fast by the reaper from system memory management 
>> code. So it won’t be a problem. Bcache will try to use any 
>> possible memory for B+tree nodes cache if it is 
>> necessary, and throttle I/O performance to return these 
>> memory back to memory management code when the 
>> available system memory is low. By default, it should 
>> work well and nothing should be done from user.
> 
> I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> 
> root@pve-00-005:~# free              total        used        free      shared  buff/cache  available
> Mem:      131980688    72670448    19088648      76780    40221592    57335704
> Swap:              0          0          0
> root@pve-00-005:~#
> 
> There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?

No, this is not because of insufficient memory. From your information the memory is enough.

> 
>> Bcache doesn’t issue trim request proactively. 
>> [...]
>> In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?

There was such attempt but indeed doesn’t help at all. The reason is, bcache can only know which bucket can be discarded when it is handled by garbage collection. 


> 
> Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> 
> As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)

Let me look into this…


Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-09 20:14                       ` Adriano Silva
@ 2023-04-09 21:07                         ` Adriano Silva
  2023-04-20 11:35                           ` Adriano Silva
  0 siblings, 1 reply; 28+ messages in thread
From: Adriano Silva @ 2023-04-09 21:07 UTC (permalink / raw)
  To: Coly Li; +Cc: Eric Wheeler, Bcache Linux, Martin McClure

Hi Coly! 

Talking about the TRIM (discard) made in the cache...

> There was such attempt but indeed doesn’t help at all. 
> The reason is, bcache can only know which bucket can 
> be discarded when it is handled by garbage collection.

Come to think of it, I spoke to Eric before something curious, but I could be wrong. What I understand about the "garbage collector" is that the "garbage" would be parts of buckets (blocks) that would not have been reused and were "lost" outside the c->free list and also outside the free_inc list. If I'm correct in my perception, I think the garbage collector would help very little in my case. Of course, all help is welcome. But I'm already thinking about the bigger one.

If I think correctly, I don't think that in my case most of the buckets would be collected by the garbage collector. Because it is data that has not been erased in the file system. They would need to be cleaned (saved to the mass device) and after some time passed without access, removed from the cache. That is, in the cache would only be hot data. That is recently accessed data (LRU), but never allowing the cache to fill completely.

Using the same logic that bcache already uses to choose a bucket to be erased and replaced (in case the cache is already completely full and a new write is requested), it would do the same, allocating empty space by erasing the data in the bucket (in many buckets) previously whenever you notice that the cache is very close to being full. You can do this in the background, asynchronously. So in this case I understand that TRIM/discard should help a lot. Do not you think?

So my question would be: is bcache capable of ranking recently accessed buckets, differentiating into lines (levels) of more or less recently accessed buckets?

I think the variable I mentioned, which I saw in the kernel documentation (freelist_percent), may have been designed for this purpose.

Coly, thank you very much!



Em domingo, 9 de abril de 2023 às 17:14:57 BRT, Adriano Silva <adriano_da_silva@yahoo.com.br> escreveu: 


Hello Eric !

> Did you try to trigger gc after setting gc_after_writeback=1?
> 
>         echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc
> 
> The `gc_after_writeback=1` setting might not trigger until writeback
> finishes, but if writeback is already finished and there is no new IO then
> it may never trigger unless it is forced via `tigger_gc`
> 
> -Eric


Yes, I use the two commands indicated several times, one after the other, first one, then the other, then in reversed order... successive times, after hours of zero disk writing/reading. On more than one server. I tested it on all my servers actually. And in all, the results are similar, there is no significant cache space flush.

And to make matters worse, in other performance tests, I realized that depending on the size of the block I manipulate, the difference in performance is frightening. With 4MB blocks I can write 691MB/s with freshly formatted cache.

root@pve-01-007:~# ceph tell osd.0 bench                
{                
    bytes_written: 1073741824,                
    blocksize: 4194304,                
    elapsed_sec: 1.5536911500000001,                
    bytes_per_sec: 691090905.67967761,
    iops: 164.76891176216068
}                
root@pve-01-007:~#

In the same test I only get 142MB/s when all cache is occupied.

root@pve-00-005:~# ceph tell osd.0 bench
{
    bytes_written: 1073741824,
    blocksize: 4194304,
    elapsed_sec: 7.5302066820000002,
    bytes_per_sec: 142591281.93209398,
    iops: 33.996410830520148
}
root@pve-00-005:~#

That is, with the cache after all occupied, the bcache can write with only 21% of the performance obtained with the newly formatted cache. It doesn't look like we're talking about exactly the same hardware... Same NVME, same processors, same RAM, same server, same OS, same bcache settings..... If you format the cache, it returns to the original performance.

I'm looking at the bcache source code to see if I can pick up anything that might be useful to me. But the code is big and complex. I confess that it is not quick to understand.

I created a little C program to try and call a built-in bcache function for testing, but I spent Sunday and couldn't even compile the program. It is funny.

But what would the garbage collector be in this case? What I understand is that the "garbage" would be parts of buckets (blocks) that would not have been reused and were "lost" outside the c->free list and also outside the free_inc list. I think that would help yes, but maybe in a very limited way. Is this the condition of most buckets that are in use?

As it seems to me (I could be talking nonsense), what would solve the problem would be to get bcache to allocate an adequate amount of buckets in the c->free list. I see this being mentioned in bcache/alloc.c

Would it be through invalidate_buckets(ca) called through the bch_allocator_thread(void *arg) thread? I don't know. What is limiting the action of this thread? I could not understand.

But here in my anxious ignorance, I'm left thinking maybe this was the way, a way to call this function to invalidate many clean buckets in the lru order and discard them. So I looked for an external interface that calls it, but I didn't find it.

Thank you very much!

Em domingo, 9 de abril de 2023 às 13:37:32 BRT, Coly Li <colyli@suse.de> escreveu: 







> 2023年4月6日 03:31,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> 
> Hello Coly.
> 
> Yes, the server is always on. I allowed it to stay on for more than 24 hours with zero disk I/O to the bcache device. The result is that there are no movements on the cache or data disks, nor on the bcache device as we can see:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
>  54k  154k: 301k  221k: 223k  169k|0.67  0.54 :6.99  20.5 :6.77  12.3 |05-04 14:45:50
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:51
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:52
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:53
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:54
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:55
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:56
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:57
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:58
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:59
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:00
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:01
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:02
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:03
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:04
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:05
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:06
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:07
> 
> It can stay like that for hours without showing any, zero data flow, either read or write on any of the devices.
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/state
> clean
> root@pve-00-005:~#
> 
> But look how strange, in another command (priority_stats), it shows that there is still 1% of dirt in the cache. And 0% unused cache space. Even after hours of server on and completely idle:
> 
> root@pve-00-005:~# cat /sys/devices/pci0000:80/0000:80:01.1/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/bcache/priority_stats
> Unused:        0%
> Clean:          98%
> Dirty:          1%
> Metadata:      0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> Why is this happening?
> 
>> Can you try to write 1 to cache set sysfs file 
>> gc_after_writeback? 
>> When it is set, a gc will be waken up automatically after 
>> all writeback accomplished. Then most of the clean cache 
>> might be shunk and the B+tree nodes will be deduced 
>> quite a lot.
> 
> Would this be the command you ask me for?
> 
> root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> 
> If this command is correct, I already advance that it did not give the expected result. The Cache continues with 100% of the occupied space. Nothing has changed despite the cache being cleaned and having written the command you recommended. Let's see:
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> Unused:        0%
> Clean:          98%
> Dirty:          1%
> Metadata:      0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> But if there was any movement on the disks after the command, I couldn't detect it:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
>  54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:58
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:59
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:00
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:01
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:02
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:03
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:04^C
> root@pve-00-005:~#
> 
> Why were there no changes?

Thanks for the above information. The result is unexpected from me. Let me check whether the B+tree nodes are not shrunk, this is something should be improved. And when the write erase time matters for write requests, normally it is the condition that heavy write loads coming. In such education, the LBA of the collected buckets might be allocated out very soon even before the SSD controller finishes internal write-erasure by the hint of discard/trim. Therefore issue discard/trim right before writing to this LBA doesn’t help on any write performance and involves in extra unnecessary workload into the SSD controller.

And for nowadays SATA/NVMe SSDs, with your workload described above, the write performance drawback can be almost ignored

> 
>> Currently there is no such option for limit bcache 
>> in-memory B+tree nodes cache occupation, but when I/O 
>> load reduces, such memory consumption may drop very 
>> fast by the reaper from system memory management 
>> code. So it won’t be a problem. Bcache will try to use any 
>> possible memory for B+tree nodes cache if it is 
>> necessary, and throttle I/O performance to return these 
>> memory back to memory management code when the 
>> available system memory is low. By default, it should 
>> work well and nothing should be done from user.
> 
> I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> 
> root@pve-00-005:~# free              total        used        free      shared  buff/cache  available
> Mem:      131980688    72670448    19088648      76780    40221592    57335704
> Swap:              0          0          0
> root@pve-00-005:~#
> 
> There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?

No, this is not because of insufficient memory. From your information the memory is enough.

> 
>> Bcache doesn’t issue trim request proactively. 
>> [...]
>> In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?

There was such attempt but indeed doesn’t help at all. The reason is, bcache can only know which bucket can be discarded when it is handled by garbage collection. 


> 
> Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> 
> As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)

Let me look into this…


Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-09 21:07                         ` Adriano Silva
@ 2023-04-20 11:35                           ` Adriano Silva
  2023-05-02 20:34                             ` Eric Wheeler
  0 siblings, 1 reply; 28+ messages in thread
From: Adriano Silva @ 2023-04-20 11:35 UTC (permalink / raw)
  To: Coly Li; +Cc: Eric Wheeler, Bcache Linux, Martin McClure

Hey guys. All right with you?

I continue to investigate the situation. There is actually a performance gain when the bcache device is only half filled versus full. There is a reduction and greater stability in the latency of direct writes and this improves my scenario.

But it should be noted that the difference is noticed when we wait for the device to rest (organize itself internally) after being cleaned. Maybe for him to clear his internal caches?

I thought about keeping gc_after_writeback on all the time and also turning on bcache's discard option to see if that improves. But my back device is an HDD.

One thing that wasn't clear to me since the last conversation is about the bcache discard option, because Coly even said that the discard would be passed only to the back device. However, Eric pulled up a snippet of source code that supposedly could indicate something different, asking Coly if there could be a mistake. Anyway Coly, can you confirm whether or not the discard is passed on to the buckets deleted from the cache? Or does it confirm that it would really only be for the back device?

Thank you all!



Em domingo, 9 de abril de 2023 às 18:07:41 BRT, Adriano Silva <adriano_da_silva@yahoo.com.br> escreveu: 





Hi Coly! 

Talking about the TRIM (discard) made in the cache...

> There was such attempt but indeed doesn’t help at all. 
> The reason is, bcache can only know which bucket can 
> be discarded when it is handled by garbage collection.

Come to think of it, I spoke to Eric before something curious, but I could be wrong. What I understand about the "garbage collector" is that the "garbage" would be parts of buckets (blocks) that would not have been reused and were "lost" outside the c->free list and also outside the free_inc list. If I'm correct in my perception, I think the garbage collector would help very little in my case. Of course, all help is welcome. But I'm already thinking about the bigger one.

If I think correctly, I don't think that in my case most of the buckets would be collected by the garbage collector. Because it is data that has not been erased in the file system. They would need to be cleaned (saved to the mass device) and after some time passed without access, removed from the cache. That is, in the cache would only be hot data. That is recently accessed data (LRU), but never allowing the cache to fill completely.

Using the same logic that bcache already uses to choose a bucket to be erased and replaced (in case the cache is already completely full and a new write is requested), it would do the same, allocating empty space by erasing the data in the bucket (in many buckets) previously whenever you notice that the cache is very close to being full. You can do this in the background, asynchronously. So in this case I understand that TRIM/discard should help a lot. Do not you think?

So my question would be: is bcache capable of ranking recently accessed buckets, differentiating into lines (levels) of more or less recently accessed buckets?

I think the variable I mentioned, which I saw in the kernel documentation (freelist_percent), may have been designed for this purpose.

Coly, thank you very much!



Em domingo, 9 de abril de 2023 às 17:14:57 BRT, Adriano Silva <adriano_da_silva@yahoo.com.br> escreveu: 


Hello Eric !

> Did you try to trigger gc after setting gc_after_writeback=1?
> 
>         echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc
> 
> The `gc_after_writeback=1` setting might not trigger until writeback
> finishes, but if writeback is already finished and there is no new IO then
> it may never trigger unless it is forced via `tigger_gc`
> 
> -Eric


Yes, I use the two commands indicated several times, one after the other, first one, then the other, then in reversed order... successive times, after hours of zero disk writing/reading. On more than one server. I tested it on all my servers actually. And in all, the results are similar, there is no significant cache space flush.

And to make matters worse, in other performance tests, I realized that depending on the size of the block I manipulate, the difference in performance is frightening. With 4MB blocks I can write 691MB/s with freshly formatted cache.

root@pve-01-007:~# ceph tell osd.0 bench                
{                
    bytes_written: 1073741824,                
    blocksize: 4194304,                
    elapsed_sec: 1.5536911500000001,                
    bytes_per_sec: 691090905.67967761,
    iops: 164.76891176216068
}                
root@pve-01-007:~#

In the same test I only get 142MB/s when all cache is occupied.

root@pve-00-005:~# ceph tell osd.0 bench
{
    bytes_written: 1073741824,
    blocksize: 4194304,
    elapsed_sec: 7.5302066820000002,
    bytes_per_sec: 142591281.93209398,
    iops: 33.996410830520148
}
root@pve-00-005:~#

That is, with the cache after all occupied, the bcache can write with only 21% of the performance obtained with the newly formatted cache. It doesn't look like we're talking about exactly the same hardware... Same NVME, same processors, same RAM, same server, same OS, same bcache settings..... If you format the cache, it returns to the original performance.

I'm looking at the bcache source code to see if I can pick up anything that might be useful to me. But the code is big and complex. I confess that it is not quick to understand.

I created a little C program to try and call a built-in bcache function for testing, but I spent Sunday and couldn't even compile the program. It is funny.

But what would the garbage collector be in this case? What I understand is that the "garbage" would be parts of buckets (blocks) that would not have been reused and were "lost" outside the c->free list and also outside the free_inc list. I think that would help yes, but maybe in a very limited way. Is this the condition of most buckets that are in use?

As it seems to me (I could be talking nonsense), what would solve the problem would be to get bcache to allocate an adequate amount of buckets in the c->free list. I see this being mentioned in bcache/alloc.c

Would it be through invalidate_buckets(ca) called through the bch_allocator_thread(void *arg) thread? I don't know. What is limiting the action of this thread? I could not understand.

But here in my anxious ignorance, I'm left thinking maybe this was the way, a way to call this function to invalidate many clean buckets in the lru order and discard them. So I looked for an external interface that calls it, but I didn't find it.

Thank you very much!

Em domingo, 9 de abril de 2023 às 13:37:32 BRT, Coly Li <colyli@suse.de> escreveu: 







> 2023年4月6日 03:31,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> 
> Hello Coly.
> 
> Yes, the server is always on. I allowed it to stay on for more than 24 hours with zero disk I/O to the bcache device. The result is that there are no movements on the cache or data disks, nor on the bcache device as we can see:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
>  54k  154k: 301k  221k: 223k  169k|0.67  0.54 :6.99  20.5 :6.77  12.3 |05-04 14:45:50
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:51
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:52
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:53
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:54
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:55
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:56
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:57
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:58
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:59
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:00
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:01
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:02
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:03
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:04
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:05
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:06
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:07
> 
> It can stay like that for hours without showing any, zero data flow, either read or write on any of the devices.
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/state
> clean
> root@pve-00-005:~#
> 
> But look how strange, in another command (priority_stats), it shows that there is still 1% of dirt in the cache. And 0% unused cache space. Even after hours of server on and completely idle:
> 
> root@pve-00-005:~# cat /sys/devices/pci0000:80/0000:80:01.1/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/bcache/priority_stats
> Unused:        0%
> Clean:          98%
> Dirty:          1%
> Metadata:      0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> Why is this happening?
> 
>> Can you try to write 1 to cache set sysfs file 
>> gc_after_writeback? 
>> When it is set, a gc will be waken up automatically after 
>> all writeback accomplished. Then most of the clean cache 
>> might be shunk and the B+tree nodes will be deduced 
>> quite a lot.
> 
> Would this be the command you ask me for?
> 
> root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> 
> If this command is correct, I already advance that it did not give the expected result. The Cache continues with 100% of the occupied space. Nothing has changed despite the cache being cleaned and having written the command you recommended. Let's see:
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> Unused:        0%
> Clean:          98%
> Dirty:          1%
> Metadata:      0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> But if there was any movement on the disks after the command, I couldn't detect it:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
>  54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:58
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:59
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:00
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:01
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:02
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:03
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:04^C
> root@pve-00-005:~#
> 
> Why were there no changes?

Thanks for the above information. The result is unexpected from me. Let me check whether the B+tree nodes are not shrunk, this is something should be improved. And when the write erase time matters for write requests, normally it is the condition that heavy write loads coming. In such education, the LBA of the collected buckets might be allocated out very soon even before the SSD controller finishes internal write-erasure by the hint of discard/trim. Therefore issue discard/trim right before writing to this LBA doesn’t help on any write performance and involves in extra unnecessary workload into the SSD controller.

And for nowadays SATA/NVMe SSDs, with your workload described above, the write performance drawback can be almost ignored

> 
>> Currently there is no such option for limit bcache 
>> in-memory B+tree nodes cache occupation, but when I/O 
>> load reduces, such memory consumption may drop very 
>> fast by the reaper from system memory management 
>> code. So it won’t be a problem. Bcache will try to use any 
>> possible memory for B+tree nodes cache if it is 
>> necessary, and throttle I/O performance to return these 
>> memory back to memory management code when the 
>> available system memory is low. By default, it should 
>> work well and nothing should be done from user.
> 
> I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> 
> root@pve-00-005:~# free              total        used        free      shared  buff/cache  available
> Mem:      131980688    72670448    19088648      76780    40221592    57335704
> Swap:              0          0          0
> root@pve-00-005:~#
> 
> There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?

No, this is not because of insufficient memory. From your information the memory is enough.

> 
>> Bcache doesn’t issue trim request proactively. 
>> [...]
>> In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?

There was such attempt but indeed doesn’t help at all. The reason is, bcache can only know which bucket can be discarded when it is handled by garbage collection. 


> 
> Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> 
> As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)

Let me look into this…


Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-04-20 11:35                           ` Adriano Silva
@ 2023-05-02 20:34                             ` Eric Wheeler
  2023-05-04  4:56                               ` Coly Li
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Wheeler @ 2023-05-02 20:34 UTC (permalink / raw)
  To: Coly Li; +Cc: Adriano Silva, Bcache Linux, Martin McClure

[-- Attachment #1: Type: text/plain, Size: 16695 bytes --]

On Thu, 20 Apr 2023, Adriano Silva wrote:
> I continue to investigate the situation. There is actually a performance 
> gain when the bcache device is only half filled versus full. There is a 
> reduction and greater stability in the latency of direct writes and this 
> improves my scenario.

Hi Coly, have you been able to look at this?

This sounds like a great optimization and Adriano is in a place to test 
this now and report his findings.

I think you said this should be a simple hack to add early reclaim, so 
maybe you can throw a quick patch together (even a rough first-pass with 
hard-coded reclaim values)

If we can get back to Adriano quickly then he can test while he has an 
easy-to-reproduce environment.  Indeed, this could benefit all bcache 
users.

--
Eric Wheeler



> 
> But it should be noted that the difference is noticed when we wait for 
> the device to rest (organize itself internally) after being cleaned. 
> Maybe for him to clear his internal caches?
> 
> I thought about keeping gc_after_writeback on all the time and also 
> turning on bcache's discard option to see if that improves. But my back 
> device is an HDD.
> 
> One thing that wasn't clear to me since the last conversation is about 
> the bcache discard option, because Coly even said that the discard would 
> be passed only to the back device. However, Eric pulled up a snippet of 
> source code that supposedly could indicate something different, asking 
> Coly if there could be a mistake. Anyway Coly, can you confirm whether 
> or not the discard is passed on to the buckets deleted from the cache? 
> Or does it confirm that it would really only be for the back device?
> 
> Thank you all!
> 
> 
> 
> Em domingo, 9 de abril de 2023 às 18:07:41 BRT, Adriano Silva <adriano_da_silva@yahoo.com.br> escreveu: 
> 
> 
> 
> 
> 
> Hi Coly! 
> 
> Talking about the TRIM (discard) made in the cache...
> 
> > There was such attempt but indeed doesn’t help at all. 
> > The reason is, bcache can only know which bucket can 
> > be discarded when it is handled by garbage collection.
> 
> Come to think of it, I spoke to Eric before something curious, but I could be wrong. What I understand about the "garbage collector" is that the "garbage" would be parts of buckets (blocks) that would not have been reused and were "lost" outside the c->free list and also outside the free_inc list. If I'm correct in my perception, I think the garbage collector would help very little in my case. Of course, all help is welcome. But I'm already thinking about the bigger one.
> 
> If I think correctly, I don't think that in my case most of the buckets would be collected by the garbage collector. Because it is data that has not been erased in the file system. They would need to be cleaned (saved to the mass device) and after some time passed without access, removed from the cache. That is, in the cache would only be hot data. That is recently accessed data (LRU), but never allowing the cache to fill completely.
> 
> Using the same logic that bcache already uses to choose a bucket to be erased and replaced (in case the cache is already completely full and a new write is requested), it would do the same, allocating empty space by erasing the data in the bucket (in many buckets) previously whenever you notice that the cache is very close to being full. You can do this in the background, asynchronously. So in this case I understand that TRIM/discard should help a lot. Do not you think?
> 
> So my question would be: is bcache capable of ranking recently accessed buckets, differentiating into lines (levels) of more or less recently accessed buckets?
> 
> I think the variable I mentioned, which I saw in the kernel documentation (freelist_percent), may have been designed for this purpose.
> 
> Coly, thank you very much!
> 
> 
> 
> Em domingo, 9 de abril de 2023 às 17:14:57 BRT, Adriano Silva <adriano_da_silva@yahoo.com.br> escreveu: 
> 
> 
> Hello Eric !
> 
> > Did you try to trigger gc after setting gc_after_writeback=1?
> > 
> >         echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc
> > 
> > The `gc_after_writeback=1` setting might not trigger until writeback
> > finishes, but if writeback is already finished and there is no new IO then
> > it may never trigger unless it is forced via `tigger_gc`
> > 
> > -Eric
> 
> 
> Yes, I use the two commands indicated several times, one after the other, first one, then the other, then in reversed order... successive times, after hours of zero disk writing/reading. On more than one server. I tested it on all my servers actually. And in all, the results are similar, there is no significant cache space flush.
> 
> And to make matters worse, in other performance tests, I realized that depending on the size of the block I manipulate, the difference in performance is frightening. With 4MB blocks I can write 691MB/s with freshly formatted cache.
> 
> root@pve-01-007:~# ceph tell osd.0 bench                
> {                
>     bytes_written: 1073741824,                
>     blocksize: 4194304,                
>     elapsed_sec: 1.5536911500000001,                
>     bytes_per_sec: 691090905.67967761,
>     iops: 164.76891176216068
> }                
> root@pve-01-007:~#
> 
> In the same test I only get 142MB/s when all cache is occupied.
> 
> root@pve-00-005:~# ceph tell osd.0 bench
> {
>     bytes_written: 1073741824,
>     blocksize: 4194304,
>     elapsed_sec: 7.5302066820000002,
>     bytes_per_sec: 142591281.93209398,
>     iops: 33.996410830520148
> }
> root@pve-00-005:~#
> 
> That is, with the cache after all occupied, the bcache can write with only 21% of the performance obtained with the newly formatted cache. It doesn't look like we're talking about exactly the same hardware... Same NVME, same processors, same RAM, same server, same OS, same bcache settings..... If you format the cache, it returns to the original performance.
> 
> I'm looking at the bcache source code to see if I can pick up anything that might be useful to me. But the code is big and complex. I confess that it is not quick to understand.
> 
> I created a little C program to try and call a built-in bcache function for testing, but I spent Sunday and couldn't even compile the program. It is funny.
> 
> But what would the garbage collector be in this case? What I understand is that the "garbage" would be parts of buckets (blocks) that would not have been reused and were "lost" outside the c->free list and also outside the free_inc list. I think that would help yes, but maybe in a very limited way. Is this the condition of most buckets that are in use?
> 
> As it seems to me (I could be talking nonsense), what would solve the problem would be to get bcache to allocate an adequate amount of buckets in the c->free list. I see this being mentioned in bcache/alloc.c
> 
> Would it be through invalidate_buckets(ca) called through the bch_allocator_thread(void *arg) thread? I don't know. What is limiting the action of this thread? I could not understand.
> 
> But here in my anxious ignorance, I'm left thinking maybe this was the way, a way to call this function to invalidate many clean buckets in the lru order and discard them. So I looked for an external interface that calls it, but I didn't find it.
> 
> Thank you very much!
> 
> Em domingo, 9 de abril de 2023 às 13:37:32 BRT, Coly Li <colyli@suse.de> escreveu: 
> 
> 
> 
> 
> 
> 
> 
> > 2023年4月6日 03:31,Adriano Silva <adriano_da_silva@yahoo.com.br> 写道:
> > 
> > Hello Coly.
> > 
> > Yes, the server is always on. I allowed it to stay on for more than 24 hours with zero disk I/O to the bcache device. The result is that there are no movements on the cache or data disks, nor on the bcache device as we can see:
> > 
> > root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> > --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
> >  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
> >  54k  154k: 301k  221k: 223k  169k|0.67  0.54 :6.99  20.5 :6.77  12.3 |05-04 14:45:50
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:51
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:52
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:53
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:54
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:55
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:56
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:57
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:58
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:59
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:00
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:01
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:02
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:03
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:04
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:05
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:06
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:07
> > 
> > It can stay like that for hours without showing any, zero data flow, either read or write on any of the devices.
> > 
> > root@pve-00-005:~# cat /sys/block/bcache0/bcache/state
> > clean
> > root@pve-00-005:~#
> > 
> > But look how strange, in another command (priority_stats), it shows that there is still 1% of dirt in the cache. And 0% unused cache space. Even after hours of server on and completely idle:
> > 
> > root@pve-00-005:~# cat /sys/devices/pci0000:80/0000:80:01.1/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/bcache/priority_stats
> > Unused:        0%
> > Clean:          98%
> > Dirty:          1%
> > Metadata:      0%
> > Average:        1137
> > Sectors per Q:  36245232
> > Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> > 
> > Why is this happening?
> > 
> >> Can you try to write 1 to cache set sysfs file 
> >> gc_after_writeback? 
> >> When it is set, a gc will be waken up automatically after 
> >> all writeback accomplished. Then most of the clean cache 
> >> might be shunk and the B+tree nodes will be deduced 
> >> quite a lot.
> > 
> > Would this be the command you ask me for?
> > 
> > root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> > 
> > If this command is correct, I already advance that it did not give the expected result. The Cache continues with 100% of the occupied space. Nothing has changed despite the cache being cleaned and having written the command you recommended. Let's see:
> > 
> > root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> > Unused:        0%
> > Clean:          98%
> > Dirty:          1%
> > Metadata:      0%
> > Average:        1137
> > Sectors per Q:  36245232
> > Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> > 
> > But if there was any movement on the disks after the command, I couldn't detect it:
> > 
> > root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> > --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
> >  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
> >  54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:58
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:59
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:00
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:01
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:02
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:03
> >    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:04^C
> > root@pve-00-005:~#
> > 
> > Why were there no changes?
> 
> Thanks for the above information. The result is unexpected from me. Let me check whether the B+tree nodes are not shrunk, this is something should be improved. And when the write erase time matters for write requests, normally it is the condition that heavy write loads coming. In such education, the LBA of the collected buckets might be allocated out very soon even before the SSD controller finishes internal write-erasure by the hint of discard/trim. Therefore issue discard/trim right before writing to this LBA doesn’t help on any write performance and involves in extra unnecessary workload into the SSD controller.
> 
> And for nowadays SATA/NVMe SSDs, with your workload described above, the write performance drawback can be almost ignored
> 
> > 
> >> Currently there is no such option for limit bcache 
> >> in-memory B+tree nodes cache occupation, but when I/O 
> >> load reduces, such memory consumption may drop very 
> >> fast by the reaper from system memory management 
> >> code. So it won’t be a problem. Bcache will try to use any 
> >> possible memory for B+tree nodes cache if it is 
> >> necessary, and throttle I/O performance to return these 
> >> memory back to memory management code when the 
> >> available system memory is low. By default, it should 
> >> work well and nothing should be done from user.
> > 
> > I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> > 
> > root@pve-00-005:~# free              total        used        free      shared  buff/cache  available
> > Mem:      131980688    72670448    19088648      76780    40221592    57335704
> > Swap:              0          0          0
> > root@pve-00-005:~#
> > 
> > There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?
> 
> No, this is not because of insufficient memory. From your information the memory is enough.
> 
> > 
> >> Bcache doesn’t issue trim request proactively. 
> >> [...]
> >> In run time, bcache code only forward the trim request to backing device (not cache device).
> > 
> > Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?
> 
> There was such attempt but indeed doesn’t help at all. The reason is, bcache can only know which bucket can be discarded when it is handled by garbage collection. 
> 
> 
> > 
> > Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> > 
> > As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)
> 
> Let me look into this…
> 
> 
> Thanks.
> 
> Coly Li
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-05-02 20:34                             ` Eric Wheeler
@ 2023-05-04  4:56                               ` Coly Li
  2023-05-04 14:34                                 ` Adriano Silva
  2023-05-09  0:42                                 ` Eric Wheeler
  0 siblings, 2 replies; 28+ messages in thread
From: Coly Li @ 2023-05-04  4:56 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Adriano Silva, Bcache Linux, Martin McClure



> 2023年5月3日 04:34,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> 
> On Thu, 20 Apr 2023, Adriano Silva wrote:
>> I continue to investigate the situation. There is actually a performance 
>> gain when the bcache device is only half filled versus full. There is a 
>> reduction and greater stability in the latency of direct writes and this 
>> improves my scenario.
> 
> Hi Coly, have you been able to look at this?
> 
> This sounds like a great optimization and Adriano is in a place to test 
> this now and report his findings.
> 
> I think you said this should be a simple hack to add early reclaim, so 
> maybe you can throw a quick patch together (even a rough first-pass with 
> hard-coded reclaim values)
> 
> If we can get back to Adriano quickly then he can test while he has an 
> easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> users.

My current to-do list on hand is a little bit long. Yes I’d like and plan to do it, but the response time cannot be estimated.

Coly Li

[snipped]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-05-04  4:56                               ` Coly Li
@ 2023-05-04 14:34                                 ` Adriano Silva
  2023-05-09  0:29                                   ` Eric Wheeler
  2023-05-09  0:42                                 ` Eric Wheeler
  1 sibling, 1 reply; 28+ messages in thread
From: Adriano Silva @ 2023-05-04 14:34 UTC (permalink / raw)
  To: Eric Wheeler, Coly Li; +Cc: Bcache Linux, Martin McClure

Hi Coly,

If I can help you with anything, please let me know.

Thanks!


Guys, can I take advantage and ask one more question? If you prefer, I'll open another topic, but as it has something to do with the subject discussed here, I'll ask for now right here.

I decided to make (for now) a bash script to change the cache parameters trying a temporary workaround to solve the issue manually in at least one of my clusters.

So: I use in production cache_mode as writeback, writeback_percent to 2 (I think low is safer and faster for a flush, while staying at 10 hasn't shown better performance in my case - i am wrong?). I use discard as false, as it is slow to discard each bucket that is modified (I believe the discard would need to be by large batches of free buckets). I use 0 (zero) in sequence_cutoff because using the bluestore file system (from ceph), it seems to me that using any other value in this variable, bcache understands everything as sequential and bypasses it to the back disk. I also use congested_read_threshold_us and congested_write_threshold_us to 0 (zero) as it seems to give slightly better performance, lower latency. I always use rotational as 1, never change it. They always say that for Ceph it works better, I've been using it ever since. I put these parameters at system startup.

So, I decided at 01:00 that I'm going to run a bash script to change these parameters in order to clear the cache and use it to back up my data from databases and others. So, I change writeback_percent to 0 (zero) for it to clean all the dirt from the cache. Then I keep checking the status until it's "cleared". I then pass the cache_mode to writethrough.
In the sequence I confirm if the cache remains "clean". Being "clean", I change cache_mode to "none" and then comes the following line:

echo $cache_cset > /sys/block/$bcache_device/bcache/cache/unregister

Here ends the script that runs at 01:00 am.

So, then I perform backups of my data, without the reading of this data going through and being all written in my cache. (Am I thinking correctly?)

Users will continue to use the system normally, however the system will be slower because the Ceph OSD will be working on top of the bcache device without having a cache. But a lower performance at that time, for my case, is acceptable at that time.

After the backup is complete, at 05:00 am I run the following sequence:

          wipefs -a /dev/nvme0n1p1
          sleep 1
          blkdiscard /dev/nvme0n1p1
          sleep 1
          makebcache=$(make-bcache --wipe-bcache -w 4k --bucket 256K -C /dev/$cache_device)
          sleep 1 cache_cset=$(bcache-super-show /dev/$cache_device | grep cset | awk '{ print $2 }')
          echo $cache_cset > /sys/block/bcache0/bcache/attach

One thing to point out here is the size of the bucket I use (256K) which I defined according to the performance tests I did. While I didn't notice any big performance differences during these tests, I thought 256K was the best performing smallest block I got with my NVMe device, which is an enterprise device (with non-volatile cache), but I don't have information about the size minimum erasure block. I did not find this information about the smallest erase block of this device anywhere. I looked in several ways, the manufacturer didn't inform me, the nvme-cli tool didn't show me either. Would 256 really be a good number to use?

Anyway, after attaching the cache again, I return the parameters to what I have been using in production:

          echo writeback > /sys/block/$bcache_device/bcache/cache_mode
          echo 1 > /sys/devices/virtual/block/$bcache_device/queue/rotational
          echo 1 > /sys/fs/bcache/$cache_cset/internal/gc_after_writeback
          echo 1 > /sys/block/$bcache_device/bcache/cache/internal/trigger_gc
          echo 2 > /sys/block/$bcache_device/bcache/writeback_percent
          echo 0 > /sys/fs/bcache/$cache_cset/cache0/discard
          echo 0 > /sys/block/$bcache_device/bcache/sequential_cutoff
          echo 0 > /sys/fs/bcache/$cache_cset/congested_read_threshold_us
          echo 0 > /sys/fs/bcache/$cache_cset/congested_write_threshold_us

I created the scripts in a test environment and it seems to have worked as expected.

My question: Would it be a correct way to temporarily solve the problem as a palliative? Is it safe to do it this way with a mounted file system, with files in use by users and databases in working order? Are there greater risks involved in putting this into production? Do you see any problems or anything that could be different?

Thanks!



Em quinta-feira, 4 de maio de 2023 às 01:56:23 BRT, Coly Li <colyli@suse.de> escreveu: 







> 2023年5月3日 04:34,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> 
> On Thu, 20 Apr 2023, Adriano Silva wrote:
>> I continue to investigate the situation. There is actually a performance 
>> gain when the bcache device is only half filled versus full. There is a 
>> reduction and greater stability in the latency of direct writes and this 
>> improves my scenario.
> 
> Hi Coly, have you been able to look at this?
> 
> This sounds like a great optimization and Adriano is in a place to test 
> this now and report his findings.
> 
> I think you said this should be a simple hack to add early reclaim, so 
> maybe you can throw a quick patch together (even a rough first-pass with 
> hard-coded reclaim values)
> 
> If we can get back to Adriano quickly then he can test while he has an 
> easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> users.

My current to-do list on hand is a little bit long. Yes I’d like and plan to do it, but the response time cannot be estimated.


Coly Li


[snipped]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-05-04 14:34                                 ` Adriano Silva
@ 2023-05-09  0:29                                   ` Eric Wheeler
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2023-05-09  0:29 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Coly Li, Bcache Linux, Martin McClure

[-- Attachment #1: Type: text/plain, Size: 7703 bytes --]

On Thu, 4 May 2023, Adriano Silva wrote:

> Hi Coly,
> 
> If I can help you with anything, please let me know.
> 
> Thanks!
> 
> 
> Guys, can I take advantage and ask one more question? If you prefer, 
> I'll open another topic, but as it has something to do with the subject 
> discussed here, I'll ask for now right here.
> 
> I decided to make (for now) a bash script to change the cache parameters 
> trying a temporary workaround to solve the issue manually in at least 
> one of my clusters.
> 
> So: I use in production cache_mode as writeback, writeback_percent to 2 
> (I think low is safer and faster for a flush, while staying at 10 hasn't 
> shown better performance in my case - i am wrong?). I use discard as 
> false, as it is slow to discard each bucket that is modified (I believe 
> the discard would need to be by large batches of free buckets). I use 0 
> (zero) in sequence_cutoff because using the bluestore file system (from 
> ceph), it seems to me that using any other value in this variable, 
> bcache understands everything as sequential and bypasses it to the back 
> disk. I also use congested_read_threshold_us and 
> congested_write_threshold_us to 0 (zero) as it seems to give slightly 
> better performance, lower latency. I always use rotational as 1, never 
> change it. They always say that for Ceph it works better, I've been 
> using it ever since. I put these parameters at system startup.
> 
> So, I decided at 01:00 that I'm going to run a bash script to change 
> these parameters in order to clear the cache and use it to back up my 
> data from databases and others. So, I change writeback_percent to 0 
> (zero) for it to clean all the dirt from the cache. Then I keep checking 
> the status until it's "cleared". I then pass the cache_mode to 
> writethrough. In the sequence I confirm if the cache remains "clean". 
> Being "clean", I change cache_mode to "none" and then comes the 
> following line:
> 
> echo $cache_cset > /sys/block/$bcache_device/bcache/cache/unregister

These are my notes for hot-removal of a cache.  You need to detach and 
then unregister:

~] bcache-super-show /dev/sdz1 # cache device
[...]
cset.uuid		3db83b23-1af7-43a6-965d-c277b402b16a

~] echo 3db83b23-1af7-43a6-965d-c277b402b16a > /sys/block/bcache0/bcache/detach
~] watch cat /sys/block/bcache0/bcache/dirty_data # wait until 0
~] echo 1 > /sys/fs/bcache/3db83b23-1af7-43a6-965d-c277b402b16a/unregister


> Here ends the script that runs at 01:00 am.
> 
> So, then I perform backups of my data, without the reading of this data 
> going through and being all written in my cache. (Am I thinking 
> correctly?)
> 
> Users will continue to use the system normally, however the system will 
> be slower because the Ceph OSD will be working on top of the bcache 
> device without having a cache. But a lower performance at that time, for 
> my case, is acceptable at that time.
> 
> After the backup is complete, at 05:00 am I run the following sequence:
> 
>           wipefs -a /dev/nvme0n1p1
>           sleep 1
>           blkdiscard /dev/nvme0n1p1
>           sleep 1
>           makebcache=$(make-bcache --wipe-bcache -w 4k --bucket 256K -C /dev/$cache_device)
>           sleep 1 cache_cset=$(bcache-super-show /dev/$cache_device | grep cset | awk '{ print $2 }')
>           echo $cache_cset > /sys/block/bcache0/bcache/attach
> 
> One thing to point out here is the size of the bucket I use (256K) which 
> I defined according to the performance tests I did.

How do you test performance, is it automated?

> While I didn't notice any big performance differences during these 
> tests, I thought 256K was the best performing smallest block I got with 
> my NVMe device, which is an enterprise device (with non-volatile cache), 
> but I don't have information about the size minimum erasure block. I did 
> not find this information about the smallest erase block of this device 
> anywhere. I looked in several ways, the manufacturer didn't inform me, 
> the nvme-cli tool didn't show me either. Would 256 really be a good 
> number to use?

If you can automate performance testing, then you could use something like 
Simplex [1] to optimize the following values for your infrastructure:

	- bucket size # `make-bcache --bucket N`
	- /sys/block/<nvme>/queue/nr_requests	
	- /sys/block/<nvme>/queue/io_poll_delay
	- /sys/block/<nvme>/queue/max_sectors_kb
	- /sys/block/<bdev>/queue/nr_requests
	- other IO tunables? maybe: /sys/block/bcache0/bcache/
		writeback_percent
		writeback_rate
		writeback_delay

	Selfless plug: I've always wanted to tune Linux with Simplex, but 
	haven't gotten to it: [1] https://metacpan.org/pod/PDL::Opt::Simplex::Simple
 
> Anyway, after attaching the cache again, I return the parameters to what 
> I have been using in production:
> 
>           echo writeback > /sys/block/$bcache_device/bcache/cache_mode
>           echo 1 > /sys/devices/virtual/block/$bcache_device/queue/rotational
>           echo 1 > /sys/fs/bcache/$cache_cset/internal/gc_after_writeback
>           echo 1 > /sys/block/$bcache_device/bcache/cache/internal/trigger_gc
>           echo 2 > /sys/block/$bcache_device/bcache/writeback_percent
>           echo 0 > /sys/fs/bcache/$cache_cset/cache0/discard
>           echo 0 > /sys/block/$bcache_device/bcache/sequential_cutoff
>           echo 0 > /sys/fs/bcache/$cache_cset/congested_read_threshold_us
>           echo 0 > /sys/fs/bcache/$cache_cset/congested_write_threshold_us
>
> 
> I created the scripts in a test environment and it seems to have worked as expected.

Looks good.  I might try those myself...
 
> My question: Would it be a correct way to temporarily solve the problem 
> as a palliative? 

If it works, then I don't see a problem except that you are evicting your 
cache.  

> Is it safe to do it this way with a mounted file 
> system, with files in use by users and databases in working order? 

Yes (at least by design, it is safe to detach bcache cdevs).  We have done 
it many times in production.

> Are there greater risks involved in putting this into production? Do you 
> see any problems or anything that could be different?

Only that you are turning on options that have not been tested much, 
simply because they are not default.  You might hit a bug... but if so, 
then report it to be fixed!

-Eric

> 
> Thanks!
> 
> 
> 
> Em quinta-feira, 4 de maio de 2023 às 01:56:23 BRT, Coly Li <colyli@suse.de> escreveu: 
> 
> 
> 
> 
> 
> 
> 
> > 2023年5月3日 04:34,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> > 
> > On Thu, 20 Apr 2023, Adriano Silva wrote:
> >> I continue to investigate the situation. There is actually a performance 
> >> gain when the bcache device is only half filled versus full. There is a 
> >> reduction and greater stability in the latency of direct writes and this 
> >> improves my scenario.
> > 
> > Hi Coly, have you been able to look at this?
> > 
> > This sounds like a great optimization and Adriano is in a place to test 
> > this now and report his findings.
> > 
> > I think you said this should be a simple hack to add early reclaim, so 
> > maybe you can throw a quick patch together (even a rough first-pass with 
> > hard-coded reclaim values)
> > 
> > If we can get back to Adriano quickly then he can test while he has an 
> > easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> > users.
> 
> My current to-do list on hand is a little bit long. Yes I’d like and plan to do it, but the response time cannot be estimated.
> 
> 
> Coly Li
> 
> 
> [snipped]
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-05-04  4:56                               ` Coly Li
  2023-05-04 14:34                                 ` Adriano Silva
@ 2023-05-09  0:42                                 ` Eric Wheeler
  2023-05-09  2:21                                   ` Adriano Silva
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Wheeler @ 2023-05-09  0:42 UTC (permalink / raw)
  To: Coly Li; +Cc: Adriano Silva, Bcache Linux, Martin McClure

[-- Attachment #1: Type: text/plain, Size: 2268 bytes --]

On Thu, 4 May 2023, Coly Li wrote:
> > 2023年5月3日 04:34,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> > 
> > On Thu, 20 Apr 2023, Adriano Silva wrote:
> >> I continue to investigate the situation. There is actually a performance 
> >> gain when the bcache device is only half filled versus full. There is a 
> >> reduction and greater stability in the latency of direct writes and this 
> >> improves my scenario.
> > 
> > Hi Coly, have you been able to look at this?
> > 
> > This sounds like a great optimization and Adriano is in a place to test 
> > this now and report his findings.
> > 
> > I think you said this should be a simple hack to add early reclaim, so 
> > maybe you can throw a quick patch together (even a rough first-pass with 
> > hard-coded reclaim values)
> > 
> > If we can get back to Adriano quickly then he can test while he has an 
> > easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> > users.
> 
> My current to-do list on hand is a little bit long. Yes I’d like and 
> plan to do it, but the response time cannot be estimated.

I understand.  Maybe I can put something together if you can provide some 
pointers since you are _the_ expert on bcache these days.  Here are a few 
questions:

Q's for Coly:

- It looks like it could be a simple change to bch_allocator_thread().  
  Is this the right place? 
  https://elixir.bootlin.com/linux/v6.3-rc5/source/drivers/md/bcache/alloc.c#L317
    - On alloc.c:332
	if (!fifo_pop(&ca->free_inc, bucket))
      does it just need to be modified to something like this:
	if (!fifo_pop(&ca->free_inc, bucket) || 
		total_unused_cache_percent() < 20)
      if so, where does bcache store the concept of "Total Unused Cache" ?

- If I'm going about it wrong above, then where is the code path in bcache 
  that frees a bucket such that it is completely unused (ie, as it was
  after `make-bcache -C`?)


Q's Adriano:

Where did you get these cache details from your earlier post?  In /sys 
somewhere, probably, but I didn't find them:

	Total Cache Size 553.31GiB
	Total Cache Used 547.78GiB (99%)
	Total Unused Cache 5.53GiB (1%)
	Dirty Data 0B (0%)
	Evictable Cache 503.52GiB (91%)



--
Eric Wheeler



> 
> Coly Li
> 
> [snipped]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-05-09  0:42                                 ` Eric Wheeler
@ 2023-05-09  2:21                                   ` Adriano Silva
  2023-05-11 23:10                                     ` Eric Wheeler
  0 siblings, 1 reply; 28+ messages in thread
From: Adriano Silva @ 2023-05-09  2:21 UTC (permalink / raw)
  To: Coly Li, Eric Wheeler; +Cc: Bcache Linux, Martin McClure

Thanks.Hi Guys,

Eric:

I got the parameters with this script, although I also checked / sys, doing the math everything is right.

https://gist.github.com/damoxc/6267899


Thanks.


Em segunda-feira, 8 de maio de 2023 às 21:42:26 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 





On Thu, 4 May 2023, Coly Li wrote:
> > 2023年5月3日 04:34,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> > 
> > On Thu, 20 Apr 2023, Adriano Silva wrote:
> >> I continue to investigate the situation. There is actually a performance 
> >> gain when the bcache device is only half filled versus full. There is a 
> >> reduction and greater stability in the latency of direct writes and this 
> >> improves my scenario.
> > 
> > Hi Coly, have you been able to look at this?
> > 
> > This sounds like a great optimization and Adriano is in a place to test 
> > this now and report his findings.
> > 
> > I think you said this should be a simple hack to add early reclaim, so 
> > maybe you can throw a quick patch together (even a rough first-pass with 
> > hard-coded reclaim values)
> > 
> > If we can get back to Adriano quickly then he can test while he has an 
> > easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> > users.
> 
> My current to-do list on hand is a little bit long. Yes I’d like and 
> plan to do it, but the response time cannot be estimated.

I understand.  Maybe I can put something together if you can provide some 
pointers since you are _the_ expert on bcache these days.  Here are a few 
questions:

Q's for Coly:

- It looks like it could be a simple change to bch_allocator_thread().  
  Is this the right place? 
  https://elixir.bootlin.com/linux/v6.3-rc5/source/drivers/md/bcache/alloc.c#L317
    - On alloc.c:332
    if (!fifo_pop(&ca->free_inc, bucket))
      does it just need to be modified to something like this:
    if (!fifo_pop(&ca->free_inc, bucket) || 
        total_unused_cache_percent() < 20)
      if so, where does bcache store the concept of "Total Unused Cache" ?

- If I'm going about it wrong above, then where is the code path in bcache 
  that frees a bucket such that it is completely unused (ie, as it was
  after `make-bcache -C`?)


Q's Adriano:

Where did you get these cache details from your earlier post?  In /sys 
somewhere, probably, but I didn't find them:

    Total Cache Size 553.31GiB
    Total Cache Used 547.78GiB (99%)
    Total Unused Cache 5.53GiB (1%)
    Dirty Data 0B (0%)
    Evictable Cache 503.52GiB (91%)




--
Eric Wheeler



> 
> Coly Li
> 
> [snipped]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-05-09  2:21                                   ` Adriano Silva
@ 2023-05-11 23:10                                     ` Eric Wheeler
  2023-05-12  5:13                                       ` Coly Li
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Wheeler @ 2023-05-11 23:10 UTC (permalink / raw)
  To: Coly Li; +Cc: Adriano Silva, Bcache Linux, Martin McClure

[-- Attachment #1: Type: text/plain, Size: 4078 bytes --]

On Tue, 9 May 2023, Adriano Silva wrote:
> I got the parameters with this script, although I also checked / sys, doing the math everything is right.
> 
> https://gist.github.com/damoxc/6267899

Thanks.  prio_stats gives me what I'm looking for.  More below.
 
> Em segunda-feira, 8 de maio de 2023 às 21:42:26 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 
> On Thu, 4 May 2023, Coly Li wrote:
> > > 2023年5月3日 04:34,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> > > 
> > > On Thu, 20 Apr 2023, Adriano Silva wrote:
> > >> I continue to investigate the situation. There is actually a performance 
> > >> gain when the bcache device is only half filled versus full. There is a 
> > >> reduction and greater stability in the latency of direct writes and this 
> > >> improves my scenario.
> > > 
> > > Hi Coly, have you been able to look at this?
> > > 
> > > This sounds like a great optimization and Adriano is in a place to test 
> > > this now and report his findings.
> > > 
> > > I think you said this should be a simple hack to add early reclaim, so 
> > > maybe you can throw a quick patch together (even a rough first-pass with 
> > > hard-coded reclaim values)
> > > 
> > > If we can get back to Adriano quickly then he can test while he has an 
> > > easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> > > users.
> > 
> > My current to-do list on hand is a little bit long. Yes I’d like and 
> > plan to do it, but the response time cannot be estimated.
> 
> I understand.  Maybe I can put something together if you can provide some 
> pointers since you are _the_ expert on bcache these days.  Here are a few 
> questions:
> 
> Q's for Coly:


It _looks_ like bcache frees buckets while the `ca->free_inc` list is 
full, but it could go further.  Consider this hypothetical:

https://elixir.bootlin.com/linux/v6.4-rc1/source/drivers/md/bcache/alloc.c#L179

	static void invalidate_buckets_lru(struct cache *ca)
	{
	...
+	int available = 0;
+	mutex_lock(&ca->set->bucket_lock);
+	for_each_bucket(b, ca) {
+		if (!GC_SECTORS_USED(b))
+			unused++;
+		if (GC_MARK(b) == GC_MARK_RECLAIMABLE)
+			available++;
+		if (GC_MARK(b) == GC_MARK_DIRTY)
+			dirty++;
+		if (GC_MARK(b) == GC_MARK_METADATA)
+			meta++;
+	}
+	mutex_unlock(&ca->set->bucket_lock);
+
-	while (!fifo_full(&ca->free_inc)) {
+	while (!fifo_full(&ca->free_inc) || available < TARGET_AVAIL_BUCKETS) {
		if (!heap_pop(&ca->heap, b, bucket_min_cmp)) {
			/*
			 * We don't want to be calling invalidate_buckets()
			 * multiple times when it can't do anything
			 */
			ca->invalidate_needs_gc = 1;
			wake_up_gc(ca->set);
			return;
		}

		bch_invalidate_one_bucket(ca, b);  <<<< this does the work
	}

(TARGET_AVAIL_BUCKETS is a placeholder, ultimately it would be a sysfs 
setting, probably a percentage.)


Coly, would this work?

Can you think of any serious issues with this (besides the fact that for_each_bucket is slow)?



-Eric

> 
> - It looks like it could be a simple change to bch_allocator_thread().  
>   Is this the right place? 
>   https://elixir.bootlin.com/linux/v6.3-rc5/source/drivers/md/bcache/alloc.c#L317
>     - On alloc.c:332
>     if (!fifo_pop(&ca->free_inc, bucket))
>       does it just need to be modified to something like this:
>     if (!fifo_pop(&ca->free_inc, bucket) || 
>         total_unused_cache_percent() < 20)
>       if so, where does bcache store the concept of "Total Unused Cache" ?
> 
> - If I'm going about it wrong above, then where is the code path in bcache 
>   that frees a bucket such that it is completely unused (ie, as it was
>   after `make-bcache -C`?)
> 
> 
> Q's Adriano:
> 
> Where did you get these cache details from your earlier post?  In /sys 
> somewhere, probably, but I didn't find them:
> 
>     Total Cache Size 553.31GiB
>     Total Cache Used 547.78GiB (99%)
>     Total Unused Cache 5.53GiB (1%)
>     Dirty Data 0B (0%)
>     Evictable Cache 503.52GiB (91%)
> 
> 
> 
> 
> --
> Eric Wheeler
> 
> 
> 
> > 
> > Coly Li
> > 
> > [snipped]
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-05-11 23:10                                     ` Eric Wheeler
@ 2023-05-12  5:13                                       ` Coly Li
  2023-05-13 21:05                                         ` Eric Wheeler
  0 siblings, 1 reply; 28+ messages in thread
From: Coly Li @ 2023-05-12  5:13 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Adriano Silva, Bcache Linux, Martin McClure

On Thu, May 11, 2023 at 04:10:51PM -0700, Eric Wheeler wrote:
> On Tue, 9 May 2023, Adriano Silva wrote:
> > I got the parameters with this script, although I also checked / sys, doing the math everything is right.
> > 
> > https://gist.github.com/damoxc/6267899
> 
> Thanks.  prio_stats gives me what I'm looking for.  More below.
>  
> > Em segunda-feira, 8 de maio de 2023 às 21:42:26 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 
> > On Thu, 4 May 2023, Coly Li wrote:
> > > > 2023年5月3日 04:34,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> > > > 
> > > > On Thu, 20 Apr 2023, Adriano Silva wrote:
> > > >> I continue to investigate the situation. There is actually a performance 
> > > >> gain when the bcache device is only half filled versus full. There is a 
> > > >> reduction and greater stability in the latency of direct writes and this 
> > > >> improves my scenario.
> > > > 
> > > > Hi Coly, have you been able to look at this?
> > > > 
> > > > This sounds like a great optimization and Adriano is in a place to test 
> > > > this now and report his findings.
> > > > 
> > > > I think you said this should be a simple hack to add early reclaim, so 
> > > > maybe you can throw a quick patch together (even a rough first-pass with 
> > > > hard-coded reclaim values)
> > > > 
> > > > If we can get back to Adriano quickly then he can test while he has an 
> > > > easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> > > > users.
> > > 
> > > My current to-do list on hand is a little bit long. Yes I’d like and 
> > > plan to do it, but the response time cannot be estimated.
> > 
> > I understand.  Maybe I can put something together if you can provide some 
> > pointers since you are _the_ expert on bcache these days.  Here are a few 
> > questions:
> > 
> > Q's for Coly:
> 
> 
> It _looks_ like bcache frees buckets while the `ca->free_inc` list is 
> full, but it could go further.  Consider this hypothetical:
> 

Hi Eric,

Bcache starts to invalidate bucket when ca->free_inc is full, and selects some
buckets to be invalidate by the replacement policy. Then continues to call
bch_invalidate_one_bucket() and pushes the invalidated bucket into ca->free_inc
until this list is full or no more candidate bucket to invalidate.


> https://elixir.bootlin.com/linux/v6.4-rc1/source/drivers/md/bcache/alloc.c#L179
> 
> 	static void invalidate_buckets_lru(struct cache *ca)
> 	{
> 	...
> +	int available = 0;
> +	mutex_lock(&ca->set->bucket_lock);

the mutex_lock()/unlock may introduce deadlock. Before invadliate_buckets() is
called, after allocator_wait() returns the mutex lock bucket_lock is held again. 

> +	for_each_bucket(b, ca) {
> +		if (!GC_SECTORS_USED(b))
> +			unused++;
> +		if (GC_MARK(b) == GC_MARK_RECLAIMABLE)
> +			available++;
> +		if (GC_MARK(b) == GC_MARK_DIRTY)
> +			dirty++;
> +		if (GC_MARK(b) == GC_MARK_METADATA)
> +			meta++;
> +	}
> +	mutex_unlock(&ca->set->bucket_lock);
> +
> -	while (!fifo_full(&ca->free_inc)) {
> +	while (!fifo_full(&ca->free_inc) || available < TARGET_AVAIL_BUCKETS) {

If ca->free_inc is full, and you still try to invalidate more candidate buckets, the
following selected bucket (by the heap_pop) will be invalidate in
bch_invalidate_one_bucket() and pushed into ca->free_inc. But now ca->free_inc is
full, so next time when invalidate_buckets_lru() is called again, this already
invalidated bucket will be accessed and checked again in for_each_bucket(). This is
just a waste of CPU cycles.

Further more, __bch_invalidate_one_bucket() will include the bucket's gen number and
its pin counter. Doing this without pushing the bucket into ca->free_inc, makes me
feel uncomfortable.


> 		if (!heap_pop(&ca->heap, b, bucket_min_cmp)) {



> 			/*
> 			 * We don't want to be calling invalidate_buckets()
> 			 * multiple times when it can't do anything
> 			 */
> 			ca->invalidate_needs_gc = 1;
> 			wake_up_gc(ca->set);
> 			return;
> 		}
> 
> 		bch_invalidate_one_bucket(ca, b);  <<<< this does the work
> 	}
> 
> (TARGET_AVAIL_BUCKETS is a placeholder, ultimately it would be a sysfs 
> setting, probably a percentage.)
> 
> 
> Coly, would this work?
> 

It should work on some kind of workloads, but will introduce complains for other kind of workloads.

> Can you think of any serious issues with this (besides the fact that for_each_bucket is slow)?
> 

I don't feel this change may help to make bcache invalidate the clean buckets without extra cost.

It is not simple for me to tell a solution without careful thought, this is a tradeoff of gain and
pay...

Thanks.

[snipped]

-- 
Coly Li

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Writeback cache all used.
  2023-05-12  5:13                                       ` Coly Li
@ 2023-05-13 21:05                                         ` Eric Wheeler
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2023-05-13 21:05 UTC (permalink / raw)
  To: Coly Li; +Cc: Adriano Silva, Bcache Linux, Martin McClure

[-- Attachment #1: Type: text/plain, Size: 7024 bytes --]

On Fri, 12 May 2023, Coly Li wrote:
> On Thu, May 11, 2023 at 04:10:51PM -0700, Eric Wheeler wrote:
> > On Tue, 9 May 2023, Adriano Silva wrote:
> > > I got the parameters with this script, although I also checked / sys, doing the math everything is right.
> > > 
> > > https://gist.github.com/damoxc/6267899
> > 
> > Thanks.  prio_stats gives me what I'm looking for.  More below.
> >  
> > > Em segunda-feira, 8 de maio de 2023 às 21:42:26 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 
> > > On Thu, 4 May 2023, Coly Li wrote:
> > > > > 2023年5月3日 04:34,Eric Wheeler <bcache@lists.ewheeler.net> 写道:
> > > > > 
> > > > > On Thu, 20 Apr 2023, Adriano Silva wrote:
> > > > >> I continue to investigate the situation. There is actually a performance 
> > > > >> gain when the bcache device is only half filled versus full. There is a 
> > > > >> reduction and greater stability in the latency of direct writes and this 
> > > > >> improves my scenario.
> > > > > 
> > > > > Hi Coly, have you been able to look at this?
> > > > > 
> > > > > This sounds like a great optimization and Adriano is in a place to test 
> > > > > this now and report his findings.
> > > > > 
> > > > > I think you said this should be a simple hack to add early reclaim, so 
> > > > > maybe you can throw a quick patch together (even a rough first-pass with 
> > > > > hard-coded reclaim values)
> > > > > 
> > > > > If we can get back to Adriano quickly then he can test while he has an 
> > > > > easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> > > > > users.
> > > > 
> > > > My current to-do list on hand is a little bit long. Yes I’d like and 
> > > > plan to do it, but the response time cannot be estimated.
> > > 
> > > I understand.  Maybe I can put something together if you can provide some 
> > > pointers since you are _the_ expert on bcache these days.  Here are a few 
> > > questions:
> > > 
> > > Q's for Coly:
> > 
> > 
> > It _looks_ like bcache frees buckets while the `ca->free_inc` list is 
> > full, but it could go further.  Consider this hypothetical:
> > 
> 
> Hi Eric,
> 
> Bcache starts to invalidate bucket when ca->free_inc is full, and selects some
> buckets to be invalidate by the replacement policy. Then continues to call
> bch_invalidate_one_bucket() and pushes the invalidated bucket into ca->free_inc
> until this list is full or no more candidate bucket to invalidate.

Understood.  The goal:  In an attempt to work around Adriano's performance 
issue, we wish to invalidate buckets even after free_inc is full.  If we 
can keep ~20% of buckets unused (ie, !GC_SECTORS_USED(b) ) then I think it 
will fix his issue.  That is the theory we wish to test and validate.

> > https://elixir.bootlin.com/linux/v6.4-rc1/source/drivers/md/bcache/alloc.c#L179
> > 
> > 	static void invalidate_buckets_lru(struct cache *ca)
> > 	{
> > 	...
> 
> the mutex_lock()/unlock may introduce deadlock. Before invadliate_buckets() is
> called, after allocator_wait() returns the mutex lock bucket_lock is held again. 

I see what you mean.  Maybe the bucket lock is already held; if so then I 
don't need to grab it again. For now I've pulled the mutex_lock lines for 
discussion.

We only use for_each_bucket() to get a "fuzzy" count of `available` 
buckets (pseudo code updated below). It doesn't need to be exact.

Here is some cleaned up and concise pseudo code for discussion (I've not 
yet compile tested):

+	int available = 0;
+
+	//mutex_lock(&ca->set->bucket_lock);
+	for_each_bucket(b, ca) {
+		if (GC_MARK(b) == GC_MARK_RECLAIMABLE)
+			available++;
+	}
+	//mutex_unlock(&ca->set->bucket_lock);
+
-	while (!fifo_full(&ca->free_inc)) {
+	while (!fifo_full(&ca->free_inc) || available < TARGET_AVAIL_BUCKETS) {
		...
 		bch_invalidate_one_bucket(ca, b);  <<<< this does the work
+               available++;
	}

Changes from previous post:

  - `available` was not incremented, now it is , so now the loop can 
    terminate.
  - Removed the other counters for clarity, we only care about 
    GC_MARK_RECLAIMABLE for this discussion.
  - Ignore locking for now
  
(TARGET_AVAIL_BUCKETS is a placeholder, ultimately it would be a sysfs 
setting, probably a percentage.)
 
> If ca->free_inc is full, and you still try to invalidate more candidate 
> buckets, the following selected bucket (by the heap_pop) will be 
> invalidate in bch_invalidate_one_bucket() and pushed into ca->free_inc. 
> But now ca->free_inc is full, so next time when invalidate_buckets_lru() 
> is called again, this already invalidated bucket will be accessed and 
> checked again in for_each_bucket(). This is just a waste of CPU cycles.

True. I was aware of this performance issue when I wrote that; bcache 
takes ~1s to iterate for_each_bucket() on my system.  Right now we just 
want to keep ~20% of buckets completely unused and verify 
correctness...and then I can work on hiding the bucket counting overhead 
caused by for_each_bucket().

> Further more, __bch_invalidate_one_bucket() will include the bucket's gen number and
> its pin counter. Doing this without pushing the bucket into ca->free_inc, makes me
> feel uncomfortable.

Questions for my understanding: 

- Is free_inc just a reserve list such that most buckets are in the heap 
  after a fresh `make-bcache -C <cdev>`?

- What is the difference between buckets in free_inc and buckets in the 
  heap? Do they overlap?

I assume you mean this:

	void __bch_invalidate_one_bucket(...) { 
		...
		bch_inc_gen(ca, b);
		b->prio = INITIAL_PRIO;
		atomic_inc(&b->pin);

If I understand the internals of bcache, the `gen` is just a counter that 
increments to make the bucket "newer" than another referenced version.  
Incrementing the `gen` on an unused bucket should be safe, but please 
correct me if I'm wrong here.

I'm not familiar with b->pin, it doesn't appear to be commented in `struct 
bucket` and I didn't see it used in bch_allocator_push().  

What is b->pin used for?

> > Coly, would this work?
> 
> It should work on some kind of workloads, but will introduce complains for other kind of workloads.

As this is written above, I agree.  Right now I'm just trying to 
understand the code well enough to free buckets preemptively so allocation 
doesn't happen during IO.  For now please ignore the cost of 
for_each_bucket().

> > Can you think of any serious issues with this (besides the fact that 
> > for_each_bucket is slow)?
> > 
> 
> I don't feel this change may help to make bcache invalidate the clean 
> buckets without extra cost.

For the example pseudo code above that is true, and for now I am _not_ 
trying to address performance.
 
> It is not simple for me to tell a solution without careful thought, this 
> is a tradeoff of gain and pay...

Certainly that is the end goal, but first I need to understand the code 
well enough to invalidate buckets down to 20% free and still maintain 
correctness.

Thanks for your help understaing this.

-Eric

> 
> Thanks.
> 
> [snipped]
> 
> -- 
> Coly Li
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-05-13 21:05 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1012241948.1268315.1680082721600.ref@mail.yahoo.com>
2023-03-29  9:38 ` Writeback cache all used Adriano Silva
2023-03-29 19:18   ` Eric Wheeler
2023-03-30  1:38     ` Adriano Silva
2023-03-30  4:55   ` Martin McClure
2023-03-31  0:17     ` Adriano Silva
2023-04-02  0:01       ` Eric Wheeler
2023-04-03  7:14         ` Coly Li
2023-04-03 19:27           ` Eric Wheeler
2023-04-04  8:19             ` Coly Li
2023-04-04 20:29               ` Adriano Silva
2023-04-05 13:57                 ` Coly Li
2023-04-05 19:24                   ` Eric Wheeler
2023-04-05 19:31                   ` Adriano Silva
2023-04-06 21:21                     ` Eric Wheeler
2023-04-07  3:15                       ` Adriano Silva
2023-04-09 16:37                     ` Coly Li
2023-04-09 20:14                       ` Adriano Silva
2023-04-09 21:07                         ` Adriano Silva
2023-04-20 11:35                           ` Adriano Silva
2023-05-02 20:34                             ` Eric Wheeler
2023-05-04  4:56                               ` Coly Li
2023-05-04 14:34                                 ` Adriano Silva
2023-05-09  0:29                                   ` Eric Wheeler
2023-05-09  0:42                                 ` Eric Wheeler
2023-05-09  2:21                                   ` Adriano Silva
2023-05-11 23:10                                     ` Eric Wheeler
2023-05-12  5:13                                       ` Coly Li
2023-05-13 21:05                                         ` Eric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.