Re: Large latency with bcache for Ceph OSD

From: "Norman.Kern" <norman.kern@gmx.com>
To: Coly Li <colyli@suse.de>
Cc: linux-block@vger.kernel.org, axboe@kernel.dk,
	linux-bcache@vger.kernel.org
Subject: Re: Large latency with bcache for Ceph OSD
Date: Thu, 25 Feb 2021 21:00:43 +0800	[thread overview]
Message-ID: <cfe2746f-18a7-a768-ea72-901793a3133e@gmx.com> (raw)
In-Reply-To: <07bcb6c8-21e1-11de-d1f0-ffd417bd36ff@gmx.com>

I made a test:

- Stop writing and wait for dirty data writen back

$ lsblk
NAME                                                                                                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdf                                                                                                      8:80   0   7.3T  0 disk
└─bcache0                                                                                              252:0    0   7.3T  0 disk
  └─ceph--32a481f9--313c--417e--aaf7--bdd74515fd86-osd--data--2f670929--3c8a--45dd--bcef--c60ce3ee08e1 253:1    0   7.3T  0 lvm 
sdd                                                                                                      8:48   0   7.3T  0 disk
sdb                                                                                                      8:16   0   7.3T  0 disk
sdk                                                                                                      8:160  0 893.8G  0 disk
└─bcache0                                                                                              252:0    0   7.3T  0 disk
  └─ceph--32a481f9--313c--417e--aaf7--bdd74515fd86-osd--data--2f670929--3c8a--45dd--bcef--c60ce3ee08e1 253:1    0   7.3T  0 lvm 
$ cat /sys/block/bcache0/bcache/dirty_data
0.0k

root@WXS0106:~# bcache-super-show /dev/sdf
sb.magic                ok
sb.first_sector         8 [match]
sb.csum                 71DA9CA968B4A625 [match]
sb.version              1 [backing device]

dev.label               (empty)
dev.uuid                d07dc435-129d-477d-8378-a6af75199852
dev.sectors_per_block   8
dev.sectors_per_bucket  1024
dev.data.first_sector   16
dev.data.cache_mode     1 [writeback]
dev.data.cache_state    1 [clean]
cset.uuid               d87713c6-2e76-4a09-8517-d48306468659

- check the available cache

# cat /sys/fs/bcache/d87713c6-2e76-4a09-8517-d48306468659/cache_available_percent
27

As the doc described:

cache_available_percent
    Percentage of cache device which doesn’t contain dirty data, and could potentially be used for writeback. This doesn’t mean this space isn’t used for clean cached data; the unused statistic (in priority_stats) is typically much lower.
When all dirty data writen back,  why cache_available_percent was not 100?

And when I start the write I/O, the new writen didn't replace the clean cache(it think the cache is diry now?), so it cause the hdd with large latency:

./bin/iosnoop -Q -d '8,80'

<...>        73338  WS   8,80     3513701472   4096     217.69
<...>        73338  WS   8,80     3513759360   4096     448.80
<...>        73338  WS   8,80     3562211912   4096     511.69
<...>        73335  WS   8,80     3562212528   4096     505.08
<...>        73339  WS   8,80     3562213376   4096     501.19
<...>        73336  WS   8,80     3562213992   4096     511.16
<...>        73343  WS   8,80     3562214016   4096     511.74
<...>        73340  WS   8,80     3562214128   4096     512.95
<...>        73329  WS   8,80     3562214208   4096     510.48
<...>        73338  WS   8,80     3562214600   4096     518.64
<...>        73341  WS   8,80     3562214632   4096     519.09
<...>        73342  WS   8,80     3562214664   4096     518.28
<...>        73336  WS   8,80     3562214688   4096     519.27
<...>        73343  WS   8,80     3562214736   4096     528.31
<...>        73339  WS   8,80     3562214784   4096     530.13

On 2021/2/25 上午10:23, Norman.Kern wrote:
> On 2021/2/24 下午4:52, Coly Li wrote:
>> On 2/22/21 7:48 AM, Norman.Kern wrote:
>>> Ping.
>>>
>>> I'm confused on the SYNC I/O on bcache. why SYNC I/O must be writen back
>>> for persistent cache?  It can cause some latency.
>>>
>>> @Coly, can you give help me to explain why bcache handle O_SYNC like this.?
>>>
>>>
>> Hmm, normally we won't observe the application issuing I/Os on backing
>> device except for,
>> - I/O bypass by SSD congestion
>> - Sequential I/O request
>> - Dirty buckets exceeds the cutoff threshold
>> - Write through mode
>>
>> Do you set the write/read congestion threshold to 0 ?
> Thanks for you reply.
>
> I have set the threshold to zero, all configs:
>
> #make-bcache -C -b 4m -w 4k --discard --cache_replacement_policy=lru /dev/sdm
> #make-bcache -B --writeback -w 4KiB /dev/sdn --wipe-bcache
> congested_read_threshold_us = 0
> congested_write_threshold_us = 0
>
> # I tried to set sequential_cutoff to 0, but it didn't solve it.
>
> sequential_cutoff = 4194304
> writeback_percent = 40
> cache_mode = writeback
>
> I renew the cluster， run for hours and reproduced the problem. I check the cache status:
>
> root@WXS0106:/root/perf-tools# cat /sys/fs/bcache/d87713c6-2e76-4a09-8517-d48306468659/cache_available_percent
> 29
> root@WXS0106:/root/perf-tools# cat /sys/fs/bcache/d87713c6-2e76-4a09-8517-d48306468659/internal/cutoff_writeback_sync
> 70
> 'Dirty buckets exceeds the cutoff threshold' caused the problem?  My configs  are wrong or other reasons?
>
>> Coly Li
>>
>>> On 2021/2/18 下午3:56, Norman.Kern wrote:
>>>> Hi guys,
>>>>
>>>> I am testing ceph with bcache, I found some I/O with O_SYNC writeback
>>>> to HDD, which caused large latency on HDD, I trace the I/O with iosnoop:
>>>>
>>>> ./iosnoop  -Q -ts -d '8,192
>>>>
>>>> Tracing block I/O for 1 seconds (buffered)...
>>>> STARTs          ENDs            COMM         PID    TYPE DEV
>>>> BLOCK        BYTES     LATms
>>>>
>>>> 1809296.292350  1809296.319052  tp_osd_tp    22191  R    8,192
>>>> 4578940240   16384     26.70
>>>> 1809296.292330  1809296.320974  tp_osd_tp    22191  R    8,192
>>>> 4577938704   16384     28.64
>>>> 1809296.292614  1809296.323292  tp_osd_tp    22191  R    8,192
>>>> 4600404304   16384     30.68
>>>> 1809296.292353  1809296.325300  tp_osd_tp    22191  R    8,192
>>>> 4578343088   16384     32.95
>>>> 1809296.292340  1809296.328013  tp_osd_tp    22191  R    8,192
>>>> 4578055472   16384     35.67
>>>> 1809296.292606  1809296.330518  tp_osd_tp    22191  R    8,192
>>>> 4578581648   16384     37.91
>>>> 1809295.169266  1809296.334041  bstore_kv_fi 17266  WS   8,192
>>>> 4244996360   4096    1164.78
>>>> 1809296.292618  1809296.336349  tp_osd_tp    22191  R    8,192
>>>> 4602631760   16384     43.73
>>>> 1809296.292618  1809296.338812  tp_osd_tp    22191  R    8,192
>>>> 4602632976   16384     46.19
>>>> 1809296.030103  1809296.342780  tp_osd_tp    22180  WS   8,192
>>>> 4741276048   131072   312.68
>>>> 1809296.292347  1809296.345045  tp_osd_tp    22191  R    8,192
>>>> 4609037872   16384     52.70
>>>> 1809296.292620  1809296.345109  tp_osd_tp    22191  R    8,192
>>>> 4609037904   16384     52.49
>>>> 1809296.292612  1809296.347251  tp_osd_tp    22191  R    8,192
>>>> 4578937616   16384     54.64
>>>> 1809296.292621  1809296.351136  tp_osd_tp    22191  R    8,192
>>>> 4612654992   16384     58.51
>>>> 1809296.292341  1809296.353428  tp_osd_tp    22191  R    8,192
>>>> 4578220656   16384     61.09
>>>> 1809296.292342  1809296.353864  tp_osd_tp    22191  R    8,192
>>>> 4578220880   16384     61.52
>>>> 1809295.167650  1809296.358510  bstore_kv_fi 17266  WS   8,192
>>>> 4923695960   4096    1190.86
>>>> 1809296.292347  1809296.361885  tp_osd_tp    22191  R    8,192
>>>> 4607437136   16384     69.54
>>>> 1809296.029363  1809296.367313  tp_osd_tp    22180  WS   8,192
>>>> 4739824400   98304    337.95
>>>> 1809296.292349  1809296.370245  tp_osd_tp    22191  R    8,192
>>>> 4591379888   16384     77.90
>>>> 1809296.292348  1809296.376273  tp_osd_tp    22191  R    8,192
>>>> 4591289552   16384     83.92
>>>> 1809296.292353  1809296.378659  tp_osd_tp    22191  R    8,192
>>>> 4578248656   16384     86.31
>>>> 1809296.292619  1809296.384835  tp_osd_tp    22191  R    8,192
>>>> 4617494160   65536     92.22
>>>> 1809295.165451  1809296.393715  bstore_kv_fi 17266  WS   8,192
>>>> 1355703120   4096    1228.26
>>>> 1809295.168595  1809296.401560  bstore_kv_fi 17266  WS   8,192
>>>> 1122200      4096    1232.96
>>>> 1809295.165221  1809296.408018  bstore_kv_fi 17266  WS   8,192
>>>> 960656       4096    1242.80
>>>> 1809295.166737  1809296.411505  bstore_kv_fi 17266  WS   8,192
>>>> 57682504     4096    1244.77
>>>> 1809296.292352  1809296.418123  tp_osd_tp    22191  R    8,192
>>>> 4579459056   32768    125.77
>>>>
>>>> I'm confused why write with O_SYNC must writeback on the backend
>>>> storage device?  And when I used bcache for a time,
>>>>
>>>> the latency increased a lot.(The SSD is not very busy), There's some
>>>> best practices on configuration?
>>>>