All of lore.kernel.org
 help / color / mirror / Atom feed
* A lot of flush requests to the backing device
@ 2021-11-05 11:21 Aleksei Zakharov
  2021-11-08  5:38 ` Dongdong Tao
  0 siblings, 1 reply; 6+ messages in thread
From: Aleksei Zakharov @ 2021-11-05 11:21 UTC (permalink / raw)
  To: linux-bcache

Hi all,
 
I've used bcache a lot for the last three years, mostly in writeback mode with ceph, and I faced a strange behavior. When there's a heavy write load on the bcache device with a lot of fsync()/fdatasync() requests, the bcache device issues a lot of flush requests to the backing device. If the writeback rate is low, then there might be hundreds of flush requests per second issued to the backing device.
 
If the writeback rate growths, then latency of the flush requests increases. And latency of the bcache device increases as a result and the application experiences higher disk latency. So, this behavior of bcache slows the application in it's I/O requests when writeback rate becomes high.
 
This workload pattern with a lot of fsync()/fdatasync() requests is a common for a latency-sensitive applications. And it seems that this bcache behavior slows down this type of workloads.
 
As I understand, if a write request with REQ_PREFLUSH is issued to bcache device, then bcache issues new empty write request with REQ_PREFLUSH to the backing device. What is the purpose of this behavior? It looks like it might be eliminated for the better performance.

--
Regards,
Aleksei Zakharov
alexzzz.ru

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A lot of flush requests to the backing device
  2021-11-05 11:21 A lot of flush requests to the backing device Aleksei Zakharov
@ 2021-11-08  5:38 ` Dongdong Tao
  2021-11-08  6:35   ` Kai Krakow
  2021-11-10 14:35   ` A lot of flush requests to the backing device Aleksei Zakharov
  0 siblings, 2 replies; 6+ messages in thread
From: Dongdong Tao @ 2021-11-08  5:38 UTC (permalink / raw)
  To: Aleksei Zakharov; +Cc: linux-bcache

[Sorry for the Spam detection ... ]

Hi Aleksei,

This is a very interesting finding, I understand that ceph blustore
will issue fdatasync requests when it tries to flush data or metadata
(via bluefs) to the OSD device. But I'm surprised to see so much
pressure it can bring to the backing device.
May I know how do you measure the number of flush requests to the
backing device per second that is sent from the bcache with the
REQ_PREFLUSH flag?  (ftrace to some bcache tracepoint ?)

My understanding is that the bcache doesn't need to wait for the flush
requests to be completed from the backing device in order to finish
the write request, since it used a new bio "flush" for the backing
device.
So I don't think this will increase the fdatasync latency as long as
the write can be performed in a writeback mode.  It does increase the
read latency if the read io missed the cache.
Or maybe I am missing something, let me know how did you observe the
latency increasing from bcache layer , I would want to do some
experiments as well?

Regards,
Dongdong


On Fri, Nov 5, 2021 at 7:21 PM Aleksei Zakharov <zakharov.a.g@yandex.ru> wrote:
>
> Hi all,
>
> I've used bcache a lot for the last three years, mostly in writeback mode with ceph, and I faced a strange behavior. When there's a heavy write load on the bcache device with a lot of fsync()/fdatasync() requests, the bcache device issues a lot of flush requests to the backing device. If the writeback rate is low, then there might be hundreds of flush requests per second issued to the backing device.
>
> If the writeback rate growths, then latency of the flush requests increases. And latency of the bcache device increases as a result and the application experiences higher disk latency. So, this behavior of bcache slows the application in it's I/O requests when writeback rate becomes high.
>
> This workload pattern with a lot of fsync()/fdatasync() requests is a common for a latency-sensitive applications. And it seems that this bcache behavior slows down this type of workloads.
>
> As I understand, if a write request with REQ_PREFLUSH is issued to bcache device, then bcache issues new empty write request with REQ_PREFLUSH to the backing device. What is the purpose of this behavior? It looks like it might be eliminated for the better performance.
>
> --
> Regards,
> Aleksei Zakharov
> alexzzz.ru

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A lot of flush requests to the backing device
  2021-11-08  5:38 ` Dongdong Tao
@ 2021-11-08  6:35   ` Kai Krakow
  2021-11-08  8:11     ` Coly Li
  2021-11-10 14:35   ` A lot of flush requests to the backing device Aleksei Zakharov
  1 sibling, 1 reply; 6+ messages in thread
From: Kai Krakow @ 2021-11-08  6:35 UTC (permalink / raw)
  To: Dongdong Tao; +Cc: Aleksei Zakharov, linux-bcache

Am Mo., 8. Nov. 2021 um 06:38 Uhr schrieb Dongdong Tao
<dongdong.tao@canonical.com>:
>
> My understanding is that the bcache doesn't need to wait for the flush
> requests to be completed from the backing device in order to finish
> the write request, since it used a new bio "flush" for the backing
> device.

That's probably true for requests going to the writeback cache. But
requests that bypass the cache must also pass the flush request to the
backing device - otherwise it would violate transactional guarantees.
bcache still guarantees the presence of the dirty data when it later
replays all dirty data to the backing device (and it can probably
reduce flushes here and only flush just before removing the writeback
log from its cache).

Personally, I've turned writeback caching off due to increasingly high
latencies as seen by applications [1]. Writes may be slower
throughput-wise but overall latency is lower which "feels" faster.

I wonder if maybe a lot of writes with flush requests may bypass the cache...

That said, initial releases of bcache felt a lot smoother here. But
I'd like to add that I only ever used it for desktop workflows, I
never used ceph.

Regards,
Kai

[1]: And some odd behavior where bcache would detach dirty caches on
caching device problems, which happens for me sometimes at reboot just
after bcache was detected (probably due to a SSD firmware hiccup, the
device temporarily goes missing and re-appears) - and then all dirty
data is lost and discarded. In consequence, on next reboot, cache mode
is set to "none" and the devices need to be re-attached. But until
then, dirty data is long gone.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A lot of flush requests to the backing device
  2021-11-08  6:35   ` Kai Krakow
@ 2021-11-08  8:11     ` Coly Li
  2021-11-08 11:29       ` Latency, performance, detach behavior (was: A lot of flush requests to the backing device) Kai Krakow
  0 siblings, 1 reply; 6+ messages in thread
From: Coly Li @ 2021-11-08  8:11 UTC (permalink / raw)
  To: Kai Krakow; +Cc: Aleksei Zakharov, Dongdong Tao, linux-bcache

On 11/8/21 2:35 PM, Kai Krakow wrote:
> Am Mo., 8. Nov. 2021 um 06:38 Uhr schrieb Dongdong Tao
> <dongdong.tao@canonical.com>:
>> My understanding is that the bcache doesn't need to wait for the flush
>> requests to be completed from the backing device in order to finish
>> the write request, since it used a new bio "flush" for the backing
>> device.
> That's probably true for requests going to the writeback cache. But
> requests that bypass the cache must also pass the flush request to the
> backing device - otherwise it would violate transactional guarantees.
> bcache still guarantees the presence of the dirty data when it later
> replays all dirty data to the backing device (and it can probably
> reduce flushes here and only flush just before removing the writeback
> log from its cache).
>
> Personally, I've turned writeback caching off due to increasingly high
> latencies as seen by applications [1]. Writes may be slower
> throughput-wise but overall latency is lower which "feels" faster.
>
> I wonder if maybe a lot of writes with flush requests may bypass the cache...
>
> That said, initial releases of bcache felt a lot smoother here. But
> I'd like to add that I only ever used it for desktop workflows, I
> never used ceph.
>
> Regards,
> Kai
>
> [1]: And some odd behavior where bcache would detach dirty caches on
> caching device problems, which happens for me sometimes at reboot just
> after bcache was detected (probably due to a SSD firmware hiccup, the
> device temporarily goes missing and re-appears) - and then all dirty
> data is lost and discarded. In consequence, on next reboot, cache mode
> is set to "none" and the devices need to be re-attached. But until
> then, dirty data is long gone.

Just an off topic question, when you experienced the above situation, 
what is the kernel version for this?
We recently have a bkey oversize regression triggered in Linux v5.12 or 
v5.13, which behaved quite similar to the above description.
The issue was fixed in Linux v5.13 by the following commits,

commit 1616a4c2ab1a ("bcache: remove bcache device self-defined readahead")
commit 41fe8d088e96 ("bcache: avoid oversized read request in cache 
missing code path")

Coly Li

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Latency, performance, detach behavior (was: A lot of flush requests to the backing device)
  2021-11-08  8:11     ` Coly Li
@ 2021-11-08 11:29       ` Kai Krakow
  0 siblings, 0 replies; 6+ messages in thread
From: Kai Krakow @ 2021-11-08 11:29 UTC (permalink / raw)
  To: Coly Li; +Cc: Aleksei Zakharov, Dongdong Tao, linux-bcache

Am Mo., 8. Nov. 2021 um 09:11 Uhr schrieb Coly Li <colyli@suse.de>:
> On 11/8/21 2:35 PM, Kai Krakow wrote:
> > [1]: And some odd behavior where bcache would detach dirty caches on
> > caching device problems, which happens for me sometimes at reboot just
> > after bcache was detected (probably due to a SSD firmware hiccup, the
> > device temporarily goes missing and re-appears) - and then all dirty
> > data is lost and discarded. In consequence, on next reboot, cache mode
> > is set to "none" and the devices need to be re-attached. But until
> > then, dirty data is long gone.
>
> Just an off topic question, when you experienced the above situation,
> what is the kernel version for this?
> We recently have a bkey oversize regression triggered in Linux v5.12 or
> v5.13, which behaved quite similar to the above description.
> The issue was fixed in Linux v5.13 by the following commits,

You mean exactly the above mentioned situation? Or the latency problems?

I'm using LTS kernels, that is currently the 5.10 series, and usually
I'm updating it as soon as possible. I didn't switch to 5.15 yet.

Latency problems: That's a long-standing issue, and may be more
related to how btrfs works on top of bcache. It has improved during
the course of 5.10 probably due to changes in btrfs. But it seems that
using bcache writeback causes more writeback blocking than it should
while without bcache writeback, dirty writeback takes longer but
doesn't block desktop so much. It may be related to sometimes varying
latency performance of Samsung Evo SSD drives.

> commit 1616a4c2ab1a ("bcache: remove bcache device self-defined readahead")
> commit 41fe8d088e96 ("bcache: avoid oversized read request in cache
> missing code path")

Without having looked at the commits, this mostly sounds like it would
affect latency and performance.

So your request was probably NOT about the detach-on-error situation.

Just for completeness: That one isn't really a software problem (I'll
just ditch Samsung on the next SSD swap, maybe going to Seagate
Ironwolf instead which was recommended by Zygo who created bees and
works on btrfs). I then expect that situation not to occur again, I
never experienced it back when I used Crucial MX (which also had
better latency behavior). Since using Samsung SSDs, I've lost parts of
EFI more than once (2 MB where just zeroed out in vfat), which didn't
happen again since I turned TRIM off (some filesystems or even bcache
seem to enable it, the kernel doesn't blacklist the feature for my
model). This also caused bcache to sometimes complain about a broken
journal structure. But well, this is not the lost-data-on-TRIM
situation:

Due to the nature of the problem, I cannot really pinpoint when it
happened first. The problem is, usually on cold boots, that the SSD
firmware would shortly after power-cycle detach from SATA and come
back, since I use fast-boot UEFI, that means it can happen when the
kernel already booted and bcache loaded. This never happens on a
running system, only during boot/POST. The problematic bcache commit
introduced a behavior to detach errored caching backends which in turn
invalidates dirty cache data. Looking at the cache status after such
an incident, the cache mode of the detached members is set to "none",
they are no longer attached, but the cache device still has the same
amount of data so data of the detached device was not freed from the
cache. But on re-attach, dirty data won't be replayed, dirty data
stays 0, and btrfs tells me that expected transaction numbers are some
300 generations behind (which is usually not fixable, I was lucky this
time because only one btrfs member had dirty data, scrub fixed it).
bcache still keeps its usage level (like 90%, or 860GB in my case),
and it seems to just discard old "stale" data from before the detach
situation.

I still think that bcache should not detach backends when the cache
device goes missing with dirty data - instead it must reply with IO
errors and/or go to read-only mode, until I either manually bring the
cache back or decide to resolve the situation by declaring the dirty
data as lost manually. Even simple RAID controllers do that: If the
cache contents are lost or broken, they won't "auto fix" themselves by
purging the cache, they halt on boot telling me that I can either work
without the device set, or accept that the dirty data is lost. bcache
should go into read-only mode, leave the cache attached but mark it
missing/errored, until I decided to either accept the data loss, or
resolve the situation with the missing cache device. Another
work-around would be if I could instruct bcache to flush all dirty
data during shutdown.

Regards,
Kai

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A lot of flush requests to the backing device
  2021-11-08  5:38 ` Dongdong Tao
  2021-11-08  6:35   ` Kai Krakow
@ 2021-11-10 14:35   ` Aleksei Zakharov
  1 sibling, 0 replies; 6+ messages in thread
From: Aleksei Zakharov @ 2021-11-10 14:35 UTC (permalink / raw)
  To: Dongdong Tao; +Cc: linux-bcache



> [Sorry for the Spam detection ... ]
> 
> Hi Aleksei,
> 
> This is a very interesting finding, I understand that ceph blustore
> will issue fdatasync requests when it tries to flush data or metadata
> (via bluefs) to the OSD device. But I'm surprised to see so much
> pressure it can bring to the backing device.
> May I know how do you measure the number of flush requests to the
> backing device per second that is sent from the bcache with the
> REQ_PREFLUSH flag? (ftrace to some bcache tracepoint ?)
That was easy: the writeback rate was minimal and there were a lot
of write requests to the backing device in iostat -xtd 1 output and
bytes/s was too small for that number of writes. It was relatively old kernel,
so flushes were not separated in the block layer stats yet.

> 
> My understanding is that the bcache doesn't need to wait for the flush
> requests to be completed from the backing device in order to finish
> the write request, since it used a new bio "flush" for the backing
> device.
> So I don't think this will increase the fdatasync latency as long as
> the write can be performed in a writeback mode. It does increase the
> read latency if the read io missed the cache.
Hm, that might be truth for the reads, i'll do some experiments.
But, I don't see any reason to send flush request to the backing
device if there's nothing to flush.

> Or maybe I am missing something, let me know how did you observe the
> latency increasing from bcache layer , I would want to do some
> experiments as well?
I'll do some experiments and come back with more details on the
issue in a week! Already quit that job and don't work with ceph anymore,
but still thinking about this interesting issue.

> 
> Regards,
> Dongdong
> 
> On Fri, Nov 5, 2021 at 7:21 PM Aleksei Zakharov <zakharov.a.g@yandex.ru> wrote:
> 
>> Hi all,
>>
>> I've used bcache a lot for the last three years, mostly in writeback mode with ceph, and I faced a strange behavior. When there's a heavy write load on the bcache device with a lot of fsync()/fdatasync() requests, the bcache device issues a lot of flush requests to the backing device. If the writeback rate is low, then there might be hundreds of flush requests per second issued to the backing device.
>>
>> If the writeback rate growths, then latency of the flush requests increases. And latency of the bcache device increases as a result and the application experiences higher disk latency. So, this behavior of bcache slows the application in it's I/O requests when writeback rate becomes high.
>>
>> This workload pattern with a lot of fsync()/fdatasync() requests is a common for a latency-sensitive applications. And it seems that this bcache behavior slows down this type of workloads.
>>
>> As I understand, if a write request with REQ_PREFLUSH is issued to bcache device, then bcache issues new empty write request with REQ_PREFLUSH to the backing device. What is the purpose of this behavior? It looks like it might be eliminated for the better performance.
>>
>> --
>> Regards,
>> Aleksei Zakharov
>> alexzzz.ru
--
Regards,
Aleksei Zakharov
alexzzz.ru

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-11-10 14:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-05 11:21 A lot of flush requests to the backing device Aleksei Zakharov
2021-11-08  5:38 ` Dongdong Tao
2021-11-08  6:35   ` Kai Krakow
2021-11-08  8:11     ` Coly Li
2021-11-08 11:29       ` Latency, performance, detach behavior (was: A lot of flush requests to the backing device) Kai Krakow
2021-11-10 14:35   ` A lot of flush requests to the backing device Aleksei Zakharov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.