linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Injecting delays into block layer
@ 2019-11-21  7:13 Oleksandr Natalenko
  2019-11-21  8:00 ` Paolo Valente
  0 siblings, 1 reply; 5+ messages in thread
From: Oleksandr Natalenko @ 2019-11-21  7:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-block, paolo.valente

Hi Paolo et al.

I have a strong suspect that something is going wrong when the 
underlying block device responds with a large delay. What makes me 
thinking so is that I use a VM on some cloud provider, and they have 
substantial block device latency resulting in permanently high (~20%) 
iowait. It spikes occasionally when their cluster is overloaded, and 
when that happens, the I/O in my VM may stop and never recover. This is 
a rare occasion, but it really happens.

What's worse, so far I've seen such a behaviour with BFQ only. I'm still 
testing other schedulers though.

Important note: I have no strict evidences that this is *the* case, thus 
I'm asking for some suggestions. My idea is to fire up a local VM and 
inject delays to a block device while performing some I/O from within 
the VM.

So the question is: how can those delays be injected? Using dm-delay? 
Can those delays be random?

Thanks in advance.

-- 
   Oleksandr Natalenko (post-factum)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Injecting delays into block layer
  2019-11-21  7:13 Injecting delays into block layer Oleksandr Natalenko
@ 2019-11-21  8:00 ` Paolo Valente
  2019-12-06 16:17   ` Paolo Valente
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Valente @ 2019-11-21  8:00 UTC (permalink / raw)
  To: Oleksandr Natalenko; +Cc: linux-kernel, linux-block



> Il giorno 21 nov 2019, alle ore 08:13, Oleksandr Natalenko <oleksandr@natalenko.name> ha scritto:
> 
> Hi Paolo et al.
> 

Hi

> I have a strong suspect that something is going wrong when the underlying block device responds with a large delay. What makes me thinking so is that I use a VM on some cloud provider, and they have substantial block device latency resulting in permanently high (~20%) iowait. It spikes occasionally when their cluster is overloaded, and when that happens, the I/O in my VM may stop and never recover. This is a rare occasion, but it really happens.
> 
> What's worse, so far I've seen such a behaviour with BFQ only. I'm still testing other schedulers though.
> 
> Important note: I have no strict evidences that this is *the* case, thus I'm asking for some suggestions. My idea is to fire up a local VM and inject delays to a block device while performing some I/O from within the VM.
> 
> So the question is: how can those delays be injected? Using dm-delay? Can those delays be random?
> 

So far I have used scsi_debug [1] for this kind of tests.  In my S
suite [2], it boils down to setting SCSI_DEBUG=yes in the S config
file, and then launching any of the benchmarks.  Unfortunately, AFAIK
scsi_debug gives you only constant delays; but you can emulate delay
spikes very easily, by changing the delay parameter manually during
the test.

If this option sounds reasonable to you, then I'm willing to help you
for every step.

Thanks,
Paolo

[1] http://sg.danny.cz/sg/sdebug26.html
[2] https://github.com/Algodev-github/S

> Thanks in advance.
> 
> -- 
>  Oleksandr Natalenko (post-factum)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Injecting delays into block layer
  2019-11-21  8:00 ` Paolo Valente
@ 2019-12-06 16:17   ` Paolo Valente
  2019-12-06 19:50     ` Oleksandr Natalenko
  2019-12-23  0:10     ` Oleksandr Natalenko
  0 siblings, 2 replies; 5+ messages in thread
From: Paolo Valente @ 2019-12-06 16:17 UTC (permalink / raw)
  To: Oleksandr Natalenko; +Cc: linux-kernel, linux-block, SIMONE RICHETTI



> Il giorno 21 nov 2019, alle ore 09:00, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 21 nov 2019, alle ore 08:13, Oleksandr Natalenko <oleksandr@natalenko.name> ha scritto:
>> 
>> Hi Paolo et al.
>> 
> 
> Hi
> 
>> I have a strong suspect that something is going wrong when the underlying block device responds with a large delay. What makes me thinking so is that I use a VM on some cloud provider, and they have substantial block device latency resulting in permanently high (~20%) iowait. It spikes occasionally when their cluster is overloaded, and when that happens, the I/O in my VM may stop and never recover. This is a rare occasion, but it really happens.
>> 
>> What's worse, so far I've seen such a behaviour with BFQ only. I'm still testing other schedulers though.
>> 
>> Important note: I have no strict evidences that this is *the* case, thus I'm asking for some suggestions. My idea is to fire up a local VM and inject delays to a block device while performing some I/O from within the VM.
>> 
>> So the question is: how can those delays be injected? Using dm-delay? Can those delays be random?
>> 
> 
> So far I have used scsi_debug [1] for this kind of tests.  In my S
> suite [2], it boils down to setting SCSI_DEBUG=yes in the S config
> file, and then launching any of the benchmarks.  Unfortunately, AFAIK
> scsi_debug gives you only constant delays; but you can emulate delay
> spikes very easily, by changing the delay parameter manually during
> the test.
> 
> If this option sounds reasonable to you, then I'm willing to help you
> for every step.
> 

Hi Oleksandr,
Simone (in CC) and I have worked a little bit on reproducing the I/O
freeze you report.  Simone made a small change in SCSI_debug, which
makes the latter serve I/O with a highly varying random delay (100ms -
1s), about twice a second.

Then, to generate some fluctuating and heavy I/O, he ran the
comm_startup_lat.sh script of my S suite with SCSI_debug a few times.
Unfortunately, he didn't succeed in reproducing the problem.  If you
want, we can send you a patch with his change for SCSI_debug.

Any news on your side?

Thanks,
Simone

> Thanks,
> Paolo
> 
> [1] http://sg.danny.cz/sg/sdebug26.html
> [2] https://github.com/Algodev-github/S
> 
>> Thanks in advance.
>> 
>> -- 
>> Oleksandr Natalenko (post-factum)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Injecting delays into block layer
  2019-12-06 16:17   ` Paolo Valente
@ 2019-12-06 19:50     ` Oleksandr Natalenko
  2019-12-23  0:10     ` Oleksandr Natalenko
  1 sibling, 0 replies; 5+ messages in thread
From: Oleksandr Natalenko @ 2019-12-06 19:50 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-kernel, linux-block, SIMONE RICHETTI

Hello.

On 06.12.2019 17:17, Paolo Valente wrote:
> Simone (in CC) and I have worked a little bit on reproducing the I/O
> freeze you report.  Simone made a small change in SCSI_debug, which
> makes the latter serve I/O with a highly varying random delay (100ms -
> 1s), about twice a second.
> 
> Then, to generate some fluctuating and heavy I/O, he ran the
> comm_startup_lat.sh script of my S suite with SCSI_debug a few times.
> Unfortunately, he didn't succeed in reproducing the problem.  If you
> want, we can send you a patch with his change for SCSI_debug.
> 
> Any news on your side?

I was playing with dm-delay in an isolated VM, but so far got no luck. 
I'll try to find another way to trigger this (if the bug is still 
present in 5.4) and get back to you in case of success.

For me it is a rare occurrence in production, and since I've upgraded to 
5.4 and disabled BFQ I haven't seen any at all. At this point I'm not 
even sure what I'm looking at. I'll try to re-enable BFQ soon to stress 
my production VMs again.

Thank you.

-- 
   Oleksandr Natalenko (post-factum)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Injecting delays into block layer
  2019-12-06 16:17   ` Paolo Valente
  2019-12-06 19:50     ` Oleksandr Natalenko
@ 2019-12-23  0:10     ` Oleksandr Natalenko
  1 sibling, 0 replies; 5+ messages in thread
From: Oleksandr Natalenko @ 2019-12-23  0:10 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-kernel, linux-block, SIMONE RICHETTI, tytso

Hi.

On 06.12.2019 17:17, Paolo Valente wrote:
> Simone (in CC) and I have worked a little bit on reproducing the I/O
> freeze you report.  Simone made a small change in SCSI_debug, which
> makes the latter serve I/O with a highly varying random delay (100ms -
> 1s), about twice a second.
> 
> Then, to generate some fluctuating and heavy I/O, he ran the
> comm_startup_lat.sh script of my S suite with SCSI_debug a few times.
> Unfortunately, he didn't succeed in reproducing the problem.  If you
> want, we can send you a patch with his change for SCSI_debug.
> 
> Any news on your side?

FWIW, I guess I'm safe to exclude BFQ at the moment since I've 
encountered a very similar issue without having BFQ enabled.

Also, I think this might be unrelated to the block layer at all. I 
suspect there's some race between MADV_MERGEABLE and MADV_DONTNEED since 
this is what's hammering the affected tasks and what I see from the call 
traces.

I'll investigate further and probably talk to MM people instead. Sorry 
for the noise.

-- 
   Oleksandr Natalenko (post-factum)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-12-23  0:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-21  7:13 Injecting delays into block layer Oleksandr Natalenko
2019-11-21  8:00 ` Paolo Valente
2019-12-06 16:17   ` Paolo Valente
2019-12-06 19:50     ` Oleksandr Natalenko
2019-12-23  0:10     ` Oleksandr Natalenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).