* Injecting delays into block layer @ 2019-11-21 7:13 Oleksandr Natalenko 2019-11-21 8:00 ` Paolo Valente 0 siblings, 1 reply; 5+ messages in thread From: Oleksandr Natalenko @ 2019-11-21 7:13 UTC (permalink / raw) To: linux-kernel; +Cc: linux-block, paolo.valente Hi Paolo et al. I have a strong suspect that something is going wrong when the underlying block device responds with a large delay. What makes me thinking so is that I use a VM on some cloud provider, and they have substantial block device latency resulting in permanently high (~20%) iowait. It spikes occasionally when their cluster is overloaded, and when that happens, the I/O in my VM may stop and never recover. This is a rare occasion, but it really happens. What's worse, so far I've seen such a behaviour with BFQ only. I'm still testing other schedulers though. Important note: I have no strict evidences that this is *the* case, thus I'm asking for some suggestions. My idea is to fire up a local VM and inject delays to a block device while performing some I/O from within the VM. So the question is: how can those delays be injected? Using dm-delay? Can those delays be random? Thanks in advance. -- Oleksandr Natalenko (post-factum) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Injecting delays into block layer 2019-11-21 7:13 Injecting delays into block layer Oleksandr Natalenko @ 2019-11-21 8:00 ` Paolo Valente 2019-12-06 16:17 ` Paolo Valente 0 siblings, 1 reply; 5+ messages in thread From: Paolo Valente @ 2019-11-21 8:00 UTC (permalink / raw) To: Oleksandr Natalenko; +Cc: linux-kernel, linux-block > Il giorno 21 nov 2019, alle ore 08:13, Oleksandr Natalenko <oleksandr@natalenko.name> ha scritto: > > Hi Paolo et al. > Hi > I have a strong suspect that something is going wrong when the underlying block device responds with a large delay. What makes me thinking so is that I use a VM on some cloud provider, and they have substantial block device latency resulting in permanently high (~20%) iowait. It spikes occasionally when their cluster is overloaded, and when that happens, the I/O in my VM may stop and never recover. This is a rare occasion, but it really happens. > > What's worse, so far I've seen such a behaviour with BFQ only. I'm still testing other schedulers though. > > Important note: I have no strict evidences that this is *the* case, thus I'm asking for some suggestions. My idea is to fire up a local VM and inject delays to a block device while performing some I/O from within the VM. > > So the question is: how can those delays be injected? Using dm-delay? Can those delays be random? > So far I have used scsi_debug [1] for this kind of tests. In my S suite [2], it boils down to setting SCSI_DEBUG=yes in the S config file, and then launching any of the benchmarks. Unfortunately, AFAIK scsi_debug gives you only constant delays; but you can emulate delay spikes very easily, by changing the delay parameter manually during the test. If this option sounds reasonable to you, then I'm willing to help you for every step. Thanks, Paolo [1] http://sg.danny.cz/sg/sdebug26.html [2] https://github.com/Algodev-github/S > Thanks in advance. > > -- > Oleksandr Natalenko (post-factum) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Injecting delays into block layer 2019-11-21 8:00 ` Paolo Valente @ 2019-12-06 16:17 ` Paolo Valente 2019-12-06 19:50 ` Oleksandr Natalenko 2019-12-23 0:10 ` Oleksandr Natalenko 0 siblings, 2 replies; 5+ messages in thread From: Paolo Valente @ 2019-12-06 16:17 UTC (permalink / raw) To: Oleksandr Natalenko; +Cc: linux-kernel, linux-block, SIMONE RICHETTI > Il giorno 21 nov 2019, alle ore 09:00, Paolo Valente <paolo.valente@linaro.org> ha scritto: > > > >> Il giorno 21 nov 2019, alle ore 08:13, Oleksandr Natalenko <oleksandr@natalenko.name> ha scritto: >> >> Hi Paolo et al. >> > > Hi > >> I have a strong suspect that something is going wrong when the underlying block device responds with a large delay. What makes me thinking so is that I use a VM on some cloud provider, and they have substantial block device latency resulting in permanently high (~20%) iowait. It spikes occasionally when their cluster is overloaded, and when that happens, the I/O in my VM may stop and never recover. This is a rare occasion, but it really happens. >> >> What's worse, so far I've seen such a behaviour with BFQ only. I'm still testing other schedulers though. >> >> Important note: I have no strict evidences that this is *the* case, thus I'm asking for some suggestions. My idea is to fire up a local VM and inject delays to a block device while performing some I/O from within the VM. >> >> So the question is: how can those delays be injected? Using dm-delay? Can those delays be random? >> > > So far I have used scsi_debug [1] for this kind of tests. In my S > suite [2], it boils down to setting SCSI_DEBUG=yes in the S config > file, and then launching any of the benchmarks. Unfortunately, AFAIK > scsi_debug gives you only constant delays; but you can emulate delay > spikes very easily, by changing the delay parameter manually during > the test. > > If this option sounds reasonable to you, then I'm willing to help you > for every step. > Hi Oleksandr, Simone (in CC) and I have worked a little bit on reproducing the I/O freeze you report. Simone made a small change in SCSI_debug, which makes the latter serve I/O with a highly varying random delay (100ms - 1s), about twice a second. Then, to generate some fluctuating and heavy I/O, he ran the comm_startup_lat.sh script of my S suite with SCSI_debug a few times. Unfortunately, he didn't succeed in reproducing the problem. If you want, we can send you a patch with his change for SCSI_debug. Any news on your side? Thanks, Simone > Thanks, > Paolo > > [1] http://sg.danny.cz/sg/sdebug26.html > [2] https://github.com/Algodev-github/S > >> Thanks in advance. >> >> -- >> Oleksandr Natalenko (post-factum) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Injecting delays into block layer 2019-12-06 16:17 ` Paolo Valente @ 2019-12-06 19:50 ` Oleksandr Natalenko 2019-12-23 0:10 ` Oleksandr Natalenko 1 sibling, 0 replies; 5+ messages in thread From: Oleksandr Natalenko @ 2019-12-06 19:50 UTC (permalink / raw) To: Paolo Valente; +Cc: linux-kernel, linux-block, SIMONE RICHETTI Hello. On 06.12.2019 17:17, Paolo Valente wrote: > Simone (in CC) and I have worked a little bit on reproducing the I/O > freeze you report. Simone made a small change in SCSI_debug, which > makes the latter serve I/O with a highly varying random delay (100ms - > 1s), about twice a second. > > Then, to generate some fluctuating and heavy I/O, he ran the > comm_startup_lat.sh script of my S suite with SCSI_debug a few times. > Unfortunately, he didn't succeed in reproducing the problem. If you > want, we can send you a patch with his change for SCSI_debug. > > Any news on your side? I was playing with dm-delay in an isolated VM, but so far got no luck. I'll try to find another way to trigger this (if the bug is still present in 5.4) and get back to you in case of success. For me it is a rare occurrence in production, and since I've upgraded to 5.4 and disabled BFQ I haven't seen any at all. At this point I'm not even sure what I'm looking at. I'll try to re-enable BFQ soon to stress my production VMs again. Thank you. -- Oleksandr Natalenko (post-factum) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Injecting delays into block layer 2019-12-06 16:17 ` Paolo Valente 2019-12-06 19:50 ` Oleksandr Natalenko @ 2019-12-23 0:10 ` Oleksandr Natalenko 1 sibling, 0 replies; 5+ messages in thread From: Oleksandr Natalenko @ 2019-12-23 0:10 UTC (permalink / raw) To: Paolo Valente; +Cc: linux-kernel, linux-block, SIMONE RICHETTI, tytso Hi. On 06.12.2019 17:17, Paolo Valente wrote: > Simone (in CC) and I have worked a little bit on reproducing the I/O > freeze you report. Simone made a small change in SCSI_debug, which > makes the latter serve I/O with a highly varying random delay (100ms - > 1s), about twice a second. > > Then, to generate some fluctuating and heavy I/O, he ran the > comm_startup_lat.sh script of my S suite with SCSI_debug a few times. > Unfortunately, he didn't succeed in reproducing the problem. If you > want, we can send you a patch with his change for SCSI_debug. > > Any news on your side? FWIW, I guess I'm safe to exclude BFQ at the moment since I've encountered a very similar issue without having BFQ enabled. Also, I think this might be unrelated to the block layer at all. I suspect there's some race between MADV_MERGEABLE and MADV_DONTNEED since this is what's hammering the affected tasks and what I see from the call traces. I'll investigate further and probably talk to MM people instead. Sorry for the noise. -- Oleksandr Natalenko (post-factum) ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-12-23 0:10 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-21 7:13 Injecting delays into block layer Oleksandr Natalenko 2019-11-21 8:00 ` Paolo Valente 2019-12-06 16:17 ` Paolo Valente 2019-12-06 19:50 ` Oleksandr Natalenko 2019-12-23 0:10 ` Oleksandr Natalenko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).