[REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
@ 2017-11-29 18:39 vcaputo
  2017-12-01 21:33 ` vcaputo
  0 siblings, 1 reply; 11+ messages in thread
From: vcaputo @ 2017-11-29 18:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: timmurray, tj

[-- Attachment #1: Type: text/plain, Size: 3396 bytes --]

Hello,

Recently I noticed substantial audio dropouts when listening to MP3s in
`cmus` while doing big and churny `git checkout` commands in my linux git
tree.

It's not something I've done much of over the last couple months so I
hadn't noticed until yesterday, but didn't remember this being a problem in
recent history.

As there's quite an accumulation of similarly configured and built kernels
in my grub menu, it was trivial to determine approximately when this began:

4.11.0: no dropouts
4.12.0-rc7: dropouts
4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)

Watching top while this is going on in the various kernel versions, it's
apparent that the kworker behavior changed.  Both the priority and quantity
of running kworker threads is elevated in kernels experiencing dropouts.

Searching through the commit history for v4.11..v4.12 uncovered:

commit a1b89132dc4f61071bdeaab92ea958e0953380a1
Author: Tim Murray <timmurray@google.com>
Date:   Fri Apr 21 11:11:36 2017 +0200

    dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues

    Running dm-crypt with workqueues at the standard priority results in IO
    competing for CPU time with standard user apps, which can lead to
    pipeline bubbles and seriously degraded performance.  Move to using
    WQ_HIGHPRI workqueues to protect against that.

    Signed-off-by: Tim Murray <timmurray@google.com>
    Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

---

Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
problem completely.

Looking at the diff in that commit, it looks like the commit message isn't
even accurate; not only is the priority of the dmcrypt workqueues being
changed - they're also being made "CPU intensive" workqueues as well.

This combination appears to result in both elevated scheduling priority and
greater quantity of participant worker threads effectively starving any
normal priority user task under periods of heavy IO on dmcrypt volumes.

I don't know what the right solution is here.  It seems to me we're lacking
the appropriate mechanism for charging CPU resources consumed on behalf of
user processes in kworker threads to the work-causing process.

What effectively happens is my normal `git` user process is able to
greatly amplify what share of CPU it takes from the system by generating IO
on what happens to be a high-priority CPU-intensive storage volume.

It looks potentially complicated to fix properly, but I suspect at its core
this may be a fairly longstanding shortcoming of the page cache and its
asynchronous design.  Something that has been exacerbated substantially by
the introduction of CPU-intensive storage subsystems like dmcrypt.

If we imagine the whole stack simplified, where all the IO was being done
synchronously in-band, and the dmcrypt kernel code simply ran in the
IO-causing process context, it would be getting charged to the calling
process and scheduled accordingly.  The resource accounting and scheduling
problems all emerge with the page cache, buffered IO, and async background
writeback in a pool of unrelated worker threads, etc.  That's how it
appears to me anyways...

The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
The kernel .config is attached in case it's of interest.

Thanks,
Vito Caputo

[-- Attachment #2: config-x61s.gz --]
[-- Type: application/gzip, Size: 25327 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2017-11-29 18:39 [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns vcaputo
@ 2017-12-01 21:33 ` vcaputo
  2017-12-18  9:25   ` Enric Balletbo Serra
  0 siblings, 1 reply; 11+ messages in thread
From: vcaputo @ 2017-12-01 21:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: timmurray, tj

On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
> Hello,
> 
> Recently I noticed substantial audio dropouts when listening to MP3s in
> `cmus` while doing big and churny `git checkout` commands in my linux git
> tree.
> 
> It's not something I've done much of over the last couple months so I
> hadn't noticed until yesterday, but didn't remember this being a problem in
> recent history.
> 
> As there's quite an accumulation of similarly configured and built kernels
> in my grub menu, it was trivial to determine approximately when this began:
> 
> 4.11.0: no dropouts
> 4.12.0-rc7: dropouts
> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> 
> Watching top while this is going on in the various kernel versions, it's
> apparent that the kworker behavior changed.  Both the priority and quantity
> of running kworker threads is elevated in kernels experiencing dropouts.
> 
> Searching through the commit history for v4.11..v4.12 uncovered:
> 
> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> Author: Tim Murray <timmurray@google.com>
> Date:   Fri Apr 21 11:11:36 2017 +0200
> 
>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
>     
>     Running dm-crypt with workqueues at the standard priority results in IO
>     competing for CPU time with standard user apps, which can lead to
>     pipeline bubbles and seriously degraded performance.  Move to using
>     WQ_HIGHPRI workqueues to protect against that.
>     
>     Signed-off-by: Tim Murray <timmurray@google.com>
>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> 
> ---
> 
> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> problem completely.
> 
> Looking at the diff in that commit, it looks like the commit message isn't
> even accurate; not only is the priority of the dmcrypt workqueues being
> changed - they're also being made "CPU intensive" workqueues as well.
> 
> This combination appears to result in both elevated scheduling priority and
> greater quantity of participant worker threads effectively starving any
> normal priority user task under periods of heavy IO on dmcrypt volumes.
> 
> I don't know what the right solution is here.  It seems to me we're lacking
> the appropriate mechanism for charging CPU resources consumed on behalf of
> user processes in kworker threads to the work-causing process.
> 
> What effectively happens is my normal `git` user process is able to
> greatly amplify what share of CPU it takes from the system by generating IO
> on what happens to be a high-priority CPU-intensive storage volume.
> 
> It looks potentially complicated to fix properly, but I suspect at its core
> this may be a fairly longstanding shortcoming of the page cache and its
> asynchronous design.  Something that has been exacerbated substantially by
> the introduction of CPU-intensive storage subsystems like dmcrypt.
> 
> If we imagine the whole stack simplified, where all the IO was being done
> synchronously in-band, and the dmcrypt kernel code simply ran in the
> IO-causing process context, it would be getting charged to the calling
> process and scheduled accordingly.  The resource accounting and scheduling
> problems all emerge with the page cache, buffered IO, and async background
> writeback in a pool of unrelated worker threads, etc.  That's how it
> appears to me anyways...
> 
> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
> The kernel .config is attached in case it's of interest.
> 
> Thanks,
> Vito Caputo



Ping...

Could somebody please at least ACK receiving this so I'm not left wondering
if my mails to lkml are somehow winding up flagged as spam, thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2017-12-01 21:33 ` vcaputo
@ 2017-12-18  9:25   ` Enric Balletbo Serra
  2018-01-17 22:48     ` vcaputo
  0 siblings, 1 reply; 11+ messages in thread
From: Enric Balletbo Serra @ 2017-12-18  9:25 UTC (permalink / raw)
  To: vcaputo; +Cc: linux-kernel, timmurray, tj

Hi Vito,

2017-12-01 22:33 GMT+01:00  <vcaputo@pengaru.com>:
> On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
>> Hello,
>>
>> Recently I noticed substantial audio dropouts when listening to MP3s in
>> `cmus` while doing big and churny `git checkout` commands in my linux git
>> tree.
>>
>> It's not something I've done much of over the last couple months so I
>> hadn't noticed until yesterday, but didn't remember this being a problem in
>> recent history.
>>
>> As there's quite an accumulation of similarly configured and built kernels
>> in my grub menu, it was trivial to determine approximately when this began:
>>
>> 4.11.0: no dropouts
>> 4.12.0-rc7: dropouts
>> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
>>
>> Watching top while this is going on in the various kernel versions, it's
>> apparent that the kworker behavior changed.  Both the priority and quantity
>> of running kworker threads is elevated in kernels experiencing dropouts.
>>
>> Searching through the commit history for v4.11..v4.12 uncovered:
>>
>> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
>> Author: Tim Murray <timmurray@google.com>
>> Date:   Fri Apr 21 11:11:36 2017 +0200
>>
>>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
>>
>>     Running dm-crypt with workqueues at the standard priority results in IO
>>     competing for CPU time with standard user apps, which can lead to
>>     pipeline bubbles and seriously degraded performance.  Move to using
>>     WQ_HIGHPRI workqueues to protect against that.
>>
>>     Signed-off-by: Tim Murray <timmurray@google.com>
>>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
>>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>
>> ---
>>
>> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
>> problem completely.
>>
>> Looking at the diff in that commit, it looks like the commit message isn't
>> even accurate; not only is the priority of the dmcrypt workqueues being
>> changed - they're also being made "CPU intensive" workqueues as well.
>>
>> This combination appears to result in both elevated scheduling priority and
>> greater quantity of participant worker threads effectively starving any
>> normal priority user task under periods of heavy IO on dmcrypt volumes.
>>
>> I don't know what the right solution is here.  It seems to me we're lacking
>> the appropriate mechanism for charging CPU resources consumed on behalf of
>> user processes in kworker threads to the work-causing process.
>>
>> What effectively happens is my normal `git` user process is able to
>> greatly amplify what share of CPU it takes from the system by generating IO
>> on what happens to be a high-priority CPU-intensive storage volume.
>>
>> It looks potentially complicated to fix properly, but I suspect at its core
>> this may be a fairly longstanding shortcoming of the page cache and its
>> asynchronous design.  Something that has been exacerbated substantially by
>> the introduction of CPU-intensive storage subsystems like dmcrypt.
>>
>> If we imagine the whole stack simplified, where all the IO was being done
>> synchronously in-band, and the dmcrypt kernel code simply ran in the
>> IO-causing process context, it would be getting charged to the calling
>> process and scheduled accordingly.  The resource accounting and scheduling
>> problems all emerge with the page cache, buffered IO, and async background
>> writeback in a pool of unrelated worker threads, etc.  That's how it
>> appears to me anyways...
>>
>> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
>> The kernel .config is attached in case it's of interest.
>>
>> Thanks,
>> Vito Caputo
>
>
>
> Ping...
>
> Could somebody please at least ACK receiving this so I'm not left wondering
> if my mails to lkml are somehow winding up flagged as spam, thanks!

Sorry I did not notice your email before you ping me directly. It's
interesting that issue, though we didn't notice this problem. It's a
bit far since I tested this patch but I'll setup the environment again
and do more tests to understand better what is happening.

Thanks,
 Enric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2017-12-18  9:25   ` Enric Balletbo Serra
@ 2018-01-17 22:48     ` vcaputo
  2018-01-19 10:57       ` Enric Balletbo Serra
  0 siblings, 1 reply; 11+ messages in thread
From: vcaputo @ 2018-01-17 22:48 UTC (permalink / raw)
  To: Enric Balletbo Serra; +Cc: linux-kernel, timmurray, tj

On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2017-12-01 22:33 GMT+01:00  <vcaputo@pengaru.com>:
> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
> >> Hello,
> >>
> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> `cmus` while doing big and churny `git checkout` commands in my linux git
> >> tree.
> >>
> >> It's not something I've done much of over the last couple months so I
> >> hadn't noticed until yesterday, but didn't remember this being a problem in
> >> recent history.
> >>
> >> As there's quite an accumulation of similarly configured and built kernels
> >> in my grub menu, it was trivial to determine approximately when this began:
> >>
> >> 4.11.0: no dropouts
> >> 4.12.0-rc7: dropouts
> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >>
> >> Watching top while this is going on in the various kernel versions, it's
> >> apparent that the kworker behavior changed.  Both the priority and quantity
> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >>
> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >>
> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> Author: Tim Murray <timmurray@google.com>
> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >>
> >>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >>
> >>     Running dm-crypt with workqueues at the standard priority results in IO
> >>     competing for CPU time with standard user apps, which can lead to
> >>     pipeline bubbles and seriously degraded performance.  Move to using
> >>     WQ_HIGHPRI workqueues to protect against that.
> >>
> >>     Signed-off-by: Tim Murray <timmurray@google.com>
> >>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
> >>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >>
> >> ---
> >>
> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> problem completely.
> >>
> >> Looking at the diff in that commit, it looks like the commit message isn't
> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> changed - they're also being made "CPU intensive" workqueues as well.
> >>
> >> This combination appears to result in both elevated scheduling priority and
> >> greater quantity of participant worker threads effectively starving any
> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >>
> >> I don't know what the right solution is here.  It seems to me we're lacking
> >> the appropriate mechanism for charging CPU resources consumed on behalf of
> >> user processes in kworker threads to the work-causing process.
> >>
> >> What effectively happens is my normal `git` user process is able to
> >> greatly amplify what share of CPU it takes from the system by generating IO
> >> on what happens to be a high-priority CPU-intensive storage volume.
> >>
> >> It looks potentially complicated to fix properly, but I suspect at its core
> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> asynchronous design.  Something that has been exacerbated substantially by
> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >>
> >> If we imagine the whole stack simplified, where all the IO was being done
> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> IO-causing process context, it would be getting charged to the calling
> >> process and scheduled accordingly.  The resource accounting and scheduling
> >> problems all emerge with the page cache, buffered IO, and async background
> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> appears to me anyways...
> >>
> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
> >> The kernel .config is attached in case it's of interest.
> >>
> >> Thanks,
> >> Vito Caputo
> >
> >
> >
> > Ping...
> >
> > Could somebody please at least ACK receiving this so I'm not left wondering
> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> 
> Sorry I did not notice your email before you ping me directly. It's
> interesting that issue, though we didn't notice this problem. It's a
> bit far since I tested this patch but I'll setup the environment again
> and do more tests to understand better what is happening.
> 

Any update on this?

I still experience it on 4.15-rc7 when doing sustained heavyweight git
checkouts without a1b8913 reverted.
 
Thanks,
Vito Caputo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2018-01-17 22:48     ` vcaputo
@ 2018-01-19 10:57       ` Enric Balletbo Serra
  2018-01-25  6:45         ` vcaputo
                           ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Enric Balletbo Serra @ 2018-01-19 10:57 UTC (permalink / raw)
  To: vcaputo; +Cc: linux-kernel, Tim Murray, tj

Hi Vito,

2018-01-17 23:48 GMT+01:00  <vcaputo@pengaru.com>:
> On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
>> Hi Vito,
>>
>> 2017-12-01 22:33 GMT+01:00  <vcaputo@pengaru.com>:
>> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
>> >> Hello,
>> >>
>> >> Recently I noticed substantial audio dropouts when listening to MP3s in
>> >> `cmus` while doing big and churny `git checkout` commands in my linux git
>> >> tree.
>> >>
>> >> It's not something I've done much of over the last couple months so I
>> >> hadn't noticed until yesterday, but didn't remember this being a problem in
>> >> recent history.
>> >>
>> >> As there's quite an accumulation of similarly configured and built kernels
>> >> in my grub menu, it was trivial to determine approximately when this began:
>> >>
>> >> 4.11.0: no dropouts
>> >> 4.12.0-rc7: dropouts
>> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
>> >>
>> >> Watching top while this is going on in the various kernel versions, it's
>> >> apparent that the kworker behavior changed.  Both the priority and quantity
>> >> of running kworker threads is elevated in kernels experiencing dropouts.
>> >>
>> >> Searching through the commit history for v4.11..v4.12 uncovered:
>> >>
>> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
>> >> Author: Tim Murray <timmurray@google.com>
>> >> Date:   Fri Apr 21 11:11:36 2017 +0200
>> >>
>> >>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
>> >>
>> >>     Running dm-crypt with workqueues at the standard priority results in IO
>> >>     competing for CPU time with standard user apps, which can lead to
>> >>     pipeline bubbles and seriously degraded performance.  Move to using
>> >>     WQ_HIGHPRI workqueues to protect against that.
>> >>
>> >>     Signed-off-by: Tim Murray <timmurray@google.com>
>> >>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
>> >>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>> >>
>> >> ---
>> >>
>> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
>> >> problem completely.
>> >>
>> >> Looking at the diff in that commit, it looks like the commit message isn't
>> >> even accurate; not only is the priority of the dmcrypt workqueues being
>> >> changed - they're also being made "CPU intensive" workqueues as well.
>> >>
>> >> This combination appears to result in both elevated scheduling priority and
>> >> greater quantity of participant worker threads effectively starving any
>> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
>> >>
>> >> I don't know what the right solution is here.  It seems to me we're lacking
>> >> the appropriate mechanism for charging CPU resources consumed on behalf of
>> >> user processes in kworker threads to the work-causing process.
>> >>
>> >> What effectively happens is my normal `git` user process is able to
>> >> greatly amplify what share of CPU it takes from the system by generating IO
>> >> on what happens to be a high-priority CPU-intensive storage volume.
>> >>
>> >> It looks potentially complicated to fix properly, but I suspect at its core
>> >> this may be a fairly longstanding shortcoming of the page cache and its
>> >> asynchronous design.  Something that has been exacerbated substantially by
>> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
>> >>
>> >> If we imagine the whole stack simplified, where all the IO was being done
>> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
>> >> IO-causing process context, it would be getting charged to the calling
>> >> process and scheduled accordingly.  The resource accounting and scheduling
>> >> problems all emerge with the page cache, buffered IO, and async background
>> >> writeback in a pool of unrelated worker threads, etc.  That's how it
>> >> appears to me anyways...
>> >>
>> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
>> >> The kernel .config is attached in case it's of interest.
>> >>
>> >> Thanks,
>> >> Vito Caputo
>> >
>> >
>> >
>> > Ping...
>> >
>> > Could somebody please at least ACK receiving this so I'm not left wondering
>> > if my mails to lkml are somehow winding up flagged as spam, thanks!
>>
>> Sorry I did not notice your email before you ping me directly. It's
>> interesting that issue, though we didn't notice this problem. It's a
>> bit far since I tested this patch but I'll setup the environment again
>> and do more tests to understand better what is happening.
>>
>
> Any update on this?
>

I did not reproduce the issue for now. Can you try what happens if you
remove the WQ_CPU_INTENSIVE in the kcryptd_io workqueue?

- cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI |
WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 1);
cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);

> I still experience it on 4.15-rc7 when doing sustained heavyweight git
> checkouts without a1b8913 reverted.
>
> Thanks,
> Vito Caputo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2018-01-19 10:57       ` Enric Balletbo Serra
@ 2018-01-25  6:45         ` vcaputo
  2018-01-25  7:49         ` vcaputo
  2018-01-25  8:33         ` vcaputo
  2 siblings, 0 replies; 11+ messages in thread
From: vcaputo @ 2018-01-25  6:45 UTC (permalink / raw)
  To: Enric Balletbo Serra; +Cc: linux-kernel, Tim Murray, tj

On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2018-01-17 23:48 GMT+01:00  <vcaputo@pengaru.com>:
> > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> Hi Vito,
> >>
> >> 2017-12-01 22:33 GMT+01:00  <vcaputo@pengaru.com>:
> >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
> >> >> Hello,
> >> >>
> >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> >> `cmus` while doing big and churny `git checkout` commands in my linux git
> >> >> tree.
> >> >>
> >> >> It's not something I've done much of over the last couple months so I
> >> >> hadn't noticed until yesterday, but didn't remember this being a problem in
> >> >> recent history.
> >> >>
> >> >> As there's quite an accumulation of similarly configured and built kernels
> >> >> in my grub menu, it was trivial to determine approximately when this began:
> >> >>
> >> >> 4.11.0: no dropouts
> >> >> 4.12.0-rc7: dropouts
> >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> >>
> >> >> Watching top while this is going on in the various kernel versions, it's
> >> >> apparent that the kworker behavior changed.  Both the priority and quantity
> >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> >>
> >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> >>
> >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> >> Author: Tim Murray <timmurray@google.com>
> >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> >>
> >> >>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> >>
> >> >>     Running dm-crypt with workqueues at the standard priority results in IO
> >> >>     competing for CPU time with standard user apps, which can lead to
> >> >>     pipeline bubbles and seriously degraded performance.  Move to using
> >> >>     WQ_HIGHPRI workqueues to protect against that.
> >> >>
> >> >>     Signed-off-by: Tim Murray <timmurray@google.com>
> >> >>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
> >> >>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >> >>
> >> >> ---
> >> >>
> >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> >> problem completely.
> >> >>
> >> >> Looking at the diff in that commit, it looks like the commit message isn't
> >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> >>
> >> >> This combination appears to result in both elevated scheduling priority and
> >> >> greater quantity of participant worker threads effectively starving any
> >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> >>
> >> >> I don't know what the right solution is here.  It seems to me we're lacking
> >> >> the appropriate mechanism for charging CPU resources consumed on behalf of
> >> >> user processes in kworker threads to the work-causing process.
> >> >>
> >> >> What effectively happens is my normal `git` user process is able to
> >> >> greatly amplify what share of CPU it takes from the system by generating IO
> >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> >>
> >> >> It looks potentially complicated to fix properly, but I suspect at its core
> >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> >> asynchronous design.  Something that has been exacerbated substantially by
> >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> >>
> >> >> If we imagine the whole stack simplified, where all the IO was being done
> >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> >> IO-causing process context, it would be getting charged to the calling
> >> >> process and scheduled accordingly.  The resource accounting and scheduling
> >> >> problems all emerge with the page cache, buffered IO, and async background
> >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> >> appears to me anyways...
> >> >>
> >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
> >> >> The kernel .config is attached in case it's of interest.
> >> >>
> >> >> Thanks,
> >> >> Vito Caputo
> >> >
> >> >
> >> >
> >> > Ping...
> >> >
> >> > Could somebody please at least ACK receiving this so I'm not left wondering
> >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >>
> >> Sorry I did not notice your email before you ping me directly. It's
> >> interesting that issue, though we didn't notice this problem. It's a
> >> bit far since I tested this patch but I'll setup the environment again
> >> and do more tests to understand better what is happening.
> >>
> >
> > Any update on this?
> >
> 
> I did not reproduce the issue for now. Can you try what happens if you
> remove the WQ_CPU_INTENSIVE in the kcryptd_io workqueue?
> 
> - cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI |
> WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 1);
> cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> 

No, this doesn't appear to fix the problem.

I'm surprised this isn't trivial to reproduce.  You just need a small
enough machine that the music player and dmcrypt threads are
substantially contending for CPU.

Thanks,
Vito Caputo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2018-01-19 10:57       ` Enric Balletbo Serra
  2018-01-25  6:45         ` vcaputo
@ 2018-01-25  7:49         ` vcaputo
  2018-01-25  8:33         ` vcaputo
  2 siblings, 0 replies; 11+ messages in thread
From: vcaputo @ 2018-01-25  7:49 UTC (permalink / raw)
  To: Enric Balletbo Serra; +Cc: linux-kernel, Tim Murray, tj

On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2018-01-17 23:48 GMT+01:00  <vcaputo@pengaru.com>:
> > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> Hi Vito,
> >>
> >> 2017-12-01 22:33 GMT+01:00  <vcaputo@pengaru.com>:
> >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
> >> >> Hello,
> >> >>
> >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> >> `cmus` while doing big and churny `git checkout` commands in my linux git
> >> >> tree.
> >> >>
> >> >> It's not something I've done much of over the last couple months so I
> >> >> hadn't noticed until yesterday, but didn't remember this being a problem in
> >> >> recent history.
> >> >>
> >> >> As there's quite an accumulation of similarly configured and built kernels
> >> >> in my grub menu, it was trivial to determine approximately when this began:
> >> >>
> >> >> 4.11.0: no dropouts
> >> >> 4.12.0-rc7: dropouts
> >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> >>
> >> >> Watching top while this is going on in the various kernel versions, it's
> >> >> apparent that the kworker behavior changed.  Both the priority and quantity
> >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> >>
> >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> >>
> >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> >> Author: Tim Murray <timmurray@google.com>
> >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> >>
> >> >>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> >>
> >> >>     Running dm-crypt with workqueues at the standard priority results in IO
> >> >>     competing for CPU time with standard user apps, which can lead to
> >> >>     pipeline bubbles and seriously degraded performance.  Move to using
> >> >>     WQ_HIGHPRI workqueues to protect against that.
> >> >>
> >> >>     Signed-off-by: Tim Murray <timmurray@google.com>
> >> >>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
> >> >>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >> >>
> >> >> ---
> >> >>
> >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> >> problem completely.
> >> >>
> >> >> Looking at the diff in that commit, it looks like the commit message isn't
> >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> >>
> >> >> This combination appears to result in both elevated scheduling priority and
> >> >> greater quantity of participant worker threads effectively starving any
> >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> >>
> >> >> I don't know what the right solution is here.  It seems to me we're lacking
> >> >> the appropriate mechanism for charging CPU resources consumed on behalf of
> >> >> user processes in kworker threads to the work-causing process.
> >> >>
> >> >> What effectively happens is my normal `git` user process is able to
> >> >> greatly amplify what share of CPU it takes from the system by generating IO
> >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> >>
> >> >> It looks potentially complicated to fix properly, but I suspect at its core
> >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> >> asynchronous design.  Something that has been exacerbated substantially by
> >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> >>
> >> >> If we imagine the whole stack simplified, where all the IO was being done
> >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> >> IO-causing process context, it would be getting charged to the calling
> >> >> process and scheduled accordingly.  The resource accounting and scheduling
> >> >> problems all emerge with the page cache, buffered IO, and async background
> >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> >> appears to me anyways...
> >> >>
> >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
> >> >> The kernel .config is attached in case it's of interest.
> >> >>
> >> >> Thanks,
> >> >> Vito Caputo
> >> >
> >> >
> >> >
> >> > Ping...
> >> >
> >> > Could somebody please at least ACK receiving this so I'm not left wondering
> >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >>
> >> Sorry I did not notice your email before you ping me directly. It's
> >> interesting that issue, though we didn't notice this problem. It's a
> >> bit far since I tested this patch but I'll setup the environment again
> >> and do more tests to understand better what is happening.
> >>
> >
> > Any update on this?
> >
> 
> I did not reproduce the issue for now. Can you try what happens if you
> remove the WQ_CPU_INTENSIVE in the kcryptd_io workqueue?
> 
> - cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI |
> WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 1);
> cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> 

FYI I also tried just removing WQ_HIGHPRI, retaining WQ_CPU_INTENSIVE,
also bad results.

So far just reverting a1b8913 has been the best solution.

I haven't studied the dmcrypt code, is there reason to observe the
effects of these changes on both the workqueues touched by a1b8913
instead of just kcryptd_io?

Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2018-01-19 10:57       ` Enric Balletbo Serra
  2018-01-25  6:45         ` vcaputo
  2018-01-25  7:49         ` vcaputo
@ 2018-01-25  8:33         ` vcaputo
  2018-05-28  3:32           ` Vito Caputo
  2 siblings, 1 reply; 11+ messages in thread
From: vcaputo @ 2018-01-25  8:33 UTC (permalink / raw)
  To: Enric Balletbo Serra; +Cc: linux-kernel, Tim Murray, tj

On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2018-01-17 23:48 GMT+01:00  <vcaputo@pengaru.com>:
> > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> Hi Vito,
> >>
> >> 2017-12-01 22:33 GMT+01:00  <vcaputo@pengaru.com>:
> >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
> >> >> Hello,
> >> >>
> >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> >> `cmus` while doing big and churny `git checkout` commands in my linux git
> >> >> tree.
> >> >>
> >> >> It's not something I've done much of over the last couple months so I
> >> >> hadn't noticed until yesterday, but didn't remember this being a problem in
> >> >> recent history.
> >> >>
> >> >> As there's quite an accumulation of similarly configured and built kernels
> >> >> in my grub menu, it was trivial to determine approximately when this began:
> >> >>
> >> >> 4.11.0: no dropouts
> >> >> 4.12.0-rc7: dropouts
> >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> >>
> >> >> Watching top while this is going on in the various kernel versions, it's
> >> >> apparent that the kworker behavior changed.  Both the priority and quantity
> >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> >>
> >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> >>
> >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> >> Author: Tim Murray <timmurray@google.com>
> >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> >>
> >> >>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> >>
> >> >>     Running dm-crypt with workqueues at the standard priority results in IO
> >> >>     competing for CPU time with standard user apps, which can lead to
> >> >>     pipeline bubbles and seriously degraded performance.  Move to using
> >> >>     WQ_HIGHPRI workqueues to protect against that.
> >> >>
> >> >>     Signed-off-by: Tim Murray <timmurray@google.com>
> >> >>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
> >> >>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >> >>
> >> >> ---
> >> >>
> >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> >> problem completely.
> >> >>
> >> >> Looking at the diff in that commit, it looks like the commit message isn't
> >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> >>
> >> >> This combination appears to result in both elevated scheduling priority and
> >> >> greater quantity of participant worker threads effectively starving any
> >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> >>
> >> >> I don't know what the right solution is here.  It seems to me we're lacking
> >> >> the appropriate mechanism for charging CPU resources consumed on behalf of
> >> >> user processes in kworker threads to the work-causing process.
> >> >>
> >> >> What effectively happens is my normal `git` user process is able to
> >> >> greatly amplify what share of CPU it takes from the system by generating IO
> >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> >>
> >> >> It looks potentially complicated to fix properly, but I suspect at its core
> >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> >> asynchronous design.  Something that has been exacerbated substantially by
> >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> >>
> >> >> If we imagine the whole stack simplified, where all the IO was being done
> >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> >> IO-causing process context, it would be getting charged to the calling
> >> >> process and scheduled accordingly.  The resource accounting and scheduling
> >> >> problems all emerge with the page cache, buffered IO, and async background
> >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> >> appears to me anyways...
> >> >>
> >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
> >> >> The kernel .config is attached in case it's of interest.
> >> >>
> >> >> Thanks,
> >> >> Vito Caputo
> >> >
> >> >
> >> >
> >> > Ping...
> >> >
> >> > Could somebody please at least ACK receiving this so I'm not left wondering
> >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >>
> >> Sorry I did not notice your email before you ping me directly. It's
> >> interesting that issue, though we didn't notice this problem. It's a
> >> bit far since I tested this patch but I'll setup the environment again
> >> and do more tests to understand better what is happening.
> >>
> >
> > Any update on this?
> >
> 
> I did not reproduce the issue for now. Can you try what happens if you
> remove the WQ_CPU_INTENSIVE in the kcryptd_io workqueue?
> 
> - cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI |
> WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 1);
> cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> 

FWIW if I change both "kcryptd" and "kcryptd_io" workqueues to just
WQ_CPU_INTENSIVE, removing WQ_HIGHPRIO, the problem goes away.

Doing this to "kcryptd_io" alone, as mentioned in my previous email, was
ineffective.

Perhaps revert just the WQ_HIGHPRIO bit from the dmcrypt workqueues?

Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2018-01-25  8:33         ` vcaputo
@ 2018-05-28  3:32           ` Vito Caputo
  2018-05-28 17:01             ` Enric Balletbo Serra
  0 siblings, 1 reply; 11+ messages in thread
From: Vito Caputo @ 2018-05-28  3:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Enric Balletbo Serra, Tim Murray, tj

On Thu, Jan 25, 2018 at 12:33:21AM -0800, vcaputo@pengaru.com wrote:
> On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> > Hi Vito,
> > 
> > 2018-01-17 23:48 GMT+01:00  <vcaputo@pengaru.com>:
> > > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> > >> Hi Vito,
> > >>
> > >> 2017-12-01 22:33 GMT+01:00  <vcaputo@pengaru.com>:
> > >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
> > >> >> Hello,
> > >> >>
> > >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> > >> >> `cmus` while doing big and churny `git checkout` commands in my linux git
> > >> >> tree.
> > >> >>
> > >> >> It's not something I've done much of over the last couple months so I
> > >> >> hadn't noticed until yesterday, but didn't remember this being a problem in
> > >> >> recent history.
> > >> >>
> > >> >> As there's quite an accumulation of similarly configured and built kernels
> > >> >> in my grub menu, it was trivial to determine approximately when this began:
> > >> >>
> > >> >> 4.11.0: no dropouts
> > >> >> 4.12.0-rc7: dropouts
> > >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> > >> >>
> > >> >> Watching top while this is going on in the various kernel versions, it's
> > >> >> apparent that the kworker behavior changed.  Both the priority and quantity
> > >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> > >> >>
> > >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> > >> >>
> > >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> > >> >> Author: Tim Murray <timmurray@google.com>
> > >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> > >> >>
> > >> >>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> > >> >>
> > >> >>     Running dm-crypt with workqueues at the standard priority results in IO
> > >> >>     competing for CPU time with standard user apps, which can lead to
> > >> >>     pipeline bubbles and seriously degraded performance.  Move to using
> > >> >>     WQ_HIGHPRI workqueues to protect against that.
> > >> >>
> > >> >>     Signed-off-by: Tim Murray <timmurray@google.com>
> > >> >>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
> > >> >>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > >> >>
> > >> >> ---
> > >> >>
> > >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> > >> >> problem completely.
> > >> >>
> > >> >> Looking at the diff in that commit, it looks like the commit message isn't
> > >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> > >> >> changed - they're also being made "CPU intensive" workqueues as well.
> > >> >>
> > >> >> This combination appears to result in both elevated scheduling priority and
> > >> >> greater quantity of participant worker threads effectively starving any
> > >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> > >> >>
> > >> >> I don't know what the right solution is here.  It seems to me we're lacking
> > >> >> the appropriate mechanism for charging CPU resources consumed on behalf of
> > >> >> user processes in kworker threads to the work-causing process.
> > >> >>
> > >> >> What effectively happens is my normal `git` user process is able to
> > >> >> greatly amplify what share of CPU it takes from the system by generating IO
> > >> >> on what happens to be a high-priority CPU-intensive storage volume.
> > >> >>
> > >> >> It looks potentially complicated to fix properly, but I suspect at its core
> > >> >> this may be a fairly longstanding shortcoming of the page cache and its
> > >> >> asynchronous design.  Something that has been exacerbated substantially by
> > >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> > >> >>
> > >> >> If we imagine the whole stack simplified, where all the IO was being done
> > >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> > >> >> IO-causing process context, it would be getting charged to the calling
> > >> >> process and scheduled accordingly.  The resource accounting and scheduling
> > >> >> problems all emerge with the page cache, buffered IO, and async background
> > >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> > >> >> appears to me anyways...
> > >> >>
> > >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
> > >> >> The kernel .config is attached in case it's of interest.
> > >> >>
> > >> >> Thanks,
> > >> >> Vito Caputo
> > >> >
> > >> >
> > >> >
> > >> > Ping...
> > >> >
> > >> > Could somebody please at least ACK receiving this so I'm not left wondering
> > >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> > >>
> > >> Sorry I did not notice your email before you ping me directly. It's
> > >> interesting that issue, though we didn't notice this problem. It's a
> > >> bit far since I tested this patch but I'll setup the environment again
> > >> and do more tests to understand better what is happening.
> > >>
> > >
> > > Any update on this?
> > >
> > 
> > I did not reproduce the issue for now. Can you try what happens if you
> > remove the WQ_CPU_INTENSIVE in the kcryptd_io workqueue?
> > 
> > - cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI |
> > WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 1);
> > cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> > 
> 
> FWIW if I change both "kcryptd" and "kcryptd_io" workqueues to just
> WQ_CPU_INTENSIVE, removing WQ_HIGHPRIO, the problem goes away.
> 
> Doing this to "kcryptd_io" alone, as mentioned in my previous email, was
> ineffective.
> 
> Perhaps revert just the WQ_HIGHPRIO bit from the dmcrypt workqueues?
> 


Guys... this is still a problem in 4.17-rc6.

I don't understand why this is being ignored.  It's pathetic, my laptop
can't even do a git checkout of the linux tree while playing mp3s
without the music skipping.

Reverting a1b8913 completely eliminates the problem.  What gives?

Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2018-05-28  3:32           ` Vito Caputo
@ 2018-05-28 17:01             ` Enric Balletbo Serra
  2018-05-28 17:34               ` Vito Caputo
  0 siblings, 1 reply; 11+ messages in thread
From: Enric Balletbo Serra @ 2018-05-28 17:01 UTC (permalink / raw)
  To: Vito Caputo
  Cc: linux-kernel, Tim Murray, tj, dm-devel, Alasdair Kergon, Mike Snitzer

Hi Vito,

cc: dm-devel, Alasdair and Mike Snitzer

2018-05-28 5:32 GMT+02:00 Vito Caputo <vcaputo@pengaru.com>:
> On Thu, Jan 25, 2018 at 12:33:21AM -0800, vcaputo@pengaru.com wrote:
>> On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
>> > Hi Vito,
>> >
>> > 2018-01-17 23:48 GMT+01:00  <vcaputo@pengaru.com>:
>> > > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
>> > >> Hi Vito,
>> > >>
>> > >> 2017-12-01 22:33 GMT+01:00  <vcaputo@pengaru.com>:
>> > >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
>> > >> >> Hello,
>> > >> >>
>> > >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
>> > >> >> `cmus` while doing big and churny `git checkout` commands in my linux git
>> > >> >> tree.
>> > >> >>
>> > >> >> It's not something I've done much of over the last couple months so I
>> > >> >> hadn't noticed until yesterday, but didn't remember this being a problem in
>> > >> >> recent history.
>> > >> >>
>> > >> >> As there's quite an accumulation of similarly configured and built kernels
>> > >> >> in my grub menu, it was trivial to determine approximately when this began:
>> > >> >>
>> > >> >> 4.11.0: no dropouts
>> > >> >> 4.12.0-rc7: dropouts
>> > >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
>> > >> >>
>> > >> >> Watching top while this is going on in the various kernel versions, it's
>> > >> >> apparent that the kworker behavior changed.  Both the priority and quantity
>> > >> >> of running kworker threads is elevated in kernels experiencing dropouts.
>> > >> >>
>> > >> >> Searching through the commit history for v4.11..v4.12 uncovered:
>> > >> >>
>> > >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
>> > >> >> Author: Tim Murray <timmurray@google.com>
>> > >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
>> > >> >>
>> > >> >>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
>> > >> >>
>> > >> >>     Running dm-crypt with workqueues at the standard priority results in IO
>> > >> >>     competing for CPU time with standard user apps, which can lead to
>> > >> >>     pipeline bubbles and seriously degraded performance.  Move to using
>> > >> >>     WQ_HIGHPRI workqueues to protect against that.
>> > >> >>
>> > >> >>     Signed-off-by: Tim Murray <timmurray@google.com>
>> > >> >>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
>> > >> >>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>> > >> >>
>> > >> >> ---
>> > >> >>
>> > >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
>> > >> >> problem completely.
>> > >> >>
>> > >> >> Looking at the diff in that commit, it looks like the commit message isn't
>> > >> >> even accurate; not only is the priority of the dmcrypt workqueues being
>> > >> >> changed - they're also being made "CPU intensive" workqueues as well.
>> > >> >>
>> > >> >> This combination appears to result in both elevated scheduling priority and
>> > >> >> greater quantity of participant worker threads effectively starving any
>> > >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
>> > >> >>
>> > >> >> I don't know what the right solution is here.  It seems to me we're lacking
>> > >> >> the appropriate mechanism for charging CPU resources consumed on behalf of
>> > >> >> user processes in kworker threads to the work-causing process.
>> > >> >>
>> > >> >> What effectively happens is my normal `git` user process is able to
>> > >> >> greatly amplify what share of CPU it takes from the system by generating IO
>> > >> >> on what happens to be a high-priority CPU-intensive storage volume.
>> > >> >>
>> > >> >> It looks potentially complicated to fix properly, but I suspect at its core
>> > >> >> this may be a fairly longstanding shortcoming of the page cache and its
>> > >> >> asynchronous design.  Something that has been exacerbated substantially by
>> > >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
>> > >> >>
>> > >> >> If we imagine the whole stack simplified, where all the IO was being done
>> > >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
>> > >> >> IO-causing process context, it would be getting charged to the calling
>> > >> >> process and scheduled accordingly.  The resource accounting and scheduling
>> > >> >> problems all emerge with the page cache, buffered IO, and async background
>> > >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
>> > >> >> appears to me anyways...
>> > >> >>
>> > >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
>> > >> >> The kernel .config is attached in case it's of interest.
>> > >> >>
>> > >> >> Thanks,
>> > >> >> Vito Caputo
>> > >> >
>> > >> >
>> > >> >
>> > >> > Ping...
>> > >> >
>> > >> > Could somebody please at least ACK receiving this so I'm not left wondering
>> > >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
>> > >>
>> > >> Sorry I did not notice your email before you ping me directly. It's
>> > >> interesting that issue, though we didn't notice this problem. It's a
>> > >> bit far since I tested this patch but I'll setup the environment again
>> > >> and do more tests to understand better what is happening.
>> > >>
>> > >
>> > > Any update on this?
>> > >
>> >
>> > I did not reproduce the issue for now. Can you try what happens if you
>> > remove the WQ_CPU_INTENSIVE in the kcryptd_io workqueue?
>> >
>> > - cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI |
>> > WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 1);
>> > cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
>> >
>>
>> FWIW if I change both "kcryptd" and "kcryptd_io" workqueues to just
>> WQ_CPU_INTENSIVE, removing WQ_HIGHPRIO, the problem goes away.
>>
>> Doing this to "kcryptd_io" alone, as mentioned in my previous email, was
>> ineffective.
>>
>> Perhaps revert just the WQ_HIGHPRIO bit from the dmcrypt workqueues?
>>
>
>
> Guys... this is still a problem in 4.17-rc6.
>
> I don't understand why this is being ignored.  It's pathetic, my laptop
> can't even do a git checkout of the linux tree while playing mp3s
> without the music skipping.
>

Sorry, but it's easy to lost something on lkml, so adding the dm-devel
ML and the maintainers.

> Reverting a1b8913 completely eliminates the problem.  What gives?
>

IIRC the patch is there since 4.12 and I tried to reproduce the issue
on at least two devices, my laptop and a Chromebook Pixel 2 without
luck. Also, I am a bit surprised that nobody else has complained,
maybe I missed it, and *of course*, this doesn't mean the issue is not
there.

So, did anyone experience the same issue?

Regards,
 Enric

> Regards,
> Vito Caputo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns
  2018-05-28 17:01             ` Enric Balletbo Serra
@ 2018-05-28 17:34               ` Vito Caputo
  0 siblings, 0 replies; 11+ messages in thread
From: Vito Caputo @ 2018-05-28 17:34 UTC (permalink / raw)
  To: Enric Balletbo Serra
  Cc: linux-kernel, Tim Murray, tj, dm-devel, Alasdair Kergon, Mike Snitzer

On Mon, May 28, 2018 at 07:01:36PM +0200, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> cc: dm-devel, Alasdair and Mike Snitzer
> 
> 2018-05-28 5:32 GMT+02:00 Vito Caputo <vcaputo@pengaru.com>:
> > On Thu, Jan 25, 2018 at 12:33:21AM -0800, vcaputo@pengaru.com wrote:
> >> On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> >> > Hi Vito,
> >> >
> >> > 2018-01-17 23:48 GMT+01:00  <vcaputo@pengaru.com>:
> >> > > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> > >> Hi Vito,
> >> > >>
> >> > >> 2017-12-01 22:33 GMT+01:00  <vcaputo@pengaru.com>:
> >> > >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcaputo@pengaru.com wrote:
> >> > >> >> Hello,
> >> > >> >>
> >> > >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> > >> >> `cmus` while doing big and churny `git checkout` commands in my linux git
> >> > >> >> tree.
> >> > >> >>
> >> > >> >> It's not something I've done much of over the last couple months so I
> >> > >> >> hadn't noticed until yesterday, but didn't remember this being a problem in
> >> > >> >> recent history.
> >> > >> >>
> >> > >> >> As there's quite an accumulation of similarly configured and built kernels
> >> > >> >> in my grub menu, it was trivial to determine approximately when this began:
> >> > >> >>
> >> > >> >> 4.11.0: no dropouts
> >> > >> >> 4.12.0-rc7: dropouts
> >> > >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> > >> >>
> >> > >> >> Watching top while this is going on in the various kernel versions, it's
> >> > >> >> apparent that the kworker behavior changed.  Both the priority and quantity
> >> > >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> > >> >>
> >> > >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> > >> >>
> >> > >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> > >> >> Author: Tim Murray <timmurray@google.com>
> >> > >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> > >> >>
> >> > >> >>     dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> > >> >>
> >> > >> >>     Running dm-crypt with workqueues at the standard priority results in IO
> >> > >> >>     competing for CPU time with standard user apps, which can lead to
> >> > >> >>     pipeline bubbles and seriously degraded performance.  Move to using
> >> > >> >>     WQ_HIGHPRI workqueues to protect against that.
> >> > >> >>
> >> > >> >>     Signed-off-by: Tim Murray <timmurray@google.com>
> >> > >> >>     Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
> >> > >> >>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >> > >> >>
> >> > >> >> ---
> >> > >> >>
> >> > >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> > >> >> problem completely.
> >> > >> >>
> >> > >> >> Looking at the diff in that commit, it looks like the commit message isn't
> >> > >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> > >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> > >> >>
> >> > >> >> This combination appears to result in both elevated scheduling priority and
> >> > >> >> greater quantity of participant worker threads effectively starving any
> >> > >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> > >> >>
> >> > >> >> I don't know what the right solution is here.  It seems to me we're lacking
> >> > >> >> the appropriate mechanism for charging CPU resources consumed on behalf of
> >> > >> >> user processes in kworker threads to the work-causing process.
> >> > >> >>
> >> > >> >> What effectively happens is my normal `git` user process is able to
> >> > >> >> greatly amplify what share of CPU it takes from the system by generating IO
> >> > >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> > >> >>
> >> > >> >> It looks potentially complicated to fix properly, but I suspect at its core
> >> > >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> > >> >> asynchronous design.  Something that has been exacerbated substantially by
> >> > >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> > >> >>
> >> > >> >> If we imagine the whole stack simplified, where all the IO was being done
> >> > >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> > >> >> IO-causing process context, it would be getting charged to the calling
> >> > >> >> process and scheduled accordingly.  The resource accounting and scheduling
> >> > >> >> problems all emerge with the page cache, buffered IO, and async background
> >> > >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> > >> >> appears to me anyways...
> >> > >> >>
> >> > >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on dmcrypt.
> >> > >> >> The kernel .config is attached in case it's of interest.
> >> > >> >>
> >> > >> >> Thanks,
> >> > >> >> Vito Caputo
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > Ping...
> >> > >> >
> >> > >> > Could somebody please at least ACK receiving this so I'm not left wondering
> >> > >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >> > >>
> >> > >> Sorry I did not notice your email before you ping me directly. It's
> >> > >> interesting that issue, though we didn't notice this problem. It's a
> >> > >> bit far since I tested this patch but I'll setup the environment again
> >> > >> and do more tests to understand better what is happening.
> >> > >>
> >> > >
> >> > > Any update on this?
> >> > >
> >> >
> >> > I did not reproduce the issue for now. Can you try what happens if you
> >> > remove the WQ_CPU_INTENSIVE in the kcryptd_io workqueue?
> >> >
> >> > - cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI |
> >> > WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 1);
> >> > cc->io_queue = alloc_workqueue("kcryptd_io", WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> >> >
> >>
> >> FWIW if I change both "kcryptd" and "kcryptd_io" workqueues to just
> >> WQ_CPU_INTENSIVE, removing WQ_HIGHPRIO, the problem goes away.
> >>
> >> Doing this to "kcryptd_io" alone, as mentioned in my previous email, was
> >> ineffective.
> >>
> >> Perhaps revert just the WQ_HIGHPRIO bit from the dmcrypt workqueues?
> >>
> >
> >
> > Guys... this is still a problem in 4.17-rc6.
> >
> > I don't understand why this is being ignored.  It's pathetic, my laptop
> > can't even do a git checkout of the linux tree while playing mp3s
> > without the music skipping.
> >
> 
> Sorry, but it's easy to lost something on lkml, so adding the dm-devel
> ML and the maintainers.
> 
> > Reverting a1b8913 completely eliminates the problem.  What gives?
> >
> 
> IIRC the patch is there since 4.12 and I tried to reproduce the issue
> on at least two devices, my laptop and a Chromebook Pixel 2 without
> luck. Also, I am a bit surprised that nobody else has complained,
> maybe I missed it, and *of course*, this doesn't mean the issue is not
> there.
> 
> So, did anyone experience the same issue?
> 

FYI I've created https://bugzilla.kernel.org/show_bug.cgi?id=199857 to
track this issue more formally.

Thanks,
Vito Caputo

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-05-28 17:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-29 18:39 [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns vcaputo
2017-12-01 21:33 ` vcaputo
2017-12-18  9:25   ` Enric Balletbo Serra
2018-01-17 22:48     ` vcaputo
2018-01-19 10:57       ` Enric Balletbo Serra
2018-01-25  6:45         ` vcaputo
2018-01-25  7:49         ` vcaputo
2018-01-25  8:33         ` vcaputo
2018-05-28  3:32           ` Vito Caputo
2018-05-28 17:01             ` Enric Balletbo Serra
2018-05-28 17:34               ` Vito Caputo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).