All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
       [not found] <bug-172981-27@https.bugzilla.kernel.org/>
@ 2016-09-27 18:10 ` Andrew Morton
  2016-09-28  2:03   ` Johannes Weiner
  2016-09-28  3:13   ` Doug Smythies
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2016-09-27 18:10 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: bugzilla-daemon, dsmythies, linux-mm


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 27 Sep 2016 17:57:08 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=172981
> 
>             Bug ID: 172981
>            Summary: [bisected] SLAB: extreme load averages and over 2000
>                     kworker threads
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.7+
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Slab Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: dsmythies@telus.net
>         Regression: No
> 
> Immediately after boot, extreme load average numbers and over 2000 kworker
> processes are being observed on my main linux test computer (basically a Ubuntu
> 16.04 server, no GUI). The worker threads appear to be idle, and do disappear
> after the nominal 5 minute timeout, depending on whatever other stuff might run
> in the meantime. However, the number of threads can hugely increase again. The
> issue occurs with ease for kernels compiled using SLAB.
> 
> For SLAB, kernel bisection gave:
> 801faf0db8947e01877920e848a4d338dd7a99e7
> "mm/slab: lockless decision to grow cache"
> 
> The following monitoring script was used for the below examples:
> 
> #!/bin/dash
> 
> while [ 1 ];
> do
>   echo $(uptime) ::: $(ps -A --no-headers | wc -l) ::: $(ps aux | grep kworker
> | grep -v u | grep -v H | wc -l)
>   sleep 10.0
> done
> 
> Example (SLAB):
> 
> After boot:
> 
> 22:26:21 up 1 min, 2 users, load average: 295.98, 85.67, 29.47 ::: 2240 :::
> 2074
> 22:26:31 up 1 min, 2 users, load average: 250.47, 82.85, 29.15 ::: 2240 :::
> 2074
> 22:26:41 up 1 min, 2 users, load average: 211.96, 80.12, 28.84 ::: 2240 :::
> 2074
> ...
> 22:52:34 up 27 min, 3 users, load average: 0.00, 0.43, 5.40 ::: 165 ::: 17
> 22:52:44 up 27 min, 3 users, load average: 0.00, 0.42, 5.34 ::: 165 ::: 17
> 
> Now type: sudo echo "bla":
> 
> 22:53:14 up 27 min, 3 users, load average: 0.00, 0.38, 5.17 ::: 493 ::: 345
> 22:53:24 up 28 min, 3 users, load average: 0.00, 0.36, 5.11 ::: 493 ::: 345
> 
> Caused 328 new kworker threads.
> Now queue just a few (8 in this case) very simple jobs.
> 
> 22:55:45 up 30 min, 3 users, load average: 0.11, 0.27, 4.38 ::: 493 ::: 345
> 22:55:55 up 30 min, 3 users, load average: 0.09, 0.26, 4.34 ::: 2207 ::: 2059
> 22:56:05 up 30 min, 3 users, load average: 0.08, 0.25, 4.29 ::: 2207 ::: 2059
> 
> If I look at linux/Documentation/workqueue.txt and do:
> 
> echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
> 
> and:
> 
> cat /sys/kernel/debug/tracing/trace_pipe > out.txt
> 
> I get somewhere between 10,000 and 20,000 occurrences of
> memcg_kmem_cache_create_func in the file (using my simple test method).
> 
> Also tested with kernel 4.8-rc7.
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-27 18:10 ` [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads Andrew Morton
@ 2016-09-28  2:03   ` Johannes Weiner
  2016-09-28  8:09     ` Vladimir Davydov
  2016-09-28  3:13   ` Doug Smythies
  1 sibling, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2016-09-28  2:03 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov
  Cc: Joonsoo Kim, bugzilla-daemon, dsmythies, linux-mm

[CC Vladimir]

These are the delayed memcg cache allocations, where in a fresh memcg
that doesn't have per-memcg caches yet, every accounted allocation
schedules a kmalloc work item in __memcg_schedule_kmem_cache_create()
until the cache is finally available. It looks like those can be many
more than the number of slab caches in existence, if there is a storm
of slab allocations before the workers get a chance to run.

Vladimir, what do you think of embedding the work item into the
memcg_cache_array? That way we make sure we have exactly one work per
cache and not an unbounded number of them. The downside of course is
that we'd have to keep these things around as long as the memcg is in
existence, but that's the only place I can think of that allows us to
serialize this.

On Tue, Sep 27, 2016 at 11:10:59AM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Tue, 27 Sep 2016 17:57:08 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=172981
> > 
> >             Bug ID: 172981
> >            Summary: [bisected] SLAB: extreme load averages and over 2000
> >                     kworker threads
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 4.7+
> >           Hardware: All
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Slab Allocator
> >           Assignee: akpm@linux-foundation.org
> >           Reporter: dsmythies@telus.net
> >         Regression: No
> > 
> > Immediately after boot, extreme load average numbers and over 2000 kworker
> > processes are being observed on my main linux test computer (basically a Ubuntu
> > 16.04 server, no GUI). The worker threads appear to be idle, and do disappear
> > after the nominal 5 minute timeout, depending on whatever other stuff might run
> > in the meantime. However, the number of threads can hugely increase again. The
> > issue occurs with ease for kernels compiled using SLAB.
> > 
> > For SLAB, kernel bisection gave:
> > 801faf0db8947e01877920e848a4d338dd7a99e7
> > "mm/slab: lockless decision to grow cache"
> > 
> > The following monitoring script was used for the below examples:
> > 
> > #!/bin/dash
> > 
> > while [ 1 ];
> > do
> >   echo $(uptime) ::: $(ps -A --no-headers | wc -l) ::: $(ps aux | grep kworker
> > | grep -v u | grep -v H | wc -l)
> >   sleep 10.0
> > done
> > 
> > Example (SLAB):
> > 
> > After boot:
> > 
> > 22:26:21 up 1 min, 2 users, load average: 295.98, 85.67, 29.47 ::: 2240 :::
> > 2074
> > 22:26:31 up 1 min, 2 users, load average: 250.47, 82.85, 29.15 ::: 2240 :::
> > 2074
> > 22:26:41 up 1 min, 2 users, load average: 211.96, 80.12, 28.84 ::: 2240 :::
> > 2074
> > ...
> > 22:52:34 up 27 min, 3 users, load average: 0.00, 0.43, 5.40 ::: 165 ::: 17
> > 22:52:44 up 27 min, 3 users, load average: 0.00, 0.42, 5.34 ::: 165 ::: 17
> > 
> > Now type: sudo echo "bla":
> > 
> > 22:53:14 up 27 min, 3 users, load average: 0.00, 0.38, 5.17 ::: 493 ::: 345
> > 22:53:24 up 28 min, 3 users, load average: 0.00, 0.36, 5.11 ::: 493 ::: 345
> > 
> > Caused 328 new kworker threads.
> > Now queue just a few (8 in this case) very simple jobs.
> > 
> > 22:55:45 up 30 min, 3 users, load average: 0.11, 0.27, 4.38 ::: 493 ::: 345
> > 22:55:55 up 30 min, 3 users, load average: 0.09, 0.26, 4.34 ::: 2207 ::: 2059
> > 22:56:05 up 30 min, 3 users, load average: 0.08, 0.25, 4.29 ::: 2207 ::: 2059
> > 
> > If I look at linux/Documentation/workqueue.txt and do:
> > 
> > echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
> > 
> > and:
> > 
> > cat /sys/kernel/debug/tracing/trace_pipe > out.txt
> > 
> > I get somewhere between 10,000 and 20,000 occurrences of
> > memcg_kmem_cache_create_func in the file (using my simple test method).
> > 
> > Also tested with kernel 4.8-rc7.
> > 
> > -- 
> > You are receiving this mail because:
> > You are the assignee for the bug.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-27 18:10 ` [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads Andrew Morton
  2016-09-28  2:03   ` Johannes Weiner
@ 2016-09-28  3:13   ` Doug Smythies
  2016-09-28  5:18     ` Joonsoo Kim
  1 sibling, 1 reply; 16+ messages in thread
From: Doug Smythies @ 2016-09-28  3:13 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton',
	'Vladimir Davydov'
  Cc: 'Joonsoo Kim', bugzilla-daemon, linux-mm, Doug Smythies

By the way, I can eliminate the problem by doing this:
(see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)

diff --git a/mm/slab.c b/mm/slab.c
index b672710..a4edbfa 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
         * freed after synchronize_sched().
         */
        if (force_change)
-               synchronize_sched();
+               kick_all_cpus_sync();

 fail:
        kfree(old_shared);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-28  3:13   ` Doug Smythies
@ 2016-09-28  5:18     ` Joonsoo Kim
  2016-09-28  6:20       ` Joonsoo Kim
  2016-09-28 15:22       ` Doug Smythies
  0 siblings, 2 replies; 16+ messages in thread
From: Joonsoo Kim @ 2016-09-28  5:18 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Johannes Weiner', 'Andrew Morton',
	'Vladimir Davydov',
	bugzilla-daemon, linux-mm

On Tue, Sep 27, 2016 at 08:13:58PM -0700, Doug Smythies wrote:
> By the way, I can eliminate the problem by doing this:
> (see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)

I think that Johannes found the root cause of the problem and they
(Johannes and Vladimir) will solve the root cause.

However, there is something useful to do in SLAB side.
Could you test following patch, please?

Thanks.

---------->8--------------
diff --git a/mm/slab.c b/mm/slab.c
index 0eb6691..39e3bf2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
         * guaranteed to be valid until irq is re-enabled, because it will be
         * freed after synchronize_sched().
         */
-       if (force_change)
+       if (n->shared && force_change)
                synchronize_sched();
 
 fail:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-28  5:18     ` Joonsoo Kim
@ 2016-09-28  6:20       ` Joonsoo Kim
  2016-09-28 15:22       ` Doug Smythies
  1 sibling, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2016-09-28  6:20 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Johannes Weiner', 'Andrew Morton',
	'Vladimir Davydov',
	bugzilla-daemon, linux-mm

On Wed, Sep 28, 2016 at 02:18:42PM +0900, Joonsoo Kim wrote:
> On Tue, Sep 27, 2016 at 08:13:58PM -0700, Doug Smythies wrote:
> > By the way, I can eliminate the problem by doing this:
> > (see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)
> 
> I think that Johannes found the root cause of the problem and they
> (Johannes and Vladimir) will solve the root cause.
> 
> However, there is something useful to do in SLAB side.
> Could you test following patch, please?
> 
> Thanks.
> 
> ---------->8--------------
> diff --git a/mm/slab.c b/mm/slab.c
> index 0eb6691..39e3bf2 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
>          * guaranteed to be valid until irq is re-enabled, because it will be
>          * freed after synchronize_sched().
>          */
> -       if (force_change)
> +       if (n->shared && force_change)
>                 synchronize_sched();

Oops...

s/n->shared/old_shared/

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-28  2:03   ` Johannes Weiner
@ 2016-09-28  8:09     ` Vladimir Davydov
  2016-09-29  2:00       ` Joonsoo Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Davydov @ 2016-09-28  8:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Joonsoo Kim, bugzilla-daemon, dsmythies, linux-mm

On Tue, Sep 27, 2016 at 10:03:47PM -0400, Johannes Weiner wrote:
> [CC Vladimir]
> 
> These are the delayed memcg cache allocations, where in a fresh memcg
> that doesn't have per-memcg caches yet, every accounted allocation
> schedules a kmalloc work item in __memcg_schedule_kmem_cache_create()
> until the cache is finally available. It looks like those can be many
> more than the number of slab caches in existence, if there is a storm
> of slab allocations before the workers get a chance to run.
> 
> Vladimir, what do you think of embedding the work item into the
> memcg_cache_array? That way we make sure we have exactly one work per
> cache and not an unbounded number of them. The downside of course is
> that we'd have to keep these things around as long as the memcg is in
> existence, but that's the only place I can think of that allows us to
> serialize this.

We could set the entry of the root_cache->memcg_params.memcg_caches
array corresponding to the cache being created to a special value, say
(void*)1, and skip scheduling cache creation work on kmalloc if the
caller sees it. I'm not sure it's really worth it though, because
work_struct isn't that big (at least, in comparison with the cache
itself) to avoid embedding it at all costs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-28  5:18     ` Joonsoo Kim
  2016-09-28  6:20       ` Joonsoo Kim
@ 2016-09-28 15:22       ` Doug Smythies
  2016-09-29  1:50         ` Joonsoo Kim
  1 sibling, 1 reply; 16+ messages in thread
From: Doug Smythies @ 2016-09-28 15:22 UTC (permalink / raw)
  To: 'Joonsoo Kim'
  Cc: 'Johannes Weiner', 'Andrew Morton',
	'Vladimir Davydov',
	bugzilla-daemon, linux-mm

On 2016.09.27 23:20 Joonsoo Kim wrote:
> On Wed, Sep 28, 2016 at 02:18:42PM +0900, Joonsoo Kim wrote:
>> On Tue, Sep 27, 2016 at 08:13:58PM -0700, Doug Smythies wrote:
>>> By the way, I can eliminate the problem by doing this:
>>> (see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)
>> 
>> I think that Johannes found the root cause of the problem and they
>> (Johannes and Vladimir) will solve the root cause.
>> 
>> However, there is something useful to do in SLAB side.
>> Could you test following patch, please?
>> 
>> Thanks.
>> 
>> ---------->8--------------
>> diff --git a/mm/slab.c b/mm/slab.c
>> index 0eb6691..39e3bf2 100644
>> --- a/mm/slab.c
>> +++ b/mm/slab.c
>> @@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
>>          * guaranteed to be valid until irq is re-enabled, because it will be
>>          * freed after synchronize_sched().
>>          */
>> -       if (force_change)
>> +       if (n->shared && force_change)
>>                 synchronize_sched();
>
> Oops...
>
> s/n->shared/old_shared/

Yes, that seems to work fine. After boot everything is good.
Then I tried and tried to get it to mess up, but could not.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-28 15:22       ` Doug Smythies
@ 2016-09-29  1:50         ` Joonsoo Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2016-09-29  1:50 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Johannes Weiner', 'Andrew Morton',
	'Vladimir Davydov',
	bugzilla-daemon, linux-mm

On Wed, Sep 28, 2016 at 08:22:24AM -0700, Doug Smythies wrote:
> On 2016.09.27 23:20 Joonsoo Kim wrote:
> > On Wed, Sep 28, 2016 at 02:18:42PM +0900, Joonsoo Kim wrote:
> >> On Tue, Sep 27, 2016 at 08:13:58PM -0700, Doug Smythies wrote:
> >>> By the way, I can eliminate the problem by doing this:
> >>> (see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)
> >> 
> >> I think that Johannes found the root cause of the problem and they
> >> (Johannes and Vladimir) will solve the root cause.
> >> 
> >> However, there is something useful to do in SLAB side.
> >> Could you test following patch, please?
> >> 
> >> Thanks.
> >> 
> >> ---------->8--------------
> >> diff --git a/mm/slab.c b/mm/slab.c
> >> index 0eb6691..39e3bf2 100644
> >> --- a/mm/slab.c
> >> +++ b/mm/slab.c
> >> @@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
> >>          * guaranteed to be valid until irq is re-enabled, because it will be
> >>          * freed after synchronize_sched().
> >>          */
> >> -       if (force_change)
> >> +       if (n->shared && force_change)
> >>                 synchronize_sched();
> >
> > Oops...
> >
> > s/n->shared/old_shared/
> 
> Yes, that seems to work fine. After boot everything is good.
> Then I tried and tried to get it to mess up, but could not.

Thanks for confirm.
I will send a formal patch, soon.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-28  8:09     ` Vladimir Davydov
@ 2016-09-29  2:00       ` Joonsoo Kim
  2016-09-29 13:45         ` Vladimir Davydov
  0 siblings, 1 reply; 16+ messages in thread
From: Joonsoo Kim @ 2016-09-29  2:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Andrew Morton, bugzilla-daemon, dsmythies, linux-mm

On Wed, Sep 28, 2016 at 11:09:53AM +0300, Vladimir Davydov wrote:
> On Tue, Sep 27, 2016 at 10:03:47PM -0400, Johannes Weiner wrote:
> > [CC Vladimir]
> > 
> > These are the delayed memcg cache allocations, where in a fresh memcg
> > that doesn't have per-memcg caches yet, every accounted allocation
> > schedules a kmalloc work item in __memcg_schedule_kmem_cache_create()
> > until the cache is finally available. It looks like those can be many
> > more than the number of slab caches in existence, if there is a storm
> > of slab allocations before the workers get a chance to run.
> > 
> > Vladimir, what do you think of embedding the work item into the
> > memcg_cache_array? That way we make sure we have exactly one work per
> > cache and not an unbounded number of them. The downside of course is
> > that we'd have to keep these things around as long as the memcg is in
> > existence, but that's the only place I can think of that allows us to
> > serialize this.
> 
> We could set the entry of the root_cache->memcg_params.memcg_caches
> array corresponding to the cache being created to a special value, say
> (void*)1, and skip scheduling cache creation work on kmalloc if the
> caller sees it. I'm not sure it's really worth it though, because
> work_struct isn't that big (at least, in comparison with the cache
> itself) to avoid embedding it at all costs.

Hello, Johannes and Vladimir.

I'm not familiar with memcg so have a question about this solution.
This solution will solve the current issue but if burst memcg creation
happens, similar issue would happen again. My understanding is correct?

I think that the other cause of the problem is that we call
synchronize_sched() which is rather slow with holding a slab_mutex and
it blocks further kmem_cache creation. Should we fix that, too?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-29  2:00       ` Joonsoo Kim
@ 2016-09-29 13:45         ` Vladimir Davydov
  2016-09-30  8:19           ` Joonsoo Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Davydov @ 2016-09-29 13:45 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Johannes Weiner, Andrew Morton, bugzilla-daemon, dsmythies, linux-mm

On Thu, Sep 29, 2016 at 11:00:50AM +0900, Joonsoo Kim wrote:
> On Wed, Sep 28, 2016 at 11:09:53AM +0300, Vladimir Davydov wrote:
> > On Tue, Sep 27, 2016 at 10:03:47PM -0400, Johannes Weiner wrote:
> > > [CC Vladimir]
> > > 
> > > These are the delayed memcg cache allocations, where in a fresh memcg
> > > that doesn't have per-memcg caches yet, every accounted allocation
> > > schedules a kmalloc work item in __memcg_schedule_kmem_cache_create()
> > > until the cache is finally available. It looks like those can be many
> > > more than the number of slab caches in existence, if there is a storm
> > > of slab allocations before the workers get a chance to run.
> > > 
> > > Vladimir, what do you think of embedding the work item into the
> > > memcg_cache_array? That way we make sure we have exactly one work per
> > > cache and not an unbounded number of them. The downside of course is
> > > that we'd have to keep these things around as long as the memcg is in
> > > existence, but that's the only place I can think of that allows us to
> > > serialize this.
> > 
> > We could set the entry of the root_cache->memcg_params.memcg_caches
> > array corresponding to the cache being created to a special value, say
> > (void*)1, and skip scheduling cache creation work on kmalloc if the
> > caller sees it. I'm not sure it's really worth it though, because
> > work_struct isn't that big (at least, in comparison with the cache
> > itself) to avoid embedding it at all costs.
> 
> Hello, Johannes and Vladimir.
> 
> I'm not familiar with memcg so have a question about this solution.
> This solution will solve the current issue but if burst memcg creation
> happens, similar issue would happen again. My understanding is correct?

Yes, I think you're right - embedding the work_struct responsible for
cache creation in kmem_cache struct won't help if a thousand of
different cgroups call kmem_cache_alloc() simultaneously for a cache
they haven't used yet.

Come to think of it, we could fix the issue by simply introducing a
special single-threaded workqueue used exclusively for cache creation
works - cache creation is done mostly under the slab_mutex, anyway. This
way, we wouldn't have to keep those used-once work_structs for the whole
kmem_cache life time.

> 
> I think that the other cause of the problem is that we call
> synchronize_sched() which is rather slow with holding a slab_mutex and
> it blocks further kmem_cache creation. Should we fix that, too?

Well, the patch you posted looks pretty obvious and it helps the
reporter, so personally I don't see any reason for not applying it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-29 13:45         ` Vladimir Davydov
@ 2016-09-30  8:19           ` Joonsoo Kim
  2016-09-30 19:58             ` Vladimir Davydov
  2016-10-06  5:04             ` Doug Smythies
  0 siblings, 2 replies; 16+ messages in thread
From: Joonsoo Kim @ 2016-09-30  8:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Andrew Morton, bugzilla-daemon, dsmythies, linux-mm

On Thu, Sep 29, 2016 at 04:45:50PM +0300, Vladimir Davydov wrote:
> On Thu, Sep 29, 2016 at 11:00:50AM +0900, Joonsoo Kim wrote:
> > On Wed, Sep 28, 2016 at 11:09:53AM +0300, Vladimir Davydov wrote:
> > > On Tue, Sep 27, 2016 at 10:03:47PM -0400, Johannes Weiner wrote:
> > > > [CC Vladimir]
> > > > 
> > > > These are the delayed memcg cache allocations, where in a fresh memcg
> > > > that doesn't have per-memcg caches yet, every accounted allocation
> > > > schedules a kmalloc work item in __memcg_schedule_kmem_cache_create()
> > > > until the cache is finally available. It looks like those can be many
> > > > more than the number of slab caches in existence, if there is a storm
> > > > of slab allocations before the workers get a chance to run.
> > > > 
> > > > Vladimir, what do you think of embedding the work item into the
> > > > memcg_cache_array? That way we make sure we have exactly one work per
> > > > cache and not an unbounded number of them. The downside of course is
> > > > that we'd have to keep these things around as long as the memcg is in
> > > > existence, but that's the only place I can think of that allows us to
> > > > serialize this.
> > > 
> > > We could set the entry of the root_cache->memcg_params.memcg_caches
> > > array corresponding to the cache being created to a special value, say
> > > (void*)1, and skip scheduling cache creation work on kmalloc if the
> > > caller sees it. I'm not sure it's really worth it though, because
> > > work_struct isn't that big (at least, in comparison with the cache
> > > itself) to avoid embedding it at all costs.
> > 
> > Hello, Johannes and Vladimir.
> > 
> > I'm not familiar with memcg so have a question about this solution.
> > This solution will solve the current issue but if burst memcg creation
> > happens, similar issue would happen again. My understanding is correct?
> 
> Yes, I think you're right - embedding the work_struct responsible for
> cache creation in kmem_cache struct won't help if a thousand of
> different cgroups call kmem_cache_alloc() simultaneously for a cache
> they haven't used yet.
> 
> Come to think of it, we could fix the issue by simply introducing a
> special single-threaded workqueue used exclusively for cache creation
> works - cache creation is done mostly under the slab_mutex, anyway. This
> way, we wouldn't have to keep those used-once work_structs for the whole
> kmem_cache life time.
> 
> > 
> > I think that the other cause of the problem is that we call
> > synchronize_sched() which is rather slow with holding a slab_mutex and
> > it blocks further kmem_cache creation. Should we fix that, too?
> 
> Well, the patch you posted looks pretty obvious and it helps the
> reporter, so personally I don't see any reason for not applying it.

Oops... I forgot to mention why I asked that.

There is another report that similar problem also happens in SLUB. In there,
synchronize_sched() is called in cache shrinking path with holding the
slab_mutex. I guess that it blocks further kmem_cache creation.

If we uses special single-threaded workqueue, number of kworker would
be limited but kmem_cache creation will be delayed for a long time in
burst memcg creation/destroy scenario.

https://bugzilla.kernel.org/show_bug.cgi?id=172991

Do we need to remove synchronize_sched() in SLUB and find other
solution?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-30  8:19           ` Joonsoo Kim
@ 2016-09-30 19:58             ` Vladimir Davydov
  2016-10-06  5:04             ` Doug Smythies
  1 sibling, 0 replies; 16+ messages in thread
From: Vladimir Davydov @ 2016-09-30 19:58 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Johannes Weiner, Andrew Morton, bugzilla-daemon, dsmythies, linux-mm

On Fri, Sep 30, 2016 at 05:19:41PM +0900, Joonsoo Kim wrote:
> On Thu, Sep 29, 2016 at 04:45:50PM +0300, Vladimir Davydov wrote:
> > On Thu, Sep 29, 2016 at 11:00:50AM +0900, Joonsoo Kim wrote:
> > > On Wed, Sep 28, 2016 at 11:09:53AM +0300, Vladimir Davydov wrote:
> > > > On Tue, Sep 27, 2016 at 10:03:47PM -0400, Johannes Weiner wrote:
> > > > > [CC Vladimir]
> > > > > 
> > > > > These are the delayed memcg cache allocations, where in a fresh memcg
> > > > > that doesn't have per-memcg caches yet, every accounted allocation
> > > > > schedules a kmalloc work item in __memcg_schedule_kmem_cache_create()
> > > > > until the cache is finally available. It looks like those can be many
> > > > > more than the number of slab caches in existence, if there is a storm
> > > > > of slab allocations before the workers get a chance to run.
> > > > > 
> > > > > Vladimir, what do you think of embedding the work item into the
> > > > > memcg_cache_array? That way we make sure we have exactly one work per
> > > > > cache and not an unbounded number of them. The downside of course is
> > > > > that we'd have to keep these things around as long as the memcg is in
> > > > > existence, but that's the only place I can think of that allows us to
> > > > > serialize this.
> > > > 
> > > > We could set the entry of the root_cache->memcg_params.memcg_caches
> > > > array corresponding to the cache being created to a special value, say
> > > > (void*)1, and skip scheduling cache creation work on kmalloc if the
> > > > caller sees it. I'm not sure it's really worth it though, because
> > > > work_struct isn't that big (at least, in comparison with the cache
> > > > itself) to avoid embedding it at all costs.
> > > 
> > > Hello, Johannes and Vladimir.
> > > 
> > > I'm not familiar with memcg so have a question about this solution.
> > > This solution will solve the current issue but if burst memcg creation
> > > happens, similar issue would happen again. My understanding is correct?
> > 
> > Yes, I think you're right - embedding the work_struct responsible for
> > cache creation in kmem_cache struct won't help if a thousand of
> > different cgroups call kmem_cache_alloc() simultaneously for a cache
> > they haven't used yet.
> > 
> > Come to think of it, we could fix the issue by simply introducing a
> > special single-threaded workqueue used exclusively for cache creation
> > works - cache creation is done mostly under the slab_mutex, anyway. This
> > way, we wouldn't have to keep those used-once work_structs for the whole
> > kmem_cache life time.
> > 
> > > 
> > > I think that the other cause of the problem is that we call
> > > synchronize_sched() which is rather slow with holding a slab_mutex and
> > > it blocks further kmem_cache creation. Should we fix that, too?
> > 
> > Well, the patch you posted looks pretty obvious and it helps the
> > reporter, so personally I don't see any reason for not applying it.
> 
> Oops... I forgot to mention why I asked that.
> 
> There is another report that similar problem also happens in SLUB. In there,
> synchronize_sched() is called in cache shrinking path with holding the
> slab_mutex. I guess that it blocks further kmem_cache creation.
> 
> If we uses special single-threaded workqueue, number of kworker would
> be limited but kmem_cache creation will be delayed for a long time in
> burst memcg creation/destroy scenario.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=172991
> 
> Do we need to remove synchronize_sched() in SLUB and find other
> solution?

Yeah, you're right. We'd better do something about this
synchronize_sched(). I think moving it out of the slab_mutex and calling
it once for all caches in memcg_deactivate_kmem_caches() would resolve
the issue. I'll post the patches tomorrow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-09-30  8:19           ` Joonsoo Kim
  2016-09-30 19:58             ` Vladimir Davydov
@ 2016-10-06  5:04             ` Doug Smythies
  2016-10-06  6:35               ` Joonsoo Kim
                                 ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Doug Smythies @ 2016-10-06  5:04 UTC (permalink / raw)
  To: 'Vladimir Davydov', 'Joonsoo Kim'
  Cc: 'Johannes Weiner', 'Andrew Morton',
	bugzilla-daemon, linux-mm

On 2016.09.30 12:59 Vladimir Davydov wrote:

> Yeah, you're right. We'd better do something about this
> synchronize_sched(). I think moving it out of the slab_mutex and calling
> it once for all caches in memcg_deactivate_kmem_caches() would resolve
> the issue. I'll post the patches tomorrow.

Would someone please be kind enough to send me the patch set?

I didn't get them, and would like to test them.
I have searched and searched and did manage to find:
"[PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink"
And a thread about a patch 1 of 2:
"Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches"
Where I see me as "reported by", but I guess "reported by" people don't get the e-mails.
I haven't found PATCH 0/2, nor do I know if what I did find is current.

... Doug


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-10-06  5:04             ` Doug Smythies
@ 2016-10-06  6:35               ` Joonsoo Kim
  2016-10-06 16:02               ` Doug Smythies
  2016-10-07 15:55               ` Doug Smythies
  2 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2016-10-06  6:35 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Vladimir Davydov', 'Johannes Weiner',
	'Andrew Morton',
	bugzilla-daemon, linux-mm

On Wed, Oct 05, 2016 at 10:04:27PM -0700, Doug Smythies wrote:
> On 2016.09.30 12:59 Vladimir Davydov wrote:
> 
> > Yeah, you're right. We'd better do something about this
> > synchronize_sched(). I think moving it out of the slab_mutex and calling
> > it once for all caches in memcg_deactivate_kmem_caches() would resolve
> > the issue. I'll post the patches tomorrow.
> 
> Would someone please be kind enough to send me the patch set?
> 
> I didn't get them, and would like to test them.
> I have searched and searched and did manage to find:
> "[PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink"
> And a thread about a patch 1 of 2:
> "Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches"
> Where I see me as "reported by", but I guess "reported by" people don't get the e-mails.
> I haven't found PATCH 0/2, nor do I know if what I did find is current.

I think that what you find is correct one. It has no cover-letter so
there is no [PATCH 0/2]. Anyway, to clarify, I add links to these
patches.

https://patchwork.kernel.org/patch/9361853
https://patchwork.kernel.org/patch/9359271

It would be very helpful if you test these patches.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-10-06  5:04             ` Doug Smythies
  2016-10-06  6:35               ` Joonsoo Kim
@ 2016-10-06 16:02               ` Doug Smythies
  2016-10-07 15:55               ` Doug Smythies
  2 siblings, 0 replies; 16+ messages in thread
From: Doug Smythies @ 2016-10-06 16:02 UTC (permalink / raw)
  To: 'Joonsoo Kim'
  Cc: 'Vladimir Davydov', 'Johannes Weiner',
	'Andrew Morton',
	bugzilla-daemon, linux-mm

On 2016.10.05 23:35 Joonsoo Kim wrote:
> On Wed, Oct 05, 2016 at 10:04:27PM -0700, Doug Smythies wrote:
>> On 2016.09.30 12:59 Vladimir Davydov wrote:
>> 
>>> Yeah, you're right. We'd better do something about this
>>> synchronize_sched(). I think moving it out of the slab_mutex and calling
>>> it once for all caches in memcg_deactivate_kmem_caches() would resolve
>>> the issue. I'll post the patches tomorrow.
>> 
>> Would someone please be kind enough to send me the patch set?
>> 
>> I didn't get them, and would like to test them.
>> I have searched and searched and did manage to find:
>> "[PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink"
>> And a thread about a patch 1 of 2:
>> "Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches"
>> Where I see me as "reported by", but I guess "reported by" people don't get the e-mails.
>> I haven't found PATCH 0/2, nor do I know if what I did find is current.
>
> I think that what you find is correct one. It has no cover-letter so
> there is no [PATCH 0/2]. Anyway, to clarify, I add links to these
> patches.
>
> https://patchwork.kernel.org/patch/9361853
> https://patchwork.kernel.org/patch/9359271
>
> It would be very helpful if you test these patches.

Yes, as best as I am able to test, the 2 patch set
solves both this SLAB and the other SLUB bug reports.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads
  2016-10-06  5:04             ` Doug Smythies
  2016-10-06  6:35               ` Joonsoo Kim
  2016-10-06 16:02               ` Doug Smythies
@ 2016-10-07 15:55               ` Doug Smythies
  2 siblings, 0 replies; 16+ messages in thread
From: Doug Smythies @ 2016-10-07 15:55 UTC (permalink / raw)
  To: 'Joonsoo Kim'
  Cc: 'Vladimir Davydov', 'Johannes Weiner',
	'Andrew Morton',
	bugzilla-daemon, linux-mm, 'Doug Smythies'

On 2016.10.06 09:02 Doug Smythies wrote:
> On 2016.10.05 23:35 Joonsoo Kim wrote:
>> On Wed, Oct 05, 2016 at 10:04:27PM -0700, Doug Smythies wrote:
>>> On 2016.09.30 12:59 Vladimir Davydov wrote:
>>> 
>>>> Yeah, you're right. We'd better do something about this
>>>> synchronize_sched(). I think moving it out of the slab_mutex and calling
>>>> it once for all caches in memcg_deactivate_kmem_caches() would resolve
>>>> the issue. I'll post the patches tomorrow.
>>> 
>>> Would someone please be kind enough to send me the patch set?
>>> 
>>> I didn't get them, and would like to test them.
>>> I have searched and searched and did manage to find:
>>> "[PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink"
>>> And a thread about a patch 1 of 2:
>>> "Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches"
>>> Where I see me as "reported by", but I guess "reported by" people don't get the e-mails.
>>> I haven't found PATCH 0/2, nor do I know if what I did find is current.
>>
>> I think that what you find is correct one. It has no cover-letter so
>> there is no [PATCH 0/2]. Anyway, to clarify, I add links to these
>> patches.
>>
>> https://patchwork.kernel.org/patch/9361853
>> https://patchwork.kernel.org/patch/9359271
>>
>> It would be very helpful if you test these patches.
>
> Yes, as best as I am able to test, the 2 patch set
> solves both this SLAB and the other SLUB bug reports.

I tested the patch from the other thread on top of these two,
And things continued to work fine. The additional patch
does seems a little faster under some of my hammering conditions.

Reference:
https://marc.info/?l=linux-kernel&m=147573486705407&w=2


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-10-07 15:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-172981-27@https.bugzilla.kernel.org/>
2016-09-27 18:10 ` [Bug 172981] New: [bisected] SLAB: extreme load averages and over 2000 kworker threads Andrew Morton
2016-09-28  2:03   ` Johannes Weiner
2016-09-28  8:09     ` Vladimir Davydov
2016-09-29  2:00       ` Joonsoo Kim
2016-09-29 13:45         ` Vladimir Davydov
2016-09-30  8:19           ` Joonsoo Kim
2016-09-30 19:58             ` Vladimir Davydov
2016-10-06  5:04             ` Doug Smythies
2016-10-06  6:35               ` Joonsoo Kim
2016-10-06 16:02               ` Doug Smythies
2016-10-07 15:55               ` Doug Smythies
2016-09-28  3:13   ` Doug Smythies
2016-09-28  5:18     ` Joonsoo Kim
2016-09-28  6:20       ` Joonsoo Kim
2016-09-28 15:22       ` Doug Smythies
2016-09-29  1:50         ` Joonsoo Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.