All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC ATTEND]
@ 2015-01-06 16:14 Michal Hocko
  2015-01-06 23:27 ` Greg Thelen
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Michal Hocko @ 2015-01-06 16:14 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm

Hi,
I would like to attend this year (2015) LSF/MM conference. I am
particularly interested in the MM track. I would like to discuss (among
other topics already suggested) the following topics:
General MM topics:
- THP success rate has become one of the metric for reclaim/compaction
  changes which I feel is missing one important aspect and that is
  cost/benefit analysis. It might be better to have more THP pages in
  some loads but the whole advantage might easily go away when the
  initial cost is higher than all aggregated saves. When it comes to
  benchmarks and numbers we are usually missing the later.
  This becomes even more an issue with memcg when close to the limit.
  Does it make sense to do a heavy reclaim (with THP size target) to
  fulfill THP allocations? If not memcg acts against the global MM and
  ruins the effort, on the other hand reclaiming 512 pages can take
  quite some time.
  The memcg part could be worked around by either precharging THP pages
  or reclaiming only clean page cache pages which would handle most
  usecases IMO but it would be better to think about a !memcg solution. Do
  we really want to allocate THP pages unconditionally or rather build
  them up if it seems worthwhile?

- As it turned out recently GFP_KERNEL mimicing GFP_NOFAIL for !costly
  allocation is sometimes kicking us back because we are basically
  creating an invisible lock dependencies which might livelock the whole
  system under OOM conditions.
  That leads to attempts to add more hacks into the OOM killer
  which is tricky enough as is. Changing the current state is
  quite risky because we do not really know how many places in the
  kernel silently depend on this behavior. As per Johannes attempt
  (http://marc.info/?l=linux-mm&m=141932770811346) it is clear that
  we are not yet there! I do not have very good ideas how to deal with
  this unfortunatelly...

And as a memcg co-maintainer I would like to also discuss the following
topics.
- We should finally settle down with a set of core knobs exported with
  the new unified hierarchy cgroups API. I have proposed this already
  http://marc.info/?l=linux-mm&m=140552160325228&w=2 but there is no
  clear consensus and the discussion has died later on. I feel it would
  be more productive to sit together and come up with a reasonable
  compromise between - let's start from the begining and keep useful and
  reasonable features.
  
- kmem accounting is seeing a lot of activity mainly thanks to Vladimir.
  He is basically the only active developer in this area. I would be
  happy if he can attend as well and discuss his future plans in the
  area. The work overlaps with slab allocators and slab shrinkers so
  having people familiar with these areas would be more than welcome
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-06 16:14 [LSF/MM TOPIC ATTEND] Michal Hocko
@ 2015-01-06 23:27 ` Greg Thelen
  2015-01-07 14:28   ` Michal Hocko
  2015-01-07  8:58 ` Vladimir Davydov
  2015-02-02  8:37 ` [LSF/MM TOPIC ATTEND] - THP benefits Vlastimil Babka
  2 siblings, 1 reply; 13+ messages in thread
From: Greg Thelen @ 2015-01-06 23:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Tue, Jan 06 2015, Michal Hocko wrote:

> - As it turned out recently GFP_KERNEL mimicing GFP_NOFAIL for !costly
>   allocation is sometimes kicking us back because we are basically
>   creating an invisible lock dependencies which might livelock the whole
>   system under OOM conditions.
>   That leads to attempts to add more hacks into the OOM killer
>   which is tricky enough as is. Changing the current state is
>   quite risky because we do not really know how many places in the
>   kernel silently depend on this behavior. As per Johannes attempt
>   (http://marc.info/?l=linux-mm&m=141932770811346) it is clear that
>   we are not yet there! I do not have very good ideas how to deal with
>   this unfortunatelly...

We've internally been fighting similar deadlocks between memcg kmem
accounting and memcg oom killer.  I wouldn't call it a very good idea,
because it falls in the realm of further complicating the oom killer,
but what about introducing an async oom killer which runs outside of the
context of the current task.  An async killer won't hold any locks so it
won't block the indented oom victim from terminating.  After queuing a
deferred oom kill the allocating thread would then be able to dip into
memory reserves to satisfy its too-small-to-fail allocation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-06 16:14 [LSF/MM TOPIC ATTEND] Michal Hocko
  2015-01-06 23:27 ` Greg Thelen
@ 2015-01-07  8:58 ` Vladimir Davydov
  2015-01-07 14:38   ` Michal Hocko
  2015-02-02  8:37 ` [LSF/MM TOPIC ATTEND] - THP benefits Vlastimil Babka
  2 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2015-01-07  8:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Tue, Jan 06, 2015 at 05:14:35PM +0100, Michal Hocko wrote:
[...]
> And as a memcg co-maintainer I would like to also discuss the following
> topics.
> - We should finally settle down with a set of core knobs exported with
>   the new unified hierarchy cgroups API. I have proposed this already
>   http://marc.info/?l=linux-mm&m=140552160325228&w=2 but there is no
>   clear consensus and the discussion has died later on. I feel it would
>   be more productive to sit together and come up with a reasonable
>   compromise between - let's start from the begining and keep useful and
>   reasonable features.
>   
> - kmem accounting is seeing a lot of activity mainly thanks to Vladimir.
>   He is basically the only active developer in this area. I would be
>   happy if he can attend as well and discuss his future plans in the
>   area. The work overlaps with slab allocators and slab shrinkers so
>   having people familiar with these areas would be more than welcome

One more memcg related topic that is worth discussing IMO:

 - On global memory pressure we walk over all memory cgroups and scan
   pages from each of them. Since there can be hundreds or even
   thousands of memory cgroups, such a walk can be quite expensive,
   especially if the cgroups are small so that to reclaim anything from
   them we have to descend to a lower scan priority. The problem is
   augmented by offline memory cgroups, which now can be dangling for
   indefinitely long time.

   That's why I think we should work out a better algorithm for the
   memory reclaimer. May be, we could rank memory cgroups somehow (by
   their age, memory consumption?) and try to scan only the top ranked
   cgroup during a reclaimer run. This topic is also very close to the
   soft limit reclaim improvements, which Michal has been working on for
   a while.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-06 23:27 ` Greg Thelen
@ 2015-01-07 14:28   ` Michal Hocko
  2015-01-07 18:54     ` Greg Thelen
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Michal Hocko @ 2015-01-07 14:28 UTC (permalink / raw)
  To: Greg Thelen; +Cc: lsf-pc, linux-mm

On Tue 06-01-15 15:27:27, Greg Thelen wrote:
> On Tue, Jan 06 2015, Michal Hocko wrote:
> 
> > - As it turned out recently GFP_KERNEL mimicing GFP_NOFAIL for !costly
> >   allocation is sometimes kicking us back because we are basically
> >   creating an invisible lock dependencies which might livelock the whole
> >   system under OOM conditions.
> >   That leads to attempts to add more hacks into the OOM killer
> >   which is tricky enough as is. Changing the current state is
> >   quite risky because we do not really know how many places in the
> >   kernel silently depend on this behavior. As per Johannes attempt
> >   (http://marc.info/?l=linux-mm&m=141932770811346) it is clear that
> >   we are not yet there! I do not have very good ideas how to deal with
> >   this unfortunatelly...
> 
> We've internally been fighting similar deadlocks between memcg kmem
> accounting and memcg oom killer.  I wouldn't call it a very good idea,
> because it falls in the realm of further complicating the oom killer,
> but what about introducing an async oom killer which runs outside of the
> context of the current task. 

I am not sure I understand you properly. We have something similar for
memcg in upstream. It is still from the context of the task which has
tripped over the OOM but it happens down in the page fault path where no
locks are held. This has fixed the similar lock dependency problem in
memcg charges, which can happen on top of any locks, but it is still not
enough, see below.

> An async killer won't hold any locks so it
> won't block the indented oom victim from terminating.  After queuing a
> deferred oom kill the allocating thread would then be able to dip into
> memory reserves to satisfy its too-small-to-fail allocation.

What would prevent the current to consume all the memory reserves
because the victim wouldn't die early enough (e.g. it won't be scheduled
or spend a lot of time on an unrelated lock)? Each "current" which
blocks the oom victim would have to get access to the reserves. There
might be really lots of them...

I think that we shouldn't give anybody but OOM victim access to
the reserves because there is a good chance that the victim will
not use too much of it (unless there is a bug somewhere where the
victim allocates unbounded amount of memory without bailing out on
fatal_signals_pending).

I am pretty sure that we can extend lockdep to report when OOM victim
is going to block on a lock which is held by a task which is allocating
on almost-never-fail gfp (there is already GFP_FS tracking implemented
AFAIR). But that wouldn't solve the problem, though, because it would
turn into, as Dave pointed out, "whack a mole" game.

Instead we shouldn't pretend that GFP_KERNEL is basically GFP_NOFAIL.
The question is how to get there without too many regressions IMHO.
Or maybe we should simply bite a bullet and don't be cowards and simply
deal with bugs as they come. If something really cannot deal with the
failure it should tell that by a proper flag.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-07  8:58 ` Vladimir Davydov
@ 2015-01-07 14:38   ` Michal Hocko
  2015-01-08  8:33     ` Vladimir Davydov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2015-01-07 14:38 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: lsf-pc, linux-mm

On Wed 07-01-15 11:58:28, Vladimir Davydov wrote:
> On Tue, Jan 06, 2015 at 05:14:35PM +0100, Michal Hocko wrote:
> [...]
> > And as a memcg co-maintainer I would like to also discuss the following
> > topics.
> > - We should finally settle down with a set of core knobs exported with
> >   the new unified hierarchy cgroups API. I have proposed this already
> >   http://marc.info/?l=linux-mm&m=140552160325228&w=2 but there is no
> >   clear consensus and the discussion has died later on. I feel it would
> >   be more productive to sit together and come up with a reasonable
> >   compromise between - let's start from the begining and keep useful and
> >   reasonable features.
> >   
> > - kmem accounting is seeing a lot of activity mainly thanks to Vladimir.
> >   He is basically the only active developer in this area. I would be
> >   happy if he can attend as well and discuss his future plans in the
> >   area. The work overlaps with slab allocators and slab shrinkers so
> >   having people familiar with these areas would be more than welcome
> 
> One more memcg related topic that is worth discussing IMO:
> 
>  - On global memory pressure we walk over all memory cgroups and scan
>    pages from each of them. Since there can be hundreds or even
>    thousands of memory cgroups, such a walk can be quite expensive,
>    especially if the cgroups are small so that to reclaim anything from
>    them we have to descend to a lower scan priority.

     We do not get to lower priorities just to scan small cgroups. They
     will simply get ignored unless we are force scanning them.

>    The problem is
>    augmented by offline memory cgroups, which now can be dangling for
>    indefinitely long time.

OK, but shrink_lruvec shouldn't do too much work on a memcg which
doesn't have any pages to scan for the given priority. Or have you seen
this in some profiles?

>    That's why I think we should work out a better algorithm for the
>    memory reclaimer. May be, we could rank memory cgroups somehow (by
>    their age, memory consumption?) and try to scan only the top ranked
>    cgroup during a reclaimer run.

We still have to keep some fairness and reclaim all groups
proportionally and balancing this would be quite non-trivial. I am not
saying we couldn't implement our iterators in a more intelligent way but
this code is quite complex already and I haven't seen this as a big
problem yet. Some overhead is to be expected when thousands of groups
are configured, right?

>    This topic is also very close to the
>    soft limit reclaim improvements, which Michal has been working on for
>    a while.

The patches I have for the low limit reclaim didn't care about an
intelligent filtering of non-reclaimable groups because I thought it
would be too early to complicate the code at this stage. Especially when
non-reclaimable will be a very small minority in the real life. This
wasn't the case with the old soft limit because we had opposite
situation there.

Nevertheless I am definitely open to discussing improvements.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-07 14:28   ` Michal Hocko
@ 2015-01-07 18:54     ` Greg Thelen
  2015-01-07 19:00     ` Greg Thelen
  2015-01-14 21:27     ` Andrea Arcangeli
  2 siblings, 0 replies; 13+ messages in thread
From: Greg Thelen @ 2015-01-07 18:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Wed, Jan 07 2015, Michal Hocko wrote:

> On Tue 06-01-15 15:27:27, Greg Thelen wrote:
>> On Tue, Jan 06 2015, Michal Hocko wrote:
>> 
>> > - As it turned out recently GFP_KERNEL mimicing GFP_NOFAIL for !costly
>> >   allocation is sometimes kicking us back because we are basically
>> >   creating an invisible lock dependencies which might livelock the whole
>> >   system under OOM conditions.
>> >   That leads to attempts to add more hacks into the OOM killer
>> >   which is tricky enough as is. Changing the current state is
>> >   quite risky because we do not really know how many places in the
>> >   kernel silently depend on this behavior. As per Johannes attempt
>> >   (http://marc.info/?l=linux-mm&m=141932770811346) it is clear that
>> >   we are not yet there! I do not have very good ideas how to deal with
>> >   this unfortunatelly...
>> 
>> We've internally been fighting similar deadlocks between memcg kmem
>> accounting and memcg oom killer.  I wouldn't call it a very good idea,
>> because it falls in the realm of further complicating the oom killer,
>> but what about introducing an async oom killer which runs outside of the
>> context of the current task. 
>
> I am not sure I understand you properly. We have something similar for
> memcg in upstream. It is still from the context of the task which has
> tripped over the OOM but it happens down in the page fault path where no
> locks are held. This has fixed the similar lock dependency problem in
> memcg charges, which can happen on top of any locks, but it is still not
> enough, see below.

Nod.  I'm working with an older kernel which does oom killing in the allocation
context rather than failing the allocation and expecting the end of page fault
processing to queue an oom kill.  Such older kernels thus don't fail small
GFP_KERNEL kmem allocations due to memcg oom, but they run the risk of lockups.
Newer kernels fail small GFP_KERNEL for memcg oom, but won't fail them for page
allocator shortages.  I assume we want consistency in the handling of small
GFP_KERNEL allocations for memcg and machine oom.

>> An async killer won't hold any locks so it
>> won't block the indented oom victim from terminating.  After queuing a
>> deferred oom kill the allocating thread would then be able to dip into
>> memory reserves to satisfy its too-small-to-fail allocation.
>
> What would prevent the current to consume all the memory reserves
> because the victim wouldn't die early enough (e.g. it won't be scheduled
> or spend a lot of time on an unrelated lock)? Each "current" which
> blocks the oom victim would have to get access to the reserves. There
> might be really lots of them...

Yeah, this is the weak spot.

> I think that we shouldn't give anybody but OOM victim access to
> the reserves because there is a good chance that the victim will
> not use too much of it (unless there is a bug somewhere where the
> victim allocates unbounded amount of memory without bailing out on
> fatal_signals_pending).
>
> I am pretty sure that we can extend lockdep to report when OOM victim
> is going to block on a lock which is held by a task which is allocating
> on almost-never-fail gfp (there is already GFP_FS tracking implemented
> AFAIR). But that wouldn't solve the problem, though, because it would
> turn into, as Dave pointed out, "whack a mole" game.

Close, but I think the lockdep complaint would need to be wider - it shouldn't
only catch actual oom kill victims but potential oom kill victims.  Lockdep
would need to complain whenever any thread attempts almost-never-fail allocation
while holding any lock which any user thread (possible oom kill victim) has ever
grabbed in non-interruptible fashion.  This might catch a lot of allocations.

> Instead we shouldn't pretend that GFP_KERNEL is basically GFP_NOFAIL.
> The question is how to get there without too many regressions IMHO.
> Or maybe we should simply bite a bullet and don't be cowards and simply
> deal with bugs as they come. If something really cannot deal with the
> failure it should tell that by a proper flag.

I'm not opposed to this, but we'll still have a lot of places where the
only response to small GFP_KERNEL allocation failure is to call the oom
killer.  These allocation sites would presumably add a GFP_NOFAIL, to
instruct the page allocator to caller the oom killer rather than fail.
Thus we still need to either start enforcing the above lockdep rule or
have "some sort of" async oom killer.  But I admit the async killer has
a serious reserve exhaustion issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-07 14:28   ` Michal Hocko
  2015-01-07 18:54     ` Greg Thelen
@ 2015-01-07 19:00     ` Greg Thelen
  2015-01-14 21:27     ` Andrea Arcangeli
  2 siblings, 0 replies; 13+ messages in thread
From: Greg Thelen @ 2015-01-07 19:00 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Wed, Jan 07 2015, Michal Hocko wrote:

> On Tue 06-01-15 15:27:27, Greg Thelen wrote:
>> On Tue, Jan 06 2015, Michal Hocko wrote:
>> 
>> > - As it turned out recently GFP_KERNEL mimicing GFP_NOFAIL for !costly
>> >   allocation is sometimes kicking us back because we are basically
>> >   creating an invisible lock dependencies which might livelock the whole
>> >   system under OOM conditions.
>> >   That leads to attempts to add more hacks into the OOM killer
>> >   which is tricky enough as is. Changing the current state is
>> >   quite risky because we do not really know how many places in the
>> >   kernel silently depend on this behavior. As per Johannes attempt
>> >   (http://marc.info/?l=linux-mm&m=141932770811346) it is clear that
>> >   we are not yet there! I do not have very good ideas how to deal with
>> >   this unfortunatelly...
>> 
>> We've internally been fighting similar deadlocks between memcg kmem
>> accounting and memcg oom killer.  I wouldn't call it a very good idea,
>> because it falls in the realm of further complicating the oom killer,
>> but what about introducing an async oom killer which runs outside of the
>> context of the current task. 
>
> I am not sure I understand you properly. We have something similar for
> memcg in upstream. It is still from the context of the task which has
> tripped over the OOM but it happens down in the page fault path where no
> locks are held. This has fixed the similar lock dependency problem in
> memcg charges, which can happen on top of any locks, but it is still not
> enough, see below.

Nod.  I'm working with an older kernel which does oom killing in the allocation
context rather than failing the allocation and expecting the end of page fault
processing to queue an oom kill.  Such older kernels thus don't fail small
GFP_KERNEL kmem allocations due to memcg oom, but they run the risk of lockups.
Newer kernels fail small GFP_KERNEL for memcg oom, but won't fail them for page
allocator shortages.  I assume we want consistency in the handling of small
GFP_KERNEL allocations for memcg and machine oom.

>> An async killer won't hold any locks so it
>> won't block the indented oom victim from terminating.  After queuing a
>> deferred oom kill the allocating thread would then be able to dip into
>> memory reserves to satisfy its too-small-to-fail allocation.
>
> What would prevent the current to consume all the memory reserves
> because the victim wouldn't die early enough (e.g. it won't be scheduled
> or spend a lot of time on an unrelated lock)? Each "current" which
> blocks the oom victim would have to get access to the reserves. There
> might be really lots of them...

Yeah, this is the weak spot.

> I think that we shouldn't give anybody but OOM victim access to
> the reserves because there is a good chance that the victim will
> not use too much of it (unless there is a bug somewhere where the
> victim allocates unbounded amount of memory without bailing out on
> fatal_signals_pending).
>
> I am pretty sure that we can extend lockdep to report when OOM victim
> is going to block on a lock which is held by a task which is allocating
> on almost-never-fail gfp (there is already GFP_FS tracking implemented
> AFAIR). But that wouldn't solve the problem, though, because it would
> turn into, as Dave pointed out, "whack a mole" game.

Close, but I think the lockdep complaint would need to be wider - it shouldn't
only catch actual oom kill victims but potential oom kill victims.  Lockdep
would need to complain whenever any thread attempts almost-never-fail allocation
while holding any lock which any user thread (possible oom kill victim) has ever
grabbed in non-interruptible fashion.  This might catch a lot of allocations.

> Instead we shouldn't pretend that GFP_KERNEL is basically GFP_NOFAIL.
> The question is how to get there without too many regressions IMHO.
> Or maybe we should simply bite a bullet and don't be cowards and simply
> deal with bugs as they come. If something really cannot deal with the
> failure it should tell that by a proper flag.

I'm not opposed to this, but we'll still have a lot of places where the
only response to small GFP_KERNEL allocation failure is to call the oom
killer.  These allocation sites would presumably add a GFP_NOFAIL, to
instruct the page allocator to caller the oom killer rather than fail.
Thus we still need to either start enforcing the above lockdep rule or
have "some sort of" async oom killer.  But I admit the async killer has
a serious reserve exhaustion issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-07 14:38   ` Michal Hocko
@ 2015-01-08  8:33     ` Vladimir Davydov
  2015-01-08  9:09       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2015-01-08  8:33 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Wed, Jan 07, 2015 at 03:38:58PM +0100, Michal Hocko wrote:
> On Wed 07-01-15 11:58:28, Vladimir Davydov wrote:
> > On Tue, Jan 06, 2015 at 05:14:35PM +0100, Michal Hocko wrote:
> > [...]
> > > And as a memcg co-maintainer I would like to also discuss the following
> > > topics.
> > > - We should finally settle down with a set of core knobs exported with
> > >   the new unified hierarchy cgroups API. I have proposed this already
> > >   http://marc.info/?l=linux-mm&m=140552160325228&w=2 but there is no
> > >   clear consensus and the discussion has died later on. I feel it would
> > >   be more productive to sit together and come up with a reasonable
> > >   compromise between - let's start from the begining and keep useful and
> > >   reasonable features.
> > >   
> > > - kmem accounting is seeing a lot of activity mainly thanks to Vladimir.
> > >   He is basically the only active developer in this area. I would be
> > >   happy if he can attend as well and discuss his future plans in the
> > >   area. The work overlaps with slab allocators and slab shrinkers so
> > >   having people familiar with these areas would be more than welcome
> > 
> > One more memcg related topic that is worth discussing IMO:
> > 
> >  - On global memory pressure we walk over all memory cgroups and scan
> >    pages from each of them. Since there can be hundreds or even
> >    thousands of memory cgroups, such a walk can be quite expensive,
> >    especially if the cgroups are small so that to reclaim anything from
> >    them we have to descend to a lower scan priority.
> 
>      We do not get to lower priorities just to scan small cgroups. They
>      will simply get ignored unless we are force scanning them.

That means that small cgroups (< 16 M) may not be scanned at all if
there are enough reclaimable pages in bigger cgroups. I'm not sure if
anyone will mix small and big cgroups on the same host though. However,
currently this may render offline memory cgroups hanging around forever
if they have some memory on destruction, because they will become small
due to global reclaim sooner or later. OTOH, we could always forcefully
scan lruvecs that belong to dead cgroups, or limit the maximal number of
dead cgroups, w/o reworking the reclaimer.

> 
> >    The problem is
> >    augmented by offline memory cgroups, which now can be dangling for
> >    indefinitely long time.
> 
> OK, but shrink_lruvec shouldn't do too much work on a memcg which
> doesn't have any pages to scan for the given priority. Or have you
> seen this in some profiles?

In real life, no.

> 
> >    That's why I think we should work out a better algorithm for the
> >    memory reclaimer. May be, we could rank memory cgroups somehow (by
> >    their age, memory consumption?) and try to scan only the top ranked
> >    cgroup during a reclaimer run.
> 
> We still have to keep some fairness and reclaim all groups
> proportionally and balancing this would be quite non-trivial. I am not
> saying we couldn't implement our iterators in a more intelligent way but
> this code is quite complex already and I haven't seen this as a big
> problem yet. Some overhead is to be expected when thousands of groups
> are configured, right?

Right, sounds convincing. Let's cross out this topic then until we see
complains from real users. No need to spend time on it right now.

Sorry for the noise.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-08  8:33     ` Vladimir Davydov
@ 2015-01-08  9:09       ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2015-01-08  9:09 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: lsf-pc, linux-mm

On Thu 08-01-15 11:33:53, Vladimir Davydov wrote:
> On Wed, Jan 07, 2015 at 03:38:58PM +0100, Michal Hocko wrote:
> > On Wed 07-01-15 11:58:28, Vladimir Davydov wrote:
> > > On Tue, Jan 06, 2015 at 05:14:35PM +0100, Michal Hocko wrote:
> > > [...]
> > > > And as a memcg co-maintainer I would like to also discuss the following
> > > > topics.
> > > > - We should finally settle down with a set of core knobs exported with
> > > >   the new unified hierarchy cgroups API. I have proposed this already
> > > >   http://marc.info/?l=linux-mm&m=140552160325228&w=2 but there is no
> > > >   clear consensus and the discussion has died later on. I feel it would
> > > >   be more productive to sit together and come up with a reasonable
> > > >   compromise between - let's start from the begining and keep useful and
> > > >   reasonable features.
> > > >   
> > > > - kmem accounting is seeing a lot of activity mainly thanks to Vladimir.
> > > >   He is basically the only active developer in this area. I would be
> > > >   happy if he can attend as well and discuss his future plans in the
> > > >   area. The work overlaps with slab allocators and slab shrinkers so
> > > >   having people familiar with these areas would be more than welcome
> > > 
> > > One more memcg related topic that is worth discussing IMO:
> > > 
> > >  - On global memory pressure we walk over all memory cgroups and scan
> > >    pages from each of them. Since there can be hundreds or even
> > >    thousands of memory cgroups, such a walk can be quite expensive,
> > >    especially if the cgroups are small so that to reclaim anything from
> > >    them we have to descend to a lower scan priority.
> > 
> >      We do not get to lower priorities just to scan small cgroups. They
> >      will simply get ignored unless we are force scanning them.
> 
> That means that small cgroups (< 16 M) may not be scanned at all if
> there are enough reclaimable pages in bigger cgroups. I'm not sure if
> anyone will mix small and big cgroups on the same host though. However,
> currently this may render offline memory cgroups hanging around forever
> if they have some memory on destruction, because they will become small
> due to global reclaim sooner or later. OTOH, we could always forcefully
> scan lruvecs that belong to dead cgroups, or limit the maximal number of
> dead cgroups, w/o reworking the reclaimer.

Makes sense! Now that we do not reparent on offline this might indeed be
a problem. Care to send a patch? I will cook up something if you do not
have time for that.

Something along these lines should work but I haven't thought about that
very much to be honest:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e29f411b38ac..277585176a9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1935,7 +1935,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && !zone_reclaimable(zone))
+	if (current_is_kswapd() && (!zone_reclaimable(zone) || mem_cgroup_need_force_scan(sc->target_mem_cgroup)))
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
 
> > >    The problem is
> > >    augmented by offline memory cgroups, which now can be dangling for
> > >    indefinitely long time.
> > 
> > OK, but shrink_lruvec shouldn't do too much work on a memcg which
> > doesn't have any pages to scan for the given priority. Or have you
> > seen this in some profiles?
> 
> In real life, no.
> 
> > 
> > >    That's why I think we should work out a better algorithm for the
> > >    memory reclaimer. May be, we could rank memory cgroups somehow (by
> > >    their age, memory consumption?) and try to scan only the top ranked
> > >    cgroup during a reclaimer run.
> > 
> > We still have to keep some fairness and reclaim all groups
> > proportionally and balancing this would be quite non-trivial. I am not
> > saying we couldn't implement our iterators in a more intelligent way but
> > this code is quite complex already and I haven't seen this as a big
> > problem yet. Some overhead is to be expected when thousands of groups
> > are configured, right?
> 
> Right, sounds convincing. Let's cross out this topic then until we see
> complains from real users. No need to spend time on it right now.
> 
> Sorry for the noise.

No noise at all!

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-07 14:28   ` Michal Hocko
  2015-01-07 18:54     ` Greg Thelen
  2015-01-07 19:00     ` Greg Thelen
@ 2015-01-14 21:27     ` Andrea Arcangeli
  2015-01-15 14:06       ` Michal Hocko
  2 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2015-01-14 21:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Greg Thelen, linux-mm

Hello everyone,

On Wed, Jan 07, 2015 at 03:28:04PM +0100, Michal Hocko wrote:
> Instead we shouldn't pretend that GFP_KERNEL is basically GFP_NOFAIL.
> The question is how to get there without too many regressions IMHO.
> Or maybe we should simply bite a bullet and don't be cowards and simply
> deal with bugs as they come. If something really cannot deal with the
> failure it should tell that by a proper flag.

Not related to memcg but related to GFP_NOFAIL behavior, a couple of
months ago while stress testing some code I've been working on, I run
into several OOM livelocks which may be the same you're reporting here
and I reliably fixed those (at least for my load) so I could keep
going with my work. I didn't try to submit these changes yet, but this
discussion rings a bell... so I'm sharing my changes below in this
thread in case it may help:

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=00e91f97df9861454f7e0701944d7de2c382ffb9
http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=a0fcf2323b2e4cffd750c1abc1d2c138acdefcc8
http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=798b7f9d549664f8c0007c6416a2568eedd75d6a

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-14 21:27     ` Andrea Arcangeli
@ 2015-01-15 14:06       ` Michal Hocko
  2015-01-15 20:58         ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2015-01-15 14:06 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Greg Thelen, linux-mm

On Wed 14-01-15 22:27:45, Andrea Arcangeli wrote:
> Hello everyone,
> 
> On Wed, Jan 07, 2015 at 03:28:04PM +0100, Michal Hocko wrote:
> > Instead we shouldn't pretend that GFP_KERNEL is basically GFP_NOFAIL.
> > The question is how to get there without too many regressions IMHO.
> > Or maybe we should simply bite a bullet and don't be cowards and simply
> > deal with bugs as they come. If something really cannot deal with the
> > failure it should tell that by a proper flag.
> 
> Not related to memcg but related to GFP_NOFAIL behavior, a couple of
> months ago while stress testing some code I've been working on, I run
> into several OOM livelocks which may be the same you're reporting here
> and I reliably fixed those (at least for my load) so I could keep
> going with my work. I didn't try to submit these changes yet, but this
> discussion rings a bell... so I'm sharing my changes below in this
> thread in case it may help:
> 
> http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=00e91f97df9861454f7e0701944d7de2c382ffb9

OK, this is interesting. We do fail !GFP_FS allocations but
did_some_progress might prevent from __alloc_pages_may_oom where we
fail. This can lead to a trashing when the reclaim makes some progress
but it doesn't help to succeed allocation. This can take many retries
until no progress can be done and fail much later.

I do agree that failing earlier is slightly better, even though the result
would be more allocation failures which has hard to predict outcome.
Anyway callers should be prepared for the failure and we can hardly think
about performance under such condition. I would happily ack such a patch
if you post it.

> http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=a0fcf2323b2e4cffd750c1abc1d2c138acdefcc8

I am not sure about this one because TIF_MEMDIE is there to give an
access to memory reserves. GFP_NOFAIL shouldn't mean the same because
then it would be much harder to "guarantee" that the reserves wouldn't
be depleted completely. So I do not like this much. Besides that I think
that GFP_NOFAIL allocation blocking OOM victim is a plain bug.
grow_dev_page is relying on GFP_NOFAIL but I am wondering whether ext4
can do something to pre-allocate so that it doesn't have to call it.

> http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=798b7f9d549664f8c0007c6416a2568eedd75d6a

I think this should be fixed in the filesystem rather than paper over
it.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND]
  2015-01-15 14:06       ` Michal Hocko
@ 2015-01-15 20:58         ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2015-01-15 20:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Greg Thelen, linux-mm

Hi Michal,

On Thu, Jan 15, 2015 at 03:06:54PM +0100, Michal Hocko wrote:
> 
> > http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=a0fcf2323b2e4cffd750c1abc1d2c138acdefcc8
> 
> I am not sure about this one because TIF_MEMDIE is there to give an
> access to memory reserves. GFP_NOFAIL shouldn't mean the same because
> then it would be much harder to "guarantee" that the reserves wouldn't
> be depleted completely. So I do not like this much. Besides that I think
> that GFP_NOFAIL allocation blocking OOM victim is a plain bug.
> grow_dev_page is relying on GFP_NOFAIL but I am wondering whether ext4
> can do something to pre-allocate so that it doesn't have to call it.

Well this is just the longstanding GFP_NOFAIL livelock, it always
existed deep down in the buffer header allocation even before
GFP_NOFAIL existed. GFP_NOFAIL just generalized the livelocking
concept.

There's no proper fix for that other than to teach the filesystem to
deal with allocation errors and remove GFP_NOFAIL (in this case
__GFP_NOFAIL was set:

 #0 get_page_from_freelist (gfp_mask=0x20858, nodemask=0x0
  <irq_stack_union>, order=0x0, zonelist=0xffff88007fffc100, hi
  gh_zoneidx=0x2, alloc_flags=0xc0, preferred_zone=0xffff88007fffa840,
  classzone_idx=classzone_idx@entry=0x1,
  migratetype=migratetype@entry=0x2) at mm/page_alloc.c:1953

gfp_mask=0x20858  & 0x800u = 0x800

If we're OOM and GFP_NOFAIL actually fails to allocate memory, this
patch simply tries to mitigate the potential livelock by giving it a
chance to use the memory reserves that are normally used only for high
priority allocations.

If __GFP_NOFAIL hits OOM I think it's fair to say it is very high
priority (more high priority than a GFP_ATOMIC or something that can
fail totally gracefully).

So the above second patch looks quite safe to me conceptually as well:
at least we put those last 50M of ram to good use instead of
livelocking while 50M are still free.

> > http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=798b7f9d549664f8c0007c6416a2568eedd75d6a
> 
> I think this should be fixed in the filesystem rather than paper over
> it.

No doubt, this third patch basically undoes the fix in the first
patch. This third patch makes __GFP_FS again not failing if invoked in
kernel thread context where TIF_MEMDIE cannot ever be set.

However there was no way I could run without this third patch on my
own production systems with a potential ext4 mounting itself readonly
on me during OOM killing (as result of the livelock fix in the first
patch).

Ideally the ext4 developer should reverse this third patch (which must
be keep separated from the first patch exactly for this reason) and
start an OOM killing loop to reproduce and fix this so then we can
reverse the third patch.

In short:

1) first patch makes !__GFP_FS not equivalent to __GFP_NOFAIL anymore
   (when invoked by kernel threads where TIF_MEMDIE cannot be set)

2) second patch deals with a genuine __GFP_NOFAIL livelock using the
   memory reserves (this is orthogonal with 1)

3) third patch undoes 1 and uses the memory reserves for !__GFP_FS too
   like patch 2 used them to mitigate the genuine __GFP_NOFAIL
   deadlock.  Undoing patch 1 is needed because patch 1 causes ext4 to
   remount itself readonly and complain about metadata corruption.

I later tested further the ext4 trouble after applying only 1 and it
seems ext4 thinks it's corrupted, but e2fsck -f shows it's actually
clean. So it's probably an in-memory issue only, but still having ext4
remounting itself readonly during a OOM killing isn't exactly
acceptable or graceful (until it is fixed). Hence the reason for 3.

Of course it took a long time before the trouble with patch 1 seen the
light, in fact first I hit the genuine __GFP_NOFAIL deadlock fixed by
2 before I could ever hit the ext4 error paths.

Let me know what you'd like me to submit, I just don't think
submitting only the first patch as you suggested is safe idea.

I also think allowing __GFP_NOFAIL to access the emergency reserves is
ok if __GFP_NOFAIL is hitting an OOM condition (what else could be
more urgent than succeeding a potentially livelocking __GFP_NOFAIL?).

I think the combination of the 3 patches is safe and in practice it
solves all OOM related livelocks I run into. It also allows ext4
developers to trivially (git reverse #ofpatch3) to fix their bugs and
then we can reverse the third patch upstream so !__GFP_FS allocations
from kernel threads becomes theoretically safe too. As opposed
__GFP_NOFAIL is never theoretically safe but that's much harder to fix
than the already existing ext4 error paths that aren't using
__GFP_NOFAIL but that haven't been properly exercised, simply because
they couldn't be exercised without patch 1 applied (kernel thread
allocations without __GFP_FS set cannot fail currently and making them
fail by applying patch1, exercises those untested error paths for the
first time).

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC ATTEND] - THP benefits
  2015-01-06 16:14 [LSF/MM TOPIC ATTEND] Michal Hocko
  2015-01-06 23:27 ` Greg Thelen
  2015-01-07  8:58 ` Vladimir Davydov
@ 2015-02-02  8:37 ` Vlastimil Babka
  2 siblings, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2015-02-02  8:37 UTC (permalink / raw)
  To: Michal Hocko, lsf-pc
  Cc: linux-mm, Kirill A. Shutemov, Mel Gorman, Hugh Dickins

On 01/06/2015 05:14 PM, Michal Hocko wrote:
> - THP success rate has become one of the metric for reclaim/compaction
>   changes which I feel is missing one important aspect and that is
>   cost/benefit analysis. It might be better to have more THP pages in
>   some loads but the whole advantage might easily go away when the
>   initial cost is higher than all aggregated saves. When it comes to
>   benchmarks and numbers we are usually missing the later.

So what I think would help in this discussion is some numbers on how much
hugepages (thus THP) actually help performance nowadays. Does anyone have such
results on recent hardware from e.g. SPEC CPU2006 or even production workloads?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-02-02  8:37 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-06 16:14 [LSF/MM TOPIC ATTEND] Michal Hocko
2015-01-06 23:27 ` Greg Thelen
2015-01-07 14:28   ` Michal Hocko
2015-01-07 18:54     ` Greg Thelen
2015-01-07 19:00     ` Greg Thelen
2015-01-14 21:27     ` Andrea Arcangeli
2015-01-15 14:06       ` Michal Hocko
2015-01-15 20:58         ` Andrea Arcangeli
2015-01-07  8:58 ` Vladimir Davydov
2015-01-07 14:38   ` Michal Hocko
2015-01-08  8:33     ` Vladimir Davydov
2015-01-08  9:09       ` Michal Hocko
2015-02-02  8:37 ` [LSF/MM TOPIC ATTEND] - THP benefits Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.