linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [failures] mm-vmscan-remove-unnecessary-lruvec-adding.patch removed from -mm tree
       [not found]     ` <97EE83E1-FEC9-48B6-98E8-07FB3FECB961@lca.pw>
@ 2020-03-06  4:17       ` Hugh Dickins
  2020-03-06  4:42         ` Alex Shi
                           ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Hugh Dickins @ 2020-03-06  4:17 UTC (permalink / raw)
  To: Qian Cai
  Cc: Matthew Wilcox, LKML, Andrew Morton, aarcange, Alex Shi,
	daniel.m.jordan, hannes, hughd, khlebnikov, kirill, kravetz,
	mhocko, mm-commits, tj, vdavydov.dev, yang.shi, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5162 bytes --]

On Thu, 5 Mar 2020, Qian Cai wrote:
> > On Mar 5, 2020, at 10:38 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > On Thu, Mar 05, 2020 at 10:32:18PM -0500, Qian Cai wrote:
> >>> On Mar 5, 2020, at 9:50 PM, akpm@linux-foundation.org wrote:
> >>> The patch titled
> >>>    Subject: mm/vmscan: remove unnecessary lruvec adding
> >>> has been removed from the -mm tree.  Its filename was
> >>>    mm-vmscan-remove-unnecessary-lruvec-adding.patch
> >>> 
> >>> This patch was dropped because it had testing failures
> >> 
> >> Andrew, do you have more information about this failure? I hit a bug
> >> here under memory pressure and am wondering if this is related
> >> which might save me some time digging…

Very likely related.

> > 
> > See Hugh's message from a few minutes ago:

Thanks Matthew.

> > 
> > Subject: Re: [PATCH v9 00/21] per lruvec lru_lock for memcg
> 
> I don’t see it on lore.kernel or anywhere. Private email?

You're right, sorry I didn't notice, lots of ccs but
neither lkml nor linux-mm were on that thread from the start:

From hughd@google.com Thu Mar  5 18:16:06 2020
Date: Thu, 5 Mar 2020 18:15:40 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
To: Andew Morton <akpm@linux-foundation.org>, Alex Shi <alex.shi@linux.alibaba.com>
Cc: cgroups@vger.kernel.org, mgorman@techsingularity.net, tj@kernel.org, hughd@google.com, khlebnikov@yandex-team.ru, daniel.m.jordan@oracle.com, yang.shi@linux.alibaba.com, willy@infradead.org, hannes@cmpxchg.org, lkp@intel.com, Fengguang Wu <fengguang.wu@intel.com>, Rong Chen <rong.a.chen@intel.com>
Subject: Re: [PATCH v9 00/21] per lruvec lru_lock for memcg

On Tue, 3 Mar 2020, Alex Shi wrote:
> 在 2020/3/3 上午6:12, Andrew Morton 写道:
> >> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> >> and Yun Wang.
> > I'm not seeing a lot of evidence of review and test activity yet.  But
> > I think I'll grab patches 01-06 as they look like fairly
> > straightforward improvements.
> 
> cc Fengguang and Rong Chen
> 
> I did some local functional testing and kselftest, they all look fine.
> 0day only warn me if some case failed. Is it no news is good news? :)

And now the bad news.

Andrew, please revert those six (or seven as they ended up in mmotm).
5.6-rc4-mm1 without them runs my tmpfs+loop+swapping+memcg+ksm kernel
build loads fine (did four hours just now), but 5.6-rc4-mm1 itself
crashed just after starting - seconds or minutes I didn't see,
but it did not complete an iteration.

I thought maybe those six would be harmless (though I've not looked
at them at all); but knew already that the full series is not good yet:
I gave it a try over 5.6-rc4 on Monday, and crashed very soon on simpler
testing, in different ways from what hits mmotm.

The first thing wrong with the full set was when I tried tmpfs+loop+
swapping kernel builds in "mem=700M cgroup_disabled=memory", of course
with CONFIG_DEBUG_LIST=y. That soon collapsed in a splurge of OOM kills
and list_del corruption messages: __list_del_entry_valid < list_del <
__page_cache_release < __put_page < put_page < __try_to_reclaim_swap <
free_swap_and_cache < shmem_free_swap < shmem_undo_range.

When I next tried with "mem=1G" and memcg enabled (but not being used),
that managed some iterations, no OOM kills, no list_del warnings (was
it swapping? perhaps, perhaps not, I was trying to go easy on it just
to see if "cgroup_disabled=memory" had been the problem); but when
rebooting after that, again list_del corruption messages and crash
(I didn't note them down).

So I didn't take much notice of what the mmotm crash backtrace showed
(but IIRC shmem and swap were in it).

Alex, I'm afraid you're focusing too much on performance results,
without doing the basic testing needed - I thought we had given you
some hints on the challenging areas (swapping, move_charge_at_immigrate,
page migration) when we attached a *correctly working* 5.3 version back
on 23rd August:

https://lore.kernel.org/linux-mm/alpine.LSU.2.11.1908231736001.16920@eggly.anvils/

(Correctly working, except missing two patches I'd mistakenly dropped
as unnecessary in earlier rebases: but our discussions with Johannes
later showed to be very necessary, though their races rarely seen.)

I have not had the time (and do not expect to have the time) to review
your series: maybe it's one or two small fixes away from being complete,
or maybe it's still fundamentally flawed, I do not know.  I had naively
hoped that you would help with a patchset that worked, rather than
cutting it down into something which does not.

Submitting your series to routine testing is much easier for me than
reviewing it: but then, yes, it's a pity that I don't find the time
to report the results on intervening versions, which also crashed.

What I have to do now, is set aside time today and tomorrow, to package
up the old scripts I use, describe them and their environment, and send
them to you (cc akpm in case I fall under a bus): so that you can
reproduce the crashes for yourself, and get to work on them.

Hugh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [failures] mm-vmscan-remove-unnecessary-lruvec-adding.patch removed from -mm tree
  2020-03-06  4:17       ` [failures] mm-vmscan-remove-unnecessary-lruvec-adding.patch removed from -mm tree Hugh Dickins
@ 2020-03-06  4:42         ` Alex Shi
  2020-03-06  4:46           ` Qian Cai
  2020-03-06 13:30         ` Alex Shi
  2020-03-06 14:54         ` Johannes Weiner
  2 siblings, 1 reply; 5+ messages in thread
From: Alex Shi @ 2020-03-06  4:42 UTC (permalink / raw)
  To: Hugh Dickins, Qian Cai
  Cc: Matthew Wilcox, LKML, Andrew Morton, aarcange, daniel.m.jordan,
	hannes, khlebnikov, kirill, kravetz, mhocko, mm-commits, tj,
	vdavydov.dev, yang.shi, linux-mm



在 2020/3/6 下午12:17, Hugh Dickins 写道:
> On Thu, 5 Mar 2020, Qian Cai wrote:
>>> On Mar 5, 2020, at 10:38 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>>
>>> On Thu, Mar 05, 2020 at 10:32:18PM -0500, Qian Cai wrote:
>>>>> On Mar 5, 2020, at 9:50 PM, akpm@linux-foundation.org wrote:
>>>>> The patch titled
>>>>>    Subject: mm/vmscan: remove unnecessary lruvec adding
>>>>> has been removed from the -mm tree.  Its filename was
>>>>>    mm-vmscan-remove-unnecessary-lruvec-adding.patch
>>>>>
>>>>> This patch was dropped because it had testing failures
>>>> Andrew, do you have more information about this failure? I hit a bug
>>>> here under memory pressure and am wondering if this is related
>>>> which might save me some time digging…
> Very likely related.
> 

Hi all,

Apologize for the trouble!
And Many thanks for you all for the report!
Obviously, I missed memory stress testing which I should do. Apologize again!

Qian Cai,
Which test case are you using? Could you share the reproduce steps for me?

Hugh,
Many thanks for help! I will seek some memory stress case and waiting for your case.


Thank you all!
Alex


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [failures] mm-vmscan-remove-unnecessary-lruvec-adding.patch removed from -mm tree
  2020-03-06  4:42         ` Alex Shi
@ 2020-03-06  4:46           ` Qian Cai
  0 siblings, 0 replies; 5+ messages in thread
From: Qian Cai @ 2020-03-06  4:46 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, Matthew Wilcox, LKML, Andrew Morton, aarcange,
	daniel.m.jordan, hannes, khlebnikov, kirill, kravetz, mhocko,
	mm-commits, tj, vdavydov.dev, yang.shi, linux-mm



> On Mar 5, 2020, at 11:42 PM, Alex Shi <alex.shi@linux.alibaba.com> wrote:
> 
> 
> 
> 在 2020/3/6 下午12:17, Hugh Dickins 写道:
>> On Thu, 5 Mar 2020, Qian Cai wrote:
>>>> On Mar 5, 2020, at 10:38 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>>> 
>>>> On Thu, Mar 05, 2020 at 10:32:18PM -0500, Qian Cai wrote:
>>>>>> On Mar 5, 2020, at 9:50 PM, akpm@linux-foundation.org wrote:
>>>>>> The patch titled
>>>>>>   Subject: mm/vmscan: remove unnecessary lruvec adding
>>>>>> has been removed from the -mm tree.  Its filename was
>>>>>>   mm-vmscan-remove-unnecessary-lruvec-adding.patch
>>>>>> 
>>>>>> This patch was dropped because it had testing failures
>>>>> Andrew, do you have more information about this failure? I hit a bug
>>>>> here under memory pressure and am wondering if this is related
>>>>> which might save me some time digging…
>> Very likely related.
>> 
> 
> Hi all,
> 
> Apologize for the trouble!
> And Many thanks for you all for the report!
> Obviously, I missed memory stress testing which I should do. Apologize again!
> 
> Qian Cai,
> Which test case are you using? Could you share the reproduce steps for me?


LTP oom01 in a tight loop with swap,

# i=0; while :; do echo $((i++)); oom01; sleep 5; done

> 
> Hugh,
> Many thanks for help! I will seek some memory stress case and waiting for your case.
> 
> 
> Thank you all!
> Alex



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [failures] mm-vmscan-remove-unnecessary-lruvec-adding.patch removed from -mm tree
  2020-03-06  4:17       ` [failures] mm-vmscan-remove-unnecessary-lruvec-adding.patch removed from -mm tree Hugh Dickins
  2020-03-06  4:42         ` Alex Shi
@ 2020-03-06 13:30         ` Alex Shi
  2020-03-06 14:54         ` Johannes Weiner
  2 siblings, 0 replies; 5+ messages in thread
From: Alex Shi @ 2020-03-06 13:30 UTC (permalink / raw)
  To: Hugh Dickins, Qian Cai
  Cc: Matthew Wilcox, LKML, Andrew Morton, aarcange, daniel.m.jordan,
	hannes, khlebnikov, kirill, kravetz, mhocko, mm-commits, tj,
	vdavydov.dev, yang.shi, linux-mm



在 2020/3/6 下午12:17, Hugh Dickins 写道:
>>>
>>> Subject: Re: [PATCH v9 00/21] per lruvec lru_lock for memcg
>>
>> I don’t see it on lore.kernel or anywhere. Private email?
> 
> You're right, sorry I didn't notice, lots of ccs but
> neither lkml nor linux-mm were on that thread from the start:

My fault, I thought people would often give comments on each patch, will care this from now on.

> 
> And now the bad news.
> 
> Andrew, please revert those six (or seven as they ended up in mmotm).
> 5.6-rc4-mm1 without them runs my tmpfs+loop+swapping+memcg+ksm kernel
> build loads fine (did four hours just now), but 5.6-rc4-mm1 itself
> crashed just after starting - seconds or minutes I didn't see,
> but it did not complete an iteration.
> 
> I thought maybe those six would be harmless (though I've not looked
> at them at all); but knew already that the full series is not good yet:
> I gave it a try over 5.6-rc4 on Monday, and crashed very soon on simpler
> testing, in different ways from what hits mmotm.
> 
> The first thing wrong with the full set was when I tried tmpfs+loop+
> swapping kernel builds in "mem=700M cgroup_disabled=memory", of course
> with CONFIG_DEBUG_LIST=y. That soon collapsed in a splurge of OOM kills
> and list_del corruption messages: __list_del_entry_valid < list_del <
> __page_cache_release < __put_page < put_page < __try_to_reclaim_swap <
> free_swap_and_cache < shmem_free_swap < shmem_undo_range.

I have been run kernel build with a "mem=700M cgroup_disabled=memory" qemu-kvm
with a swapfile for 3 hours, Hope I could catch sth while waiting for your 
kindly reproduce scripts. Thanks Hugh!

> 
> When I next tried with "mem=1G" and memcg enabled (but not being used),
> that managed some iterations, no OOM kills, no list_del warnings (was
> it swapping? perhaps, perhaps not, I was trying to go easy on it just
> to see if "cgroup_disabled=memory" had been the problem); but when
> rebooting after that, again list_del corruption messages and crash
> (I didn't note them down).
> 
> So I didn't take much notice of what the mmotm crash backtrace showed
> (but IIRC shmem and swap were in it).

Is there some place to get mmotm's crash backtrace?

> 
> Alex, I'm afraid you're focusing too much on performance results,
> without doing the basic testing needed - I thought we had given you
> some hints on the challenging areas (swapping, move_charge_at_immigrate,
> page migration) when we attached a *correctly working* 5.3 version back
> on 23rd August:
> 
> https://lore.kernel.org/linux-mm/alpine.LSU.2.11.1908231736001.16920@eggly.anvils/
> 
> (Correctly working, except missing two patches I'd mistakenly dropped
> as unnecessary in earlier rebases: but our discussions with Johannes
> later showed to be very necessary, though their races rarely seen.)
> 

Did you mean the Johannes's question of race on page->memcg in previous email?

"> I don't see what prevents the lruvec from changing under compaction,
> neither in your patches nor in Hugh's. Maybe I'm missing something?"

https://lkml.org/lkml/2019/11/22/2153

From then on, I have tired 2 solutions to protect page->memcg, 
first use lock_page_memcg(wrong) and 2nd new solution, taking PageLRU bit as page 
isoltion precondition which may work for memcg migration, and page 
migration in compaction etc. Could you like to give some comments on this?

> I have not had the time (and do not expect to have the time) to review
> your series: maybe it's one or two small fixes away from being complete,
> or maybe it's still fundamentally flawed, I do not know.  I had naively
> hoped that you would help with a patchset that worked, rather than
> cutting it down into something which does not.> 

Sorry, Hugh, I didn't know you have per memcg lru_lock patchset before I sent 
out my first verion.

> Submitting your series to routine testing is much easier for me than
> reviewing it: but then, yes, it's a pity that I don't find the time
> to report the results on intervening versions, which also crashed.
> 
> What I have to do now, is set aside time today and tomorrow, to package
> up the old scripts I use, describe them and their environment, and send
> them to you (cc akpm in case I fall under a bus): so that you can
> reproduce the crashes for yourself, and get to work on them.
> 

Thanks advance for your coming testing scripts, I believe it will help a lot.

BTW, I try my best to orgnize this patches to make it stright, a senior experts
like you, won't cost much time to go through whole patches. and give some precious
comment! 

I am looking forward to hear comments from you. :)

Thanks
Alex


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [failures] mm-vmscan-remove-unnecessary-lruvec-adding.patch removed from -mm tree
  2020-03-06  4:17       ` [failures] mm-vmscan-remove-unnecessary-lruvec-adding.patch removed from -mm tree Hugh Dickins
  2020-03-06  4:42         ` Alex Shi
  2020-03-06 13:30         ` Alex Shi
@ 2020-03-06 14:54         ` Johannes Weiner
  2 siblings, 0 replies; 5+ messages in thread
From: Johannes Weiner @ 2020-03-06 14:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Qian Cai, Matthew Wilcox, LKML, Andrew Morton, aarcange,
	Alex Shi, daniel.m.jordan, khlebnikov, kirill, kravetz, mhocko,
	mm-commits, tj, vdavydov.dev, yang.shi, linux-mm

On Thu, Mar 05, 2020 at 08:17:46PM -0800, Hugh Dickins wrote:
> On Tue, 3 Mar 2020, Alex Shi wrote:
> > 在 2020/3/3 上午6:12, Andrew Morton 写道:
> > >> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> > >> and Yun Wang.
> > > I'm not seeing a lot of evidence of review and test activity yet.  But
> > > I think I'll grab patches 01-06 as they look like fairly
> > > straightforward improvements.
> > 
> > cc Fengguang and Rong Chen
> > 
> > I did some local functional testing and kselftest, they all look fine.
> > 0day only warn me if some case failed. Is it no news is good news? :)
> 
> And now the bad news.
> 
> Andrew, please revert those six (or seven as they ended up in mmotm).
> 5.6-rc4-mm1 without them runs my tmpfs+loop+swapping+memcg+ksm kernel
> build loads fine (did four hours just now), but 5.6-rc4-mm1 itself
> crashed just after starting - seconds or minutes I didn't see,
> but it did not complete an iteration.
> 
> I thought maybe those six would be harmless (though I've not looked
> at them at all); but knew already that the full series is not good yet:
> I gave it a try over 5.6-rc4 on Monday, and crashed very soon on simpler
> testing, in different ways from what hits mmotm.
> 
> The first thing wrong with the full set was when I tried tmpfs+loop+
> swapping kernel builds in "mem=700M cgroup_disabled=memory", of course
> with CONFIG_DEBUG_LIST=y. That soon collapsed in a splurge of OOM kills
> and list_del corruption messages: __list_del_entry_valid < list_del <
> __page_cache_release < __put_page < put_page < __try_to_reclaim_swap <
> free_swap_and_cache < shmem_free_swap < shmem_undo_range.
> 
> When I next tried with "mem=1G" and memcg enabled (but not being used),
> that managed some iterations, no OOM kills, no list_del warnings (was
> it swapping? perhaps, perhaps not, I was trying to go easy on it just
> to see if "cgroup_disabled=memory" had been the problem); but when
> rebooting after that, again list_del corruption messages and crash
> (I didn't note them down).
> 
> So I didn't take much notice of what the mmotm crash backtrace showed
> (but IIRC shmem and swap were in it).
> 
> Alex, I'm afraid you're focusing too much on performance results,
> without doing the basic testing needed - I thought we had given you
> some hints on the challenging areas (swapping, move_charge_at_immigrate,
> page migration) when we attached a *correctly working* 5.3 version back
> on 23rd August:
> 
> https://lore.kernel.org/linux-mm/alpine.LSU.2.11.1908231736001.16920@eggly.anvils/
> 
> (Correctly working, except missing two patches I'd mistakenly dropped
> as unnecessary in earlier rebases: but our discussions with Johannes
> later showed to be very necessary, though their races rarely seen.)
>
> I have not had the time (and do not expect to have the time) to review
> your series: maybe it's one or two small fixes away from being complete,
> or maybe it's still fundamentally flawed, I do not know.  I had naively
> hoped that you would help with a patchset that worked, rather than
> cutting it down into something which does not.

I'm a bit confused by this. I, and I believe Alex, kept going down a
different path because it didn't sound like there was a solution to
the compaction race. As I remember, the conversation ended on this:

: Your race here (again, lruvec lock taken then PageLRU observed, but
: page->mem_cgroup changed in between) really questions my whole scheme:
: I am not going to propose a solution now, I'll have to go back and
: recheck my assumptions all over.  Certainly isolate_migratepage_block()
: has a harder job than any other, but I need to re-review it all.

https://lore.kernel.org/lkml/alpine.LSU.2.11.1911221616580.1144@eggly.anvils/

That's certainly why I kept looking and eventually proposed using
PageLRU clearing as a lock. Maybe there is a better way to do it, but
I didn't see it.

An LRU list corruption in page_cache_release() suggests a bug in the
way this new locking scheme works or is applied - rather than a
gratuitous divergence from your series that could have been avoided.

> Submitting your series to routine testing is much easier for me than
> reviewing it: but then, yes, it's a pity that I don't find the time
> to report the results on intervening versions, which also crashed.
> 
> What I have to do now, is set aside time today and tomorrow, to package
> up the old scripts I use, describe them and their environment, and send
> them to you (cc akpm in case I fall under a bus): so that you can
> reproduce the crashes for yourself, and get to work on them.

I think that would be very useful. tmpfs+loop+swapping+memcg+ksm
kernel builds aren't exactly a go-to test case for most mm developers
(although maybe they should be!)


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-03-06 14:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200306025041.rERhvnYmB%akpm@linux-foundation.org>
     [not found] ` <211632B1-2D6F-4BFA-A5A0-3030339D3D2A@lca.pw>
     [not found]   ` <20200306033850.GO29971@bombadil.infradead.org>
     [not found]     ` <97EE83E1-FEC9-48B6-98E8-07FB3FECB961@lca.pw>
2020-03-06  4:17       ` [failures] mm-vmscan-remove-unnecessary-lruvec-adding.patch removed from -mm tree Hugh Dickins
2020-03-06  4:42         ` Alex Shi
2020-03-06  4:46           ` Qian Cai
2020-03-06 13:30         ` Alex Shi
2020-03-06 14:54         ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).