linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>,
	akpm@linux-foundation.org,  mgorman@techsingularity.net,
	tj@kernel.org, khlebnikov@yandex-team.ru,
	 daniel.m.jordan@oracle.com, yang.shi@linux.alibaba.com,
	 willy@infradead.org, hannes@cmpxchg.org, lkp@intel.com,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org, shakeelb@google.com,
	 iamjoonsoo.kim@lge.com, richard.weiyang@gmail.com
Subject: Re: [PATCH v11 00/16] per memcg lru lock
Date: Thu, 11 Jun 2020 15:09:54 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.11.2006111409280.10801@eggly.anvils> (raw)
In-Reply-To: <730c595b-f4bf-b16a-562e-de25b9b7eb97@linux.alibaba.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 7338 bytes --]

On Thu, 11 Jun 2020, Alex Shi wrote:
> 在 2020/6/10 上午11:22, Hugh Dickins 写道:
> > On Mon, 8 Jun 2020, Alex Shi wrote:
> >> 在 2020/6/8 下午12:15, Hugh Dickins 写道:
> >>>>  24 files changed, 487 insertions(+), 312 deletions(-)
> >>> Hi Alex,
> >>>
> >>> I didn't get to try v10 at all, waited until Johannes's preparatory
> >>> memcg swap cleanup was in mmotm; but I have spent a while thrashing
> >>> this v11, and can happily report that it is much better than v9 etc:
> >>> I believe this memcg lru_lock work will soon be ready for v5.9.
> >>>
> >>> I've not yet found any flaw at the swapping end, but fixes are needed
> >>> for isolate_migratepages_block() and mem_cgroup_move_account(): I've
> >>> got a series of 4 fix patches to send you (I guess two to fold into
> >>> existing patches of yours, and two to keep as separate from me).
> >>>
> >>> I haven't yet written the patch descriptions, will return to that
> >>> tomorrow.  I expect you will be preparing a v12 rebased on v5.8-rc1
> >>> or v5.8-rc2, and will be able to include these fixes in that.
> >>
> >> I am very glad to get your help on this feature! 
> >>
> >> and looking forward for your fixes tomorrow. :)
> >>
> >> Thanks a lot!
> >> Alex
> > 
> > Sorry, Alex, the news is not so good today.
> > 
> > You'll have noticed I sent nothing yesterday. That's because I got
> > stuck on my second patch: could not quite convince myself that it
> > was safe.
> 
> Hi Hugh,
> 
> Thanks a lot for your help and effort! I very appreciate for this.
> 
> > 
> > I keep hinting at these patches, and I can't complete their writeups
> > until I'm convinced; but to give you a better idea of what they do:
> > 
> > 1. Fixes isolate_fail and isolate_abort in isolate_migratepages_block().
> 
> I guess I know this after mm-compaction-avoid-vm_bug_onpageslab-in-page_mapcount.patch
> was removed.

No, I already assumed you had backed that out: these are fixes beyond that.

> 
> > 2. Fixes unsafe use of trylock_page() in __isolate_lru_page_prepare().
> > 3. Reverts 07/16 inversion of lock ordering in split_huge_page_to_list().
> > 4. Adds lruvec lock protection in mem_cgroup_move_account().
> 
> Sorry for can't follow you for above issues.

Indeed, more explanation needed: coming.

> Anyway, I will send out new patchset
> with the first issue fixed. and then let's discussion base on it.

Sigh. I wish you had waited for me to send you fixes, or waited for an
identifiable tag like 5.8-rc1.  Andrew has been very hard at work with
mm patches to Linus, but it looks like there are still "data_race" mods
to come before -rc1, which may stop your v12 from applying cleanly.

> 
> > 
> > In the second, I was using rcu_read_lock() instead of trylock_page()
> > (like in my own patchset), but could not quite be sure of the case when
> > PageSwapCache gets set at the wrong moment. Gave up for the night, and
> > in the morning abandoned that, instead just shifting the call to
> > __isolate_lru_page_prepare() after the get_page_unless_zero(),
> > where that trylock_page() becomes safe (no danger of stomping on page
> > flags while page is being freed or newly allocated to another owner).
> 
> Sorry, I don't know the problem of trylock_page here? Could you like to
> describe it as a race?

Races, yes. Look, I'll send you now patches 1 and 2: at least with those
in it should be safe for you and others to test compaction (if 5.8-rc1
turns out well: I think so much has gone in that it might have unrelated
problems, and often the -rc2 is much more stable).

But no point in sending 3 and 4 at this point, since ...

> 
> > 
> > I thought that a very safe change, but best to do some test runs with
> > it in before finalizing. And was then unpleasantly surprised to hit a
> > VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup) from
> > lock_page_lruvec_irqsave < relock_page_lruvec < pagevec_lru_move_fn <
> > pagevec_move_tail < lru_add_drain_cpu after 6 hours on one machine.
> > Then similar but < rotate_reclaimable_page after 8 hours on another.
> > 
> > Only seen once before: that's what drove me to add patch 4 (with 3 to
> > revert the locking before it): somehow, when adding the lruvec locking
> > there, I just took it for granted that your patchset would have the
> > appropriate locking (or TestClearPageLRU magic) at the other end.
> > 
> > But apparently not. And I'm beginning to think that TestClearPageLRU
> > was just to distract the audience from the lack of proper locking.
> > 
> > I have certainly not concluded that yet, but I'm having to think about
> > an area of the code which I'd imagined you had under control (and I'm
> > puzzled why my testing has found it so very hard to hit). If we're
> > lucky, I'll find that pagevec_move_tail is a special case, and
> > nothing much else needs changing; but I doubt that will be so.

... shows that your locking primitives are not yet good enough
to handle the case when tasks are moved between memcgs with
move_charge_at_immigrate set.  "bin/cg m" in the tests I sent,
but today I'm changing its "seconds=60" to "seconds=1" in hope
of speeding up the reproduction.

Ah, good, two machines crashed in 1.5 hours: but I don't need to
examine the crashes, now that it's obvious there's no protection -
please, think about rotate_reclaimable_page() (there will be more
cases, but in practice that seems easiest to hit, so focus on that)
and how it is not protected from mem_cgroup_move_account().

I'm thinking too. Maybe judicious use of lock_page_memcg() can fix it
(8 years ago it was unsuitable, but a lot has changed for the better
since then); otherwise it's back to what I've been doing all along,
taking the likely lruvec lock, and checking under that lock whether
we have the right lock (as your lruvec_memcg_debug() does), retrying
if not. Which may be more efficient than involving lock_page_memcg().

But I guess still worth sending my first two patches, since most of us
use move_charge_at_immigrate only for... testing move_charge_at_immigrate.
Whereas compaction bugs can hit any of us at any time.

> > 
> > There's one other unexplained and unfixed bug I've seen several times
> > while exercising mem_cgroup_move_account(): refcount_warn_saturate()
> > from where __mem_cgroup_clear_mc() calls mem_cgroup_id_get_many().
> > I'll be glad if that goes away when the lruvec locking is fixed,
> > but don't understand the connection. And it's quite possible that
> > this refcounting bug has nothing to do with your changes: I have
> > not succeeded in reproducing it on 5.7 nor on 5.7-rc7-mm1,
> > but I didn't really try long enough to be sure.

I got one of those quite quickly too after setting "cg m"'s seconds=1.
I think the best thing I can do while thinking and researching, is
give 5.7-rc7-mm1 a run on that machine with the speeded up moving,
to see whether or not that refcount bug reproduces.

> > 
> > (I should also warn, that I'm surprised by the amount of change
> > 11/16 makes to mm/mlock.c: I've not been exercising mlock at all.)
> 
> yes, that is a bit complex. I have tried the mlock cases in selftest with
> your swap&build case. They are all fine with 300 times run.

Good, thanks.

Hugh

  reply	other threads:[~2020-06-11 22:10 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-28 11:00 [PATCH v11 00/16] per memcg lru lock Alex Shi
2020-05-28 11:00 ` [PATCH v11 01/16] mm/vmscan: remove unnecessary lruvec adding Alex Shi
2020-05-28 11:00 ` [PATCH v11 02/16] mm/page_idle: no unlikely double check for idle page counting Alex Shi
2020-05-28 11:00 ` [PATCH v11 03/16] mm/compaction: correct the comments of compact_defer_shift Alex Shi
2020-05-28 11:00 ` [PATCH v11 04/16] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
2020-05-28 11:00 ` [PATCH v11 05/16] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
2020-05-28 11:00 ` [PATCH v11 06/16] mm/thp: clean up lru_add_page_tail Alex Shi
2020-05-28 11:00 ` [PATCH v11 07/16] mm/thp: narrow lru locking Alex Shi
2020-05-28 11:00 ` [PATCH v11 08/16] mm/memcg: add debug checking in lock_page_memcg Alex Shi
2020-05-28 11:00 ` [PATCH v11 09/16] mm/lru: introduce TestClearPageLRU Alex Shi
2020-05-28 11:00 ` [PATCH v11 10/16] mm/compaction: do page isolation first in compaction Alex Shi
2020-05-28 11:00 ` [PATCH v11 11/16] mm/mlock: reorder isolation sequence during munlock Alex Shi
2020-05-28 11:00 ` [PATCH v11 12/16] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
2020-05-28 11:00 ` [PATCH v11 13/16] mm/lru: introduce the relock_page_lruvec function Alex Shi
2020-05-28 11:00 ` [PATCH v11 14/16] mm/vmscan: use relock for move_pages_to_lru Alex Shi
2020-05-28 11:00 ` [PATCH v11 15/16] mm/pgdat: remove pgdat lru_lock Alex Shi
2020-05-28 11:00 ` [PATCH v11 16/16] mm/lru: revise the comments of lru_lock Alex Shi
2020-06-08  4:15 ` [PATCH v11 00/16] per memcg lru lock Hugh Dickins
2020-06-08  6:13   ` Alex Shi
2020-06-10  3:22     ` Hugh Dickins
2020-06-11  6:06       ` Alex Shi
2020-06-11 22:09         ` Hugh Dickins [this message]
2020-06-12 10:43           ` Alex Shi
2020-06-16  6:14           ` Alex Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.11.2006111409280.10801@eggly.anvils \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=cgroups@vger.kernel.org \
    --cc=daniel.m.jordan@oracle.com \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=mgorman@techsingularity.net \
    --cc=richard.weiyang@gmail.com \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).