On Thu, 11 Jun 2020, Alex Shi wrote: > 在 2020/6/10 上午11:22, Hugh Dickins 写道: > > On Mon, 8 Jun 2020, Alex Shi wrote: > >> 在 2020/6/8 下午12:15, Hugh Dickins 写道: > >>>> 24 files changed, 487 insertions(+), 312 deletions(-) > >>> Hi Alex, > >>> > >>> I didn't get to try v10 at all, waited until Johannes's preparatory > >>> memcg swap cleanup was in mmotm; but I have spent a while thrashing > >>> this v11, and can happily report that it is much better than v9 etc: > >>> I believe this memcg lru_lock work will soon be ready for v5.9. > >>> > >>> I've not yet found any flaw at the swapping end, but fixes are needed > >>> for isolate_migratepages_block() and mem_cgroup_move_account(): I've > >>> got a series of 4 fix patches to send you (I guess two to fold into > >>> existing patches of yours, and two to keep as separate from me). > >>> > >>> I haven't yet written the patch descriptions, will return to that > >>> tomorrow. I expect you will be preparing a v12 rebased on v5.8-rc1 > >>> or v5.8-rc2, and will be able to include these fixes in that. > >> > >> I am very glad to get your help on this feature! > >> > >> and looking forward for your fixes tomorrow. :) > >> > >> Thanks a lot! > >> Alex > > > > Sorry, Alex, the news is not so good today. > > > > You'll have noticed I sent nothing yesterday. That's because I got > > stuck on my second patch: could not quite convince myself that it > > was safe. > > Hi Hugh, > > Thanks a lot for your help and effort! I very appreciate for this. > > > > > I keep hinting at these patches, and I can't complete their writeups > > until I'm convinced; but to give you a better idea of what they do: > > > > 1. Fixes isolate_fail and isolate_abort in isolate_migratepages_block(). > > I guess I know this after mm-compaction-avoid-vm_bug_onpageslab-in-page_mapcount.patch > was removed. No, I already assumed you had backed that out: these are fixes beyond that. > > > 2. Fixes unsafe use of trylock_page() in __isolate_lru_page_prepare(). > > 3. Reverts 07/16 inversion of lock ordering in split_huge_page_to_list(). > > 4. Adds lruvec lock protection in mem_cgroup_move_account(). > > Sorry for can't follow you for above issues. Indeed, more explanation needed: coming. > Anyway, I will send out new patchset > with the first issue fixed. and then let's discussion base on it. Sigh. I wish you had waited for me to send you fixes, or waited for an identifiable tag like 5.8-rc1. Andrew has been very hard at work with mm patches to Linus, but it looks like there are still "data_race" mods to come before -rc1, which may stop your v12 from applying cleanly. > > > > > In the second, I was using rcu_read_lock() instead of trylock_page() > > (like in my own patchset), but could not quite be sure of the case when > > PageSwapCache gets set at the wrong moment. Gave up for the night, and > > in the morning abandoned that, instead just shifting the call to > > __isolate_lru_page_prepare() after the get_page_unless_zero(), > > where that trylock_page() becomes safe (no danger of stomping on page > > flags while page is being freed or newly allocated to another owner). > > Sorry, I don't know the problem of trylock_page here? Could you like to > describe it as a race? Races, yes. Look, I'll send you now patches 1 and 2: at least with those in it should be safe for you and others to test compaction (if 5.8-rc1 turns out well: I think so much has gone in that it might have unrelated problems, and often the -rc2 is much more stable). But no point in sending 3 and 4 at this point, since ... > > > > > I thought that a very safe change, but best to do some test runs with > > it in before finalizing. And was then unpleasantly surprised to hit a > > VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup) from > > lock_page_lruvec_irqsave < relock_page_lruvec < pagevec_lru_move_fn < > > pagevec_move_tail < lru_add_drain_cpu after 6 hours on one machine. > > Then similar but < rotate_reclaimable_page after 8 hours on another. > > > > Only seen once before: that's what drove me to add patch 4 (with 3 to > > revert the locking before it): somehow, when adding the lruvec locking > > there, I just took it for granted that your patchset would have the > > appropriate locking (or TestClearPageLRU magic) at the other end. > > > > But apparently not. And I'm beginning to think that TestClearPageLRU > > was just to distract the audience from the lack of proper locking. > > > > I have certainly not concluded that yet, but I'm having to think about > > an area of the code which I'd imagined you had under control (and I'm > > puzzled why my testing has found it so very hard to hit). If we're > > lucky, I'll find that pagevec_move_tail is a special case, and > > nothing much else needs changing; but I doubt that will be so. ... shows that your locking primitives are not yet good enough to handle the case when tasks are moved between memcgs with move_charge_at_immigrate set. "bin/cg m" in the tests I sent, but today I'm changing its "seconds=60" to "seconds=1" in hope of speeding up the reproduction. Ah, good, two machines crashed in 1.5 hours: but I don't need to examine the crashes, now that it's obvious there's no protection - please, think about rotate_reclaimable_page() (there will be more cases, but in practice that seems easiest to hit, so focus on that) and how it is not protected from mem_cgroup_move_account(). I'm thinking too. Maybe judicious use of lock_page_memcg() can fix it (8 years ago it was unsuitable, but a lot has changed for the better since then); otherwise it's back to what I've been doing all along, taking the likely lruvec lock, and checking under that lock whether we have the right lock (as your lruvec_memcg_debug() does), retrying if not. Which may be more efficient than involving lock_page_memcg(). But I guess still worth sending my first two patches, since most of us use move_charge_at_immigrate only for... testing move_charge_at_immigrate. Whereas compaction bugs can hit any of us at any time. > > > > There's one other unexplained and unfixed bug I've seen several times > > while exercising mem_cgroup_move_account(): refcount_warn_saturate() > > from where __mem_cgroup_clear_mc() calls mem_cgroup_id_get_many(). > > I'll be glad if that goes away when the lruvec locking is fixed, > > but don't understand the connection. And it's quite possible that > > this refcounting bug has nothing to do with your changes: I have > > not succeeded in reproducing it on 5.7 nor on 5.7-rc7-mm1, > > but I didn't really try long enough to be sure. I got one of those quite quickly too after setting "cg m"'s seconds=1. I think the best thing I can do while thinking and researching, is give 5.7-rc7-mm1 a run on that machine with the speeded up moving, to see whether or not that refcount bug reproduces. > > > > (I should also warn, that I'm surprised by the amount of change > > 11/16 makes to mm/mlock.c: I've not been exercising mlock at all.) > > yes, that is a bit complex. I have tried the mlock cases in selftest with > your swap&build case. They are all fine with 300 times run. Good, thanks. Hugh