From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81F33C2BB1D for ; Tue, 14 Apr 2020 04:53:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 26C4A206D5 for ; Tue, 14 Apr 2020 04:53:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 26C4A206D5 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AB5C68E0003; Tue, 14 Apr 2020 00:52:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A65B98E0001; Tue, 14 Apr 2020 00:52:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 954878E0003; Tue, 14 Apr 2020 00:52:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0140.hostedemail.com [216.40.44.140]) by kanga.kvack.org (Postfix) with ESMTP id 7EEFD8E0001 for ; Tue, 14 Apr 2020 00:52:59 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 362AC5008 for ; Tue, 14 Apr 2020 04:52:59 +0000 (UTC) X-FDA: 76705240878.24.land82_8e515b7caad47 X-HE-Tag: land82_8e515b7caad47 X-Filterd-Recvd-Size: 13217 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf22.hostedemail.com (Postfix) with ESMTP for ; Tue, 14 Apr 2020 04:52:57 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04397;MF=alex.shi@linux.alibaba.com;NM=1;PH=DS;RN=38;SR=0;TI=SMTPD_---0TvUnJ20_1586839969; Received: from IT-FVFX43SYHV2H.local(mailfrom:alex.shi@linux.alibaba.com fp:SMTPD_---0TvUnJ20_1586839969) by smtp.aliyun-inc.com(127.0.0.1); Tue, 14 Apr 2020 12:52:50 +0800 Subject: Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock To: Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, mgorman@techsingularity.net, tj@kernel.org, hughd@google.com, khlebnikov@yandex-team.ru, daniel.m.jordan@oracle.com, yang.shi@linux.alibaba.com, willy@infradead.org, shakeelb@google.com, Michal Hocko , Vladimir Davydov , Roman Gushchin , Chris Down , Thomas Gleixner , Vlastimil Babka , Qian Cai , Andrey Ryabinin , "Kirill A. Shutemov" , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Andrea Arcangeli , David Rientjes , "Aneesh Kumar K.V" , swkhack , "Potyra, Stefan" , Mike Rapoport , Stephen Rothwell , Colin Ian King , Jason Gunthorpe , Mauro Carvalho Chehab , Peng Fan , Nikolay Borisov , Ira Weiny , Kirill Tkhai , Yafang Shao References: <1579143909-156105-1-git-send-email-alex.shi@linux.alibaba.com> <1579143909-156105-4-git-send-email-alex.shi@linux.alibaba.com> <20200116215222.GA64230@cmpxchg.org> <20200413180725.GA99267@cmpxchg.org> From: Alex Shi Message-ID: <8e7bf170-2bb5-f862-c12b-809f7f7d96cb@linux.alibaba.com> Date: Tue, 14 Apr 2020 12:52:30 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <20200413180725.GA99267@cmpxchg.org> Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =D4=DA 2020/4/14 =C9=CF=CE=E72:07, Johannes Weiner =D0=B4=B5=C0: > On Mon, Apr 13, 2020 at 06:48:22PM +0800, Alex Shi wrote: >>> In a previous review, I pointed out the following race condition >>> between page charging and compaction: >>> >>> compaction: generic_file_buffered_read: >>> >>> page_cache_alloc() >>> >>> !PageBuddy() >>> >>> lock_page_lruvec(page) >>> lruvec =3D mem_cgroup_page_lruvec() >>> spin_lock(&lruvec->lru_lock) >>> if lruvec !=3D mem_cgroup_page_lruvec() >>> goto again >>> >>> add_to_page_cache_lru() >>> mem_cgroup_commit_charge() >>> page->mem_cgroup =3D foo >>> lru_cache_add() >>> __pagevec_lru_add() >>> SetPageLRU() >>> >>> if PageLRU(page): >>> __isolate_lru_page() >>> >>> As far as I can see, you have not addressed this. You have added >>> lock_page_memcg(), but that prevents charged pages from moving betwee= n >>> cgroups, it does not prevent newly allocated pages from being charged= . >>> >>> It doesn't matter how many times you check the lruvec before and afte= r >>> locking - if you're looking at a free page, it might get allocated, >>> charged and put on a new lruvec after you're done checking, and then >>> you isolate a page from an unlocked lruvec. >>> >>> You simply cannot serialize on page->mem_cgroup->lruvec when >>> page->mem_cgroup isn't stable. You need to serialize on the page >>> itself, one way or another, to make this work. >>> >>> >>> So here is a crazy idea that may be worth exploring: >>> >>> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's >>> linked list. >>> >>> Can we make PageLRU atomic and use it to stabilize the lru_lock >>> instead, and then use the lru_lock only serialize list operations? >>> >>> I.e. in compaction, you'd do >>> >>> if (!TestClearPageLRU(page)) >>> goto isolate_fail; >>> /* >>> * We isolated the page's LRU state and thereby locked out all >>> * other isolators, including cgroup page moving, page reclaim, >>> * page freeing etc. That means page->mem_cgroup is now stable >>> * and we can safely look up the correct lruvec and take the >>> * page off its physical LRU list. >>> */ >>> lruvec =3D mem_cgroup_page_lruvec(page); >>> spin_lock_irq(&lruvec->lru_lock); >>> del_page_from_lru_list(page, lruvec, page_lru(page)); >>> >>> Putback would mostly remain the same (although you could take the >>> PageLRU setting out of the list update locked section, as long as it'= s >>> set after the page is physically linked): >>> >>> /* LRU isolation pins page->mem_cgroup */ >>> lruvec =3D mem_cgroup_page_lruvec(page) >>> spin_lock_irq(&lruvec->lru_lock); >>> add_page_to_lru_list(...); >>> spin_unlock_irq(&lruvec->lru_lock); >>> >>> SetPageLRU(page); >>> >>> And you'd have to carefully review and rework other sites that rely o= n >>> PageLRU: reclaim, __page_cache_release(), __activate_page() etc. >>> >>> Especially things like activate_page(), which used to only check >>> PageLRU to shuffle the page on the LRU list would now have to briefly >>> clear PageLRU and then set it again afterwards. >>> >>> However, aside from a bit more churn in those cases, and the >>> unfortunate additional atomic operations, I currently can't think of = a >>> fundamental reason why this wouldn't work. >>> >>> Hugh, what do you think? >>> >> >> Hi Johannes >> >> As to the idea of TestClearPageLRU, we except the following scenario >> compaction commit_charge >> if (TestClearPageLRU) >> !TestClearPageLRU lock_page_lruvec >> goto isolate_fail; del_from_lru_list >> unlock_page_lruvec >> >> But there is a difficult situation to handle: >> >> compaction commit_charge >> TestClearPageLRU >> !TestClearPageLRU >> >> page possible state: >> a, reclaiming, b, moving between l= ru list, c, migrating, like in compaction >> d, mlocking, e, split_huge_page, >> >> If the page lru bit was cleared in commit_charge with lrucare, >> we still have no idea if the page was isolated by the reason from a~e >> or the page is never on LRU, to deal with different reasons is high co= st. >> >> So as to the above issue you mentioned, Maybe the better idea is to >> set lrucare when do mem_cgroup_commit_charge(), since the charge actio= n >> is not often. What's your idea of this solution? >=20 > Hm, yes, the lrucare scenario is a real problem. If it can isolate the > page, fine, but if not, it changes page->mem_cgroup on a page that > somebody else has isolated, having indeed no idea who they are and how > they are going to access page->mem_cgroup. >=20 > Right now it's safe because of secondary protection on top of > isolation: split_huge_page keeps the lru_lock held throughout; > reclaim, cgroup migration, page migration, compaction etc. hold the > page lock which locks out swapcache charging. >=20 > But it complicates the serialization model immensely and makes it > subtle and error prone. >=20 > I'm not sure how unconditionally taking the lru_lock when charging > would help. Can you lay out what you have in mind in prototype code, > like I'm using below, for isolation, putback, charging, compaction? The situation would back to relock scheme, the lru_lock will compete with= =20 the some root_memcg->lru_lock in practical. So no needs to distinguish=20 putback, compaction etc.=20 But I don't know how much impact on this alloc path... compaction: generic_file_buffered_read: page_cache_alloc() !PageBuddy() lock_page_lruvec(page) lruvec =3D mem_cgroup_page_lruvec() spin_lock(&lruvec->lru_lock) if lruvec !=3D mem_cgroup_page_lruvec() goto again add_to_page_cache_lru() mem_cgroup_commit_charge() spin_lock_irq(page->memcg->lruvec->lru_lock) page->mem_cgroup =3D foo spin_unlock_irq(page->memcg->lruvec->lru_lock) lru_cache_add() __pagevec_lru_add() SetPageLRU() if PageLRU(page): __isolate_lru_page() >=20 > That said, charging actually is a hotpath. I'm reluctant to > unconditionally take the LRU lock there. But if you can make things a > lot simpler this way, it could be worth exploring. >=20 > In the PageLRU locking scheme, I can see a way to make putback safe > wrt lrucare charging, but I'm not sure about isolation: >=20 > putback: > lruvec =3D page->mem_cgroup->lruvecs[pgdat] > spin_lock(lruvec->lru_lock) > if lruvec !=3D page->mem_cgroup->lruvecs[pgdat]: > /* > * commit_charge(lrucare=3Dtrue) can charge an uncharged swapcache > * page while we had it isolated. This changes page->mem_cgroup, > * but it can only happen once. Look up the new cgroup. > */ > spin_unlock(lruvec->lru_lock) > lruvec =3D page->mem_cgroup->lruvecs[pgdat] > spin_lock(lruvec->lru_lock) > add_page_to_lru_list(page, lruvec, ...) > SetPageLRU(page); > spin_unlock(lruvec->lru_lock) >=20 > commit_charge: > if (lrucare) > spin_lock(root_memcg->lru_lock) > /* > * If we can isolate the page, we'll move it to the new > * cgroup's LRU list. If somebody else has the page > * isolated, we need their putback to move it to the > * new cgroup. If they see the old cgroup - the root - > * they will spin until we're done and recheck. > */ > if ((lru =3D TestClearPageLRU(page))) > del_page_from_lru_list() > page->mem_cgroup =3D new; > if (lrucare) > spin_unlock(root_memcg->lru_lock) > if (lru) > spin_lock(new->lru_lock) > add_page_to_lru_list() > spin_unlock(new->lru_lock); > SetPageLRU(page) >=20 > putback would need to 1) recheck once after acquiring the lock and 2) > SetPageLRU while holding the lru_lock after all. But it works because > we know the old cgroup: if the putback sees the old cgroup, we know > it's the root cgroup, and we have that locked until we're done with > the update. And if putback manages to lock the old cgroup before us, > we will spin until the isolator is done, and then either be able to > isolate it ourselves or, if racing with yet another isolator, hold the > lock and delay putback until we're done. >=20 > But isolation actually needs to lock out charging, or it would operate > on the wrong list: >=20 > isolation: commit_charge: > if (TestClearPageLRU(page)) > page->mem_cgroup =3D new > // page is still physically on > // the root_mem_cgroup's LRU. We're > // updating the wrong list: > memcg =3D page->mem_cgroup > spin_lock(memcg->lru_lock) > del_page_from_lru_list(page, memcg) > spin_unlock(memcg->lru_lock) >=20 Yes, this is the problem I encountered now for mem_cgroup_lru_size incorr= ect. =20 > lrucare really is a mess. Even before this patch series, it makes > things tricky and subtle and error prone. >=20 > The only reason we're doing it is for when there is swapping without > swap tracking, in which case swap reahadead needs to put pages on the > LRU but cannot charge them until we have a faulting vma later. >=20 > But it's not clear how practical such a configuration is. Both memory > and swap are shared resources, and isolation isn't really effective > when you restrict access to memory but then let workloads swap freely. Yes, we didn't figure a good usage on MEMCG_SWAP too. And if swaping happ= ens often, the different memcg's memory were swaped to same disk and mixed to= gether which cause readahead useless. >=20 > Plus, the overhead of tracking is tiny - 512k per G of swap (0.04%). >=20 > Maybe we should just delete MEMCG_SWAP and unconditionally track swap > entry ownership when the memory controller is enabled. I don't see a > good reason not to, and it would simplify the entire swapin path, the > LRU locking, and the page->mem_cgroup stabilization rules. >=20 Sorry for not follow you up, did you mean just remove the MEMCG_SWAP conf= iguration and keep the feature in default memcg?=20 That does can remove lrucare, but PageLRU lock scheme still fails since we can not isolate the page during commit_charge, is that right? Thanks a lot! Alex