From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.6 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72C2DC433E7 for ; Fri, 9 Oct 2020 17:44:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0B56022282 for ; Fri, 9 Oct 2020 17:44:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="eCqio3Kg" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387908AbgJIRoI (ORCPT ); Fri, 9 Oct 2020 13:44:08 -0400 Received: from aserp2120.oracle.com ([141.146.126.78]:40762 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732060AbgJIRnc (ORCPT ); Fri, 9 Oct 2020 13:43:32 -0400 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 099HUQeV174239; Fri, 9 Oct 2020 17:43:03 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2020-01-29; bh=/tMxU3Y/yTDW501bhwkEruJM94fHKEf5tP/XKDOD6CA=; b=eCqio3KgC1cwSw2Uemav/8g55vH7IAss/9AHHVHiZ7K4OYvz4J2iybz6oYr43Eimm3rP qN9NX5YM7326MVUjGY/X3BfoPfLI31JXuEM2qTMf18XXe7n3Zkc64sM5TzSvWMa9rzNg SsxUKdlMNYz34WvSdlOr29jYOq9T9+Cu8yDMCZyDfgjNf4r4cAkPtn79naJ88apIuCNy aVNUBdZLwKDoCbJ2TYAW/aKtsnn0jmAqG8zmtKzcw0GwZ/ztg78tR03Z8IZAJpp6NZ+G bXXpiNsa0ebPicFI38TU1+t0M24++hLCiwfYChusNkwnIHigWjEYmUzrhfswLBNMkvvI rw== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by aserp2120.oracle.com with ESMTP id 3429juvtc7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 09 Oct 2020 17:43:03 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 099HV09k156142; Fri, 9 Oct 2020 17:43:03 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserp3020.oracle.com with ESMTP id 3429kbjccn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 09 Oct 2020 17:43:02 +0000 Received: from abhmp0008.oracle.com (abhmp0008.oracle.com [141.146.116.14]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 099HgvcY023358; Fri, 9 Oct 2020 17:42:57 GMT Received: from [192.168.2.112] (/50.38.35.18) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 09 Oct 2020 10:42:57 -0700 Subject: Re: [PATCH v3 7/8] mm/mempolicy: use a standard migration target allocation callback To: Hugh Dickins Cc: Qian Cai , js1304@gmail.com, Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@lge.com, Vlastimil Babka , Christoph Hellwig , Roman Gushchin , Naoya Horiguchi , Michal Hocko , Joonsoo Kim References: <1592892828-1934-1-git-send-email-iamjoonsoo.kim@lge.com> <1592892828-1934-8-git-send-email-iamjoonsoo.kim@lge.com> <20200708012044.GC992@lca.pw> From: Mike Kravetz Message-ID: <68e09cac-d2d9-9ddf-6e10-0b8cc0befe82@oracle.com> Date: Fri, 9 Oct 2020 10:42:55 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9769 signatures=668681 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=27 adultscore=0 phishscore=0 spamscore=0 mlxscore=0 malwarescore=0 bulkscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2010090128 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9769 signatures=668681 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 mlxlogscore=999 mlxscore=0 phishscore=0 bulkscore=0 suspectscore=27 lowpriorityscore=0 spamscore=0 clxscore=1011 malwarescore=0 priorityscore=1501 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2010090128 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/8/20 10:50 PM, Hugh Dickins wrote: > On Thu, 8 Oct 2020, Mike Kravetz wrote: >> On 10/7/20 8:21 PM, Hugh Dickins wrote: >>> >>> Mike, j'accuse... your 5.7 commit c0d0381ade79 ("hugetlbfs: >>> use i_mmap_rwsem for more pmd sharing synchronization"), in which >>> unmap_and_move_huge_page() now passes the TTU_RMAP_LOCKED flag to >>> try_to_unmap(), because it's already holding mapping->i_mmap_rwsem: >>> but that is not the right lock to secure an anon_vma lookup. >> >> Thanks Hugh! Your analysis is correct and the code in that commit is >> not correct. I was so focused on the file mapping case, I overlooked >> (actually introduced) this issue for anon mappings. >> >> Let me verify that this indeed is the root cause. However, since >> move_pages12 migrated anon hugetlb pages it certainly does look to be >> the case. >> >>> I intended to send a patch, passing TTU_RMAP_LOCKED only in the >>> !PageAnon case (and, see vma_adjust(), anon_vma lock conveniently >>> nests inside i_mmap_rwsem); but then wondered if i_mmap_rwsem was >>> needed in that case or not, so looked deeper into c0d0381ade79. >>> >>> Hmm, not even you liked it! But the worst of it looks simply >>> unnecessary to me, and I hope can be deleted - but better by you >>> than by me (in particular, you were trying to kill 1) and 2) birds >>> with one stone, and I've always given up on understanding hugetlb's >>> reservations: I suspect that side of it is irrelevant here, >>> but I wouldn't pretend to be sure). >>> >>> How could you ever find a PageAnon page in a vma_shareable() area? >>> >>> It is all rather confusing (vma_shareable() depending on VM_MAYSHARE, >>> whereas is_cow_mapping() on VM_SHARED and VM_MAYWRITE: they have to >>> be studied together with do_mmap()'s >>> vm_flags |= VM_SHARED | VM_MAYSHARE; >>> if (!(file->f_mode & FMODE_WRITE)) >>> vm_flags &= ~(VM_MAYWRITE | VM_SHARED); >>> >>> (And let me add to the confusion by admitting that, prior to 3.15's >>> cda540ace6a1 "mm: get_user_pages(write,force) refuse to COW in >>> shared areas", maybe it was possible to find a PageAnon there.) >>> >>> But my belief (best confirmed by you running your tests with a >>> suitably placed BUG_ON or WARN_ON) is that you'll never find a >>> PageAnon in a vma_shareable() area, so will never need try_to_unmap() >>> to unshare a pagetable in the PageAnon case, so won't need i_mmap_rwsem >>> for PageAnon there, and _get_hugetlb_page_mapping() (your function that >>> deduces an address_space from an anon_vma) can just be deleted. >> >> Yes, it is confusing. Let me look into this. I would be really happy >> to delete that ugly function. >> >>> (And in passing, may I ask what hugetlb_page_mapping_lock_write()'s >>> hpage->_mapcount inc and dec are for? You comment it as a hack, >>> but don't explain what needs that hack, and I don't see it.) >> >> We are trying to lock the mapping (mapping->i_mmap_rwsem). We know >> mapping is valid, because we obtained it from page_mapping() and it >> will remain valid because we have the page locked. Page needs to be >> unlocked to unmap. However, we have to drop page lock in order to >> acquire i_mmap_rwsem. Once we drop page lock, mapping could become >> invalid. So, the code code artifically incs mapcount so that mapping >> will remain valid when upmapping page. > > No, unless you can point me to some other hugetlbfs-does-it-differently > (and I didn't see it there in that commit), raising _mapcount does not > provide any such protection; but does add the possiblility of a > "BUG: Bad page cache" and leak from unaccount_page_cache_page(). > > Earlier in the day I was trying to work out what to recommend instead, > but had to turn aside to something else: I'll try again tomorrow. > > It's a problem I've faced before in tmpfs, keeping a hold on the > mapping while page lock is dropped. Quite awkward: igrab() looks as > if it's the right thing to use, but turns out to give no protection > against umount. Last time around, I ended up with a stop_eviction > count in the shmem inode, which shmem_evict_inode() waits on if > necessary. Something like that could be done for hugetlbfs too, > but I'd prefer to do it without adding extra, if there is a way. Thanks. >> >> As mentioned above, I hope all this can be removed. > > If you continue to nest page lock inside i_mmap_rwsem for hugetlbfs, > then I think that part of hugetlb_page_mapping_lock_write() has to > remain. I'd much prefer that hugetlbfs did not reverse the usual > nesting, but accept that you had reasons for doing it that way. Yes, that is necessary with the change to lock order. Yesterday I found another issue/case to handle in the hugetlb COW code caused by the difference in lock nesting. As a result, I am rethinking the decision to change the locking order. The primary reason for changing the lock order was to make the hugetlb page fault code not have to worry about pte pointers changing. The issue is that the pte pointer you get (or create) while walking the table without the page table lock could go away due to unsharing the PMD. We can walk the table again after acquiring the lock and validate and possibly retry. However, the backout code in this area which previously had to deal with truncation as well, was quite fragile and did not work in all corner cases. This was mostly due to lovely huge page reservations. I thought that adding more complexity to the backout code was going to cause more issues. Changing the locking order eliminated the pte pointer race as well as truncation. However, it created new locking issues. :( In parallel to working the locking issue, I am also exploring enhanced backout code to handle all the needed cases. -- Mike Kravetz