From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.4 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23BE7C43331 for ; Wed, 13 Nov 2019 02:55:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E727D204FD for ; Wed, 13 Nov 2019 02:55:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="egFodboY" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727420AbfKMCzJ (ORCPT ); Tue, 12 Nov 2019 21:55:09 -0500 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:60435 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726979AbfKMCzJ (ORCPT ); Tue, 12 Nov 2019 21:55:09 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1573613707; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jEZEFdfBLqCpFlj0HXUJ3Rk++O0w0ySIrg0cr29shO0=; b=egFodboYag0t06Ms/UlDr/jt8AI+A8NN+Sngj92AQI4UIg4FzfftodGLJHHmTTSrwDp7hE 7SzF1aqSLj76/93xayvTDp8JJ3pMrX83fgFgGexF6gUbOg9w1G2MBojRm9nFfKLY5dmebU aAQk6U/3Wzn52JdEV/6EEM7TElKXTRc= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-359-Wd7ywj7cPE2UeQJY8jww4Q-1; Tue, 12 Nov 2019 21:55:03 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E61DB801E54; Wed, 13 Nov 2019 02:55:01 +0000 (UTC) Received: from llong.remote.csb (ovpn-122-42.rdu2.redhat.com [10.10.122.42]) by smtp.corp.redhat.com (Postfix) with ESMTP id AD846600F8; Wed, 13 Nov 2019 02:55:00 +0000 (UTC) Subject: Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing To: Mike Kravetz , Matthew Wilcox , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Peter Zijlstra , Ingo Molnar , Will Deacon References: <20191107190628.22667-1-longman@redhat.com> <20191107195441.GF11823@bombadil.infradead.org> <20191108020456.sulyjskhq3s5zcaa@linux-p48b> <5059733e-95aa-2c9e-6f5d-4f45f6a130b3@oracle.com> <33d1fcc8-ebec-bd37-2969-daa0125479df@oracle.com> From: Waiman Long Organization: Red Hat Message-ID: <28b721b1-38f3-f10a-60e3-20e247c4ecb7@redhat.com> Date: Tue, 12 Nov 2019 21:55:00 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: <33d1fcc8-ebec-bd37-2969-daa0125479df@oracle.com> Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-MC-Unique: Wd7ywj7cPE2UeQJY8jww4Q-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/12/19 6:11 PM, Mike Kravetz wrote: > On 11/12/19 9:27 AM, Waiman Long wrote: >> On 11/8/19 8:47 PM, Mike Kravetz wrote: >>> On 11/8/19 11:10 AM, Mike Kravetz wrote: >>>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote: >>>>> On Thu, 07 Nov 2019, Mike Kravetz wrote: >>>>> >>>>>> Note that huge_pmd_share now increments the page count with the sema= phore >>>>>> held just in read mode. It is OK to do increments in parallel witho= ut >>>>>> synchronization. However, we don't want anyone else changing the co= unt >>>>>> while that check in huge_pmd_unshare is happening. Hence, the need = for >>>>>> taking the semaphore in write mode. >>>>> This would be a nice addition to the changelog methinks. >>>> Last night I remembered there is one place where we currently take >>>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare. That >>>> is in try_to_unmap_one. Yes, there is a potential race here today. >>> Actually there is no race there today. Callers to huge_pmd_unshare >>> hold the page table lock. So, this synchronizes those unshare calls >>> from page migration and page poisoning. >>> >>>> But that race is somewhat contained as you need two threads doing some >>>> combination of page migration and page poisoning to race. This change >>>> now allows migration or poisoning to race with page fault. I would >>>> really prefer if we do not open up the race window in this manner. >>> But, we do open a race window by changing huge_pmd_share to take the >>> i_mmap_rwsem in read mode as in the original patch. =20 >>> >>> Here is the additional code needed to take the semaphore in write mode >>> for the huge_pmd_unshare calls via try_to_unmap_one. We would need to >>> combine this with Longman's patch. Please take a look and provide feed= back. >>> Some of the changes are subtle, especially the exception for MAP_PRIVAT= E >>> mappings, but I tried to add sufficient comments. >>> >>> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001 >>> From: Mike Kravetz >>> Date: Fri, 8 Nov 2019 17:25:37 -0800 >>> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode b= efore >>> call to huge_pmd_unshare in try_to_unmap_one. >>> >>> Signed-off-by: Mike Kravetz >>> --- >>> mm/hugetlb.c | 9 ++++++++- >>> mm/memory-failure.c | 28 +++++++++++++++++++++++++++- >>> mm/migrate.c | 27 +++++++++++++++++++++++++-- >>> 3 files changed, 60 insertions(+), 4 deletions(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index f78891f92765..73d9136549a5 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsi= gned long addr, pud_t *pud) >>> * indicated by page_count > 1, unmap is achieved by clearing pud and >>> * decrementing the ref count. If count =3D=3D 1, the pte page is not = shared. >>> * >>> - * called with page table lock held. >>> + * Must be called while holding page table lock. >>> + * In general, the caller should also hold the i_mmap_rwsem in write m= ode. >>> + * This is to prevent races with page faults calling huge_pmd_share wh= ich >>> + * will not be holding the page table lock, but will be holding i_mmap= _rwsem >>> + * in read mode. It is possible to call without holding i_mmap_rwsem = in >>> + * write mode if the caller KNOWS the page table is associated with a = private >>> + * mapping. This is because private mappings can not share PMDs and c= an >>> + * not race with huge_pmd_share calls during page faults. >> So the page table lock here is the huge_pte_lock(). Right? In >> huge_pmd_share(), the pte lock has to be taken before one can share it. >> So would you mind explaining where exactly is the race? > My concern was that this code in huge_pmd_share: > > =09=09saddr =3D page_table_shareable(svma, vma, addr, idx); > =09=09if (saddr) { > =09=09=09spte =3D huge_pte_offset(svma->vm_mm, saddr, > =09=09=09=09=09 vma_mmu_pagesize(svma)); > =09=09=09if (spte) { > =09=09=09=09get_page(virt_to_page(spte)); > =09=09=09=09break; > =09=09=09} > =09=09} > > and this code in huge_pmd_unshare: > > BUG_ON(page_count(virt_to_page(ptep)) =3D=3D 0); > if (page_count(virt_to_page(ptep)) =3D=3D 1) > return 0; > > could run concurrently. Note that return code for huge_pmd_unshare is > specified as, > > * returns: 1 successfully unmapped a shared pte page > * 0 the underlying pte page is not shared, or it is the last us= er > > And, callers take different action depending on the return value. > > Now, since the code blocks above can run concurrently it is possible that= : > - huge_pmd_unshare sees page_count =3D=3D 1 > - huge_pmd_share increments page_count in anticipation of sharing > - huge_pmd_unshare returns 0 indicating there is no pmd sharing > - huge_pmd_share waits for page table lock to insert pmd page before it > sharts sharing > > My concern was with what might happen if a huge_pmd_unshare caller receiv= ed > a 0 return code and thought there were no other users of the pmd. In the > specific case mentioned here (try_to_unmap_one) I now do not believe ther= e > are any issues. The caller is simply going to clear one entry on the pmd > page. I can not think of a way this could impact the other thread waitin= g > to share the pmd. It will simply start sharing the pmd after it gets the > page table lock and the entry has been removed. > > As the result of your question, I went back and took a really close look > at the code in question. While that huge_pmd_share/huge_pmd_unshare code > running concurrently does make me a bit nervous, I can not find any speci= fic > problem. In addition, I ran a test causes this race for several hours > without issue. > > Sorry for the churn/second thoughts. This code is subtle, and sometimes = hard > to understand. IMO, Longman's patch (V2) can go forward, but please dele= te > the following blob in the commit message from me: > > "Note that huge_pmd_share now increments the page count with the > semaphore held just in read mode. It is OK to do increments in > parallel without synchronization. However, we don't want anyone else > changing the count while that check in huge_pmd_unshare is happening.=20 > Hence, the need for taking the semaphore in write mode." > Thanks for the explanation. I understand that it is often time hard to figure out if a race exists or not. I got confused myself many times. Anyway, as long as any destructive action is only taken while holding the page table lock, it should be safe. Cheers, Longman