From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.4 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15E57C43331 for ; Tue, 12 Nov 2019 17:27:48 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AF2E321872 for ; Tue, 12 Nov 2019 17:27:47 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="GOmbeJ/C" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AF2E321872 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 19A016B026B; Tue, 12 Nov 2019 12:27:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 14C0A6B026C; Tue, 12 Nov 2019 12:27:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 01E926B026D; Tue, 12 Nov 2019 12:27:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0200.hostedemail.com [216.40.44.200]) by kanga.kvack.org (Postfix) with ESMTP id DFC1E6B026B for ; Tue, 12 Nov 2019 12:27:46 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 9F7628249980 for ; Tue, 12 Nov 2019 17:27:46 +0000 (UTC) X-FDA: 76148307732.09.wrist48_60f459971bc34 X-HE-Tag: wrist48_60f459971bc34 X-Filterd-Recvd-Size: 9550 Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) by imf31.hostedemail.com (Postfix) with ESMTP for ; Tue, 12 Nov 2019 17:27:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1573579665; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=n3GFD3RmbDO5HoBdK1kzcUxkVYIC30uFtwy9Gv2par0=; b=GOmbeJ/CpFq1vSsDbsweg/sa3cQ/U2tHVERhVmqqF+wcEdXHACVXUohqkiuZz03fryY9bx 3tWoioF/a/4JVBUopd66oOvg1RQGq0n4VJ21q7FuR3YCuTiQ19NWbCoRrCiU3roS5llbz+ +Qv/0tYt84ThCN6SZiKEpKdgjt46ZnM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-34-__xu1CdDNUuR6B77fHh4Kw-1; Tue, 12 Nov 2019 12:27:43 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 49E8410B889E; Tue, 12 Nov 2019 17:27:42 +0000 (UTC) Received: from llong.remote.csb (dhcp-17-59.bos.redhat.com [10.18.17.59]) by smtp.corp.redhat.com (Postfix) with ESMTP id 0462F60923; Tue, 12 Nov 2019 17:27:40 +0000 (UTC) Subject: Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing To: Mike Kravetz , Matthew Wilcox , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Peter Zijlstra , Ingo Molnar , Will Deacon References: <20191107190628.22667-1-longman@redhat.com> <20191107195441.GF11823@bombadil.infradead.org> <20191108020456.sulyjskhq3s5zcaa@linux-p48b> <5059733e-95aa-2c9e-6f5d-4f45f6a130b3@oracle.com> From: Waiman Long Organization: Red Hat Message-ID: Date: Tue, 12 Nov 2019 12:27:40 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: <5059733e-95aa-2c9e-6f5d-4f45f6a130b3@oracle.com> Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-MC-Unique: __xu1CdDNUuR6B77fHh4Kw-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11/8/19 8:47 PM, Mike Kravetz wrote: > On 11/8/19 11:10 AM, Mike Kravetz wrote: >> On 11/7/19 6:04 PM, Davidlohr Bueso wrote: >>> On Thu, 07 Nov 2019, Mike Kravetz wrote: >>> >>>> Note that huge_pmd_share now increments the page count with the semaph= ore >>>> held just in read mode. It is OK to do increments in parallel without >>>> synchronization. However, we don't want anyone else changing the coun= t >>>> while that check in huge_pmd_unshare is happening. Hence, the need fo= r >>>> taking the semaphore in write mode. >>> This would be a nice addition to the changelog methinks. >> Last night I remembered there is one place where we currently take >> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare. That >> is in try_to_unmap_one. Yes, there is a potential race here today. > Actually there is no race there today. Callers to huge_pmd_unshare > hold the page table lock. So, this synchronizes those unshare calls > from page migration and page poisoning. > >> But that race is somewhat contained as you need two threads doing some >> combination of page migration and page poisoning to race. This change >> now allows migration or poisoning to race with page fault. I would >> really prefer if we do not open up the race window in this manner. > But, we do open a race window by changing huge_pmd_share to take the > i_mmap_rwsem in read mode as in the original patch. =20 > > Here is the additional code needed to take the semaphore in write mode > for the huge_pmd_unshare calls via try_to_unmap_one. We would need to > combine this with Longman's patch. Please take a look and provide feedba= ck. > Some of the changes are subtle, especially the exception for MAP_PRIVATE > mappings, but I tried to add sufficient comments. > > From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001 > From: Mike Kravetz > Date: Fri, 8 Nov 2019 17:25:37 -0800 > Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode bef= ore > call to huge_pmd_unshare in try_to_unmap_one. > > Signed-off-by: Mike Kravetz > --- > mm/hugetlb.c | 9 ++++++++- > mm/memory-failure.c | 28 +++++++++++++++++++++++++++- > mm/migrate.c | 27 +++++++++++++++++++++++++-- > 3 files changed, 60 insertions(+), 4 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index f78891f92765..73d9136549a5 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsign= ed long addr, pud_t *pud) > * indicated by page_count > 1, unmap is achieved by clearing pud and > * decrementing the ref count. If count =3D=3D 1, the pte page is not sh= ared. > * > - * called with page table lock held. > + * Must be called while holding page table lock. > + * In general, the caller should also hold the i_mmap_rwsem in write mod= e. > + * This is to prevent races with page faults calling huge_pmd_share whic= h > + * will not be holding the page table lock, but will be holding i_mmap_r= wsem > + * in read mode. It is possible to call without holding i_mmap_rwsem in > + * write mode if the caller KNOWS the page table is associated with a pr= ivate > + * mapping. This is because private mappings can not share PMDs and can > + * not race with huge_pmd_share calls during page faults. So the page table lock here is the huge_pte_lock(). Right? In huge_pmd_share(), the pte lock has to be taken before one can share it. So would you mind explaining where exactly is the race? Thanks, Longman > * > * returns: 1 successfully unmapped a shared pte page > *=09 0 the underlying pte page is not shared, or it is the last user > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 3151c87dff73..8f52b22cf71b 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1030,7 +1030,33 @@ static bool hwpoison_user_mappings(struct page *p,= unsigned long pfn, > =09if (kill) > =09=09collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED); > =20 > -=09unmap_success =3D try_to_unmap(hpage, ttu); > +=09if (!PageHuge(hpage)) { > +=09=09unmap_success =3D try_to_unmap(hpage, ttu); > +=09} else { > +=09=09mapping =3D page_mapping(hpage); > +=09=09if (mapping) { > +=09=09=09/* > +=09=09=09 * For hugetlb pages, try_to_unmap could potentially > +=09=09=09 * call huge_pmd_unshare. Because of this, take > +=09=09=09 * semaphore in write mode here and set TTU_RMAP_LOCKED > +=09=09=09 * to indicate we have taken the lock at this higher > +=09=09=09 * level. > +=09=09=09 */ > +=09=09=09i_mmap_lock_write(mapping); > +=09=09=09unmap_success =3D try_to_unmap(hpage, > +=09=09=09=09=09=09=09ttu|TTU_RMAP_LOCKED); > +=09=09=09i_mmap_unlock_write(mapping); > +=09=09} else { > +=09=09=09/* > +=09=09=09 * !mapping implies a MAP_PRIVATE huge page mapping. > +=09=09=09 * Since PMDs will never be shared in a private > +=09=09=09 * mapping, it is safe to let huge_pmd_unshare be > +=09=09=09 * called with the semaphore in read mode. > +=09=09=09 */ > +=09=09=09unmap_success =3D try_to_unmap(hpage, ttu); > +=09=09} > +=09} > + > =09if (!unmap_success) > =09=09pr_err("Memory failure: %#lx: failed to unmap page (mapcount=3D%d)= \n", > =09=09 pfn, page_mapcount(hpage)); > diff --git a/mm/migrate.c b/mm/migrate.c > index 4fe45d1428c8..9cae5a4f1e48 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1333,8 +1333,31 @@ static int unmap_and_move_huge_page(new_page_t get= _new_page, > =09=09goto put_anon; > =20 > =09if (page_mapped(hpage)) { > -=09=09try_to_unmap(hpage, > -=09=09=09TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > +=09=09struct address_space *mapping =3D page_mapping(hpage); > + > +=09=09if (mapping) { > +=09=09=09/* > +=09=09=09 * try_to_unmap could potentially call huge_pmd_unshare. > +=09=09=09 * Because of this, take semaphore in write mode here > +=09=09=09 * and set TTU_RMAP_LOCKED to indicate we have taken > +=09=09=09 * the lock at this higher level. > +=09=09=09 */ > +=09=09=09i_mmap_lock_write(mapping); > +=09=09=09try_to_unmap(hpage, > +=09=09=09=09TTU_MIGRATION|TTU_IGNORE_MLOCK| > +=09=09=09=09TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED); > +=09=09=09i_mmap_unlock_write(mapping); > +=09=09} else { > +=09=09=09/* > +=09=09=09 * !mapping implies a MAP_PRIVATE huge page mapping. > +=09=09=09 * Since PMDs will never be shared in a private > +=09=09=09 * mapping, it is safe to let huge_pmd_unshare be > +=09=09=09 * called with the semaphore in read mode. > +=09=09=09 */ > +=09=09=09try_to_unmap(hpage, > +=09=09=09=09TTU_MIGRATION|TTU_IGNORE_MLOCK| > +=09=09=09=09TTU_IGNORE_ACCESS); > +=09=09} > =09=09page_was_mapped =3D 1; > =09} > =20