From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.3 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B938CCA9ED4 for ; Mon, 4 Nov 2019 18:53:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 815CC20B7C for ; Mon, 4 Nov 2019 18:53:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="AwPUH1Gw" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729240AbfKDSxC (ORCPT ); Mon, 4 Nov 2019 13:53:02 -0500 Received: from us-smtp-1.mimecast.com ([205.139.110.61]:60728 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728510AbfKDSw6 (ORCPT ); Mon, 4 Nov 2019 13:52:58 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1572893575; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vhvFv/gKEXA1cmin4yO0xx1/Gf0U1DoUV0cjNAwwQns=; b=AwPUH1Gw9pb2VEuimGsZm+pjX1CrtX1/7Ed6HwMaKy/ZPUFZ0zSVYAO0Z54UIpDwl8H69t gWhD+1ZLensUPVPHSFxB9TOGSe88J7eC6ZI4L0LyPYTrr9ONpXlB7K7kmlv9RccXfesO2B tzG2CoqH/FEbnYvo+xskH7FTRFMnj4E= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-106-BYegrK4JPIWXqhL6bCqv8A-1; Mon, 04 Nov 2019 13:52:51 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 0F2011800D53; Mon, 4 Nov 2019 18:52:47 +0000 (UTC) Received: from redhat.com (unknown [10.20.6.178]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1977960878; Mon, 4 Nov 2019 18:52:40 +0000 (UTC) Date: Mon, 4 Nov 2019 13:52:38 -0500 From: Jerome Glisse To: John Hubbard Cc: Andrew Morton , Al Viro , Alex Williamson , Benjamin Herrenschmidt , =?iso-8859-1?Q?Bj=F6rn_T=F6pel?= , Christoph Hellwig , Dan Williams , Daniel Vetter , Dave Chinner , David Airlie , "David S . Miller" , Ira Weiny , Jan Kara , Jason Gunthorpe , Jens Axboe , Jonathan Corbet , Magnus Karlsson , Mauro Carvalho Chehab , Michael Ellerman , Michal Hocko , Mike Kravetz , Paul Mackerras , Shuah Khan , Vlastimil Babka , bpf@vger.kernel.org, dri-devel@lists.freedesktop.org, kvm@vger.kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, linux-rdma@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, netdev@vger.kernel.org, linux-mm@kvack.org, LKML Subject: Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages Message-ID: <20191104185238.GG5134@redhat.com> References: <20191103211813.213227-1-jhubbard@nvidia.com> <20191103211813.213227-13-jhubbard@nvidia.com> MIME-Version: 1.0 In-Reply-To: <20191103211813.213227-13-jhubbard@nvidia.com> User-Agent: Mutt/1.12.1 (2019-06-15) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-MC-Unique: BYegrK4JPIWXqhL6bCqv8A-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Sender: linux-kselftest-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote: > Add tracking of pages that were pinned via FOLL_PIN. >=20 > As mentioned in the FOLL_PIN documentation, callers who effectively set > FOLL_PIN are required to ultimately free such pages via put_user_page(). > The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET > for DIO and/or RDMA use". >=20 > Pages that have been pinned via FOLL_PIN are identifiable via a > new function call: >=20 > bool page_dma_pinned(struct page *page); >=20 > What to do in response to encountering such a page, is left to later > patchsets. There is discussion about this in [1]. >=20 > This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). >=20 > This also has a couple of trivial, non-functional change fixes to > try_get_compound_head(). That function got moved to the top of the > file. Maybe split that as a separate trivial patch. >=20 > This includes the following fix from Ira Weiny: >=20 > DAX requires detection of a page crossing to a ref count of 1. Fix this > for GUP pages by introducing put_devmap_managed_user_page() which > accounts for GUP_PIN_COUNTING_BIAS now used by GUP. Please do the put_devmap_managed_page() changes in a separate patch, it would be a lot easier to follow, also on that front see comments below. >=20 > [1] https://lwn.net/Articles/784574/ "Some slow progress on > get_user_pages()" >=20 > Suggested-by: Jan Kara > Suggested-by: J=E9r=F4me Glisse > Signed-off-by: Ira Weiny > Signed-off-by: John Hubbard > --- > include/linux/mm.h | 80 +++++++++++---- > include/linux/mmzone.h | 2 + > include/linux/page_ref.h | 10 ++ > mm/gup.c | 213 +++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 32 +++++- > mm/hugetlb.c | 28 ++++- > mm/memremap.c | 4 +- > mm/vmstat.c | 2 + > 8 files changed, 300 insertions(+), 71 deletions(-) >=20 > diff --git a/include/linux/mm.h b/include/linux/mm.h > index cdfb6fedb271..03b3600843b7 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct = page *page) > #endif > =20 > #ifdef CONFIG_DEV_PAGEMAP_OPS > -void __put_devmap_managed_page(struct page *page); > +void __put_devmap_managed_page(struct page *page, int count); > DECLARE_STATIC_KEY_FALSE(devmap_managed_key); > -static inline bool put_devmap_managed_page(struct page *page) > + > +static inline bool page_is_devmap_managed(struct page *page) > { > =09if (!static_branch_unlikely(&devmap_managed_key)) > =09=09return false; > @@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct pag= e *page) > =09switch (page->pgmap->type) { > =09case MEMORY_DEVICE_PRIVATE: > =09case MEMORY_DEVICE_FS_DAX: > -=09=09__put_devmap_managed_page(page); > =09=09return true; > =09default: > =09=09break; > @@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct pa= ge *page) > =09return false; > } > =20 > +static inline bool put_devmap_managed_page(struct page *page) > +{ > +=09bool is_devmap =3D page_is_devmap_managed(page); > + > +=09if (is_devmap) { > +=09=09int count =3D page_ref_dec_return(page); > + > +=09=09__put_devmap_managed_page(page, count); > +=09} > + > +=09return is_devmap; > +} I think the __put_devmap_managed_page() should be rename to free_devmap_managed_page() and that the count !=3D 1 case move to this inline function ie: static inline bool put_devmap_managed_page(struct page *page) { =09bool is_devmap =3D page_is_devmap_managed(page); =09if (is_devmap) { =09=09int count =3D page_ref_dec_return(page); =09=09/* =09=09 * If refcount is 1 then page is freed and refcount is stable as nobo= dy =09=09 * holds a reference on the page. =09=09 */ =09=09if (count =3D=3D 1) =09=09=09free_devmap_managed_page(page, count); =09=09else if (!count) =09=09=09__put_page(page); =09} =09return is_devmap; } > + > #else /* CONFIG_DEV_PAGEMAP_OPS */ > static inline bool put_devmap_managed_page(struct page *page) > { > @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct= page *page) > =09return true; > } > =20 > +__must_check bool user_page_ref_inc(struct page *page); > + What about having it as an inline here as it is pretty small. > static inline void put_page(struct page *page) > { > =09page =3D compound_head(page); > @@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page) > =09=09__put_page(page); > } > =20 > -/** > - * put_user_page() - release a gup-pinned page > - * @page: pointer to page to be released > +/* > + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, over= load > + * the page's refcount so that two separate items are tracked: the origi= nal page > + * reference count, and also a new count of how many get_user_pages() ca= lls were > + * made against the page. ("gup-pinned" is another term for the latter). > + * > + * With this scheme, get_user_pages() becomes special: such pages are ma= rked > + * as distinct from normal pages. As such, the new put_user_page() call = (and > + * its variants) must be used in order to release gup-pinned pages. > + * > + * Choice of value: > * > - * Pages that were pinned via get_user_pages*() must be released via > - * either put_user_page(), or one of the put_user_pages*() routines > - * below. This is so that eventually, pages that are pinned via > - * get_user_pages*() can be separately tracked and uniquely handled. In > - * particular, interactions with RDMA and filesystems need special > - * handling. > + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page ref= erence > + * counts with respect to get_user_pages() and put_user_page() becomes s= impler, > + * due to the fact that adding an even power of two to the page refcount= has > + * the effect of using only the upper N bits, for the code that counts u= p using > + * the bias value. This means that the lower bits are left for the exclu= sive > + * use of the original code that increments and decrements by one (or at= least, > + * by much smaller values than the bias value). > * > - * put_user_page() and put_page() are not interchangeable, despite this = early > - * implementation that makes them look the same. put_user_page() calls m= ust > - * be perfectly matched up with get_user_page() calls. > + * Of course, once the lower bits overflow into the upper bits (and this= is > + * OK, because subtraction recovers the original values), then visual in= spection > + * no longer suffices to directly view the separate counts. However, for= normal > + * applications that don't have huge page reference counts, this won't b= e an > + * issue. > + * > + * Locking: the lockless algorithm described in page_cache_get_speculati= ve() > + * and page_cache_gup_pin_speculative() provides safe operation for > + * get_user_pages and page_mkclean and other calls that race to set up p= age > + * table entries. > */ > -static inline void put_user_page(struct page *page) > -{ > -=09put_page(page); > -} > +#define GUP_PIN_COUNTING_BIAS (1UL << 10) > =20 > +void put_user_page(struct page *page); > void put_user_pages_dirty_lock(struct page **pages, unsigned long npages= , > =09=09=09 bool make_dirty); > - > void put_user_pages(struct page **pages, unsigned long npages); > =20 > +/** > + * page_dma_pinned() - report if a page is pinned by a call to pin_user_= pages*() > + * or pin_longterm_pages*() > + * @page:=09pointer to page to be queried. > + * @Return:=09True, if it is likely that the page has been "dma-pinned". > + *=09=09False, if the page is definitely not dma-pinned. > + */ Maybe add a small comment about wrap around :) > +static inline bool page_dma_pinned(struct page *page) > +{ > +=09return (page_ref_count(compound_head(page))) >=3D GUP_PIN_COUNTING_BI= AS; > +} > + > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > #define SECTION_IN_PAGE_FLAGS > #endif [...] > diff --git a/mm/gup.c b/mm/gup.c > index 1aea48427879..c9727e65fad3 100644 > --- a/mm/gup.c > +++ b/mm/gup.c [...] > @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, u= nsigned long addr, > =20 > =09=09pgmap =3D get_dev_pagemap(pfn, pgmap); > =09=09if (unlikely(!pgmap)) { > -=09=09=09undo_dev_pagemap(nr, nr_start, pages); > +=09=09=09undo_dev_pagemap(nr, nr_start, flags, pages); > =09=09=09return 0; > =09=09} > =09=09SetPageReferenced(page); > =09=09pages[*nr] =3D page; > -=09=09get_page(page); > + > +=09=09if (flags & FOLL_PIN) { > +=09=09=09if (unlikely(!user_page_ref_inc(page))) { > +=09=09=09=09undo_dev_pagemap(nr, nr_start, flags, pages); > +=09=09=09=09return 0; > +=09=09=09} Maybe add a comment about a case that should never happens ie user_page_ref_inc() fails after the second iteration of the loop as it would be broken and a bug to call undo_dev_pagemap() after the first iteration of that loop. Also i believe that this should never happens as if first iteration succeed than __page_cache_add_speculative() will succeed for all the iterations. Note that the pgmap case above follows that too ie the call to get_dev_pagemap() can only fail on first iteration of the loop, well i assume you can never have a huge device page that span different pgmap ie different devices (which is a reasonable assumption). So maybe this code needs fixing ie : =09=09pgmap =3D get_dev_pagemap(pfn, pgmap); =09=09if (unlikely(!pgmap)) =09=09=09return 0; > +=09=09} else > +=09=09=09get_page(page); > + > =09=09(*nr)++; > =09=09pfn++; > =09} while (addr +=3D PAGE_SIZE, addr !=3D end); [...] > @@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned lo= ng start, int nr_pages, > =09unsigned long addr, len, end; > =09int nr =3D 0, ret =3D 0; > =20 > -=09if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM))) > +=09if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)= )) Maybe add a comments to explain, something like: /* * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN * * Note that get_user_pages_fast() imply FOLL_GET flag by default but * callers can over-ride this default to pin case by setting FOLL_PIN. */ > =09=09return -EINVAL; > =20 > =09start =3D untagged_addr(start) & PAGE_MASK; > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 13cc93785006..66bf4c8b88f1 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c [...] > @@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct= *vma, unsigned long addr, > =09if (!*pgmap) > =09=09return ERR_PTR(-EFAULT); > =09page =3D pfn_to_page(pfn); > -=09get_page(page); > + > +=09if (flags & FOLL_GET) > +=09=09get_page(page); > +=09else if (flags & FOLL_PIN) > +=09=09if (unlikely(!user_page_ref_inc(page))) > +=09=09=09page =3D ERR_PTR(-ENOMEM); While i agree that user_page_ref_inc() (ie page_cache_add_speculative()) should never fails here as we are holding the pmd lock and thus no one can unmap the pmd and free the page it points to. I believe you should return -EFAULT like for the pgmap and not -ENOMEM as the pgmap should not fail either for the same reason. Thus it would be better to have consistent error. Maybe also add a comments explaining that it should not fail here. > =20 > =09return page; > } [...] > @@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struc= t *vma, unsigned long addr, > =09 * device mapped pages can only be returned if the > =09 * caller will manage the page reference count. > =09 */ > -=09if (!(flags & FOLL_GET)) > +=09if (!(flags & (FOLL_GET | FOLL_PIN))) > =09=09return ERR_PTR(-EEXIST); Maybe add a comment that FOLL_GET or FOLL_PIN must be set. > =09pfn +=3D (addr & ~PUD_MASK) >> PAGE_SHIFT; > @@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_stru= ct *vma, unsigned long addr, > =09if (!*pgmap) > =09=09return ERR_PTR(-EFAULT); > =09page =3D pfn_to_page(pfn); > -=09get_page(page); > + > +=09if (flags & FOLL_GET) > +=09=09get_page(page); > +=09else if (flags & FOLL_PIN) > +=09=09if (unlikely(!user_page_ref_inc(page))) > +=09=09=09page =3D ERR_PTR(-ENOMEM); Same as for follow_devmap_pmd() see above. > =20 > =09return page; > } > @@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_= struct *vma, > skip_mlock: > =09page +=3D (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; > =09VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), pag= e); > + > =09if (flags & FOLL_GET) > =09=09get_page(page); > +=09else if (flags & FOLL_PIN) > +=09=09if (unlikely(!user_page_ref_inc(page))) > +=09=09=09page =3D NULL; This should not fail either as we are holding the pmd lock maybe add a comment. Dunno if we want a WARN() or something to catch this degenerate case, or dump the page. > =20 > out: > =09return page; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index b45a95363a84..da335b1cd798 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, str= uct vm_area_struct *vma, > same_page: > =09=09if (pages) { > =09=09=09pages[i] =3D mem_map_offset(page, pfn_offset); > -=09=09=09get_page(pages[i]); > + > +=09=09=09if (flags & FOLL_GET) > +=09=09=09=09get_page(pages[i]); > +=09=09=09else if (flags & FOLL_PIN) > +=09=09=09=09if (unlikely(!user_page_ref_inc(pages[i]))) { > +=09=09=09=09=09spin_unlock(ptl); > +=09=09=09=09=09remainder =3D 0; > +=09=09=09=09=09err =3D -ENOMEM; > +=09=09=09=09=09WARN_ON_ONCE(1); > +=09=09=09=09=09break; > +=09=09=09=09} > =09=09} user_page_ref_inc() should not fail here either because we hold the ptl, so the WAR_ON_ONCE() is right but maybe add a comment. > =20 > =09=09if (vmas) [...] > @@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned lon= g address, > =09pte =3D huge_ptep_get((pte_t *)pmd); > =09if (pte_present(pte)) { > =09=09page =3D pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); > + > =09=09if (flags & FOLL_GET) > =09=09=09get_page(page); > +=09=09else if (flags & FOLL_PIN) > +=09=09=09if (unlikely(!user_page_ref_inc(page))) { > +=09=09=09=09page =3D NULL; > +=09=09=09=09goto out; > +=09=09=09} This should not fail either (again holding pmd lock), dunno if we want a warn or something to catch this degenerate case. > =09} else { > =09=09if (is_hugetlb_entry_migration(pte)) { > =09=09=09spin_unlock(ptl); [...]