linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	 Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	 Suren Baghdasaryan <surenb@google.com>,
	 Qi Zheng <zhengqi.arch@bytedance.com>,
	Yang Shi <shy828301@gmail.com>,
	 Mel Gorman <mgorman@techsingularity.net>,
	 Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>,  Yu Zhao <yuzhao@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	 Ralph Campbell <rcampbell@nvidia.com>,
	Ira Weiny <ira.weiny@intel.com>,
	 Steven Price <steven.price@arm.com>,
	SeongJae Park <sj@kernel.org>,
	 Naoya Horiguchi <naoya.horiguchi@nec.com>,
	 Christophe Leroy <christophe.leroy@csgroup.eu>,
	 Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	 Axel Rasmussen <axelrasmussen@google.com>,
	 Anshuman Khandual <anshuman.khandual@arm.com>,
	 Pasha Tatashin <pasha.tatashin@soleen.com>,
	 Miaohe Lin <linmiaohe@huawei.com>,
	Minchan Kim <minchan@kernel.org>,
	 Christoph Hellwig <hch@infradead.org>,
	Song Liu <song@kernel.org>,
	 Thomas Hellstrom <thomas.hellstrom@linux.intel.com>,
	 Russell King <linux@armlinux.org.uk>,
	 "David S. Miller" <davem@davemloft.net>,
	 Michael Ellerman <mpe@ellerman.id.au>,
	 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	 Heiko Carstens <hca@linux.ibm.com>,
	 Christian Borntraeger <borntraeger@linux.ibm.com>,
	 Claudio Imbrenda <imbrenda@linux.ibm.com>,
	 Alexander Gordeev <agordeev@linux.ibm.com>,
	Jann Horn <jannh@google.com>,
	 linux-arm-kernel@lists.infradead.org,
	sparclinux@vger.kernel.org,  linuxppc-dev@lists.ozlabs.org,
	linux-s390@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
Date: Tue, 30 May 2023 17:38:25 -0700 (PDT)	[thread overview]
Message-ID: <48c679e9-8eed-568a-1de1-c57e315c693c@google.com> (raw)
In-Reply-To: <ZHU0m+QIChZNdOdg@x1n>

Thanks for looking, Peter: I was well aware of you dropping several hints
that you wanted to see what's intended before passing judgment on earlier
series, and I preferred to get on with showing this series, than go into
detail in responses to you there - thanks for your patience :)

On Mon, 29 May 2023, Peter Xu wrote:
> On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote:
...
> > @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
> >  	mmap_write_unlock(mm);
> >  }
> >  
> > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > -			       struct mm_struct *target_mm,
> > -			       unsigned long target_addr, struct page *hpage,
> > -			       struct collapse_control *cc)
> > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >  {
> >  	struct vm_area_struct *vma;
> > -	int target_result = SCAN_FAIL;
> >  
> > -	i_mmap_lock_write(mapping);
> > +	i_mmap_lock_read(mapping);
> >  	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > -		int result = SCAN_FAIL;
> > -		struct mm_struct *mm = NULL;
> > -		unsigned long addr = 0;
> > -		pmd_t *pmd;
> > -		bool is_target = false;
> > +		struct mm_struct *mm;
> > +		unsigned long addr;
> > +		pmd_t *pmd, pgt_pmd;
> > +		spinlock_t *pml;
> > +		spinlock_t *ptl;
> >  
> >  		/*
> >  		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > -		 * got written to. These VMAs are likely not worth investing
> > -		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
> > -		 * later.
> > +		 * got written to. These VMAs are likely not worth removing
> > +		 * page tables from, as PMD-mapping is likely to be split later.
> >  		 *
> > -		 * Note that vma->anon_vma check is racy: it can be set up after
> > -		 * the check but before we took mmap_lock by the fault path.
> > -		 * But page lock would prevent establishing any new ptes of the
> > -		 * page, so we are safe.
> > -		 *
> > -		 * An alternative would be drop the check, but check that page
> > -		 * table is clear before calling pmdp_collapse_flush() under
> > -		 * ptl. It has higher chance to recover THP for the VMA, but
> > -		 * has higher cost too. It would also probably require locking
> > -		 * the anon_vma.
> > +		 * Note that vma->anon_vma check is racy: it can be set after
> > +		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > +		 * prevented establishing new ptes of the page. So we are safe
> > +		 * to remove page table below, without even checking it's empty.
> >  		 */
> > -		if (READ_ONCE(vma->anon_vma)) {
> > -			result = SCAN_PAGE_ANON;
> > -			goto next;
> > -		}
> > +		if (READ_ONCE(vma->anon_vma))
> > +			continue;
> 
> Not directly related to current patch, but I just realized there seems to
> have similar issue as what ab0c3f1251b4 wanted to fix.
> 
> IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma
> set here, then does it mean that any vma used to get debugged will never be
> able to merge into a thp (with either madvise or khugepaged)?
> 
> I think it'll only make a difference when the page cache is not huge yet
> when bp was uninstalled, but then it becomes a thp candidate somehow.  Even
> if so, I think the anon_vma should still be there.
> 
> Did I miss something, or maybe that's not even a problem?

Finding vma->anon_vma set would discourage retract_page_tables() from
doing its business with that previously uprobed area; but it does not stop
collapse_pte_mapped_thp() (which uprobes unregister calls directly) from
dealing with it, and MADV_COLLAPSE works on anon_vma'ed areas too.  It's
just a heuristic in retract_page_tables(), when it chooses to skip the
anon_vma'ed areas as often not worth bothering with.

As to vma merges: I haven't actually checked since the maple tree and other
rewrites of vma merging, but previously one vma with anon_vma set could be
merged with adjacent vma before or after without anon_vma set - the
anon_vma comparison is not just equality of anon_vma, but allows NULL too -
so the anon_vma will still be there, but extends to cover the wider extent.
Right, I find is_mergeable_anon_vma() still following that rule.

(And once vmas are merged, so that the whole of the huge page falls within
a single vma, khugepaged can consider it, and do collapse_pte_mapped_thp()
on it - before or after 11/12 I think.)

As to whether it would even be a problem: generally no, the vma is supposed
just to be an internal representation, and so long as the code resists
proliferating them unnecessarily, occasional failures to merge should not
matter.  The one place that forever sticks in my mind as mattering (perhaps
there are others I'm unaware of, but I'd call them bugs) is mremap(): which
is sufficiently awkward and bug-prone already, that nobody ever had the
courage to make it independent of vma boundaries; but ideally, it's
mremap() that we should fix.

But I may have written three answers, yet still missed your point.

...
> > +
> > +		mm = vma->vm_mm;
> > +		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> > +			continue;
> > +
> > +		if (hpage_collapse_test_exit(mm))
> > +			continue;
> > +		/*
> > +		 * When a vma is registered with uffd-wp, we cannot recycle
> > +		 * the page table because there may be pte markers installed.
> > +		 * Other vmas can still have the same file mapped hugely, but
> > +		 * skip this one: it will always be mapped in small page size
> > +		 * for uffd-wp registered ranges.
> > +		 *
> > +		 * What if VM_UFFD_WP is set a moment after this check?  No
> > +		 * problem, huge page lock is still held, stopping new mappings
> > +		 * of page which might then get replaced by pte markers: only
> > +		 * existing markers need to be protected here.  (We could check
> > +		 * after getting ptl below, but this comment distracting there!)
> > +		 */
> > +		if (userfaultfd_wp(vma))
> > +			continue;
> 
> IIUC here with the new code we only hold (1) hpage lock, and (2)
> i_mmap_lock read.  Then could it possible that right after checking this
> and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this
> vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below
> pgtable locks taken?
> 
> The thing is installation of pte markers may not need either of the locks
> iiuc..
> 
> Would taking mmap read lock help in this case?

Isn't my comment above it a good enough answer?  If I misunderstand the
uffd-wp pte marker ("If"? certainly I don't understand it well enough,
but I may or may not be too wrong about it here), and actually it can
spring up in places where the page has not even been mapped yet, then
I'd *much* rather just move that check down into the pte_locked area,
than involve mmap read lock (which, though easier to acquire than its
write lock, would I think take us back to square 1 in terms of needing
trylock); but I did prefer not to have a big uffd-wp comment distracting
from the code flow there.

I expect now, that if I follow up UFFDIO_WRITEPROTECT, I shall indeed
find it inserting pte markers where the page has not even been mapped
yet.  A "Yes" from you will save me looking, but probably I shall have
to move that check down (oh well, the comment will be smaller there).

Thanks,
Hugh


  reply	other threads:[~2023-05-31  0:38 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-29  6:11 [PATCH 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-05-29  6:14 ` [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
2023-05-31 17:06   ` Jann Horn
2023-05-29  6:16 ` [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
2023-05-29 13:56   ` Matthew Wilcox
2023-05-29  6:17 ` [PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
2023-05-29  6:18 ` [PATCH 04/12] powerpc: assert_pte_locked() " Hugh Dickins
2023-05-29  6:20 ` [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-05-29 14:02   ` Matthew Wilcox
2023-05-29 14:36     ` Hugh Dickins
     [not found]     ` <ZHn6n5eVTsr4Wl8x@ziepe.ca>
     [not found]       ` <4df4909f-f5dd-6f94-9792-8f2949f542b3@google.com>
     [not found]         ` <ZH95oobIqN0WO5MK@ziepe.ca>
     [not found]           ` <ZH+DAxLhIYpTlIFc@x1n>
     [not found]             ` <ZH+EMp9RuEVOjVNb@ziepe.ca>
2023-06-07  3:49               ` Hugh Dickins
2023-05-29  6:21 ` [PATCH 06/12] sparc: " Hugh Dickins
2023-05-29  6:22 ` [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() Hugh Dickins
     [not found]   ` <175ebec8-761-c3f-2d98-6c3bd87161c8@google.com>
2023-06-06 19:40     ` Gerald Schaefer
2023-06-08  3:35       ` Hugh Dickins
2023-06-08 13:58         ` Jason Gunthorpe
2023-06-08 15:47         ` Gerald Schaefer
     [not found]     ` <ZH99cLKeALvUCIH8@ziepe.ca>
2023-06-08  2:46       ` Hugh Dickins
2023-05-29  6:23 ` [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
2023-05-29  6:25 ` [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
2023-05-29 23:26   ` Peter Xu
2023-05-31  0:38     ` Hugh Dickins [this message]
2023-05-31 15:34   ` Jann Horn
2023-05-29  6:26 ` [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
2023-05-31 17:25   ` Jann Horn
2023-05-29  6:28 ` [PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
2023-05-29  6:30 ` [PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48c679e9-8eed-568a-1de1-c57e315c693c@google.com \
    --to=hughd@google.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=imbrenda@linux.ibm.com \
    --cc=ira.weiny@intel.com \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@techsingularity.net \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=naoya.horiguchi@nec.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcampbell@nvidia.com \
    --cc=rppt@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=song@kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=steven.price@arm.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    --cc=zackr@vmware.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).