All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chih-En Lin <shiyn.lin@gmail.com>
To: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Christian Brauner <brauner@kernel.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	William Kucharski <william.kucharski@oracle.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Yunsheng Lin <linyunsheng@huawei.com>,
	Arnd Bergmann <arnd@arndb.de>,
	Suren Baghdasaryan <surenb@google.com>,
	Colin Cross <ccross@google.com>, Feng Tang <feng.tang@intel.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Mike Rapoport <rppt@kernel.org>,
	Geert Uytterhoeven <geert@linux-m68k.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Daniel Axtens <dja@axtens.net>,
	Jonathan Marek <jonathan@marek.ca>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Peter Xu <peterx@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andy Lutomirski <luto@kernel.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Fenghua Yu <fenghua.yu@intel.com>,
	David Hildenbrand <david@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Kaiyang Zhao <zhao776@purdue.edu>,
	Huichun Feng <foxhoundsk.tw@gmail.com>,
	Jim Huang <jserv.tw@gmail.com>
Subject: Re: [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table
Date: Sat, 21 May 2022 12:38:53 +0800	[thread overview]
Message-ID: <20220521043853.GC1508515@strix-laptop> (raw)
In-Reply-To: <0c152776-6851-eb3e-b5c2-16826779891f@csgroup.eu>

On Fri, May 20, 2022 at 02:49:31PM +0000, Christophe Leroy wrote:
> 
> 
> Le 19/05/2022 à 20:31, Chih-En Lin a écrit :
> > This patch adds the Copy-On-Write (COW) mechanism to the PTE table.
> > To enable the COW page table use the clone3() system call with the
> > CLONE_COW_PGTABLE flag. It will set the MMF_COW_PGTABLE flag to the
> > processes.
> > 
> > It uses the MMF_COW_PGTABLE flag to distinguish the default page table
> > and the COW one. Moreover, it is difficult to distinguish whether the
> > entire page table is out of COW state. So the MMF_COW_PGTABLE flag won't
> > be disabled after its setup.
> > 
> > Since the memory space of the page table is distinctive for each process
> > in kernel space. It uses the address of the PMD index for the ownership
> > of the PTE table to identify which one of the processes needs to update
> > the page table state. In other words, only the owner will update COW PTE
> > state, like the RSS and pgtable_bytes.
> > 
> > It uses the reference count to control the lifetime of COW PTE table.
> > When someone breaks COW, it will copy the COW PTE table and decrease the
> > reference count. But if the reference count is equal to one before the
> > break COW, it will reuse the COW PTE table.
> > 
> > This patch modifies the part of the copy page table to do the basic COW.
> > For the break COW, it modifies the part of a page fault, zaps page table
> > , unmapping, and remapping.
> > 
> > Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> > ---
> >   include/linux/pgtable.h |   3 +
> >   mm/memory.c             | 262 ++++++++++++++++++++++++++++++++++++----
> >   mm/mmap.c               |   4 +
> >   mm/mremap.c             |   5 +
> >   4 files changed, 251 insertions(+), 23 deletions(-)
> > 
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 33c01fec7b92..357ce3722ee8 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -631,6 +631,9 @@ static inline int cow_pte_refcount_read(pmd_t *pmd)
> >          return atomic_read(&pmd_page(*pmd)->cow_pgtable_refcount);
> >   }
> > 
> > +extern int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > +               unsigned long addr, bool alloc);
> > +
> >   #ifndef pte_access_permitted
> >   #define pte_access_permitted(pte, write) \
> >          (pte_present(pte) && (!(write) || pte_write(pte)))
> > diff --git a/mm/memory.c b/mm/memory.c
> > index aa66af76e214..ff3fcbe4dfb5 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -247,6 +247,8 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
> >                  next = pmd_addr_end(addr, end);
> >                  if (pmd_none_or_clear_bad(pmd))
> >                          continue;
> > +               BUG_ON(cow_pte_refcount_read(pmd) != 1);
> > +               BUG_ON(!cow_pte_owner_is_same(pmd, NULL));
> 
> See comment on a previous patch of this series, there seem to be a huge 
> number of new BUG_ONs.

Got it.

> >                  free_pte_range(tlb, pmd, addr);
> >          } while (pmd++, addr = next, addr != end);
> > 
> > @@ -1031,7 +1033,7 @@ static inline void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
> >   static int
> >   copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >                 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> > -              unsigned long end)
> > +              unsigned long end, bool is_src_pte_locked)
> >   {
> >          struct mm_struct *dst_mm = dst_vma->vm_mm;
> >          struct mm_struct *src_mm = src_vma->vm_mm;
> > @@ -1053,8 +1055,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >                  goto out;
> >          }
> >          src_pte = pte_offset_map(src_pmd, addr);
> > -       src_ptl = pte_lockptr(src_mm, src_pmd);
> > -       spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > +       if (!is_src_pte_locked) {
> > +               src_ptl = pte_lockptr(src_mm, src_pmd);
> > +               spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > +       }
> 
> Odd construct, that kind of construct often leads to messy errors.
> 
> Could you construct things differently by refactoring the code ?

Sure, I will try my best.
It's probably why here have the bug when doing the stress testing.

> > @@ -1180,11 +1186,55 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >                                  continue;
> >                          /* fall through */
> >                  }
> > -               if (pmd_none_or_clear_bad(src_pmd))
> > -                       continue;
> > -               if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
> > -                                  addr, next))
> > +
> > +               if (test_bit(MMF_COW_PGTABLE, &src_mm->flags)) {
> > +
> > +                        if (pmd_none(*src_pmd))
> > +                               continue;
> 
> Why not keep the pmd_none_or_clear_bad(src_pmd) instead ?
> 
> > +
> > +                       /* XXX: Skip if the PTE already COW this time. */
> > +                       if (!pmd_none(*dst_pmd) &&
> 
> Shouldn't is be a pmd_none_or_clear_bad() ?
> 
> > +                           cow_pte_refcount_read(src_pmd) > 1)
> > +                               continue;
> > +
> > +                       /* If PTE doesn't have an owner, the parent needs to
> > +                        * take this PTE.
> > +                        */
> > +                       if (cow_pte_owner_is_same(src_pmd, NULL)) {
> > +                               set_cow_pte_owner(src_pmd, src_pmd);
> > +                               /* XXX: The process may COW PTE fork two times.
> > +                                * But in some situations, owner has cleared.
> > +                                * Previously Child (This time is the parent)
> > +                                * COW PTE forking, but previously parent, owner
> > +                                * , break COW. So it needs to add back the RSS
> > +                                * state and pgtable bytes.
> > +                                */
> > +                               if (!pmd_write(*src_pmd)) {
> > +                                       unsigned long pte_start =
> > +                                               addr & PMD_MASK;
> > +                                       unsigned long pte_end =
> > +                                               (addr + PMD_SIZE) & PMD_MASK;
> > +                                       cow_pte_rss(src_mm, src_vma, src_pmd,
> > +                                           pte_start, pte_end, true /* inc */);
> > +                                       mm_inc_nr_ptes(src_mm);
> > +                                       smp_wmb();
> > +                                       pmd_populate(src_mm, src_pmd,
> > +                                                       pmd_page(*src_pmd));
> > +                               }
> > +                       }
> > +
> > +                       pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > +
> > +                       /* Child reference count */
> > +                       pmd_get_pte(src_pmd);
> > +
> > +                       /* COW for PTE table */
> > +                       set_pmd_at(dst_mm, addr, dst_pmd, *src_pmd);
> > +               } else if (!pmd_none_or_clear_bad(src_pmd) &&
> 
> Can't we keep pmd_none_or_clear_bad(src_pmd) common to both cases ?
> 

You are right.
I will change to pmd_none_or_clear_bad().

> > +                           copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
> > +                                   addr, next, false)) {
> >                          return -ENOMEM;
> > +               }
> >          } while (dst_pmd++, src_pmd++, addr = next, addr != end);
> >          return 0;
> >   }
> > @@ -1336,6 +1386,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >   struct zap_details {
> >          struct folio *single_folio;     /* Locked folio to be unmapped */
> >          bool even_cows;                 /* Zap COWed private pages too? */
> > +       bool cow_pte;                   /* Do not free COW PTE */
> >   };
> > 
> >   /* Whether we should zap all COWed (private) pages too */
> > @@ -1398,8 +1449,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                          page = vm_normal_page(vma, addr, ptent);
> >                          if (unlikely(!should_zap_page(details, page)))
> >                                  continue;
> > -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> > -                                                       tlb->fullmm);
> > +                       if (!details || !details->cow_pte)
> > +                               ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +                                                               tlb->fullmm);
> >                          tlb_remove_tlb_entry(tlb, pte, addr);
> >                          if (unlikely(!page))
> >                                  continue;
> > @@ -1413,8 +1465,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                                      likely(!(vma->vm_flags & VM_SEQ_READ)))
> >                                          mark_page_accessed(page);
> >                          }
> > -                       rss[mm_counter(page)]--;
> > -                       page_remove_rmap(page, vma, false);
> > +                       if (!details || !details->cow_pte) {
> > +                               rss[mm_counter(page)]--;
> > +                               page_remove_rmap(page, vma, false);
> > +                       } else
> > +                               continue;
> 
> Can you do the reverse:
> 
> 			if (details && details->cow_pte)
> 				continue;
> 
> 			rss[mm_counter(page)]--;
> 			page_remove_rmap(page, vma, false);

It's better than I wrote.
Thanks.

> 
> >                          if (unlikely(page_mapcount(page) < 0))
> >                                  print_bad_pte(vma, addr, ptent, page);
> >                          if (unlikely(__tlb_remove_page(tlb, page))) {
> > @@ -1425,6 +1480,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                          continue;
> >                  }
> > 
> > +               // TODO: Deal COW PTE with swap
> > +
> >                  entry = pte_to_swp_entry(ptent);
> >                  if (is_device_private_entry(entry) ||
> >                      is_device_exclusive_entry(entry)) {
> > @@ -1513,16 +1570,34 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> >                          spin_unlock(ptl);
> >                  }
> > 
> > -               /*
> > -                * Here there can be other concurrent MADV_DONTNEED or
> > -                * trans huge page faults running, and if the pmd is
> > -                * none or trans huge it can change under us. This is
> > -                * because MADV_DONTNEED holds the mmap_lock in read
> > -                * mode.
> > -                */
> > -               if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> > -                       goto next;
> > -               next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> > +
> > +               if (test_bit(MMF_COW_PGTABLE, &tlb->mm->flags) &&
> > +                   !pmd_none(*pmd) && !pmd_write(*pmd)) {
> 
> Can't you use pmd_none_or_trans_huge_or_clear_bad() and keep it common ? ...

Sure.

> > +static int zap_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > +               unsigned long addr)
> > +{
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       unsigned long start, end;
> > +
> > +       if (pmd_put_pte(vma, pmd, addr)) {
> > +               // fallback
> > +               return 1;
> > +       }
> 
> No { } for a single line if. The comment could go just before the if.
> 
> > +
> > +       start = addr & PMD_MASK;
> > +       end = (addr + PMD_SIZE) & PMD_MASK;
> > +
> > +       /* If PMD entry is owner, clear the ownership, and decrease RSS state
> > +        * and pgtable_bytes.
> > +        */
> 
> Please follow the standard comments style:
> 
> /*
>   * Some text
>   * More text
>   */
> 

Got it.

> > +       if (cow_pte_owner_is_same(pmd, pmd)) {
> > +               set_cow_pte_owner(pmd, NULL);
> > +               cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
> > +               mm_dec_nr_ptes(mm);
> > +       }
> > +
> > +       pmd_clear(pmd);
> > +       return 0;
> > +}
> > +
> > +/* If alloc set means it won't break COW. For this case, it will just decrease
> > + * the reference count. The address needs to be at the beginning of the PTE page
> > + * since COW PTE is copy-on-write the entire PTE.
> > + * If pmd is NULL, it will get the pmd from vma and check it is cowing.
> > + */
> > +int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > +               unsigned long addr, bool alloc)
> > +{
> > +       pgd_t *pgd;
> > +       p4d_t *p4d;
> > +       pud_t *pud;
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       int ret = 0;
> > +       spinlock_t *ptl = NULL;
> > +
> > +       if (!pmd) {
> > +               pgd = pgd_offset(mm, addr);
> > +               if (pgd_none_or_clear_bad(pgd))
> > +                       return 0;
> > +               p4d = p4d_offset(pgd, addr);
> > +               if (p4d_none_or_clear_bad(p4d))
> > +                       return 0;
> > +               pud = pud_offset(p4d, addr);
> > +               if (pud_none_or_clear_bad(pud))
> > +                       return 0;
> > +               pmd = pmd_offset(pud, addr);
> > +               if (pmd_none(*pmd) || pmd_write(*pmd))
> > +                       return 0;
> > +       }
> > +
> > +       // TODO: handle COW PTE with swap
> > +       BUG_ON(is_swap_pmd(*pmd));
> > +       BUG_ON(pmd_trans_huge(*pmd));
> > +       BUG_ON(pmd_devmap(*pmd));
> > +
> > +       BUG_ON(pmd_none(*pmd));
> > +       BUG_ON(pmd_write(*pmd));
> 
> So many BUG_ON ? All this has a cost during the execution.

I will consider it again.

> > +
> > +       ptl = pte_lockptr(mm, pmd);
> > +       spin_lock(ptl);
> > +       if (!alloc)
> > +               ret = zap_cow_pte(vma, pmd, addr);
> > +       else
> > +               ret = break_cow_pte(vma, pmd, addr);
> 
> Better as
> 
> 	if (alloc)
> 		break_cow_pte()
> 	else
> 		zap_cow_pte()

Great!
Thanks.

> > +       spin_unlock(ptl);
> > +
> > +       return ret;
> > +}
> > +
> >   /*
> >    * These routines also need to handle stuff like marking pages dirty
> >    * and/or accessed for architectures that don't do it in hardware (most
> > @@ -4825,6 +5028,19 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> >                                  return 0;
> >                          }
> >                  }
> > +
> > +               /* When the PMD entry is set with write protection, it needs to
> > +                * handle the on-demand PTE. It will allocate a new PTE and copy
> > +                * the old one, then set this entry writeable and decrease the
> > +                * reference count at COW PTE.
> > +                */
> > +               if (test_bit(MMF_COW_PGTABLE, &mm->flags) &&
> > +                   !pmd_none(vmf.orig_pmd) && !pmd_write(vmf.orig_pmd)) {
> > +                       if (handle_cow_pte(vmf.vma, vmf.pmd, vmf.real_address,
> > +                          (cow_pte_refcount_read(&vmf.orig_pmd) > 1) ?
> > +                          true : false) < 0)
> 
> (condition ? true : false) is exactly the same as (condition)
> 

I knew. ;-)

Again, thanks!

  reply	other threads:[~2022-05-21  4:39 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 1/6] mm: Add a new mm flag for Copy-On-Write PTE table Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 2/6] mm: clone3: Add CLONE_COW_PGTABLE flag Chih-En Lin
2022-05-20 14:13   ` Christophe Leroy
2022-05-21  3:50     ` Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table Chih-En Lin
2022-05-19 23:07   ` kernel test robot
2022-05-20  0:08   ` kernel test robot
2022-05-20 14:15   ` Christophe Leroy
2022-05-21  4:03     ` Chih-En Lin
2022-05-21  4:02   ` Matthew Wilcox
2022-05-21  5:01     ` Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 4/6] mm: Add COW PTE fallback function Chih-En Lin
2022-05-20  0:20   ` kernel test robot
2022-05-20 14:21   ` Christophe Leroy
2022-05-21  4:15     ` Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE Chih-En Lin
2022-05-20 14:30   ` Christophe Leroy
2022-05-21  4:22     ` Chih-En Lin
2022-05-21  4:08   ` Matthew Wilcox
2022-05-21  5:10     ` Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table Chih-En Lin
2022-05-20 14:49   ` Christophe Leroy
2022-05-21  4:38     ` Chih-En Lin [this message]
2022-05-21  8:59 ` [External] [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Qi Zheng
2022-05-21 19:08   ` Chih-En Lin
2022-05-21 16:07 ` David Hildenbrand
2022-05-21 18:50   ` Chih-En Lin
2022-05-21 20:28     ` David Hildenbrand
2022-05-21 20:12   ` Matthew Wilcox
2022-05-21 20:22     ` David Hildenbrand
2022-05-21 22:19     ` Andy Lutomirski
2022-05-22  0:31       ` Matthew Wilcox
2022-05-22 15:20         ` Andy Lutomirski
2022-05-22 19:40           ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220521043853.GC1508515@strix-laptop \
    --to=shiyn.lin@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=arnd@arndb.de \
    --cc=bigeasy@linutronix.de \
    --cc=brauner@kernel.org \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=ccross@google.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=david@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dja@axtens.net \
    --cc=ebiederm@xmission.com \
    --cc=feng.tang@intel.com \
    --cc=fenghua.yu@intel.com \
    --cc=foxhoundsk.tw@gmail.com \
    --cc=geert@linux-m68k.org \
    --cc=jhubbard@nvidia.com \
    --cc=jonathan@marek.ca \
    --cc=jserv.tw@gmail.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linyunsheng@huawei.com \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=vbabka@suse.cz \
    --cc=vincent.guittot@linaro.org \
    --cc=william.kucharski@oracle.com \
    --cc=willy@infradead.org \
    --cc=zhao776@purdue.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.