All of lore.kernel.org
 help / color / mirror / Atom feed
From: Song Liu <songliubraving@fb.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: lkml <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"matthew.wilcox@oracle.com" <matthew.wilcox@oracle.com>,
	"kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>,
	"oleg@redhat.com" <oleg@redhat.com>,
	Kernel Team <Kernel-team@fb.com>,
	"william.kucharski@oracle.com" <william.kucharski@oracle.com>,
	"srikar@linux.vnet.ibm.com" <srikar@linux.vnet.ibm.com>
Subject: Re: [PATCH 1/2] khugepaged: enable collapse pmd for pte-mapped THP
Date: Tue, 30 Jul 2019 18:39:54 +0000	[thread overview]
Message-ID: <48DAF4DE-AB27-487A-B9B2-E733FA30A7B1@fb.com> (raw)
In-Reply-To: <452746EE-186C-43D8-B15C-9921E587BA3A@fb.com>



> On Jul 30, 2019, at 10:28 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Jul 30, 2019, at 7:59 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>> 
>> On Sun, Jul 28, 2019 at 10:43:34PM -0700, Song Liu wrote:
>>> khugepaged needs exclusive mmap_sem to access page table. When it fails
>>> to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
>>> is already a THP, khugepaged will not handle this pmd again.
>>> 
>>> This patch enables the khugepaged to retry collapse the page table.
>>> 
>>> struct mm_slot (in khugepaged.c) is extended with an array, containing
>>> addresses of pte-mapped THPs. We use array here for simplicity. We can
>>> easily replace it with more advanced data structures when needed. This
>>> array is protected by khugepaged_mm_lock.
>>> 
>>> In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try
>>> to collapse the page table.
>>> 
>>> Since collapse may happen at an later time, some pages may already fault
>>> in. collapse_pte_mapped_thp() is added to properly handle these pages.
>>> collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
>>> are mapping to the same THP. This is necessary because some subpage of
>>> the THP may be replaced, for example by uprobe. In such cases, it is not
>>> possible to collapse the pmd.
>>> 
>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>> ---
>>> include/linux/khugepaged.h |  15 ++++
>>> mm/khugepaged.c            | 136 +++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 151 insertions(+)
>>> 
>>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
>>> index 082d1d2a5216..2d700830fe0e 100644
>>> --- a/include/linux/khugepaged.h
>>> +++ b/include/linux/khugepaged.h
>>> @@ -15,6 +15,16 @@ extern int __khugepaged_enter(struct mm_struct *mm);
>>> extern void __khugepaged_exit(struct mm_struct *mm);
>>> extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
>>> 				      unsigned long vm_flags);
>>> +#ifdef CONFIG_SHMEM
>>> +extern int khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
>>> +					 unsigned long addr);
>>> +#else
>>> +static inline int khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
>>> +						unsigned long addr)
>>> +{
>>> +	return 0;
>>> +}
>>> +#endif
>>> 
>>> #define khugepaged_enabled()					       \
>>> 	(transparent_hugepage_flags &				       \
>>> @@ -73,6 +83,11 @@ static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
>>> {
>>> 	return 0;
>>> }
>>> +static inline int khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
>>> +						unsigned long addr)
>>> +{
>>> +	return 0;
>>> +}
>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>> 
>>> #endif /* _LINUX_KHUGEPAGED_H */
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index eaaa21b23215..247c25aeb096 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -76,6 +76,7 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>> 
>>> static struct kmem_cache *mm_slot_cache __read_mostly;
>>> 
>>> +#define MAX_PTE_MAPPED_THP 8
>> 
>> Is MAX_PTE_MAPPED_THP value random or do you have any justification for
>> it?
> 
> In our use cases, we only have small number (< 10) of huge pages for the
> text section, so 8 should be enough to cover the worse case. 
> 
> If this is not sufficient, we can make it a list. 
> 
>> 
>> Please add empty line after it.
>> 
>>> /**
>>> * struct mm_slot - hash lookup from mm to mm_slot
>>> * @hash: hash collision list
>>> @@ -86,6 +87,10 @@ struct mm_slot {
>>> 	struct hlist_node hash;
>>> 	struct list_head mm_node;
>>> 	struct mm_struct *mm;
>>> +
>>> +	/* pte-mapped THP in this mm */
>>> +	int nr_pte_mapped_thp;
>>> +	unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
>>> };
>>> 
>>> /**
>>> @@ -1281,11 +1286,141 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>> 			up_write(&vma->vm_mm->mmap_sem);
>>> 			mm_dec_nr_ptes(vma->vm_mm);
>>> 			pte_free(vma->vm_mm, pmd_pgtable(_pmd));
>>> +		} else if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
>>> +			/* need down_read for khugepaged_test_exit() */
>>> +			khugepaged_add_pte_mapped_thp(vma->vm_mm, addr);
>>> +			up_read(&vma->vm_mm->mmap_sem);
>>> 		}
>>> 	}
>>> 	i_mmap_unlock_write(mapping);
>>> }
>>> 
>>> +/*
>>> + * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
>>> + * khugepaged should try to collapse the page table.
>>> + */
>>> +int khugepaged_add_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>> 
>> What is contract about addr alignment? Do we expect it PAGE_SIZE aligned
>> or PMD_SIZE aligned? Do we want to enforce it?
> 
> It is PMD_SIZE aligned. Let me add VM_BUG_ON() for it. 
> 
>> 
>>> +{
>>> +	struct mm_slot *mm_slot;
>>> +	int ret = 0;
>>> +
>>> +	/* hold mmap_sem for khugepaged_test_exit() */
>>> +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
>>> +
>>> +	if (unlikely(khugepaged_test_exit(mm)))
>>> +		return 0;
>>> +
>>> +	if (!test_bit(MMF_VM_HUGEPAGE, &mm->flags) &&
>>> +	    !test_bit(MMF_DISABLE_THP, &mm->flags)) {
>>> +		ret = __khugepaged_enter(mm);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>> 
>> Any reason not to call khugepaged_enter() here?
> 
> No specific reasons... Let me try it. 

Actually, khugepaged_enter() takes vma and vm_flags; while here we only  
have the mm. I guess we should just use __khugepaged_enter(). Once we 
remove all checks on vm_flags, khugepaged_enter() is about the same as 
the logic above. 

Thanks,
Song

[...]

  reply	other threads:[~2019-07-30 18:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-29  5:43 [PATCH 0/2] khugepaged: collapse pmd for pte-mapped THP Song Liu
2019-07-29  5:43 ` [PATCH 1/2] khugepaged: enable " Song Liu
2019-07-30 14:59   ` Kirill A. Shutemov
2019-07-30 17:28     ` Song Liu
2019-07-30 18:39       ` Song Liu [this message]
2019-07-29  5:43 ` [PATCH 2/2] uprobe: collapse THP pmd after removing all uprobes Song Liu
2019-07-30 15:01   ` Kirill A. Shutemov
2019-07-30 17:02     ` Song Liu
2019-07-31 16:16   ` Oleg Nesterov
2019-07-31 16:36     ` Song Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48DAF4DE-AB27-487A-B9B2-E733FA30A7B1@fb.com \
    --to=songliubraving@fb.com \
    --cc=Kernel-team@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.wilcox@oracle.com \
    --cc=oleg@redhat.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=william.kucharski@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.