linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Fixes for THP in page cache
@ 2019-10-16  7:37 Song Liu
  2019-10-16  7:37 ` [PATCH 1/4] proc/meminfo: fix output alignment Song Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Song Liu @ 2019-10-16  7:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kernel-team, william.kucharski, Song Liu

This set includes a few fixes for THP in page cache. They are based on
Linus's master branch.

Thanks,
Song

Kirill A. Shutemov (3):
  proc/meminfo: fix output alignment
  mm/thp: fix node page state in split_huge_page_to_list()
  mm/thp: allow drop THP from page cache

Song Liu (1):
  uprobe: only do FOLL_SPLIT_PMD for uprobe register

 fs/proc/meminfo.c       |  4 ++--
 kernel/events/uprobes.c | 10 ++++++++--
 mm/huge_memory.c        |  9 +++++++--
 mm/truncate.c           | 12 ++++++++++++
 mm/vmscan.c             |  3 ++-
 5 files changed, 31 insertions(+), 7 deletions(-)

--
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] proc/meminfo: fix output alignment
  2019-10-16  7:37 [PATCH 0/4] Fixes for THP in page cache Song Liu
@ 2019-10-16  7:37 ` Song Liu
  2019-10-16  7:37 ` [PATCH 2/4] mm/thp: fix node page state in split_huge_page_to_list() Song Liu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Song Liu @ 2019-10-16  7:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kernel-team, william.kucharski,
	Kirill A. Shutemov, Song Liu

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Add extra space for FileHugePages and FilePmdMapped, so the output is
aligned with other rows.

Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 fs/proc/meminfo.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index ac9247371871..8c1f1bb1a5ce 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -132,9 +132,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		    global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
 	show_val_kb(m, "ShmemPmdMapped: ",
 		    global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
-	show_val_kb(m, "FileHugePages: ",
+	show_val_kb(m, "FileHugePages:  ",
 		    global_node_page_state(NR_FILE_THPS) * HPAGE_PMD_NR);
-	show_val_kb(m, "FilePmdMapped: ",
+	show_val_kb(m, "FilePmdMapped:  ",
 		    global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR);
 #endif
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] mm/thp: fix node page state in split_huge_page_to_list()
  2019-10-16  7:37 [PATCH 0/4] Fixes for THP in page cache Song Liu
  2019-10-16  7:37 ` [PATCH 1/4] proc/meminfo: fix output alignment Song Liu
@ 2019-10-16  7:37 ` Song Liu
  2019-10-16  7:37 ` [PATCH 3/4] mm/thp: allow drop THP from page cache Song Liu
  2019-10-16  7:37 ` [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register Song Liu
  3 siblings, 0 replies; 13+ messages in thread
From: Song Liu @ 2019-10-16  7:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kernel-team, william.kucharski,
	Kirill A. Shutemov, Song Liu

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make sure split_huge_page_to_list() handle the state of shmem THP and
file THP properly.

Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 mm/huge_memory.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c5cb6dcd6c69..13cc93785006 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2789,8 +2789,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			ds_queue->split_queue_len--;
 			list_del(page_deferred_list(head));
 		}
-		if (mapping)
-			__dec_node_page_state(page, NR_SHMEM_THPS);
+		if (mapping) {
+			if (PageSwapBacked(page))
+				__dec_node_page_state(page, NR_SHMEM_THPS);
+			else
+				__dec_node_page_state(page, NR_FILE_THPS);
+		}
+
 		spin_unlock(&ds_queue->split_queue_lock);
 		__split_huge_page(page, list, end, flags);
 		if (PageSwapCache(head)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] mm/thp: allow drop THP from page cache
  2019-10-16  7:37 [PATCH 0/4] Fixes for THP in page cache Song Liu
  2019-10-16  7:37 ` [PATCH 1/4] proc/meminfo: fix output alignment Song Liu
  2019-10-16  7:37 ` [PATCH 2/4] mm/thp: fix node page state in split_huge_page_to_list() Song Liu
@ 2019-10-16  7:37 ` Song Liu
  2019-10-17 16:12   ` Matthew Wilcox
  2019-10-16  7:37 ` [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register Song Liu
  3 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2019-10-16  7:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kernel-team, william.kucharski,
	Kirill A. Shutemov, Song Liu

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Once a THP is added to the page cache, it cannot be dropped via
/proc/sys/vm/drop_caches. Fix this issue with proper handling in
invalidate_mapping_pages() and __remove_mapping().

Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 mm/truncate.c | 12 ++++++++++++
 mm/vmscan.c   |  3 ++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 8563339041f6..dd9ebc1da356 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -592,6 +592,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					unlock_page(page);
 					continue;
 				}
+
+				/* Take a pin outside pagevec */
+				get_page(page);
+
+				/*
+				 * Drop extra pins before trying to invalidate
+				 * the huge page.
+				 */
+				pagevec_remove_exceptionals(&pvec);
+				pagevec_release(&pvec);
 			}
 
 			ret = invalidate_inode_page(page);
@@ -602,6 +612,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			 */
 			if (!ret)
 				deactivate_file_page(page);
+			if (PageTransHuge(page))
+				put_page(page);
 			count += ret;
 		}
 		pagevec_remove_exceptionals(&pvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6659bb758a4..1d80a188ad4a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -932,7 +932,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	 * Note that if SetPageDirty is always performed via set_page_dirty,
 	 * and thus under the i_pages lock, then this ordering is not required.
 	 */
-	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
+	if (unlikely(PageTransHuge(page)) &&
+			(PageSwapCache(page) || !PageSwapBacked(page)))
 		refcount = 1 + HPAGE_PMD_NR;
 	else
 		refcount = 2;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register
  2019-10-16  7:37 [PATCH 0/4] Fixes for THP in page cache Song Liu
                   ` (2 preceding siblings ...)
  2019-10-16  7:37 ` [PATCH 3/4] mm/thp: allow drop THP from page cache Song Liu
@ 2019-10-16  7:37 ` Song Liu
  2019-10-16 12:10   ` Oleg Nesterov
  3 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2019-10-16  7:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kernel-team, william.kucharski, Song Liu,
	Kirill A . Shutemov, Srikar Dronamraju, Oleg Nesterov

Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries. On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe. This is
because perf_event based trace_uprobe calls uprobe_unregister twice on
close: first in TRACE_REG_PERF_CLOSE, then in TRACE_REG_PERF_UNREGISTER.
The second call will split the PMD mapped page table entry, which is not
the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Also add a WARN() to confirm uprobe unregister never work on huge pages.

Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/events/uprobes.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 94d38a39d72e..d7a556cc589e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -474,14 +474,17 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	struct vm_area_struct *vma;
 	int ret, is_register, ref_ctr_updated = 0;
 	bool orig_page_huge = false;
+	unsigned int gup_flags = FOLL_FORCE;
 
 	is_register = is_swbp_insn(&opcode);
 	uprobe = container_of(auprobe, struct uprobe, arch);
 
 retry:
+	if (is_register)
+		gup_flags |= FOLL_SPLIT_PMD;
 	/* Read the page with vaddr into memory */
-	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
-			FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
+	ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags,
+				    &old_page, &vma, NULL);
 	if (ret <= 0)
 		return ret;
 
@@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (ret <= 0)
 		goto put_old;
 
+	WARN(!is_register && PageCompound(old_page),
+	     "uprobe unregister should never work on compound page\n");
+
 	/* We are going to replace instruction, update ref_ctr. */
 	if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
 		ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register
  2019-10-16  7:37 ` [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register Song Liu
@ 2019-10-16 12:10   ` Oleg Nesterov
  2019-10-16 16:10     ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2019-10-16 12:10 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, linux-mm, akpm, matthew.wilcox, kernel-team,
	william.kucharski, Kirill A . Shutemov, Srikar Dronamraju

On 10/16, Song Liu wrote:
>
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -474,14 +474,17 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  	struct vm_area_struct *vma;
>  	int ret, is_register, ref_ctr_updated = 0;
>  	bool orig_page_huge = false;
> +	unsigned int gup_flags = FOLL_FORCE;
>  
>  	is_register = is_swbp_insn(&opcode);
>  	uprobe = container_of(auprobe, struct uprobe, arch);
>  
>  retry:
> +	if (is_register)
> +		gup_flags |= FOLL_SPLIT_PMD;
>  	/* Read the page with vaddr into memory */
> -	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
> -			FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
> +	ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags,
> +				    &old_page, &vma, NULL);
>  	if (ret <= 0)
>  		return ret;
>  
> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  	if (ret <= 0)
>  		goto put_old;
>  
> +	WARN(!is_register && PageCompound(old_page),
> +	     "uprobe unregister should never work on compound page\n");

But this can happen with the change above. You can't know if *vaddr was
previously changed by install_breakpoint() or not.

If not, verify_opcode() should likely save us, but we can't rely on it.
Say, someone can write "int3" into vm_file at uprobe->offset.

And I am not sure it is safe to continue in this case, I'd suggest to
return -EWHATEVER to avoid the possible crash.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register
  2019-10-16 12:10   ` Oleg Nesterov
@ 2019-10-16 16:10     ` Song Liu
  2019-10-17  8:47       ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2019-10-16 16:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: open list, linux-mm, akpm, matthew.wilcox, Kernel Team,
	william.kucharski, Kirill A . Shutemov, Srikar Dronamraju



> On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 10/16, Song Liu wrote:
>> 
>> --- a/kernel/events/uprobes.c
>> +++ b/kernel/events/uprobes.c
>> @@ -474,14 +474,17 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>> 	struct vm_area_struct *vma;
>> 	int ret, is_register, ref_ctr_updated = 0;
>> 	bool orig_page_huge = false;
>> +	unsigned int gup_flags = FOLL_FORCE;
>> 
>> 	is_register = is_swbp_insn(&opcode);
>> 	uprobe = container_of(auprobe, struct uprobe, arch);
>> 
>> retry:
>> +	if (is_register)
>> +		gup_flags |= FOLL_SPLIT_PMD;
>> 	/* Read the page with vaddr into memory */
>> -	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
>> -			FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
>> +	ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags,
>> +				    &old_page, &vma, NULL);
>> 	if (ret <= 0)
>> 		return ret;
>> 
>> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>> 	if (ret <= 0)
>> 		goto put_old;
>> 
>> +	WARN(!is_register && PageCompound(old_page),
>> +	     "uprobe unregister should never work on compound page\n");
> 
> But this can happen with the change above. You can't know if *vaddr was
> previously changed by install_breakpoint() or not.

> If not, verify_opcode() should likely save us, but we can't rely on it.
> Say, someone can write "int3" into vm_file at uprobe->offset.

I think this won't really happen. With is_register == false, we already 
know opcode is not "int3", so current call must be from set_orig_insn(). 
Therefore, old_page must be installed by uprobe, and cannot be compound.

The other way is not guaranteed. With is_register == true, it is still
possible current call is from set_orig_insn(). However, we do not rely
on this path. 

Does this make sense? Or did I miss anything?

> 
> And I am not sure it is safe to continue in this case, I'd suggest to
> return -EWHATEVER to avoid the possible crash.

I think we can return -ESOMETHING here to be safe. However, if the 
analysis above makes sense, it is not necessary. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register
  2019-10-16 16:10     ` Song Liu
@ 2019-10-17  8:47       ` Oleg Nesterov
  2019-10-17 14:05         ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2019-10-17  8:47 UTC (permalink / raw)
  To: Song Liu
  Cc: open list, linux-mm, akpm, matthew.wilcox, Kernel Team,
	william.kucharski, Kirill A . Shutemov, Srikar Dronamraju

On 10/16, Song Liu wrote:
>
> > On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> >> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >> 	if (ret <= 0)
> >> 		goto put_old;
> >>
> >> +	WARN(!is_register && PageCompound(old_page),
> >> +	     "uprobe unregister should never work on compound page\n");
> >
> > But this can happen with the change above. You can't know if *vaddr was
> > previously changed by install_breakpoint() or not.
>
> > If not, verify_opcode() should likely save us, but we can't rely on it.
> > Say, someone can write "int3" into vm_file at uprobe->offset.
>
> I think this won't really happen. With is_register == false, we already
> know opcode is not "int3", so current call must be from set_orig_insn().
> Therefore, old_page must be installed by uprobe, and cannot be compound.
>
> The other way is not guaranteed. With is_register == true, it is still
> possible current call is from set_orig_insn(). However, we do not rely
> on this path.

Quite contrary.

When is_register == true we know that a) the caller is install_breakpoint()
and b) the original insn is NOT int3 unless this page was alreadt COW'ed by
userspace, say, by gdb.

If is_register == false we only know that the caller is remove_breakpoint().
We can't know if this page was COW'ed by uprobes or userspace, we can not
know if the insn we are going to replace is int3 or not, thus we can not
assume that verify_opcode() will fail and save us.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register
  2019-10-17  8:47       ` Oleg Nesterov
@ 2019-10-17 14:05         ` Song Liu
  2019-10-17 14:28           ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2019-10-17 14:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: open list, linux-mm, akpm, matthew.wilcox, Kernel Team,
	william.kucharski, Kirill A . Shutemov, Srikar Dronamraju



> On Oct 17, 2019, at 1:47 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 10/16, Song Liu wrote:
>> 
>>> On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> 
>>>> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>>>> 	if (ret <= 0)
>>>> 		goto put_old;
>>>> 
>>>> +	WARN(!is_register && PageCompound(old_page),
>>>> +	     "uprobe unregister should never work on compound page\n");
>>> 
>>> But this can happen with the change above. You can't know if *vaddr was
>>> previously changed by install_breakpoint() or not.
>> 
>>> If not, verify_opcode() should likely save us, but we can't rely on it.
>>> Say, someone can write "int3" into vm_file at uprobe->offset.
>> 
>> I think this won't really happen. With is_register == false, we already
>> know opcode is not "int3", so current call must be from set_orig_insn().
>> Therefore, old_page must be installed by uprobe, and cannot be compound.
>> 
>> The other way is not guaranteed. With is_register == true, it is still
>> possible current call is from set_orig_insn(). However, we do not rely
>> on this path.
> 
> Quite contrary.
> 
> When is_register == true we know that a) the caller is install_breakpoint()
> and b) the original insn is NOT int3 unless this page was alreadt COW'ed by
> userspace, say, by gdb.
> 
> If is_register == false we only know that the caller is remove_breakpoint().
> We can't know if this page was COW'ed by uprobes or userspace, we can not
> know if the insn we are going to replace is int3 or not, thus we can not
> assume that verify_opcode() will fail and save us.

So the case we worry about is: 
old_page is COW by user space, target insn is int3, and it is a huge page; 
then uprobe calls remove_breakpoint(); 

Yeah, I guess this could happen. 

For the fix, I guess return -Esomething in such case should be sufficient?

Thanks,
Song 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register
  2019-10-17 14:05         ` Song Liu
@ 2019-10-17 14:28           ` Oleg Nesterov
  2019-10-17 15:34             ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2019-10-17 14:28 UTC (permalink / raw)
  To: Song Liu
  Cc: open list, linux-mm, akpm, matthew.wilcox, Kernel Team,
	william.kucharski, Kirill A . Shutemov, Srikar Dronamraju

On 10/17, Song Liu wrote:
>
>
> > On Oct 17, 2019, at 1:47 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 10/16, Song Liu wrote:
> >>
> >>> On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >>>
> >>>> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >>>> 	if (ret <= 0)
> >>>> 		goto put_old;
> >>>>
> >>>> +	WARN(!is_register && PageCompound(old_page),
> >>>> +	     "uprobe unregister should never work on compound page\n");
> >>>
> >>> But this can happen with the change above. You can't know if *vaddr was
> >>> previously changed by install_breakpoint() or not.
> >>
> >>> If not, verify_opcode() should likely save us, but we can't rely on it.
> >>> Say, someone can write "int3" into vm_file at uprobe->offset.
> >>
> >> I think this won't really happen. With is_register == false, we already
> >> know opcode is not "int3", so current call must be from set_orig_insn().
> >> Therefore, old_page must be installed by uprobe, and cannot be compound.
> >>
> >> The other way is not guaranteed. With is_register == true, it is still
> >> possible current call is from set_orig_insn(). However, we do not rely
> >> on this path.
> >
> > Quite contrary.
> >
> > When is_register == true we know that a) the caller is install_breakpoint()
> > and b) the original insn is NOT int3 unless this page was alreadt COW'ed by
> > userspace, say, by gdb.
> >
> > If is_register == false we only know that the caller is remove_breakpoint().
> > We can't know if this page was COW'ed by uprobes or userspace, we can not
> > know if the insn we are going to replace is int3 or not, thus we can not
> > assume that verify_opcode() will fail and save us.
>
> So the case we worry about is:
> old_page is COW by user space,

no, in this case the page shouldn't be huge,

> target insn is int3, and it is a huge page;
> then uprobe calls remove_breakpoint();

Yes,

> Yeah, I guess this could happen.

Yes,

> For the fix, I guess return -Esomething in such case should be sufficient?

this is what I tried to suggest from the very beginning.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register
  2019-10-17 14:28           ` Oleg Nesterov
@ 2019-10-17 15:34             ` Song Liu
  0 siblings, 0 replies; 13+ messages in thread
From: Song Liu @ 2019-10-17 15:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: open list, linux-mm, akpm, matthew.wilcox, Kernel Team,
	william.kucharski, Kirill A . Shutemov, Srikar Dronamraju



> On Oct 17, 2019, at 7:28 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 10/17, Song Liu wrote:
>> 
>> 
>>> On Oct 17, 2019, at 1:47 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> 
>>> On 10/16, Song Liu wrote:
>>>> 
>>>>> On Oct 16, 2019, at 5:10 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>>> 
>>>>>> @@ -489,6 +492,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>>>>>> 	if (ret <= 0)
>>>>>> 		goto put_old;
>>>>>> 
>>>>>> +	WARN(!is_register && PageCompound(old_page),
>>>>>> +	     "uprobe unregister should never work on compound page\n");
>>>>> 
>>>>> But this can happen with the change above. You can't know if *vaddr was
>>>>> previously changed by install_breakpoint() or not.
>>>> 
>>>>> If not, verify_opcode() should likely save us, but we can't rely on it.
>>>>> Say, someone can write "int3" into vm_file at uprobe->offset.
>>>> 
>>>> I think this won't really happen. With is_register == false, we already
>>>> know opcode is not "int3", so current call must be from set_orig_insn().
>>>> Therefore, old_page must be installed by uprobe, and cannot be compound.
>>>> 
>>>> The other way is not guaranteed. With is_register == true, it is still
>>>> possible current call is from set_orig_insn(). However, we do not rely
>>>> on this path.
>>> 
>>> Quite contrary.
>>> 
>>> When is_register == true we know that a) the caller is install_breakpoint()
>>> and b) the original insn is NOT int3 unless this page was alreadt COW'ed by
>>> userspace, say, by gdb.
>>> 
>>> If is_register == false we only know that the caller is remove_breakpoint().
>>> We can't know if this page was COW'ed by uprobes or userspace, we can not
>>> know if the insn we are going to replace is int3 or not, thus we can not
>>> assume that verify_opcode() will fail and save us.
>> 
>> So the case we worry about is:
>> old_page is COW by user space,
> 
> no, in this case the page shouldn't be huge,
> 
>> target insn is int3, and it is a huge page;
>> then uprobe calls remove_breakpoint();
> 
> Yes,
> 
>> Yeah, I guess this could happen.
> 
> Yes,
> 
>> For the fix, I guess return -Esomething in such case should be sufficient?
> 
> this is what I tried to suggest from the very beginning.
> 

Thanks Oleg!

Attached is v2 of 4/4. 

Song


============================ 8< ===============================

From 0ab451f31b8fa6315d9de86eebe5d6c44dabac7e Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Tue, 15 Oct 2019 22:38:55 -0700
Subject: [PATCH v2 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register

Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries. On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe. This is
because perf_event based trace_uprobe calls uprobe_unregister twice on
close: first in TRACE_REG_PERF_CLOSE, then in TRACE_REG_PERF_UNREGISTER.
The second call will split the PMD mapped page table entry, which is not
the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Add a WARN() to confirm uprobe unregister never work on huge pages, and
abort the operation when this WARN() triggers.

Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>

---
Changes v1 -> v2:
Return -EINVAL if the WARN() triggers. (Oleg Nesterov)
---
 kernel/events/uprobes.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 94d38a39d72e..c74761004ee5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -474,14 +474,17 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
        struct vm_area_struct *vma;
        int ret, is_register, ref_ctr_updated = 0;
        bool orig_page_huge = false;
+       unsigned int gup_flags = FOLL_FORCE;

        is_register = is_swbp_insn(&opcode);
        uprobe = container_of(auprobe, struct uprobe, arch);

 retry:
+       if (is_register)
+               gup_flags |= FOLL_SPLIT_PMD;
        /* Read the page with vaddr into memory */
-       ret = get_user_pages_remote(NULL, mm, vaddr, 1,
-                       FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
+       ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags,
+                                   &old_page, &vma, NULL);
        if (ret <= 0)
                return ret;

@@ -489,6 +492,12 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
        if (ret <= 0)
                goto put_old;

+       if (WARN(!is_register && PageCompound(old_page),
+                "uprobe unregister should never work on compound page\n")) {
+               ret = -EINVAL;
+               goto put_old;
+       }
+
        /* We are going to replace instruction, update ref_ctr. */
        if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
                ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1);
--
2.17.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] mm/thp: allow drop THP from page cache
  2019-10-16  7:37 ` [PATCH 3/4] mm/thp: allow drop THP from page cache Song Liu
@ 2019-10-17 16:12   ` Matthew Wilcox
  2019-10-17 16:36     ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2019-10-17 16:12 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, linux-mm, akpm, matthew.wilcox, kernel-team,
	william.kucharski, Kirill A. Shutemov

On Wed, Oct 16, 2019 at 12:37:30AM -0700, Song Liu wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Once a THP is added to the page cache, it cannot be dropped via
> /proc/sys/vm/drop_caches. Fix this issue with proper handling in
> invalidate_mapping_pages() and __remove_mapping().
> 
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  mm/truncate.c | 12 ++++++++++++
>  mm/vmscan.c   |  3 ++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c6659bb758a4..1d80a188ad4a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -932,7 +932,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  	 * Note that if SetPageDirty is always performed via set_page_dirty,
>  	 * and thus under the i_pages lock, then this ordering is not required.
>  	 */
> -	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
> +	if (unlikely(PageTransHuge(page)) &&
> +			(PageSwapCache(page) || !PageSwapBacked(page)))
>  		refcount = 1 + HPAGE_PMD_NR;
>  	else
>  		refcount = 2;

Kirill suggests that this patch would be better (for this part of the patch;
the part in truncate.c should remain as it is)

commit ddcee327f96d57cb9d5310486d21e43892b7a368
Author: William Kucharski <william.kucharski@oracle.com>
Date:   Fri Sep 20 16:14:51 2019 -0400

    mm: Support removing arbitrary sized pages from mapping
    
    __remove_mapping() assumes that pages can only be either base pages
    or HPAGE_PMD_SIZE.  Further, it assumes that large pages are
    swap-backed.  Support all kinds of pages by unconditionally asking how
    many pages this page references.
    
    Signed-off-by: William Kucharski <william.kucharski@oracle.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6659bb758a4..f870da1f4bb7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -932,10 +932,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	 * Note that if SetPageDirty is always performed via set_page_dirty,
 	 * and thus under the i_pages lock, then this ordering is not required.
 	 */
-	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
-		refcount = 1 + HPAGE_PMD_NR;
-	else
-		refcount = 2;
+	refcount = 1 + compound_nr(page);
 	if (!page_ref_freeze(page, refcount))
 		goto cannot_free;
 	/* note: atomic_cmpxchg in page_ref_freeze provides the smp_rmb */

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] mm/thp: allow drop THP from page cache
  2019-10-17 16:12   ` Matthew Wilcox
@ 2019-10-17 16:36     ` Song Liu
  0 siblings, 0 replies; 13+ messages in thread
From: Song Liu @ 2019-10-17 16:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: open list, linux-mm, akpm, matthew.wilcox, Kernel Team,
	william.kucharski, Kirill A. Shutemov



> On Oct 17, 2019, at 9:12 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Oct 16, 2019 at 12:37:30AM -0700, Song Liu wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> 
>> Once a THP is added to the page cache, it cannot be dropped via
>> /proc/sys/vm/drop_caches. Fix this issue with proper handling in
>> invalidate_mapping_pages() and __remove_mapping().
>> 
>> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Tested-by: Song Liu <songliubraving@fb.com>
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> mm/truncate.c | 12 ++++++++++++
>> mm/vmscan.c   |  3 ++-
>> 2 files changed, 14 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c6659bb758a4..1d80a188ad4a 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -932,7 +932,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>> 	 * Note that if SetPageDirty is always performed via set_page_dirty,
>> 	 * and thus under the i_pages lock, then this ordering is not required.
>> 	 */
>> -	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
>> +	if (unlikely(PageTransHuge(page)) &&
>> +			(PageSwapCache(page) || !PageSwapBacked(page)))
>> 		refcount = 1 + HPAGE_PMD_NR;
>> 	else
>> 		refcount = 2;
> 
> Kirill suggests that this patch would be better (for this part of the patch;
> the part in truncate.c should remain as it is)
> 
> commit ddcee327f96d57cb9d5310486d21e43892b7a368
> Author: William Kucharski <william.kucharski@oracle.com>
> Date:   Fri Sep 20 16:14:51 2019 -0400
> 
>    mm: Support removing arbitrary sized pages from mapping
> 
>    __remove_mapping() assumes that pages can only be either base pages
>    or HPAGE_PMD_SIZE.  Further, it assumes that large pages are
>    swap-backed.  Support all kinds of pages by unconditionally asking how
>    many pages this page references.
> 
>    Signed-off-by: William Kucharski <william.kucharski@oracle.com>
>    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c6659bb758a4..f870da1f4bb7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -932,10 +932,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> 	 * Note that if SetPageDirty is always performed via set_page_dirty,
> 	 * and thus under the i_pages lock, then this ordering is not required.
> 	 */
> -	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
> -		refcount = 1 + HPAGE_PMD_NR;
> -	else
> -		refcount = 2;
> +	refcount = 1 + compound_nr(page);
> 	if (!page_ref_freeze(page, refcount))
> 		goto cannot_free;
> 	/* note: atomic_cmpxchg in page_ref_freeze provides the smp_rmb */

This does look cleaner, and works fine in my tests.

Let me include it in v2 set. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-10-17 16:37 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-16  7:37 [PATCH 0/4] Fixes for THP in page cache Song Liu
2019-10-16  7:37 ` [PATCH 1/4] proc/meminfo: fix output alignment Song Liu
2019-10-16  7:37 ` [PATCH 2/4] mm/thp: fix node page state in split_huge_page_to_list() Song Liu
2019-10-16  7:37 ` [PATCH 3/4] mm/thp: allow drop THP from page cache Song Liu
2019-10-17 16:12   ` Matthew Wilcox
2019-10-17 16:36     ` Song Liu
2019-10-16  7:37 ` [PATCH 4/4] uprobe: only do FOLL_SPLIT_PMD for uprobe register Song Liu
2019-10-16 12:10   ` Oleg Nesterov
2019-10-16 16:10     ` Song Liu
2019-10-17  8:47       ` Oleg Nesterov
2019-10-17 14:05         ` Song Liu
2019-10-17 14:28           ` Oleg Nesterov
2019-10-17 15:34             ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).