linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] fix hugepage coredump
@ 2013-04-03 18:35 Naoya Horiguchi
  2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2013-04-03 18:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	linux-kernel, Naoya Horiguchi

Hi,

Here is 3nd version of hugepage coredump fix.
I added some minor changes based on Hatayama-san's and Michal's comments.
Thank you very much.

Naoya Horiguchi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)
  2013-04-03 18:35 [PATCH v3 0/3] fix hugepage coredump Naoya Horiguchi
@ 2013-04-03 18:35 ` Naoya Horiguchi
  2013-04-03 19:20   ` Rik van Riel
  2013-04-05 18:32   ` KOSAKI Motohiro
  2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi
  2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi
  2 siblings, 2 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2013-04-03 18:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	linux-kernel, Naoya Horiguchi

Currently we fail to include any data on hugepages into coredump,
because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently
introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and
mm->reserved_vm counter". This looks to me a serious regression,
so let's fix it.

ChangeLog v3:
 - move 'return 0' into a separate patch

ChangeLog v2:
 - add 'return 0' in hugepage memory check

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: stable@vger.kernel.org
---
 fs/hugetlbfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git v3.9-rc3.orig/fs/hugetlbfs/inode.c v3.9-rc3/fs/hugetlbfs/inode.c
index 84e3d85..523464e 100644
--- v3.9-rc3.orig/fs/hugetlbfs/inode.c
+++ v3.9-rc3/fs/hugetlbfs/inode.c
@@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 * way when do_mmap_pgoff unwinds (may be important on powerpc
 	 * and ia64).
 	 */
-	vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
 	vma->vm_ops = &hugetlb_vm_ops;
 
 	if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size()
  2013-04-03 18:35 [PATCH v3 0/3] fix hugepage coredump Naoya Horiguchi
  2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi
@ 2013-04-03 18:35 ` Naoya Horiguchi
  2013-04-03 19:22   ` Rik van Riel
                     ` (3 more replies)
  2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi
  2 siblings, 4 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2013-04-03 18:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	linux-kernel, Naoya Horiguchi

Documentation/filesystems/proc.txt says about coredump_filter bitmask,

  Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
  effected by bit 5-6.

However current code can go into the subsequent flag checks of bit 0-4
for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
as written in the document.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: stable@vger.kernel.org
---
 fs/binfmt_elf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git v3.9-rc3.orig/fs/binfmt_elf.c v3.9-rc3/fs/binfmt_elf.c
index 3939829..86af964 100644
--- v3.9-rc3.orig/fs/binfmt_elf.c
+++ v3.9-rc3/fs/binfmt_elf.c
@@ -1137,6 +1137,7 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
 			goto whole;
 		if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE))
 			goto whole;
+		return 0;
 	}
 
 	/* Do not dump I/O mapped devices or special mappings */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
  2013-04-03 18:35 [PATCH v3 0/3] fix hugepage coredump Naoya Horiguchi
  2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi
  2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi
@ 2013-04-03 18:35 ` Naoya Horiguchi
  2013-04-03 20:01   ` Rik van Riel
  2013-04-05 18:59   ` KOSAKI Motohiro
  2 siblings, 2 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2013-04-03 18:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	linux-kernel, Naoya Horiguchi

With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
error happens on a hugepage and the affected processes try to access
the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0)
in get_page().

The reason for this bug is that coredump-related code doesn't recognise
"hugepage hwpoison entry" with which a pmd entry is replaced when a memory
error occurs on a hugepage.
In other words, physical address information is stored in different bit layout
between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page()
which is called in get_dump_page() returns a wrong page from a given address.

We need to filter out only hwpoison hugepages to have data on healthy
hugepages in coredump. So this patch makes follow_hugetlb_page() avoid
trying to get page when a pmd is in swap entry like format.

ChangeLog v3:
 - add comment about using is_swap_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: stable@vger.kernel.org
---
 mm/hugetlb.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
index 0d1705b..3bc20bd 100644
--- v3.9-rc3.orig/mm/hugetlb.c
+++ v3.9-rc3/mm/hugetlb.c
@@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
 		 * first, for the page indexing below to work.
+		 *
+		 * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
+		 * and hugepages under migration in which case
+		 * hugetlb_fault waits for the migration and bails out
+		 * properly for HWPosined pages.
 		 */
 		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
-		absent = !pte || huge_pte_none(huge_ptep_get(pte));
+		absent = !pte || huge_pte_none(huge_ptep_get(pte)) ||
+			is_swap_pte(huge_ptep_get(pte));
 
 		/*
 		 * When coredumping, it suits get_dump_page if we just return
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)
  2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi
@ 2013-04-03 19:20   ` Rik van Riel
  2013-04-05 18:32   ` KOSAKI Motohiro
  1 sibling, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2013-04-03 19:20 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	linux-kernel

On 04/03/2013 02:35 PM, Naoya Horiguchi wrote:
> Currently we fail to include any data on hugepages into coredump,
> because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently
> introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and
> mm->reserved_vm counter". This looks to me a serious regression,
> so let's fix it.
>
> ChangeLog v3:
>   - move 'return 0' into a separate patch
>
> ChangeLog v2:
>   - add 'return 0' in hugepage memory check
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Cc: stable@vger.kernel.org

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size()
  2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi
@ 2013-04-03 19:22   ` Rik van Riel
  2013-04-03 21:14   ` Michal Hocko
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2013-04-03 19:22 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	linux-kernel

On 04/03/2013 02:35 PM, Naoya Horiguchi wrote:
> Documentation/filesystems/proc.txt says about coredump_filter bitmask,
>
>    Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
>    effected by bit 5-6.
>
> However current code can go into the subsequent flag checks of bit 0-4
> for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
> as written in the document.
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
  2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi
@ 2013-04-03 20:01   ` Rik van Riel
  2013-04-05 18:59   ` KOSAKI Motohiro
  1 sibling, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2013-04-03 20:01 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	linux-kernel

On 04/03/2013 02:35 PM, Naoya Horiguchi wrote:
> With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
> initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
> error happens on a hugepage and the affected processes try to access
> the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0)
> in get_page().
>
> The reason for this bug is that coredump-related code doesn't recognise
> "hugepage hwpoison entry" with which a pmd entry is replaced when a memory
> error occurs on a hugepage.
> In other words, physical address information is stored in different bit layout
> between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page()
> which is called in get_dump_page() returns a wrong page from a given address.
>
> We need to filter out only hwpoison hugepages to have data on healthy
> hugepages in coredump. So this patch makes follow_hugetlb_page() avoid
> trying to get page when a pmd is in swap entry like format.
>
> ChangeLog v3:
>   - add comment about using is_swap_pte()
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> Cc: stable@vger.kernel.org

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size()
  2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi
  2013-04-03 19:22   ` Rik van Riel
@ 2013-04-03 21:14   ` Michal Hocko
  2013-04-04  0:42   ` HATAYAMA Daisuke
  2013-04-05 18:33   ` KOSAKI Motohiro
  3 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2013-04-03 21:14 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Konstantin Khlebnikov, HATAYAMA Daisuke,
	linux-mm, linux-kernel

On Wed 03-04-13 14:35:37, Naoya Horiguchi wrote:
> Documentation/filesystems/proc.txt says about coredump_filter bitmask,
> 
>   Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
>   effected by bit 5-6.
> 
> However current code can go into the subsequent flag checks of bit 0-4
> for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
> as written in the document.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: stable@vger.kernel.org

Just for the record. It should be stable for 3.7+ since (314e51b98)
becuase then have lost VM_RESERVED check which used to stop hugetlb
mappings.

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  fs/binfmt_elf.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git v3.9-rc3.orig/fs/binfmt_elf.c v3.9-rc3/fs/binfmt_elf.c
> index 3939829..86af964 100644
> --- v3.9-rc3.orig/fs/binfmt_elf.c
> +++ v3.9-rc3/fs/binfmt_elf.c
> @@ -1137,6 +1137,7 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
>  			goto whole;
>  		if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE))
>  			goto whole;
> +		return 0;
>  	}
>  
>  	/* Do not dump I/O mapped devices or special mappings */
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size()
  2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi
  2013-04-03 19:22   ` Rik van Riel
  2013-04-03 21:14   ` Michal Hocko
@ 2013-04-04  0:42   ` HATAYAMA Daisuke
  2013-04-05 18:33   ` KOSAKI Motohiro
  3 siblings, 0 replies; 18+ messages in thread
From: HATAYAMA Daisuke @ 2013-04-04  0:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, linux-mm,
	linux-kernel

(2013/04/04 3:35), Naoya Horiguchi wrote:
> Documentation/filesystems/proc.txt says about coredump_filter bitmask,
> 
>    Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
>    effected by bit 5-6.
> 
> However current code can go into the subsequent flag checks of bit 0-4
> for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
> as written in the document.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: stable@vger.kernel.org
> ---
>   fs/binfmt_elf.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git v3.9-rc3.orig/fs/binfmt_elf.c v3.9-rc3/fs/binfmt_elf.c
> index 3939829..86af964 100644
> --- v3.9-rc3.orig/fs/binfmt_elf.c
> +++ v3.9-rc3/fs/binfmt_elf.c
> @@ -1137,6 +1137,7 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
>   			goto whole;
>   		if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE))
>   			goto whole;
> +		return 0;
>   	}
>   
>   	/* Do not dump I/O mapped devices or special mappings */
> 

Thanks for splitting this fix. Now it's easier to keep track of this fix.

Reviewed-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>

Thanks.
HATAYAMA, Daisuke



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)
  2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi
  2013-04-03 19:20   ` Rik van Riel
@ 2013-04-05 18:32   ` KOSAKI Motohiro
  2013-04-05 18:40     ` Naoya Horiguchi
  1 sibling, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 18:32 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko,
	HATAYAMA Daisuke, linux-mm, linux-kernel, kosaki.motohiro

(4/3/13 2:35 PM), Naoya Horiguchi wrote:
> Currently we fail to include any data on hugepages into coredump,
> because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently
> introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and
> mm->reserved_vm counter". This looks to me a serious regression,
> so let's fix it.

I don't think this is enough explanations. Let's explain the code meaning
time to time order.

First, we had no madvice(DONTDUMP) nor coredump_filter(HUGETLB). then hugetlb
pages were never dumped.

Second, I added coredump_filter(HUGETLB). and then vm_dump_size became..

vm_dump_size()
{
	/* Hugetlb memory check */
	if (vma->vm_flags & VM_HUGETLB) {
		..
		goto whole;
	}
	if (vma->vm_flags & VM_RESERVED)
		return 0;

The point is, hugetlb was checked before VM_RESERVED. i.e. hugetlb core dump ignored
VM_RESERVED. At this time, "if (vma->vm_flags & VM_HUGETLB)" statement don't need
return 0 because VM_RESERVED prevented to go into the subsequent flag checks.

Third, Jason added madvise(DONTDUMP). then vm_dump_size became...

vm_dump_size()
{
       if (vma->vm_flags & VM_NODUMP)
               return 0;

	/* Hugetlb memory check */
	if (vma->vm_flags & VM_HUGETLB) {
		..
		goto whole;
	}
	if (vma->vm_flags & VM_RESERVED)
		return 0;

Look, VM_NODUMP and VM_RESERVED had similar and different meanings at this time.

Finally, Konstantin removed VM_RESERVED and hugetlb coredump behavior
has been changed.

Thus, patch [1/3] and [2/3] should be marked [stable for v3.6 or later].

Anyway, this patch is correct. Thank you!

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size()
  2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi
                     ` (2 preceding siblings ...)
  2013-04-04  0:42   ` HATAYAMA Daisuke
@ 2013-04-05 18:33   ` KOSAKI Motohiro
  3 siblings, 0 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 18:33 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko,
	HATAYAMA Daisuke, linux-mm, linux-kernel, kosaki.motohiro

(4/3/13 2:35 PM), Naoya Horiguchi wrote:
> Documentation/filesystems/proc.txt says about coredump_filter bitmask,
> 
>   Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
>   effected by bit 5-6.
> 
> However current code can go into the subsequent flag checks of bit 0-4
> for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
> as written in the document.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: stable@vger.kernel.org

If I were you, I merge this patch into [1/3] because both patches treat the same
regression. but it is no big matter.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)
  2013-04-05 18:32   ` KOSAKI Motohiro
@ 2013-04-05 18:40     ` Naoya Horiguchi
  0 siblings, 0 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2013-04-05 18:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko,
	HATAYAMA Daisuke, linux-mm, linux-kernel

On Fri, Apr 05, 2013 at 02:32:08PM -0400, KOSAKI Motohiro wrote:
> (4/3/13 2:35 PM), Naoya Horiguchi wrote:
> > Currently we fail to include any data on hugepages into coredump,
> > because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently
> > introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and
> > mm->reserved_vm counter". This looks to me a serious regression,
> > so let's fix it.
> 
> I don't think this is enough explanations. Let's explain the code meaning
> time to time order.
> 
> First, we had no madvice(DONTDUMP) nor coredump_filter(HUGETLB). then hugetlb
> pages were never dumped.
> 
> Second, I added coredump_filter(HUGETLB). and then vm_dump_size became..
> 
> vm_dump_size()
> {
> 	/* Hugetlb memory check */
> 	if (vma->vm_flags & VM_HUGETLB) {
> 		..
> 		goto whole;
> 	}
> 	if (vma->vm_flags & VM_RESERVED)
> 		return 0;
> 
> The point is, hugetlb was checked before VM_RESERVED. i.e. hugetlb core dump ignored
> VM_RESERVED. At this time, "if (vma->vm_flags & VM_HUGETLB)" statement don't need
> return 0 because VM_RESERVED prevented to go into the subsequent flag checks.
> 
> Third, Jason added madvise(DONTDUMP). then vm_dump_size became...
> 
> vm_dump_size()
> {
>        if (vma->vm_flags & VM_NODUMP)
>                return 0;
> 
> 	/* Hugetlb memory check */
> 	if (vma->vm_flags & VM_HUGETLB) {
> 		..
> 		goto whole;
> 	}
> 	if (vma->vm_flags & VM_RESERVED)
> 		return 0;
> 
> Look, VM_NODUMP and VM_RESERVED had similar and different meanings at this time.
> 
> Finally, Konstantin removed VM_RESERVED and hugetlb coredump behavior
> has been changed.

Things were/are quite complicated, and this is a wonderful step-by-step explanation.
Thank you very much.

Naoya

> Thus, patch [1/3] and [2/3] should be marked [stable for v3.6 or later].
> 
> Anyway, this patch is correct. Thank you!
> 
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
  2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi
  2013-04-03 20:01   ` Rik van Riel
@ 2013-04-05 18:59   ` KOSAKI Motohiro
  2013-04-08 19:27     ` Naoya Horiguchi
  1 sibling, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 18:59 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko,
	HATAYAMA Daisuke, linux-mm, linux-kernel, kosaki.motohiro

(4/3/13 2:35 PM), Naoya Horiguchi wrote:
> With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
> initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
> error happens on a hugepage and the affected processes try to access
> the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0)
> in get_page().
> 
> The reason for this bug is that coredump-related code doesn't recognise
> "hugepage hwpoison entry" with which a pmd entry is replaced when a memory
> error occurs on a hugepage.
> In other words, physical address information is stored in different bit layout
> between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page()
> which is called in get_dump_page() returns a wrong page from a given address.
> 
> We need to filter out only hwpoison hugepages to have data on healthy
> hugepages in coredump. So this patch makes follow_hugetlb_page() avoid
> trying to get page when a pmd is in swap entry like format.
> 
> ChangeLog v3:
>  - add comment about using is_swap_pte()
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> Cc: stable@vger.kernel.org
> ---
>  mm/hugetlb.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index 0d1705b..3bc20bd 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * Some archs (sparc64, sh*) have multiple pte_ts to
>  		 * each hugepage.  We have to make sure we get the
>  		 * first, for the page indexing below to work.
> +		 *
> +		 * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
> +		 * and hugepages under migration in which case
> +		 * hugetlb_fault waits for the migration and bails out
> +		 * properly for HWPosined pages.
>  		 */
>  		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
> -		absent = !pte || huge_pte_none(huge_ptep_get(pte));
> +		absent = !pte || huge_pte_none(huge_ptep_get(pte)) ||
> +			is_swap_pte(huge_ptep_get(pte));

Hmmm...

Now absent has two meanings. 1) skip hugetlb_fault() and return immediately if FOLL_DUMP is used.
2) call hugetlb_fault() if to be need page population or cow.

The description of this patch only explain about (2). and I'm not convinced why we don't need to
dump pages under migraion.













^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
  2013-04-05 18:59   ` KOSAKI Motohiro
@ 2013-04-08 19:27     ` Naoya Horiguchi
  2013-04-08 20:57       ` KOSAKI Motohiro
  0 siblings, 1 reply; 18+ messages in thread
From: Naoya Horiguchi @ 2013-04-08 19:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko,
	HATAYAMA Daisuke, linux-mm, linux-kernel

On Fri, Apr 05, 2013 at 02:59:43PM -0400, KOSAKI Motohiro wrote:
> (4/3/13 2:35 PM), Naoya Horiguchi wrote:
> > With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
> > initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
> > error happens on a hugepage and the affected processes try to access
> > the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0)
> > in get_page().
> > 
> > The reason for this bug is that coredump-related code doesn't recognise
> > "hugepage hwpoison entry" with which a pmd entry is replaced when a memory
> > error occurs on a hugepage.
> > In other words, physical address information is stored in different bit layout
> > between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page()
> > which is called in get_dump_page() returns a wrong page from a given address.
> > 
> > We need to filter out only hwpoison hugepages to have data on healthy
> > hugepages in coredump. So this patch makes follow_hugetlb_page() avoid
> > trying to get page when a pmd is in swap entry like format.
> > 
> > ChangeLog v3:
> >  - add comment about using is_swap_pte()
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Reviewed-by: Michal Hocko <mhocko@suse.cz>
> > Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> > Cc: stable@vger.kernel.org
> > ---
> >  mm/hugetlb.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> > index 0d1705b..3bc20bd 100644
> > --- v3.9-rc3.orig/mm/hugetlb.c
> > +++ v3.9-rc3/mm/hugetlb.c
> > @@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		 * Some archs (sparc64, sh*) have multiple pte_ts to
> >  		 * each hugepage.  We have to make sure we get the
> >  		 * first, for the page indexing below to work.
> > +		 *
> > +		 * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
> > +		 * and hugepages under migration in which case
> > +		 * hugetlb_fault waits for the migration and bails out
> > +		 * properly for HWPosined pages.
> >  		 */
> >  		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
> > -		absent = !pte || huge_pte_none(huge_ptep_get(pte));
> > +		absent = !pte || huge_pte_none(huge_ptep_get(pte)) ||
> > +			is_swap_pte(huge_ptep_get(pte));
> 
> Hmmm...
> 
> Now absent has two meanings. 1) skip hugetlb_fault() and return immediately if FOLL_DUMP is used.
> 2) call hugetlb_fault() if to be need page population or cow.
> 
> The description of this patch only explain about (2). and I'm not convinced why we don't need to
> dump pages under migraion.

We can/should dump hugepages under migration, and to do that we have to
put is_swap_pte() in the check of the hugetlb_falut block.
I updated this patch like below.

# I suspended Reviewed and Acked given for the previous version, because
# it has a non-minor change. If you want to restore it, please let me know.

Thanks,
Naoya
------------------------------------------------------------------------
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Thu, 28 Mar 2013 10:17:38 -0400
Subject: [PATCH v4] hugetlbfs: add swap entry check in follow_hugetlb_page()

With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
error happens on a hugepage and the affected processes try to access
the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0)
in get_page().

The reason for this bug is that coredump-related code doesn't recognise
"hugepage hwpoison entry" with which a pmd entry is replaced when a memory
error occurs on a hugepage.
In other words, physical address information is stored in different bit layout
between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page()
which is called in get_dump_page() returns a wrong page from a given address.

The expected behavior is like this:

  absent    is_swap_pte    FOLL_DUMP   Expected behavior
  ---------------------------------------------------------------------
   true      false          false      hugetlb_fault
   false     true           false      hugetlb_fault
   false     false          false      return page
   true      false          true       skip page (to avoid allocation)
   false     true           true       hugetlb_fault
   false     false          true       return page

With this patch, we can call hugetlb_fault() and take proper actions
(we can wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for
hwpoisoned entries,) and as the result we can dump all hugepages except
for hwpoisoned ones.

ChangeLog v4:
 - move is_swap_page() to right place.

ChangeLog v3:
 - add comment about using is_swap_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: stable@vger.kernel.org
---
 mm/hugetlb.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0d1705b..f155e59 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2983,7 +2983,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			break;
 		}
 
-		if (absent ||
+		/*
+		 * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
+		 * and hugepages under migration in which case
+		 * hugetlb_fault waits for the migration and bails out
+		 * properly for HWPosined pages.
+		 */
+		if (absent || is_swap_pte(huge_ptep_get(pte)) ||
 		    ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) {
 			int ret;
 
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
  2013-04-08 19:27     ` Naoya Horiguchi
@ 2013-04-08 20:57       ` KOSAKI Motohiro
  2013-04-09 22:00         ` Naoya Horiguchi
  0 siblings, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2013-04-08 20:57 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	LKML

> -               if (absent ||
> +               /*
> +                * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
> +                * and hugepages under migration in which case
> +                * hugetlb_fault waits for the migration and bails out
> +                * properly for HWPosined pages.
> +                */
> +               if (absent || is_swap_pte(huge_ptep_get(pte)) ||
>                     ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) {
>                         int ret;

Your comment describe what the code is. However we want the comment describe
why. In migration case, calling hugetlb_fault() is natural. but in
hwpoison case, it is
needed more explanation. Why can't we call is_hugetlb_hwpoisoned() directly?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
  2013-04-08 20:57       ` KOSAKI Motohiro
@ 2013-04-09 22:00         ` Naoya Horiguchi
  2013-04-10  1:53           ` KOSAKI Motohiro
  2013-04-10  8:07           ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2013-04-09 22:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	LKML

On Mon, Apr 08, 2013 at 04:57:44PM -0400, KOSAKI Motohiro wrote:
> > -               if (absent ||
> > +               /*
> > +                * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
> > +                * and hugepages under migration in which case
> > +                * hugetlb_fault waits for the migration and bails out
> > +                * properly for HWPosined pages.
> > +                */
> > +               if (absent || is_swap_pte(huge_ptep_get(pte)) ||
> >                     ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) {
> >                         int ret;
> 
> Your comment describe what the code is. However we want the comment describe
> why. In migration case, calling hugetlb_fault() is natural. but in
> hwpoison case, it is
> needed more explanation.

We should call hugetlb_fault() when we encounter any kind of swap
type entry. It's consistent with handling of normal pages.

> Why can't we call is_hugetlb_hwpoisoned() directly?

We can use it, but I like to make code simple.

I rewrite the comment here, how about this?

-		if (absent ||
+		/*
+		 * We need call hugetlb_fault for both hugepages under migration
+		 * (in which case hugetlb_fault waits for the migration,) and
+		 * hwpoisoned hugepages (in which case we need to prevent the
+		 * caller from accessing to them.) In order to do this, we use
+		 * here is_swap_pte instead of is_hugetlb_entry_migration and
+		 * is_hugetlb_entry_hwpoisoned. This is because it simply covers
+		 * both cases, and because we can't follow correct pages directly
+		 * from any kind of swap entries.
+		 */
+		if (absent || is_swap_pte(huge_ptep_get(pte)) ||
 		    ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) {
 			int ret;

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
  2013-04-09 22:00         ` Naoya Horiguchi
@ 2013-04-10  1:53           ` KOSAKI Motohiro
  2013-04-10  8:07           ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2013-04-10  1:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel,
	Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm,
	LKML

> I rewrite the comment here, how about this?
>
> -               if (absent ||
> +               /*
> +                * We need call hugetlb_fault for both hugepages under migration
> +                * (in which case hugetlb_fault waits for the migration,) and
> +                * hwpoisoned hugepages (in which case we need to prevent the
> +                * caller from accessing to them.) In order to do this, we use
> +                * here is_swap_pte instead of is_hugetlb_entry_migration and
> +                * is_hugetlb_entry_hwpoisoned. This is because it simply covers
> +                * both cases, and because we can't follow correct pages directly
> +                * from any kind of swap entries.
> +                */
> +               if (absent || is_swap_pte(huge_ptep_get(pte)) ||
>                     ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) {
>                         int ret;

Looks ok to me.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
  2013-04-09 22:00         ` Naoya Horiguchi
  2013-04-10  1:53           ` KOSAKI Motohiro
@ 2013-04-10  8:07           ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2013-04-10  8:07 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Hugh Dickins,
	Rik van Riel, Konstantin Khlebnikov, HATAYAMA Daisuke, linux-mm,
	LKML

On Tue 09-04-13 18:00:34, Naoya Horiguchi wrote:
[...]
> I rewrite the comment here, how about this?
> 
> -		if (absent ||
> +		/*
> +		 * We need call hugetlb_fault for both hugepages under migration
> +		 * (in which case hugetlb_fault waits for the migration,) and
> +		 * hwpoisoned hugepages (in which case we need to prevent the
> +		 * caller from accessing to them.) In order to do this, we use
> +		 * here is_swap_pte instead of is_hugetlb_entry_migration and
> +		 * is_hugetlb_entry_hwpoisoned. This is because it simply covers
> +		 * both cases, and because we can't follow correct pages directly
> +		 * from any kind of swap entries.
> +		 */
> +		if (absent || is_swap_pte(huge_ptep_get(pte)) ||
>  		    ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) {
>  			int ret;

OK, thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-04-10  8:07 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-03 18:35 [PATCH v3 0/3] fix hugepage coredump Naoya Horiguchi
2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi
2013-04-03 19:20   ` Rik van Riel
2013-04-05 18:32   ` KOSAKI Motohiro
2013-04-05 18:40     ` Naoya Horiguchi
2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi
2013-04-03 19:22   ` Rik van Riel
2013-04-03 21:14   ` Michal Hocko
2013-04-04  0:42   ` HATAYAMA Daisuke
2013-04-05 18:33   ` KOSAKI Motohiro
2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi
2013-04-03 20:01   ` Rik van Riel
2013-04-05 18:59   ` KOSAKI Motohiro
2013-04-08 19:27     ` Naoya Horiguchi
2013-04-08 20:57       ` KOSAKI Motohiro
2013-04-09 22:00         ` Naoya Horiguchi
2013-04-10  1:53           ` KOSAKI Motohiro
2013-04-10  8:07           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).