* [PATCH v3 0/3] fix hugepage coredump @ 2013-04-03 18:35 Naoya Horiguchi 2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Naoya Horiguchi @ 2013-04-03 18:35 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel, Naoya Horiguchi Hi, Here is 3nd version of hugepage coredump fix. I added some minor changes based on Hatayama-san's and Michal's comments. Thank you very much. Naoya Horiguchi ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) 2013-04-03 18:35 [PATCH v3 0/3] fix hugepage coredump Naoya Horiguchi @ 2013-04-03 18:35 ` Naoya Horiguchi 2013-04-03 19:20 ` Rik van Riel 2013-04-05 18:32 ` KOSAKI Motohiro 2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi 2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi 2 siblings, 2 replies; 18+ messages in thread From: Naoya Horiguchi @ 2013-04-03 18:35 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel, Naoya Horiguchi Currently we fail to include any data on hugepages into coredump, because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and mm->reserved_vm counter". This looks to me a serious regression, so let's fix it. ChangeLog v3: - move 'return 0' into a separate patch ChangeLog v2: - add 'return 0' in hugepage memory check Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: stable@vger.kernel.org --- fs/hugetlbfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git v3.9-rc3.orig/fs/hugetlbfs/inode.c v3.9-rc3/fs/hugetlbfs/inode.c index 84e3d85..523464e 100644 --- v3.9-rc3.orig/fs/hugetlbfs/inode.c +++ v3.9-rc3/fs/hugetlbfs/inode.c @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) * way when do_mmap_pgoff unwinds (may be important on powerpc * and ia64). */ - vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND | VM_DONTDUMP; + vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND; vma->vm_ops = &hugetlb_vm_ops; if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT)) -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) 2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi @ 2013-04-03 19:20 ` Rik van Riel 2013-04-05 18:32 ` KOSAKI Motohiro 1 sibling, 0 replies; 18+ messages in thread From: Rik van Riel @ 2013-04-03 19:20 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel On 04/03/2013 02:35 PM, Naoya Horiguchi wrote: > Currently we fail to include any data on hugepages into coredump, > because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently > introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and > mm->reserved_vm counter". This looks to me a serious regression, > so let's fix it. > > ChangeLog v3: > - move 'return 0' into a separate patch > > ChangeLog v2: > - add 'return 0' in hugepage memory check > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> > Acked-by: Michal Hocko <mhocko@suse.cz> > Cc: stable@vger.kernel.org Reviewed-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) 2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi 2013-04-03 19:20 ` Rik van Riel @ 2013-04-05 18:32 ` KOSAKI Motohiro 2013-04-05 18:40 ` Naoya Horiguchi 1 sibling, 1 reply; 18+ messages in thread From: KOSAKI Motohiro @ 2013-04-05 18:32 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel, kosaki.motohiro (4/3/13 2:35 PM), Naoya Horiguchi wrote: > Currently we fail to include any data on hugepages into coredump, > because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently > introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and > mm->reserved_vm counter". This looks to me a serious regression, > so let's fix it. I don't think this is enough explanations. Let's explain the code meaning time to time order. First, we had no madvice(DONTDUMP) nor coredump_filter(HUGETLB). then hugetlb pages were never dumped. Second, I added coredump_filter(HUGETLB). and then vm_dump_size became.. vm_dump_size() { /* Hugetlb memory check */ if (vma->vm_flags & VM_HUGETLB) { .. goto whole; } if (vma->vm_flags & VM_RESERVED) return 0; The point is, hugetlb was checked before VM_RESERVED. i.e. hugetlb core dump ignored VM_RESERVED. At this time, "if (vma->vm_flags & VM_HUGETLB)" statement don't need return 0 because VM_RESERVED prevented to go into the subsequent flag checks. Third, Jason added madvise(DONTDUMP). then vm_dump_size became... vm_dump_size() { if (vma->vm_flags & VM_NODUMP) return 0; /* Hugetlb memory check */ if (vma->vm_flags & VM_HUGETLB) { .. goto whole; } if (vma->vm_flags & VM_RESERVED) return 0; Look, VM_NODUMP and VM_RESERVED had similar and different meanings at this time. Finally, Konstantin removed VM_RESERVED and hugetlb coredump behavior has been changed. Thus, patch [1/3] and [2/3] should be marked [stable for v3.6 or later]. Anyway, this patch is correct. Thank you! Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) 2013-04-05 18:32 ` KOSAKI Motohiro @ 2013-04-05 18:40 ` Naoya Horiguchi 0 siblings, 0 replies; 18+ messages in thread From: Naoya Horiguchi @ 2013-04-05 18:40 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel On Fri, Apr 05, 2013 at 02:32:08PM -0400, KOSAKI Motohiro wrote: > (4/3/13 2:35 PM), Naoya Horiguchi wrote: > > Currently we fail to include any data on hugepages into coredump, > > because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently > > introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and > > mm->reserved_vm counter". This looks to me a serious regression, > > so let's fix it. > > I don't think this is enough explanations. Let's explain the code meaning > time to time order. > > First, we had no madvice(DONTDUMP) nor coredump_filter(HUGETLB). then hugetlb > pages were never dumped. > > Second, I added coredump_filter(HUGETLB). and then vm_dump_size became.. > > vm_dump_size() > { > /* Hugetlb memory check */ > if (vma->vm_flags & VM_HUGETLB) { > .. > goto whole; > } > if (vma->vm_flags & VM_RESERVED) > return 0; > > The point is, hugetlb was checked before VM_RESERVED. i.e. hugetlb core dump ignored > VM_RESERVED. At this time, "if (vma->vm_flags & VM_HUGETLB)" statement don't need > return 0 because VM_RESERVED prevented to go into the subsequent flag checks. > > Third, Jason added madvise(DONTDUMP). then vm_dump_size became... > > vm_dump_size() > { > if (vma->vm_flags & VM_NODUMP) > return 0; > > /* Hugetlb memory check */ > if (vma->vm_flags & VM_HUGETLB) { > .. > goto whole; > } > if (vma->vm_flags & VM_RESERVED) > return 0; > > Look, VM_NODUMP and VM_RESERVED had similar and different meanings at this time. > > Finally, Konstantin removed VM_RESERVED and hugetlb coredump behavior > has been changed. Things were/are quite complicated, and this is a wonderful step-by-step explanation. Thank you very much. Naoya > Thus, patch [1/3] and [2/3] should be marked [stable for v3.6 or later]. > > Anyway, this patch is correct. Thank you! > > Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() 2013-04-03 18:35 [PATCH v3 0/3] fix hugepage coredump Naoya Horiguchi 2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi @ 2013-04-03 18:35 ` Naoya Horiguchi 2013-04-03 19:22 ` Rik van Riel ` (3 more replies) 2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi 2 siblings, 4 replies; 18+ messages in thread From: Naoya Horiguchi @ 2013-04-03 18:35 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel, Naoya Horiguchi Documentation/filesystems/proc.txt says about coredump_filter bitmask, Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only effected by bit 5-6. However current code can go into the subsequent flag checks of bit 0-4 for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work as written in the document. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: stable@vger.kernel.org --- fs/binfmt_elf.c | 1 + 1 file changed, 1 insertion(+) diff --git v3.9-rc3.orig/fs/binfmt_elf.c v3.9-rc3/fs/binfmt_elf.c index 3939829..86af964 100644 --- v3.9-rc3.orig/fs/binfmt_elf.c +++ v3.9-rc3/fs/binfmt_elf.c @@ -1137,6 +1137,7 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma, goto whole; if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE)) goto whole; + return 0; } /* Do not dump I/O mapped devices or special mappings */ -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() 2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi @ 2013-04-03 19:22 ` Rik van Riel 2013-04-03 21:14 ` Michal Hocko ` (2 subsequent siblings) 3 siblings, 0 replies; 18+ messages in thread From: Rik van Riel @ 2013-04-03 19:22 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel On 04/03/2013 02:35 PM, Naoya Horiguchi wrote: > Documentation/filesystems/proc.txt says about coredump_filter bitmask, > > Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only > effected by bit 5-6. > > However current code can go into the subsequent flag checks of bit 0-4 > for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work > as written in the document. > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: stable@vger.kernel.org Reviewed-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() 2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi 2013-04-03 19:22 ` Rik van Riel @ 2013-04-03 21:14 ` Michal Hocko 2013-04-04 0:42 ` HATAYAMA Daisuke 2013-04-05 18:33 ` KOSAKI Motohiro 3 siblings, 0 replies; 18+ messages in thread From: Michal Hocko @ 2013-04-03 21:14 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, HATAYAMA Daisuke, linux-mm, linux-kernel On Wed 03-04-13 14:35:37, Naoya Horiguchi wrote: > Documentation/filesystems/proc.txt says about coredump_filter bitmask, > > Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only > effected by bit 5-6. > > However current code can go into the subsequent flag checks of bit 0-4 > for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work > as written in the document. > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: stable@vger.kernel.org Just for the record. It should be stable for 3.7+ since (314e51b98) becuase then have lost VM_RESERVED check which used to stop hugetlb mappings. Acked-by: Michal Hocko <mhocko@suse.cz> > --- > fs/binfmt_elf.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git v3.9-rc3.orig/fs/binfmt_elf.c v3.9-rc3/fs/binfmt_elf.c > index 3939829..86af964 100644 > --- v3.9-rc3.orig/fs/binfmt_elf.c > +++ v3.9-rc3/fs/binfmt_elf.c > @@ -1137,6 +1137,7 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma, > goto whole; > if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE)) > goto whole; > + return 0; > } > > /* Do not dump I/O mapped devices or special mappings */ > -- > 1.7.11.7 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() 2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi 2013-04-03 19:22 ` Rik van Riel 2013-04-03 21:14 ` Michal Hocko @ 2013-04-04 0:42 ` HATAYAMA Daisuke 2013-04-05 18:33 ` KOSAKI Motohiro 3 siblings, 0 replies; 18+ messages in thread From: HATAYAMA Daisuke @ 2013-04-04 0:42 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, linux-mm, linux-kernel (2013/04/04 3:35), Naoya Horiguchi wrote: > Documentation/filesystems/proc.txt says about coredump_filter bitmask, > > Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only > effected by bit 5-6. > > However current code can go into the subsequent flag checks of bit 0-4 > for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work > as written in the document. > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: stable@vger.kernel.org > --- > fs/binfmt_elf.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git v3.9-rc3.orig/fs/binfmt_elf.c v3.9-rc3/fs/binfmt_elf.c > index 3939829..86af964 100644 > --- v3.9-rc3.orig/fs/binfmt_elf.c > +++ v3.9-rc3/fs/binfmt_elf.c > @@ -1137,6 +1137,7 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma, > goto whole; > if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE)) > goto whole; > + return 0; > } > > /* Do not dump I/O mapped devices or special mappings */ > Thanks for splitting this fix. Now it's easier to keep track of this fix. Reviewed-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Thanks. HATAYAMA, Daisuke ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() 2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi ` (2 preceding siblings ...) 2013-04-04 0:42 ` HATAYAMA Daisuke @ 2013-04-05 18:33 ` KOSAKI Motohiro 3 siblings, 0 replies; 18+ messages in thread From: KOSAKI Motohiro @ 2013-04-05 18:33 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel, kosaki.motohiro (4/3/13 2:35 PM), Naoya Horiguchi wrote: > Documentation/filesystems/proc.txt says about coredump_filter bitmask, > > Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only > effected by bit 5-6. > > However current code can go into the subsequent flag checks of bit 0-4 > for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work > as written in the document. > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: stable@vger.kernel.org If I were you, I merge this patch into [1/3] because both patches treat the same regression. but it is no big matter. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() 2013-04-03 18:35 [PATCH v3 0/3] fix hugepage coredump Naoya Horiguchi 2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi 2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi @ 2013-04-03 18:35 ` Naoya Horiguchi 2013-04-03 20:01 ` Rik van Riel 2013-04-05 18:59 ` KOSAKI Motohiro 2 siblings, 2 replies; 18+ messages in thread From: Naoya Horiguchi @ 2013-04-03 18:35 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel, Naoya Horiguchi With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory error happens on a hugepage and the affected processes try to access the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) in get_page(). The reason for this bug is that coredump-related code doesn't recognise "hugepage hwpoison entry" with which a pmd entry is replaced when a memory error occurs on a hugepage. In other words, physical address information is stored in different bit layout between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() which is called in get_dump_page() returns a wrong page from a given address. We need to filter out only hwpoison hugepages to have data on healthy hugepages in coredump. So this patch makes follow_hugetlb_page() avoid trying to get page when a pmd is in swap entry like format. ChangeLog v3: - add comment about using is_swap_pte() Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: stable@vger.kernel.org --- mm/hugetlb.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c index 0d1705b..3bc20bd 100644 --- v3.9-rc3.orig/mm/hugetlb.c +++ v3.9-rc3/mm/hugetlb.c @@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, * Some archs (sparc64, sh*) have multiple pte_ts to * each hugepage. We have to make sure we get the * first, for the page indexing below to work. + * + * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned + * and hugepages under migration in which case + * hugetlb_fault waits for the migration and bails out + * properly for HWPosined pages. */ pte = huge_pte_offset(mm, vaddr & huge_page_mask(h)); - absent = !pte || huge_pte_none(huge_ptep_get(pte)); + absent = !pte || huge_pte_none(huge_ptep_get(pte)) || + is_swap_pte(huge_ptep_get(pte)); /* * When coredumping, it suits get_dump_page if we just return -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() 2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi @ 2013-04-03 20:01 ` Rik van Riel 2013-04-05 18:59 ` KOSAKI Motohiro 1 sibling, 0 replies; 18+ messages in thread From: Rik van Riel @ 2013-04-03 20:01 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel On 04/03/2013 02:35 PM, Naoya Horiguchi wrote: > With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in > initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory > error happens on a hugepage and the affected processes try to access > the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) > in get_page(). > > The reason for this bug is that coredump-related code doesn't recognise > "hugepage hwpoison entry" with which a pmd entry is replaced when a memory > error occurs on a hugepage. > In other words, physical address information is stored in different bit layout > between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() > which is called in get_dump_page() returns a wrong page from a given address. > > We need to filter out only hwpoison hugepages to have data on healthy > hugepages in coredump. So this patch makes follow_hugetlb_page() avoid > trying to get page when a pmd is in swap entry like format. > > ChangeLog v3: > - add comment about using is_swap_pte() > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Reviewed-by: Michal Hocko <mhocko@suse.cz> > Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> > Cc: stable@vger.kernel.org Reviewed-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() 2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi 2013-04-03 20:01 ` Rik van Riel @ 2013-04-05 18:59 ` KOSAKI Motohiro 2013-04-08 19:27 ` Naoya Horiguchi 1 sibling, 1 reply; 18+ messages in thread From: KOSAKI Motohiro @ 2013-04-05 18:59 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel, kosaki.motohiro (4/3/13 2:35 PM), Naoya Horiguchi wrote: > With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in > initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory > error happens on a hugepage and the affected processes try to access > the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) > in get_page(). > > The reason for this bug is that coredump-related code doesn't recognise > "hugepage hwpoison entry" with which a pmd entry is replaced when a memory > error occurs on a hugepage. > In other words, physical address information is stored in different bit layout > between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() > which is called in get_dump_page() returns a wrong page from a given address. > > We need to filter out only hwpoison hugepages to have data on healthy > hugepages in coredump. So this patch makes follow_hugetlb_page() avoid > trying to get page when a pmd is in swap entry like format. > > ChangeLog v3: > - add comment about using is_swap_pte() > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Reviewed-by: Michal Hocko <mhocko@suse.cz> > Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> > Cc: stable@vger.kernel.org > --- > mm/hugetlb.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c > index 0d1705b..3bc20bd 100644 > --- v3.9-rc3.orig/mm/hugetlb.c > +++ v3.9-rc3/mm/hugetlb.c > @@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, > * Some archs (sparc64, sh*) have multiple pte_ts to > * each hugepage. We have to make sure we get the > * first, for the page indexing below to work. > + * > + * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned > + * and hugepages under migration in which case > + * hugetlb_fault waits for the migration and bails out > + * properly for HWPosined pages. > */ > pte = huge_pte_offset(mm, vaddr & huge_page_mask(h)); > - absent = !pte || huge_pte_none(huge_ptep_get(pte)); > + absent = !pte || huge_pte_none(huge_ptep_get(pte)) || > + is_swap_pte(huge_ptep_get(pte)); Hmmm... Now absent has two meanings. 1) skip hugetlb_fault() and return immediately if FOLL_DUMP is used. 2) call hugetlb_fault() if to be need page population or cow. The description of this patch only explain about (2). and I'm not convinced why we don't need to dump pages under migraion. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() 2013-04-05 18:59 ` KOSAKI Motohiro @ 2013-04-08 19:27 ` Naoya Horiguchi 2013-04-08 20:57 ` KOSAKI Motohiro 0 siblings, 1 reply; 18+ messages in thread From: Naoya Horiguchi @ 2013-04-08 19:27 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, KOSAKI Motohiro, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, linux-kernel On Fri, Apr 05, 2013 at 02:59:43PM -0400, KOSAKI Motohiro wrote: > (4/3/13 2:35 PM), Naoya Horiguchi wrote: > > With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in > > initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory > > error happens on a hugepage and the affected processes try to access > > the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) > > in get_page(). > > > > The reason for this bug is that coredump-related code doesn't recognise > > "hugepage hwpoison entry" with which a pmd entry is replaced when a memory > > error occurs on a hugepage. > > In other words, physical address information is stored in different bit layout > > between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() > > which is called in get_dump_page() returns a wrong page from a given address. > > > > We need to filter out only hwpoison hugepages to have data on healthy > > hugepages in coredump. So this patch makes follow_hugetlb_page() avoid > > trying to get page when a pmd is in swap entry like format. > > > > ChangeLog v3: > > - add comment about using is_swap_pte() > > > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > > Reviewed-by: Michal Hocko <mhocko@suse.cz> > > Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> > > Cc: stable@vger.kernel.org > > --- > > mm/hugetlb.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c > > index 0d1705b..3bc20bd 100644 > > --- v3.9-rc3.orig/mm/hugetlb.c > > +++ v3.9-rc3/mm/hugetlb.c > > @@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, > > * Some archs (sparc64, sh*) have multiple pte_ts to > > * each hugepage. We have to make sure we get the > > * first, for the page indexing below to work. > > + * > > + * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned > > + * and hugepages under migration in which case > > + * hugetlb_fault waits for the migration and bails out > > + * properly for HWPosined pages. > > */ > > pte = huge_pte_offset(mm, vaddr & huge_page_mask(h)); > > - absent = !pte || huge_pte_none(huge_ptep_get(pte)); > > + absent = !pte || huge_pte_none(huge_ptep_get(pte)) || > > + is_swap_pte(huge_ptep_get(pte)); > > Hmmm... > > Now absent has two meanings. 1) skip hugetlb_fault() and return immediately if FOLL_DUMP is used. > 2) call hugetlb_fault() if to be need page population or cow. > > The description of this patch only explain about (2). and I'm not convinced why we don't need to > dump pages under migraion. We can/should dump hugepages under migration, and to do that we have to put is_swap_pte() in the check of the hugetlb_falut block. I updated this patch like below. # I suspended Reviewed and Acked given for the previous version, because # it has a non-minor change. If you want to restore it, please let me know. Thanks, Naoya ------------------------------------------------------------------------ From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Date: Thu, 28 Mar 2013 10:17:38 -0400 Subject: [PATCH v4] hugetlbfs: add swap entry check in follow_hugetlb_page() With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory error happens on a hugepage and the affected processes try to access the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) in get_page(). The reason for this bug is that coredump-related code doesn't recognise "hugepage hwpoison entry" with which a pmd entry is replaced when a memory error occurs on a hugepage. In other words, physical address information is stored in different bit layout between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() which is called in get_dump_page() returns a wrong page from a given address. The expected behavior is like this: absent is_swap_pte FOLL_DUMP Expected behavior --------------------------------------------------------------------- true false false hugetlb_fault false true false hugetlb_fault false false false return page true false true skip page (to avoid allocation) false true true hugetlb_fault false false true return page With this patch, we can call hugetlb_fault() and take proper actions (we can wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for hwpoisoned entries,) and as the result we can dump all hugepages except for hwpoisoned ones. ChangeLog v4: - move is_swap_page() to right place. ChangeLog v3: - add comment about using is_swap_pte() Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: stable@vger.kernel.org --- mm/hugetlb.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 0d1705b..f155e59 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2983,7 +2983,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, break; } - if (absent || + /* + * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned + * and hugepages under migration in which case + * hugetlb_fault waits for the migration and bails out + * properly for HWPosined pages. + */ + if (absent || is_swap_pte(huge_ptep_get(pte)) || ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) { int ret; -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() 2013-04-08 19:27 ` Naoya Horiguchi @ 2013-04-08 20:57 ` KOSAKI Motohiro 2013-04-09 22:00 ` Naoya Horiguchi 0 siblings, 1 reply; 18+ messages in thread From: KOSAKI Motohiro @ 2013-04-08 20:57 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, LKML > - if (absent || > + /* > + * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned > + * and hugepages under migration in which case > + * hugetlb_fault waits for the migration and bails out > + * properly for HWPosined pages. > + */ > + if (absent || is_swap_pte(huge_ptep_get(pte)) || > ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) { > int ret; Your comment describe what the code is. However we want the comment describe why. In migration case, calling hugetlb_fault() is natural. but in hwpoison case, it is needed more explanation. Why can't we call is_hugetlb_hwpoisoned() directly? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() 2013-04-08 20:57 ` KOSAKI Motohiro @ 2013-04-09 22:00 ` Naoya Horiguchi 2013-04-10 1:53 ` KOSAKI Motohiro 2013-04-10 8:07 ` Michal Hocko 0 siblings, 2 replies; 18+ messages in thread From: Naoya Horiguchi @ 2013-04-09 22:00 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, LKML On Mon, Apr 08, 2013 at 04:57:44PM -0400, KOSAKI Motohiro wrote: > > - if (absent || > > + /* > > + * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned > > + * and hugepages under migration in which case > > + * hugetlb_fault waits for the migration and bails out > > + * properly for HWPosined pages. > > + */ > > + if (absent || is_swap_pte(huge_ptep_get(pte)) || > > ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) { > > int ret; > > Your comment describe what the code is. However we want the comment describe > why. In migration case, calling hugetlb_fault() is natural. but in > hwpoison case, it is > needed more explanation. We should call hugetlb_fault() when we encounter any kind of swap type entry. It's consistent with handling of normal pages. > Why can't we call is_hugetlb_hwpoisoned() directly? We can use it, but I like to make code simple. I rewrite the comment here, how about this? - if (absent || + /* + * We need call hugetlb_fault for both hugepages under migration + * (in which case hugetlb_fault waits for the migration,) and + * hwpoisoned hugepages (in which case we need to prevent the + * caller from accessing to them.) In order to do this, we use + * here is_swap_pte instead of is_hugetlb_entry_migration and + * is_hugetlb_entry_hwpoisoned. This is because it simply covers + * both cases, and because we can't follow correct pages directly + * from any kind of swap entries. + */ + if (absent || is_swap_pte(huge_ptep_get(pte)) || ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) { int ret; Thanks, Naoya ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() 2013-04-09 22:00 ` Naoya Horiguchi @ 2013-04-10 1:53 ` KOSAKI Motohiro 2013-04-10 8:07 ` Michal Hocko 1 sibling, 0 replies; 18+ messages in thread From: KOSAKI Motohiro @ 2013-04-10 1:53 UTC (permalink / raw) To: Naoya Horiguchi Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, Konstantin Khlebnikov, Michal Hocko, HATAYAMA Daisuke, linux-mm, LKML > I rewrite the comment here, how about this? > > - if (absent || > + /* > + * We need call hugetlb_fault for both hugepages under migration > + * (in which case hugetlb_fault waits for the migration,) and > + * hwpoisoned hugepages (in which case we need to prevent the > + * caller from accessing to them.) In order to do this, we use > + * here is_swap_pte instead of is_hugetlb_entry_migration and > + * is_hugetlb_entry_hwpoisoned. This is because it simply covers > + * both cases, and because we can't follow correct pages directly > + * from any kind of swap entries. > + */ > + if (absent || is_swap_pte(huge_ptep_get(pte)) || > ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) { > int ret; Looks ok to me. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() 2013-04-09 22:00 ` Naoya Horiguchi 2013-04-10 1:53 ` KOSAKI Motohiro @ 2013-04-10 8:07 ` Michal Hocko 1 sibling, 0 replies; 18+ messages in thread From: Michal Hocko @ 2013-04-10 8:07 UTC (permalink / raw) To: Naoya Horiguchi Cc: KOSAKI Motohiro, Andrew Morton, Mel Gorman, Hugh Dickins, Rik van Riel, Konstantin Khlebnikov, HATAYAMA Daisuke, linux-mm, LKML On Tue 09-04-13 18:00:34, Naoya Horiguchi wrote: [...] > I rewrite the comment here, how about this? > > - if (absent || > + /* > + * We need call hugetlb_fault for both hugepages under migration > + * (in which case hugetlb_fault waits for the migration,) and > + * hwpoisoned hugepages (in which case we need to prevent the > + * caller from accessing to them.) In order to do this, we use > + * here is_swap_pte instead of is_hugetlb_entry_migration and > + * is_hugetlb_entry_hwpoisoned. This is because it simply covers > + * both cases, and because we can't follow correct pages directly > + * from any kind of swap entries. > + */ > + if (absent || is_swap_pte(huge_ptep_get(pte)) || > ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte)))) { > int ret; OK, thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2013-04-10 8:07 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-04-03 18:35 [PATCH v3 0/3] fix hugepage coredump Naoya Horiguchi 2013-04-03 18:35 ` [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) Naoya Horiguchi 2013-04-03 19:20 ` Rik van Riel 2013-04-05 18:32 ` KOSAKI Motohiro 2013-04-05 18:40 ` Naoya Horiguchi 2013-04-03 18:35 ` [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size() Naoya Horiguchi 2013-04-03 19:22 ` Rik van Riel 2013-04-03 21:14 ` Michal Hocko 2013-04-04 0:42 ` HATAYAMA Daisuke 2013-04-05 18:33 ` KOSAKI Motohiro 2013-04-03 18:35 ` [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page() Naoya Horiguchi 2013-04-03 20:01 ` Rik van Riel 2013-04-05 18:59 ` KOSAKI Motohiro 2013-04-08 19:27 ` Naoya Horiguchi 2013-04-08 20:57 ` KOSAKI Motohiro 2013-04-09 22:00 ` Naoya Horiguchi 2013-04-10 1:53 ` KOSAKI Motohiro 2013-04-10 8:07 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).