All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
To: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>,
	kexec@lists.infradead.org,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Jan Willeke <willeke@de.ibm.com>,
	linux-kernel@vger.kernel.org, Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
Date: Mon, 15 Jul 2013 11:21:09 +0200	[thread overview]
Message-ID: <20130715112109.5eaa4dd9@mschwide> (raw)
In-Reply-To: <51E028AA.7070203@jp.fujitsu.com>

On Sat, 13 Jul 2013 01:02:50 +0900
HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:

> (2013/07/10 20:00), Michael Holzheu wrote:
> > On Wed, 10 Jul 2013 18:50:18 +0900
> > HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
> >
> > [snip]
> >
> >> (2013/07/10 17:42), Michael Holzheu wrote:
> >>> My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
> >>> effect as your suggestion for all architectures besides of s390. And for s390 we
> >>> take the risk that a programming error would result in poor /proc/vmcore
> >>> performance.
> >>>
> >>
> >> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
> >> of the elements, you can still calculate the range of offsets in /proc/vmcore
> >> corresponding to HSA during /proc/vmcore initialization.
> >>
> >> Also, could you tell me how often and how much the HSA region is during crash dumping?
> >> I guess the read to HSA is done mainly during early part of crash dumping process only.
> >> According to the code, it appears at most 64MiB only. Then, I feel performance is not
> >> a big issue.
> >
> > Currently it is 32 MiB and normally it is read only once.
> >
> >>
> >> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
> >> think it too much overhead...
> >
> > I was more concerned about in_valid_fault_range(). But I was most concerned the additional
> > interface that introduces more complexity to the code. And that just to implement a
> > sanity check that in our opinion we don't really need.
> >
> > And what makes it even worse:
> >
> 
> What you think the sanity check is unnecessary is perfectly wrong. You design page faults
> always happens on HSA region. If page fault happens on the other parts, i.e. some point
> of mmap()ed region, it means somehow page table on the address has not been created. This
> is bug, possibly caused by mmap() itself, page table creation, other components in kernel,
> bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs
> now. There's no guarantee that program runs safely, of course for page cache creation, too.
> We cannot and must expect such buggy process to behave in invalid states just as our design.
> It results in undefined behaviour. The only thing we can do is to kill the buggy process
> as soon as possible.

I don't quite get this point, please bear with me. If you compare the situation before and
after the introduction of the fault handler the possible error scenarios are not almost
identical:
1) If an access is made outside of the mapped memory region the first level fault handler
   (do_exception for s390, __do_page_fault for x86) won't find a vma and force a SIGSEGV
   right away, independent of the existance of a hole and the vmcore fault handler.
2) If there is a hardware bug that corrupts a page table the behaviour depends on how the
   entries are corrupted. If the outcome is a valid pte an incorrect memory area will be
   accessed, the same with or without the vmcore fault handler. If the corrupted pte is
   an invalid pte it can come out as swap pte, file pte, or as empty pte. The behaviour
   does not change for swap and file ptes, you will get funny results in both cases.
   For empty ptes the new behaviour will call the vmcore fault handler for the address
   in question. If the read() function can satisfy the request we will get a page cache
   copy of the missing page, if the read function can not satisfy the request it returns
   an error which is translated to a SIGBUS.
   This new behaviour is IMHO better than the old one, it successfully recovers from at
   least one type of corruption. For x86 that would be the case if the page table is
   overwritten with zeroes, for s390 a specific bit pattern in the pte is required.
3) In the case of a programming error in regard to remap_pfn_range the new behaviour will
   provide page cache copies and the dump will technically be correct. The performance
   might suffer a little bit as the CPU will have to create the page cache copies but
   compared to the I/O that is involved with writing a dump this is negligible, no?

It seems to me that the warning message you want to see in the fault handler would be
a debugging aid for the developer to see if the mmap() and the remap_pfn_range() calls
match up. Something similar to a single VM_WARN_ON() messages would be appropriate, no?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


WARNING: multiple messages have this Message-ID (diff)
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
To: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: kexec@lists.infradead.org,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Jan Willeke <willeke@de.ibm.com>,
	linux-kernel@vger.kernel.org,
	Michael Holzheu <holzheu@linux.vnet.ibm.com>,
	Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()
Date: Mon, 15 Jul 2013 11:21:09 +0200	[thread overview]
Message-ID: <20130715112109.5eaa4dd9@mschwide> (raw)
In-Reply-To: <51E028AA.7070203@jp.fujitsu.com>

On Sat, 13 Jul 2013 01:02:50 +0900
HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:

> (2013/07/10 20:00), Michael Holzheu wrote:
> > On Wed, 10 Jul 2013 18:50:18 +0900
> > HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
> >
> > [snip]
> >
> >> (2013/07/10 17:42), Michael Holzheu wrote:
> >>> My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
> >>> effect as your suggestion for all architectures besides of s390. And for s390 we
> >>> take the risk that a programming error would result in poor /proc/vmcore
> >>> performance.
> >>>
> >>
> >> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
> >> of the elements, you can still calculate the range of offsets in /proc/vmcore
> >> corresponding to HSA during /proc/vmcore initialization.
> >>
> >> Also, could you tell me how often and how much the HSA region is during crash dumping?
> >> I guess the read to HSA is done mainly during early part of crash dumping process only.
> >> According to the code, it appears at most 64MiB only. Then, I feel performance is not
> >> a big issue.
> >
> > Currently it is 32 MiB and normally it is read only once.
> >
> >>
> >> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
> >> think it too much overhead...
> >
> > I was more concerned about in_valid_fault_range(). But I was most concerned the additional
> > interface that introduces more complexity to the code. And that just to implement a
> > sanity check that in our opinion we don't really need.
> >
> > And what makes it even worse:
> >
> 
> What you think the sanity check is unnecessary is perfectly wrong. You design page faults
> always happens on HSA region. If page fault happens on the other parts, i.e. some point
> of mmap()ed region, it means somehow page table on the address has not been created. This
> is bug, possibly caused by mmap() itself, page table creation, other components in kernel,
> bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs
> now. There's no guarantee that program runs safely, of course for page cache creation, too.
> We cannot and must expect such buggy process to behave in invalid states just as our design.
> It results in undefined behaviour. The only thing we can do is to kill the buggy process
> as soon as possible.

I don't quite get this point, please bear with me. If you compare the situation before and
after the introduction of the fault handler the possible error scenarios are not almost
identical:
1) If an access is made outside of the mapped memory region the first level fault handler
   (do_exception for s390, __do_page_fault for x86) won't find a vma and force a SIGSEGV
   right away, independent of the existance of a hole and the vmcore fault handler.
2) If there is a hardware bug that corrupts a page table the behaviour depends on how the
   entries are corrupted. If the outcome is a valid pte an incorrect memory area will be
   accessed, the same with or without the vmcore fault handler. If the corrupted pte is
   an invalid pte it can come out as swap pte, file pte, or as empty pte. The behaviour
   does not change for swap and file ptes, you will get funny results in both cases.
   For empty ptes the new behaviour will call the vmcore fault handler for the address
   in question. If the read() function can satisfy the request we will get a page cache
   copy of the missing page, if the read function can not satisfy the request it returns
   an error which is translated to a SIGBUS.
   This new behaviour is IMHO better than the old one, it successfully recovers from at
   least one type of corruption. For x86 that would be the case if the page table is
   overwritten with zeroes, for s390 a specific bit pattern in the pte is required.
3) In the case of a programming error in regard to remap_pfn_range the new behaviour will
   provide page cache copies and the dump will technically be correct. The performance
   might suffer a little bit as the CPU will have to create the page cache copies but
   compared to the I/O that is involved with writing a dump this is negligible, no?

It seems to me that the warning message you want to see in the fault handler would be
a debugging aid for the developer to see if the mmap() and the remap_pfn_range() calls
match up. Something similar to a single VM_WARN_ON() messages would be appropriate, no?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

  reply	other threads:[~2013-07-15  9:21 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-01 19:32 [PATCH v6 0/5] kdump: Allow ELF header creation in new kernel Michael Holzheu
2013-07-01 19:32 ` Michael Holzheu
2013-07-01 19:32 ` [PATCH v6 1/5] vmcore: Introduce ELF header in new memory feature Michael Holzheu
2013-07-01 19:32   ` Michael Holzheu
2013-07-02 15:27   ` Vivek Goyal
2013-07-02 15:27     ` Vivek Goyal
2013-07-01 19:32 ` [PATCH v6 2/5] s390/vmcore: Use " Michael Holzheu
2013-07-01 19:32   ` Michael Holzheu
2013-07-02 16:23   ` Vivek Goyal
2013-07-02 16:23     ` Vivek Goyal
2013-07-03  7:59     ` Michael Holzheu
2013-07-03  7:59       ` Michael Holzheu
2013-07-03 14:15       ` Vivek Goyal
2013-07-03 14:15         ` Vivek Goyal
2013-07-03 14:39         ` Michael Holzheu
2013-07-03 14:39           ` Michael Holzheu
2013-07-03 14:50           ` Vivek Goyal
2013-07-03 14:50             ` Vivek Goyal
2013-07-01 19:32 ` [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range() Michael Holzheu
2013-07-01 19:32   ` Michael Holzheu
2013-07-02 15:42   ` Vivek Goyal
2013-07-02 15:42     ` Vivek Goyal
2013-07-03 13:59     ` Michael Holzheu
2013-07-03 13:59       ` Michael Holzheu
2013-07-03 14:16       ` Vivek Goyal
2013-07-03 14:16         ` Vivek Goyal
2013-07-15 13:44     ` Michael Holzheu
2013-07-15 13:44       ` Michael Holzheu
2013-07-15 14:27       ` Vivek Goyal
2013-07-15 14:27         ` Vivek Goyal
2013-07-16  9:25         ` Michael Holzheu
2013-07-16  9:25           ` Michael Holzheu
2013-07-16 14:04           ` Vivek Goyal
2013-07-16 14:04             ` Vivek Goyal
2013-07-16 15:37             ` Michael Holzheu
2013-07-16 15:37               ` Michael Holzheu
2013-07-16 15:55               ` Vivek Goyal
2013-07-16 15:55                 ` Vivek Goyal
2013-07-08  5:32   ` HATAYAMA Daisuke
2013-07-08  5:32     ` HATAYAMA Daisuke
2013-07-08  9:28     ` Michael Holzheu
2013-07-08  9:28       ` Michael Holzheu
2013-07-08 14:28       ` Vivek Goyal
2013-07-08 14:28         ` Vivek Goyal
2013-07-09  5:49         ` HATAYAMA Daisuke
2013-07-09  5:49           ` HATAYAMA Daisuke
2013-07-10  8:42           ` Michael Holzheu
2013-07-10  8:42             ` Michael Holzheu
2013-07-10  9:50             ` HATAYAMA Daisuke
2013-07-10  9:50               ` HATAYAMA Daisuke
2013-07-10 11:00               ` Michael Holzheu
2013-07-10 11:00                 ` Michael Holzheu
2013-07-12 16:02                 ` HATAYAMA Daisuke
2013-07-12 16:02                   ` HATAYAMA Daisuke
2013-07-15  9:21                   ` Martin Schwidefsky [this message]
2013-07-15  9:21                     ` Martin Schwidefsky
2013-07-16  0:51                     ` HATAYAMA Daisuke
2013-07-16  0:51                       ` HATAYAMA Daisuke
2013-07-10 14:33               ` Vivek Goyal
2013-07-10 14:33                 ` Vivek Goyal
2013-07-12 11:05                 ` HATAYAMA Daisuke
2013-07-12 11:05                   ` HATAYAMA Daisuke
2013-07-15 14:20                   ` Vivek Goyal
2013-07-15 14:20                     ` Vivek Goyal
2013-07-16  0:27                     ` HATAYAMA Daisuke
2013-07-16  0:27                       ` HATAYAMA Daisuke
2013-07-16  9:40                       ` HATAYAMA Daisuke
2013-07-16  9:40                         ` HATAYAMA Daisuke
2013-07-09  5:31       ` HATAYAMA Daisuke
2013-07-09  5:31         ` HATAYAMA Daisuke
2013-07-01 19:32 ` [PATCH v6 4/5] s390/vmcore: Implement remap_oldmem_pfn_range for s390 Michael Holzheu
2013-07-01 19:32   ` Michael Holzheu
2013-07-01 19:32 ` [PATCH v6 5/5] s390/vmcore: Use vmcore for zfcpdump Michael Holzheu
2013-07-01 19:32   ` Michael Holzheu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130715112109.5eaa4dd9@mschwide \
    --to=schwidefsky@de.ibm.com \
    --cc=d.hatayama@jp.fujitsu.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=holzheu@linux.vnet.ibm.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=vgoyal@redhat.com \
    --cc=willeke@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.