From: Dan Williams <dan.j.williams@intel.com> To: Andy Lutomirski <luto@kernel.org> Cc: "Luck, Tony" <tony.luck@intel.com>, Linus Torvalds <torvalds@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Borislav Petkov <bp@alien8.de>, stable <stable@vger.kernel.org>, the arch/x86 maintainers <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>, Paul Mackerras <paulus@samba.org>, Benjamin Herrenschmidt <benh@kernel.crashing.org>, "Tsaur, Erwin" <erwin.tsaur@intel.com>, Michael Ellerman <mpe@ellerman.id.au>, Arnaldo Carvalho de Melo <acme@kernel.org>, linux-nvdimm <linux-nvdimm@lists.01.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org> Subject: Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe() Date: Mon, 4 May 2020 14:30:54 -0700 [thread overview] Message-ID: <CAPcyv4g9nTLTMjhQOJdu+v8n-Sc9L566KfnSjcz+0TS_Ge15Fw@mail.gmail.com> (raw) In-Reply-To: <CALCETrVAsppM5kRz0HicAQ8o_x06=7Nd0q64sEre3MEShWPaLw@mail.gmail.com> On Mon, May 4, 2020 at 1:26 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Mon, May 4, 2020 at 1:05 PM Luck, Tony <tony.luck@intel.com> wrote: > > > > > When a copy function hits a bad page and the page is not yet known to > > > be bad, what does it do? (I.e. the page was believed to be fine but > > > the copy function gets #MC.) Does it unmap it right away? What does > > > it return? > > > > I suspect that we will only ever find a handful of situations where the > > kernel can recover from memory that has gone bad that are worth fixing > > (got to be some code path that touches a meaningful fraction of memory, > > otherwise we get code complexity without any meaningful payoff). > > > > I don't think we'd want different actions for the cases of "we just found out > > now that this page is bad" and "we got a notification an hour ago that this > > page had gone bad". Currently we treat those the same for application > > errors ... SIGBUS either way[1]. > > Oh, I agree that the end result should be the same. I'm thinking more > about the mechanism and the internal API. As a somewhat silly example > of why there's a difference, the first time we try to read from bad > memory, we can expect #MC (I assume, on a sensibly functioning > platform). But, once we get the #MC, I imagine that the #MC handler > will want to unmap the page to prevent a storm of additional #MC > events on the same page -- given the awful x86 #MC design, too many > all at once is fatal. So the next time we copy_mc_to_user() or > whatever from the memory, we'll get #PF instead. Or maybe that #MC > will defer the unmap? After the consumption the PMEM driver arranges for the page to never be mapped again via its "badblocks" list. > > So the point of my questions is that the overall design should be at > least somewhat settled before anyone tries to review just the copy > functions. I would say that DAX / PMEM stretches the Linux memory error handling model beyond what it was originally designed. The primary concepts that bend the assumptions of mm/memory-failure.c are: 1/ DAX pages can not be offlined via the page allocator. 2/ DAX pages (well cachelines in those pages) can be asynchronously marked poisoned by a platform or device patrol scrub facility. 3/ DAX pages might be repaired by writes. Currently 1/ and 2/ are managed by a per-block-device "badblocks" list that is populated by scrub results and also amended when #MC is raised (see nfit_handle_mce()). When fs/dax.c services faults it will decline to map the page if the physical file extent intersects a bad block. There is also support for sending SIGBUS if userspace races the scrubber to consume the badblock. However, that uses the standard 'struct page' error model and assumes that a file backed page is 1:1 mapped to a file. This requirement prevents filesystems from enabling reflink. That collision and the desire to enable reflink is why we are now investigating supplanting the mm/memory-failure.c model. When the page is "owned" by a filesystem invoke the filesystem to handle the memory error across all impacted files. The presence of 3/ means that any action error handling takes to disable access to the page needs to be capable of being undone, which runs counter to the mm/memory-failure.c assumption that offlining is a one-way trip. _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com> To: Andy Lutomirski <luto@kernel.org> Cc: "Luck, Tony" <tony.luck@intel.com>, Linus Torvalds <torvalds@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Borislav Petkov <bp@alien8.de>, stable <stable@vger.kernel.org>, "the arch/x86 maintainers" <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>, Paul Mackerras <paulus@samba.org>, Benjamin Herrenschmidt <benh@kernel.crashing.org>, "Tsaur, Erwin" <erwin.tsaur@intel.com>, Michael Ellerman <mpe@ellerman.id.au>, Arnaldo Carvalho de Melo <acme@kernel.org>, linux-nvdimm <linux-nvdimm@lists.01.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org> Subject: Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe() Date: Mon, 4 May 2020 14:30:54 -0700 [thread overview] Message-ID: <CAPcyv4g9nTLTMjhQOJdu+v8n-Sc9L566KfnSjcz+0TS_Ge15Fw@mail.gmail.com> (raw) In-Reply-To: <CALCETrVAsppM5kRz0HicAQ8o_x06=7Nd0q64sEre3MEShWPaLw@mail.gmail.com> On Mon, May 4, 2020 at 1:26 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Mon, May 4, 2020 at 1:05 PM Luck, Tony <tony.luck@intel.com> wrote: > > > > > When a copy function hits a bad page and the page is not yet known to > > > be bad, what does it do? (I.e. the page was believed to be fine but > > > the copy function gets #MC.) Does it unmap it right away? What does > > > it return? > > > > I suspect that we will only ever find a handful of situations where the > > kernel can recover from memory that has gone bad that are worth fixing > > (got to be some code path that touches a meaningful fraction of memory, > > otherwise we get code complexity without any meaningful payoff). > > > > I don't think we'd want different actions for the cases of "we just found out > > now that this page is bad" and "we got a notification an hour ago that this > > page had gone bad". Currently we treat those the same for application > > errors ... SIGBUS either way[1]. > > Oh, I agree that the end result should be the same. I'm thinking more > about the mechanism and the internal API. As a somewhat silly example > of why there's a difference, the first time we try to read from bad > memory, we can expect #MC (I assume, on a sensibly functioning > platform). But, once we get the #MC, I imagine that the #MC handler > will want to unmap the page to prevent a storm of additional #MC > events on the same page -- given the awful x86 #MC design, too many > all at once is fatal. So the next time we copy_mc_to_user() or > whatever from the memory, we'll get #PF instead. Or maybe that #MC > will defer the unmap? After the consumption the PMEM driver arranges for the page to never be mapped again via its "badblocks" list. > > So the point of my questions is that the overall design should be at > least somewhat settled before anyone tries to review just the copy > functions. I would say that DAX / PMEM stretches the Linux memory error handling model beyond what it was originally designed. The primary concepts that bend the assumptions of mm/memory-failure.c are: 1/ DAX pages can not be offlined via the page allocator. 2/ DAX pages (well cachelines in those pages) can be asynchronously marked poisoned by a platform or device patrol scrub facility. 3/ DAX pages might be repaired by writes. Currently 1/ and 2/ are managed by a per-block-device "badblocks" list that is populated by scrub results and also amended when #MC is raised (see nfit_handle_mce()). When fs/dax.c services faults it will decline to map the page if the physical file extent intersects a bad block. There is also support for sending SIGBUS if userspace races the scrubber to consume the badblock. However, that uses the standard 'struct page' error model and assumes that a file backed page is 1:1 mapped to a file. This requirement prevents filesystems from enabling reflink. That collision and the desire to enable reflink is why we are now investigating supplanting the mm/memory-failure.c model. When the page is "owned" by a filesystem invoke the filesystem to handle the memory error across all impacted files. The presence of 3/ means that any action error handling takes to disable access to the page needs to be capable of being undone, which runs counter to the mm/memory-failure.c assumption that offlining is a one-way trip.
next prev parent reply other threads:[~2020-05-04 21:31 UTC|newest] Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-04-30 8:24 [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe() Dan Williams 2020-04-30 8:24 ` Dan Williams 2020-04-30 8:25 ` [PATCH v2 1/2] copy_safe: Rename memcpy_mcsafe() to copy_safe() Dan Williams 2020-04-30 8:25 ` Dan Williams 2020-05-01 2:55 ` Sasha Levin 2020-04-30 8:25 ` [PATCH v2 2/2] x86/copy_safe: Introduce copy_safe_fast() Dan Williams 2020-04-30 8:25 ` Dan Williams 2020-05-01 2:55 ` Sasha Levin 2020-04-30 14:02 ` [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe() Linus Torvalds 2020-04-30 14:02 ` Linus Torvalds 2020-04-30 16:51 ` Andy Lutomirski 2020-04-30 16:51 ` Andy Lutomirski 2020-04-30 17:17 ` Linus Torvalds 2020-04-30 17:17 ` Linus Torvalds 2020-04-30 18:42 ` Andy Lutomirski 2020-04-30 18:42 ` Andy Lutomirski 2020-04-30 19:22 ` Luck, Tony 2020-04-30 19:22 ` Luck, Tony 2020-04-30 19:50 ` Linus Torvalds 2020-04-30 19:50 ` Linus Torvalds 2020-04-30 20:25 ` Luck, Tony 2020-04-30 20:25 ` Luck, Tony 2020-04-30 23:52 ` Dan Williams 2020-04-30 23:52 ` Dan Williams 2020-05-01 0:10 ` Linus Torvalds 2020-05-01 0:10 ` Linus Torvalds 2020-05-01 0:23 ` Andy Lutomirski 2020-05-01 0:23 ` Andy Lutomirski 2020-05-01 0:39 ` Linus Torvalds 2020-05-01 0:39 ` Linus Torvalds 2020-05-01 1:10 ` Andy Lutomirski 2020-05-01 1:10 ` Andy Lutomirski 2020-05-01 14:09 ` Luck, Tony 2020-05-01 14:09 ` Luck, Tony 2020-05-03 0:29 ` Andy Lutomirski 2020-05-03 0:29 ` Andy Lutomirski 2020-05-04 20:05 ` Luck, Tony 2020-05-04 20:05 ` Luck, Tony 2020-05-04 20:26 ` Andy Lutomirski 2020-05-04 20:26 ` Andy Lutomirski 2020-05-04 21:30 ` Dan Williams [this message] 2020-05-04 21:30 ` Dan Williams 2020-05-01 0:24 ` Linus Torvalds 2020-05-01 0:24 ` Linus Torvalds 2020-05-01 1:20 ` Andy Lutomirski 2020-05-01 1:20 ` Andy Lutomirski 2020-05-01 1:21 ` Dan Williams 2020-05-01 1:21 ` Dan Williams 2020-05-01 18:28 ` Linus Torvalds 2020-05-01 18:28 ` Linus Torvalds 2020-05-01 20:17 ` Dave Hansen 2020-05-01 20:17 ` Dave Hansen 2020-05-03 12:57 ` David Laight 2020-05-03 12:57 ` David Laight 2020-05-04 18:33 ` Dan Williams 2020-05-04 18:33 ` Dan Williams 2020-05-11 15:24 ` Vivek Goyal 2020-05-11 15:24 ` Vivek Goyal 2020-04-30 19:51 ` Dan Williams 2020-04-30 19:51 ` Dan Williams 2020-04-30 20:07 ` Andy Lutomirski 2020-04-30 20:07 ` Andy Lutomirski 2020-05-01 7:46 ` David Laight 2020-05-01 7:46 ` David Laight
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAPcyv4g9nTLTMjhQOJdu+v8n-Sc9L566KfnSjcz+0TS_Ge15Fw@mail.gmail.com \ --to=dan.j.williams@intel.com \ --cc=acme@kernel.org \ --cc=benh@kernel.crashing.org \ --cc=bp@alien8.de \ --cc=erwin.tsaur@intel.com \ --cc=hpa@zytor.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-nvdimm@lists.01.org \ --cc=luto@kernel.org \ --cc=mingo@redhat.com \ --cc=mpe@ellerman.id.au \ --cc=paulus@samba.org \ --cc=peterz@infradead.org \ --cc=stable@vger.kernel.org \ --cc=tglx@linutronix.de \ --cc=tony.luck@intel.com \ --cc=torvalds@linux-foundation.org \ --cc=x86@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.