All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Luck <tony.luck@intel.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Peter Zijlstra <peterz@infradead.org>,
	the arch/x86 maintainers <x86@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()
Date: Tue, 1 May 2018 19:25:57 -0700	[thread overview]
Message-ID: <CAPcyv4i=cjQr9xvxt+Mjp-fhzyNJdTTp7uaAtpJN9R4gPg_j-Q@mail.gmail.com> (raw)
In-Reply-To: <CA+55aFwZ3hrrOJ5W-C8gdam3aGNxz8FEAq9gPnRBkVmwu4BvYA@mail.gmail.com>

On Tue, May 1, 2018 at 5:09 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, May 1, 2018 at 4:03 PM Dan Williams <dan.j.williams@intel.com>
> wrote:
>
>> I'm confused. Are you talking about getting rid of the block-layer
>> bypass or changing how MCS errors are handled?
>
> The latter.
>
>> If it's the latter, MCS error handling, I don't see how get
>> around something like copy_to_iter_mcsafe().
>
> So the basic issue is that since everybody wants mmap() to be at least an
> option (and preferably one of the _main_ options), I think that the whole
> "MCS errors are fatal" is fundamentally flawed.
>
> Which means that MCS errors can't be fatal.
>
> Which in turn means that the whole "special memcpy" seems very suspect.
>
> Can't we just do
>
>   - use a normal memcpy()
>
>   - basically set an "IO error flag" on MCE.
>
>   - for a user access the IO error flag potentially causes a SIGBUS as you
> mention, but even there it's not 100% clear that's necessarily possible or
> a good idea (I'm assuming that it can be damned hard to figure out _who_
> caused the problem if it was a cached write that causes an MCE much much
> later).

Writes don't trigger MCE. Only consumed poison / media errors trigger
MCE. I.e. even a read-modify-write operation to write-back a partially
dirty cacheline will not trigger an MCE because the read is not
consumed by the core only the cache. We'll get notified when that
happens, but only by CMCI interrupt not an MCE exception.

>   - for the kernel, the "IO error flag" can hopefully be then (again,
> assuming you can correlate the MCE with the right process) be turned into
> EIO.

This is precisely the current implementation / usage of
memcpy_mcsafe(). Reads go through the driver and the driver does the
right / simple thing to turn an MCE into EIO. I'd like to make this
the only model and kill the driver bypass in fs/dax.c so that the vfs
does not need to contend with these low level architecture details.

To be clear I'm not against dax specific optimization that does not go
through the block layer, but it should still be a driver call.

>> You mention mmap. Yes, we want the predominant access model to be
>> dax-mmap for Persistent Memory, but there's still the question about
>> what to do with media errors. To date we are trying to mirror the
>> error handling model for System Memory, i.e. SIGBUS to the process
>> that consumed the error. Is that error handling model also problematic
>> in your view?
>
> See above: if you can handle user space errors "gracefully" (ie with a
> SIGBUS, no crazy "system fatal (reboot)" garbage), then I really don't see
> why you can't do the same for the kernel accesses.
>
> IOW, why do we need that special "copy_to_iter_mcsafe()", when a normal
> "copy_to_iter()" should just work (and basically _has_ to work) anyway?
>
> Put another way: I think the whole basic premis of your patch is wrong,
> because (to quote your original patch descriptor), the fundamental starting
> point is garbage:
>
>     The result of the bypass is that the kernel treats machine checks during
>     read as system fatal (reboot) [..]
>
> See? If you are able to map that memory into user space, and recover, then
> why the whole crazy "system fatal" thing for kernel accesses?

Right, but the only way to make MCE non-fatal is to teach the machine
check handler about recoverable conditions. This patch teaches the
machine check handler how to recover copy_to_iter() errors.

We already have copy_from_iter_flushcache() that is used as a 'struct
dax_operations' op. I can do the same for this copy_to_iter() case so
at least it's up to the driver and not the vfs (fs/dax.c) to decide
how to handle this case.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	Tony Luck <tony.luck@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@alien8.de>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andy Lutomirski <luto@amacapital.net>,
	Ingo Molnar <mingo@redhat.com>, Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter()
Date: Tue, 1 May 2018 19:25:57 -0700	[thread overview]
Message-ID: <CAPcyv4i=cjQr9xvxt+Mjp-fhzyNJdTTp7uaAtpJN9R4gPg_j-Q@mail.gmail.com> (raw)
In-Reply-To: <CA+55aFwZ3hrrOJ5W-C8gdam3aGNxz8FEAq9gPnRBkVmwu4BvYA@mail.gmail.com>

On Tue, May 1, 2018 at 5:09 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, May 1, 2018 at 4:03 PM Dan Williams <dan.j.williams@intel.com>
> wrote:
>
>> I'm confused. Are you talking about getting rid of the block-layer
>> bypass or changing how MCS errors are handled?
>
> The latter.
>
>> If it's the latter, MCS error handling, I don't see how get
>> around something like copy_to_iter_mcsafe().
>
> So the basic issue is that since everybody wants mmap() to be at least an
> option (and preferably one of the _main_ options), I think that the whole
> "MCS errors are fatal" is fundamentally flawed.
>
> Which means that MCS errors can't be fatal.
>
> Which in turn means that the whole "special memcpy" seems very suspect.
>
> Can't we just do
>
>   - use a normal memcpy()
>
>   - basically set an "IO error flag" on MCE.
>
>   - for a user access the IO error flag potentially causes a SIGBUS as you
> mention, but even there it's not 100% clear that's necessarily possible or
> a good idea (I'm assuming that it can be damned hard to figure out _who_
> caused the problem if it was a cached write that causes an MCE much much
> later).

Writes don't trigger MCE. Only consumed poison / media errors trigger
MCE. I.e. even a read-modify-write operation to write-back a partially
dirty cacheline will not trigger an MCE because the read is not
consumed by the core only the cache. We'll get notified when that
happens, but only by CMCI interrupt not an MCE exception.

>   - for the kernel, the "IO error flag" can hopefully be then (again,
> assuming you can correlate the MCE with the right process) be turned into
> EIO.

This is precisely the current implementation / usage of
memcpy_mcsafe(). Reads go through the driver and the driver does the
right / simple thing to turn an MCE into EIO. I'd like to make this
the only model and kill the driver bypass in fs/dax.c so that the vfs
does not need to contend with these low level architecture details.

To be clear I'm not against dax specific optimization that does not go
through the block layer, but it should still be a driver call.

>> You mention mmap. Yes, we want the predominant access model to be
>> dax-mmap for Persistent Memory, but there's still the question about
>> what to do with media errors. To date we are trying to mirror the
>> error handling model for System Memory, i.e. SIGBUS to the process
>> that consumed the error. Is that error handling model also problematic
>> in your view?
>
> See above: if you can handle user space errors "gracefully" (ie with a
> SIGBUS, no crazy "system fatal (reboot)" garbage), then I really don't see
> why you can't do the same for the kernel accesses.
>
> IOW, why do we need that special "copy_to_iter_mcsafe()", when a normal
> "copy_to_iter()" should just work (and basically _has_ to work) anyway?
>
> Put another way: I think the whole basic premis of your patch is wrong,
> because (to quote your original patch descriptor), the fundamental starting
> point is garbage:
>
>     The result of the bypass is that the kernel treats machine checks during
>     read as system fatal (reboot) [..]
>
> See? If you are able to map that memory into user space, and recover, then
> why the whole crazy "system fatal" thing for kernel accesses?

Right, but the only way to make MCE non-fatal is to teach the machine
check handler about recoverable conditions. This patch teaches the
machine check handler how to recover copy_to_iter() errors.

We already have copy_from_iter_flushcache() that is used as a 'struct
dax_operations' op. I can do the same for this copy_to_iter() case so
at least it's up to the driver and not the vfs (fs/dax.c) to decide
how to handle this case.

  reply	other threads:[~2018-05-02  2:25 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-01 20:45 [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter() Dan Williams
2018-05-01 20:45 ` Dan Williams
2018-05-01 20:45 ` [PATCH 1/6] x86, memcpy_mcsafe: update labels in support of write fault handling Dan Williams
2018-05-01 20:45   ` Dan Williams
2018-05-01 20:45 ` [PATCH 2/6] x86, memcpy_mcsafe: return bytes remaining Dan Williams
2018-05-01 20:45   ` Dan Williams
2018-05-01 20:45 ` [PATCH 3/6] x86, memcpy_mcsafe: add write-protection-fault handling Dan Williams
2018-05-01 20:45   ` Dan Williams
2018-05-01 20:45 ` [PATCH 4/6] x86, memcpy_mcsafe: define copy_to_iter_mcsafe() Dan Williams
2018-05-01 20:45   ` Dan Williams
2018-05-01 22:17   ` kbuild test robot
2018-05-01 22:17     ` kbuild test robot
2018-05-01 22:49   ` kbuild test robot
2018-05-01 22:49     ` kbuild test robot
2018-05-01 20:45 ` [PATCH 5/6] dax: use copy_to_iter_mcsafe() in dax_iomap_actor() Dan Williams
2018-05-01 20:45   ` Dan Williams
2018-05-01 20:45 ` [PATCH 6/6] x86, nfit_test: unit test for memcpy_mcsafe() Dan Williams
2018-05-01 20:45   ` Dan Williams
2018-05-01 21:05 ` [PATCH 0/6] use memcpy_mcsafe() for copy_to_iter() Linus Torvalds
2018-05-01 21:05   ` Linus Torvalds
2018-05-01 23:02   ` Dan Williams
2018-05-01 23:02     ` Dan Williams
2018-05-01 23:28     ` Andy Lutomirski
2018-05-01 23:28       ` Andy Lutomirski
2018-05-01 23:31       ` Dan Williams
2018-05-01 23:31         ` Dan Williams
2018-05-02  0:09     ` Linus Torvalds
2018-05-02  0:09       ` Linus Torvalds
2018-05-02  2:25       ` Dan Williams [this message]
2018-05-02  2:25         ` Dan Williams
2018-05-02  2:53         ` Linus Torvalds
2018-05-02  2:53           ` Linus Torvalds
2018-05-02  3:02           ` Dan Williams
2018-05-02  3:02             ` Dan Williams
2018-05-02  3:13             ` Linus Torvalds
2018-05-02  3:13               ` Linus Torvalds
2018-05-02  3:20               ` Dan Williams
2018-05-02  3:20                 ` Dan Williams
2018-05-02  3:22                 ` Dan Williams
2018-05-02  3:22                   ` Dan Williams
2018-05-02  3:33                   ` Linus Torvalds
2018-05-02  3:33                     ` Linus Torvalds
2018-05-02  4:00                     ` Dan Williams
2018-05-02  4:00                       ` Dan Williams
2018-05-02  4:14                       ` Linus Torvalds
2018-05-02  4:14                         ` Linus Torvalds
2018-05-02  5:37                         ` Dan Williams
2018-05-02  5:37                           ` Dan Williams
2018-05-02 16:19                     ` Andy Lutomirski
2018-05-02 16:19                       ` Andy Lutomirski
2018-05-02 17:47                       ` Dan Williams
2018-05-02 17:47                         ` Dan Williams
2018-05-02  8:30         ` Borislav Petkov
2018-05-02  8:30           ` Borislav Petkov
2018-05-02 13:52           ` Dan Williams
2018-05-02 13:52             ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4i=cjQr9xvxt+Mjp-fhzyNJdTTp7uaAtpJN9R4gPg_j-Q@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.