linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
@ 2020-04-18 20:30 Andy Lutomirski
  2020-04-18 20:52 ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2020-04-18 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Williams, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm



--Andy

> On Apr 18, 2020, at 12:42 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>>> On Fri, Apr 17, 2020 at 5:12 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>> 
>>> @@ -106,12 +108,10 @@ static __always_inline __must_check unsigned long
>>> memcpy_mcsafe(void *dst, const void *src, size_t cnt)
>>> {
>>> #ifdef CONFIG_X86_MCE
>>> -       i(static_branch_unlikely(&mcsafe_key))
>>> -               return __memcpy_mcsafe(dst, src, cnt);
>>> -       else
>>> +       if (static_branch_unlikely(&mcsafe_slow_key))
>>> +               return memcpy_mcsafe_slow(dst, src, cnt);
>>> #endif
>>> -               memcpy(dst, src, cnt);
>>> -       return 0;
>>> +       return memcpy_mcsafe_fast(dst, src, cnt);
>>> }
> 
> It strikes me that I see no advantages to making this an inline function at all.
> 
> Even for the good case - where it turns into just a memcpy because MCE
> is entirely disabled - it doesn't seem to matter.
> 
> The only case that really helps is when the memcpy can be turned into
> a single access. Which - and I checked - does exist, with people doing
> 
>        r = memcpy_mcsafe(&sb_seq_count, &sb(wc)->seq_count, sizeof(uint64_t));
> 
> to read a single 64-bit field which looks aligned to me.
> 
> But that code is incredible garbage anyway, since even on a broken
> machine, there's no actual reason to use the slow variant for that
> whole access that I can tell. The macs-safe copy routines do not do
> anything worthwhile for a single access.

Maybe I’m missing something obvious, but what’s the alternative?  The _mcsafe variants don’t just avoid the REP mess — they also tell the kernel that this particular access is recoverable via extable. With a regular memory access, the CPU may not explode, but do_machine_check() will, at very best, OOPS, and even that requires a certain degree of optimism.  A panic is more likely.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-18 20:30 [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast Andy Lutomirski
@ 2020-04-18 20:52 ` Linus Torvalds
  2020-04-20  5:08   ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2020-04-18 20:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dan Williams, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Sat, Apr 18, 2020 at 1:30 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> Maybe I’m missing something obvious, but what’s the alternative?  The _mcsafe variants don’t just avoid the REP mess — they also tell the kernel that this particular access is recoverable via extable.

.. which they could easily do exactly the same way the user space
accessors do, just with a much simplified model that doesn't even care
about multiple sizes, since unaligned accesses weren't valid anyway.

The thing is, all of the MCS code has been nasty. There's no reason
for it what-so-ever that I can tell. The hardware has been so
incredibly broken that it's basically unusable, and most of the
software around it seems to have been about testing.

So I absolutely abhor that thing. Everything about that code has
screamed "yeah, we completely mis-designed the hardware, we're pushing
the problems into software, and nobody even uses it or can test it so
there's like 5 people who care".

And I'm pushing back on it, because I think that the least the code
can do is to at least be simple.

For example, none of those optimizations should exist. That function
shouldn't have been inline to begin with. And if it really really
matters from a performance angle that it was inline (which I doubt),
it shouldn't have looked like a memory copy, it should have looked
like "get_user()" (except without all the complications of actually
having to test addresses or worry about different sizes).

And it almost certainly shouldn't have been done in low-level asm
either. It could have been a single "read aligned word" interface
using an inline asm, and then everything else could have been done as
C code around it.

But no. The software side is almost as messy as the hardware side is.
I hate it. And since nobody sane can test it, and the broken hardware
is _so_ broken than nobody should ever use it, I have continually
pushed back against this kind of ugly nasty special code.

We know the writes can't fault, since they are buffered. So they
aren't special at all.

We know the acceptable reads for the broken hardware basically boil
down to a single simple word-size aligned read, so you need _one_
special inline asm for that. The rest of the cases can be handled by
masking and shifting if you really really need to - and done better
that way than with byte accesses anyway.

Then you have _one_ C file that implements everything using that
single operation (ok, if people absolutely want to do sizes, I guess
they can if they can just hide it in that one file), and you have one
header file that exposes the interfaces to it, and you're done.

And you strive hard as hell to not impact anything else, because you
know that the hardware is unacceptable until all those special rules
go away. Which they apparently finally have.

                Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-18 20:52 ` Linus Torvalds
@ 2020-04-20  5:08   ` Dan Williams
  2020-04-20 17:28     ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-04-20  5:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Sat, Apr 18, 2020 at 1:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, Apr 18, 2020 at 1:30 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Maybe I’m missing something obvious, but what’s the alternative?  The _mcsafe variants don’t just avoid the REP mess — they also tell the kernel that this particular access is recoverable via extable.
>
> .. which they could easily do exactly the same way the user space
> accessors do, just with a much simplified model that doesn't even care
> about multiple sizes, since unaligned accesses weren't valid anyway.
>
> The thing is, all of the MCS code has been nasty. There's no reason
> for it what-so-ever that I can tell. The hardware has been so
> incredibly broken that it's basically unusable, and most of the
> software around it seems to have been about testing.
>
> So I absolutely abhor that thing. Everything about that code has
> screamed "yeah, we completely mis-designed the hardware, we're pushing
> the problems into software, and nobody even uses it or can test it so
> there's like 5 people who care".
>
> And I'm pushing back on it, because I think that the least the code
> can do is to at least be simple.
>
> For example, none of those optimizations should exist. That function
> shouldn't have been inline to begin with. And if it really really
> matters from a performance angle that it was inline (which I doubt),
> it shouldn't have looked like a memory copy, it should have looked
> like "get_user()" (except without all the complications of actually
> having to test addresses or worry about different sizes).
>
>
> And it almost certainly shouldn't have been done in low-level asm
> either. It could have been a single "read aligned word" interface
> using an inline asm, and then everything else could have been done as
> C code around it.

Do we have examples of doing exception handling from C? I thought all
the exception handling copy routines were assembly routines?

>
> But no. The software side is almost as messy as the hardware side is.
> I hate it. And since nobody sane can test it, and the broken hardware
> is _so_ broken than nobody should ever use it, I have continually
> pushed back against this kind of ugly nasty special code.
>
> We know the writes can't fault, since they are buffered. So they
> aren't special at all.

The writes can mmu-fault now that memcpy_mcsafe() is also used by
_copy_to_iter_mcsafe(). This allows a clean bypass of the block layer
in fs/dax.c in addition to the pmem driver access of poisoned memory.
Now that the fallback is a sane rep; movs; it can be considered for
plain copy_to_iter() for other user copies so you get exception
handling on kernel access of poison outside of persistent memory. To
Andy's point I think a recoverable copy (for exceptions or faults) is
generally useful.

> We know the acceptable reads for the broken hardware basically boil
> down to a single simple word-size aligned read, so you need _one_
> special inline asm for that. The rest of the cases can be handled by
> masking and shifting if you really really need to - and done better
> that way than with byte accesses anyway.
>
> Then you have _one_ C file that implements everything using that
> single operation (ok, if people absolutely want to do sizes, I guess
> they can if they can just hide it in that one file), and you have one
> header file that exposes the interfaces to it, and you're done.
>
> And you strive hard as hell to not impact anything else, because you
> know that the hardware is unacceptable until all those special rules
> go away. Which they apparently finally have.

I understand the gripes about the mcsafe_slow() implementation, but
how do I implement mcsafe_fast() any better than how it is currently
organized given that, setting aside machine check handling,
memcpy_mcsafe() is the core of a copy_to_iter*() front-end that can
mmu-fault on either source or destination access?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20  5:08   ` Dan Williams
@ 2020-04-20 17:28     ` Linus Torvalds
  2020-04-20 18:20       ` Dan Williams
  2020-10-06  9:57       ` [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() tip-bot2 for Dan Williams
  0 siblings, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2020-04-20 17:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Sun, Apr 19, 2020 at 10:08 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Do we have examples of doing exception handling from C? I thought all
> the exception handling copy routines were assembly routines?

You need assembler for the actual access, but that's a _single_
instruction - best done as inline asm.

The best example of something that does *exactly* what you want to do is likely

        unsafe_get_user();
        unsafe_put_user();

which basically turns into a single instruction with exception
handling, with the exception hander jumping directly to an error
label.

Ok, so right now gcc can't do that for inline asm with outputs, so it
generates fairly nasty code (a secondary register with the error state
that then causes a conditional branch on it), but that's a compiler
limitation that will eventually go away (where "eventially" means that
it already works in LLVM with experimental patches).

You could literally mis-use those helpers as-is (please don't - the
code generation is correct, but at the very least we'd have to
re-organize a bit to make it a better interface, ie have an
alternative name like "unsafe_get_kernel()" for kernel pointer
accesses).

You'd have to do the alignment guarantees yourself, but there are
examples of that in this area too (strnlen_user() does exactly that -
aligned word accesses).

So the point here is that the current interfaces are garbage, _if_ the
whole "access a single value" is actually performance-critical.

And if that is *not* the case, then the best thing to do is likely to
just use a static call. No inlining of single instructions at all,
just always use a function call, and then pick the function
appropriately.

Honestly, I can't imagine that the "single access" case is so
timing-critical that the static call isn't the right model. Your use
case is _not_ as important or common as doing user accesses.

Finally, the big question is whether the completely broken hardware
even matters. Are there actual customers that actually use the garbage
"we can crash the machine" stuff?

Because when it comes to things like nvdimms etc, the _primary_
performance target would be getting the kernel entirely out of the
way, and allowing databases etc to just access the damn thing
directly.

And if you allow user space to access it directly, then you just have
to admit that it's not a software issue any more - it's the hardware
that is terminally broken and unusable garbage. It's not even
interesting to work around things in the kernel, because user space
can just crash the machine directly.

This is why I absolutely detest that thing so much. The hardware is
_so_ fundamentally broken that I have always considered the kernel
workarounds to basically be "staging" level stuff - good enough for
some random testing of known-broken stuff, but not something that
anybody sane should ever use.

So my preference would actually be to just call the broken cases to be
largely ignored, at least from a performance angle. If you can only
access it through the kernel, the biggest performance limitation is
that you cannot do any DAX-like thing at all safely, so then the
performance of some kernel accessors is completely secondary and
meaningless. When a kernel entry/exit takes a few thousand cycles on
the broken hardware (due to _other_ bugs), what's the point about
worrying about trying to inline some single access to the nvdimm?

Did the broken hardware ever spread out into the general public?
Because if not, then the proper thing to do might be to just make it a
compile-time option for the (few) customers that signed up for testing
the initial broken stuff, and make the way _forward_ be a clean model
without the need to worry about any exceptions at all.

> The writes can mmu-fault now that memcpy_mcsafe() is also used by
> _copy_to_iter_mcsafe(). This allows a clean bypass of the block layer
> in fs/dax.c in addition to the pmem driver access of poisoned memory.
> Now that the fallback is a sane rep; movs; it can be considered for
> plain copy_to_iter() for other user copies so you get exception
> handling on kernel access of poison outside of persistent memory. To
> Andy's point I think a recoverable copy (for exceptions or faults) is
> generally useful.

I think that's completely independent.

If we have good reasons for having targets with exception handling,
then that has absolutely nothing to do with machine checks or buggy
hardware.

And it sure shouldn't be called "mcsafe", since it has nothing to do
with that situation any more.

> I understand the gripes about the mcsafe_slow() implementation, but
> how do I implement mcsafe_fast() any better than how it is currently
> organized given that, setting aside machine check handling,
> memcpy_mcsafe() is the core of a copy_to_iter*() front-end that can
> mmu-fault on either source or destination access?

So honestly, once it is NOT about the broken machine check garbage,
then it should be sold on its own independent reasons.

Do we want to have a "copy_to_iter_safe" that can handle page faults?
Because we have lots of those kinds of things, we have

 - load_unaligned_zeropad()

   This loads a single word knowing that the _first_ byte is valid,
but can take an exception and zero-pad if it crosses a page boundary

 - probe_kernel_read()/write()

   This is a kernel memcpy() with the source/destination perhaps being unmapped.

 - various perf and tracing helpers that have special semantics.

but once it's about some generic interface, then it also needs to take
other architectures into account.

               Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 17:28     ` Linus Torvalds
@ 2020-04-20 18:20       ` Dan Williams
  2020-04-20 19:05         ` Linus Torvalds
  2020-10-06  9:57       ` [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() tip-bot2 for Dan Williams
  1 sibling, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-04-20 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 10:29 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, Apr 19, 2020 at 10:08 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Do we have examples of doing exception handling from C? I thought all
> > the exception handling copy routines were assembly routines?
>
> You need assembler for the actual access, but that's a _single_
> instruction - best done as inline asm.
>
> The best example of something that does *exactly* what you want to do is likely
>
>         unsafe_get_user();
>         unsafe_put_user();
>
> which basically turns into a single instruction with exception
> handling, with the exception hander jumping directly to an error
> label.
>
> Ok, so right now gcc can't do that for inline asm with outputs, so it
> generates fairly nasty code (a secondary register with the error state
> that then causes a conditional branch on it), but that's a compiler
> limitation that will eventually go away (where "eventially" means that
> it already works in LLVM with experimental patches).
>
> You could literally mis-use those helpers as-is (please don't - the
> code generation is correct, but at the very least we'd have to
> re-organize a bit to make it a better interface, ie have an
> alternative name like "unsafe_get_kernel()" for kernel pointer
> accesses).
>
> You'd have to do the alignment guarantees yourself, but there are
> examples of that in this area too (strnlen_user() does exactly that -
> aligned word accesses).

Ok, I'll take a deeper look, but my initial reaction is that this
approach could be a way to replace memcpy_mcsafe_slow(), but would be
unfortunate for the fast case which can just use rep; movs;

>
> So the point here is that the current interfaces are garbage, _if_ the
> whole "access a single value" is actually performance-critical.
>
> And if that is *not* the case, then the best thing to do is likely to
> just use a static call. No inlining of single instructions at all,
> just always use a function call, and then pick the function
> appropriately.
>
> Honestly, I can't imagine that the "single access" case is so
> timing-critical that the static call isn't the right model. Your use
> case is _not_ as important or common as doing user accesses.
>
> Finally, the big question is whether the completely broken hardware
> even matters. Are there actual customers that actually use the garbage
> "we can crash the machine" stuff?
>
> Because when it comes to things like nvdimms etc, the _primary_
> performance target would be getting the kernel entirely out of the
> way, and allowing databases etc to just access the damn thing
> directly.
>
> And if you allow user space to access it directly, then you just have
> to admit that it's not a software issue any more - it's the hardware
> that is terminally broken and unusable garbage. It's not even
> interesting to work around things in the kernel, because user space
> can just crash the machine directly.
>
> This is why I absolutely detest that thing so much. The hardware is
> _so_ fundamentally broken that I have always considered the kernel
> workarounds to basically be "staging" level stuff - good enough for
> some random testing of known-broken stuff, but not something that
> anybody sane should ever use.
>
> So my preference would actually be to just call the broken cases to be
> largely ignored, at least from a performance angle. If you can only
> access it through the kernel, the biggest performance limitation is
> that you cannot do any DAX-like thing at all safely, so then the
> performance of some kernel accessors is completely secondary and
> meaningless. When a kernel entry/exit takes a few thousand cycles on
> the broken hardware (due to _other_ bugs), what's the point about
> worrying about trying to inline some single access to the nvdimm?
>
> Did the broken hardware ever spread out into the general public?
> Because if not, then the proper thing to do might be to just make it a
> compile-time option for the (few) customers that signed up for testing
> the initial broken stuff, and make the way _forward_ be a clean model
> without the need to worry about any exceptions at all.

There are 3 classes of hardware, all of them trigger exceptions* it's
just the recoverability of those exceptions that is different:

1/ Recovery not supported all exceptions report PCC=1 (processor
context corrupted) and the kernel decides to panic.

2/ Recovery supported, user processes killed on consumption, panic on
kernel consumption outside of an exception handler instrumented code
path. Unfortunately there is no architectural way to distinguish
class1 from class2 outside of a PCI quirk whitelist. The kernel prior
to memcpy_mcsafe() just unconditionally enabled recovery for user
consumption, panicked on kernel consumption, and hoped that exceptions
are sent with PCC=0. The introduction of memcpy_mcsafe() to handle
poison consumption from kernel-space without a panic introduced this
not pretty caveat that some platforms needed to do special snowflake
memcpy to avoid known scenarios** that trigger PCC=1 when they could
otherwise trigger PCC=0. So a PCI quirk whitelist was added for those.

3/ Recovery supported *and* special snowflake memcpy rules relaxed.
Still no architectural way to discover this state so let us just
enable memcpy_mcsafe_fast() always and let the former whitelist become
a blacklist of "this CPU requires you to do a dance to avoid some
PCC=1 cases, but that dance impacts performance".

* I'm at a loss of why you seem to be suggesting that hardware should
/ could avoid all exceptions. What else could hardware do besides
throw an exception on consumption of a naturally occuring multi-bit
ECC error? Data is gone, and only software might know how to recover.

** Unaligned access across a cacheline boundary, fast string consumption...

> > The writes can mmu-fault now that memcpy_mcsafe() is also used by
> > _copy_to_iter_mcsafe(). This allows a clean bypass of the block layer
> > in fs/dax.c in addition to the pmem driver access of poisoned memory.
> > Now that the fallback is a sane rep; movs; it can be considered for
> > plain copy_to_iter() for other user copies so you get exception
> > handling on kernel access of poison outside of persistent memory. To
> > Andy's point I think a recoverable copy (for exceptions or faults) is
> > generally useful.
>
> I think that's completely independent.
>
> If we have good reasons for having targets with exception handling,
> then that has absolutely nothing to do with machine checks or buggy
> hardware.
>
> And it sure shouldn't be called "mcsafe", since it has nothing to do
> with that situation any more.

Ok, true.

> > I understand the gripes about the mcsafe_slow() implementation, but
> > how do I implement mcsafe_fast() any better than how it is currently
> > organized given that, setting aside machine check handling,
> > memcpy_mcsafe() is the core of a copy_to_iter*() front-end that can
> > mmu-fault on either source or destination access?
>
> So honestly, once it is NOT about the broken machine check garbage,
> then it should be sold on its own independent reasons.
>
> Do we want to have a "copy_to_iter_safe" that can handle page faults?
> Because we have lots of those kinds of things, we have
>
>  - load_unaligned_zeropad()
>
>    This loads a single word knowing that the _first_ byte is valid,
> but can take an exception and zero-pad if it crosses a page boundary
>
>  - probe_kernel_read()/write()
>
>    This is a kernel memcpy() with the source/destination perhaps being unmapped.
>
>  - various perf and tracing helpers that have special semantics.
>
> but once it's about some generic interface, then it also needs to take
> other architectures into account.

For the fast case it's really just copy_user_enhanced_fast_string()
without forced stac/clac when the source and dest are both kernel
accesses.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 18:20       ` Dan Williams
@ 2020-04-20 19:05         ` Linus Torvalds
  2020-04-20 19:29           ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2020-04-20 19:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 11:20 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> * I'm at a loss of why you seem to be suggesting that hardware should
> / could avoid all exceptions. What else could hardware do besides
> throw an exception on consumption of a naturally occuring multi-bit
> ECC error? Data is gone, and only software might know how to recover.

This is classic bogus thinking.

If Intel ever makes ECC DRAM available to everybody, there would be a
_shred_ of logic to that thinking, but right now it's some hw designer
in their mom's basement that has told you that hardware has to throw a
synchronous exception because hardware doesn't know any better.

That hardware designer really doesn't have a _clue_ about the big issues.

The fact is, a synchronous machine check exception is about the
_worst_ thing you can ever do when you encounter something like a
memory error.

It literally means that the software cannot possibly do anything sane
to recover, because the software is in some random place. The hardware
designer didn't think about the fact that the low-level access is
hidden from the software by a compiler and/or a lot of other
infrastructure - maybe microcode, maybe the OS, maybe a scriping
language, yadda yadda.

Absolutely NOBODY can recover at the level of one instruction. The
microcode people already proved that. At the level of "memcpy()", you
do not have a recovery option.

A hardware designer that tells you that you have to magically recover
at an instruction boundary fundamentally DOES NOT UNDERSTAND THE
PROBLEM.

IOW, you have completely uncritically just taken that incorrect
statement of "what else could hardware do" without questioning that
hardware designer AT ALL.

And remember, this is likely the same hardware designer that told you
that it's a good idea to then make machine checks go to every single
CPU in the system.

And this is the same hardware designer that then didn't even give you
enough information to recover.

And this is the same hardware designer that made recovery impossible
anyway, because if the error happened in microcode or in some other
situation, the microcode COULDN'T HANDLE IT EITHER!

In other words, you are talking to people WHO ARE KNOWN TO BE INCOMPETENT.

Seriously. Question them. When they tell ytou that "it's the only
thing we can possibly do", they do so from being incompetent, and we
have the history to PROVE it.

I don't understand why nobody else seems to be pushing back against
the completely insane and known garbage that is the Intel machine
checks. They are wrong.

The fact is, synchronous errors are the absolute worst possible
interface, exactly because they cause problems in various nasty corner
cases.

We _know_ a lot of those corner cases, for chrissake:

 - random standard library routine like "memcpy". How the hell do you
think a memcpy can recover? It can't.

 - Unaligned handling where "one" access isn't actually a single access.

 - microcode. Intel saw this problem themselves, but instead of making
people realize "oh, synchronous exceptions don't work that well" and
think about the problem, they wasted our time for decades, and then
probably spent a lot of effort in trying to make them work.

 - random generic code that isn't able to handle the fault, because IT
SHOULDN'T NEED TO CARE. Low-level filesystems, user mappings, the list
just goes on.

The only thing that can recover tends to be at a *MUCH* higher level
than one instruction access.

So the next time somebody tells you "there's nothing else we can do",
call them out on being incompetent, and call them out on the fact that
history has _shown_ that they are incompetent and wrong. Over and over
again.

I can _trivially_ point to a number of "what else could we do" that
are much better options.

 (a) let the corrupted value through, notify things ASYNCHRONOUSLY
that there were problems, and let people handle the issue later when
they are ready to do so.

Yeah, the data was corrupt - but not allowing the user to read it may
mean that the whole file is now inaccessible, even if it was just a
single ECC block that was wrong. I don't know the block-size people
use for ECC, and the fact is, software shouldn't really even need to
care. I may be able to recover from the situation at a higher level.
The data may be recoverable other ways - including just a "I want even
corrupted data, because I might have enough context that I can make
sense of it anyway".

 (b) if you have other issues so that you don't have data at all
(maybe it was metadata that was corrupted, don't ask me how that would
happen), just return zeroes, and notify about the problem
ASYNCHRONOUSLY.

And when notifying, give as much useful information as possible: the
virtual and physical address of the data, please, in addition to maybe
lower level bank information. Add a bit for "multiple errors", so that
whoever then later tries to maybe recover, can tell if it has complete
data or not.

The OS can send a SIGBUS with that information to user land that can
then maybe recover. Or it can say "hey, I'm in some recovery mode,
I'll try to limp along with incomplete data". Sometimes "recover"
means "try to limp along, notify the user that their hw is going bad,
but try to use the data anyway".

Again, what Intel engineers actually did with the synchronous
non-recoverable thing was not "the only thing I could possibly have
done".

It was literally the ABSOLUTE WORST POSSIBLE THING they could do, with
potentially  a dead machine as a result.

And now - after years of pain - you have the gall to repeat that
idiocy that you got from people who have shown themselves to be
completely wrong in the past?

Why?

Why do you take their known wrong approach as unthinking gospel? Just
because it's been handed down over generations on stone slabs?

I really really detest the whole mcsafe garbage. And I absolutely
*ABHOR* how nobody inside of Intel has apparently ever questioned the
brokenness at a really fundamental level.

That "I throw my hands in the air and just give up" thing is a
disease. It's absolutely not "what else could we do".

                  Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 19:05         ` Linus Torvalds
@ 2020-04-20 19:29           ` Dan Williams
  2020-04-20 20:07             ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-04-20 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 12:13 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Apr 20, 2020 at 11:20 AM Dan Williams <dan.j.williams@intel.com> wrote:
[..]
> I really really detest the whole mcsafe garbage. And I absolutely
> *ABHOR* how nobody inside of Intel has apparently ever questioned the
> brokenness at a really fundamental level.
>
> That "I throw my hands in the air and just give up" thing is a
> disease. It's absolutely not "what else could we do".

So I grew up in the early part of my career validating ARM CPUs where
a data-abort was either precise or imprecise and the precise error
could be handled like a page fault as you know which instruction
faulted and how to restart the thread. So I didn't take x86 CPU
designers' word for it, I honestly thought that "hmm the x86 machine
check thingy looks like it's trying to implement precise vs imprecise
data-aborts, and precise / synchronous is maybe a good thing because
it's akin to a page fault". I didn't consider asynchronous to be
better because that means there is a gap between when the data
corruption is detected and when it might escape the system that some
external agent could trust the result and start acting on before the
asynchronous signal is delivered.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 19:29           ` Dan Williams
@ 2020-04-20 20:07             ` Linus Torvalds
  2020-04-20 20:23               ` Luck, Tony
  2020-04-20 20:24               ` Dan Williams
  0 siblings, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2020-04-20 20:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 12:29 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
>  I didn't consider asynchronous to be
> better because that means there is a gap between when the data
> corruption is detected and when it might escape the system that some
> external agent could trust the result and start acting on before the
> asynchronous signal is delivered.

The thing is, absolutely nobody cares whether you start acting on the
wrong data or not.

Even if you use the wrong data as a pointer, and get other wrong data
behind it (or take an exception), that doesn't change a thing: the end
result is still the exact same thing "unpredictably wrong data". You
didn't really make things worse by continuing, and people who care
about some very particular case can easily add a barrier to get any
error before they go past some important point.

It's not like Intel hasn't done that before. It's how legacy FP
instructions work - and while the legacy FP unit had horrendous
problems, the delayed exception wasn't one of them (routing it to
irq13 was, but that's on IBM, not Intel). Async exception handling
simplifies a lot of problems, and in fact, if you then have explicit
"check now" points (which the Intel FPU also had, as long as you
didn't have that nasty irq13 routing), it simplifies software too.

So there is no _advantage_ to being synchronous to the source of the
problem.  There is only pain.

There are lots and lots of disadvantages, most of them being
variations - on many different levesl of - "it hit at a point where I
cannot recover".

The microcode example is one such example that Intel engineers should
really take to heart. But the real take-away is that it is only _one_
such example. And making the microcode able to recover doesn't fix all
the _other_ places that aren't able to recover at that point, or fix
the fundamental pain.

For example, for the kernel, NMI's tend to be pretty special things,
and we simply may not be able to handle errors inside an NMI context.
It's not all that unlike the intel microcode example, just at another
level.

But we have lots of other situations. We have random
compiler-generated code, and we're simply not able to take exceptions
at random stages and expect to recover. We have very strict rules
about "this might be unsafe", and a normal memory access of a trusted
pointer IS NOT IT.

But at a higher level, something like "strnlen()" - or any other
low-level library routine - isn't really all that different from
microcode for most programs. We don't want to have a million copies of
"standard library routine, except for nvram".

The whole - and ONLY - point of something like nvram is that it's
supposed to act like memory. It's part of the name, and it's very much
part of the whole argument for why it's a good idea.

And the advantage of it being memory is that you are supposed to be
able to do all those random operations that you just do normally on
any memory region - ask for string lengths, do copies, add up numbers,
look for patterns. Without having to do an "IO" operation for it.

Taking random synchronous faults is simply not acceptable. It breaks
that fundamental principle.

Maybe you have a language with a try/catch/throw model of exception
handling - but one where throwing exceptions is constrained to happen
for documented operations, not any random memory access. That's very
common, and kind of like what the kernel exception handling does by
hand.

So an allocation failure can throw an exception, but once you've
allocated memory, the language runtime simply depends on knowing that
it has pointer safety etc, and normal accesses won't throw an
exception.

In that kind of model, you can easily add a model where "SIGBUS sets a
flag, we'll handle it at catch time", but that depends on things being
able to just continue _despite_ the error.

So a source-synchronous error can be really hard to handle. Exception
handling at random points is simply really really tough in general -
and there are few more random and more painful points than some "I
accessed memory".

So no. An ECC error is nothing like a page fault. A page fault you
have *control* over. You can have a language without the ability to
generate wild pointers, or you can have a language like C where wild
pointers are possible, but they are a bug. So you can work on the
assumption that a page fault is always entirely invisible to the
program at hand.

A synchronous ECC error doesn't have that "entirely invisible to the
program at hand" behavior.

And yes, I've asked for ECC for regular memory for regular CPU's from
Intel for over two decades now. I do NOT want their broken machine
check model with it. I don't even necessarily want the "correction"
part of it - I want reporting of "your hardware / the results of your
computation is no longer reliable". I don't want a dead machine, I
don't want some worry about "what happens if I get an ECC error in
NMI", I don't want any of the problems that come from taking an
absolute "ECC errors are fatal" approach. I just want "an error
happened", with as much information about where it happened as
possible.

                 Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:07             ` Linus Torvalds
@ 2020-04-20 20:23               ` Luck, Tony
  2020-04-20 20:27                 ` Linus Torvalds
  2020-04-20 20:24               ` Dan Williams
  1 sibling, 1 reply; 30+ messages in thread
From: Luck, Tony @ 2020-04-20 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Williams, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, stable, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 01:07:09PM -0700, Linus Torvalds wrote:
> On Mon, Apr 20, 2020 at 12:29 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >  I didn't consider asynchronous to be
> > better because that means there is a gap between when the data
> > corruption is detected and when it might escape the system that some
> > external agent could trust the result and start acting on before the
> > asynchronous signal is delivered.
> 
> The thing is, absolutely nobody cares whether you start acting on the
> wrong data or not.

I think they do. If the result of the wrong data has already
been sent out the network before you process the signal, then you
will need far smarter application software than has ever been written
to hunt it down and stop the spread of the bogus result.

Stopping dead on the instruction before it consumes the data
means you can "recover" by killing just one process, or just one
VMM guest.

I'm in total agreement the machine check (especially broadcast)
was a bad choice for how to "stop on a dime". But I can't see
how you could possibly decide what to do if you let thousands
of instructions retire based on a bad data value before you even
know that it happened.

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:07             ` Linus Torvalds
  2020-04-20 20:23               ` Luck, Tony
@ 2020-04-20 20:24               ` Dan Williams
  2020-04-20 20:46                 ` Linus Torvalds
  1 sibling, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-04-20 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:07 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Apr 20, 2020 at 12:29 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >  I didn't consider asynchronous to be
> > better because that means there is a gap between when the data
> > corruption is detected and when it might escape the system that some
> > external agent could trust the result and start acting on before the
> > asynchronous signal is delivered.
>
> The thing is, absolutely nobody cares whether you start acting on the
> wrong data or not.
>
> Even if you use the wrong data as a pointer, and get other wrong data
> behind it (or take an exception), that doesn't change a thing: the end
> result is still the exact same thing "unpredictably wrong data". You
> didn't really make things worse by continuing, and people who care
> about some very particular case can easily add a barrier to get any
> error before they go past some important point.
>
> It's not like Intel hasn't done that before. It's how legacy FP
> instructions work - and while the legacy FP unit had horrendous
> problems, the delayed exception wasn't one of them (routing it to
> irq13 was, but that's on IBM, not Intel). Async exception handling
> simplifies a lot of problems, and in fact, if you then have explicit
> "check now" points (which the Intel FPU also had, as long as you
> didn't have that nasty irq13 routing), it simplifies software too.

Ah, now it's finally sinking in, it's the fact that if the model had
been from the beginning to require software to issue a barrier if it
wants to resolve pending memory errors that would have freed up the
implementation to leave error handling to a higher-level / more less
constrained part of the software stack.

>
> So there is no _advantage_ to being synchronous to the source of the
> problem.  There is only pain.
>
> There are lots and lots of disadvantages, most of them being
> variations - on many different levesl of - "it hit at a point where I
> cannot recover".
>
> The microcode example is one such example that Intel engineers should
> really take to heart. But the real take-away is that it is only _one_
> such example. And making the microcode able to recover doesn't fix all
> the _other_ places that aren't able to recover at that point, or fix
> the fundamental pain.
>
> For example, for the kernel, NMI's tend to be pretty special things,
> and we simply may not be able to handle errors inside an NMI context.
> It's not all that unlike the intel microcode example, just at another
> level.
>
> But we have lots of other situations. We have random
> compiler-generated code, and we're simply not able to take exceptions
> at random stages and expect to recover. We have very strict rules
> about "this might be unsafe", and a normal memory access of a trusted
> pointer IS NOT IT.
>
> But at a higher level, something like "strnlen()" - or any other
> low-level library routine - isn't really all that different from
> microcode for most programs. We don't want to have a million copies of
> "standard library routine, except for nvram".
>
> The whole - and ONLY - point of something like nvram is that it's
> supposed to act like memory. It's part of the name, and it's very much
> part of the whole argument for why it's a good idea.

Right, and I had a short cry/laugh when realizing that the current
implementation still correctly sends SIGBUS to userspace consumed
poison, but the kernel regressed because it's special synchronous
handling was no longer be correctly deployed.

>
> And the advantage of it being memory is that you are supposed to be
> able to do all those random operations that you just do normally on
> any memory region - ask for string lengths, do copies, add up numbers,
> look for patterns. Without having to do an "IO" operation for it.
>
> Taking random synchronous faults is simply not acceptable. It breaks
> that fundamental principle.
>
> Maybe you have a language with a try/catch/throw model of exception
> handling - but one where throwing exceptions is constrained to happen
> for documented operations, not any random memory access. That's very
> common, and kind of like what the kernel exception handling does by
> hand.
>
> So an allocation failure can throw an exception, but once you've
> allocated memory, the language runtime simply depends on knowing that
> it has pointer safety etc, and normal accesses won't throw an
> exception.
>
> In that kind of model, you can easily add a model where "SIGBUS sets a
> flag, we'll handle it at catch time", but that depends on things being
> able to just continue _despite_ the error.
>
> So a source-synchronous error can be really hard to handle. Exception
> handling at random points is simply really really tough in general -
> and there are few more random and more painful points than some "I
> accessed memory".
>
> So no. An ECC error is nothing like a page fault. A page fault you
> have *control* over. You can have a language without the ability to
> generate wild pointers, or you can have a language like C where wild
> pointers are possible, but they are a bug. So you can work on the
> assumption that a page fault is always entirely invisible to the
> program at hand.
>
> A synchronous ECC error doesn't have that "entirely invisible to the
> program at hand" behavior.
>
> And yes, I've asked for ECC for regular memory for regular CPU's from
> Intel for over two decades now. I do NOT want their broken machine
> check model with it. I don't even necessarily want the "correction"
> part of it - I want reporting of "your hardware / the results of your
> computation is no longer reliable". I don't want a dead machine, I
> don't want some worry about "what happens if I get an ECC error in
> NMI", I don't want any of the problems that come from taking an
> absolute "ECC errors are fatal" approach. I just want "an error
> happened", with as much information about where it happened as
> possible.

...but also some kind of barrier semantic, right? Because there are
systems that want some guarantees when they can commit or otherwise
shoot the machine if they can not.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:23               ` Luck, Tony
@ 2020-04-20 20:27                 ` Linus Torvalds
  2020-04-20 20:45                   ` Luck, Tony
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2020-04-20 20:27 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Dan Williams, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, stable, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:23 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> I think they do. If the result of the wrong data has already
> been sent out the network before you process the signal, then you
> will need far smarter application software than has ever been written
> to hunt it down and stop the spread of the bogus result.

Bah. That's a completely red herring argument.

By "asynchronous" I don't mean "hours later".

Make it be "interrupts are enabled, before serializing instruction".

Yes, we want bounded error handling latency. But that doesn't mean "synchronous"

           Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:27                 ` Linus Torvalds
@ 2020-04-20 20:45                   ` Luck, Tony
  2020-04-20 20:56                     ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Luck, Tony @ 2020-04-20 20:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Williams, Dan J, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, stable, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	Tsaur, Erwin, Linux Kernel Mailing List, linux-nvdimm

> By "asynchronous" I don't mean "hours later".
>
> Make it be "interrupts are enabled, before serializing instruction".
>
> Yes, we want bounded error handling latency. But that doesn't mean "synchronous"

Another X86 vendor seems to be adding something like that. See MCOMMIT
in https://www.amd.com/system/files/TechDocs/24594.pdf

But I wonder how an OS will know whether it is running some smart
MCOMMIT-aware application that can figure out what to do with bad
data, or a legacy application that should probably be stopped before
it hurts somebody.

I also wonder how expensive MCOMMIT is (since it is essentially
polling for "did any errors happen").

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:24               ` Dan Williams
@ 2020-04-20 20:46                 ` Linus Torvalds
  2020-04-20 20:57                   ` Luck, Tony
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2020-04-20 20:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:25 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> ...but also some kind of barrier semantic, right? Because there are
> systems that want some guarantees when they can commit or otherwise
> shoot the machine if they can not.

The optimal model would likely be a new instruction that could be done
in user space and test for it, possibly without any exception at all
(because the thing that checks for errors is also presumably the only
thing that can decide how to recover - so raising an exception doesn't
necessarily help).

Together with a way for the kernel to save/restore the exception state
on task switch (presumably in the xsave area) so that the error state
of one process doesn't affect another one. Bonus points if it's all
per-security level, so that a pure user-level error report doesn't
poison the kernel state and vice versa.

That is _very_ similar to how FPU exceptions work right now. User
space can literally do an operation that creates an error on one CPU,
get re-scheduled to another one, and take the actual signal and read
the exception state on that other CPU.

(Of course, the "not even take an exception" part is different).

An alternate very simple model that doesn't require any new
instructions and no new architecturally visible state (except of
course the actual error data) would be to just be raising a *maskable*
trap (with the Intel definition of trap vs exception: a trap happens
_after_ the instruction).

The trap could be on the next instruction if people really want to be
that precise, but I don't think it even matters. If it's delayed until
the next serializing instruction, that would probably be just fine
too.

But the important thing is that it

 (a) is a trap, not an exception - so the instruction has been done,
and you don't need to try to emulate it or anything to continue.

 (b) is maskable, so that the trap handler can decide to just mask it
and return (and set a separate flag to then handle it later)

With domain transfers either being barriers, or masking it (so NMI and
external interrupts would presumably mask it for latency reasons)?

I dunno. Wild handwaving. But much better than that crazy
unrecoverable machine check model.

                   Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:45                   ` Luck, Tony
@ 2020-04-20 20:56                     ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2020-04-20 20:56 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Williams, Dan J, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, stable, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	Tsaur, Erwin, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:45 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> Another X86 vendor seems to be adding something like that. See MCOMMIT
> in https://www.amd.com/system/files/TechDocs/24594.pdf

That sounds potentially very expensive.

Particularly, as you say, something like the kernel (or
virtualization) may want to at least test for it cheaply on entry or
switch or whatever.

I do think you want the mcommit kind of thing for writing, but I think
the intel model of (no longer pcommit) using a writeback instruction
with a range, and then just sfence is better than a "commit
everything" thing.

But that's just for writing things, and that's fundamentally very
different from the read side errors.

So for the read-side you'd want some kind of "lfence and report"
instruction for the "did I see load errors". Very cheap like lfence,
so that there wouldn't really be a cost for the people if somebody
_really_ want to get notified immediately.

And again, it might just be part of any serializing instruction. I
don't care that much, although overloading "serializing instruction"
even more sounds like a bad idea.

               Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:46                 ` Linus Torvalds
@ 2020-04-20 20:57                   ` Luck, Tony
  2020-04-20 21:16                     ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Luck, Tony @ 2020-04-20 20:57 UTC (permalink / raw)
  To: Linus Torvalds, Williams, Dan J
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tsaur, Erwin,
	Linux Kernel Mailing List, linux-nvdimm

>  (a) is a trap, not an exception - so the instruction has been done,
> and you don't need to try to emulate it or anything to continue.

Maybe for errors on the data side of the pipeline. On the instruction
side we can usually recover from user space instruction fetches by
just throwing away the page with the corrupted instructions and reading
from disk into a new page. Then just point the page table to the new
page, and hey presto, its all transparently fixed (modulo time lost fixing
things).

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:57                   ` Luck, Tony
@ 2020-04-20 21:16                     ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2020-04-20 21:16 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Williams, Dan J, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, stable, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	Tsaur, Erwin, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:57 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> >  (a) is a trap, not an exception - so the instruction has been done,
> > and you don't need to try to emulate it or anything to continue.
>
> Maybe for errors on the data side of the pipeline. On the instruction
> side we can usually recover from user space instruction fetches by
> just throwing away the page with the corrupted instructions and reading
> from disk into a new page. Then just point the page table to the new
> page, and hey presto, its all transparently fixed (modulo time lost fixing
> things).

That's true for things like ECC on real RAM, with traditional executables.

It's not so true of something like nvram that you execute out of
directly. There is not necessarily a disk to re-read things from.

But it's also not true of things like JIT's. They are kind of a big
thing. Asking the JIT to do "hey, I faulted at a random point, you
need to re-JIT" is no different from all the other "that's a _really_
painful recovery point, please delay it".

Sure, the JIT environment will probably just have to kill that thread
anyway, but I do think this falls under the same "you're better off
giving the _option_ to just continue and hope for the best" than force
a non-recoverable state.

For regular ECC, I literally would like the machine to just always
continue. I'd like to be informed that there's something bad going on
(because it might be RAM going bad, but it might also be a rowhammer
attack), but the decision to kill things or not should ultimately be
the *users*, not the JIT's, not the kernel.

So the basic rule should be that you should always have the _option_
to just continue. The corrupted state might not be critical - or it
might be the ECC bits themselves, not the data.

There are situations where stopping everything is worse than "let's
continue as best we can, and inform the user with a big red blinking
light".

ECC should not make things less reliable, even if it's another 10+% of
bits that can go wrong.

It should also be noted that even a good ECC pattern _can_ miss
corruption if you're unlucky with the corruption. So the whole
black-and-white model of "ECC means you need to stop everything" is
questionable to begin with, because the signal isn't that absolute in
the first place.

So when somebody brings up a "what if I use corrupted data and make
things worse", they are making an intellectually dishonest argument.
What if you saw corrupted data and simply never caught it, because it
was a unlucky multi-bit failure"?

There is no "absolute" thing about ECC. The only thing that is _never_
wrong is to report it and try to continue, and let some higher-level
entity decide what to do.

And that final decision might literally be "I ran this simulation for
2 days, I see that there's an error report, I will buy a new machine.
For now I'll use the data it generated, but I'll re-run to validate it
later".

                 Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  2020-04-20 17:28     ` Linus Torvalds
  2020-04-20 18:20       ` Dan Williams
@ 2020-10-06  9:57       ` tip-bot2 for Dan Williams
  2020-10-07 11:14         ` Borislav Petkov
  1 sibling, 1 reply; 30+ messages in thread
From: tip-bot2 for Dan Williams @ 2020-10-06  9:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dan Williams, Borislav Petkov, Tony Luck, Michael Ellerman,
	stable, x86, LKML

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     ec6347bb43395cb92126788a1a5b25302543f815
Gitweb:        https://git.kernel.org/tip/ec6347bb43395cb92126788a1a5b25302543f815
Author:        Dan Williams <dan.j.williams@intel.com>
AuthorDate:    Mon, 05 Oct 2020 20:40:16 -07:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Tue, 06 Oct 2020 11:18:04 +02:00

x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()

In reaction to a proposal to introduce a memcpy_mcsafe_fast()
implementation Linus points out that memcpy_mcsafe() is poorly named
relative to communicating the scope of the interface. Specifically what
addresses are valid to pass as source, destination, and what faults /
exceptions are handled.

Of particular concern is that even though x86 might be able to handle
the semantics of copy_mc_to_user() with its common copy_user_generic()
implementation other archs likely need / want an explicit path for this
case:

  On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
  >
  > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote:
  > >
  > > However now I see that copy_user_generic() works for the wrong reason.
  > > It works because the exception on the source address due to poison
  > > looks no different than a write fault on the user address to the
  > > caller, it's still just a short copy. So it makes copy_to_user() work
  > > for the wrong reason relative to the name.
  >
  > Right.
  >
  > And it won't work that way on other architectures. On x86, we have a
  > generic function that can take faults on either side, and we use it
  > for both cases (and for the "in_user" case too), but that's an
  > artifact of the architecture oddity.
  >
  > In fact, it's probably wrong even on x86 - because it can hide bugs -
  > but writing those things is painful enough that everybody prefers
  > having just one function.

Replace a single top-level memcpy_mcsafe() with either
copy_mc_to_user(), or copy_mc_to_kernel().

Introduce an x86 copy_mc_fragile() name as the rename for the
low-level x86 implementation formerly named memcpy_mcsafe(). It is used
as the slow / careful backend that is supplanted by a fast
copy_mc_generic() in a follow-on patch.

One side-effect of this reorganization is that separating copy_mc_64.S
to its own file means that perf no longer needs to track dependencies
for its memcpy_64.S benchmarks.

 [ bp: Massage a bit. ]

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>
Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
---
 arch/powerpc/Kconfig                                         |   2 +-
 arch/powerpc/include/asm/string.h                            |   2 +-
 arch/powerpc/include/asm/uaccess.h                           |  40 +-
 arch/powerpc/lib/Makefile                                    |   2 +-
 arch/powerpc/lib/copy_mc_64.S                                | 242 +++++++-
 arch/powerpc/lib/memcpy_mcsafe_64.S                          | 242 +-------
 arch/x86/Kconfig                                             |   2 +-
 arch/x86/Kconfig.debug                                       |   2 +-
 arch/x86/include/asm/copy_mc_test.h                          |  75 ++-
 arch/x86/include/asm/mce.h                                   |   9 +-
 arch/x86/include/asm/mcsafe_test.h                           |  75 +--
 arch/x86/include/asm/string_64.h                             |  32 +-
 arch/x86/include/asm/uaccess.h                               |   9 +-
 arch/x86/include/asm/uaccess_64.h                            |  20 +-
 arch/x86/kernel/cpu/mce/core.c                               |   8 +-
 arch/x86/kernel/quirks.c                                     |  10 +-
 arch/x86/lib/Makefile                                        |   1 +-
 arch/x86/lib/copy_mc.c                                       |  82 ++-
 arch/x86/lib/copy_mc_64.S                                    | 127 ++++-
 arch/x86/lib/memcpy_64.S                                     | 115 +---
 arch/x86/lib/usercopy_64.c                                   |  21 +-
 drivers/md/dm-writecache.c                                   |  15 +-
 drivers/nvdimm/claim.c                                       |   2 +-
 drivers/nvdimm/pmem.c                                        |   6 +-
 include/linux/string.h                                       |   9 +-
 include/linux/uaccess.h                                      |  13 +-
 include/linux/uio.h                                          |  10 +-
 lib/Kconfig                                                  |   7 +-
 lib/iov_iter.c                                               |  48 +-
 tools/arch/x86/include/asm/mcsafe_test.h                     |  13 +-
 tools/arch/x86/lib/memcpy_64.S                               | 115 +---
 tools/objtool/check.c                                        |   4 +-
 tools/perf/bench/Build                                       |   1 +-
 tools/perf/bench/mem-memcpy-x86-64-lib.c                     |  24 +-
 tools/testing/nvdimm/test/nfit.c                             |  49 +-
 tools/testing/selftests/powerpc/copyloops/.gitignore         |   2 +-
 tools/testing/selftests/powerpc/copyloops/Makefile           |   6 +-
 tools/testing/selftests/powerpc/copyloops/copy_mc_64.S       |   1 +-
 tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S |   1 +-
 39 files changed, 673 insertions(+), 771 deletions(-)
 create mode 100644 arch/powerpc/lib/copy_mc_64.S
 delete mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S
 create mode 100644 arch/x86/include/asm/copy_mc_test.h
 delete mode 100644 arch/x86/include/asm/mcsafe_test.h
 create mode 100644 arch/x86/lib/copy_mc.c
 create mode 100644 arch/x86/lib/copy_mc_64.S
 delete mode 100644 tools/arch/x86/include/asm/mcsafe_test.h
 delete mode 100644 tools/perf/bench/mem-memcpy-x86-64-lib.c
 create mode 120000 tools/testing/selftests/powerpc/copyloops/copy_mc_64.S
 delete mode 120000 tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbf..76473ed 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -136,7 +136,7 @@ config PPC
 	select ARCH_HAS_STRICT_KERNEL_RWX	if (PPC32 && !HIBERNATION)
 	select ARCH_HAS_TICK_BROADCAST		if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAS_UACCESS_FLUSHCACHE
-	select ARCH_HAS_UACCESS_MCSAFE		if PPC64
+	select ARCH_HAS_COPY_MC			if PPC64
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_KEEP_MEMBLOCK
diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
index 283552c..2aa0e31 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -53,9 +53,7 @@ void *__memmove(void *to, const void *from, __kernel_size_t n);
 #ifndef CONFIG_KASAN
 #define __HAVE_ARCH_MEMSET32
 #define __HAVE_ARCH_MEMSET64
-#define __HAVE_ARCH_MEMCPY_MCSAFE
 
-extern int memcpy_mcsafe(void *dst, const void *src, __kernel_size_t sz);
 extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t);
 extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
 extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 0069990..20a3537 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -435,6 +435,32 @@ do {								\
 extern unsigned long __copy_tofrom_user(void __user *to,
 		const void __user *from, unsigned long size);
 
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+unsigned long __must_check
+copy_mc_generic(void *to, const void *from, unsigned long size);
+
+static inline unsigned long __must_check
+copy_mc_to_kernel(void *to, const void *from, unsigned long size)
+{
+	return copy_mc_generic(to, from, size);
+}
+#define copy_mc_to_kernel copy_mc_to_kernel
+
+static inline unsigned long __must_check
+copy_mc_to_user(void __user *to, const void *from, unsigned long n)
+{
+	if (likely(check_copy_size(from, n, true))) {
+		if (access_ok(to, n)) {
+			allow_write_to_user(to, n);
+			n = copy_mc_generic((void *)to, from, n);
+			prevent_write_to_user(to, n);
+		}
+	}
+
+	return n;
+}
+#endif
+
 #ifdef __powerpc64__
 static inline unsigned long
 raw_copy_in_user(void __user *to, const void __user *from, unsigned long n)
@@ -523,20 +549,6 @@ raw_copy_to_user(void __user *to, const void *from, unsigned long n)
 	return ret;
 }
 
-static __always_inline unsigned long __must_check
-copy_to_user_mcsafe(void __user *to, const void *from, unsigned long n)
-{
-	if (likely(check_copy_size(from, n, true))) {
-		if (access_ok(to, n)) {
-			allow_write_to_user(to, n);
-			n = memcpy_mcsafe((void *)to, from, n);
-			prevent_write_to_user(to, n);
-		}
-	}
-
-	return n;
-}
-
 unsigned long __arch_clear_user(void __user *addr, unsigned long size);
 
 static inline unsigned long clear_user(void __user *addr, unsigned long size)
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index d66a645..69a91b5 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -39,7 +39,7 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o copypage_power7.o \
 			       memcpy_power7.o
 
 obj64-y	+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
-	   memcpy_64.o memcpy_mcsafe_64.o
+	   memcpy_64.o copy_mc_64.o
 
 ifndef CONFIG_PPC_QUEUED_SPINLOCKS
 obj64-$(CONFIG_SMP)	+= locks.o
diff --git a/arch/powerpc/lib/copy_mc_64.S b/arch/powerpc/lib/copy_mc_64.S
new file mode 100644
index 0000000..88d46c4
--- /dev/null
+++ b/arch/powerpc/lib/copy_mc_64.S
@@ -0,0 +1,242 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) IBM Corporation, 2011
+ * Derived from copyuser_power7.s by Anton Blanchard <anton@au.ibm.com>
+ * Author - Balbir Singh <bsingharora@gmail.com>
+ */
+#include <asm/ppc_asm.h>
+#include <asm/errno.h>
+#include <asm/export.h>
+
+	.macro err1
+100:
+	EX_TABLE(100b,.Ldo_err1)
+	.endm
+
+	.macro err2
+200:
+	EX_TABLE(200b,.Ldo_err2)
+	.endm
+
+	.macro err3
+300:	EX_TABLE(300b,.Ldone)
+	.endm
+
+.Ldo_err2:
+	ld	r22,STK_REG(R22)(r1)
+	ld	r21,STK_REG(R21)(r1)
+	ld	r20,STK_REG(R20)(r1)
+	ld	r19,STK_REG(R19)(r1)
+	ld	r18,STK_REG(R18)(r1)
+	ld	r17,STK_REG(R17)(r1)
+	ld	r16,STK_REG(R16)(r1)
+	ld	r15,STK_REG(R15)(r1)
+	ld	r14,STK_REG(R14)(r1)
+	addi	r1,r1,STACKFRAMESIZE
+.Ldo_err1:
+	/* Do a byte by byte copy to get the exact remaining size */
+	mtctr	r7
+46:
+err3;	lbz	r0,0(r4)
+	addi	r4,r4,1
+err3;	stb	r0,0(r3)
+	addi	r3,r3,1
+	bdnz	46b
+	li	r3,0
+	blr
+
+.Ldone:
+	mfctr	r3
+	blr
+
+
+_GLOBAL(copy_mc_generic)
+	mr	r7,r5
+	cmpldi	r5,16
+	blt	.Lshort_copy
+
+.Lcopy:
+	/* Get the source 8B aligned */
+	neg	r6,r4
+	mtocrf	0x01,r6
+	clrldi	r6,r6,(64-3)
+
+	bf	cr7*4+3,1f
+err1;	lbz	r0,0(r4)
+	addi	r4,r4,1
+err1;	stb	r0,0(r3)
+	addi	r3,r3,1
+	subi	r7,r7,1
+
+1:	bf	cr7*4+2,2f
+err1;	lhz	r0,0(r4)
+	addi	r4,r4,2
+err1;	sth	r0,0(r3)
+	addi	r3,r3,2
+	subi	r7,r7,2
+
+2:	bf	cr7*4+1,3f
+err1;	lwz	r0,0(r4)
+	addi	r4,r4,4
+err1;	stw	r0,0(r3)
+	addi	r3,r3,4
+	subi	r7,r7,4
+
+3:	sub	r5,r5,r6
+	cmpldi	r5,128
+
+	mflr	r0
+	stdu	r1,-STACKFRAMESIZE(r1)
+	std	r14,STK_REG(R14)(r1)
+	std	r15,STK_REG(R15)(r1)
+	std	r16,STK_REG(R16)(r1)
+	std	r17,STK_REG(R17)(r1)
+	std	r18,STK_REG(R18)(r1)
+	std	r19,STK_REG(R19)(r1)
+	std	r20,STK_REG(R20)(r1)
+	std	r21,STK_REG(R21)(r1)
+	std	r22,STK_REG(R22)(r1)
+	std	r0,STACKFRAMESIZE+16(r1)
+
+	blt	5f
+	srdi	r6,r5,7
+	mtctr	r6
+
+	/* Now do cacheline (128B) sized loads and stores. */
+	.align	5
+4:
+err2;	ld	r0,0(r4)
+err2;	ld	r6,8(r4)
+err2;	ld	r8,16(r4)
+err2;	ld	r9,24(r4)
+err2;	ld	r10,32(r4)
+err2;	ld	r11,40(r4)
+err2;	ld	r12,48(r4)
+err2;	ld	r14,56(r4)
+err2;	ld	r15,64(r4)
+err2;	ld	r16,72(r4)
+err2;	ld	r17,80(r4)
+err2;	ld	r18,88(r4)
+err2;	ld	r19,96(r4)
+err2;	ld	r20,104(r4)
+err2;	ld	r21,112(r4)
+err2;	ld	r22,120(r4)
+	addi	r4,r4,128
+err2;	std	r0,0(r3)
+err2;	std	r6,8(r3)
+err2;	std	r8,16(r3)
+err2;	std	r9,24(r3)
+err2;	std	r10,32(r3)
+err2;	std	r11,40(r3)
+err2;	std	r12,48(r3)
+err2;	std	r14,56(r3)
+err2;	std	r15,64(r3)
+err2;	std	r16,72(r3)
+err2;	std	r17,80(r3)
+err2;	std	r18,88(r3)
+err2;	std	r19,96(r3)
+err2;	std	r20,104(r3)
+err2;	std	r21,112(r3)
+err2;	std	r22,120(r3)
+	addi	r3,r3,128
+	subi	r7,r7,128
+	bdnz	4b
+
+	clrldi	r5,r5,(64-7)
+
+	/* Up to 127B to go */
+5:	srdi	r6,r5,4
+	mtocrf	0x01,r6
+
+6:	bf	cr7*4+1,7f
+err2;	ld	r0,0(r4)
+err2;	ld	r6,8(r4)
+err2;	ld	r8,16(r4)
+err2;	ld	r9,24(r4)
+err2;	ld	r10,32(r4)
+err2;	ld	r11,40(r4)
+err2;	ld	r12,48(r4)
+err2;	ld	r14,56(r4)
+	addi	r4,r4,64
+err2;	std	r0,0(r3)
+err2;	std	r6,8(r3)
+err2;	std	r8,16(r3)
+err2;	std	r9,24(r3)
+err2;	std	r10,32(r3)
+err2;	std	r11,40(r3)
+err2;	std	r12,48(r3)
+err2;	std	r14,56(r3)
+	addi	r3,r3,64
+	subi	r7,r7,64
+
+7:	ld	r14,STK_REG(R14)(r1)
+	ld	r15,STK_REG(R15)(r1)
+	ld	r16,STK_REG(R16)(r1)
+	ld	r17,STK_REG(R17)(r1)
+	ld	r18,STK_REG(R18)(r1)
+	ld	r19,STK_REG(R19)(r1)
+	ld	r20,STK_REG(R20)(r1)
+	ld	r21,STK_REG(R21)(r1)
+	ld	r22,STK_REG(R22)(r1)
+	addi	r1,r1,STACKFRAMESIZE
+
+	/* Up to 63B to go */
+	bf	cr7*4+2,8f
+err1;	ld	r0,0(r4)
+err1;	ld	r6,8(r4)
+err1;	ld	r8,16(r4)
+err1;	ld	r9,24(r4)
+	addi	r4,r4,32
+err1;	std	r0,0(r3)
+err1;	std	r6,8(r3)
+err1;	std	r8,16(r3)
+err1;	std	r9,24(r3)
+	addi	r3,r3,32
+	subi	r7,r7,32
+
+	/* Up to 31B to go */
+8:	bf	cr7*4+3,9f
+err1;	ld	r0,0(r4)
+err1;	ld	r6,8(r4)
+	addi	r4,r4,16
+err1;	std	r0,0(r3)
+err1;	std	r6,8(r3)
+	addi	r3,r3,16
+	subi	r7,r7,16
+
+9:	clrldi	r5,r5,(64-4)
+
+	/* Up to 15B to go */
+.Lshort_copy:
+	mtocrf	0x01,r5
+	bf	cr7*4+0,12f
+err1;	lwz	r0,0(r4)	/* Less chance of a reject with word ops */
+err1;	lwz	r6,4(r4)
+	addi	r4,r4,8
+err1;	stw	r0,0(r3)
+err1;	stw	r6,4(r3)
+	addi	r3,r3,8
+	subi	r7,r7,8
+
+12:	bf	cr7*4+1,13f
+err1;	lwz	r0,0(r4)
+	addi	r4,r4,4
+err1;	stw	r0,0(r3)
+	addi	r3,r3,4
+	subi	r7,r7,4
+
+13:	bf	cr7*4+2,14f
+err1;	lhz	r0,0(r4)
+	addi	r4,r4,2
+err1;	sth	r0,0(r3)
+	addi	r3,r3,2
+	subi	r7,r7,2
+
+14:	bf	cr7*4+3,15f
+err1;	lbz	r0,0(r4)
+err1;	stb	r0,0(r3)
+
+15:	li	r3,0
+	blr
+
+EXPORT_SYMBOL_GPL(copy_mc_generic);
diff --git a/arch/powerpc/lib/memcpy_mcsafe_64.S b/arch/powerpc/lib/memcpy_mcsafe_64.S
deleted file mode 100644
index cb882d9..0000000
--- a/arch/powerpc/lib/memcpy_mcsafe_64.S
+++ /dev/null
@@ -1,242 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Copyright (C) IBM Corporation, 2011
- * Derived from copyuser_power7.s by Anton Blanchard <anton@au.ibm.com>
- * Author - Balbir Singh <bsingharora@gmail.com>
- */
-#include <asm/ppc_asm.h>
-#include <asm/errno.h>
-#include <asm/export.h>
-
-	.macro err1
-100:
-	EX_TABLE(100b,.Ldo_err1)
-	.endm
-
-	.macro err2
-200:
-	EX_TABLE(200b,.Ldo_err2)
-	.endm
-
-	.macro err3
-300:	EX_TABLE(300b,.Ldone)
-	.endm
-
-.Ldo_err2:
-	ld	r22,STK_REG(R22)(r1)
-	ld	r21,STK_REG(R21)(r1)
-	ld	r20,STK_REG(R20)(r1)
-	ld	r19,STK_REG(R19)(r1)
-	ld	r18,STK_REG(R18)(r1)
-	ld	r17,STK_REG(R17)(r1)
-	ld	r16,STK_REG(R16)(r1)
-	ld	r15,STK_REG(R15)(r1)
-	ld	r14,STK_REG(R14)(r1)
-	addi	r1,r1,STACKFRAMESIZE
-.Ldo_err1:
-	/* Do a byte by byte copy to get the exact remaining size */
-	mtctr	r7
-46:
-err3;	lbz	r0,0(r4)
-	addi	r4,r4,1
-err3;	stb	r0,0(r3)
-	addi	r3,r3,1
-	bdnz	46b
-	li	r3,0
-	blr
-
-.Ldone:
-	mfctr	r3
-	blr
-
-
-_GLOBAL(memcpy_mcsafe)
-	mr	r7,r5
-	cmpldi	r5,16
-	blt	.Lshort_copy
-
-.Lcopy:
-	/* Get the source 8B aligned */
-	neg	r6,r4
-	mtocrf	0x01,r6
-	clrldi	r6,r6,(64-3)
-
-	bf	cr7*4+3,1f
-err1;	lbz	r0,0(r4)
-	addi	r4,r4,1
-err1;	stb	r0,0(r3)
-	addi	r3,r3,1
-	subi	r7,r7,1
-
-1:	bf	cr7*4+2,2f
-err1;	lhz	r0,0(r4)
-	addi	r4,r4,2
-err1;	sth	r0,0(r3)
-	addi	r3,r3,2
-	subi	r7,r7,2
-
-2:	bf	cr7*4+1,3f
-err1;	lwz	r0,0(r4)
-	addi	r4,r4,4
-err1;	stw	r0,0(r3)
-	addi	r3,r3,4
-	subi	r7,r7,4
-
-3:	sub	r5,r5,r6
-	cmpldi	r5,128
-
-	mflr	r0
-	stdu	r1,-STACKFRAMESIZE(r1)
-	std	r14,STK_REG(R14)(r1)
-	std	r15,STK_REG(R15)(r1)
-	std	r16,STK_REG(R16)(r1)
-	std	r17,STK_REG(R17)(r1)
-	std	r18,STK_REG(R18)(r1)
-	std	r19,STK_REG(R19)(r1)
-	std	r20,STK_REG(R20)(r1)
-	std	r21,STK_REG(R21)(r1)
-	std	r22,STK_REG(R22)(r1)
-	std	r0,STACKFRAMESIZE+16(r1)
-
-	blt	5f
-	srdi	r6,r5,7
-	mtctr	r6
-
-	/* Now do cacheline (128B) sized loads and stores. */
-	.align	5
-4:
-err2;	ld	r0,0(r4)
-err2;	ld	r6,8(r4)
-err2;	ld	r8,16(r4)
-err2;	ld	r9,24(r4)
-err2;	ld	r10,32(r4)
-err2;	ld	r11,40(r4)
-err2;	ld	r12,48(r4)
-err2;	ld	r14,56(r4)
-err2;	ld	r15,64(r4)
-err2;	ld	r16,72(r4)
-err2;	ld	r17,80(r4)
-err2;	ld	r18,88(r4)
-err2;	ld	r19,96(r4)
-err2;	ld	r20,104(r4)
-err2;	ld	r21,112(r4)
-err2;	ld	r22,120(r4)
-	addi	r4,r4,128
-err2;	std	r0,0(r3)
-err2;	std	r6,8(r3)
-err2;	std	r8,16(r3)
-err2;	std	r9,24(r3)
-err2;	std	r10,32(r3)
-err2;	std	r11,40(r3)
-err2;	std	r12,48(r3)
-err2;	std	r14,56(r3)
-err2;	std	r15,64(r3)
-err2;	std	r16,72(r3)
-err2;	std	r17,80(r3)
-err2;	std	r18,88(r3)
-err2;	std	r19,96(r3)
-err2;	std	r20,104(r3)
-err2;	std	r21,112(r3)
-err2;	std	r22,120(r3)
-	addi	r3,r3,128
-	subi	r7,r7,128
-	bdnz	4b
-
-	clrldi	r5,r5,(64-7)
-
-	/* Up to 127B to go */
-5:	srdi	r6,r5,4
-	mtocrf	0x01,r6
-
-6:	bf	cr7*4+1,7f
-err2;	ld	r0,0(r4)
-err2;	ld	r6,8(r4)
-err2;	ld	r8,16(r4)
-err2;	ld	r9,24(r4)
-err2;	ld	r10,32(r4)
-err2;	ld	r11,40(r4)
-err2;	ld	r12,48(r4)
-err2;	ld	r14,56(r4)
-	addi	r4,r4,64
-err2;	std	r0,0(r3)
-err2;	std	r6,8(r3)
-err2;	std	r8,16(r3)
-err2;	std	r9,24(r3)
-err2;	std	r10,32(r3)
-err2;	std	r11,40(r3)
-err2;	std	r12,48(r3)
-err2;	std	r14,56(r3)
-	addi	r3,r3,64
-	subi	r7,r7,64
-
-7:	ld	r14,STK_REG(R14)(r1)
-	ld	r15,STK_REG(R15)(r1)
-	ld	r16,STK_REG(R16)(r1)
-	ld	r17,STK_REG(R17)(r1)
-	ld	r18,STK_REG(R18)(r1)
-	ld	r19,STK_REG(R19)(r1)
-	ld	r20,STK_REG(R20)(r1)
-	ld	r21,STK_REG(R21)(r1)
-	ld	r22,STK_REG(R22)(r1)
-	addi	r1,r1,STACKFRAMESIZE
-
-	/* Up to 63B to go */
-	bf	cr7*4+2,8f
-err1;	ld	r0,0(r4)
-err1;	ld	r6,8(r4)
-err1;	ld	r8,16(r4)
-err1;	ld	r9,24(r4)
-	addi	r4,r4,32
-err1;	std	r0,0(r3)
-err1;	std	r6,8(r3)
-err1;	std	r8,16(r3)
-err1;	std	r9,24(r3)
-	addi	r3,r3,32
-	subi	r7,r7,32
-
-	/* Up to 31B to go */
-8:	bf	cr7*4+3,9f
-err1;	ld	r0,0(r4)
-err1;	ld	r6,8(r4)
-	addi	r4,r4,16
-err1;	std	r0,0(r3)
-err1;	std	r6,8(r3)
-	addi	r3,r3,16
-	subi	r7,r7,16
-
-9:	clrldi	r5,r5,(64-4)
-
-	/* Up to 15B to go */
-.Lshort_copy:
-	mtocrf	0x01,r5
-	bf	cr7*4+0,12f
-err1;	lwz	r0,0(r4)	/* Less chance of a reject with word ops */
-err1;	lwz	r6,4(r4)
-	addi	r4,r4,8
-err1;	stw	r0,0(r3)
-err1;	stw	r6,4(r3)
-	addi	r3,r3,8
-	subi	r7,r7,8
-
-12:	bf	cr7*4+1,13f
-err1;	lwz	r0,0(r4)
-	addi	r4,r4,4
-err1;	stw	r0,0(r3)
-	addi	r3,r3,4
-	subi	r7,r7,4
-
-13:	bf	cr7*4+2,14f
-err1;	lhz	r0,0(r4)
-	addi	r4,r4,2
-err1;	sth	r0,0(r3)
-	addi	r3,r3,2
-	subi	r7,r7,2
-
-14:	bf	cr7*4+3,15f
-err1;	lbz	r0,0(r4)
-err1;	stb	r0,0(r3)
-
-15:	li	r3,0
-	blr
-
-EXPORT_SYMBOL_GPL(memcpy_mcsafe);
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac6..e876b3a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -75,7 +75,7 @@ config X86
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_UACCESS_FLUSHCACHE	if X86_64
-	select ARCH_HAS_UACCESS_MCSAFE		if X86_64 && X86_MCE
+	select ARCH_HAS_COPY_MC			if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index ee1d3c5..27b5e2b 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -62,7 +62,7 @@ config EARLY_PRINTK_USB_XDBC
 	  You should normally say N here, unless you want to debug early
 	  crashes or need a very simple printk logging facility.
 
-config MCSAFE_TEST
+config COPY_MC_TEST
 	def_bool n
 
 config EFI_PGT_DUMP
diff --git a/arch/x86/include/asm/copy_mc_test.h b/arch/x86/include/asm/copy_mc_test.h
new file mode 100644
index 0000000..e4991ba
--- /dev/null
+++ b/arch/x86/include/asm/copy_mc_test.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _COPY_MC_TEST_H_
+#define _COPY_MC_TEST_H_
+
+#ifndef __ASSEMBLY__
+#ifdef CONFIG_COPY_MC_TEST
+extern unsigned long copy_mc_test_src;
+extern unsigned long copy_mc_test_dst;
+
+static inline void copy_mc_inject_src(void *addr)
+{
+	if (addr)
+		copy_mc_test_src = (unsigned long) addr;
+	else
+		copy_mc_test_src = ~0UL;
+}
+
+static inline void copy_mc_inject_dst(void *addr)
+{
+	if (addr)
+		copy_mc_test_dst = (unsigned long) addr;
+	else
+		copy_mc_test_dst = ~0UL;
+}
+#else /* CONFIG_COPY_MC_TEST */
+static inline void copy_mc_inject_src(void *addr)
+{
+}
+
+static inline void copy_mc_inject_dst(void *addr)
+{
+}
+#endif /* CONFIG_COPY_MC_TEST */
+
+#else /* __ASSEMBLY__ */
+#include <asm/export.h>
+
+#ifdef CONFIG_COPY_MC_TEST
+.macro COPY_MC_TEST_CTL
+	.pushsection .data
+	.align 8
+	.globl copy_mc_test_src
+	copy_mc_test_src:
+		.quad 0
+	EXPORT_SYMBOL_GPL(copy_mc_test_src)
+	.globl copy_mc_test_dst
+	copy_mc_test_dst:
+		.quad 0
+	EXPORT_SYMBOL_GPL(copy_mc_test_dst)
+	.popsection
+.endm
+
+.macro COPY_MC_TEST_SRC reg count target
+	leaq \count(\reg), %r9
+	cmp copy_mc_test_src, %r9
+	ja \target
+.endm
+
+.macro COPY_MC_TEST_DST reg count target
+	leaq \count(\reg), %r9
+	cmp copy_mc_test_dst, %r9
+	ja \target
+.endm
+#else
+.macro COPY_MC_TEST_CTL
+.endm
+
+.macro COPY_MC_TEST_SRC reg count target
+.endm
+
+.macro COPY_MC_TEST_DST reg count target
+.endm
+#endif /* CONFIG_COPY_MC_TEST */
+#endif /* __ASSEMBLY__ */
+#endif /* _COPY_MC_TEST_H_ */
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 109af5c..ba2062d 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -174,6 +174,15 @@ extern void mce_unregister_decode_chain(struct notifier_block *nb);
 
 extern int mce_p5_enabled;
 
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+extern void enable_copy_mc_fragile(void);
+unsigned long __must_check copy_mc_fragile(void *dst, const void *src, unsigned cnt);
+#else
+static inline void enable_copy_mc_fragile(void)
+{
+}
+#endif
+
 #ifdef CONFIG_X86_MCE
 int mcheck_init(void);
 void mcheck_cpu_init(struct cpuinfo_x86 *c);
diff --git a/arch/x86/include/asm/mcsafe_test.h b/arch/x86/include/asm/mcsafe_test.h
deleted file mode 100644
index eb59804..0000000
--- a/arch/x86/include/asm/mcsafe_test.h
+++ /dev/null
@@ -1,75 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _MCSAFE_TEST_H_
-#define _MCSAFE_TEST_H_
-
-#ifndef __ASSEMBLY__
-#ifdef CONFIG_MCSAFE_TEST
-extern unsigned long mcsafe_test_src;
-extern unsigned long mcsafe_test_dst;
-
-static inline void mcsafe_inject_src(void *addr)
-{
-	if (addr)
-		mcsafe_test_src = (unsigned long) addr;
-	else
-		mcsafe_test_src = ~0UL;
-}
-
-static inline void mcsafe_inject_dst(void *addr)
-{
-	if (addr)
-		mcsafe_test_dst = (unsigned long) addr;
-	else
-		mcsafe_test_dst = ~0UL;
-}
-#else /* CONFIG_MCSAFE_TEST */
-static inline void mcsafe_inject_src(void *addr)
-{
-}
-
-static inline void mcsafe_inject_dst(void *addr)
-{
-}
-#endif /* CONFIG_MCSAFE_TEST */
-
-#else /* __ASSEMBLY__ */
-#include <asm/export.h>
-
-#ifdef CONFIG_MCSAFE_TEST
-.macro MCSAFE_TEST_CTL
-	.pushsection .data
-	.align 8
-	.globl mcsafe_test_src
-	mcsafe_test_src:
-		.quad 0
-	EXPORT_SYMBOL_GPL(mcsafe_test_src)
-	.globl mcsafe_test_dst
-	mcsafe_test_dst:
-		.quad 0
-	EXPORT_SYMBOL_GPL(mcsafe_test_dst)
-	.popsection
-.endm
-
-.macro MCSAFE_TEST_SRC reg count target
-	leaq \count(\reg), %r9
-	cmp mcsafe_test_src, %r9
-	ja \target
-.endm
-
-.macro MCSAFE_TEST_DST reg count target
-	leaq \count(\reg), %r9
-	cmp mcsafe_test_dst, %r9
-	ja \target
-.endm
-#else
-.macro MCSAFE_TEST_CTL
-.endm
-
-.macro MCSAFE_TEST_SRC reg count target
-.endm
-
-.macro MCSAFE_TEST_DST reg count target
-.endm
-#endif /* CONFIG_MCSAFE_TEST */
-#endif /* __ASSEMBLY__ */
-#endif /* _MCSAFE_TEST_H_ */
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3..6e45082 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -82,38 +82,6 @@ int strcmp(const char *cs, const char *ct);
 
 #endif
 
-#define __HAVE_ARCH_MEMCPY_MCSAFE 1
-__must_check unsigned long __memcpy_mcsafe(void *dst, const void *src,
-		size_t cnt);
-DECLARE_STATIC_KEY_FALSE(mcsafe_key);
-
-/**
- * memcpy_mcsafe - copy memory with indication if a machine check happened
- *
- * @dst:	destination address
- * @src:	source address
- * @cnt:	number of bytes to copy
- *
- * Low level memory copy function that catches machine checks
- * We only call into the "safe" function on systems that can
- * actually do machine check recovery. Everyone else can just
- * use memcpy().
- *
- * Return 0 for success, or number of bytes not copied if there was an
- * exception.
- */
-static __always_inline __must_check unsigned long
-memcpy_mcsafe(void *dst, const void *src, size_t cnt)
-{
-#ifdef CONFIG_X86_MCE
-	if (static_branch_unlikely(&mcsafe_key))
-		return __memcpy_mcsafe(dst, src, cnt);
-	else
-#endif
-		memcpy(dst, src, cnt);
-	return 0;
-}
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
 void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index ecefaff..eff7fb8 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -455,6 +455,15 @@ extern __must_check long strnlen_user(const char __user *str, long n);
 unsigned long __must_check clear_user(void __user *mem, unsigned long len);
 unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+unsigned long __must_check
+copy_mc_to_kernel(void *to, const void *from, unsigned len);
+#define copy_mc_to_kernel copy_mc_to_kernel
+
+unsigned long __must_check
+copy_mc_to_user(void *to, const void *from, unsigned len);
+#endif
+
 /*
  * movsl can be slow when source and dest are not both 8-byte aligned
  */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index bc10e3d..e7265a5 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -47,22 +47,6 @@ copy_user_generic(void *to, const void *from, unsigned len)
 }
 
 static __always_inline __must_check unsigned long
-copy_to_user_mcsafe(void *to, const void *from, unsigned len)
-{
-	unsigned long ret;
-
-	__uaccess_begin();
-	/*
-	 * Note, __memcpy_mcsafe() is explicitly used since it can
-	 * handle exceptions / faults.  memcpy_mcsafe() may fall back to
-	 * memcpy() which lacks this handling.
-	 */
-	ret = __memcpy_mcsafe(to, from, len);
-	__uaccess_end();
-	return ret;
-}
-
-static __always_inline __must_check unsigned long
 raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
 {
 	return copy_user_generic(dst, (__force void *)src, size);
@@ -102,8 +86,4 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
-
-unsigned long
-mcsafe_handle_tail(char *to, char *from, unsigned len);
-
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 11b6697..b5b70f4 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -40,7 +40,6 @@
 #include <linux/debugfs.h>
 #include <linux/irq_work.h>
 #include <linux/export.h>
-#include <linux/jump_label.h>
 #include <linux/set_memory.h>
 #include <linux/sync_core.h>
 #include <linux/task_work.h>
@@ -2124,7 +2123,7 @@ void mce_disable_bank(int bank)
 	and older.
  * mce=nobootlog Don't log MCEs from before booting.
  * mce=bios_cmci_threshold Don't program the CMCI threshold
- * mce=recovery force enable memcpy_mcsafe()
+ * mce=recovery force enable copy_mc_fragile()
  */
 static int __init mcheck_enable(char *str)
 {
@@ -2732,13 +2731,10 @@ static void __init mcheck_debugfs_init(void)
 static void __init mcheck_debugfs_init(void) { }
 #endif
 
-DEFINE_STATIC_KEY_FALSE(mcsafe_key);
-EXPORT_SYMBOL_GPL(mcsafe_key);
-
 static int __init mcheck_late_init(void)
 {
 	if (mca_cfg.recovery)
-		static_branch_inc(&mcsafe_key);
+		enable_copy_mc_fragile();
 
 	mcheck_debugfs_init();
 
diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 1b10717..6d0df6a 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -8,6 +8,7 @@
 
 #include <asm/hpet.h>
 #include <asm/setup.h>
+#include <asm/mce.h>
 
 #if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)
 
@@ -624,10 +625,6 @@ static void amd_disable_seq_and_redirect_scrub(struct pci_dev *dev)
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_NB_F3,
 			amd_disable_seq_and_redirect_scrub);
 
-#if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE)
-#include <linux/jump_label.h>
-#include <asm/string_64.h>
-
 /* Ivy Bridge, Haswell, Broadwell */
 static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev)
 {
@@ -636,7 +633,7 @@ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev)
 	pci_read_config_dword(pdev, 0x84, &capid0);
 
 	if (capid0 & 0x10)
-		static_branch_inc(&mcsafe_key);
+		enable_copy_mc_fragile();
 }
 
 /* Skylake */
@@ -653,7 +650,7 @@ static void quirk_intel_purley_xeon_ras_cap(struct pci_dev *pdev)
 	 * enabled, so memory machine check recovery is also enabled.
 	 */
 	if ((capid0 & 0xc0) == 0xc0 || (capid5 & 0x1e0))
-		static_branch_inc(&mcsafe_key);
+		enable_copy_mc_fragile();
 
 }
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0ec3, quirk_intel_brickland_xeon_ras_cap);
@@ -661,7 +658,6 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2fc0, quirk_intel_brickland_xeon_
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6fc0, quirk_intel_brickland_xeon_ras_cap);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2083, quirk_intel_purley_xeon_ras_cap);
 #endif
-#endif
 
 bool x86_apple_machine;
 EXPORT_SYMBOL(x86_apple_machine);
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index d46fff1..f7d23c9 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -44,6 +44,7 @@ obj-$(CONFIG_SMP) += msr-smp.o cache-smp.o
 lib-y := delay.o misc.o cmdline.o cpu.o
 lib-y += usercopy_$(BITS).o usercopy.o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
+lib-$(CONFIG_ARCH_HAS_COPY_MC) += copy_mc.o copy_mc_64.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
 lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
 lib-$(CONFIG_FUNCTION_ERROR_INJECTION)	+= error-inject.o
diff --git a/arch/x86/lib/copy_mc.c b/arch/x86/lib/copy_mc.c
new file mode 100644
index 0000000..2633635
--- /dev/null
+++ b/arch/x86/lib/copy_mc.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2020 Intel Corporation. All rights reserved. */
+
+#include <linux/jump_label.h>
+#include <linux/uaccess.h>
+#include <linux/export.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mce.h>
+
+#ifdef CONFIG_X86_MCE
+/*
+ * See COPY_MC_TEST for self-test of the copy_mc_fragile()
+ * implementation.
+ */
+static DEFINE_STATIC_KEY_FALSE(copy_mc_fragile_key);
+
+void enable_copy_mc_fragile(void)
+{
+	static_branch_inc(&copy_mc_fragile_key);
+}
+#define copy_mc_fragile_enabled (static_branch_unlikely(&copy_mc_fragile_key))
+
+/*
+ * Similar to copy_user_handle_tail, probe for the write fault point, or
+ * source exception point.
+ */
+__visible notrace unsigned long
+copy_mc_fragile_handle_tail(char *to, char *from, unsigned len)
+{
+	for (; len; --len, to++, from++)
+		if (copy_mc_fragile(to, from, 1))
+			break;
+	return len;
+}
+#else
+/*
+ * No point in doing careful copying, or consulting a static key when
+ * there is no #MC handler in the CONFIG_X86_MCE=n case.
+ */
+void enable_copy_mc_fragile(void)
+{
+}
+#define copy_mc_fragile_enabled (0)
+#endif
+
+/**
+ * copy_mc_to_kernel - memory copy that handles source exceptions
+ *
+ * @dst:	destination address
+ * @src:	source address
+ * @len:	number of bytes to copy
+ *
+ * Call into the 'fragile' version on systems that have trouble
+ * actually do machine check recovery. Everyone else can just
+ * use memcpy().
+ *
+ * Return 0 for success, or number of bytes not copied if there was an
+ * exception.
+ */
+unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigned len)
+{
+	if (copy_mc_fragile_enabled)
+		return copy_mc_fragile(dst, src, len);
+	memcpy(dst, src, len);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(copy_mc_to_kernel);
+
+unsigned long __must_check copy_mc_to_user(void *dst, const void *src, unsigned len)
+{
+	unsigned long ret;
+
+	if (!copy_mc_fragile_enabled)
+		return copy_user_generic(dst, src, len);
+
+	__uaccess_begin();
+	ret = copy_mc_fragile(dst, src, len);
+	__uaccess_end();
+	return ret;
+}
diff --git a/arch/x86/lib/copy_mc_64.S b/arch/x86/lib/copy_mc_64.S
new file mode 100644
index 0000000..c3b613c
--- /dev/null
+++ b/arch/x86/lib/copy_mc_64.S
@@ -0,0 +1,127 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2016-2020 Intel Corporation. All rights reserved. */
+
+#include <linux/linkage.h>
+#include <asm/copy_mc_test.h>
+#include <asm/export.h>
+#include <asm/asm.h>
+
+#ifndef CONFIG_UML
+
+#ifdef CONFIG_X86_MCE
+COPY_MC_TEST_CTL
+
+/*
+ * copy_mc_fragile - copy memory with indication if an exception / fault happened
+ *
+ * The 'fragile' version is opted into by platform quirks and takes
+ * pains to avoid unrecoverable corner cases like 'fast-string'
+ * instruction sequences, and consuming poison across a cacheline
+ * boundary. The non-fragile version is equivalent to memcpy()
+ * regardless of CPU machine-check-recovery capability.
+ */
+SYM_FUNC_START(copy_mc_fragile)
+	cmpl $8, %edx
+	/* Less than 8 bytes? Go to byte copy loop */
+	jb .L_no_whole_words
+
+	/* Check for bad alignment of source */
+	testl $7, %esi
+	/* Already aligned */
+	jz .L_8byte_aligned
+
+	/* Copy one byte at a time until source is 8-byte aligned */
+	movl %esi, %ecx
+	andl $7, %ecx
+	subl $8, %ecx
+	negl %ecx
+	subl %ecx, %edx
+.L_read_leading_bytes:
+	movb (%rsi), %al
+	COPY_MC_TEST_SRC %rsi 1 .E_leading_bytes
+	COPY_MC_TEST_DST %rdi 1 .E_leading_bytes
+.L_write_leading_bytes:
+	movb %al, (%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_read_leading_bytes
+
+.L_8byte_aligned:
+	movl %edx, %ecx
+	andl $7, %edx
+	shrl $3, %ecx
+	jz .L_no_whole_words
+
+.L_read_words:
+	movq (%rsi), %r8
+	COPY_MC_TEST_SRC %rsi 8 .E_read_words
+	COPY_MC_TEST_DST %rdi 8 .E_write_words
+.L_write_words:
+	movq %r8, (%rdi)
+	addq $8, %rsi
+	addq $8, %rdi
+	decl %ecx
+	jnz .L_read_words
+
+	/* Any trailing bytes? */
+.L_no_whole_words:
+	andl %edx, %edx
+	jz .L_done_memcpy_trap
+
+	/* Copy trailing bytes */
+	movl %edx, %ecx
+.L_read_trailing_bytes:
+	movb (%rsi), %al
+	COPY_MC_TEST_SRC %rsi 1 .E_trailing_bytes
+	COPY_MC_TEST_DST %rdi 1 .E_trailing_bytes
+.L_write_trailing_bytes:
+	movb %al, (%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_read_trailing_bytes
+
+	/* Copy successful. Return zero */
+.L_done_memcpy_trap:
+	xorl %eax, %eax
+.L_done:
+	ret
+SYM_FUNC_END(copy_mc_fragile)
+EXPORT_SYMBOL_GPL(copy_mc_fragile)
+
+	.section .fixup, "ax"
+	/*
+	 * Return number of bytes not copied for any failure. Note that
+	 * there is no "tail" handling since the source buffer is 8-byte
+	 * aligned and poison is cacheline aligned.
+	 */
+.E_read_words:
+	shll	$3, %ecx
+.E_leading_bytes:
+	addl	%edx, %ecx
+.E_trailing_bytes:
+	mov	%ecx, %eax
+	jmp	.L_done
+
+	/*
+	 * For write fault handling, given the destination is unaligned,
+	 * we handle faults on multi-byte writes with a byte-by-byte
+	 * copy up to the write-protected page.
+	 */
+.E_write_words:
+	shll	$3, %ecx
+	addl	%edx, %ecx
+	movl	%ecx, %edx
+	jmp copy_mc_fragile_handle_tail
+
+	.previous
+
+	_ASM_EXTABLE_FAULT(.L_read_leading_bytes, .E_leading_bytes)
+	_ASM_EXTABLE_FAULT(.L_read_words, .E_read_words)
+	_ASM_EXTABLE_FAULT(.L_read_trailing_bytes, .E_trailing_bytes)
+	_ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes)
+	_ASM_EXTABLE(.L_write_words, .E_write_words)
+	_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
+#endif /* CONFIG_X86_MCE */
+#endif /* !CONFIG_UML */
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index bbcc05b..037faac 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -4,7 +4,6 @@
 #include <linux/linkage.h>
 #include <asm/errno.h>
 #include <asm/cpufeatures.h>
-#include <asm/mcsafe_test.h>
 #include <asm/alternative-asm.h>
 #include <asm/export.h>
 
@@ -187,117 +186,3 @@ SYM_FUNC_START_LOCAL(memcpy_orig)
 SYM_FUNC_END(memcpy_orig)
 
 .popsection
-
-#ifndef CONFIG_UML
-
-MCSAFE_TEST_CTL
-
-/*
- * __memcpy_mcsafe - memory copy with machine check exception handling
- * Note that we only catch machine checks when reading the source addresses.
- * Writes to target are posted and don't generate machine checks.
- */
-SYM_FUNC_START(__memcpy_mcsafe)
-	cmpl $8, %edx
-	/* Less than 8 bytes? Go to byte copy loop */
-	jb .L_no_whole_words
-
-	/* Check for bad alignment of source */
-	testl $7, %esi
-	/* Already aligned */
-	jz .L_8byte_aligned
-
-	/* Copy one byte at a time until source is 8-byte aligned */
-	movl %esi, %ecx
-	andl $7, %ecx
-	subl $8, %ecx
-	negl %ecx
-	subl %ecx, %edx
-.L_read_leading_bytes:
-	movb (%rsi), %al
-	MCSAFE_TEST_SRC %rsi 1 .E_leading_bytes
-	MCSAFE_TEST_DST %rdi 1 .E_leading_bytes
-.L_write_leading_bytes:
-	movb %al, (%rdi)
-	incq %rsi
-	incq %rdi
-	decl %ecx
-	jnz .L_read_leading_bytes
-
-.L_8byte_aligned:
-	movl %edx, %ecx
-	andl $7, %edx
-	shrl $3, %ecx
-	jz .L_no_whole_words
-
-.L_read_words:
-	movq (%rsi), %r8
-	MCSAFE_TEST_SRC %rsi 8 .E_read_words
-	MCSAFE_TEST_DST %rdi 8 .E_write_words
-.L_write_words:
-	movq %r8, (%rdi)
-	addq $8, %rsi
-	addq $8, %rdi
-	decl %ecx
-	jnz .L_read_words
-
-	/* Any trailing bytes? */
-.L_no_whole_words:
-	andl %edx, %edx
-	jz .L_done_memcpy_trap
-
-	/* Copy trailing bytes */
-	movl %edx, %ecx
-.L_read_trailing_bytes:
-	movb (%rsi), %al
-	MCSAFE_TEST_SRC %rsi 1 .E_trailing_bytes
-	MCSAFE_TEST_DST %rdi 1 .E_trailing_bytes
-.L_write_trailing_bytes:
-	movb %al, (%rdi)
-	incq %rsi
-	incq %rdi
-	decl %ecx
-	jnz .L_read_trailing_bytes
-
-	/* Copy successful. Return zero */
-.L_done_memcpy_trap:
-	xorl %eax, %eax
-.L_done:
-	ret
-SYM_FUNC_END(__memcpy_mcsafe)
-EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
-
-	.section .fixup, "ax"
-	/*
-	 * Return number of bytes not copied for any failure. Note that
-	 * there is no "tail" handling since the source buffer is 8-byte
-	 * aligned and poison is cacheline aligned.
-	 */
-.E_read_words:
-	shll	$3, %ecx
-.E_leading_bytes:
-	addl	%edx, %ecx
-.E_trailing_bytes:
-	mov	%ecx, %eax
-	jmp	.L_done
-
-	/*
-	 * For write fault handling, given the destination is unaligned,
-	 * we handle faults on multi-byte writes with a byte-by-byte
-	 * copy up to the write-protected page.
-	 */
-.E_write_words:
-	shll	$3, %ecx
-	addl	%edx, %ecx
-	movl	%ecx, %edx
-	jmp mcsafe_handle_tail
-
-	.previous
-
-	_ASM_EXTABLE_FAULT(.L_read_leading_bytes, .E_leading_bytes)
-	_ASM_EXTABLE_FAULT(.L_read_words, .E_read_words)
-	_ASM_EXTABLE_FAULT(.L_read_trailing_bytes, .E_trailing_bytes)
-	_ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes)
-	_ASM_EXTABLE(.L_write_words, .E_write_words)
-	_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
-#endif
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index b0dfac3..5f1d4a9 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -56,27 +56,6 @@ unsigned long clear_user(void __user *to, unsigned long n)
 }
 EXPORT_SYMBOL(clear_user);
 
-/*
- * Similar to copy_user_handle_tail, probe for the write fault point,
- * but reuse __memcpy_mcsafe in case a new read error is encountered.
- * clac() is handled in _copy_to_iter_mcsafe().
- */
-__visible notrace unsigned long
-mcsafe_handle_tail(char *to, char *from, unsigned len)
-{
-	for (; len; --len, to++, from++) {
-		/*
-		 * Call the assembly routine back directly since
-		 * memcpy_mcsafe() may silently fallback to memcpy.
-		 */
-		unsigned long rem = __memcpy_mcsafe(to, from, 1);
-
-		if (rem)
-			break;
-	}
-	return len;
-}
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 86dbe0c..9fc18fa 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -49,7 +49,7 @@ do {								\
 #define pmem_assign(dest, src)	((dest) = (src))
 #endif
 
-#if defined(__HAVE_ARCH_MEMCPY_MCSAFE) && defined(DM_WRITECACHE_HAS_PMEM)
+#if IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC) && defined(DM_WRITECACHE_HAS_PMEM)
 #define DM_WRITECACHE_HANDLE_HARDWARE_ERRORS
 #endif
 
@@ -984,7 +984,8 @@ static void writecache_resume(struct dm_target *ti)
 	}
 	wc->freelist_size = 0;
 
-	r = memcpy_mcsafe(&sb_seq_count, &sb(wc)->seq_count, sizeof(uint64_t));
+	r = copy_mc_to_kernel(&sb_seq_count, &sb(wc)->seq_count,
+			      sizeof(uint64_t));
 	if (r) {
 		writecache_error(wc, r, "hardware memory error when reading superblock: %d", r);
 		sb_seq_count = cpu_to_le64(0);
@@ -1000,7 +1001,8 @@ static void writecache_resume(struct dm_target *ti)
 			e->seq_count = -1;
 			continue;
 		}
-		r = memcpy_mcsafe(&wme, memory_entry(wc, e), sizeof(struct wc_memory_entry));
+		r = copy_mc_to_kernel(&wme, memory_entry(wc, e),
+				      sizeof(struct wc_memory_entry));
 		if (r) {
 			writecache_error(wc, r, "hardware memory error when reading metadata entry %lu: %d",
 					 (unsigned long)b, r);
@@ -1198,7 +1200,7 @@ static void bio_copy_block(struct dm_writecache *wc, struct bio *bio, void *data
 
 		if (rw == READ) {
 			int r;
-			r = memcpy_mcsafe(buf, data, size);
+			r = copy_mc_to_kernel(buf, data, size);
 			flush_dcache_page(bio_page(bio));
 			if (unlikely(r)) {
 				writecache_error(wc, r, "hardware memory error when reading data: %d", r);
@@ -2341,7 +2343,7 @@ invalid_optional:
 		}
 	}
 
-	r = memcpy_mcsafe(&s, sb(wc), sizeof(struct wc_memory_superblock));
+	r = copy_mc_to_kernel(&s, sb(wc), sizeof(struct wc_memory_superblock));
 	if (r) {
 		ti->error = "Hardware memory error when reading superblock";
 		goto bad;
@@ -2352,7 +2354,8 @@ invalid_optional:
 			ti->error = "Unable to initialize device";
 			goto bad;
 		}
-		r = memcpy_mcsafe(&s, sb(wc), sizeof(struct wc_memory_superblock));
+		r = copy_mc_to_kernel(&s, sb(wc),
+				      sizeof(struct wc_memory_superblock));
 		if (r) {
 			ti->error = "Hardware memory error when reading superblock";
 			goto bad;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 45964ac..22d865b 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -268,7 +268,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 	if (rw == READ) {
 		if (unlikely(is_bad_pmem(&nsio->bb, sector, sz_align)))
 			return -EIO;
-		if (memcpy_mcsafe(buf, nsio->addr + offset, size) != 0)
+		if (copy_mc_to_kernel(buf, nsio->addr + offset, size) != 0)
 			return -EIO;
 		return 0;
 	}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fab29b5..5c6939e 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -125,7 +125,7 @@ static blk_status_t read_pmem(struct page *page, unsigned int off,
 	while (len) {
 		mem = kmap_atomic(page);
 		chunk = min_t(unsigned int, len, PAGE_SIZE - off);
-		rem = memcpy_mcsafe(mem + off, pmem_addr, chunk);
+		rem = copy_mc_to_kernel(mem + off, pmem_addr, chunk);
 		kunmap_atomic(mem);
 		if (rem)
 			return BLK_STS_IOERR;
@@ -304,7 +304,7 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 
 /*
  * Use the 'no check' versions of copy_from_iter_flushcache() and
- * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds
+ * copy_mc_to_iter() to bypass HARDENED_USERCOPY overhead. Bounds
  * checking, both file offset and device offset, is handled by
  * dax_iomap_actor()
  */
@@ -317,7 +317,7 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 		void *addr, size_t bytes, struct iov_iter *i)
 {
-	return _copy_to_iter_mcsafe(addr, bytes, i);
+	return _copy_mc_to_iter(addr, bytes, i);
 }
 
 static const struct dax_operations pmem_dax_ops = {
diff --git a/include/linux/string.h b/include/linux/string.h
index 9b7a063..b1f3894 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -161,20 +161,13 @@ extern int bcmp(const void *,const void *,__kernel_size_t);
 #ifndef __HAVE_ARCH_MEMCHR
 extern void * memchr(const void *,int,__kernel_size_t);
 #endif
-#ifndef __HAVE_ARCH_MEMCPY_MCSAFE
-static inline __must_check unsigned long memcpy_mcsafe(void *dst,
-		const void *src, size_t cnt)
-{
-	memcpy(dst, src, cnt);
-	return 0;
-}
-#endif
 #ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE
 static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
 {
 	memcpy(dst, src, cnt);
 }
 #endif
+
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *s, char old, char new);
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 94b2854..1ae36bc 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -179,6 +179,19 @@ copy_in_user(void __user *to, const void __user *from, unsigned long n)
 }
 #endif
 
+#ifndef copy_mc_to_kernel
+/*
+ * Without arch opt-in this generic copy_mc_to_kernel() will not handle
+ * #MC (or arch equivalent) during source read.
+ */
+static inline unsigned long __must_check
+copy_mc_to_kernel(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+	return 0;
+}
+#endif
+
 static __always_inline void pagefault_disabled_inc(void)
 {
 	current->pagefault_disabled++;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 3835a8a..f14410c 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -185,10 +185,10 @@ size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i);
 #define _copy_from_iter_flushcache _copy_from_iter_nocache
 #endif
 
-#ifdef CONFIG_ARCH_HAS_UACCESS_MCSAFE
-size_t _copy_to_iter_mcsafe(const void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 #else
-#define _copy_to_iter_mcsafe _copy_to_iter
+#define _copy_mc_to_iter _copy_to_iter
 #endif
 
 static __always_inline __must_check
@@ -201,12 +201,12 @@ size_t copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i)
 }
 
 static __always_inline __must_check
-size_t copy_to_iter_mcsafe(void *addr, size_t bytes, struct iov_iter *i)
+size_t copy_mc_to_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (unlikely(!check_copy_size(addr, bytes, true)))
 		return 0;
 	else
-		return _copy_to_iter_mcsafe(addr, bytes, i);
+		return _copy_mc_to_iter(addr, bytes, i);
 }
 
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
diff --git a/lib/Kconfig b/lib/Kconfig
index b4b98a0..b46a9fd 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -635,7 +635,12 @@ config UACCESS_MEMCPY
 config ARCH_HAS_UACCESS_FLUSHCACHE
 	bool
 
-config ARCH_HAS_UACCESS_MCSAFE
+# arch has a concept of a recoverable synchronous exception due to a
+# memory-read error like x86 machine-check or ARM data-abort, and
+# implements copy_mc_to_{user,kernel} to abort and report
+# 'bytes-transferred' if that exception fires when accessing the source
+# buffer.
+config ARCH_HAS_COPY_MC
 	bool
 
 # Temporary. Goes away when all archs are cleaned up
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 5e40786..d13304a 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -637,30 +637,30 @@ size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(_copy_to_iter);
 
-#ifdef CONFIG_ARCH_HAS_UACCESS_MCSAFE
-static int copyout_mcsafe(void __user *to, const void *from, size_t n)
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+static int copyout_mc(void __user *to, const void *from, size_t n)
 {
 	if (access_ok(to, n)) {
 		instrument_copy_to_user(to, from, n);
-		n = copy_to_user_mcsafe((__force void *) to, from, n);
+		n = copy_mc_to_user((__force void *) to, from, n);
 	}
 	return n;
 }
 
-static unsigned long memcpy_mcsafe_to_page(struct page *page, size_t offset,
+static unsigned long copy_mc_to_page(struct page *page, size_t offset,
 		const char *from, size_t len)
 {
 	unsigned long ret;
 	char *to;
 
 	to = kmap_atomic(page);
-	ret = memcpy_mcsafe(to + offset, from, len);
+	ret = copy_mc_to_kernel(to + offset, from, len);
 	kunmap_atomic(to);
 
 	return ret;
 }
 
-static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes,
+static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
 				struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
@@ -678,7 +678,7 @@ static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes,
 		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
 		unsigned long rem;
 
-		rem = memcpy_mcsafe_to_page(pipe->bufs[i_head & p_mask].page,
+		rem = copy_mc_to_page(pipe->bufs[i_head & p_mask].page,
 					    off, addr, chunk);
 		i->head = i_head;
 		i->iov_offset = off + chunk - rem;
@@ -695,18 +695,17 @@ static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes,
 }
 
 /**
- * _copy_to_iter_mcsafe - copy to user with source-read error exception handling
+ * _copy_mc_to_iter - copy to iter with source memory error exception handling
  * @addr: source kernel address
  * @bytes: total transfer length
  * @iter: destination iterator
  *
- * The pmem driver arranges for filesystem-dax to use this facility via
- * dax_copy_to_iter() for protecting read/write to persistent memory.
- * Unless / until an architecture can guarantee identical performance
- * between _copy_to_iter_mcsafe() and _copy_to_iter() it would be a
- * performance regression to switch more users to the mcsafe version.
+ * The pmem driver deploys this for the dax operation
+ * (dax_copy_to_iter()) for dax reads (bypass page-cache and the
+ * block-layer). Upon #MC read(2) aborts and returns EIO or the bytes
+ * successfully copied.
  *
- * Otherwise, the main differences between this and typical _copy_to_iter().
+ * The main differences between this and typical _copy_to_iter().
  *
  * * Typical tail/residue handling after a fault retries the copy
  *   byte-by-byte until the fault happens again. Re-triggering machine
@@ -717,23 +716,22 @@ static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes,
  * * ITER_KVEC, ITER_PIPE, and ITER_BVEC can return short copies.
  *   Compare to copy_to_iter() where only ITER_IOVEC attempts might return
  *   a short copy.
- *
- * See MCSAFE_TEST for self-test.
  */
-size_t _copy_to_iter_mcsafe(const void *addr, size_t bytes, struct iov_iter *i)
+size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	const char *from = addr;
 	unsigned long rem, curr_addr, s_addr = (unsigned long) addr;
 
 	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_pipe_to_iter_mcsafe(addr, bytes, i);
+		return copy_mc_pipe_to_iter(addr, bytes, i);
 	if (iter_is_iovec(i))
 		might_fault();
 	iterate_and_advance(i, bytes, v,
-		copyout_mcsafe(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len),
+		copyout_mc(v.iov_base, (from += v.iov_len) - v.iov_len,
+			   v.iov_len),
 		({
-		rem = memcpy_mcsafe_to_page(v.bv_page, v.bv_offset,
-                               (from += v.bv_len) - v.bv_len, v.bv_len);
+		rem = copy_mc_to_page(v.bv_page, v.bv_offset,
+				      (from += v.bv_len) - v.bv_len, v.bv_len);
 		if (rem) {
 			curr_addr = (unsigned long) from;
 			bytes = curr_addr - s_addr - rem;
@@ -741,8 +739,8 @@ size_t _copy_to_iter_mcsafe(const void *addr, size_t bytes, struct iov_iter *i)
 		}
 		}),
 		({
-		rem = memcpy_mcsafe(v.iov_base, (from += v.iov_len) - v.iov_len,
-				v.iov_len);
+		rem = copy_mc_to_kernel(v.iov_base, (from += v.iov_len)
+					- v.iov_len, v.iov_len);
 		if (rem) {
 			curr_addr = (unsigned long) from;
 			bytes = curr_addr - s_addr - rem;
@@ -753,8 +751,8 @@ size_t _copy_to_iter_mcsafe(const void *addr, size_t bytes, struct iov_iter *i)
 
 	return bytes;
 }
-EXPORT_SYMBOL_GPL(_copy_to_iter_mcsafe);
-#endif /* CONFIG_ARCH_HAS_UACCESS_MCSAFE */
+EXPORT_SYMBOL_GPL(_copy_mc_to_iter);
+#endif /* CONFIG_ARCH_HAS_COPY_MC */
 
 size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
diff --git a/tools/arch/x86/include/asm/mcsafe_test.h b/tools/arch/x86/include/asm/mcsafe_test.h
deleted file mode 100644
index 2ccd588..0000000
--- a/tools/arch/x86/include/asm/mcsafe_test.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _MCSAFE_TEST_H_
-#define _MCSAFE_TEST_H_
-
-.macro MCSAFE_TEST_CTL
-.endm
-
-.macro MCSAFE_TEST_SRC reg count target
-.endm
-
-.macro MCSAFE_TEST_DST reg count target
-.endm
-#endif /* _MCSAFE_TEST_H_ */
diff --git a/tools/arch/x86/lib/memcpy_64.S b/tools/arch/x86/lib/memcpy_64.S
index 45f8e1b..0b5b8ae 100644
--- a/tools/arch/x86/lib/memcpy_64.S
+++ b/tools/arch/x86/lib/memcpy_64.S
@@ -4,7 +4,6 @@
 #include <linux/linkage.h>
 #include <asm/errno.h>
 #include <asm/cpufeatures.h>
-#include <asm/mcsafe_test.h>
 #include <asm/alternative-asm.h>
 #include <asm/export.h>
 
@@ -187,117 +186,3 @@ SYM_FUNC_START(memcpy_orig)
 SYM_FUNC_END(memcpy_orig)
 
 .popsection
-
-#ifndef CONFIG_UML
-
-MCSAFE_TEST_CTL
-
-/*
- * __memcpy_mcsafe - memory copy with machine check exception handling
- * Note that we only catch machine checks when reading the source addresses.
- * Writes to target are posted and don't generate machine checks.
- */
-SYM_FUNC_START(__memcpy_mcsafe)
-	cmpl $8, %edx
-	/* Less than 8 bytes? Go to byte copy loop */
-	jb .L_no_whole_words
-
-	/* Check for bad alignment of source */
-	testl $7, %esi
-	/* Already aligned */
-	jz .L_8byte_aligned
-
-	/* Copy one byte at a time until source is 8-byte aligned */
-	movl %esi, %ecx
-	andl $7, %ecx
-	subl $8, %ecx
-	negl %ecx
-	subl %ecx, %edx
-.L_read_leading_bytes:
-	movb (%rsi), %al
-	MCSAFE_TEST_SRC %rsi 1 .E_leading_bytes
-	MCSAFE_TEST_DST %rdi 1 .E_leading_bytes
-.L_write_leading_bytes:
-	movb %al, (%rdi)
-	incq %rsi
-	incq %rdi
-	decl %ecx
-	jnz .L_read_leading_bytes
-
-.L_8byte_aligned:
-	movl %edx, %ecx
-	andl $7, %edx
-	shrl $3, %ecx
-	jz .L_no_whole_words
-
-.L_read_words:
-	movq (%rsi), %r8
-	MCSAFE_TEST_SRC %rsi 8 .E_read_words
-	MCSAFE_TEST_DST %rdi 8 .E_write_words
-.L_write_words:
-	movq %r8, (%rdi)
-	addq $8, %rsi
-	addq $8, %rdi
-	decl %ecx
-	jnz .L_read_words
-
-	/* Any trailing bytes? */
-.L_no_whole_words:
-	andl %edx, %edx
-	jz .L_done_memcpy_trap
-
-	/* Copy trailing bytes */
-	movl %edx, %ecx
-.L_read_trailing_bytes:
-	movb (%rsi), %al
-	MCSAFE_TEST_SRC %rsi 1 .E_trailing_bytes
-	MCSAFE_TEST_DST %rdi 1 .E_trailing_bytes
-.L_write_trailing_bytes:
-	movb %al, (%rdi)
-	incq %rsi
-	incq %rdi
-	decl %ecx
-	jnz .L_read_trailing_bytes
-
-	/* Copy successful. Return zero */
-.L_done_memcpy_trap:
-	xorl %eax, %eax
-.L_done:
-	ret
-SYM_FUNC_END(__memcpy_mcsafe)
-EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
-
-	.section .fixup, "ax"
-	/*
-	 * Return number of bytes not copied for any failure. Note that
-	 * there is no "tail" handling since the source buffer is 8-byte
-	 * aligned and poison is cacheline aligned.
-	 */
-.E_read_words:
-	shll	$3, %ecx
-.E_leading_bytes:
-	addl	%edx, %ecx
-.E_trailing_bytes:
-	mov	%ecx, %eax
-	jmp	.L_done
-
-	/*
-	 * For write fault handling, given the destination is unaligned,
-	 * we handle faults on multi-byte writes with a byte-by-byte
-	 * copy up to the write-protected page.
-	 */
-.E_write_words:
-	shll	$3, %ecx
-	addl	%edx, %ecx
-	movl	%ecx, %edx
-	jmp mcsafe_handle_tail
-
-	.previous
-
-	_ASM_EXTABLE_FAULT(.L_read_leading_bytes, .E_leading_bytes)
-	_ASM_EXTABLE_FAULT(.L_read_words, .E_read_words)
-	_ASM_EXTABLE_FAULT(.L_read_trailing_bytes, .E_trailing_bytes)
-	_ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes)
-	_ASM_EXTABLE(.L_write_words, .E_write_words)
-	_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
-#endif
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index e034a8f..893f021 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -548,8 +548,8 @@ static const char *uaccess_safe_builtin[] = {
 	"__ubsan_handle_shift_out_of_bounds",
 	/* misc */
 	"csum_partial_copy_generic",
-	"__memcpy_mcsafe",
-	"mcsafe_handle_tail",
+	"copy_mc_fragile",
+	"copy_mc_fragile_handle_tail",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
 	NULL
 };
diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index dd68a40..878db6a 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -13,7 +13,6 @@ perf-y += synthesize.o
 perf-y += kallsyms-parse.o
 perf-y += find-bit-bench.o
 
-perf-$(CONFIG_X86_64) += mem-memcpy-x86-64-lib.o
 perf-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o
 perf-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o
 
diff --git a/tools/perf/bench/mem-memcpy-x86-64-lib.c b/tools/perf/bench/mem-memcpy-x86-64-lib.c
deleted file mode 100644
index 4130734..0000000
--- a/tools/perf/bench/mem-memcpy-x86-64-lib.c
+++ /dev/null
@@ -1,24 +0,0 @@
-/*
- * From code in arch/x86/lib/usercopy_64.c, copied to keep tools/ copy
- * of the kernel's arch/x86/lib/memcpy_64.s used in 'perf bench mem memcpy'
- * happy.
- */
-#include <linux/types.h>
-
-unsigned long __memcpy_mcsafe(void *dst, const void *src, size_t cnt);
-unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len);
-
-unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len)
-{
-	for (; len; --len, to++, from++) {
-		/*
-		 * Call the assembly routine back directly since
-		 * memcpy_mcsafe() may silently fallback to memcpy.
-		 */
-		unsigned long rem = __memcpy_mcsafe(to, from, 1);
-
-		if (rem)
-			break;
-	}
-	return len;
-}
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index a1a5dc6..2ac0fff 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -23,7 +23,8 @@
 #include "nfit_test.h"
 #include "../watermark.h"
 
-#include <asm/mcsafe_test.h>
+#include <asm/copy_mc_test.h>
+#include <asm/mce.h>
 
 /*
  * Generate an NFIT table to describe the following topology:
@@ -3283,7 +3284,7 @@ static struct platform_driver nfit_test_driver = {
 	.id_table = nfit_test_id,
 };
 
-static char mcsafe_buf[PAGE_SIZE] __attribute__((__aligned__(PAGE_SIZE)));
+static char copy_mc_buf[PAGE_SIZE] __attribute__((__aligned__(PAGE_SIZE)));
 
 enum INJECT {
 	INJECT_NONE,
@@ -3291,7 +3292,7 @@ enum INJECT {
 	INJECT_DST,
 };
 
-static void mcsafe_test_init(char *dst, char *src, size_t size)
+static void copy_mc_test_init(char *dst, char *src, size_t size)
 {
 	size_t i;
 
@@ -3300,7 +3301,7 @@ static void mcsafe_test_init(char *dst, char *src, size_t size)
 		src[i] = (char) i;
 }
 
-static bool mcsafe_test_validate(unsigned char *dst, unsigned char *src,
+static bool copy_mc_test_validate(unsigned char *dst, unsigned char *src,
 		size_t size, unsigned long rem)
 {
 	size_t i;
@@ -3321,12 +3322,12 @@ static bool mcsafe_test_validate(unsigned char *dst, unsigned char *src,
 	return true;
 }
 
-void mcsafe_test(void)
+void copy_mc_test(void)
 {
 	char *inject_desc[] = { "none", "source", "destination" };
 	enum INJECT inj;
 
-	if (IS_ENABLED(CONFIG_MCSAFE_TEST)) {
+	if (IS_ENABLED(CONFIG_COPY_MC_TEST)) {
 		pr_info("%s: run...\n", __func__);
 	} else {
 		pr_info("%s: disabled, skip.\n", __func__);
@@ -3344,31 +3345,31 @@ void mcsafe_test(void)
 
 			switch (inj) {
 			case INJECT_NONE:
-				mcsafe_inject_src(NULL);
-				mcsafe_inject_dst(NULL);
-				dst = &mcsafe_buf[2048];
-				src = &mcsafe_buf[1024 - i];
+				copy_mc_inject_src(NULL);
+				copy_mc_inject_dst(NULL);
+				dst = &copy_mc_buf[2048];
+				src = &copy_mc_buf[1024 - i];
 				expect = 0;
 				break;
 			case INJECT_SRC:
-				mcsafe_inject_src(&mcsafe_buf[1024]);
-				mcsafe_inject_dst(NULL);
-				dst = &mcsafe_buf[2048];
-				src = &mcsafe_buf[1024 - i];
+				copy_mc_inject_src(&copy_mc_buf[1024]);
+				copy_mc_inject_dst(NULL);
+				dst = &copy_mc_buf[2048];
+				src = &copy_mc_buf[1024 - i];
 				expect = 512 - i;
 				break;
 			case INJECT_DST:
-				mcsafe_inject_src(NULL);
-				mcsafe_inject_dst(&mcsafe_buf[2048]);
-				dst = &mcsafe_buf[2048 - i];
-				src = &mcsafe_buf[1024];
+				copy_mc_inject_src(NULL);
+				copy_mc_inject_dst(&copy_mc_buf[2048]);
+				dst = &copy_mc_buf[2048 - i];
+				src = &copy_mc_buf[1024];
 				expect = 512 - i;
 				break;
 			}
 
-			mcsafe_test_init(dst, src, 512);
-			rem = __memcpy_mcsafe(dst, src, 512);
-			valid = mcsafe_test_validate(dst, src, 512, expect);
+			copy_mc_test_init(dst, src, 512);
+			rem = copy_mc_fragile(dst, src, 512);
+			valid = copy_mc_test_validate(dst, src, 512, expect);
 			if (rem == expect && valid)
 				continue;
 			pr_info("%s: copy(%#lx, %#lx, %d) off: %d rem: %ld %s expect: %ld\n",
@@ -3380,8 +3381,8 @@ void mcsafe_test(void)
 		}
 	}
 
-	mcsafe_inject_src(NULL);
-	mcsafe_inject_dst(NULL);
+	copy_mc_inject_src(NULL);
+	copy_mc_inject_dst(NULL);
 }
 
 static __init int nfit_test_init(void)
@@ -3392,7 +3393,7 @@ static __init int nfit_test_init(void)
 	libnvdimm_test();
 	acpi_nfit_test();
 	device_dax_test();
-	mcsafe_test();
+	copy_mc_test();
 	dax_pmem_test();
 	dax_pmem_core_test();
 #ifdef CONFIG_DEV_DAX_PMEM_COMPAT
diff --git a/tools/testing/selftests/powerpc/copyloops/.gitignore b/tools/testing/selftests/powerpc/copyloops/.gitignore
index ddaf140..994b11a 100644
--- a/tools/testing/selftests/powerpc/copyloops/.gitignore
+++ b/tools/testing/selftests/powerpc/copyloops/.gitignore
@@ -12,4 +12,4 @@ memcpy_p7_t1
 copyuser_64_exc_t0
 copyuser_64_exc_t1
 copyuser_64_exc_t2
-memcpy_mcsafe_64
+copy_mc_64
diff --git a/tools/testing/selftests/powerpc/copyloops/Makefile b/tools/testing/selftests/powerpc/copyloops/Makefile
index 0917983..3095b1f 100644
--- a/tools/testing/selftests/powerpc/copyloops/Makefile
+++ b/tools/testing/selftests/powerpc/copyloops/Makefile
@@ -12,7 +12,7 @@ ASFLAGS = $(CFLAGS) -Wa,-mpower4
 TEST_GEN_PROGS := copyuser_64_t0 copyuser_64_t1 copyuser_64_t2 \
 		copyuser_p7_t0 copyuser_p7_t1 \
 		memcpy_64_t0 memcpy_64_t1 memcpy_64_t2 \
-		memcpy_p7_t0 memcpy_p7_t1 memcpy_mcsafe_64 \
+		memcpy_p7_t0 memcpy_p7_t1 copy_mc_64 \
 		copyuser_64_exc_t0 copyuser_64_exc_t1 copyuser_64_exc_t2
 
 EXTRA_SOURCES := validate.c ../harness.c stubs.S
@@ -45,9 +45,9 @@ $(OUTPUT)/memcpy_p7_t%:	memcpy_power7.S $(EXTRA_SOURCES)
 		-D SELFTEST_CASE=$(subst memcpy_p7_t,,$(notdir $@)) \
 		-o $@ $^
 
-$(OUTPUT)/memcpy_mcsafe_64: memcpy_mcsafe_64.S $(EXTRA_SOURCES)
+$(OUTPUT)/copy_mc_64: copy_mc_64.S $(EXTRA_SOURCES)
 	$(CC) $(CPPFLAGS) $(CFLAGS) \
-		-D COPY_LOOP=test_memcpy_mcsafe \
+		-D COPY_LOOP=test_copy_mc_generic \
 		-o $@ $^
 
 $(OUTPUT)/copyuser_64_exc_t%: copyuser_64.S exc_validate.c ../harness.c \
diff --git a/tools/testing/selftests/powerpc/copyloops/copy_mc_64.S b/tools/testing/selftests/powerpc/copyloops/copy_mc_64.S
new file mode 120000
index 0000000..dcbe06d
--- /dev/null
+++ b/tools/testing/selftests/powerpc/copyloops/copy_mc_64.S
@@ -0,0 +1 @@
+../../../../../arch/powerpc/lib/copy_mc_64.S
\ No newline at end of file
diff --git a/tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S b/tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S
deleted file mode 120000
index f0feef3..0000000
--- a/tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S
+++ /dev/null
@@ -1 +0,0 @@
-../../../../../arch/powerpc/lib/memcpy_mcsafe_64.S
\ No newline at end of file

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  2020-10-06  9:57       ` [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() tip-bot2 for Dan Williams
@ 2020-10-07 11:14         ` Borislav Petkov
  2020-10-07 16:45           ` Borislav Petkov
                             ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Borislav Petkov @ 2020-10-07 11:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-tip-commits, Dan Williams, Borislav Petkov, Tony Luck,
	Michael Ellerman, stable, x86, LKML

On Tue, Oct 06, 2020 at 09:57:09AM -0000, tip-bot2 for Dan Williams wrote:
> +	/* Copy successful. Return zero */
> +.L_done_memcpy_trap:
> +	xorl %eax, %eax
> +.L_done:
> +	ret
> +SYM_FUNC_END(copy_mc_fragile)
> +EXPORT_SYMBOL_GPL(copy_mc_fragile)

That export together with CONFIG_MODVERSIONS causes

WARNING: modpost: EXPORT symbol "copy_mc_fragile" [vmlinux] version generation failed, symbol will not be versioned.

here.

I don't see why tho...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  2020-10-07 11:14         ` Borislav Petkov
@ 2020-10-07 16:45           ` Borislav Petkov
  2020-10-07 17:03             ` Borislav Petkov
  2020-10-07 17:51             ` Dan Williams
  2020-10-07 18:24           ` [PATCH] x86/mce: Gate copy_mc_fragile() export by CONFIG_COPY_MC_TEST=y Dan Williams
  2020-10-08  9:01           ` [tip: ras/core] x86/mce: Allow for copy_mc_fragile symbol checksum to be generated tip-bot2 for Borislav Petkov
  2 siblings, 2 replies; 30+ messages in thread
From: Borislav Petkov @ 2020-10-07 16:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-tip-commits, Borislav Petkov, Tony Luck, Michael Ellerman,
	stable, x86, LKML

On Wed, Oct 07, 2020 at 01:14:47PM +0200, Borislav Petkov wrote:
> On Tue, Oct 06, 2020 at 09:57:09AM -0000, tip-bot2 for Dan Williams wrote:
> > +	/* Copy successful. Return zero */
> > +.L_done_memcpy_trap:
> > +	xorl %eax, %eax
> > +.L_done:
> > +	ret
> > +SYM_FUNC_END(copy_mc_fragile)
> > +EXPORT_SYMBOL_GPL(copy_mc_fragile)
> 
> That export together with CONFIG_MODVERSIONS causes
> 
> WARNING: modpost: EXPORT symbol "copy_mc_fragile" [vmlinux] version generation failed, symbol will not be versioned.
> 
> here.
> 
> I don't see why tho...

It doesn't look like it is toolchain-specific and in both cases,
copy_mc_fragile's checksum is 0.

SUSE Leap 15.1:

Name           : binutils                                        
Version        : 2.32-lp151.3.6.1

$ grep -E "(copy_mc_fragile|copy_user_generic_unrolled)" Module.symvers
0x00000000      copy_mc_fragile vmlinux EXPORT_SYMBOL_GPL
0xecdcabd2      copy_user_generic_unrolled      vmlinux EXPORT_SYMBOL

debian testing:

Package: binutils
Version: 2.35-2

$ grep -E "(copy_mc_fragile|copy_user_generic_unrolled)" Module.symvers 
0x00000000      copy_mc_fragile vmlinux EXPORT_SYMBOL_GPL
0xecdcabd2      copy_user_generic_unrolled      vmlinux EXPORT_SYMBOL

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  2020-10-07 16:45           ` Borislav Petkov
@ 2020-10-07 17:03             ` Borislav Petkov
  2020-10-07 18:53               ` Dan Williams
  2020-10-07 17:51             ` Dan Williams
  1 sibling, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2020-10-07 17:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-tip-commits, Tony Luck, Michael Ellerman, stable, x86, LKML

On Wed, Oct 07, 2020 at 06:45:36PM +0200, Borislav Petkov wrote:
> It doesn't look like it is toolchain-specific and in both cases,
> copy_mc_fragile's checksum is 0.
> 
> SUSE Leap 15.1:
> 
> Name           : binutils                                        
> Version        : 2.32-lp151.3.6.1
> 
> $ grep -E "(copy_mc_fragile|copy_user_generic_unrolled)" Module.symvers
> 0x00000000      copy_mc_fragile vmlinux EXPORT_SYMBOL_GPL
> 0xecdcabd2      copy_user_generic_unrolled      vmlinux EXPORT_SYMBOL
> 
> debian testing:
> 
> Package: binutils
> Version: 2.35-2
> 
> $ grep -E "(copy_mc_fragile|copy_user_generic_unrolled)" Module.symvers 
> 0x00000000      copy_mc_fragile vmlinux EXPORT_SYMBOL_GPL
> 0xecdcabd2      copy_user_generic_unrolled      vmlinux EXPORT_SYMBOL

Ok, I think I have it:

---
From: Borislav Petkov <bp@suse.de>
Date: Wed, 7 Oct 2020 18:55:35 +0200
Subject: [PATCH] x86/mce: Allow for copy_mc_fragile symbol checksum to be generated

Add asm/mce.h to asm/asm-prototypes.h so that that asm symbol's checksum
can be generated in order to support CONFIG_MODVERSIONS with it and fix:

  WARNING: modpost: EXPORT symbol "copy_mc_fragile" [vmlinux] version \
	  generation failed, symbol will not be versioned.

For reference see:

  4efca4ed05cb ("kbuild: modversions for EXPORT_SYMBOL() for asm")
  334bb7738764 ("x86/kbuild: enable modversions for symbols exported from asm")

Fixes: ec6347bb4339 ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/asm-prototypes.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h
index 5a42f9206138..51e2bf27cc9b 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -5,6 +5,7 @@
 #include <asm/string.h>
 #include <asm/page.h>
 #include <asm/checksum.h>
+#include <asm/mce.h>
 
 #include <asm-generic/asm-prototypes.h>
 
-- 
2.21.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  2020-10-07 16:45           ` Borislav Petkov
  2020-10-07 17:03             ` Borislav Petkov
@ 2020-10-07 17:51             ` Dan Williams
  1 sibling, 0 replies; 30+ messages in thread
From: Dan Williams @ 2020-10-07 17:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-tip-commits, Borislav Petkov, Tony Luck, Michael Ellerman,
	stable, x86, LKML, kbuild test robot

[ add kbuild robot so they can add this check to their reports ]

On Wed, Oct 7, 2020 at 9:47 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Oct 07, 2020 at 01:14:47PM +0200, Borislav Petkov wrote:
> > On Tue, Oct 06, 2020 at 09:57:09AM -0000, tip-bot2 for Dan Williams wrote:
> > > +   /* Copy successful. Return zero */
> > > +.L_done_memcpy_trap:
> > > +   xorl %eax, %eax
> > > +.L_done:
> > > +   ret
> > > +SYM_FUNC_END(copy_mc_fragile)
> > > +EXPORT_SYMBOL_GPL(copy_mc_fragile)
> >
> > That export together with CONFIG_MODVERSIONS causes
> >
> > WARNING: modpost: EXPORT symbol "copy_mc_fragile" [vmlinux] version generation failed, symbol will not be versioned.
> >
> > here.
> >
> > I don't see why tho...
>
> It doesn't look like it is toolchain-specific and in both cases,
> copy_mc_fragile's checksum is 0.
>
> SUSE Leap 15.1:
>
> Name           : binutils
> Version        : 2.32-lp151.3.6.1
>
> $ grep -E "(copy_mc_fragile|copy_user_generic_unrolled)" Module.symvers
> 0x00000000      copy_mc_fragile vmlinux EXPORT_SYMBOL_GPL
> 0xecdcabd2      copy_user_generic_unrolled      vmlinux EXPORT_SYMBOL
>
> debian testing:
>
> Package: binutils
> Version: 2.35-2
>
> $ grep -E "(copy_mc_fragile|copy_user_generic_unrolled)" Module.symvers
> 0x00000000      copy_mc_fragile vmlinux EXPORT_SYMBOL_GPL
> 0xecdcabd2      copy_user_generic_unrolled      vmlinux EXPORT_SYMBOL

Yes, I'm seeing this too on Fedora, I'll take a look.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] x86/mce: Gate copy_mc_fragile() export by CONFIG_COPY_MC_TEST=y
  2020-10-07 11:14         ` Borislav Petkov
  2020-10-07 16:45           ` Borislav Petkov
@ 2020-10-07 18:24           ` Dan Williams
  2020-10-08  9:01           ` [tip: ras/core] x86/mce: Allow for copy_mc_fragile symbol checksum to be generated tip-bot2 for Borislav Petkov
  2 siblings, 0 replies; 30+ messages in thread
From: Dan Williams @ 2020-10-07 18:24 UTC (permalink / raw)
  To: bp
  Cc: Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	H. Peter Anvin, Tony Luck, linux-kernel, linux-nvdimm

It appears that modpost is not happy about exporting assembly symbols
that are not consumed in the same build. As Boris reports:

    WARNING: modpost: EXPORT symbol "copy_mc_fragile" [vmlinux] version generation failed, symbol will not be versioned.

The export is only consumed in the CONFIG_COPY_MC_TEST=y case, and even
then not in a way that modpost could see. CONFIG_COPY_MC_TEST uses a
module built in tools/testing/nvdimm/ to exercise the copy_mc_fragile()
corner cases.  Given the test already requires manually editing the
config entry for CONFIG_COPY_MC_TEST to make it "def_bool y" the
additional dependency to require is CONFIG_MODVERSIONS=n is not too
onerous.

Alternatively, COPY_MC_TEST and its related infrastructure could just be
ripped out because it has served its purpose. For now, just stop
exporting the symbol by default, and add the MODVERSIONS dependency to
the test.

Fixes: ec6347bb4339 ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
Reported-by: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig.debug    |    1 +
 arch/x86/lib/copy_mc_64.S |    2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 27b5e2bc6a01..6f0f5d8ac62e 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -63,6 +63,7 @@ config EARLY_PRINTK_USB_XDBC
 	  crashes or need a very simple printk logging facility.
 
 config COPY_MC_TEST
+	depends on !MODVERSIONS
 	def_bool n
 
 config EFI_PGT_DUMP
diff --git a/arch/x86/lib/copy_mc_64.S b/arch/x86/lib/copy_mc_64.S
index 892d8915f609..50fb05256751 100644
--- a/arch/x86/lib/copy_mc_64.S
+++ b/arch/x86/lib/copy_mc_64.S
@@ -88,7 +88,9 @@ SYM_FUNC_START(copy_mc_fragile)
 .L_done:
 	ret
 SYM_FUNC_END(copy_mc_fragile)
+#ifdef CONFIG_COPY_MC_TEST
 EXPORT_SYMBOL_GPL(copy_mc_fragile)
+#endif
 
 	.section .fixup, "ax"
 	/*


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  2020-10-07 17:03             ` Borislav Petkov
@ 2020-10-07 18:53               ` Dan Williams
  2020-10-07 19:25                 ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-10-07 18:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-tip-commits, Tony Luck, Michael Ellerman, stable, x86, LKML

On Wed, Oct 7, 2020 at 11:47 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Oct 07, 2020 at 06:45:36PM +0200, Borislav Petkov wrote:
> > It doesn't look like it is toolchain-specific and in both cases,
> > copy_mc_fragile's checksum is 0.
> >
> > SUSE Leap 15.1:
> >
> > Name           : binutils
> > Version        : 2.32-lp151.3.6.1
> >
> > $ grep -E "(copy_mc_fragile|copy_user_generic_unrolled)" Module.symvers
> > 0x00000000      copy_mc_fragile vmlinux EXPORT_SYMBOL_GPL
> > 0xecdcabd2      copy_user_generic_unrolled      vmlinux EXPORT_SYMBOL
> >
> > debian testing:
> >
> > Package: binutils
> > Version: 2.35-2
> >
> > $ grep -E "(copy_mc_fragile|copy_user_generic_unrolled)" Module.symvers
> > 0x00000000      copy_mc_fragile vmlinux EXPORT_SYMBOL_GPL
> > 0xecdcabd2      copy_user_generic_unrolled      vmlinux EXPORT_SYMBOL
>
> Ok, I think I have it:
>
> ---
> From: Borislav Petkov <bp@suse.de>
> Date: Wed, 7 Oct 2020 18:55:35 +0200
> Subject: [PATCH] x86/mce: Allow for copy_mc_fragile symbol checksum to be generated
>
> Add asm/mce.h to asm/asm-prototypes.h so that that asm symbol's checksum
> can be generated in order to support CONFIG_MODVERSIONS with it and fix:
>
>   WARNING: modpost: EXPORT symbol "copy_mc_fragile" [vmlinux] version \
>           generation failed, symbol will not be versioned.
>
> For reference see:
>
>   4efca4ed05cb ("kbuild: modversions for EXPORT_SYMBOL() for asm")
>   334bb7738764 ("x86/kbuild: enable modversions for symbols exported from asm")

Oh nice! I just sent a patch [1] to fix this up as well, but mine goes
after minimizing when it is exported, I think perhaps both are needed.

http://lore.kernel.org/r/160209507277.2768223.9933672492157583642.stgit@dwillia2-desk3.amr.corp.intel.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  2020-10-07 18:53               ` Dan Williams
@ 2020-10-07 19:25                 ` Borislav Petkov
  2020-10-08 16:59                   ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2020-10-07 19:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-tip-commits, Tony Luck, Michael Ellerman, stable, x86, LKML

On Wed, Oct 07, 2020 at 11:53:15AM -0700, Dan Williams wrote:
> Oh nice! I just sent a patch [1] to fix this up as well,

Yeah, for some reason it took you a while to see it - wondering if your
mail servers are slow again.

> but mine goes
> after minimizing when it is exported, I think perhaps both are needed.
> 
> http://lore.kernel.org/r/160209507277.2768223.9933672492157583642.stgit@dwillia2-desk3.amr.corp.intel.com

Looks like it. 

Why not rip out COPY_MC_TEST altogether though? Or you wanna do that
after the merge window?

It would be good to not have that export in 5.10 final if it is not
really needed.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip: ras/core] x86/mce: Allow for copy_mc_fragile symbol checksum to be generated
  2020-10-07 11:14         ` Borislav Petkov
  2020-10-07 16:45           ` Borislav Petkov
  2020-10-07 18:24           ` [PATCH] x86/mce: Gate copy_mc_fragile() export by CONFIG_COPY_MC_TEST=y Dan Williams
@ 2020-10-08  9:01           ` tip-bot2 for Borislav Petkov
  2 siblings, 0 replies; 30+ messages in thread
From: tip-bot2 for Borislav Petkov @ 2020-10-08  9:01 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Borislav Petkov, x86, LKML

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     b3149ffcdb31a8eb854cc442a389ae0b539bf28a
Gitweb:        https://git.kernel.org/tip/b3149ffcdb31a8eb854cc442a389ae0b539bf28a
Author:        Borislav Petkov <bp@suse.de>
AuthorDate:    Wed, 07 Oct 2020 18:55:35 +02:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Thu, 08 Oct 2020 10:39:21 +02:00

x86/mce: Allow for copy_mc_fragile symbol checksum to be generated

Add asm/mce.h to asm/asm-prototypes.h so that that asm symbol's checksum
can be generated in order to support CONFIG_MODVERSIONS with it and fix:

  WARNING: modpost: EXPORT symbol "copy_mc_fragile" [vmlinux] version \
	  generation failed, symbol will not be versioned.

For reference see:

  4efca4ed05cb ("kbuild: modversions for EXPORT_SYMBOL() for asm")
  334bb7738764 ("x86/kbuild: enable modversions for symbols exported from asm")

Fixes: ec6347bb4339 ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201007111447.GA23257@zn.tnic
---
 arch/x86/include/asm/asm-prototypes.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h
index 5a42f92..51e2bf2 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -5,6 +5,7 @@
 #include <asm/string.h>
 #include <asm/page.h>
 #include <asm/checksum.h>
+#include <asm/mce.h>
 
 #include <asm-generic/asm-prototypes.h>
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  2020-10-07 19:25                 ` Borislav Petkov
@ 2020-10-08 16:59                   ` Dan Williams
  2020-10-08 17:08                     ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-10-08 16:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-tip-commits, Tony Luck, Michael Ellerman, stable, x86, LKML

On Wed, Oct 7, 2020 at 12:54 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Oct 07, 2020 at 11:53:15AM -0700, Dan Williams wrote:
> > Oh nice! I just sent a patch [1] to fix this up as well,
>
> Yeah, for some reason it took you a while to see it - wondering if your
> mail servers are slow again.
>
> > but mine goes
> > after minimizing when it is exported, I think perhaps both are needed.
> >
> > http://lore.kernel.org/r/160209507277.2768223.9933672492157583642.stgit@dwillia2-desk3.amr.corp.intel.com
>
> Looks like it.
>
> Why not rip out COPY_MC_TEST altogether though? Or you wanna do that
> after the merge window?
>
> It would be good to not have that export in 5.10 final if it is not
> really needed.

I'll draft a patch to rip it out. If you have a chance to grab it
before the merge window, great, if not I can funnel it through
nvdimm.git since the bulk of it will be touching
tools/testing/nvdimm/.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  2020-10-08 16:59                   ` Dan Williams
@ 2020-10-08 17:08                     ` Borislav Petkov
  0 siblings, 0 replies; 30+ messages in thread
From: Borislav Petkov @ 2020-10-08 17:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-tip-commits, Tony Luck, Michael Ellerman, stable, x86, LKML

On Thu, Oct 08, 2020 at 09:59:26AM -0700, Dan Williams wrote:
> I'll draft a patch to rip it out. If you have a chance to grab it
> before the merge window, great, if not I can funnel it through
> nvdimm.git since the bulk of it will be touching
> tools/testing/nvdimm/.

Yeah, this is not urgent stuff to queue it now so I'd prefer it if
you funneled it after the merge window. It is enough that it lands in
5.10-final, I'd say, so that the export doesn't see an official release.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-18  0:12 ` Dan Williams
@ 2020-04-18 19:42   ` Linus Torvalds
  0 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2020-04-18 19:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Ingo Molnar, X86 ML, stable, Borislav Petkov,
	H. Peter Anvin, Peter Zijlstra, Tony Luck, Erwin Tsaur,
	Linux Kernel Mailing List, linux-nvdimm

On Fri, Apr 17, 2020 at 5:12 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> > @@ -106,12 +108,10 @@ static __always_inline __must_check unsigned long
> >  memcpy_mcsafe(void *dst, const void *src, size_t cnt)
> >  {
> >  #ifdef CONFIG_X86_MCE
> > -       if (static_branch_unlikely(&mcsafe_key))
> > -               return __memcpy_mcsafe(dst, src, cnt);
> > -       else
> > +       if (static_branch_unlikely(&mcsafe_slow_key))
> > +               return memcpy_mcsafe_slow(dst, src, cnt);
> >  #endif
> > -               memcpy(dst, src, cnt);
> > -       return 0;
> > +       return memcpy_mcsafe_fast(dst, src, cnt);
> >  }

It strikes me that I see no advantages to making this an inline function at all.

Even for the good case - where it turns into just a memcpy because MCE
is entirely disabled - it doesn't seem to matter.

The only case that really helps is when the memcpy can be turned into
a single access. Which - and I checked - does exist, with people doing

        r = memcpy_mcsafe(&sb_seq_count, &sb(wc)->seq_count, sizeof(uint64_t));

to read a single 64-bit field which looks aligned to me.

But that code is incredible garbage anyway, since even on a broken
machine, there's no actual reason to use the slow variant for that
whole access that I can tell. The macs-safe copy routines do not do
anything worthwhile for a single access.

So my reaction remains that a lot of this is just completely wrong and
incredibly mis-designed.

Yes, the hardware was buggy garbage. But why should we make things
worse with making the software be incomprehensibly bad too?

              Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-10 17:49 [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast Dan Williams
@ 2020-04-18  0:12 ` Dan Williams
  2020-04-18 19:42   ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-04-18  0:12 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: X86 ML, stable, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	Tony Luck, Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm,
	Linus Torvalds

[ add Linus because he had comments the last time memcpy_mcsafe() was
reworked  [1] ]

On Fri, Apr 10, 2020 at 11:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> The original memcpy_mcsafe() implementation satisfied two primary
> concerns. It provided a copy routine that avoided known unrecoverable
> scenarios (poison consumption via fast-string instructions / accesses
> across cacheline boundaries), and it provided a fallback to fast plain
> memcpy if the platform did not indicate recovery capability.
>
> However, the platform indication for recovery capability is not
> architectural so it was always going to require an ever growing list of
> opt-in quirks. Also, ongoing hardware improvements to recoverability
> mean that the cross-cacheline-boundary and fast-string poison
> consumption scenarios are no longer concerns on newer platforms. The
> introduction of memcpy_mcsafe_fast(), and resulting reworks, recovers
> performance for these newer CPUs, but without the maintenance burden of
> an opt-in whitelist.
>
> With memcpy_mcsafe_fast() the existing opt-in becomes a blacklist for
> CPUs that benefit from a more careful slower implementation. Every other
> CPU gets a fast, but optionally recoverable implementation.
>
> With this in place memcpy_mcsafe() never falls back to plain memcpy().
> It can be used in any scenario where the caller needs guarantees that
> source or destination access faults / exceptions will be handled.
> Systems without recovery support continue to default to a fast
> implementation while newer systems both default to fast, and support
> recovery, without needing an explicit quirk.
>
> Cc: x86@kernel.org
> Cc: <stable@vger.kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Tony Luck <tony.luck@intel.com>
> Reported-by: Erwin Tsaur <erwin.tsaur@intel.com>
> Tested-by: Erwin Tsaur <erwin.tsaur@intel.com>
> Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Note that I marked this for stable and included a Fixes tag because it
> is arguably a regression that old kernels stop recovering from poison
> consumption on new hardware.

This still applies cleanly to tip/master as of yesterday. Any concerns?

[1]: https://lore.kernel.org/lkml/CA+55aFwJ=DvvcrM725KOYAF_c=uKiQcuHz4taPjbKveOVPycKA@mail.gmail.com/

>
>  arch/x86/include/asm/string_64.h         |   24 ++++++-------
>  arch/x86/include/asm/uaccess_64.h        |    7 +---
>  arch/x86/kernel/cpu/mce/core.c           |    6 ++-
>  arch/x86/kernel/quirks.c                 |    4 +-
>  arch/x86/lib/memcpy_64.S                 |   57 +++++++++++++++++++++++++++---
>  arch/x86/lib/usercopy_64.c               |    6 +--
>  tools/objtool/check.c                    |    3 +-
>  tools/perf/bench/mem-memcpy-x86-64-lib.c |    8 +---
>  tools/testing/nvdimm/test/nfit.c         |    2 +
>  9 files changed, 75 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 75314c3dbe47..07840fa3582a 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -83,21 +83,23 @@ int strcmp(const char *cs, const char *ct);
>  #endif
>
>  #define __HAVE_ARCH_MEMCPY_MCSAFE 1
> -__must_check unsigned long __memcpy_mcsafe(void *dst, const void *src,
> +__must_check unsigned long memcpy_mcsafe_slow(void *dst, const void *src,
>                 size_t cnt);
> -DECLARE_STATIC_KEY_FALSE(mcsafe_key);
> +__must_check unsigned long memcpy_mcsafe_fast(void *dst, const void *src,
> +               size_t cnt);
> +DECLARE_STATIC_KEY_FALSE(mcsafe_slow_key);
>
>  /**
> - * memcpy_mcsafe - copy memory with indication if a machine check happened
> + * memcpy_mcsafe - copy memory with indication if an exception / fault happened
>   *
>   * @dst:       destination address
>   * @src:       source address
>   * @cnt:       number of bytes to copy
>   *
> - * Low level memory copy function that catches machine checks
> - * We only call into the "safe" function on systems that can
> - * actually do machine check recovery. Everyone else can just
> - * use memcpy().
> + * The slow version is opted into by platform quirks. The fast version
> + * is equivalent to memcpy() regardless of CPU machine-check-recovery
> + * capability, but may still fall back to the slow version if the CPU
> + * lacks fast-string instruction support.
>   *
>   * Return 0 for success, or number of bytes not copied if there was an
>   * exception.
> @@ -106,12 +108,10 @@ static __always_inline __must_check unsigned long
>  memcpy_mcsafe(void *dst, const void *src, size_t cnt)
>  {
>  #ifdef CONFIG_X86_MCE
> -       if (static_branch_unlikely(&mcsafe_key))
> -               return __memcpy_mcsafe(dst, src, cnt);
> -       else
> +       if (static_branch_unlikely(&mcsafe_slow_key))
> +               return memcpy_mcsafe_slow(dst, src, cnt);
>  #endif
> -               memcpy(dst, src, cnt);
> -       return 0;
> +       return memcpy_mcsafe_fast(dst, src, cnt);
>  }
>
>  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 5cd1caa8bc65..f8c0d38c3f45 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -52,12 +52,7 @@ copy_to_user_mcsafe(void *to, const void *from, unsigned len)
>         unsigned long ret;
>
>         __uaccess_begin();
> -       /*
> -        * Note, __memcpy_mcsafe() is explicitly used since it can
> -        * handle exceptions / faults.  memcpy_mcsafe() may fall back to
> -        * memcpy() which lacks this handling.
> -        */
> -       ret = __memcpy_mcsafe(to, from, len);
> +       ret = memcpy_mcsafe(to, from, len);
>         __uaccess_end();
>         return ret;
>  }
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 2c4f949611e4..6bf94d39dc7f 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -2579,13 +2579,13 @@ static void __init mcheck_debugfs_init(void)
>  static void __init mcheck_debugfs_init(void) { }
>  #endif
>
> -DEFINE_STATIC_KEY_FALSE(mcsafe_key);
> -EXPORT_SYMBOL_GPL(mcsafe_key);
> +DEFINE_STATIC_KEY_FALSE(mcsafe_slow_key);
> +EXPORT_SYMBOL_GPL(mcsafe_slow_key);
>
>  static int __init mcheck_late_init(void)
>  {
>         if (mca_cfg.recovery)
> -               static_branch_inc(&mcsafe_key);
> +               static_branch_inc(&mcsafe_slow_key);
>
>         mcheck_debugfs_init();
>         cec_init();
> diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
> index 896d74cb5081..89c88d9de5c4 100644
> --- a/arch/x86/kernel/quirks.c
> +++ b/arch/x86/kernel/quirks.c
> @@ -636,7 +636,7 @@ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev)
>         pci_read_config_dword(pdev, 0x84, &capid0);
>
>         if (capid0 & 0x10)
> -               static_branch_inc(&mcsafe_key);
> +               static_branch_inc(&mcsafe_slow_key);
>  }
>
>  /* Skylake */
> @@ -653,7 +653,7 @@ static void quirk_intel_purley_xeon_ras_cap(struct pci_dev *pdev)
>          * enabled, so memory machine check recovery is also enabled.
>          */
>         if ((capid0 & 0xc0) == 0xc0 || (capid5 & 0x1e0))
> -               static_branch_inc(&mcsafe_key);
> +               static_branch_inc(&mcsafe_slow_key);
>
>  }
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0ec3, quirk_intel_brickland_xeon_ras_cap);
> diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
> index 56b243b14c3a..b5e4fc1cf99d 100644
> --- a/arch/x86/lib/memcpy_64.S
> +++ b/arch/x86/lib/memcpy_64.S
> @@ -189,11 +189,18 @@ SYM_FUNC_END(memcpy_orig)
>  MCSAFE_TEST_CTL
>
>  /*
> - * __memcpy_mcsafe - memory copy with machine check exception handling
> - * Note that we only catch machine checks when reading the source addresses.
> - * Writes to target are posted and don't generate machine checks.
> + * memcpy_mcsafe_slow() - memory copy with exception handling
> + *
> + * In contrast to memcpy_mcsafe_fast() this version is careful to
> + * never perform a read across a cacheline boundary, and not use
> + * fast-string instruction sequences which are known to be unrecoverable
> + * on CPUs identified by mcsafe_slow_key.
> + *
> + * Note that this only catches machine check exceptions when reading the
> + * source addresses.  Writes to target are posted and don't generate
> + * machine checks. However this does handle protection faults on writes.
>   */
> -SYM_FUNC_START(__memcpy_mcsafe)
> +SYM_FUNC_START(memcpy_mcsafe_slow)
>         cmpl $8, %edx
>         /* Less than 8 bytes? Go to byte copy loop */
>         jb .L_no_whole_words
> @@ -260,8 +267,8 @@ SYM_FUNC_START(__memcpy_mcsafe)
>         xorl %eax, %eax
>  .L_done:
>         ret
> -SYM_FUNC_END(__memcpy_mcsafe)
> -EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
> +SYM_FUNC_END(memcpy_mcsafe_slow)
> +EXPORT_SYMBOL_GPL(memcpy_mcsafe_slow)
>
>         .section .fixup, "ax"
>         /*
> @@ -296,4 +303,42 @@ EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
>         _ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes)
>         _ASM_EXTABLE(.L_write_words, .E_write_words)
>         _ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
> +
> +/*
> + * memcpy_mcsafe_fast - memory copy with exception handling
> + *
> + * Fast string copy + exception handling. If the CPU does support
> + * machine check exception recovery, but does not support recovering
> + * from fast-string exceptions then this CPU needs to be added to the
> + * mcsafe_slow_key set of quirks. Otherwise, absent any machine check
> + * recovery support this version should be no slower than standard
> + * memcpy.
> + */
> +SYM_FUNC_START(memcpy_mcsafe_fast)
> +       ALTERNATIVE "jmp memcpy_mcsafe_slow", "", X86_FEATURE_ERMS
> +       movq %rdi, %rax
> +       movq %rdx, %rcx
> +.L_copy:
> +       rep movsb
> +       /* Copy successful. Return zero */
> +       xorl %eax, %eax
> +       ret
> +SYM_FUNC_END(memcpy_mcsafe_fast)
> +EXPORT_SYMBOL_GPL(memcpy_mcsafe_fast)
> +
> +       .section .fixup, "ax"
> +.E_copy:
> +       /*
> +        * On fault %rcx is updated such that the copy instruction could
> +        * optionally be restarted at the fault position, i.e. it
> +        * contains 'bytes remaining'. A non-zero return indicates error
> +        * to memcpy_mcsafe() users, or indicate short transfers to
> +        * user-copy routines.
> +        */
> +       movq %rcx, %rax
> +       ret
> +
> +       .previous
> +
> +       _ASM_EXTABLE_FAULT(.L_copy, .E_copy)
>  #endif
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index fff28c6f73a2..348c9331748e 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -64,11 +64,7 @@ __visible notrace unsigned long
>  mcsafe_handle_tail(char *to, char *from, unsigned len)
>  {
>         for (; len; --len, to++, from++) {
> -               /*
> -                * Call the assembly routine back directly since
> -                * memcpy_mcsafe() may silently fallback to memcpy.
> -                */
> -               unsigned long rem = __memcpy_mcsafe(to, from, 1);
> +               unsigned long rem = memcpy_mcsafe(to, from, 1);
>
>                 if (rem)
>                         break;
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 4768d91c6d68..bae1f77aae90 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -485,7 +485,8 @@ static const char *uaccess_safe_builtin[] = {
>         "__ubsan_handle_shift_out_of_bounds",
>         /* misc */
>         "csum_partial_copy_generic",
> -       "__memcpy_mcsafe",
> +       "memcpy_mcsafe_slow",
> +       "memcpy_mcsafe_fast",
>         "mcsafe_handle_tail",
>         "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
>         NULL
> diff --git a/tools/perf/bench/mem-memcpy-x86-64-lib.c b/tools/perf/bench/mem-memcpy-x86-64-lib.c
> index 4130734dde84..23e1747b3a67 100644
> --- a/tools/perf/bench/mem-memcpy-x86-64-lib.c
> +++ b/tools/perf/bench/mem-memcpy-x86-64-lib.c
> @@ -5,17 +5,13 @@
>   */
>  #include <linux/types.h>
>
> -unsigned long __memcpy_mcsafe(void *dst, const void *src, size_t cnt);
> +unsigned long memcpy_mcsafe(void *dst, const void *src, size_t cnt);
>  unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len);
>
>  unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len)
>  {
>         for (; len; --len, to++, from++) {
> -               /*
> -                * Call the assembly routine back directly since
> -                * memcpy_mcsafe() may silently fallback to memcpy.
> -                */
> -               unsigned long rem = __memcpy_mcsafe(to, from, 1);
> +               unsigned long rem = memcpy_mcsafe(to, from, 1);
>
>                 if (rem)
>                         break;
> diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
> index bf6422a6af7f..282722d96f8e 100644
> --- a/tools/testing/nvdimm/test/nfit.c
> +++ b/tools/testing/nvdimm/test/nfit.c
> @@ -3136,7 +3136,7 @@ void mcsafe_test(void)
>                         }
>
>                         mcsafe_test_init(dst, src, 512);
> -                       rem = __memcpy_mcsafe(dst, src, 512);
> +                       rem = memcpy_mcsafe_slow(dst, src, 512);
>                         valid = mcsafe_test_validate(dst, src, 512, expect);
>                         if (rem == expect && valid)
>                                 continue;
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
@ 2020-04-10 17:49 Dan Williams
  2020-04-18  0:12 ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-04-10 17:49 UTC (permalink / raw)
  To: tglx, mingo
  Cc: x86, stable, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	Tony Luck, Erwin Tsaur, Erwin Tsaur, linux-kernel, linux-nvdimm

The original memcpy_mcsafe() implementation satisfied two primary
concerns. It provided a copy routine that avoided known unrecoverable
scenarios (poison consumption via fast-string instructions / accesses
across cacheline boundaries), and it provided a fallback to fast plain
memcpy if the platform did not indicate recovery capability.

However, the platform indication for recovery capability is not
architectural so it was always going to require an ever growing list of
opt-in quirks. Also, ongoing hardware improvements to recoverability
mean that the cross-cacheline-boundary and fast-string poison
consumption scenarios are no longer concerns on newer platforms. The
introduction of memcpy_mcsafe_fast(), and resulting reworks, recovers
performance for these newer CPUs, but without the maintenance burden of
an opt-in whitelist.

With memcpy_mcsafe_fast() the existing opt-in becomes a blacklist for
CPUs that benefit from a more careful slower implementation. Every other
CPU gets a fast, but optionally recoverable implementation.

With this in place memcpy_mcsafe() never falls back to plain memcpy().
It can be used in any scenario where the caller needs guarantees that
source or destination access faults / exceptions will be handled.
Systems without recovery support continue to default to a fast
implementation while newer systems both default to fast, and support
recovery, without needing an explicit quirk.

Cc: x86@kernel.org
Cc: <stable@vger.kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Reported-by: Erwin Tsaur <erwin.tsaur@intel.com>
Tested-by: Erwin Tsaur <erwin.tsaur@intel.com>
Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Note that I marked this for stable and included a Fixes tag because it
is arguably a regression that old kernels stop recovering from poison
consumption on new hardware.

 arch/x86/include/asm/string_64.h         |   24 ++++++-------
 arch/x86/include/asm/uaccess_64.h        |    7 +---
 arch/x86/kernel/cpu/mce/core.c           |    6 ++-
 arch/x86/kernel/quirks.c                 |    4 +-
 arch/x86/lib/memcpy_64.S                 |   57 +++++++++++++++++++++++++++---
 arch/x86/lib/usercopy_64.c               |    6 +--
 tools/objtool/check.c                    |    3 +-
 tools/perf/bench/mem-memcpy-x86-64-lib.c |    8 +---
 tools/testing/nvdimm/test/nfit.c         |    2 +
 9 files changed, 75 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3dbe47..07840fa3582a 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -83,21 +83,23 @@ int strcmp(const char *cs, const char *ct);
 #endif
 
 #define __HAVE_ARCH_MEMCPY_MCSAFE 1
-__must_check unsigned long __memcpy_mcsafe(void *dst, const void *src,
+__must_check unsigned long memcpy_mcsafe_slow(void *dst, const void *src,
 		size_t cnt);
-DECLARE_STATIC_KEY_FALSE(mcsafe_key);
+__must_check unsigned long memcpy_mcsafe_fast(void *dst, const void *src,
+		size_t cnt);
+DECLARE_STATIC_KEY_FALSE(mcsafe_slow_key);
 
 /**
- * memcpy_mcsafe - copy memory with indication if a machine check happened
+ * memcpy_mcsafe - copy memory with indication if an exception / fault happened
  *
  * @dst:	destination address
  * @src:	source address
  * @cnt:	number of bytes to copy
  *
- * Low level memory copy function that catches machine checks
- * We only call into the "safe" function on systems that can
- * actually do machine check recovery. Everyone else can just
- * use memcpy().
+ * The slow version is opted into by platform quirks. The fast version
+ * is equivalent to memcpy() regardless of CPU machine-check-recovery
+ * capability, but may still fall back to the slow version if the CPU
+ * lacks fast-string instruction support.
  *
  * Return 0 for success, or number of bytes not copied if there was an
  * exception.
@@ -106,12 +108,10 @@ static __always_inline __must_check unsigned long
 memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 {
 #ifdef CONFIG_X86_MCE
-	if (static_branch_unlikely(&mcsafe_key))
-		return __memcpy_mcsafe(dst, src, cnt);
-	else
+	if (static_branch_unlikely(&mcsafe_slow_key))
+		return memcpy_mcsafe_slow(dst, src, cnt);
 #endif
-		memcpy(dst, src, cnt);
-	return 0;
+	return memcpy_mcsafe_fast(dst, src, cnt);
 }
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 5cd1caa8bc65..f8c0d38c3f45 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -52,12 +52,7 @@ copy_to_user_mcsafe(void *to, const void *from, unsigned len)
 	unsigned long ret;
 
 	__uaccess_begin();
-	/*
-	 * Note, __memcpy_mcsafe() is explicitly used since it can
-	 * handle exceptions / faults.  memcpy_mcsafe() may fall back to
-	 * memcpy() which lacks this handling.
-	 */
-	ret = __memcpy_mcsafe(to, from, len);
+	ret = memcpy_mcsafe(to, from, len);
 	__uaccess_end();
 	return ret;
 }
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c4f949611e4..6bf94d39dc7f 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2579,13 +2579,13 @@ static void __init mcheck_debugfs_init(void)
 static void __init mcheck_debugfs_init(void) { }
 #endif
 
-DEFINE_STATIC_KEY_FALSE(mcsafe_key);
-EXPORT_SYMBOL_GPL(mcsafe_key);
+DEFINE_STATIC_KEY_FALSE(mcsafe_slow_key);
+EXPORT_SYMBOL_GPL(mcsafe_slow_key);
 
 static int __init mcheck_late_init(void)
 {
 	if (mca_cfg.recovery)
-		static_branch_inc(&mcsafe_key);
+		static_branch_inc(&mcsafe_slow_key);
 
 	mcheck_debugfs_init();
 	cec_init();
diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 896d74cb5081..89c88d9de5c4 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -636,7 +636,7 @@ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev)
 	pci_read_config_dword(pdev, 0x84, &capid0);
 
 	if (capid0 & 0x10)
-		static_branch_inc(&mcsafe_key);
+		static_branch_inc(&mcsafe_slow_key);
 }
 
 /* Skylake */
@@ -653,7 +653,7 @@ static void quirk_intel_purley_xeon_ras_cap(struct pci_dev *pdev)
 	 * enabled, so memory machine check recovery is also enabled.
 	 */
 	if ((capid0 & 0xc0) == 0xc0 || (capid5 & 0x1e0))
-		static_branch_inc(&mcsafe_key);
+		static_branch_inc(&mcsafe_slow_key);
 
 }
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0ec3, quirk_intel_brickland_xeon_ras_cap);
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 56b243b14c3a..b5e4fc1cf99d 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -189,11 +189,18 @@ SYM_FUNC_END(memcpy_orig)
 MCSAFE_TEST_CTL
 
 /*
- * __memcpy_mcsafe - memory copy with machine check exception handling
- * Note that we only catch machine checks when reading the source addresses.
- * Writes to target are posted and don't generate machine checks.
+ * memcpy_mcsafe_slow() - memory copy with exception handling
+ *
+ * In contrast to memcpy_mcsafe_fast() this version is careful to
+ * never perform a read across a cacheline boundary, and not use
+ * fast-string instruction sequences which are known to be unrecoverable
+ * on CPUs identified by mcsafe_slow_key.
+ *
+ * Note that this only catches machine check exceptions when reading the
+ * source addresses.  Writes to target are posted and don't generate
+ * machine checks. However this does handle protection faults on writes.
  */
-SYM_FUNC_START(__memcpy_mcsafe)
+SYM_FUNC_START(memcpy_mcsafe_slow)
 	cmpl $8, %edx
 	/* Less than 8 bytes? Go to byte copy loop */
 	jb .L_no_whole_words
@@ -260,8 +267,8 @@ SYM_FUNC_START(__memcpy_mcsafe)
 	xorl %eax, %eax
 .L_done:
 	ret
-SYM_FUNC_END(__memcpy_mcsafe)
-EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
+SYM_FUNC_END(memcpy_mcsafe_slow)
+EXPORT_SYMBOL_GPL(memcpy_mcsafe_slow)
 
 	.section .fixup, "ax"
 	/*
@@ -296,4 +303,42 @@ EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
 	_ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes)
 	_ASM_EXTABLE(.L_write_words, .E_write_words)
 	_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
+
+/*
+ * memcpy_mcsafe_fast - memory copy with exception handling
+ *
+ * Fast string copy + exception handling. If the CPU does support
+ * machine check exception recovery, but does not support recovering
+ * from fast-string exceptions then this CPU needs to be added to the
+ * mcsafe_slow_key set of quirks. Otherwise, absent any machine check
+ * recovery support this version should be no slower than standard
+ * memcpy.
+ */
+SYM_FUNC_START(memcpy_mcsafe_fast)
+	ALTERNATIVE "jmp memcpy_mcsafe_slow", "", X86_FEATURE_ERMS
+	movq %rdi, %rax
+	movq %rdx, %rcx
+.L_copy:
+	rep movsb
+	/* Copy successful. Return zero */
+	xorl %eax, %eax
+	ret
+SYM_FUNC_END(memcpy_mcsafe_fast)
+EXPORT_SYMBOL_GPL(memcpy_mcsafe_fast)
+
+	.section .fixup, "ax"
+.E_copy:
+	/*
+	 * On fault %rcx is updated such that the copy instruction could
+	 * optionally be restarted at the fault position, i.e. it
+	 * contains 'bytes remaining'. A non-zero return indicates error
+	 * to memcpy_mcsafe() users, or indicate short transfers to
+	 * user-copy routines.
+	 */
+	movq %rcx, %rax
+	ret
+
+	.previous
+
+	_ASM_EXTABLE_FAULT(.L_copy, .E_copy)
 #endif
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6f73a2..348c9331748e 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -64,11 +64,7 @@ __visible notrace unsigned long
 mcsafe_handle_tail(char *to, char *from, unsigned len)
 {
 	for (; len; --len, to++, from++) {
-		/*
-		 * Call the assembly routine back directly since
-		 * memcpy_mcsafe() may silently fallback to memcpy.
-		 */
-		unsigned long rem = __memcpy_mcsafe(to, from, 1);
+		unsigned long rem = memcpy_mcsafe(to, from, 1);
 
 		if (rem)
 			break;
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 4768d91c6d68..bae1f77aae90 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -485,7 +485,8 @@ static const char *uaccess_safe_builtin[] = {
 	"__ubsan_handle_shift_out_of_bounds",
 	/* misc */
 	"csum_partial_copy_generic",
-	"__memcpy_mcsafe",
+	"memcpy_mcsafe_slow",
+	"memcpy_mcsafe_fast",
 	"mcsafe_handle_tail",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
 	NULL
diff --git a/tools/perf/bench/mem-memcpy-x86-64-lib.c b/tools/perf/bench/mem-memcpy-x86-64-lib.c
index 4130734dde84..23e1747b3a67 100644
--- a/tools/perf/bench/mem-memcpy-x86-64-lib.c
+++ b/tools/perf/bench/mem-memcpy-x86-64-lib.c
@@ -5,17 +5,13 @@
  */
 #include <linux/types.h>
 
-unsigned long __memcpy_mcsafe(void *dst, const void *src, size_t cnt);
+unsigned long memcpy_mcsafe(void *dst, const void *src, size_t cnt);
 unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len);
 
 unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len)
 {
 	for (; len; --len, to++, from++) {
-		/*
-		 * Call the assembly routine back directly since
-		 * memcpy_mcsafe() may silently fallback to memcpy.
-		 */
-		unsigned long rem = __memcpy_mcsafe(to, from, 1);
+		unsigned long rem = memcpy_mcsafe(to, from, 1);
 
 		if (rem)
 			break;
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index bf6422a6af7f..282722d96f8e 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -3136,7 +3136,7 @@ void mcsafe_test(void)
 			}
 
 			mcsafe_test_init(dst, src, 512);
-			rem = __memcpy_mcsafe(dst, src, 512);
+			rem = memcpy_mcsafe_slow(dst, src, 512);
 			valid = mcsafe_test_validate(dst, src, 512, expect);
 			if (rem == expect && valid)
 				continue;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-10-08 17:08 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-18 20:30 [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast Andy Lutomirski
2020-04-18 20:52 ` Linus Torvalds
2020-04-20  5:08   ` Dan Williams
2020-04-20 17:28     ` Linus Torvalds
2020-04-20 18:20       ` Dan Williams
2020-04-20 19:05         ` Linus Torvalds
2020-04-20 19:29           ` Dan Williams
2020-04-20 20:07             ` Linus Torvalds
2020-04-20 20:23               ` Luck, Tony
2020-04-20 20:27                 ` Linus Torvalds
2020-04-20 20:45                   ` Luck, Tony
2020-04-20 20:56                     ` Linus Torvalds
2020-04-20 20:24               ` Dan Williams
2020-04-20 20:46                 ` Linus Torvalds
2020-04-20 20:57                   ` Luck, Tony
2020-04-20 21:16                     ` Linus Torvalds
2020-10-06  9:57       ` [tip: ras/core] x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() tip-bot2 for Dan Williams
2020-10-07 11:14         ` Borislav Petkov
2020-10-07 16:45           ` Borislav Petkov
2020-10-07 17:03             ` Borislav Petkov
2020-10-07 18:53               ` Dan Williams
2020-10-07 19:25                 ` Borislav Petkov
2020-10-08 16:59                   ` Dan Williams
2020-10-08 17:08                     ` Borislav Petkov
2020-10-07 17:51             ` Dan Williams
2020-10-07 18:24           ` [PATCH] x86/mce: Gate copy_mc_fragile() export by CONFIG_COPY_MC_TEST=y Dan Williams
2020-10-08  9:01           ` [tip: ras/core] x86/mce: Allow for copy_mc_fragile symbol checksum to be generated tip-bot2 for Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2020-04-10 17:49 [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast Dan Williams
2020-04-18  0:12 ` Dan Williams
2020-04-18 19:42   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).