linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines
@ 2018-10-05  8:13 Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 1/9] kernel: add support for patchable function pointers Ard Biesheuvel
                   ` (9 more replies)
  0 siblings, 10 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

Linux's crypto API is widely regarded as being difficult to use for cases
where support for asynchronous accelerators or runtime dispatch of algorithms
(i.e., passed in as a string) are not needed. This leads to kludgy code and
also to actual security issues [0], although arguably, using AES in the wrong
mode can be done using any kind of crypto abstraction.

At the moment, the Zinc library [1] is being proposed as a solution for that,
and while it does address the usability problems, it does a lot more than
that, and what we end up with is a lot less flexible than what we have now.

Currently, arch specific implementations (based on SIMD or optimized
assembler) live in arch/*/crypto, where each sub-community is in charge
of its own specialized versions of the various primitives (although still
under the purview of the crypto maintainer). Any proposal to change this
model should be judged on its own merit, and not blindly accepted as the
fallout of cleaning up some library code.

Also, Zinc removes the possibility to plug different versions of a routine,
and instead, keeps all versions in the same module. Currently, the kernel's
module support permits user land to take charge of the policies that decide
which version to use in which context (by loading or blacklisting modules).
Instead, Zinc moves to a model where the policy is hardcoded into the module
(e.g., ChaCha20 on ARM uses NEON unless on Cortex-A5 or A7). In the ARM
community, we have always attempted to avoid hardcoding policy like that:
the ARM SoC space is a very heteregeneous one, and putting policy like that
into the code will lead to an endless stream of patches from silicon vendors
that want tweaks for their cores into various parts of the kernel.

Creating monolithic modules for algorithms also makes it very difficult to
support driver based implementations. The kernel's driver model is heavily
based on modules, using alias strings to match drivers against the hardware.
As an example, there ARM ST platforms that support synchronous hardware based
CRC32, and plugging that into the monolithic Zinc model is difficult if not
impossible.

The primary justification for moving away from the module system is that it
depends heavily on function pointers, and in the post-Spectre world, those
are vulnerable and costly. For this reason, this series proposes a patchable
function pointer abstraction that, in its generic incarnation, consists
simply of function pointers and some plumbing to set or reset them (patch #1)

Patch #2 illustrates how architectures could optimize these abstractions by
using code patching techniques to get rid of the indirect calls, and use
fixed jumps instead. This has the side effect of making the function
pointer const, making it more robust as well.

The remainder of the series shows how we can use this abstraction to clean
up the CRC-T10DIF handling in the kernel, which is a pretty depressing
example of how cumbersome it is to expose crypto API algorithms as library
routines.

Note that the various arch specific implementations are kept in their
original place as modules, which can be automatically dispatched by udev
(as before) based on CPU feature bits.

Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel at vger.kernel.org
Cc: linux-crypto at vger.kernel.org
Cc: linux-arm-kernel at lists.infradead.org
Cc: linuxppc-dev at lists.ozlabs.org

[0] commit 428490e38b2e ("security/keys: rewrite all of big_key crypto")
[1] https://lore.kernel.org/lkml/20180925145622.29959-3-Jason at zx2c4.com/

Ard Biesheuvel (9):
  kernel: add support for patchable function pointers
  arm64: kernel: add arch support for patchable function pointers
  crypto: crc-t10dif - make crc_t10dif a static inline
  crypto: crc-t10dif - use patchable function pointer for core update
    routine
  crypto: crc-t10dif/arm64 - move PMULL based code into core library
  crypto: crc-t10dif/arm - move PMULL based code into core library
  crypto: crct10dif/generic - switch crypto API driver to core library
  crypto: crc-t10dif/powerpc - move PMULL based code into core library
  crypto: crc-t10dif/x86 - move PMULL based code into core library

 arch/Kconfig                                |   3 +
 arch/arm/crypto/crct10dif-ce-glue.c         |  78 +++-------
 arch/arm64/Kconfig                          |   1 +
 arch/arm64/crypto/crct10dif-ce-glue.c       |  61 ++------
 arch/arm64/include/asm/ffp.h                |  35 +++++
 arch/arm64/kernel/insn.c                    |  22 +++
 arch/powerpc/crypto/crct10dif-vpmsum_glue.c |  55 +-------
 arch/x86/crypto/crct10dif-pclmul_glue.c     |  98 ++-----------
 crypto/Kconfig                              |   1 +
 crypto/Makefile                             |   2 +-
 crypto/crct10dif_common.c                   |  82 -----------
 crypto/crct10dif_generic.c                  |   4 +-
 include/linux/crc-t10dif.h                  |  24 +++-
 include/linux/ffp.h                         |  43 ++++++
 lib/Kconfig                                 |   2 -
 lib/crc-t10dif.c                            | 149 +++++++-------------
 16 files changed, 235 insertions(+), 425 deletions(-)
 create mode 100644 arch/arm64/include/asm/ffp.h
 delete mode 100644 crypto/crct10dif_common.c
 create mode 100644 include/linux/ffp.h

-- 
2.11.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
@ 2018-10-05  8:13 ` Ard Biesheuvel
  2018-10-05 13:57   ` Peter Zijlstra
  2018-10-05 14:14   ` Peter Zijlstra
  2018-10-05  8:13 ` [RFC PATCH 2/9] arm64: kernel: add arch " Ard Biesheuvel
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

Add a function pointer abstraction that can be implemented by the arch
in a manner that avoids the downsides of function pointers, i.e., the
fact that they are typically located in a writable data section, and
their vulnerability to Spectre like defects.

The FFP (or fast function pointer) is callable as a function, since
the generic incarnation is simply that. However, due to the fact that
C does not distinguish between functions and function pointers at the
call site, the architecture can instead emit it as a patchable sequence
of instructions consisting of ordinary branches.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/Kconfig        |  3 ++
 include/linux/ffp.h | 43 ++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6801123932a5..2af3442a61d3 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -862,6 +862,9 @@ config HAVE_ARCH_PREL32_RELOCATIONS
 	  architectures, and don't require runtime relocation on relocatable
 	  kernels.
 
+config HAVE_ARCH_FFP
+	bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/include/linux/ffp.h b/include/linux/ffp.h
new file mode 100644
index 000000000000..8fc3b4c9b38f
--- /dev/null
+++ b/include/linux/ffp.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LINUX_FFP_H
+#define __LINUX_FFP_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+
+#ifdef CONFIG_HAVE_ARCH_FFP
+#include <asm/ffp.h>
+#else
+
+struct ffp {
+	void	(**fn)(void);
+	void	(*default_fn)(void);
+};
+
+#define DECLARE_FFP(_fn, _def)						\
+	extern typeof(_def) *_fn;					\
+	extern struct ffp const __ffp_ ## _fn
+
+#define DEFINE_FFP(_fn, _def)						\
+	typeof(_def) *_fn = &_def;					\
+	struct ffp const __ffp_ ## _fn					\
+		= { (void(**)(void))&_fn, (void(*)(void))&_def };	\
+	EXPORT_SYMBOL(__ffp_ ## _fn)
+
+static inline void ffp_set_target(const struct ffp *m, void *new_fn)
+{
+	WRITE_ONCE(*m->fn, new_fn);
+}
+
+static inline void ffp_reset_target(const struct ffp *m)
+{
+	WRITE_ONCE(*m->fn, m->default_fn);
+}
+
+#endif
+
+#define SET_FFP(_fn, _new)	ffp_set_target(&__ffp_ ## _fn, _new)
+#define RESET_FFP(_fn)		ffp_reset_target(&__ffp_ ## _fn)
+
+#endif
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/9] arm64: kernel: add arch support for patchable function pointers
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 1/9] kernel: add support for patchable function pointers Ard Biesheuvel
@ 2018-10-05  8:13 ` Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 3/9] crypto: crc-t10dif - make crc_t10dif a static inline Ard Biesheuvel
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

Implement arm64 support for patchable function pointers by emitting
them as branch instructions (and a couple of NOPs in case the new
target is out of range of a normal branch instruction.)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/Kconfig           |  1 +
 arch/arm64/include/asm/ffp.h | 35 ++++++++++++++++++++
 arch/arm64/kernel/insn.c     | 22 ++++++++++++
 3 files changed, 58 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1b1a0e95c751..db8c9e51c56d 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -102,6 +102,7 @@ config ARM64
 	select HAVE_ALIGNED_STRUCT_PAGE if SLUB
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_BITREVERSE
+	select HAVE_ARCH_FFP
 	select HAVE_ARCH_HUGE_VMAP
 	select HAVE_ARCH_JUMP_LABEL
 	select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
diff --git a/arch/arm64/include/asm/ffp.h b/arch/arm64/include/asm/ffp.h
new file mode 100644
index 000000000000..678dc1262218
--- /dev/null
+++ b/arch/arm64/include/asm/ffp.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_FFP_H
+#define __ASM_FFP_H
+
+struct ffp {
+	u32	insn[5];
+	u32	def_branch;
+};
+
+#define DECLARE_FFP(_fn, _def)						\
+	extern typeof(_def) _fn;					\
+	extern struct ffp const __ffp_ ## _fn
+
+#define DEFINE_FFP(_fn, _def)						\
+	DECLARE_FFP(_fn, _def);						\
+	asm("	.pushsection	\".text\", \"ax\", %progbits	\n"	\
+	    "	.align		3				\n"	\
+	    "	.globl		" #_fn "			\n"	\
+	    "	.globl		__ffp_" #_fn "			\n"	\
+	    #_fn " :						\n"	\
+	    "__ffp_" #_fn " :					\n"	\
+	    "		b	" #_def "			\n"	\
+	    "		nop					\n"	\
+	    "		nop					\n"	\
+	    "		nop					\n"	\
+	    "		nop					\n"	\
+	    "		b	" #_def "			\n"	\
+	    "	.popsection					\n");	\
+	EXPORT_SYMBOL(__ffp_ ## _fn)
+
+extern void ffp_set_target(const struct ffp *m, void *new_fn);
+extern void ffp_reset_target(const struct ffp *m);
+
+#endif
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index 2b3413549734..a2ed547fd171 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -30,6 +30,7 @@
 #include <asm/cacheflush.h>
 #include <asm/debug-monitors.h>
 #include <asm/fixmap.h>
+#include <asm/ffp.h>
 #include <asm/insn.h>
 #include <asm/kprobes.h>
 
@@ -1603,3 +1604,24 @@ u32 aarch64_insn_gen_extr(enum aarch64_insn_variant variant,
 	insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, Rn);
 	return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, Rm);
 }
+
+void ffp_set_target(const struct ffp *m, void *new_fn)
+{
+	u32 branch = aarch64_insn_gen_branch_imm((u64)m, (u64)new_fn,
+						 AARCH64_INSN_BRANCH_NOLINK);
+
+	if (branch == AARCH64_BREAK_FAULT) {
+		/* TODO out of range - use a PLT sequence instead */
+	} else {
+		aarch64_insn_patch_text((void *[]){ (void *)m }, &branch, 1);
+	}
+}
+EXPORT_SYMBOL(ffp_set_target);
+
+void ffp_reset_target(const struct ffp *m)
+{
+	u32 branch = le32_to_cpu(m->def_branch);
+
+	aarch64_insn_patch_text((void *[]){ (void *)m }, &branch, 1);
+}
+EXPORT_SYMBOL(ffp_reset_target);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/9] crypto: crc-t10dif - make crc_t10dif a static inline
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 1/9] kernel: add support for patchable function pointers Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 2/9] arm64: kernel: add arch " Ard Biesheuvel
@ 2018-10-05  8:13 ` Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 4/9] crypto: crc-t10dif - use patchable function pointer for core update routine Ard Biesheuvel
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

crc_t10dif() is a trivial wrapper around crc_t10dif_update() so move
it into the header file as a static inline function.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 include/linux/crc-t10dif.h | 6 +++++-
 lib/crc-t10dif.c           | 6 ------
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/include/linux/crc-t10dif.h b/include/linux/crc-t10dif.h
index 6bb0c0bf357b..4dfe09ff4cf2 100644
--- a/include/linux/crc-t10dif.h
+++ b/include/linux/crc-t10dif.h
@@ -10,7 +10,11 @@
 
 extern __u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer,
 				size_t len);
-extern __u16 crc_t10dif(unsigned char const *, size_t);
 extern __u16 crc_t10dif_update(__u16 crc, unsigned char const *, size_t);
 
+static inline __u16 crc_t10dif(const unsigned char *buffer, size_t len)
+{
+	return crc_t10dif_update(0, buffer, len);
+}
+
 #endif
diff --git a/lib/crc-t10dif.c b/lib/crc-t10dif.c
index 4d0d47c1ffbd..036ee664c9e1 100644
--- a/lib/crc-t10dif.c
+++ b/lib/crc-t10dif.c
@@ -81,12 +81,6 @@ __u16 crc_t10dif_update(__u16 crc, const unsigned char *buffer, size_t len)
 }
 EXPORT_SYMBOL(crc_t10dif_update);
 
-__u16 crc_t10dif(const unsigned char *buffer, size_t len)
-{
-	return crc_t10dif_update(0, buffer, len);
-}
-EXPORT_SYMBOL(crc_t10dif);
-
 static int __init crc_t10dif_mod_init(void)
 {
 	crypto_register_notifier(&crc_t10dif_nb);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/9] crypto: crc-t10dif - use patchable function pointer for core update routine
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2018-10-05  8:13 ` [RFC PATCH 3/9] crypto: crc-t10dif - make crc_t10dif a static inline Ard Biesheuvel
@ 2018-10-05  8:13 ` Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 5/9] crypto: crc-t10dif/arm64 - move PMULL based code into core library Ard Biesheuvel
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

Use a patchable function pointer for the core CRC-T10DIF transform so
that we can allow modules to supersede this transform based on platform
capabilities.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/Kconfig             |   1 +
 crypto/Makefile            |   2 +-
 crypto/crct10dif_common.c  |  82 -----------
 include/linux/crc-t10dif.h |  18 ++-
 lib/Kconfig                |   2 -
 lib/crc-t10dif.c           | 143 +++++++-------------
 6 files changed, 69 insertions(+), 179 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 63ef279322d2..009c736a082e 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -611,6 +611,7 @@ config CRYPTO_CRC32_MIPS
 config CRYPTO_CRCT10DIF
 	tristate "CRCT10DIF algorithm"
 	select CRYPTO_HASH
+	select CRCT10DIF
 	help
 	  CRC T10 Data Integrity Field computation is being cast as
 	  a crypto transform.  This allows for faster crc t10 diff
diff --git a/crypto/Makefile b/crypto/Makefile
index 5c207c76abf7..0c2491a5a657 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -122,7 +122,7 @@ obj-$(CONFIG_CRYPTO_DEFLATE) += deflate.o
 obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o
 obj-$(CONFIG_CRYPTO_CRC32C) += crc32c_generic.o
 obj-$(CONFIG_CRYPTO_CRC32) += crc32_generic.o
-obj-$(CONFIG_CRYPTO_CRCT10DIF) += crct10dif_common.o crct10dif_generic.o
+obj-$(CONFIG_CRYPTO_CRCT10DIF) += crct10dif_generic.o
 obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o
 obj-$(CONFIG_CRYPTO_LZO) += lzo.o
 obj-$(CONFIG_CRYPTO_LZ4) += lz4.o
diff --git a/crypto/crct10dif_common.c b/crypto/crct10dif_common.c
deleted file mode 100644
index b2fab366f518..000000000000
--- a/crypto/crct10dif_common.c
+++ /dev/null
@@ -1,82 +0,0 @@
-/*
- * Cryptographic API.
- *
- * T10 Data Integrity Field CRC16 Crypto Transform
- *
- * Copyright (c) 2007 Oracle Corporation.  All rights reserved.
- * Written by Martin K. Petersen <martin.petersen@oracle.com>
- * Copyright (C) 2013 Intel Corporation
- * Author: Tim Chen <tim.c.chen@linux.intel.com>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or (at your option)
- * any later version.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- *
- */
-
-#include <linux/crc-t10dif.h>
-#include <linux/module.h>
-#include <linux/kernel.h>
-
-/* Table generated using the following polynomium:
- * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
- * gt: 0x8bb7
- */
-static const __u16 t10_dif_crc_table[256] = {
-	0x0000, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B,
-	0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6,
-	0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6,
-	0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B,
-	0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1,
-	0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C,
-	0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C,
-	0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781,
-	0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8,
-	0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255,
-	0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925,
-	0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698,
-	0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472,
-	0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF,
-	0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF,
-	0xCA69, 0x41DE, 0x56B0, 0xDD07, 0x786C, 0xF3DB, 0xE4B5, 0x6F02,
-	0x3AB1, 0xB106, 0xA668, 0x2DDF, 0x88B4, 0x0303, 0x146D, 0x9FDA,
-	0xD50C, 0x5EBB, 0x49D5, 0xC262, 0x6709, 0xECBE, 0xFBD0, 0x7067,
-	0x6E7C, 0xE5CB, 0xF2A5, 0x7912, 0xDC79, 0x57CE, 0x40A0, 0xCB17,
-	0x81C1, 0x0A76, 0x1D18, 0x96AF, 0x33C4, 0xB873, 0xAF1D, 0x24AA,
-	0x932B, 0x189C, 0x0FF2, 0x8445, 0x212E, 0xAA99, 0xBDF7, 0x3640,
-	0x7C96, 0xF721, 0xE04F, 0x6BF8, 0xCE93, 0x4524, 0x524A, 0xD9FD,
-	0xC7E6, 0x4C51, 0x5B3F, 0xD088, 0x75E3, 0xFE54, 0xE93A, 0x628D,
-	0x285B, 0xA3EC, 0xB482, 0x3F35, 0x9A5E, 0x11E9, 0x0687, 0x8D30,
-	0xE232, 0x6985, 0x7EEB, 0xF55C, 0x5037, 0xDB80, 0xCCEE, 0x4759,
-	0x0D8F, 0x8638, 0x9156, 0x1AE1, 0xBF8A, 0x343D, 0x2353, 0xA8E4,
-	0xB6FF, 0x3D48, 0x2A26, 0xA191, 0x04FA, 0x8F4D, 0x9823, 0x1394,
-	0x5942, 0xD2F5, 0xC59B, 0x4E2C, 0xEB47, 0x60F0, 0x779E, 0xFC29,
-	0x4BA8, 0xC01F, 0xD771, 0x5CC6, 0xF9AD, 0x721A, 0x6574, 0xEEC3,
-	0xA415, 0x2FA2, 0x38CC, 0xB37B, 0x1610, 0x9DA7, 0x8AC9, 0x017E,
-	0x1F65, 0x94D2, 0x83BC, 0x080B, 0xAD60, 0x26D7, 0x31B9, 0xBA0E,
-	0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3
-};
-
-__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len)
-{
-	unsigned int i;
-
-	for (i = 0 ; i < len ; i++)
-		crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 0xff];
-
-	return crc;
-}
-EXPORT_SYMBOL(crc_t10dif_generic);
-
-MODULE_DESCRIPTION("T10 DIF CRC calculation common code");
-MODULE_LICENSE("GPL");
diff --git a/include/linux/crc-t10dif.h b/include/linux/crc-t10dif.h
index 4dfe09ff4cf2..ae4f46723f5d 100644
--- a/include/linux/crc-t10dif.h
+++ b/include/linux/crc-t10dif.h
@@ -3,14 +3,28 @@
 #define _LINUX_CRC_T10DIF_H
 
 #include <linux/types.h>
+#include <linux/ffp.h>
+#include <linux/rcupdate.h>
 
 #define CRC_T10DIF_DIGEST_SIZE 2
 #define CRC_T10DIF_BLOCK_SIZE 1
-#define CRC_T10DIF_STRING "crct10dif"
 
 extern __u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer,
 				size_t len);
-extern __u16 crc_t10dif_update(__u16 crc, unsigned char const *, size_t);
+
+DECLARE_FFP(crc_t10dif_update_arch, crc_t10dif_generic);
+
+static inline __u16 crc_t10dif_update(__u16 crc, unsigned char const *buffer,
+				      size_t len)
+{
+	__u16 ret;
+
+	rcu_read_lock();
+	ret = crc_t10dif_update_arch(crc, buffer, len);
+	rcu_read_unlock();
+
+	return ret;
+}
 
 static inline __u16 crc_t10dif(const unsigned char *buffer, size_t len)
 {
diff --git a/lib/Kconfig b/lib/Kconfig
index a3928d4438b5..01a8a8daf08f 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -89,8 +89,6 @@ config CRC16
 
 config CRC_T10DIF
 	tristate "CRC calculation for the T10 Data Integrity Field"
-	select CRYPTO
-	select CRYPTO_CRCT10DIF
 	help
 	  This option is only needed if a module that's not in the
 	  kernel tree needs to calculate CRC checks for use with the
diff --git a/lib/crc-t10dif.c b/lib/crc-t10dif.c
index 036ee664c9e1..54dd4b7a763c 100644
--- a/lib/crc-t10dif.c
+++ b/lib/crc-t10dif.c
@@ -3,9 +3,13 @@
  *
  * Copyright (c) 2007 Oracle Corporation.  All rights reserved.
  * Written by Martin K. Petersen <martin.petersen@oracle.com>
+ * Copyright (C) 2013 Intel Corporation
+ * Author: Tim Chen <tim.c.chen@linux.intel.com>
  *
- * This source code is licensed under the GNU General Public License,
- * Version 2. See the file COPYING for more details.
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
  */
 
 #include <linux/types.h>
@@ -13,104 +17,59 @@
 #include <linux/crc-t10dif.h>
 #include <linux/err.h>
 #include <linux/init.h>
-#include <crypto/hash.h>
-#include <crypto/algapi.h>
-#include <linux/static_key.h>
-#include <linux/notifier.h>
 
-static struct crypto_shash __rcu *crct10dif_tfm;
-static struct static_key crct10dif_fallback __read_mostly;
-static DEFINE_MUTEX(crc_t10dif_mutex);
-
-static int crc_t10dif_rehash(struct notifier_block *self, unsigned long val, void *data)
-{
-	struct crypto_alg *alg = data;
-	struct crypto_shash *new, *old;
-
-	if (val != CRYPTO_MSG_ALG_LOADED ||
-	    static_key_false(&crct10dif_fallback) ||
-	    strncmp(alg->cra_name, CRC_T10DIF_STRING, strlen(CRC_T10DIF_STRING)))
-		return 0;
-
-	mutex_lock(&crc_t10dif_mutex);
-	old = rcu_dereference_protected(crct10dif_tfm,
-					lockdep_is_held(&crc_t10dif_mutex));
-	if (!old) {
-		mutex_unlock(&crc_t10dif_mutex);
-		return 0;
-	}
-	new = crypto_alloc_shash("crct10dif", 0, 0);
-	if (IS_ERR(new)) {
-		mutex_unlock(&crc_t10dif_mutex);
-		return 0;
-	}
-	rcu_assign_pointer(crct10dif_tfm, new);
-	mutex_unlock(&crc_t10dif_mutex);
-
-	synchronize_rcu();
-	crypto_free_shash(old);
-	return 0;
-}
-
-static struct notifier_block crc_t10dif_nb = {
-	.notifier_call = crc_t10dif_rehash,
+/* Table generated using the following polynomium:
+ * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
+ * gt: 0x8bb7
+ */
+static const __u16 t10_dif_crc_table[256] __cacheline_aligned = {
+	0x0000, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B,
+	0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6,
+	0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6,
+	0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B,
+	0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1,
+	0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C,
+	0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C,
+	0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781,
+	0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8,
+	0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255,
+	0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925,
+	0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698,
+	0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472,
+	0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF,
+	0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF,
+	0xCA69, 0x41DE, 0x56B0, 0xDD07, 0x786C, 0xF3DB, 0xE4B5, 0x6F02,
+	0x3AB1, 0xB106, 0xA668, 0x2DDF, 0x88B4, 0x0303, 0x146D, 0x9FDA,
+	0xD50C, 0x5EBB, 0x49D5, 0xC262, 0x6709, 0xECBE, 0xFBD0, 0x7067,
+	0x6E7C, 0xE5CB, 0xF2A5, 0x7912, 0xDC79, 0x57CE, 0x40A0, 0xCB17,
+	0x81C1, 0x0A76, 0x1D18, 0x96AF, 0x33C4, 0xB873, 0xAF1D, 0x24AA,
+	0x932B, 0x189C, 0x0FF2, 0x8445, 0x212E, 0xAA99, 0xBDF7, 0x3640,
+	0x7C96, 0xF721, 0xE04F, 0x6BF8, 0xCE93, 0x4524, 0x524A, 0xD9FD,
+	0xC7E6, 0x4C51, 0x5B3F, 0xD088, 0x75E3, 0xFE54, 0xE93A, 0x628D,
+	0x285B, 0xA3EC, 0xB482, 0x3F35, 0x9A5E, 0x11E9, 0x0687, 0x8D30,
+	0xE232, 0x6985, 0x7EEB, 0xF55C, 0x5037, 0xDB80, 0xCCEE, 0x4759,
+	0x0D8F, 0x8638, 0x9156, 0x1AE1, 0xBF8A, 0x343D, 0x2353, 0xA8E4,
+	0xB6FF, 0x3D48, 0x2A26, 0xA191, 0x04FA, 0x8F4D, 0x9823, 0x1394,
+	0x5942, 0xD2F5, 0xC59B, 0x4E2C, 0xEB47, 0x60F0, 0x779E, 0xFC29,
+	0x4BA8, 0xC01F, 0xD771, 0x5CC6, 0xF9AD, 0x721A, 0x6574, 0xEEC3,
+	0xA415, 0x2FA2, 0x38CC, 0xB37B, 0x1610, 0x9DA7, 0x8AC9, 0x017E,
+	0x1F65, 0x94D2, 0x83BC, 0x080B, 0xAD60, 0x26D7, 0x31B9, 0xBA0E,
+	0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3
 };
 
-__u16 crc_t10dif_update(__u16 crc, const unsigned char *buffer, size_t len)
-{
-	struct {
-		struct shash_desc shash;
-		char ctx[2];
-	} desc;
-	int err;
-
-	if (static_key_false(&crct10dif_fallback))
-		return crc_t10dif_generic(crc, buffer, len);
-
-	rcu_read_lock();
-	desc.shash.tfm = rcu_dereference(crct10dif_tfm);
-	desc.shash.flags = 0;
-	*(__u16 *)desc.ctx = crc;
-
-	err = crypto_shash_update(&desc.shash, buffer, len);
-	rcu_read_unlock();
-
-	BUG_ON(err);
-
-	return *(__u16 *)desc.ctx;
-}
-EXPORT_SYMBOL(crc_t10dif_update);
-
-static int __init crc_t10dif_mod_init(void)
+__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len)
 {
-	crypto_register_notifier(&crc_t10dif_nb);
-	crct10dif_tfm = crypto_alloc_shash("crct10dif", 0, 0);
-	if (IS_ERR(crct10dif_tfm)) {
-		static_key_slow_inc(&crct10dif_fallback);
-		crct10dif_tfm = NULL;
-	}
-	return 0;
-}
+	unsigned int i;
 
-static void __exit crc_t10dif_mod_fini(void)
-{
-	crypto_unregister_notifier(&crc_t10dif_nb);
-	crypto_free_shash(crct10dif_tfm);
-}
-
-module_init(crc_t10dif_mod_init);
-module_exit(crc_t10dif_mod_fini);
-
-static int crc_t10dif_transform_show(char *buffer, const struct kernel_param *kp)
-{
-	if (static_key_false(&crct10dif_fallback))
-		return sprintf(buffer, "fallback\n");
+	for (i = 0 ; i < len ; i++)
+		crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 0xff];
 
-	return sprintf(buffer, "%s\n",
-		crypto_tfm_alg_driver_name(crypto_shash_tfm(crct10dif_tfm)));
+	return crc;
 }
+EXPORT_SYMBOL(crc_t10dif_generic);
 
-module_param_call(transform, NULL, crc_t10dif_transform_show, NULL, 0644);
+DEFINE_FFP(crc_t10dif_update_arch, crc_t10dif_generic);
+EXPORT_SYMBOL(crc_t10dif_update_arch);
 
 MODULE_DESCRIPTION("T10 DIF CRC calculation");
 MODULE_LICENSE("GPL");
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 5/9] crypto: crc-t10dif/arm64 - move PMULL based code into core library
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2018-10-05  8:13 ` [RFC PATCH 4/9] crypto: crc-t10dif - use patchable function pointer for core update routine Ard Biesheuvel
@ 2018-10-05  8:13 ` Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 6/9] crypto: crc-t10dif/arm " Ard Biesheuvel
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

Move the PMULL based routines out of the crypto API into the core
CRC-T10DIF library.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/crct10dif-ce-glue.c | 61 +++++---------------
 1 file changed, 14 insertions(+), 47 deletions(-)

diff --git a/arch/arm64/crypto/crct10dif-ce-glue.c b/arch/arm64/crypto/crct10dif-ce-glue.c
index b461d62023f2..c67db86aba2c 100644
--- a/arch/arm64/crypto/crct10dif-ce-glue.c
+++ b/arch/arm64/crypto/crct10dif-ce-glue.c
@@ -10,13 +10,12 @@
 
 #include <linux/cpufeature.h>
 #include <linux/crc-t10dif.h>
+#include <linux/ffp.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/string.h>
 
-#include <crypto/internal/hash.h>
-
 #include <asm/neon.h>
 #include <asm/simd.h>
 
@@ -25,27 +24,17 @@
 asmlinkage u16 crc_t10dif_pmull_p64(u16 init_crc, const u8 buf[], u64 len);
 asmlinkage u16 crc_t10dif_pmull_p8(u16 init_crc, const u8 buf[], u64 len);
 
-static u16 (*crc_t10dif_pmull)(u16 init_crc, const u8 buf[], u64 len);
-
-static int crct10dif_init(struct shash_desc *desc)
-{
-	u16 *crc = shash_desc_ctx(desc);
-
-	*crc = 0;
-	return 0;
-}
+DEFINE_FFP(crc_t10dif_pmull, crc_t10dif_pmull_p8);
 
-static int crct10dif_update(struct shash_desc *desc, const u8 *data,
-			    unsigned int length)
+static __u16 crct10dif_update_arm64(u16 crc, const u8 *data, unsigned int length)
 {
-	u16 *crc = shash_desc_ctx(desc);
 	unsigned int l;
 
 	if (unlikely((u64)data % CRC_T10DIF_PMULL_CHUNK_SIZE)) {
 		l = min_t(u32, length, CRC_T10DIF_PMULL_CHUNK_SIZE -
 			  ((u64)data % CRC_T10DIF_PMULL_CHUNK_SIZE));
 
-		*crc = crc_t10dif_generic(*crc, data, l);
+		crc = crc_t10dif_generic(crc, data, l);
 
 		length -= l;
 		data += l;
@@ -54,51 +43,31 @@ static int crct10dif_update(struct shash_desc *desc, const u8 *data,
 	if (length > 0) {
 		if (may_use_simd()) {
 			kernel_neon_begin();
-			*crc = crc_t10dif_pmull(*crc, data, length);
+			crc = crc_t10dif_pmull(crc, data, length);
 			kernel_neon_end();
 		} else {
-			*crc = crc_t10dif_generic(*crc, data, length);
+			crc = crc_t10dif_generic(crc, data, length);
 		}
 	}
 
-	return 0;
-}
-
-static int crct10dif_final(struct shash_desc *desc, u8 *out)
-{
-	u16 *crc = shash_desc_ctx(desc);
-
-	*(u16 *)out = *crc;
-	return 0;
+	return crc;
 }
 
-static struct shash_alg crc_t10dif_alg = {
-	.digestsize		= CRC_T10DIF_DIGEST_SIZE,
-	.init			= crct10dif_init,
-	.update			= crct10dif_update,
-	.final			= crct10dif_final,
-	.descsize		= CRC_T10DIF_DIGEST_SIZE,
-
-	.base.cra_name		= "crct10dif",
-	.base.cra_driver_name	= "crct10dif-arm64-ce",
-	.base.cra_priority	= 200,
-	.base.cra_blocksize	= CRC_T10DIF_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-};
-
 static int __init crc_t10dif_mod_init(void)
 {
 	if (elf_hwcap & HWCAP_PMULL)
-		crc_t10dif_pmull = crc_t10dif_pmull_p64;
-	else
-		crc_t10dif_pmull = crc_t10dif_pmull_p8;
+		SET_FFP(crc_t10dif_pmull, crc_t10dif_pmull_p64);
 
-	return crypto_register_shash(&crc_t10dif_alg);
+	SET_FFP(crc_t10dif_update_arch, crct10dif_update_arm64);
+
+	return 0;
 }
 
 static void __exit crc_t10dif_mod_exit(void)
 {
-	crypto_unregister_shash(&crc_t10dif_alg);
+	RESET_FFP(crc_t10dif_update_arch);
+	/* wait for all users to switch back to the old version */
+	synchronize_rcu();
 }
 
 module_cpu_feature_match(ASIMD, crc_t10dif_mod_init);
@@ -106,5 +75,3 @@ module_exit(crc_t10dif_mod_exit);
 
 MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
 MODULE_LICENSE("GPL v2");
-MODULE_ALIAS_CRYPTO("crct10dif");
-MODULE_ALIAS_CRYPTO("crct10dif-arm64-ce");
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 6/9] crypto: crc-t10dif/arm - move PMULL based code into core library
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2018-10-05  8:13 ` [RFC PATCH 5/9] crypto: crc-t10dif/arm64 - move PMULL based code into core library Ard Biesheuvel
@ 2018-10-05  8:13 ` Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 7/9] crypto: crct10dif/generic - switch crypto API driver to " Ard Biesheuvel
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

Move the PMULL based routines out of the crypto API into the core
CRC-T10DIF library.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/crct10dif-ce-glue.c | 78 ++++++--------------
 1 file changed, 22 insertions(+), 56 deletions(-)

diff --git a/arch/arm/crypto/crct10dif-ce-glue.c b/arch/arm/crypto/crct10dif-ce-glue.c
index d428355cf38d..2020d9c29ff1 100644
--- a/arch/arm/crypto/crct10dif-ce-glue.c
+++ b/arch/arm/crypto/crct10dif-ce-glue.c
@@ -10,12 +10,11 @@
 
 #include <linux/crc-t10dif.h>
 #include <linux/init.h>
+#include <linux/ffp.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/string.h>
 
-#include <crypto/internal/hash.h>
-
 #include <asm/neon.h>
 #include <asm/simd.h>
 
@@ -23,74 +22,42 @@
 
 asmlinkage u16 crc_t10dif_pmull(u16 init_crc, const u8 buf[], u32 len);
 
-static int crct10dif_init(struct shash_desc *desc)
+static __u16 crct10dif_update_arm(__u16 crc, const u8 *data, unsigned int length)
 {
-	u16 *crc = shash_desc_ctx(desc);
-
-	*crc = 0;
-	return 0;
-}
-
-static int crct10dif_update(struct shash_desc *desc, const u8 *data,
-			    unsigned int length)
-{
-	u16 *crc = shash_desc_ctx(desc);
 	unsigned int l;
 
-	if (!may_use_simd()) {
-		*crc = crc_t10dif_generic(*crc, data, length);
-	} else {
-		if (unlikely((u32)data % CRC_T10DIF_PMULL_CHUNK_SIZE)) {
-			l = min_t(u32, length, CRC_T10DIF_PMULL_CHUNK_SIZE -
-				  ((u32)data % CRC_T10DIF_PMULL_CHUNK_SIZE));
+	if (!may_use_simd())
+		return crc_t10dif_generic(crc, data, length);
 
-			*crc = crc_t10dif_generic(*crc, data, l);
+	if (unlikely((u32)data % CRC_T10DIF_PMULL_CHUNK_SIZE)) {
+		l = min_t(u32, length, CRC_T10DIF_PMULL_CHUNK_SIZE -
+			  ((u32)data % CRC_T10DIF_PMULL_CHUNK_SIZE));
 
-			length -= l;
-			data += l;
-		}
-		if (length > 0) {
-			kernel_neon_begin();
-			*crc = crc_t10dif_pmull(*crc, data, length);
-			kernel_neon_end();
-		}
-	}
-	return 0;
-}
+		crc = crc_t10dif_generic(crc, data, l);
 
-static int crct10dif_final(struct shash_desc *desc, u8 *out)
-{
-	u16 *crc = shash_desc_ctx(desc);
-
-	*(u16 *)out = *crc;
-	return 0;
+		length -= l;
+		data += l;
+	}
+	if (length > 0) {
+		kernel_neon_begin();
+		crc = crc_t10dif_pmull(crc, data, length);
+		kernel_neon_end();
+	}
+	return crc;
 }
 
-static struct shash_alg crc_t10dif_alg = {
-	.digestsize		= CRC_T10DIF_DIGEST_SIZE,
-	.init			= crct10dif_init,
-	.update			= crct10dif_update,
-	.final			= crct10dif_final,
-	.descsize		= CRC_T10DIF_DIGEST_SIZE,
-
-	.base.cra_name		= "crct10dif",
-	.base.cra_driver_name	= "crct10dif-arm-ce",
-	.base.cra_priority	= 200,
-	.base.cra_blocksize	= CRC_T10DIF_BLOCK_SIZE,
-	.base.cra_module	= THIS_MODULE,
-};
-
 static int __init crc_t10dif_mod_init(void)
 {
-	if (!(elf_hwcap2 & HWCAP2_PMULL))
-		return -ENODEV;
+	SET_FFP(crc_t10dif_update_arch, crct10dif_update_arm);
 
-	return crypto_register_shash(&crc_t10dif_alg);
+	return 0;
 }
 
 static void __exit crc_t10dif_mod_exit(void)
 {
-	crypto_unregister_shash(&crc_t10dif_alg);
+	RESET_FFP(crc_t10dif_update_arch);
+	/* wait for all users to switch back to the old version */
+	synchronize_rcu();
 }
 
 module_init(crc_t10dif_mod_init);
@@ -98,4 +65,3 @@ module_exit(crc_t10dif_mod_exit);
 
 MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
 MODULE_LICENSE("GPL v2");
-MODULE_ALIAS_CRYPTO("crct10dif");
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 7/9] crypto: crct10dif/generic - switch crypto API driver to core library
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2018-10-05  8:13 ` [RFC PATCH 6/9] crypto: crc-t10dif/arm " Ard Biesheuvel
@ 2018-10-05  8:13 ` Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 8/9] crypto: crc-t10dif/powerpc - move PMULL based code into " Ard Biesheuvel
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/crct10dif_generic.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/crypto/crct10dif_generic.c b/crypto/crct10dif_generic.c
index 8e94e29dc6fc..9ea4242c4921 100644
--- a/crypto/crct10dif_generic.c
+++ b/crypto/crct10dif_generic.c
@@ -53,7 +53,7 @@ static int chksum_update(struct shash_desc *desc, const u8 *data,
 {
 	struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
 
-	ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
+	ctx->crc = crc_t10dif_update(ctx->crc, data, length);
 	return 0;
 }
 
@@ -68,7 +68,7 @@ static int chksum_final(struct shash_desc *desc, u8 *out)
 static int __chksum_finup(__u16 *crcp, const u8 *data, unsigned int len,
 			u8 *out)
 {
-	*(__u16 *)out = crc_t10dif_generic(*crcp, data, len);
+	*(__u16 *)out = crc_t10dif_update(*crcp, data, len);
 	return 0;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 8/9] crypto: crc-t10dif/powerpc - move PMULL based code into core library
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
                   ` (6 preceding siblings ...)
  2018-10-05  8:13 ` [RFC PATCH 7/9] crypto: crct10dif/generic - switch crypto API driver to " Ard Biesheuvel
@ 2018-10-05  8:13 ` Ard Biesheuvel
  2018-10-05  8:13 ` [RFC PATCH 9/9] crypto: crc-t10dif/x86 " Ard Biesheuvel
  2018-10-05 13:37 ` [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Jason A. Donenfeld
  9 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

Move the PMULL based routines out of the crypto API into the core
CRC-T10DIF library.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/powerpc/crypto/crct10dif-vpmsum_glue.c | 55 +++-----------------
 1 file changed, 6 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/crypto/crct10dif-vpmsum_glue.c b/arch/powerpc/crypto/crct10dif-vpmsum_glue.c
index 02ea277863d1..a4db7ecc0ccc 100644
--- a/arch/powerpc/crypto/crct10dif-vpmsum_glue.c
+++ b/arch/powerpc/crypto/crct10dif-vpmsum_glue.c
@@ -11,8 +11,8 @@
  */
 
 #include <linux/crc-t10dif.h>
-#include <crypto/internal/hash.h>
 #include <linux/init.h>
+#include <linux/ffp.h>
 #include <linux/module.h>
 #include <linux/string.h>
 #include <linux/kernel.h>
@@ -63,59 +63,18 @@ static u16 crct10dif_vpmsum(u16 crci, unsigned char const *p, size_t len)
 	return crc & 0xffff;
 }
 
-static int crct10dif_vpmsum_init(struct shash_desc *desc)
-{
-	u16 *crc = shash_desc_ctx(desc);
-
-	*crc = 0;
-	return 0;
-}
-
-static int crct10dif_vpmsum_update(struct shash_desc *desc, const u8 *data,
-			    unsigned int length)
-{
-	u16 *crc = shash_desc_ctx(desc);
-
-	*crc = crct10dif_vpmsum(*crc, data, length);
-
-	return 0;
-}
-
-
-static int crct10dif_vpmsum_final(struct shash_desc *desc, u8 *out)
-{
-	u16 *crcp = shash_desc_ctx(desc);
-
-	*(u16 *)out = *crcp;
-	return 0;
-}
-
-static struct shash_alg alg = {
-	.init		= crct10dif_vpmsum_init,
-	.update		= crct10dif_vpmsum_update,
-	.final		= crct10dif_vpmsum_final,
-	.descsize	= CRC_T10DIF_DIGEST_SIZE,
-	.digestsize	= CRC_T10DIF_DIGEST_SIZE,
-	.base		= {
-		.cra_name		= "crct10dif",
-		.cra_driver_name	= "crct10dif-vpmsum",
-		.cra_priority		= 200,
-		.cra_blocksize		= CRC_T10DIF_BLOCK_SIZE,
-		.cra_module		= THIS_MODULE,
-	}
-};
-
 static int __init crct10dif_vpmsum_mod_init(void)
 {
-	if (!cpu_has_feature(CPU_FTR_ARCH_207S))
-		return -ENODEV;
+	SET_FFP(crc_t10dif_update_arch, crct10dif_vpmsum);
 
-	return crypto_register_shash(&alg);
+	return 0;
 }
 
 static void __exit crct10dif_vpmsum_mod_fini(void)
 {
-	crypto_unregister_shash(&alg);
+	RESET_FFP(crc_t10dif_update_arch);
+	/* wait for all users to switch back to the old version */
+	synchronize_rcu();
 }
 
 module_cpu_feature_match(PPC_MODULE_FEATURE_VEC_CRYPTO, crct10dif_vpmsum_mod_init);
@@ -124,5 +83,3 @@ module_exit(crct10dif_vpmsum_mod_fini);
 MODULE_AUTHOR("Daniel Axtens <dja@axtens.net>");
 MODULE_DESCRIPTION("CRCT10DIF using vector polynomial multiply-sum instructions");
 MODULE_LICENSE("GPL");
-MODULE_ALIAS_CRYPTO("crct10dif");
-MODULE_ALIAS_CRYPTO("crct10dif-vpmsum");
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 9/9] crypto: crc-t10dif/x86 - move PMULL based code into core library
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
                   ` (7 preceding siblings ...)
  2018-10-05  8:13 ` [RFC PATCH 8/9] crypto: crc-t10dif/powerpc - move PMULL based code into " Ard Biesheuvel
@ 2018-10-05  8:13 ` Ard Biesheuvel
  2018-10-05 13:37 ` [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Jason A. Donenfeld
  9 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05  8:13 UTC (permalink / raw)
  To: linux-arm-kernel

Move the PMULL based routines out of the crypto API into the core
CRC-T10DIF library.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/x86/crypto/crct10dif-pclmul_glue.c | 98 +++-----------------
 1 file changed, 13 insertions(+), 85 deletions(-)

diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c
index cd4df9322501..47f0017ad7f3 100644
--- a/arch/x86/crypto/crct10dif-pclmul_glue.c
+++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
@@ -25,7 +25,7 @@
 #include <linux/types.h>
 #include <linux/module.h>
 #include <linux/crc-t10dif.h>
-#include <crypto/internal/hash.h>
+#include <linux/ffp.h>
 #include <linux/init.h>
 #include <linux/string.h>
 #include <linux/kernel.h>
@@ -36,90 +36,17 @@
 asmlinkage __u16 crc_t10dif_pcl(__u16 crc, const unsigned char *buf,
 				size_t len);
 
-struct chksum_desc_ctx {
-	__u16 crc;
-};
-
-/*
- * Steps through buffer one byte at at time, calculates reflected
- * crc using table.
- */
-
-static int chksum_init(struct shash_desc *desc)
-{
-	struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
-
-	ctx->crc = 0;
-
-	return 0;
-}
-
-static int chksum_update(struct shash_desc *desc, const u8 *data,
-			 unsigned int length)
-{
-	struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
-
-	if (irq_fpu_usable()) {
-		kernel_fpu_begin();
-		ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
-		kernel_fpu_end();
-	} else
-		ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
-	return 0;
-}
-
-static int chksum_final(struct shash_desc *desc, u8 *out)
-{
-	struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
-
-	*(__u16 *)out = ctx->crc;
-	return 0;
-}
-
-static int __chksum_finup(__u16 *crcp, const u8 *data, unsigned int len,
-			u8 *out)
+static __u16 crct10dif_update_x86(__u16 crc, const u8 *data, unsigned int length)
 {
 	if (irq_fpu_usable()) {
 		kernel_fpu_begin();
-		*(__u16 *)out = crc_t10dif_pcl(*crcp, data, len);
+		crc = crc_t10dif_pcl(crc, data, length);
 		kernel_fpu_end();
-	} else
-		*(__u16 *)out = crc_t10dif_generic(*crcp, data, len);
-	return 0;
-}
-
-static int chksum_finup(struct shash_desc *desc, const u8 *data,
-			unsigned int len, u8 *out)
-{
-	struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
-
-	return __chksum_finup(&ctx->crc, data, len, out);
-}
-
-static int chksum_digest(struct shash_desc *desc, const u8 *data,
-			 unsigned int length, u8 *out)
-{
-	struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
-
-	return __chksum_finup(&ctx->crc, data, length, out);
-}
-
-static struct shash_alg alg = {
-	.digestsize		=	CRC_T10DIF_DIGEST_SIZE,
-	.init		=	chksum_init,
-	.update		=	chksum_update,
-	.final		=	chksum_final,
-	.finup		=	chksum_finup,
-	.digest		=	chksum_digest,
-	.descsize		=	sizeof(struct chksum_desc_ctx),
-	.base			=	{
-		.cra_name		=	"crct10dif",
-		.cra_driver_name	=	"crct10dif-pclmul",
-		.cra_priority		=	200,
-		.cra_blocksize		=	CRC_T10DIF_BLOCK_SIZE,
-		.cra_module		=	THIS_MODULE,
+	} else {
+		crc = crc_t10dif_generic(crc, data, length);
 	}
-};
+	return crc;
+}
 
 static const struct x86_cpu_id crct10dif_cpu_id[] = {
 	X86_FEATURE_MATCH(X86_FEATURE_PCLMULQDQ),
@@ -132,12 +59,16 @@ static int __init crct10dif_intel_mod_init(void)
 	if (!x86_match_cpu(crct10dif_cpu_id))
 		return -ENODEV;
 
-	return crypto_register_shash(&alg);
+	SET_FFP(crc_t10dif_update_arch, crct10dif_update_x86);
+
+	return 0;
 }
 
 static void __exit crct10dif_intel_mod_fini(void)
 {
-	crypto_unregister_shash(&alg);
+	RESET_FFP(crc_t10dif_update_arch);
+	/* wait for all users to switch back to the old version */
+	synchronize_rcu();
 }
 
 module_init(crct10dif_intel_mod_init);
@@ -146,6 +77,3 @@ module_exit(crct10dif_intel_mod_fini);
 MODULE_AUTHOR("Tim Chen <tim.c.chen@linux.intel.com>");
 MODULE_DESCRIPTION("T10 DIF CRC calculation accelerated with PCLMULQDQ.");
 MODULE_LICENSE("GPL");
-
-MODULE_ALIAS_CRYPTO("crct10dif");
-MODULE_ALIAS_CRYPTO("crct10dif-pclmul");
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines
  2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
                   ` (8 preceding siblings ...)
  2018-10-05  8:13 ` [RFC PATCH 9/9] crypto: crc-t10dif/x86 " Ard Biesheuvel
@ 2018-10-05 13:37 ` Jason A. Donenfeld
  2018-10-05 17:15   ` Ard Biesheuvel
  9 siblings, 1 reply; 30+ messages in thread
From: Jason A. Donenfeld @ 2018-10-05 13:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

On Fri, Oct 5, 2018 at 10:13 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> At the moment, the Zinc library [1] is being proposed as a solution for that,
> and while it does address the usability problems, it does a lot more than
> that, and what we end up with is a lot less flexible than what we have now.
>
> Currently, arch specific implementations (based on SIMD or optimized
> assembler) live in arch/*/crypto, where each sub-community is in charge
> of its own specialized versions of the various primitives (although still
> under the purview of the crypto maintainer). Any proposal to change this
> model should be judged on its own merit, and not blindly accepted as the
> fallout of cleaning up some library code.
>
> Also, Zinc removes the possibility to plug different versions of a routine,
> and instead, keeps all versions in the same module. Currently, the kernel's
> module support permits user land to take charge of the policies that decide
> which version to use in which context (by loading or blacklisting modules).

I think this explanation misunderstands many of the design goals of Zinc and
also points to why going your direction instead is a bad idea that will cause
even more problems down the road.

Zinc does several important things:

- Introduces direct C function calls throughout, as a central way of being
  implemented and as a central way of being used by consumers of the
  API. This has various obvious benefits for the consumers of the API,
  but it also has big benefits for the developers of the library as
  well. Namely, it keeps the relationship between different parts
  extremely clear and direct. It's an explicit choice for simplicity.
  And by being the simpler and more direct solution, it also gives gcc
  an important opportunity to optimize and inline.

- Reorganizes crypto routines so that they're grouped together by
  primitive. This again leads to a much simpler design and layout,
  making it more obvious what's happening, and making things generally
  cleaner. It is not only useful and clearer for developers, but it
  also makes contributors and auditors more easily aware of what
  implementations are available.

- Has higher standards for code and implementations that are introduced.
  Zinc prefers code that has been formally verified, that has been in
  widespread usage and has received many eyeballs and fuzzing hours,
  that has been fuzzed extensively, that is simple in design, that comes
  from well-known high-quality authors -- in roughly but not precisely
  that order of preference. That's a bit different from the current
  practices of the existing crypto API.

- Has a simpler mechanism that is just as effective for choosing
  available implementations. This is, again, more obvious and direct
  than the present crypto API's module approach, leads to smaller code
  size, and has the potential of being just as flexible with the
  inevitable desire for nobs, adjustable from userspace, from
  kernelspace, or from elsewhere.

- Is designed to promote collaboration with the larger cryptography
  community and with academia, which will yield better implementations
  and for assurance.

- Can easily be extracted to userspace libraries (perhaps a future
  libzinc could be easily based on it), which makes testing and fuzzing
  using tools like libfuzzer and afl more accessible.

- Has faster implementations than the current crypto API.

- Has, again, a very strong focus on being simple and minimal, as
  opposed to bloated and complicated, so that it's actually possible to
  understand and audit the library.

Therefore, I think this patch goes in exactly the wrong direction. I
mean, if you want to introduce dynamic patching as a means for making
the crypto API's dynamic dispatch stuff not as slow in a post-spectre
world, sure, go for it; that may very well be a good idea. But
presenting it as an alternative to Zinc very widely misses the point and
serves to prolong a series of bad design choices, which are now able to
be rectified by putting energy into Zinc instead.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05  8:13 ` [RFC PATCH 1/9] kernel: add support for patchable function pointers Ard Biesheuvel
@ 2018-10-05 13:57   ` Peter Zijlstra
  2018-10-05 14:03     ` Ard Biesheuvel
  2018-10-05 14:14   ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2018-10-05 13:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
> Add a function pointer abstraction that can be implemented by the arch
> in a manner that avoids the downsides of function pointers, i.e., the
> fact that they are typically located in a writable data section, and
> their vulnerability to Spectre like defects.
> 
> The FFP (or fast function pointer) is callable as a function, since
> the generic incarnation is simply that. However, due to the fact that
> C does not distinguish between functions and function pointers at the
> call site, the architecture can instead emit it as a patchable sequence
> of instructions consisting of ordinary branches.

This is basically a static_key, except for indirection function calls?
So why not call the thing static_func or static_call or something like
that?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 13:57   ` Peter Zijlstra
@ 2018-10-05 14:03     ` Ard Biesheuvel
  0 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 October 2018 at 15:57, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
>> Add a function pointer abstraction that can be implemented by the arch
>> in a manner that avoids the downsides of function pointers, i.e., the
>> fact that they are typically located in a writable data section, and
>> their vulnerability to Spectre like defects.
>>
>> The FFP (or fast function pointer) is callable as a function, since
>> the generic incarnation is simply that. However, due to the fact that
>> C does not distinguish between functions and function pointers at the
>> call site, the architecture can instead emit it as a patchable sequence
>> of instructions consisting of ordinary branches.
>
> This is basically a static_key, except for indirection function calls?

Yes, that is why I put you on cc :-)

> So why not call the thing static_func or static_call or something like
> that?

Yep that sounds better.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05  8:13 ` [RFC PATCH 1/9] kernel: add support for patchable function pointers Ard Biesheuvel
  2018-10-05 13:57   ` Peter Zijlstra
@ 2018-10-05 14:14   ` Peter Zijlstra
  2018-10-05 14:57     ` Ard Biesheuvel
  2018-10-05 15:08     ` Andy Lutomirski
  1 sibling, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2018-10-05 14:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
> new file mode 100644
> index 000000000000..8fc3b4c9b38f
> --- /dev/null
> +++ b/include/linux/ffp.h
> @@ -0,0 +1,43 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __LINUX_FFP_H
> +#define __LINUX_FFP_H
> +
> +#include <linux/types.h>
> +#include <linux/compiler.h>
> +
> +#ifdef CONFIG_HAVE_ARCH_FFP
> +#include <asm/ffp.h>
> +#else
> +
> +struct ffp {
> +	void	(**fn)(void);
> +	void	(*default_fn)(void);
> +};
> +
> +#define DECLARE_FFP(_fn, _def)						\
> +	extern typeof(_def) *_fn;					\
> +	extern struct ffp const __ffp_ ## _fn
> +
> +#define DEFINE_FFP(_fn, _def)						\
> +	typeof(_def) *_fn = &_def;					\
> +	struct ffp const __ffp_ ## _fn					\
> +		= { (void(**)(void))&_fn, (void(*)(void))&_def };	\
> +	EXPORT_SYMBOL(__ffp_ ## _fn)
> +
> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
> +{
> +	WRITE_ONCE(*m->fn, new_fn);
> +}
> +
> +static inline void ffp_reset_target(const struct ffp *m)
> +{
> +	WRITE_ONCE(*m->fn, m->default_fn);
> +}
> +
> +#endif
> +
> +#define SET_FFP(_fn, _new)	ffp_set_target(&__ffp_ ## _fn, _new)
> +#define RESET_FFP(_fn)		ffp_reset_target(&__ffp_ ## _fn)
> +
> +#endif

I don't understand this interface. There is no wrapper for the call
site, so how are we going to patch all call-sites when you update the
target?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 14:14   ` Peter Zijlstra
@ 2018-10-05 14:57     ` Ard Biesheuvel
  2018-10-05 15:08     ` Andy Lutomirski
  1 sibling, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05 14:57 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 October 2018 at 16:14, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
>> new file mode 100644
>> index 000000000000..8fc3b4c9b38f
>> --- /dev/null
>> +++ b/include/linux/ffp.h
>> @@ -0,0 +1,43 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef __LINUX_FFP_H
>> +#define __LINUX_FFP_H
>> +
>> +#include <linux/types.h>
>> +#include <linux/compiler.h>
>> +
>> +#ifdef CONFIG_HAVE_ARCH_FFP
>> +#include <asm/ffp.h>
>> +#else
>> +
>> +struct ffp {
>> +     void    (**fn)(void);
>> +     void    (*default_fn)(void);
>> +};
>> +
>> +#define DECLARE_FFP(_fn, _def)                                               \
>> +     extern typeof(_def) *_fn;                                       \
>> +     extern struct ffp const __ffp_ ## _fn
>> +
>> +#define DEFINE_FFP(_fn, _def)                                                \
>> +     typeof(_def) *_fn = &_def;                                      \
>> +     struct ffp const __ffp_ ## _fn                                  \
>> +             = { (void(**)(void))&_fn, (void(*)(void))&_def };       \
>> +     EXPORT_SYMBOL(__ffp_ ## _fn)
>> +
>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
>> +{
>> +     WRITE_ONCE(*m->fn, new_fn);
>> +}
>> +
>> +static inline void ffp_reset_target(const struct ffp *m)
>> +{
>> +     WRITE_ONCE(*m->fn, m->default_fn);
>> +}
>> +
>> +#endif
>> +
>> +#define SET_FFP(_fn, _new)   ffp_set_target(&__ffp_ ## _fn, _new)
>> +#define RESET_FFP(_fn)               ffp_reset_target(&__ffp_ ## _fn)
>> +
>> +#endif
>
> I don't understand this interface. There is no wrapper for the call
> site, so how are we going to patch all call-sites when you update the
> target?

This is the generic implementation, and in this case, it is just a
function pointer which gets dereferenced and called indirectly at each
call site.

In the arch specific implementations (for ones where it matters, see
2/9), it is more like a PLT entry, which gets called from each call
site, and which can be patched to point to another function.

So we replace a single indirect call with a direct call plus a direct jump.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 14:14   ` Peter Zijlstra
  2018-10-05 14:57     ` Ard Biesheuvel
@ 2018-10-05 15:08     ` Andy Lutomirski
  2018-10-05 15:24       ` Ard Biesheuvel
  1 sibling, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2018-10-05 15:08 UTC (permalink / raw)
  To: linux-arm-kernel



> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
>> new file mode 100644
>> index 000000000000..8fc3b4c9b38f
>> --- /dev/null
>> +++ b/include/linux/ffp.h
>> @@ -0,0 +1,43 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef __LINUX_FFP_H
>> +#define __LINUX_FFP_H
>> +
>> +#include <linux/types.h>
>> +#include <linux/compiler.h>
>> +
>> +#ifdef CONFIG_HAVE_ARCH_FFP
>> +#include <asm/ffp.h>
>> +#else
>> +
>> +struct ffp {
>> +    void    (**fn)(void);
>> +    void    (*default_fn)(void);
>> +};
>> +
>> +#define DECLARE_FFP(_fn, _def)                        \
>> +    extern typeof(_def) *_fn;                    \
>> +    extern struct ffp const __ffp_ ## _fn
>> +
>> +#define DEFINE_FFP(_fn, _def)                        \
>> +    typeof(_def) *_fn = &_def;                    \
>> +    struct ffp const __ffp_ ## _fn                    \
>> +        = { (void(**)(void))&_fn, (void(*)(void))&_def };    \
>> +    EXPORT_SYMBOL(__ffp_ ## _fn)
>> +
>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
>> +{
>> +    WRITE_ONCE(*m->fn, new_fn);
>> +}
>> +
>> +static inline void ffp_reset_target(const struct ffp *m)
>> +{
>> +    WRITE_ONCE(*m->fn, m->default_fn);
>> +}
>> +
>> +#endif
>> +
>> +#define SET_FFP(_fn, _new)    ffp_set_target(&__ffp_ ## _fn, _new)
>> +#define RESET_FFP(_fn)        ffp_reset_target(&__ffp_ ## _fn)
>> +
>> +#endif
> 
> I don't understand this interface. There is no wrapper for the call
> site, so how are we going to patch all call-sites when you update the
> target?

I?m also confused.

Anyway, we have patchable functions on x86. They?re called PVOPs, and they?re way overcomplicated.

I?ve proposed a better way that should generate better code, be more portable, and be more maintainable.  It goes like this.

To call the function, you literally just call  the default implementation.  It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we?re using the ?nonexistent wrapper? approach, we can link in a .S or linker script to alias them to the default implementation.

To patch them, we just patch them. It can?t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment.  Or we ask gcc for an extension to align calls to a function.)

Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 15:08     ` Andy Lutomirski
@ 2018-10-05 15:24       ` Ard Biesheuvel
  2018-10-05 16:58         ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05 15:24 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
>>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
>>> new file mode 100644
>>> index 000000000000..8fc3b4c9b38f
>>> --- /dev/null
>>> +++ b/include/linux/ffp.h
>>> @@ -0,0 +1,43 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +
>>> +#ifndef __LINUX_FFP_H
>>> +#define __LINUX_FFP_H
>>> +
>>> +#include <linux/types.h>
>>> +#include <linux/compiler.h>
>>> +
>>> +#ifdef CONFIG_HAVE_ARCH_FFP
>>> +#include <asm/ffp.h>
>>> +#else
>>> +
>>> +struct ffp {
>>> +    void    (**fn)(void);
>>> +    void    (*default_fn)(void);
>>> +};
>>> +
>>> +#define DECLARE_FFP(_fn, _def)                        \
>>> +    extern typeof(_def) *_fn;                    \
>>> +    extern struct ffp const __ffp_ ## _fn
>>> +
>>> +#define DEFINE_FFP(_fn, _def)                        \
>>> +    typeof(_def) *_fn = &_def;                    \
>>> +    struct ffp const __ffp_ ## _fn                    \
>>> +        = { (void(**)(void))&_fn, (void(*)(void))&_def };    \
>>> +    EXPORT_SYMBOL(__ffp_ ## _fn)
>>> +
>>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
>>> +{
>>> +    WRITE_ONCE(*m->fn, new_fn);
>>> +}
>>> +
>>> +static inline void ffp_reset_target(const struct ffp *m)
>>> +{
>>> +    WRITE_ONCE(*m->fn, m->default_fn);
>>> +}
>>> +
>>> +#endif
>>> +
>>> +#define SET_FFP(_fn, _new)    ffp_set_target(&__ffp_ ## _fn, _new)
>>> +#define RESET_FFP(_fn)        ffp_reset_target(&__ffp_ ## _fn)
>>> +
>>> +#endif
>>
>> I don't understand this interface. There is no wrapper for the call
>> site, so how are we going to patch all call-sites when you update the
>> target?
>
> I?m also confused.
>
> Anyway, we have patchable functions on x86. They?re called PVOPs, and they?re way overcomplicated.
>
> I?ve proposed a better way that should generate better code, be more portable, and be more maintainable.  It goes like this.
>
> To call the function, you literally just call  the default implementation.  It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we?re using the ?nonexistent wrapper? approach, we can link in a .S or linker script to alias them to the default implementation.
>
> To patch them, we just patch them. It can?t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment.  Or we ask gcc for an extension to align calls to a function.)
>
> Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module.

Yeah nothing is ever simple on x86 :-(

So are you saying the approach i use in patch #2 (which would
translate to emitting a jmpq instruction pointing to the default
implementation, and patching it at runtime to point elsewhere) would
not fly on x86?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 15:24       ` Ard Biesheuvel
@ 2018-10-05 16:58         ` Andy Lutomirski
  2018-10-05 17:11           ` Ard Biesheuvel
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2018-10-05 16:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
> On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >
> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>
> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
> >>> new file mode 100644
> >>> index 000000000000..8fc3b4c9b38f
> >>> --- /dev/null
> >>> +++ b/include/linux/ffp.h
> >>> @@ -0,0 +1,43 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 */
> >>> +
> >>> +#ifndef __LINUX_FFP_H
> >>> +#define __LINUX_FFP_H
> >>> +
> >>> +#include <linux/types.h>
> >>> +#include <linux/compiler.h>
> >>> +
> >>> +#ifdef CONFIG_HAVE_ARCH_FFP
> >>> +#include <asm/ffp.h>
> >>> +#else
> >>> +
> >>> +struct ffp {
> >>> +    void    (**fn)(void);
> >>> +    void    (*default_fn)(void);
> >>> +};
> >>> +
> >>> +#define DECLARE_FFP(_fn, _def)                        \
> >>> +    extern typeof(_def) *_fn;                    \
> >>> +    extern struct ffp const __ffp_ ## _fn
> >>> +
> >>> +#define DEFINE_FFP(_fn, _def)                        \
> >>> +    typeof(_def) *_fn = &_def;                    \
> >>> +    struct ffp const __ffp_ ## _fn                    \
> >>> +        = { (void(**)(void))&_fn, (void(*)(void))&_def };    \
> >>> +    EXPORT_SYMBOL(__ffp_ ## _fn)
> >>> +
> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
> >>> +{
> >>> +    WRITE_ONCE(*m->fn, new_fn);
> >>> +}
> >>> +
> >>> +static inline void ffp_reset_target(const struct ffp *m)
> >>> +{
> >>> +    WRITE_ONCE(*m->fn, m->default_fn);
> >>> +}
> >>> +
> >>> +#endif
> >>> +
> >>> +#define SET_FFP(_fn, _new)    ffp_set_target(&__ffp_ ## _fn, _new)
> >>> +#define RESET_FFP(_fn)        ffp_reset_target(&__ffp_ ## _fn)
> >>> +
> >>> +#endif
> >>
> >> I don't understand this interface. There is no wrapper for the call
> >> site, so how are we going to patch all call-sites when you update the
> >> target?
> >
> > I?m also confused.
> >
> > Anyway, we have patchable functions on x86. They?re called PVOPs, and they?re way overcomplicated.
> >
> > I?ve proposed a better way that should generate better code, be more portable, and be more maintainable.  It goes like this.
> >
> > To call the function, you literally just call  the default implementation.  It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we?re using the ?nonexistent wrapper? approach, we can link in a .S or linker script to alias them to the default implementation.
> >
> > To patch them, we just patch them. It can?t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment.  Or we ask gcc for an extension to align calls to a function.)
> >
> > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module.
>
> Yeah nothing is ever simple on x86 :-(
>
> So are you saying the approach i use in patch #2 (which would
> translate to emitting a jmpq instruction pointing to the default
> implementation, and patching it at runtime to point elsewhere) would
> not fly on x86?

After getting some more sleep, I'm obviously wrong.  The
text_poke_bp() mechanism will work.  It's just really slow.

Let me try to summarize some of the issues.  First, when emitting
jumps and calls from inline asm on x86, there are a few considerations
that are annoying:

1. Following the x86_64 ABI calling conventions is basically
impossible.  x86_64 requires a 128-byte redzone and 16-byte stack
alignment.  After much discussion a while back, we decided that it was
flat-out impossible on current gcc to get the stack pointer aligned in
a known manner in an inline asm statement.  Instead, if we actually
need alignment, we need to align manually.  Fortunately, the kernel is
built with an override that forces only 8-byte alignment (on *most*
GCC versions).  But for crypto in particular, it sucks extra, since
the crypto code is basically the only thing in the kernel that
actually wants 16-byte alignment.  I don't think this is a huge
problem in practice, but it's annoying.  And the kernel is built
without a redzone.

2. On x86_64, depending on config, we either need frame pointers or
ORC.  ORC is no big deal -- it Just Works (tm).  Frame pointers need
extra asm hackery.  It's doable, but it's still annoying.

3. Actually getting the asm constraints right to do what a C
programmer expects is distinctly nontrivial.  I just fixed an
extremely longstanding bug in the vDSO code in which the asm
constraints for the syscall fallback were wrong in such a way that GCC
didn't notice that the fallback wrote to its output parameter.
Whoops.

And having all this asm hackery per architecture is ugly and annoying.

So my suggestion is to do it like a regular relocation.  Call a
function the normal way (make it literally be a C function and call
it), and rig up the noinline and noclone attributes and whatever else
is needed to make sure that it's a *relocatable* call.  Then the
toolchain emits ELF relocations saying exactly what part of the text
needs patching, and we can patch it at runtime.  On x86, this is a bit
extra annoying because we can't fully reliably parse backwards to find
the beginning of the instruction, but objtool could doit.

And then we get something that is mostly arch-neutral!  Because surely
ARM can also use a relocation-based mechanism.

I will generally object to x86 containing more than one
inline-asm-hackery-based patchable call mechanism, which your series
will add.  I would *love* to see a non-inline-asm one, and then we
could move most of the x86 paravirt crap over to use it for a big win
in readability and maintainability.

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 16:58         ` Andy Lutomirski
@ 2018-10-05 17:11           ` Ard Biesheuvel
  2018-10-05 17:20             ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05 17:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 October 2018 at 18:58, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>>
>> On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote:
>> >
>> >
>> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >>
>> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
>> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
>> >>> new file mode 100644
>> >>> index 000000000000..8fc3b4c9b38f
>> >>> --- /dev/null
>> >>> +++ b/include/linux/ffp.h
>> >>> @@ -0,0 +1,43 @@
>> >>> +/* SPDX-License-Identifier: GPL-2.0 */
>> >>> +
>> >>> +#ifndef __LINUX_FFP_H
>> >>> +#define __LINUX_FFP_H
>> >>> +
>> >>> +#include <linux/types.h>
>> >>> +#include <linux/compiler.h>
>> >>> +
>> >>> +#ifdef CONFIG_HAVE_ARCH_FFP
>> >>> +#include <asm/ffp.h>
>> >>> +#else
>> >>> +
>> >>> +struct ffp {
>> >>> +    void    (**fn)(void);
>> >>> +    void    (*default_fn)(void);
>> >>> +};
>> >>> +
>> >>> +#define DECLARE_FFP(_fn, _def)                        \
>> >>> +    extern typeof(_def) *_fn;                    \
>> >>> +    extern struct ffp const __ffp_ ## _fn
>> >>> +
>> >>> +#define DEFINE_FFP(_fn, _def)                        \
>> >>> +    typeof(_def) *_fn = &_def;                    \
>> >>> +    struct ffp const __ffp_ ## _fn                    \
>> >>> +        = { (void(**)(void))&_fn, (void(*)(void))&_def };    \
>> >>> +    EXPORT_SYMBOL(__ffp_ ## _fn)
>> >>> +
>> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
>> >>> +{
>> >>> +    WRITE_ONCE(*m->fn, new_fn);
>> >>> +}
>> >>> +
>> >>> +static inline void ffp_reset_target(const struct ffp *m)
>> >>> +{
>> >>> +    WRITE_ONCE(*m->fn, m->default_fn);
>> >>> +}
>> >>> +
>> >>> +#endif
>> >>> +
>> >>> +#define SET_FFP(_fn, _new)    ffp_set_target(&__ffp_ ## _fn, _new)
>> >>> +#define RESET_FFP(_fn)        ffp_reset_target(&__ffp_ ## _fn)
>> >>> +
>> >>> +#endif
>> >>
>> >> I don't understand this interface. There is no wrapper for the call
>> >> site, so how are we going to patch all call-sites when you update the
>> >> target?
>> >
>> > I?m also confused.
>> >
>> > Anyway, we have patchable functions on x86. They?re called PVOPs, and they?re way overcomplicated.
>> >
>> > I?ve proposed a better way that should generate better code, be more portable, and be more maintainable.  It goes like this.
>> >
>> > To call the function, you literally just call  the default implementation.  It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we?re using the ?nonexistent wrapper? approach, we can link in a .S or linker script to alias them to the default implementation.
>> >
>> > To patch them, we just patch them. It can?t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment.  Or we ask gcc for an extension to align calls to a function.)
>> >
>> > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module.
>>
>> Yeah nothing is ever simple on x86 :-(
>>
>> So are you saying the approach i use in patch #2 (which would
>> translate to emitting a jmpq instruction pointing to the default
>> implementation, and patching it at runtime to point elsewhere) would
>> not fly on x86?
>
> After getting some more sleep, I'm obviously wrong.  The
> text_poke_bp() mechanism will work.  It's just really slow.
>

OK

> Let me try to summarize some of the issues.  First, when emitting
> jumps and calls from inline asm on x86, there are a few considerations
> that are annoying:
>
> 1. Following the x86_64 ABI calling conventions is basically
> impossible.  x86_64 requires a 128-byte redzone and 16-byte stack
> alignment.  After much discussion a while back, we decided that it was
> flat-out impossible on current gcc to get the stack pointer aligned in
> a known manner in an inline asm statement.  Instead, if we actually
> need alignment, we need to align manually.  Fortunately, the kernel is
> built with an override that forces only 8-byte alignment (on *most*
> GCC versions).  But for crypto in particular, it sucks extra, since
> the crypto code is basically the only thing in the kernel that
> actually wants 16-byte alignment.  I don't think this is a huge
> problem in practice, but it's annoying.  And the kernel is built
> without a redzone.
>
> 2. On x86_64, depending on config, we either need frame pointers or
> ORC.  ORC is no big deal -- it Just Works (tm).  Frame pointers need
> extra asm hackery.  It's doable, but it's still annoying.
>
> 3. Actually getting the asm constraints right to do what a C
> programmer expects is distinctly nontrivial.  I just fixed an
> extremely longstanding bug in the vDSO code in which the asm
> constraints for the syscall fallback were wrong in such a way that GCC
> didn't notice that the fallback wrote to its output parameter.
> Whoops.
>

OK, so the thing I am missing is why this all matters.

Note that the compiler should take care of all of this. It emits a
call a function with external linkage having prototype X, and all the
inline asm does is emit a jmp to some function having that same
prototype, either the default one or the one we patched in.

Apologies if I am missing something obvious here: as you know, x86 is
not my focus in general.

So in the arm64 case, we have

    " .globl " #_fn " \n" \
    #_fn " : \n" \
    " b " #_def " \n" \
    " nop \n" \
    " nop \n" \
    " nop \n" \
    " nop \n" \

and all the patching does is replace the target of that branch (the
NOPs are there for jumps that are more then 128 MB away, which require
a indirect jump on arm64)

> And having all this asm hackery per architecture is ugly and annoying.
>

True. But note that only the architectures that cannot tolerate the
default instantiation using function pointers will require a special
implementation.

> So my suggestion is to do it like a regular relocation.  Call a
> function the normal way (make it literally be a C function and call
> it), and rig up the noinline and noclone attributes and whatever else
> is needed to make sure that it's a *relocatable* call.  Then the
> toolchain emits ELF relocations saying exactly what part of the text
> needs patching, and we can patch it at runtime.  On x86, this is a bit
> extra annoying because we can't fully reliably parse backwards to find
> the beginning of the instruction, but objtool could doit.
>
> And then we get something that is mostly arch-neutral!  Because surely
> ARM can also use a relocation-based mechanism.
>

Not really. We don't have any build time tooling for this, and
CONFIG_RELOCATABLE only produces R_AARCH64_RELATIVE relocations for
absolute quantities.

So it would mean we'd have to start building vmlinux with
--emit-relocs, add tooling to parse all of that etc etc.

> I will generally object to x86 containing more than one
> inline-asm-hackery-based patchable call mechanism, which your series
> will add.  I would *love* to see a non-inline-asm one, and then we
> could move most of the x86 paravirt crap over to use it for a big win
> in readability and maintainability.
>

Fair enough.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines
  2018-10-05 13:37 ` [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Jason A. Donenfeld
@ 2018-10-05 17:15   ` Ard Biesheuvel
  2018-10-05 17:26     ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05 17:15 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 October 2018 at 15:37, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
...
> Therefore, I think this patch goes in exactly the wrong direction. I
> mean, if you want to introduce dynamic patching as a means for making
> the crypto API's dynamic dispatch stuff not as slow in a post-spectre
> world, sure, go for it; that may very well be a good idea. But
> presenting it as an alternative to Zinc very widely misses the point and
> serves to prolong a series of bad design choices, which are now able to
> be rectified by putting energy into Zinc instead.
>

This series has nothing to do with dynamic dispatch: the call sites
call crypto functions using ordinary function calls (although my
example uses CRC-T10DIF), and these calls are redirected via what is
essentially a PLT entry, so that we can supsersede those routines at
runtime.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 17:11           ` Ard Biesheuvel
@ 2018-10-05 17:20             ` Andy Lutomirski
  2018-10-05 17:23               ` Ard Biesheuvel
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2018-10-05 17:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 5, 2018 at 10:11 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On 5 October 2018 at 18:58, Andy Lutomirski <luto@kernel.org> wrote:
> > On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >>
> >> On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >
> >> >
> >> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> >>
> >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
> >> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
> >> >>> new file mode 100644
> >> >>> index 000000000000..8fc3b4c9b38f
> >> >>> --- /dev/null
> >> >>> +++ b/include/linux/ffp.h
> >> >>> @@ -0,0 +1,43 @@
> >> >>> +/* SPDX-License-Identifier: GPL-2.0 */
> >> >>> +
> >> >>> +#ifndef __LINUX_FFP_H
> >> >>> +#define __LINUX_FFP_H
> >> >>> +
> >> >>> +#include <linux/types.h>
> >> >>> +#include <linux/compiler.h>
> >> >>> +
> >> >>> +#ifdef CONFIG_HAVE_ARCH_FFP
> >> >>> +#include <asm/ffp.h>
> >> >>> +#else
> >> >>> +
> >> >>> +struct ffp {
> >> >>> +    void    (**fn)(void);
> >> >>> +    void    (*default_fn)(void);
> >> >>> +};
> >> >>> +
> >> >>> +#define DECLARE_FFP(_fn, _def)                        \
> >> >>> +    extern typeof(_def) *_fn;                    \
> >> >>> +    extern struct ffp const __ffp_ ## _fn
> >> >>> +
> >> >>> +#define DEFINE_FFP(_fn, _def)                        \
> >> >>> +    typeof(_def) *_fn = &_def;                    \
> >> >>> +    struct ffp const __ffp_ ## _fn                    \
> >> >>> +        = { (void(**)(void))&_fn, (void(*)(void))&_def };    \
> >> >>> +    EXPORT_SYMBOL(__ffp_ ## _fn)
> >> >>> +
> >> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
> >> >>> +{
> >> >>> +    WRITE_ONCE(*m->fn, new_fn);
> >> >>> +}
> >> >>> +
> >> >>> +static inline void ffp_reset_target(const struct ffp *m)
> >> >>> +{
> >> >>> +    WRITE_ONCE(*m->fn, m->default_fn);
> >> >>> +}
> >> >>> +
> >> >>> +#endif
> >> >>> +
> >> >>> +#define SET_FFP(_fn, _new)    ffp_set_target(&__ffp_ ## _fn, _new)
> >> >>> +#define RESET_FFP(_fn)        ffp_reset_target(&__ffp_ ## _fn)
> >> >>> +
> >> >>> +#endif
> >> >>
> >> >> I don't understand this interface. There is no wrapper for the call
> >> >> site, so how are we going to patch all call-sites when you update the
> >> >> target?
> >> >
> >> > I?m also confused.
> >> >
> >> > Anyway, we have patchable functions on x86. They?re called PVOPs, and they?re way overcomplicated.
> >> >
> >> > I?ve proposed a better way that should generate better code, be more portable, and be more maintainable.  It goes like this.
> >> >
> >> > To call the function, you literally just call  the default implementation.  It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we?re using the ?nonexistent wrapper? approach, we can link in a .S or linker script to alias them to the default implementation.
> >> >
> >> > To patch them, we just patch them. It can?t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment.  Or we ask gcc for an extension to align calls to a function.)
> >> >
> >> > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module.
> >>
> >> Yeah nothing is ever simple on x86 :-(
> >>
> >> So are you saying the approach i use in patch #2 (which would
> >> translate to emitting a jmpq instruction pointing to the default
> >> implementation, and patching it at runtime to point elsewhere) would
> >> not fly on x86?
> >
> > After getting some more sleep, I'm obviously wrong.  The
> > text_poke_bp() mechanism will work.  It's just really slow.
> >
>
> OK
>
> > Let me try to summarize some of the issues.  First, when emitting
> > jumps and calls from inline asm on x86, there are a few considerations
> > that are annoying:
> >
> > 1. Following the x86_64 ABI calling conventions is basically
> > impossible.  x86_64 requires a 128-byte redzone and 16-byte stack
> > alignment.  After much discussion a while back, we decided that it was
> > flat-out impossible on current gcc to get the stack pointer aligned in
> > a known manner in an inline asm statement.  Instead, if we actually
> > need alignment, we need to align manually.  Fortunately, the kernel is
> > built with an override that forces only 8-byte alignment (on *most*
> > GCC versions).  But for crypto in particular, it sucks extra, since
> > the crypto code is basically the only thing in the kernel that
> > actually wants 16-byte alignment.  I don't think this is a huge
> > problem in practice, but it's annoying.  And the kernel is built
> > without a redzone.
> >
> > 2. On x86_64, depending on config, we either need frame pointers or
> > ORC.  ORC is no big deal -- it Just Works (tm).  Frame pointers need
> > extra asm hackery.  It's doable, but it's still annoying.
> >
> > 3. Actually getting the asm constraints right to do what a C
> > programmer expects is distinctly nontrivial.  I just fixed an
> > extremely longstanding bug in the vDSO code in which the asm
> > constraints for the syscall fallback were wrong in such a way that GCC
> > didn't notice that the fallback wrote to its output parameter.
> > Whoops.
> >
>
> OK, so the thing I am missing is why this all matters.
>
> Note that the compiler should take care of all of this. It emits a
> call a function with external linkage having prototype X, and all the
> inline asm does is emit a jmp to some function having that same
> prototype, either the default one or the one we patched in.
>
> Apologies if I am missing something obvious here: as you know, x86 is
> not my focus in general.

The big issue that bothers me isn't the x86-ism so much as the nasty
interactions with the optimizer.  On x86, we have all of this working.
It's in arch/x86/include/asm/paravirt_types.h, and it looks roughly
like:

                        asm volatile(pre                                \
                                     paravirt_alt(PARAVIRT_CALL)        \
                                     post                               \
                                     : call_clbr, ASM_CALL_CONSTRAINT   \
                                     : paravirt_type(op),               \
                                       paravirt_clobber(clbr),          \
                                       ##__VA_ARGS__                    \
                                     : "memory", "cc" extra_clbr);      \

With some extra magic for the constraints.  And I don't even think
this is strictly correct -- from very recent experience, telling the
compiler that "memory" is clobbered and that a bunch of arguments are
used as numeric inputs may not actually imply that the asm modifies
the target of pointer arguments.  Checks this lovely bug out:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b

As far as I can tell, the whole PVOP infrastructure has the same bug.
And I don't see how to avoid it generically on x86 or any other
architecture.  (PeterZ, am I wrong?  Are we really just getting lucky
that x86 pvop calls actually work?  Or do we not have enough of them
that take pointers as arguments for this to matter?)

Plus, asm volatile ( ..., "memory" ) is a barrier and makes code
generation suck.

Whereas, if we use my suggestion the semantics are precisely those of
any other C function call because, as far as GCC is concerned, it *is*
a C function call.  So the generated code *is* a function call.

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 17:20             ` Andy Lutomirski
@ 2018-10-05 17:23               ` Ard Biesheuvel
  2018-10-05 17:28                 ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05 17:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 October 2018 at 19:20, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Oct 5, 2018 at 10:11 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>>
>> On 5 October 2018 at 18:58, Andy Lutomirski <luto@kernel.org> wrote:
>> > On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> >>
>> >> On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> >
>> >> >
>> >> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> >>
>> >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
>> >> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
>> >> >>> new file mode 100644
>> >> >>> index 000000000000..8fc3b4c9b38f
>> >> >>> --- /dev/null
>> >> >>> +++ b/include/linux/ffp.h
>> >> >>> @@ -0,0 +1,43 @@
>> >> >>> +/* SPDX-License-Identifier: GPL-2.0 */
>> >> >>> +
>> >> >>> +#ifndef __LINUX_FFP_H
>> >> >>> +#define __LINUX_FFP_H
>> >> >>> +
>> >> >>> +#include <linux/types.h>
>> >> >>> +#include <linux/compiler.h>
>> >> >>> +
>> >> >>> +#ifdef CONFIG_HAVE_ARCH_FFP
>> >> >>> +#include <asm/ffp.h>
>> >> >>> +#else
>> >> >>> +
>> >> >>> +struct ffp {
>> >> >>> +    void    (**fn)(void);
>> >> >>> +    void    (*default_fn)(void);
>> >> >>> +};
>> >> >>> +
>> >> >>> +#define DECLARE_FFP(_fn, _def)                        \
>> >> >>> +    extern typeof(_def) *_fn;                    \
>> >> >>> +    extern struct ffp const __ffp_ ## _fn
>> >> >>> +
>> >> >>> +#define DEFINE_FFP(_fn, _def)                        \
>> >> >>> +    typeof(_def) *_fn = &_def;                    \
>> >> >>> +    struct ffp const __ffp_ ## _fn                    \
>> >> >>> +        = { (void(**)(void))&_fn, (void(*)(void))&_def };    \
>> >> >>> +    EXPORT_SYMBOL(__ffp_ ## _fn)
>> >> >>> +
>> >> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
>> >> >>> +{
>> >> >>> +    WRITE_ONCE(*m->fn, new_fn);
>> >> >>> +}
>> >> >>> +
>> >> >>> +static inline void ffp_reset_target(const struct ffp *m)
>> >> >>> +{
>> >> >>> +    WRITE_ONCE(*m->fn, m->default_fn);
>> >> >>> +}
>> >> >>> +
>> >> >>> +#endif
>> >> >>> +
>> >> >>> +#define SET_FFP(_fn, _new)    ffp_set_target(&__ffp_ ## _fn, _new)
>> >> >>> +#define RESET_FFP(_fn)        ffp_reset_target(&__ffp_ ## _fn)
>> >> >>> +
>> >> >>> +#endif
>> >> >>
>> >> >> I don't understand this interface. There is no wrapper for the call
>> >> >> site, so how are we going to patch all call-sites when you update the
>> >> >> target?
>> >> >
>> >> > I?m also confused.
>> >> >
>> >> > Anyway, we have patchable functions on x86. They?re called PVOPs, and they?re way overcomplicated.
>> >> >
>> >> > I?ve proposed a better way that should generate better code, be more portable, and be more maintainable.  It goes like this.
>> >> >
>> >> > To call the function, you literally just call  the default implementation.  It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we?re using the ?nonexistent wrapper? approach, we can link in a .S or linker script to alias them to the default implementation.
>> >> >
>> >> > To patch them, we just patch them. It can?t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment.  Or we ask gcc for an extension to align calls to a function.)
>> >> >
>> >> > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module.
>> >>
>> >> Yeah nothing is ever simple on x86 :-(
>> >>
>> >> So are you saying the approach i use in patch #2 (which would
>> >> translate to emitting a jmpq instruction pointing to the default
>> >> implementation, and patching it at runtime to point elsewhere) would
>> >> not fly on x86?
>> >
>> > After getting some more sleep, I'm obviously wrong.  The
>> > text_poke_bp() mechanism will work.  It's just really slow.
>> >
>>
>> OK
>>
>> > Let me try to summarize some of the issues.  First, when emitting
>> > jumps and calls from inline asm on x86, there are a few considerations
>> > that are annoying:
>> >
>> > 1. Following the x86_64 ABI calling conventions is basically
>> > impossible.  x86_64 requires a 128-byte redzone and 16-byte stack
>> > alignment.  After much discussion a while back, we decided that it was
>> > flat-out impossible on current gcc to get the stack pointer aligned in
>> > a known manner in an inline asm statement.  Instead, if we actually
>> > need alignment, we need to align manually.  Fortunately, the kernel is
>> > built with an override that forces only 8-byte alignment (on *most*
>> > GCC versions).  But for crypto in particular, it sucks extra, since
>> > the crypto code is basically the only thing in the kernel that
>> > actually wants 16-byte alignment.  I don't think this is a huge
>> > problem in practice, but it's annoying.  And the kernel is built
>> > without a redzone.
>> >
>> > 2. On x86_64, depending on config, we either need frame pointers or
>> > ORC.  ORC is no big deal -- it Just Works (tm).  Frame pointers need
>> > extra asm hackery.  It's doable, but it's still annoying.
>> >
>> > 3. Actually getting the asm constraints right to do what a C
>> > programmer expects is distinctly nontrivial.  I just fixed an
>> > extremely longstanding bug in the vDSO code in which the asm
>> > constraints for the syscall fallback were wrong in such a way that GCC
>> > didn't notice that the fallback wrote to its output parameter.
>> > Whoops.
>> >
>>
>> OK, so the thing I am missing is why this all matters.
>>
>> Note that the compiler should take care of all of this. It emits a
>> call a function with external linkage having prototype X, and all the
>> inline asm does is emit a jmp to some function having that same
>> prototype, either the default one or the one we patched in.
>>
>> Apologies if I am missing something obvious here: as you know, x86 is
>> not my focus in general.
>
> The big issue that bothers me isn't the x86-ism so much as the nasty
> interactions with the optimizer.  On x86, we have all of this working.
> It's in arch/x86/include/asm/paravirt_types.h, and it looks roughly
> like:
>
>                         asm volatile(pre                                \
>                                      paravirt_alt(PARAVIRT_CALL)        \
>                                      post                               \
>                                      : call_clbr, ASM_CALL_CONSTRAINT   \
>                                      : paravirt_type(op),               \
>                                        paravirt_clobber(clbr),          \
>                                        ##__VA_ARGS__                    \
>                                      : "memory", "cc" extra_clbr);      \
>
> With some extra magic for the constraints.  And I don't even think
> this is strictly correct -- from very recent experience, telling the
> compiler that "memory" is clobbered and that a bunch of arguments are
> used as numeric inputs may not actually imply that the asm modifies
> the target of pointer arguments.  Checks this lovely bug out:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b
>
> As far as I can tell, the whole PVOP infrastructure has the same bug.
> And I don't see how to avoid it generically on x86 or any other
> architecture.  (PeterZ, am I wrong?  Are we really just getting lucky
> that x86 pvop calls actually work?  Or do we not have enough of them
> that take pointers as arguments for this to matter?)
>
> Plus, asm volatile ( ..., "memory" ) is a barrier and makes code
> generation suck.
>
> Whereas, if we use my suggestion the semantics are precisely those of
> any other C function call because, as far as GCC is concerned, it *is*
> a C function call.  So the generated code *is* a function call.
>

But it is the *compiler* that actually emits the call.

Only, the target of that call is a jmpq to another location where some
version of the routine lives, and all have the same prototype.

How is that any different from PLTs in shared libraries?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines
  2018-10-05 17:15   ` Ard Biesheuvel
@ 2018-10-05 17:26     ` Andy Lutomirski
  2018-10-05 17:28       ` Ard Biesheuvel
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2018-10-05 17:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 5, 2018 at 10:15 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On 5 October 2018 at 15:37, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> ...
> > Therefore, I think this patch goes in exactly the wrong direction. I
> > mean, if you want to introduce dynamic patching as a means for making
> > the crypto API's dynamic dispatch stuff not as slow in a post-spectre
> > world, sure, go for it; that may very well be a good idea. But
> > presenting it as an alternative to Zinc very widely misses the point and
> > serves to prolong a series of bad design choices, which are now able to
> > be rectified by putting energy into Zinc instead.
> >
>
> This series has nothing to do with dynamic dispatch: the call sites
> call crypto functions using ordinary function calls (although my
> example uses CRC-T10DIF), and these calls are redirected via what is
> essentially a PLT entry, so that we can supsersede those routines at
> runtime.

If you really want to do it PLT-style, then just do:

extern void whatever_func(args);

Call it like:
whatever_func(args here);

And rig up something to emit asm like:

GLOBAL(whatever_func)
  jmpq default_whatever_func
ENDPROC(whatever_func)

Architectures without support can instead do:

void whatever_func(args)
{
  READ_ONCE(patchable_function_struct_for_whatever_func->ptr)(args);
}

and patch the asm function for basic support.  It will be slower than
necessary, but maybe the relocation trick could be used on top of this
to redirect the call to whatever_func directly to the target for
architectures that want to squeeze out the last bit of performance.
This might actually be the best of all worlds: easy implementation on
all architectures, no inline asm, and the totally non-magical version
works with okay performance.

(Is this what your code is doing?  I admit I didn't follow all the way
through all the macros.)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines
  2018-10-05 17:26     ` Andy Lutomirski
@ 2018-10-05 17:28       ` Ard Biesheuvel
  2018-10-05 18:00         ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05 17:28 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 October 2018 at 19:26, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Oct 5, 2018 at 10:15 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>>
>> On 5 October 2018 at 15:37, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>> ...
>> > Therefore, I think this patch goes in exactly the wrong direction. I
>> > mean, if you want to introduce dynamic patching as a means for making
>> > the crypto API's dynamic dispatch stuff not as slow in a post-spectre
>> > world, sure, go for it; that may very well be a good idea. But
>> > presenting it as an alternative to Zinc very widely misses the point and
>> > serves to prolong a series of bad design choices, which are now able to
>> > be rectified by putting energy into Zinc instead.
>> >
>>
>> This series has nothing to do with dynamic dispatch: the call sites
>> call crypto functions using ordinary function calls (although my
>> example uses CRC-T10DIF), and these calls are redirected via what is
>> essentially a PLT entry, so that we can supsersede those routines at
>> runtime.
>
> If you really want to do it PLT-style, then just do:
>
> extern void whatever_func(args);
>
> Call it like:
> whatever_func(args here);
>
> And rig up something to emit asm like:
>
> GLOBAL(whatever_func)
>   jmpq default_whatever_func
> ENDPROC(whatever_func)
>
> Architectures without support can instead do:
>
> void whatever_func(args)
> {
>   READ_ONCE(patchable_function_struct_for_whatever_func->ptr)(args);
> }
>
> and patch the asm function for basic support.  It will be slower than
> necessary, but maybe the relocation trick could be used on top of this
> to redirect the call to whatever_func directly to the target for
> architectures that want to squeeze out the last bit of performance.
> This might actually be the best of all worlds: easy implementation on
> all architectures, no inline asm, and the totally non-magical version
> works with okay performance.
>
> (Is this what your code is doing?  I admit I didn't follow all the way
> through all the macros.)

Basically

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 17:23               ` Ard Biesheuvel
@ 2018-10-05 17:28                 ` Andy Lutomirski
  2018-10-05 17:37                   ` Jason A. Donenfeld
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2018-10-05 17:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 5, 2018 at 10:23 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On 5 October 2018 at 19:20, Andy Lutomirski <luto@kernel.org> wrote:
> > On Fri, Oct 5, 2018 at 10:11 AM Ard Biesheuvel
> > <ard.biesheuvel@linaro.org> wrote:
> >>
> >> On 5 October 2018 at 18:58, Andy Lutomirski <luto@kernel.org> wrote:
> >> > On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >> >>
> >> >> On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >> >
> >> >> >
> >> >> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> >> >>
> >> >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
> >> >> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
> >> >> >>> new file mode 100644
> >> >> >>> index 000000000000..8fc3b4c9b38f
> >> >> >>> --- /dev/null
> >> >> >>> +++ b/include/linux/ffp.h
> >> >> >>> @@ -0,0 +1,43 @@
> >> >> >>> +/* SPDX-License-Identifier: GPL-2.0 */
> >> >> >>> +
> >> >> >>> +#ifndef __LINUX_FFP_H
> >> >> >>> +#define __LINUX_FFP_H
> >> >> >>> +
> >> >> >>> +#include <linux/types.h>
> >> >> >>> +#include <linux/compiler.h>
> >> >> >>> +
> >> >> >>> +#ifdef CONFIG_HAVE_ARCH_FFP
> >> >> >>> +#include <asm/ffp.h>
> >> >> >>> +#else
> >> >> >>> +
> >> >> >>> +struct ffp {
> >> >> >>> +    void    (**fn)(void);
> >> >> >>> +    void    (*default_fn)(void);
> >> >> >>> +};
> >> >> >>> +
> >> >> >>> +#define DECLARE_FFP(_fn, _def)                        \
> >> >> >>> +    extern typeof(_def) *_fn;                    \
> >> >> >>> +    extern struct ffp const __ffp_ ## _fn
> >> >> >>> +
> >> >> >>> +#define DEFINE_FFP(_fn, _def)                        \
> >> >> >>> +    typeof(_def) *_fn = &_def;                    \
> >> >> >>> +    struct ffp const __ffp_ ## _fn                    \
> >> >> >>> +        = { (void(**)(void))&_fn, (void(*)(void))&_def };    \
> >> >> >>> +    EXPORT_SYMBOL(__ffp_ ## _fn)
> >> >> >>> +
> >> >> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
> >> >> >>> +{
> >> >> >>> +    WRITE_ONCE(*m->fn, new_fn);
> >> >> >>> +}
> >> >> >>> +
> >> >> >>> +static inline void ffp_reset_target(const struct ffp *m)
> >> >> >>> +{
> >> >> >>> +    WRITE_ONCE(*m->fn, m->default_fn);
> >> >> >>> +}
> >> >> >>> +
> >> >> >>> +#endif
> >> >> >>> +
> >> >> >>> +#define SET_FFP(_fn, _new)    ffp_set_target(&__ffp_ ## _fn, _new)
> >> >> >>> +#define RESET_FFP(_fn)        ffp_reset_target(&__ffp_ ## _fn)
> >> >> >>> +
> >> >> >>> +#endif
> >> >> >>
> >> >> >> I don't understand this interface. There is no wrapper for the call
> >> >> >> site, so how are we going to patch all call-sites when you update the
> >> >> >> target?
> >> >> >
> >> >> > I?m also confused.
> >> >> >
> >> >> > Anyway, we have patchable functions on x86. They?re called PVOPs, and they?re way overcomplicated.
> >> >> >
> >> >> > I?ve proposed a better way that should generate better code, be more portable, and be more maintainable.  It goes like this.
> >> >> >
> >> >> > To call the function, you literally just call  the default implementation.  It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we?re using the ?nonexistent wrapper? approach, we can link in a .S or linker script to alias them to the default implementation.
> >> >> >
> >> >> > To patch them, we just patch them. It can?t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment.  Or we ask gcc for an extension to align calls to a function.)
> >> >> >
> >> >> > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module.
> >> >>
> >> >> Yeah nothing is ever simple on x86 :-(
> >> >>
> >> >> So are you saying the approach i use in patch #2 (which would
> >> >> translate to emitting a jmpq instruction pointing to the default
> >> >> implementation, and patching it at runtime to point elsewhere) would
> >> >> not fly on x86?
> >> >
> >> > After getting some more sleep, I'm obviously wrong.  The
> >> > text_poke_bp() mechanism will work.  It's just really slow.
> >> >
> >>
> >> OK
> >>
> >> > Let me try to summarize some of the issues.  First, when emitting
> >> > jumps and calls from inline asm on x86, there are a few considerations
> >> > that are annoying:
> >> >
> >> > 1. Following the x86_64 ABI calling conventions is basically
> >> > impossible.  x86_64 requires a 128-byte redzone and 16-byte stack
> >> > alignment.  After much discussion a while back, we decided that it was
> >> > flat-out impossible on current gcc to get the stack pointer aligned in
> >> > a known manner in an inline asm statement.  Instead, if we actually
> >> > need alignment, we need to align manually.  Fortunately, the kernel is
> >> > built with an override that forces only 8-byte alignment (on *most*
> >> > GCC versions).  But for crypto in particular, it sucks extra, since
> >> > the crypto code is basically the only thing in the kernel that
> >> > actually wants 16-byte alignment.  I don't think this is a huge
> >> > problem in practice, but it's annoying.  And the kernel is built
> >> > without a redzone.
> >> >
> >> > 2. On x86_64, depending on config, we either need frame pointers or
> >> > ORC.  ORC is no big deal -- it Just Works (tm).  Frame pointers need
> >> > extra asm hackery.  It's doable, but it's still annoying.
> >> >
> >> > 3. Actually getting the asm constraints right to do what a C
> >> > programmer expects is distinctly nontrivial.  I just fixed an
> >> > extremely longstanding bug in the vDSO code in which the asm
> >> > constraints for the syscall fallback were wrong in such a way that GCC
> >> > didn't notice that the fallback wrote to its output parameter.
> >> > Whoops.
> >> >
> >>
> >> OK, so the thing I am missing is why this all matters.
> >>
> >> Note that the compiler should take care of all of this. It emits a
> >> call a function with external linkage having prototype X, and all the
> >> inline asm does is emit a jmp to some function having that same
> >> prototype, either the default one or the one we patched in.
> >>
> >> Apologies if I am missing something obvious here: as you know, x86 is
> >> not my focus in general.
> >
> > The big issue that bothers me isn't the x86-ism so much as the nasty
> > interactions with the optimizer.  On x86, we have all of this working.
> > It's in arch/x86/include/asm/paravirt_types.h, and it looks roughly
> > like:
> >
> >                         asm volatile(pre                                \
> >                                      paravirt_alt(PARAVIRT_CALL)        \
> >                                      post                               \
> >                                      : call_clbr, ASM_CALL_CONSTRAINT   \
> >                                      : paravirt_type(op),               \
> >                                        paravirt_clobber(clbr),          \
> >                                        ##__VA_ARGS__                    \
> >                                      : "memory", "cc" extra_clbr);      \
> >
> > With some extra magic for the constraints.  And I don't even think
> > this is strictly correct -- from very recent experience, telling the
> > compiler that "memory" is clobbered and that a bunch of arguments are
> > used as numeric inputs may not actually imply that the asm modifies
> > the target of pointer arguments.  Checks this lovely bug out:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b
> >
> > As far as I can tell, the whole PVOP infrastructure has the same bug.
> > And I don't see how to avoid it generically on x86 or any other
> > architecture.  (PeterZ, am I wrong?  Are we really just getting lucky
> > that x86 pvop calls actually work?  Or do we not have enough of them
> > that take pointers as arguments for this to matter?)
> >
> > Plus, asm volatile ( ..., "memory" ) is a barrier and makes code
> > generation suck.
> >
> > Whereas, if we use my suggestion the semantics are precisely those of
> > any other C function call because, as far as GCC is concerned, it *is*
> > a C function call.  So the generated code *is* a function call.
> >
>
> But it is the *compiler* that actually emits the call.
>
> Only, the target of that call is a jmpq to another location where some
> version of the routine lives, and all have the same prototype.
>
> How is that any different from PLTs in shared libraries?

Ah, I see, I misunderstood your code.

See other email.  I think that, if you rework your series a bit to
have a generic version that works on all architectures, then do it
like you did on ARM, and make sure you leave the door open for the
inline patching approach, then it looks pretty good.

(None of this is to say that I disagree with Jason, though -- I'm not
entirely convinced that this makes sense for Zinc.  But maybe it can
be done in a way that makes everyone happy.)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 17:28                 ` Andy Lutomirski
@ 2018-10-05 17:37                   ` Jason A. Donenfeld
  2018-10-05 17:44                     ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Jason A. Donenfeld @ 2018-10-05 17:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 5, 2018 at 7:29 PM Andy Lutomirski <luto@kernel.org> wrote:
> (None of this is to say that I disagree with Jason, though -- I'm not
> entirely convinced that this makes sense for Zinc.  But maybe it can
> be done in a way that makes everyone happy.)

Zinc indeed will continue to push in the simpler and more minimal
direction. Down the line I'm open to trying and benching a few
different ways of going about it with dynamic patching -- something
that will be pretty easy to experiment with given the lean structure
of Zinc -- but for the initial merge I intend to do it the way it is,
which is super fast and pretty straightforward to follow.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 17:37                   ` Jason A. Donenfeld
@ 2018-10-05 17:44                     ` Andy Lutomirski
  2018-10-05 18:28                       ` Jason A. Donenfeld
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2018-10-05 17:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 5, 2018 at 10:38 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> On Fri, Oct 5, 2018 at 7:29 PM Andy Lutomirski <luto@kernel.org> wrote:
> > (None of this is to say that I disagree with Jason, though -- I'm not
> > entirely convinced that this makes sense for Zinc.  But maybe it can
> > be done in a way that makes everyone happy.)
>
> Zinc indeed will continue to push in the simpler and more minimal
> direction. Down the line I'm open to trying and benching a few
> different ways of going about it with dynamic patching -- something
> that will be pretty easy to experiment with given the lean structure
> of Zinc -- but for the initial merge I intend to do it the way it is,
> which is super fast and pretty straightforward to follow.
>

I *think* the only change to Zinc per se would be that the calls like
chacha20_simd() would be static calls or patchable functions or
whatever we want to call them.  And there could be a debugfs to
override the default selection.

Ard, I don't think that sticking this in udev rules makes sense.  The
kernel has bascially complete information as to what the right choice
is, and that will change over time as the implementation gets tuned,
and the udev rules will never get updated in sync.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines
  2018-10-05 17:28       ` Ard Biesheuvel
@ 2018-10-05 18:00         ` Andy Lutomirski
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2018-10-05 18:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 5, 2018 at 10:28 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On 5 October 2018 at 19:26, Andy Lutomirski <luto@kernel.org> wrote:
> > On Fri, Oct 5, 2018 at 10:15 AM Ard Biesheuvel
> > <ard.biesheuvel@linaro.org> wrote:
> >>
> >> On 5 October 2018 at 15:37, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >> ...
> >> > Therefore, I think this patch goes in exactly the wrong direction. I
> >> > mean, if you want to introduce dynamic patching as a means for making
> >> > the crypto API's dynamic dispatch stuff not as slow in a post-spectre
> >> > world, sure, go for it; that may very well be a good idea. But
> >> > presenting it as an alternative to Zinc very widely misses the point and
> >> > serves to prolong a series of bad design choices, which are now able to
> >> > be rectified by putting energy into Zinc instead.
> >> >
> >>
> >> This series has nothing to do with dynamic dispatch: the call sites
> >> call crypto functions using ordinary function calls (although my
> >> example uses CRC-T10DIF), and these calls are redirected via what is
> >> essentially a PLT entry, so that we can supsersede those routines at
> >> runtime.
> >
> > If you really want to do it PLT-style, then just do:
> >
> > extern void whatever_func(args);
> >
> > Call it like:
> > whatever_func(args here);
> >
> > And rig up something to emit asm like:
> >
> > GLOBAL(whatever_func)
> >   jmpq default_whatever_func
> > ENDPROC(whatever_func)
> >
> > Architectures without support can instead do:
> >
> > void whatever_func(args)
> > {
> >   READ_ONCE(patchable_function_struct_for_whatever_func->ptr)(args);
> > }
> >
> > and patch the asm function for basic support.  It will be slower than
> > necessary, but maybe the relocation trick could be used on top of this
> > to redirect the call to whatever_func directly to the target for
> > architectures that want to squeeze out the last bit of performance.
> > This might actually be the best of all worlds: easy implementation on
> > all architectures, no inline asm, and the totally non-magical version
> > works with okay performance.
> >
> > (Is this what your code is doing?  I admit I didn't follow all the way
> > through all the macros.)
>
> Basically

Adding Josh Poimboeuf.

Here's a sketch of how this could work for better performance.  For a
static call "foo" that returns void and takes no arguments, the
generic implementation does something like this:

extern void foo(void);

struct static_call {
  void (*target)(void);

  /* arch-specific part containing an array of struct static_call_site */
};

void foo(void)
{
  READ_ONCE(__static_call_foo->target)();
}

Arch code overrides it to:

GLOBAL(foo)
  jmpq *__static_call_foo(%rip)
ENDPROC(foo)

and some extra asm to emit a static_call_site object saying that the
address "foo" is a jmp/call instruction where the operand is at offset
1 into the instruction.  (Or whatever the offset is.)

The patch code is like:

void set_static_call(struct static_call *call, void *target)
{
  /* take a spinlock? */
  WRITE_ONCE(call->target, target);
  arch_set_static_call(call, target);
}

and the arch code patches the call site if needed.

On x86, an even better implementation would have objtool make a bunch
of additional static_call_site objects for each call to foo, and
arch_set_static_call() would update all of them, too.  Using
text_poke_bp() if needed, and "if needed" can maybe be clever and
check the alignment of the instruction.  I admit that I never actually
remember the full rules for atomically patching an instruction on x86
SMP.  (Hmm.  This will be really epically slow.  Maybe we don't care.
Or we could finally optimize text_poke, etc to take a list of pokes to
do and do them as a batch.  But that's not a prerequisite for the rest
of this.)

What do you all think?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 17:44                     ` Andy Lutomirski
@ 2018-10-05 18:28                       ` Jason A. Donenfeld
  2018-10-05 19:13                         ` Ard Biesheuvel
  0 siblings, 1 reply; 30+ messages in thread
From: Jason A. Donenfeld @ 2018-10-05 18:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hey Andy,

On Fri, Oct 5, 2018 at 7:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> I *think* the only change to Zinc per se would be that the calls like
> chacha20_simd() would be static calls or patchable functions or
> whatever we want to call them.  And there could be a debugfs to
> override the default selection.

Yea, right, exactly. It turns out this is really easy to do with the
way it's structured now. I'd actually experimented considerably with
using the static keys a while back, but couldn't find any performance
difference on any platform at all (four ARM microarchitectures, three
MIPS, various random intel, an old powerpc), so went with the simplest
solution. But we can certainly play with more elaborate patching
mechanisms later on and see how those turn out. Also, even with the
simple bools as we have now, it's quite easy to make all the
parameters toggle-able.

> Ard, I don't think that sticking this in udev rules makes sense.  The
> kernel has bascially complete information as to what the right choice
> is, and that will change over time as the implementation gets tuned,
> and the udev rules will never get updated in sync.

Yes, I agree with this.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] kernel: add support for patchable function pointers
  2018-10-05 18:28                       ` Jason A. Donenfeld
@ 2018-10-05 19:13                         ` Ard Biesheuvel
  0 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2018-10-05 19:13 UTC (permalink / raw)
  To: linux-arm-kernel



> On 5 Oct 2018, at 20:28, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> 
> Hey Andy,
> 
>> On Fri, Oct 5, 2018 at 7:44 PM Andy Lutomirski <luto@kernel.org> wrote:
>> I *think* the only change to Zinc per se would be that the calls like
>> chacha20_simd() would be static calls or patchable functions or
>> whatever we want to call them.  And there could be a debugfs to
>> override the default selection.
> 
> Yea, right, exactly. It turns out this is really easy to do with the
> way it's structured now. I'd actually experimented considerably with
> using the static keys a while back, but couldn't find any performance
> difference on any platform at all (four ARM microarchitectures, three
> MIPS, various random intel, an old powerpc), so went with the simplest
> solution. But we can certainly play with more elaborate patching
> mechanisms later on and see how those turn out. Also, even with the
> simple bools as we have now, it's quite easy to make all the
> parameters toggle-able.
> 
>> Ard, I don't think that sticking this in udev rules makes sense.  The
>> kernel has bascially complete information as to what the right choice
>> is, and that will change over time as the implementation gets tuned,
>> and the udev rules will never get updated in sync.
> 
> Yes, I agree with this.
> 
> 

I am not referring to udev rules. I just mean the current way that udev autoloads modules based on CPU features.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-10-05 19:13 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-05  8:13 [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Ard Biesheuvel
2018-10-05  8:13 ` [RFC PATCH 1/9] kernel: add support for patchable function pointers Ard Biesheuvel
2018-10-05 13:57   ` Peter Zijlstra
2018-10-05 14:03     ` Ard Biesheuvel
2018-10-05 14:14   ` Peter Zijlstra
2018-10-05 14:57     ` Ard Biesheuvel
2018-10-05 15:08     ` Andy Lutomirski
2018-10-05 15:24       ` Ard Biesheuvel
2018-10-05 16:58         ` Andy Lutomirski
2018-10-05 17:11           ` Ard Biesheuvel
2018-10-05 17:20             ` Andy Lutomirski
2018-10-05 17:23               ` Ard Biesheuvel
2018-10-05 17:28                 ` Andy Lutomirski
2018-10-05 17:37                   ` Jason A. Donenfeld
2018-10-05 17:44                     ` Andy Lutomirski
2018-10-05 18:28                       ` Jason A. Donenfeld
2018-10-05 19:13                         ` Ard Biesheuvel
2018-10-05  8:13 ` [RFC PATCH 2/9] arm64: kernel: add arch " Ard Biesheuvel
2018-10-05  8:13 ` [RFC PATCH 3/9] crypto: crc-t10dif - make crc_t10dif a static inline Ard Biesheuvel
2018-10-05  8:13 ` [RFC PATCH 4/9] crypto: crc-t10dif - use patchable function pointer for core update routine Ard Biesheuvel
2018-10-05  8:13 ` [RFC PATCH 5/9] crypto: crc-t10dif/arm64 - move PMULL based code into core library Ard Biesheuvel
2018-10-05  8:13 ` [RFC PATCH 6/9] crypto: crc-t10dif/arm " Ard Biesheuvel
2018-10-05  8:13 ` [RFC PATCH 7/9] crypto: crct10dif/generic - switch crypto API driver to " Ard Biesheuvel
2018-10-05  8:13 ` [RFC PATCH 8/9] crypto: crc-t10dif/powerpc - move PMULL based code into " Ard Biesheuvel
2018-10-05  8:13 ` [RFC PATCH 9/9] crypto: crc-t10dif/x86 " Ard Biesheuvel
2018-10-05 13:37 ` [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines Jason A. Donenfeld
2018-10-05 17:15   ` Ard Biesheuvel
2018-10-05 17:26     ` Andy Lutomirski
2018-10-05 17:28       ` Ard Biesheuvel
2018-10-05 18:00         ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).