Linux-Security-Module Archive on lore.kernel.org
 help / Atom feed
* [RFC v1 PATCH 00/17] prmem: protected memory
@ 2018-10-23 21:34 Igor Stoppa
  2018-10-23 21:34 ` [PATCH 01/17] prmem: linker section for static write rare Igor Stoppa
                   ` (17 more replies)
  0 siblings, 18 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott

-- Summary --

Preliminary version of memory protection patchset, including a sample use
case, turning into write-rare the IMA measurement list.

The core idea is to introduce two new types of memory protection, beside
const and __ro_after_init, which will support:
- statically allocated "write rare" memory
- dynamically allocated "read only" and "write rare" memory

On top of that, follows a set of patches which create a "write rare"
counterpart of the kernel infrastructure used in the example chose for
hardening: the IMA measurement list.

-- Mechanism --

Statically allocated protected memory is identified by the __wr_after_init
tag, which will cause the linker to place it in a special section.

Dynamically allocated memory is obtained through vmalloc, but compacting
each allocation, where possible, in the latest obtained vmap_area.

The write rare mechanism is implemented by creating a temporary alternate
writable mapping, applying the change through this mapping and then
removing it.

All of this is possible thanks to the system MMU, which must be able to
provide write protection.

-- Brief history --

I sent out various versions of memory protection over the last year or so,
however this patchset is significantly expanded, including several helper
data structures and a use case, so I decided to reset the numbering to v1.

As reference, the latest "old" version is here [1].

The current version is not yet ready for merge, however it is sufficiently
complete for supporting an end-to-end discussion, I think.

Eventually, I plan to write a white paper, once the code is in better shape.
In the meanwhile, an overview can be had from these slides [2], which are
the support material for my presentation at the Linux Security Summit 2018
Europe.

-- Validation --

Most of the testing is done on a Fedora image, with QEMU x86_64,
however the code has been also tested on a real x86_64 PC, yielding
similar positive results.
For ARM64, I use a custom Debian installation, still with QEMU, but I have
obtained similar failures when testing with a real device, using a
Kirin970.

I have written some test cases for the most basic parts and the behaviour
of IMA and the Fedora image in general do not seem to be negatively
affected, when used in conjunction with this patchset.
However, it's far from being exhaustive testing and the torture test for
rcu is completely missing.

-- Known Issues --

As said, this version is preliminary and certain parts need rework.
This is a short and incomplete list of known issues:

* arm64 support is broken for __wr_after_init
  I must create a separate section with proper mappings, similar to the
  ones used for vmalloc()

* alignment of data structures has not been throughly checked
  There are probably several redundant forced alignments

* there is no fallback for platforms missing MMU write protection

* some additional care might be needed when dealing with double mapping vs
  data cache coherency, on multicore systems

* lots of additional (stress) tests are needed

* memory reuse (object caches) are probably needed, to support converting
  more use cases, and so also other data structures.

* credits for original code: I have reimplemented various data structures,
  I am not sure if I have given credit correctly to the original authors.

* documentation for the re-implemented data structures is missing

* confirm that the hardened usercopy logic is correct

-- Q&As --

During reviews of the older patchset, several objections and questions
were formulated.

They are collected here in Q&A format, with both some old and new answers:

1 - "The protection can still be undone"
Yes, it is true. Using a hypervisor, like it is done in certain Huawei and
Samsung phones, provides a better level of protection.
However, even without that, it still gives a significantly better level of
protection than not protecting the memory at all.
The main advantage of this patchset is that now the attack has to focus on
the page table, which is a significantly smaller area, than the whole
kernel data.
It is my intention, eventually, to provide also support for interaction
with a FOSS hypervisor (ex: KVM), but this patchset should
support also those cases where it's not even possible to have an hypervisor.
So it seems simpler to start from there. The hypervisor is not mandatory.


2 - "Do not provide a chain of trust, but protect some memory and refer to
it with a writable pointer."
This might be ok for protecting against bugs, but in the case of an
attacker trying to compromise the system, the unprotected pointer has
become the new target. It doesn't change much.
Samsung does use a similar implementation, for protecting LSM hooks,
however that solution also add a pointer, from the protected memory back
to the writable memory, as validation loop. And the price to pay is that
every time the unprotected pointer must be used, it first has to be
validated, to point to a certain memory range and to have a specific
alignment. It's an alternative solution to the full chain of trust and
each has its specific advantages, depending on the data structures that
one wants to protect.


3 - "Do not use a secondary mapping, unprotect the current one"
The purpose of the secondary mapping is to create a hard-to-spot window of
writability at a random address, which cannot be easily exploited.
Unprotecting the primary mapping would allow an attack where a core is
busy looping trying to figure out if a specific location becomes writable
and race against the legitimate writer. For the same reason, interrupts
are disabled on the core that is performing the write-rare operation.


4 - "Do not create another allocator over vmalloc(), use it plain"
This is not good for various reasons:
a) vmalloc() allocates at least one page for every request it receives,
leaving most of the page typically unused. While it might not be a big
deal on large systems, on IoT class devices it is possible to find
relatively powerful cores paired to relatively little memory.
Taking as example a system using SELinux, a relatively small set of rules
can genarate a few thousands of allocations (SELinux is deny-by-default).
Modeling each allocation to be about 64bytes, on a system with 4kB pages,
assuming that the grand total of allocation is 100k, that means
                 100k * 4kB = 390MB
while, using each 64bytes slot in a page yields:
                 100k * 64B = 6MB
The first case would not be very compatible with a system having only
512MB or 1GB.
b) even worse, the amount of thrashing of the TLB would be terrible, with
each allocation having its own translation.

--

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>


-- References --

[1]: https://lkml.org/lkml/2018/4/23/508
[2]: https://events.linuxfoundation.org/wp-content/uploads/2017/12/Kernel-Hardening-Protecting-the-Protection-Mechanisms-Igor-Stoppa-Huawei.pdf

-- List of patches --

[PATCH 01/17] prmem: linker section for static write rare
[PATCH 02/17] prmem: write rare for static allocation
[PATCH 03/17] prmem: vmalloc support for dynamic allocation
[PATCH 04/17] prmem: dynamic allocation
[PATCH 05/17] prmem: shorthands for write rare on common types
[PATCH 06/17] prmem: test cases for memory protection
[PATCH 07/17] prmem: lkdtm tests for memory protection
[PATCH 08/17] prmem: struct page: track vmap_area
[PATCH 09/17] prmem: hardened usercopy
[PATCH 10/17] prmem: documentation
[PATCH 11/17] prmem: llist: use designated initializer
[PATCH 12/17] prmem: linked list: set alignment
[PATCH 13/17] prmem: linked list: disable layout randomization
[PATCH 14/17] prmem: llist, hlist, both plain and rcu
[PATCH 15/17] prmem: test cases for prlist and prhlist
[PATCH 16/17] prmem: pratomic-long
[PATCH 17/17] prmem: ima: turn the measurements list write rare

-- Diffstat --

 Documentation/core-api/index.rst          |   1 +
 Documentation/core-api/prmem.rst          | 172 +++++
 MAINTAINERS                               |  14 +
 drivers/misc/lkdtm/core.c                 |  13 +
 drivers/misc/lkdtm/lkdtm.h                |  13 +
 drivers/misc/lkdtm/perms.c                | 248 +++++++
 include/asm-generic/vmlinux.lds.h         |  20 +
 include/linux/cache.h                     |  17 +
 include/linux/list.h                      |   5 +-
 include/linux/mm_types.h                  |  25 +-
 include/linux/pratomic-long.h             |  73 ++
 include/linux/prlist.h                    | 934 ++++++++++++++++++++++++
 include/linux/prmem.h                     | 446 +++++++++++
 include/linux/prmemextra.h                | 133 ++++
 include/linux/types.h                     |  20 +-
 include/linux/vmalloc.h                   |  11 +-
 lib/Kconfig.debug                         |   9 +
 lib/Makefile                              |   1 +
 lib/test_prlist.c                         | 252 +++++++
 mm/Kconfig                                |   6 +
 mm/Kconfig.debug                          |   9 +
 mm/Makefile                               |   2 +
 mm/prmem.c                                | 273 +++++++
 mm/test_pmalloc.c                         | 629 ++++++++++++++++
 mm/test_write_rare.c                      | 236 ++++++
 mm/usercopy.c                             |   5 +
 mm/vmalloc.c                              |   7 +
 security/integrity/ima/ima.h              |  18 +-
 security/integrity/ima/ima_api.c          |  29 +-
 security/integrity/ima/ima_fs.c           |  12 +-
 security/integrity/ima/ima_main.c         |   6 +
 security/integrity/ima/ima_queue.c        |  28 +-
 security/integrity/ima/ima_template.c     |  14 +-
 security/integrity/ima/ima_template_lib.c |  14 +-
 34 files changed, 3635 insertions(+), 60 deletions(-)


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/17] prmem: linker section for static write rare
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-23 21:34 ` [PATCH 02/17] prmem: write rare for static allocation Igor Stoppa
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Arnd Bergmann, Thomas Gleixner, Kate Stewart, Greg Kroah-Hartman,
	Philippe Ombredanne, linux-arch, linux-kernel

Introduce a section and a label for statically allocated write rare
data.

The label is named "__write_rare_after_init".
As the name implies, after the init phase is completed, this section
will be modifiable only by invoking write rare functions.

NOTE:
this needs rework, because the current write-rare mechanism works
only on x86_64 and not arm64, due to arm64 mappings.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Arnd Bergmann <arnd@arndb.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Kate Stewart <kstewart@linuxfoundation.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Philippe Ombredanne <pombredanne@nexb.com>
CC: linux-arch@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 include/asm-generic/vmlinux.lds.h | 20 ++++++++++++++++++++
 include/linux/cache.h             | 17 +++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index d7701d466b60..fd40a15e3b24 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -300,6 +300,25 @@
 	. = __start_init_task + THREAD_SIZE;				\
 	__end_init_task = .;
 
+/*
+ * Allow architectures to handle wr_after_init data on their
+ * own by defining an empty WR_AFTER_INIT_DATA.
+ * However, it's important that pages containing WR_RARE data do not
+ * hold anything else, to avoid both accidentally unprotecting something
+ * that is supposed to stay read-only all the time and also to protect
+ * something else that is supposed to be writeable all the time.
+ */
+#ifndef WR_AFTER_INIT_DATA
+#define WR_AFTER_INIT_DATA(align)					\
+	. = ALIGN(PAGE_SIZE);						\
+	__start_wr_after_init = .;					\
+	. = ALIGN(align);						\
+	*(.data..wr_after_init)						\
+	. = ALIGN(PAGE_SIZE);						\
+	__end_wr_after_init = .;					\
+	. = ALIGN(align);
+#endif
+
 /*
  * Allow architectures to handle ro_after_init data on their
  * own by defining an empty RO_AFTER_INIT_DATA.
@@ -320,6 +339,7 @@
 		__start_rodata = .;					\
 		*(.rodata) *(.rodata.*)					\
 		RO_AFTER_INIT_DATA	/* Read only after init */	\
+		WR_AFTER_INIT_DATA(align) /* wr after init */	\
 		KEEP(*(__vermagic))	/* Kernel version magic */	\
 		. = ALIGN(8);						\
 		__start___tracepoints_ptrs = .;				\
diff --git a/include/linux/cache.h b/include/linux/cache.h
index 750621e41d1c..9a7e7134b887 100644
--- a/include/linux/cache.h
+++ b/include/linux/cache.h
@@ -31,6 +31,23 @@
 #define __ro_after_init __attribute__((__section__(".data..ro_after_init")))
 #endif
 
+/*
+ * __wr_after_init is used to mark objects that cannot be modified
+ * directly after init (i.e. after mark_rodata_ro() has been called).
+ * These objects become effectively read-only, from the perspective of
+ * performing a direct write, like a variable assignment.
+ * However, they can be altered through a dedicated function.
+ * It is intended for those objects which are occasionally modified after
+ * init, however they are modified so seldomly, that the extra cost from
+ * the indirect modification is either negligible or worth paying, for the
+ * sake of the protection gained.
+ */
+#ifndef __wr_after_init
+#define __wr_after_init \
+		__attribute__((__section__(".data..wr_after_init")))
+#endif
+
+
 #ifndef ____cacheline_aligned
 #define ____cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES)))
 #endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 02/17] prmem: write rare for static allocation
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
  2018-10-23 21:34 ` [PATCH 01/17] prmem: linker section for static write rare Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-25  0:24   ` Dave Hansen
  2018-10-26  9:41   ` Peter Zijlstra
  2018-10-23 21:34 ` [PATCH 03/17] prmem: vmalloc support for dynamic allocation Igor Stoppa
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

Implementation of write rare for statically allocated data, located in a
specific memory section through the use of the __write_rare label.

The basic functions are wr_memcpy() and wr_memset(): the write rare
counterparts of memcpy() and memset() respectively.

To minimize chances of attacks, this implementation does not unprotect
existing memory pages.
Instead, it remaps them, one by one, at random free locations, as writable.
Each page is mapped as writable strictly for the time needed to perform
changes in said page.
While a page is remapped, interrupts are disabled on the core performing
the write rare operation, to avoid being frozen mid-air by an attack
using interrupts for stretching the duration of the alternate mapping.
OTOH, to avoid introducing unpredictable delays, the interrupts are
re-enabled inbetween page remapping, when write operations are either
completed or not yet started, and there is not alternate, writable
mapping to exploit.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Michal Hocko <mhocko@kernel.org>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Tatashin <pasha.tatashin@oracle.com>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
---
 MAINTAINERS           |   7 ++
 include/linux/prmem.h | 213 ++++++++++++++++++++++++++++++++++++++++++
 mm/Makefile           |   1 +
 mm/prmem.c            |  10 ++
 4 files changed, 231 insertions(+)
 create mode 100644 include/linux/prmem.h
 create mode 100644 mm/prmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index b2f710eee67a..e566c5d09faf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9454,6 +9454,13 @@ F:	kernel/sched/membarrier.c
 F:	include/uapi/linux/membarrier.h
 F:	arch/powerpc/include/asm/membarrier.h
 
+MEMORY HARDENING
+M:	Igor Stoppa <igor.stoppa@gmail.com>
+L:	kernel-hardening@lists.openwall.com
+S:	Maintained
+F:	include/linux/prmem.h
+F:	mm/prmem.c
+
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
 W:	http://www.linux-mm.org
diff --git a/include/linux/prmem.h b/include/linux/prmem.h
new file mode 100644
index 000000000000..3ba41d76a582
--- /dev/null
+++ b/include/linux/prmem.h
@@ -0,0 +1,213 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * prmem.h: Header for memory protection library
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.stoppa@huawei.com>
+ *
+ * Support for:
+ * - statically allocated write rare data
+ */
+
+#ifndef _LINUX_PRMEM_H
+#define _LINUX_PRMEM_H
+
+#include <linux/set_memory.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/compiler.h>
+#include <linux/irqflags.h>
+#include <linux/set_memory.h>
+
+/* ============================ Write Rare ============================ */
+
+extern const char WR_ERR_RANGE_MSG[];
+extern const char WR_ERR_PAGE_MSG[];
+
+/*
+ * The following two variables are statically allocated by the linker
+ * script at the the boundaries of the memory region (rounded up to
+ * multiples of PAGE_SIZE) reserved for __wr_after_init.
+ */
+extern long __start_wr_after_init;
+extern long __end_wr_after_init;
+
+static __always_inline bool __is_wr_after_init(const void *ptr, size_t size)
+{
+	size_t start = (size_t)&__start_wr_after_init;
+	size_t end = (size_t)&__end_wr_after_init;
+	size_t low = (size_t)ptr;
+	size_t high = (size_t)ptr + size;
+
+	return likely(start <= low && low < high && high <= end);
+}
+
+/**
+ * wr_memset() - sets n bytes of the destination to the c value
+ * @dst: beginning of the memory to write to
+ * @c: byte to replicate
+ * @size: amount of bytes to copy
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_memset(const void *dst, const int c, size_t n_bytes)
+{
+	size_t size;
+	unsigned long flags;
+	uintptr_t d = (uintptr_t)dst;
+
+	if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
+		return false;
+	while (n_bytes) {
+		struct page *page;
+		uintptr_t base;
+		uintptr_t offset;
+		uintptr_t offset_complement;
+
+		local_irq_save(flags);
+		page = virt_to_page(d);
+		offset = d & ~PAGE_MASK;
+		offset_complement = PAGE_SIZE - offset;
+		size = min(n_bytes, offset_complement);
+		base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
+		if (WARN(!base, WR_ERR_PAGE_MSG)) {
+			local_irq_restore(flags);
+			return false;
+		}
+		memset((void *)(base + offset), c, size);
+		vunmap((void *)base);
+		d += size;
+		n_bytes -= size;
+		local_irq_restore(flags);
+	}
+	return true;
+}
+
+/**
+ * wr_memcpy() - copyes n bytes from source to destination
+ * @dst: beginning of the memory to write to
+ * @src: beginning of the memory to read from
+ * @n_bytes: amount of bytes to copy
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_memcpy(const void *dst, const void *src, size_t n_bytes)
+{
+	size_t size;
+	unsigned long flags;
+	uintptr_t d = (uintptr_t)dst;
+	uintptr_t s = (uintptr_t)src;
+
+	if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
+		return false;
+	while (n_bytes) {
+		struct page *page;
+		uintptr_t base;
+		uintptr_t offset;
+		uintptr_t offset_complement;
+
+		local_irq_save(flags);
+		page = virt_to_page(d);
+		offset = d & ~PAGE_MASK;
+		offset_complement = PAGE_SIZE - offset;
+		size = (size_t)min(n_bytes, offset_complement);
+		base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
+		if (WARN(!base, WR_ERR_PAGE_MSG)) {
+			local_irq_restore(flags);
+			return false;
+		}
+		__write_once_size((void *)(base + offset), (void *)s, size);
+		vunmap((void *)base);
+		d += size;
+		s += size;
+		n_bytes -= size;
+		local_irq_restore(flags);
+	}
+	return true;
+}
+
+/*
+ * rcu_assign_pointer is a macro, which takes advantage of being able to
+ * take the address of the destination parameter "p", so that it can be
+ * passed to WRITE_ONCE(), which is called in one of the branches of
+ * rcu_assign_pointer() and also, being a macro, can rely on the
+ * preprocessor for taking the address of its parameter.
+ * For the sake of staying compatible with the API, also
+ * wr_rcu_assign_pointer() is a macro that accepts a pointer as parameter,
+ * instead of the address of said pointer.
+ * However it is simply a wrapper to __wr_rcu_ptr(), which receives the
+ * address of the pointer.
+ */
+static __always_inline
+uintptr_t __wr_rcu_ptr(const void *dst_p_p, const void *src_p)
+{
+	unsigned long flags;
+	struct page *page;
+	void *base;
+	uintptr_t offset;
+	const size_t size = sizeof(void *);
+
+	if (WARN(!__is_wr_after_init(dst_p_p, size), WR_ERR_RANGE_MSG))
+		return (uintptr_t)NULL;
+	local_irq_save(flags);
+	page = virt_to_page(dst_p_p);
+	offset = (uintptr_t)dst_p_p & ~PAGE_MASK;
+	base = vmap(&page, 1, VM_MAP, PAGE_KERNEL);
+	if (WARN(!base, WR_ERR_PAGE_MSG)) {
+		local_irq_restore(flags);
+		return (uintptr_t)NULL;
+	}
+	rcu_assign_pointer((*(void **)(offset + (uintptr_t)base)), src_p);
+	vunmap(base);
+	local_irq_restore(flags);
+	return (uintptr_t)src_p;
+}
+
+#define wr_rcu_assign_pointer(p, v)	__wr_rcu_ptr(&p, v)
+
+#define __wr_simple(dst_ptr, src_ptr)					\
+	wr_memcpy(dst_ptr, src_ptr, sizeof(*(src_ptr)))
+
+#define __wr_safe(dst_ptr, src_ptr,					\
+		  unique_dst_ptr, unique_src_ptr)			\
+({									\
+	typeof(dst_ptr) unique_dst_ptr = (dst_ptr);			\
+	typeof(src_ptr) unique_src_ptr = (src_ptr);			\
+									\
+	wr_memcpy(unique_dst_ptr, unique_src_ptr,			\
+		  sizeof(*(unique_src_ptr)));				\
+})
+
+#define __safe_ops(dst, src)	\
+	(__typecheck(dst, src) && __no_side_effects(dst, src))
+
+/**
+ * wr - copies an object over another of same type and size
+ * @dst_ptr: address of the destination object
+ * @src_ptr: address of the source object
+ */
+#define wr(dst_ptr, src_ptr)						\
+	__builtin_choose_expr(__safe_ops(dst_ptr, src_ptr),		\
+			      __wr_simple(dst_ptr, src_ptr),		\
+			      __wr_safe(dst_ptr, src_ptr,		\
+						__UNIQUE_ID(__dst_ptr),	\
+						__UNIQUE_ID(__src_ptr)))
+
+/**
+ * wr_ptr() - alters a pointer in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_ptr(const void *dst, const void *val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+#endif
diff --git a/mm/Makefile b/mm/Makefile
index 26ef77a3883b..215c6a6d7304 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -64,6 +64,7 @@ obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_PRMEM) += prmem.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/prmem.c b/mm/prmem.c
new file mode 100644
index 000000000000..de9258f5f29a
--- /dev/null
+++ b/mm/prmem.c
@@ -0,0 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * prmem.c: Memory Protection Library
+ *
+ * (C) Copyright 2017-2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.stoppa@huawei.com>
+ */
+
+const char WR_ERR_RANGE_MSG[] = "Write rare on invalid memory range.";
+const char WR_ERR_PAGE_MSG[] = "Failed to remap write rare page.";
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/17] prmem: vmalloc support for dynamic allocation
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
  2018-10-23 21:34 ` [PATCH 01/17] prmem: linker section for static write rare Igor Stoppa
  2018-10-23 21:34 ` [PATCH 02/17] prmem: write rare for static allocation Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-25  0:26   ` Dave Hansen
  2018-10-23 21:34 ` [PATCH 04/17] prmem: " Igor Stoppa
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Andrew Morton, Chintan Pandya, Joe Perches, Luis R. Rodriguez,
	Thomas Gleixner, Kate Stewart, Greg Kroah-Hartman,
	Philippe Ombredanne, linux-mm, linux-kernel

Prepare vmalloc for:
- tagging areas used for dynamic allocation of protected memory
- supporting various tags, related to the property that an area might have
- extrapolating the pool containing a given area
- chaining the areas in each pool
- extrapolating the area containing a given memory address

NOTE:
Since there is a list_head structure that is used only when disposing of
the allocation (the field purge_list), there are two pointers for the take,
before it comes the time of freeing the allocation.
To avoid increasing the size of the vmap_area structure, instead of
using a standard doubly linked list for tracking the chain of
vmap_areas, only one pointer is spent for this purpose, in a single
linked list, while the other is used to provide a direct connection to the
parent pool.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Michal Hocko <mhocko@kernel.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Chintan Pandya <cpandya@codeaurora.org>
CC: Joe Perches <joe@perches.com>
CC: "Luis R. Rodriguez" <mcgrof@kernel.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Kate Stewart <kstewart@linuxfoundation.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Philippe Ombredanne <pombredanne@nexb.com>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
---
 include/linux/vmalloc.h | 12 +++++++++++-
 mm/vmalloc.c            |  2 +-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..4d14a3b8089e 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -21,6 +21,9 @@ struct notifier_block;		/* in notifier.h */
 #define VM_UNINITIALIZED	0x00000020	/* vm_struct is not fully initialized */
 #define VM_NO_GUARD		0x00000040      /* don't add guard page */
 #define VM_KASAN		0x00000080      /* has allocated kasan shadow memory */
+#define VM_PMALLOC		0x00000100	/* pmalloc area - see docs */
+#define VM_PMALLOC_WR		0x00000200	/* pmalloc write rare area */
+#define VM_PMALLOC_PROTECTED	0x00000400	/* pmalloc protected area */
 /* bits [20..32] reserved for arch specific ioremap internals */
 
 /*
@@ -48,7 +51,13 @@ struct vmap_area {
 	unsigned long flags;
 	struct rb_node rb_node;         /* address sorted rbtree */
 	struct list_head list;          /* address sorted list */
-	struct llist_node purge_list;    /* "lazy purge" list */
+	union {
+		struct llist_node purge_list;    /* "lazy purge" list */
+		struct {
+			struct vmap_area *next;
+			struct pmalloc_pool *pool;
+		};
+	};
 	struct vm_struct *vm;
 	struct rcu_head rcu_head;
 };
@@ -134,6 +143,7 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
 					const void *caller);
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
+extern struct vmap_area *find_vmap_area(unsigned long addr);
 
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
 			struct page **pages);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a728fc492557..15850005fea5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -742,7 +742,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
 	free_vmap_area_noflush(va);
 }
 
-static struct vmap_area *find_vmap_area(unsigned long addr)
+struct vmap_area *find_vmap_area(unsigned long addr)
 {
 	struct vmap_area *va;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 04/17] prmem: dynamic allocation
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (2 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 03/17] prmem: vmalloc support for dynamic allocation Igor Stoppa
@ 2018-10-23 21:34 ` " Igor Stoppa
  2018-10-23 21:34 ` [PATCH 05/17] prmem: shorthands for write rare on common types Igor Stoppa
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

Extension of protected memory to dynamic allocations.

Allocations are performed from "pools".
A pool is a list of virtual memory areas, in various state of
protection.

Supported cases
===============

Read Only Pool
--------------
Memory is allocated from the pool, in writable state.
Then it gets written and the content of the pool is write protected and
it cannot be altered anymore. It is only possible to destroy the pool.

Auto Read Only Pool
-------------------
Same as the plain read only, but every time a memory area is full and
phased out, it is automatically marked as read only.

Write Rare Pool
---------------
Memory is allocated from the pool, in writable state.
Then it gets written and the content of the pool is write protected and
it can be altered only by invoking special write rare functions.

Auto Write Rare Pool
--------------------
Same as the plain write rare, but every time a memory area is full and
phased out, it is automatically marked as write rare.

Start Write Rare Pool
---------------------
The memory handed out is already in write rare mode and the only way to
alter it is to use write rare functions.

When a pool is destroyed, all the memory that was obtained from it is
automatically freed. This is the only way to release protected memory.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Michal Hocko <mhocko@kernel.org>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Tatashin <pasha.tatashin@oracle.com>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
---
 include/linux/prmem.h | 220 +++++++++++++++++++++++++++++++++--
 mm/Kconfig            |   6 +
 mm/prmem.c            | 263 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 482 insertions(+), 7 deletions(-)

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index 3ba41d76a582..26fd48410d97 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -7,6 +7,8 @@
  *
  * Support for:
  * - statically allocated write rare data
+ * - dynamically allocated read only data
+ * - dynamically allocated write rare data
  */
 
 #ifndef _LINUX_PRMEM_H
@@ -22,6 +24,11 @@
 #include <linux/irqflags.h>
 #include <linux/set_memory.h>
 
+#define VM_PMALLOC_MASK \
+		(VM_PMALLOC | VM_PMALLOC_WR | VM_PMALLOC_PROTECTED)
+#define VM_PMALLOC_WR_MASK		(VM_PMALLOC | VM_PMALLOC_WR)
+#define VM_PMALLOC_PROTECTED_MASK	(VM_PMALLOC | VM_PMALLOC_PROTECTED)
+
 /* ============================ Write Rare ============================ */
 
 extern const char WR_ERR_RANGE_MSG[];
@@ -45,11 +52,23 @@ static __always_inline bool __is_wr_after_init(const void *ptr, size_t size)
 	return likely(start <= low && low < high && high <= end);
 }
 
+static __always_inline bool __is_wr_pool(const void *ptr, size_t size)
+{
+	struct vmap_area *area;
+
+	if (!is_vmalloc_addr(ptr))
+		return false;
+	area = find_vmap_area((unsigned long)ptr);
+	return area && area->vm && (area->vm->size >= size) &&
+		((area->vm->flags & (VM_PMALLOC | VM_PMALLOC_WR)) ==
+		 (VM_PMALLOC | VM_PMALLOC_WR));
+}
+
 /**
  * wr_memset() - sets n bytes of the destination to the c value
  * @dst: beginning of the memory to write to
  * @c: byte to replicate
- * @size: amount of bytes to copy
+ * @n_bytes: amount of bytes to copy
  *
  * Returns true on success, false otherwise.
  */
@@ -59,8 +78,10 @@ bool wr_memset(const void *dst, const int c, size_t n_bytes)
 	size_t size;
 	unsigned long flags;
 	uintptr_t d = (uintptr_t)dst;
+	bool is_virt = __is_wr_after_init(dst, n_bytes);
 
-	if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
+	if (WARN(!(is_virt || likely(__is_wr_pool(dst, n_bytes))),
+		 WR_ERR_RANGE_MSG))
 		return false;
 	while (n_bytes) {
 		struct page *page;
@@ -69,7 +90,10 @@ bool wr_memset(const void *dst, const int c, size_t n_bytes)
 		uintptr_t offset_complement;
 
 		local_irq_save(flags);
-		page = virt_to_page(d);
+		if (is_virt)
+			page = virt_to_page(d);
+		else
+			page = vmalloc_to_page((void *)d);
 		offset = d & ~PAGE_MASK;
 		offset_complement = PAGE_SIZE - offset;
 		size = min(n_bytes, offset_complement);
@@ -102,8 +126,10 @@ bool wr_memcpy(const void *dst, const void *src, size_t n_bytes)
 	unsigned long flags;
 	uintptr_t d = (uintptr_t)dst;
 	uintptr_t s = (uintptr_t)src;
+	bool is_virt = __is_wr_after_init(dst, n_bytes);
 
-	if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
+	if (WARN(!(is_virt || likely(__is_wr_pool(dst, n_bytes))),
+		 WR_ERR_RANGE_MSG))
 		return false;
 	while (n_bytes) {
 		struct page *page;
@@ -112,7 +138,10 @@ bool wr_memcpy(const void *dst, const void *src, size_t n_bytes)
 		uintptr_t offset_complement;
 
 		local_irq_save(flags);
-		page = virt_to_page(d);
+		if (is_virt)
+			page = virt_to_page(d);
+		else
+			page = vmalloc_to_page((void *)d);
 		offset = d & ~PAGE_MASK;
 		offset_complement = PAGE_SIZE - offset;
 		size = (size_t)min(n_bytes, offset_complement);
@@ -151,11 +180,13 @@ uintptr_t __wr_rcu_ptr(const void *dst_p_p, const void *src_p)
 	void *base;
 	uintptr_t offset;
 	const size_t size = sizeof(void *);
+	bool is_virt = __is_wr_after_init(dst_p_p, size);
 
-	if (WARN(!__is_wr_after_init(dst_p_p, size), WR_ERR_RANGE_MSG))
+	if (WARN(!(is_virt || likely(__is_wr_pool(dst_p_p, size))),
+		 WR_ERR_RANGE_MSG))
 		return (uintptr_t)NULL;
 	local_irq_save(flags);
-	page = virt_to_page(dst_p_p);
+	page = is_virt ? virt_to_page(dst_p_p) : vmalloc_to_page(dst_p_p);
 	offset = (uintptr_t)dst_p_p & ~PAGE_MASK;
 	base = vmap(&page, 1, VM_MAP, PAGE_KERNEL);
 	if (WARN(!base, WR_ERR_PAGE_MSG)) {
@@ -210,4 +241,179 @@ bool wr_ptr(const void *dst, const void *val)
 {
 	return wr_memcpy(dst, &val, sizeof(val));
 }
+
+/* ============================ Allocator ============================ */
+
+#define PMALLOC_REFILL_DEFAULT (0)
+#define PMALLOC_DEFAULT_REFILL_SIZE PAGE_SIZE
+#define PMALLOC_ALIGN_ORDER_DEFAULT ilog2(ARCH_KMALLOC_MINALIGN)
+
+#define PMALLOC_RO		0x00
+#define PMALLOC_WR		0x01
+#define PMALLOC_AUTO		0x02
+#define PMALLOC_START		0x04
+#define PMALLOC_MASK		(PMALLOC_WR | PMALLOC_AUTO | PMALLOC_START)
+
+#define PMALLOC_MODE_RO		PMALLOC_RO
+#define PMALLOC_MODE_WR		PMALLOC_WR
+#define PMALLOC_MODE_AUTO_RO	(PMALLOC_RO | PMALLOC_AUTO)
+#define PMALLOC_MODE_AUTO_WR	(PMALLOC_WR | PMALLOC_AUTO)
+#define PMALLOC_MODE_START_WR	(PMALLOC_WR | PMALLOC_START)
+
+struct pmalloc_pool {
+	struct mutex mutex;
+	struct list_head pool_node;
+	struct vmap_area *area;
+	size_t align;
+	size_t refill;
+	size_t offset;
+	uint8_t mode;
+};
+
+/*
+ * The write rare functionality is fully implemented as __always_inline,
+ * to prevent having an internal function call that is capable of modifying
+ * write protected memory.
+ * Fully inlining the function allows the compiler to optimize away its
+ * interface, making it harder for an attacker to hijack it.
+ * This still leaves the door open to attacks that might try to reuse part
+ * of the code, by jumping in the middle of the function, however it can
+ * be mitigated by having a compiler plugin that enforces Control Flow
+ * Integrity (CFI).
+ * Any addition/modification to the write rare path must follow the same
+ * approach.
+ */
+
+void pmalloc_init_custom_pool(struct pmalloc_pool *pool, size_t refill,
+			      short align_order, uint8_t mode);
+
+struct pmalloc_pool *pmalloc_create_custom_pool(size_t refill,
+						short align_order,
+						uint8_t mode);
+
+/**
+ * pmalloc_create_pool() - create a protectable memory pool
+ * @mode: can the data be altered after protection
+ *
+ * Shorthand for pmalloc_create_custom_pool() with default argument:
+ * * refill is set to PMALLOC_REFILL_DEFAULT
+ * * align_order is set to PMALLOC_ALIGN_ORDER_DEFAULT
+ *
+ * Returns:
+ * * pointer to the new pool	- success
+ * * NULL			- error
+ */
+static inline struct pmalloc_pool *pmalloc_create_pool(uint8_t mode)
+{
+	return pmalloc_create_custom_pool(PMALLOC_REFILL_DEFAULT,
+					  PMALLOC_ALIGN_ORDER_DEFAULT,
+					  mode);
+}
+
+void *pmalloc(struct pmalloc_pool *pool, size_t size);
+
+/**
+ * pzalloc() - zero-initialized version of pmalloc()
+ * @pool: handle to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ *
+ * Executes pmalloc(), initializing the memory requested to 0, before
+ * returning its address.
+ *
+ * Return:
+ * * pointer to the memory requested	- success
+ * * NULL				- error
+ */
+static inline void *pzalloc(struct pmalloc_pool *pool, size_t size)
+{
+	void *ptr = pmalloc(pool, size);
+
+	if (unlikely(!ptr))
+		return ptr;
+	if ((pool->mode & PMALLOC_MODE_START_WR) == PMALLOC_MODE_START_WR)
+		wr_memset(ptr, 0, size);
+	else
+		memset(ptr, 0, size);
+	return ptr;
+}
+
+/**
+ * pmalloc_array() - array version of pmalloc()
+ * @pool: handle to the pool to be used for memory allocation
+ * @n: number of elements in the array
+ * @size: amount of memory (in bytes) requested for each element
+ *
+ * Executes pmalloc(), on an array.
+ *
+ * Return:
+ * * the pmalloc result	- success
+ * * NULL		- error
+ */
+
+static inline
+void *pmalloc_array(struct pmalloc_pool *pool, size_t n, size_t size)
+{
+	size_t total_size = n * size;
+
+	if (unlikely(!(n && (total_size / n == size))))
+		return NULL;
+	return pmalloc(pool, n * size);
+}
+
+/**
+ * pcalloc() - array version of pzalloc()
+ * @pool: handle to the pool to be used for memory allocation
+ * @n: number of elements in the array
+ * @size: amount of memory (in bytes) requested for each element
+ *
+ * Executes pzalloc(), on an array.
+ *
+ * Return:
+ * * the pmalloc result	- success
+ * * NULL		- error
+ */
+static inline
+void *pcalloc(struct pmalloc_pool *pool, size_t n, size_t size)
+{
+	size_t total_size = n * size;
+
+	if (unlikely(!(n && (total_size / n == size))))
+		return NULL;
+	return pzalloc(pool, n * size);
+}
+
+/**
+ * pstrdup() - duplicate a string, using pmalloc()
+ * @pool: handle to the pool to be used for memory allocation
+ * @s: string to duplicate
+ *
+ * Generates a copy of the given string, allocating sufficient memory
+ * from the given pmalloc pool.
+ *
+ * Return:
+ * * pointer to the replica	- success
+ * * NULL			- error
+ */
+static inline char *pstrdup(struct pmalloc_pool *pool, const char *s)
+{
+	size_t len;
+	char *buf;
+
+	len = strlen(s) + 1;
+	buf = pmalloc(pool, len);
+	if (unlikely(!buf))
+		return buf;
+	if ((pool->mode & PMALLOC_MODE_START_WR) == PMALLOC_MODE_START_WR)
+		wr_memcpy(buf, s, len);
+	else
+		strncpy(buf, s, len);
+	return buf;
+}
+
+
+void pmalloc_protect_pool(struct pmalloc_pool *pool);
+
+void pmalloc_make_pool_ro(struct pmalloc_pool *pool);
+
+void pmalloc_destroy_pool(struct pmalloc_pool *pool);
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index de64ea658716..1885f5565cbc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -764,4 +764,10 @@ config GUP_BENCHMARK
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
+config PRMEM
+    bool
+    depends on MMU
+    depends on ARCH_HAS_SET_MEMORY
+    default y
+
 endmenu
diff --git a/mm/prmem.c b/mm/prmem.c
index de9258f5f29a..7dd13ea43304 100644
--- a/mm/prmem.c
+++ b/mm/prmem.c
@@ -6,5 +6,268 @@
  * Author: Igor Stoppa <igor.stoppa@huawei.com>
  */
 
+#include <linux/printk.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/kernel.h>
+#include <linux/log2.h>
+#include <linux/slab.h>
+#include <linux/set_memory.h>
+#include <linux/bug.h>
+#include <linux/mutex.h>
+#include <linux/llist.h>
+#include <asm/cacheflush.h>
+#include <asm/page.h>
+
+#include <linux/prmem.h>
+
 const char WR_ERR_RANGE_MSG[] = "Write rare on invalid memory range.";
 const char WR_ERR_PAGE_MSG[] = "Failed to remap write rare page.";
+
+static LIST_HEAD(pools_list);
+static DEFINE_MUTEX(pools_mutex);
+
+#define MAX_ALIGN_ORDER (ilog2(sizeof(void *)))
+
+
+/* Various helper functions. Inlined, to reduce the attack surface. */
+
+static __always_inline void protect_area(struct vmap_area *area)
+{
+	set_memory_ro(area->va_start, area->vm->nr_pages);
+	area->vm->flags |= VM_PMALLOC_PROTECTED_MASK;
+}
+
+static __always_inline bool empty(struct pmalloc_pool *pool)
+{
+	return unlikely(!pool->area);
+}
+
+/* Allocation from a protcted area is allowed only for a START_WR pool. */
+static __always_inline bool unwritable(struct pmalloc_pool *pool)
+{
+	return  (pool->area->vm->flags & VM_PMALLOC_PROTECTED) &&
+		!((pool->area->vm->flags & VM_PMALLOC_WR) &&
+		  (pool->mode & PMALLOC_START));
+}
+
+static __always_inline
+bool exhausted(struct pmalloc_pool *pool, size_t size)
+{
+	size_t space_before;
+	size_t space_after;
+
+	space_before = round_down(pool->offset, pool->align);
+	space_after = pool->offset - space_before;
+	return unlikely(space_after < size && space_before < size);
+}
+
+static __always_inline
+bool space_needed(struct pmalloc_pool *pool, size_t size)
+{
+	return empty(pool) || unwritable(pool) || exhausted(pool, size);
+}
+
+/**
+ * pmalloc_init_custom_pool() - initialize a protectable memory pool
+ * @pool: the pointer to the struct pmalloc_pool to initialize
+ * @refill: the minimum size to allocate when in need of more memory.
+ *          It will be rounded up to a multiple of PAGE_SIZE
+ *          The value of 0 gives the default amount of PAGE_SIZE.
+ * @align_order: log2 of the alignment to use when allocating memory
+ *               Negative values give ARCH_KMALLOC_MINALIGN
+ * @mode: is the data RO or RareWrite and should be provided already in
+ *        protected mode.
+ *        The value is one of:
+ *        PMALLOC_MODE_RO, PMALLOC_MODE_WR, PMALLOC_MODE_AUTO_RO
+ *        PMALLOC_MODE_AUTO_WR, PMALLOC_MODE_START_WR
+ *
+ * Initializes an empty memory pool, for allocation of protectable
+ * memory. Memory will be allocated upon request (through pmalloc).
+ */
+void pmalloc_init_custom_pool(struct pmalloc_pool *pool, size_t refill,
+			      short align_order, uint8_t mode)
+{
+	mutex_init(&pool->mutex);
+	pool->area = NULL;
+	if (align_order < 0)
+		pool->align = ARCH_KMALLOC_MINALIGN;
+	else
+		pool->align = 1UL << align_order;
+	pool->refill = refill ? PAGE_ALIGN(refill) :
+				PMALLOC_DEFAULT_REFILL_SIZE;
+	mode &= PMALLOC_MASK;
+	if (mode & PMALLOC_START)
+		mode |= PMALLOC_WR;
+	pool->mode = mode & PMALLOC_MASK;
+	pool->offset = 0;
+	mutex_lock(&pools_mutex);
+	list_add(&pool->pool_node, &pools_list);
+	mutex_unlock(&pools_mutex);
+}
+EXPORT_SYMBOL(pmalloc_init_custom_pool);
+
+/**
+ * pmalloc_create_custom_pool() - create a new protectable memory pool
+ * @refill: the minimum size to allocate when in need of more memory.
+ *          It will be rounded up to a multiple of PAGE_SIZE
+ *          The value of 0 gives the default amount of PAGE_SIZE.
+ * @align_order: log2 of the alignment to use when allocating memory
+ *               Negative values give ARCH_KMALLOC_MINALIGN
+ * @mode: can the data be altered after protection
+ *
+ * Creates a new (empty) memory pool for allocation of protectable
+ * memory. Memory will be allocated upon request (through pmalloc).
+ *
+ * Return:
+ * * pointer to the new pool	- success
+ * * NULL			- error
+ */
+struct pmalloc_pool *pmalloc_create_custom_pool(size_t refill,
+						short align_order,
+						uint8_t mode)
+{
+	struct pmalloc_pool *pool;
+
+	pool = kmalloc(sizeof(struct pmalloc_pool), GFP_KERNEL);
+	if (WARN(!pool, "Could not allocate pool meta data."))
+		return NULL;
+	pmalloc_init_custom_pool(pool, refill, align_order, mode);
+	return pool;
+}
+EXPORT_SYMBOL(pmalloc_create_custom_pool);
+
+static int grow(struct pmalloc_pool *pool, size_t min_size)
+{
+	void *addr;
+	struct vmap_area *new_area;
+	unsigned long size;
+	uint32_t tag_mask;
+
+	size = (min_size > pool->refill) ? min_size : pool->refill;
+	addr = vmalloc(size);
+	if (WARN(!addr, "Failed to allocate %zd bytes", PAGE_ALIGN(size)))
+		return -ENOMEM;
+
+	new_area = find_vmap_area((uintptr_t)addr);
+	tag_mask = VM_PMALLOC;
+	if (pool->mode & PMALLOC_WR)
+		tag_mask |= VM_PMALLOC_WR;
+	new_area->vm->flags |= (tag_mask & VM_PMALLOC_MASK);
+	new_area->pool = pool;
+	if (pool->mode & PMALLOC_START)
+		protect_area(new_area);
+	if (pool->mode & PMALLOC_AUTO && !empty(pool))
+		protect_area(pool->area);
+	/* The area size backed by pages, without the canary bird. */
+	pool->offset = new_area->vm->nr_pages * PAGE_SIZE;
+	new_area->next = pool->area;
+	pool->area = new_area;
+	return 0;
+}
+
+/**
+ * pmalloc() - allocate protectable memory from a pool
+ * @pool: handle to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ *
+ * Allocates memory from a pool.
+ * If needed, the pool will automatically allocate enough memory to
+ * either satisfy the request or meet the "refill" parameter received
+ * upon creation.
+ * New allocation can happen also if the current memory in the pool is
+ * already write protected.
+ * Allocation happens with a mutex locked, therefore it is assumed to have
+ * exclusive write access to both the pool structure and the list of
+ * vmap_areas, while inside the lock.
+ *
+ * Return:
+ * * pointer to the memory requested	- success
+ * * NULL				- error
+ */
+void *pmalloc(struct pmalloc_pool *pool, size_t size)
+{
+	void *retval = NULL;
+
+	mutex_lock(&pool->mutex);
+	if (unlikely(space_needed(pool, size)) &&
+	    unlikely(grow(pool, size) != 0))
+		goto error;
+	pool->offset = round_down(pool->offset - size, pool->align);
+	retval = (void *)(pool->area->va_start + pool->offset);
+error:
+	mutex_unlock(&pool->mutex);
+	return retval;
+}
+EXPORT_SYMBOL(pmalloc);
+
+/**
+ * pmalloc_protect_pool() - write-protects the memory in the pool
+ * @pool: the pool associated to the memory to write-protect
+ *
+ * Write-protects all the memory areas currently assigned to the pool
+ * that are still unprotected.
+ * This does not prevent further allocation of additional memory, that
+ * can be initialized and protected.
+ * The catch is that protecting a pool will make unavailable whatever
+ * free memory it might still contain.
+ * Successive allocations will grab more free pages.
+ */
+void pmalloc_protect_pool(struct pmalloc_pool *pool)
+{
+	struct vmap_area *area;
+
+	mutex_lock(&pool->mutex);
+	for (area = pool->area; area; area = area->next)
+		protect_area(area);
+	mutex_unlock(&pool->mutex);
+}
+EXPORT_SYMBOL(pmalloc_protect_pool);
+
+
+/**
+ * pmalloc_make_pool_ro() - forces a pool to become read-only
+ * @pool: the pool associated to the memory to make ro
+ *
+ * Drops the possibility to perform controlled writes from both the pool
+ * metadata and all the vm_area structures associated to the pool.
+ * In case the pool was configured to automatically protect memory when
+ * allocating it, the configuration is dropped.
+ */
+void pmalloc_make_pool_ro(struct pmalloc_pool *pool)
+{
+	struct vmap_area *area;
+
+	mutex_lock(&pool->mutex);
+	pool->mode &= ~(PMALLOC_WR | PMALLOC_START);
+	for (area = pool->area; area; area = area->next)
+		protect_area(area);
+	mutex_unlock(&pool->mutex);
+}
+EXPORT_SYMBOL(pmalloc_make_pool_ro);
+
+/**
+ * pmalloc_destroy_pool() - destroys a pool and all the associated memory
+ * @pool: the pool to destroy
+ *
+ * All the memory associated to the pool will be freed, including the
+ * metadata used for the pool.
+ */
+void pmalloc_destroy_pool(struct pmalloc_pool *pool)
+{
+	struct vmap_area *area;
+
+	mutex_lock(&pools_mutex);
+	list_del(&pool->pool_node);
+	mutex_unlock(&pools_mutex);
+	while (pool->area) {
+		area = pool->area;
+		pool->area = area->next;
+		set_memory_rw(area->va_start, area->vm->nr_pages);
+		area->vm->flags &= ~VM_PMALLOC_MASK;
+		vfree((void *)area->va_start);
+	}
+	kfree(pool);
+}
+EXPORT_SYMBOL(pmalloc_destroy_pool);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 05/17] prmem: shorthands for write rare on common types
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (3 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 04/17] prmem: " Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-25  0:28   ` Dave Hansen
  2018-10-23 21:34 ` [PATCH 06/17] prmem: test cases for memory protection Igor Stoppa
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

Wrappers around the basic write rare functionality, addressing several
common data types found in the kernel, allowing to specify the new
values through immediates, like constants and defines.

Note:
The list is not complete and could be expanded.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Michal Hocko <mhocko@kernel.org>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Tatashin <pasha.tatashin@oracle.com>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
---
 MAINTAINERS                |   1 +
 include/linux/prmemextra.h | 133 +++++++++++++++++++++++++++++++++++++
 2 files changed, 134 insertions(+)
 create mode 100644 include/linux/prmemextra.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e566c5d09faf..df7221eca160 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9459,6 +9459,7 @@ M:	Igor Stoppa <igor.stoppa@gmail.com>
 L:	kernel-hardening@lists.openwall.com
 S:	Maintained
 F:	include/linux/prmem.h
+F:	include/linux/prmemextra.h
 F:	mm/prmem.c
 
 MEMORY MANAGEMENT
diff --git a/include/linux/prmemextra.h b/include/linux/prmemextra.h
new file mode 100644
index 000000000000..36995717720e
--- /dev/null
+++ b/include/linux/prmemextra.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * prmemextra.h: Shorthands for write rare of basic data types
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.stoppa@huawei.com>
+ *
+ */
+
+#ifndef _LINUX_PRMEMEXTRA_H
+#define _LINUX_PRMEMEXTRA_H
+
+#include <linux/prmem.h>
+
+/**
+ * wr_char - alters a char in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_char(const char *dst, const char val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+
+/**
+ * wr_short - alters a short in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_short(const short *dst, const short val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+
+/**
+ * wr_ushort - alters an unsigned short in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_ushort(const unsigned short *dst, const unsigned short val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+
+/**
+ * wr_int - alters an int in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_int(const int *dst, const int val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+
+/**
+ * wr_uint - alters an unsigned int in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_uint(const unsigned int *dst, const unsigned int val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+
+/**
+ * wr_long - alters a long in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_long(const long *dst, const long val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+
+/**
+ * wr_ulong - alters an unsigned long in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_ulong(const unsigned long *dst, const unsigned long val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+
+/**
+ * wr_longlong - alters a long long in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_longlong(const long long *dst, const long long val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+
+/**
+ * wr_ulonglong - alters an unsigned long long in write rare memory
+ * @dst: target for write
+ * @val: new value
+ *
+ * Returns true on success, false otherwise.
+ */
+static __always_inline
+bool wr_ulonglong(const unsigned long long *dst,
+			  const unsigned long long val)
+{
+	return wr_memcpy(dst, &val, sizeof(val));
+}
+
+#endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/17] prmem: test cases for memory protection
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (4 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 05/17] prmem: shorthands for write rare on common types Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-24  3:27   ` Randy Dunlap
  2018-10-25 16:43   ` Dave Hansen
  2018-10-23 21:34 ` [PATCH 07/17] prmem: lkdtm tests " Igor Stoppa
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

The test cases verify the various interfaces offered by both prmem.h and
prmemextra.h

The tests avoid triggering crashes, by not performing actions that would
be treated as illegal. That part is handled in the lkdtm patch.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Michal Hocko <mhocko@kernel.org>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Tatashin <pasha.tatashin@oracle.com>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
---
 MAINTAINERS          |   2 +
 mm/Kconfig.debug     |   9 +
 mm/Makefile          |   1 +
 mm/test_pmalloc.c    | 633 +++++++++++++++++++++++++++++++++++++++++++
 mm/test_write_rare.c | 236 ++++++++++++++++
 5 files changed, 881 insertions(+)
 create mode 100644 mm/test_pmalloc.c
 create mode 100644 mm/test_write_rare.c

diff --git a/MAINTAINERS b/MAINTAINERS
index df7221eca160..ea979a5a9ec9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9461,6 +9461,8 @@ S:	Maintained
 F:	include/linux/prmem.h
 F:	include/linux/prmemextra.h
 F:	mm/prmem.c
+F:	mm/test_write_rare.c
+F:	mm/test_pmalloc.c
 
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 9a7b8b049d04..57de5b3c0bae 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -94,3 +94,12 @@ config DEBUG_RODATA_TEST
     depends on STRICT_KERNEL_RWX
     ---help---
       This option enables a testcase for the setting rodata read-only.
+
+config DEBUG_PRMEM_TEST
+    tristate "Run self test for protected memory"
+    depends on STRICT_KERNEL_RWX
+    select PRMEM
+    default n
+    help
+      Tries to verify that the memory protection works correctly and that
+      the memory is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index 215c6a6d7304..93b503d4659f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -65,6 +65,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_PRMEM) += prmem.o
+obj-$(CONFIG_DEBUG_PRMEM_TEST) += test_write_rare.o test_pmalloc.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c
new file mode 100644
index 000000000000..f9ee8fb29eea
--- /dev/null
+++ b/mm/test_pmalloc.c
@@ -0,0 +1,633 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_pmalloc.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.stoppa@huawei.com>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/bug.h>
+#include <linux/prmemextra.h>
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+static const char MSG_NO_POOL[] = "Cannot allocate memory for the pool.";
+static const char MSG_NO_PMEM[] = "Cannot allocate memory from the pool.";
+
+#define pr_success(test_name)	\
+	pr_info(test_name " test passed")
+
+/* --------------- tests the basic life-cycle of a pool --------------- */
+
+static bool is_address_protected(void *p)
+{
+	struct page *page;
+	struct vmap_area *area;
+
+	if (unlikely(!is_vmalloc_addr(p)))
+		return false;
+	page = vmalloc_to_page(p);
+	if (unlikely(!page))
+		return false;
+	wmb(); /* Flush changes to the page table - is it needed? */
+	area = find_vmap_area((uintptr_t)p);
+	if (unlikely((!area) || (!area->vm) ||
+		     ((area->vm->flags & VM_PMALLOC_PROTECTED_MASK) !=
+		      VM_PMALLOC_PROTECTED_MASK)))
+		return false;
+	return true;
+}
+
+static bool create_and_destroy_pool(void)
+{
+	static struct pmalloc_pool *pool;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_RO);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	pmalloc_destroy_pool(pool);
+	pr_success("pool creation and destruction");
+	return true;
+}
+
+/*  verifies that it's possible to allocate from the pool */
+static bool test_alloc(void)
+{
+	static struct pmalloc_pool *pool;
+	static void *p;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_RO);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	p = pmalloc(pool,  SIZE_1 - 1);
+	pmalloc_destroy_pool(pool);
+	if (WARN(!p, MSG_NO_PMEM))
+		return false;
+	pr_success("allocation capability");
+	return true;
+}
+
+/* ----------------------- tests self protection ----------------------- */
+
+static bool test_auto_ro(void)
+{
+	struct pmalloc_pool *pool;
+	int *first_chunk;
+	int *second_chunk;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_AUTO_RO);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	first_chunk = (int *)pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE);
+	if (WARN(!first_chunk, MSG_NO_PMEM))
+		goto error;
+	second_chunk = (int *)pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE);
+	if (WARN(!second_chunk, MSG_NO_PMEM))
+		goto error;
+	if (WARN(!is_address_protected(first_chunk),
+		 "Failed to automatically write protect exhausted vmarea"))
+		goto error;
+	pr_success("AUTO_RO");
+	retval = true;
+error:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_auto_wr(void)
+{
+	struct pmalloc_pool *pool;
+	int *first_chunk;
+	int *second_chunk;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_AUTO_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	first_chunk = (int *)pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE);
+	if (WARN(!first_chunk, MSG_NO_PMEM))
+		goto error;
+	second_chunk = (int *)pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE);
+	if (WARN(!second_chunk, MSG_NO_PMEM))
+		goto error;
+	if (WARN(!is_address_protected(first_chunk),
+		 "Failed to automatically write protect exhausted vmarea"))
+		goto error;
+	pr_success("AUTO_WR");
+	retval = true;
+error:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_start_wr(void)
+{
+	struct pmalloc_pool *pool;
+	int *chunks[2];
+	bool retval = false;
+	int i;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_START_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	for (i = 0; i < 2; i++) {
+		chunks[i] = (int *)pmalloc(pool, 1);
+		if (WARN(!chunks[i], MSG_NO_PMEM))
+			goto error;
+		if (WARN(!is_address_protected(chunks[i]),
+			 "vmarea was not protected from the start"))
+			goto error;
+	}
+	if (WARN(vmalloc_to_page(chunks[0]) != vmalloc_to_page(chunks[1]),
+		 "START_WR: mostly empty vmap area not reused"))
+		goto error;
+	pr_success("START_WR");
+	retval = true;
+error:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_self_protection(void)
+{
+	if (WARN(!(test_auto_ro() &&
+		   test_auto_wr() &&
+		   test_start_wr()),
+		 "self protection tests failed"))
+		return false;
+	pr_success("self protection");
+	return true;
+}
+
+/* ----------------- tests basic write rare functions ----------------- */
+
+#define INSERT_OFFSET (PAGE_SIZE * 3 / 2)
+#define INSERT_SIZE (PAGE_SIZE * 2)
+#define REGION_SIZE (PAGE_SIZE * 5)
+static bool test_wr_memset(void)
+{
+	struct pmalloc_pool *pool;
+	char *region;
+	unsigned int i;
+	int retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_START_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	region = pzalloc(pool, REGION_SIZE);
+	if (WARN(!region, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < REGION_SIZE; i++)
+		if (WARN(region[i], "Failed to memset wr memory"))
+			goto destroy_pool;
+	retval = !wr_memset(region + INSERT_OFFSET, 1, INSERT_SIZE);
+	if (WARN(retval, "wr_memset failed"))
+		goto destroy_pool;
+	for (i = 0; i < REGION_SIZE; i++)
+		if (i >= INSERT_OFFSET &&
+		    i < (INSERT_SIZE + INSERT_OFFSET)) {
+			if (WARN(!region[i],
+				 "Failed to alter target area"))
+				goto destroy_pool;
+		} else {
+			if (WARN(region[i] != 0,
+				 "Unexpected alteration outside region"))
+				goto destroy_pool;
+		}
+	retval = true;
+	pr_success("wr_memset");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_strdup(void)
+{
+	const char src[] = "Some text for testing pstrdup()";
+	struct pmalloc_pool *pool;
+	char *dst;
+	int retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	dst = pstrdup(pool, src);
+	if (WARN(!dst || strcmp(src, dst), "pmalloc wr strdup failed"))
+		goto destroy_pool;
+	retval = true;
+	pr_success("pmalloc wr strdup");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+/* Verify write rare across multiple pages, unaligned to PAGE_SIZE. */
+static bool test_wr_copy(void)
+{
+	struct pmalloc_pool *pool;
+	char *region;
+	char *mod;
+	unsigned int i;
+	int retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	region = pzalloc(pool, REGION_SIZE);
+	if (WARN(!region, MSG_NO_PMEM))
+		goto destroy_pool;
+	mod = vmalloc(INSERT_SIZE);
+	if (WARN(!mod, "Failed to allocate memory from vmalloc"))
+		goto destroy_pool;
+	memset(mod, 0xA5, INSERT_SIZE);
+	pmalloc_protect_pool(pool);
+	retval = !wr_memcpy(region + INSERT_OFFSET, mod, INSERT_SIZE);
+	if (WARN(retval, "wr_copy failed"))
+		goto free_mod;
+
+	for (i = 0; i < REGION_SIZE; i++)
+		if (i >= INSERT_OFFSET &&
+		    i < (INSERT_SIZE + INSERT_OFFSET)) {
+			if (WARN(region[i] != (char)0xA5,
+				 "Failed to alter target area"))
+				goto free_mod;
+		} else {
+			if (WARN(region[i] != 0,
+				 "Unexpected alteration outside region"))
+				goto free_mod;
+		}
+	retval = true;
+	pr_success("wr_copy");
+free_mod:
+	vfree(mod);
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+/* ----------------- tests specialized write actions ------------------- */
+
+#define TEST_ARRAY_SIZE 5
+#define TEST_ARRAY_TARGET (TEST_ARRAY_SIZE / 2)
+
+static bool test_wr_char(void)
+{
+	struct pmalloc_pool *pool;
+	char *array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(char) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = (char)0xA5;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_char(array + TEST_ARRAY_TARGET, (char)0x5A),
+		 "Failed to alter char variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ?
+				      (char)0x5A : (char)0xA5),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_char");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_short(void)
+{
+	struct pmalloc_pool *pool;
+	short *array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(short) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = (short)0xA5;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_short(array + TEST_ARRAY_TARGET, (short)0x5A),
+		 "Failed to alter short variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ?
+				      (short)0x5A : (short)0xA5),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_short");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_ushort(void)
+{
+	struct pmalloc_pool *pool;
+	unsigned short *array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(unsigned short) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = (unsigned short)0xA5;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_ushort(array + TEST_ARRAY_TARGET,
+				    (unsigned short)0x5A),
+		 "Failed to alter unsigned short variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ?
+				      (unsigned short)0x5A :
+				      (unsigned short)0xA5),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_ushort");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_int(void)
+{
+	struct pmalloc_pool *pool;
+	int *array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(int) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = 0xA5;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_int(array + TEST_ARRAY_TARGET, 0x5A),
+		 "Failed to alter int variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_int");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_uint(void)
+{
+	struct pmalloc_pool *pool;
+	unsigned int *array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(unsigned int) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = 0xA5;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_uint(array + TEST_ARRAY_TARGET, 0x5A),
+		 "Failed to alter unsigned int variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_uint");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_long(void)
+{
+	struct pmalloc_pool *pool;
+	long *array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(long) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = 0xA5;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_long(array + TEST_ARRAY_TARGET, 0x5A),
+		 "Failed to alter long variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_long");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_ulong(void)
+{
+	struct pmalloc_pool *pool;
+	unsigned long *array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(unsigned long) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = 0xA5;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_ulong(array + TEST_ARRAY_TARGET, 0x5A),
+		 "Failed to alter unsigned long variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_ulong");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_longlong(void)
+{
+	struct pmalloc_pool *pool;
+	long long *array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(long long) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = 0xA5;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_longlong(array + TEST_ARRAY_TARGET, 0x5A),
+		 "Failed to alter long variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_longlong");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_ulonglong(void)
+{
+	struct pmalloc_pool *pool;
+	unsigned long long *array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(unsigned long long) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = 0xA5;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_ulonglong(array + TEST_ARRAY_TARGET, 0x5A),
+		 "Failed to alter unsigned long long variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_ulonglong");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_wr_ptr(void)
+{
+	struct pmalloc_pool *pool;
+	int **array;
+	unsigned int i;
+	bool retval = false;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return false;
+	array = pmalloc(pool, sizeof(int *) * TEST_ARRAY_SIZE);
+	if (WARN(!array, MSG_NO_PMEM))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		array[i] = NULL;
+	pmalloc_protect_pool(pool);
+	if (WARN(!wr_ptr(array + TEST_ARRAY_TARGET, array),
+		 "Failed to alter ptr variable"))
+		goto destroy_pool;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		if (WARN(array[i] != (i == TEST_ARRAY_TARGET ?
+				      (void *)array : NULL),
+			 "Unexpected value in test array."))
+			goto destroy_pool;
+	retval = true;
+	pr_success("wr_ptr");
+destroy_pool:
+	pmalloc_destroy_pool(pool);
+	return retval;
+
+}
+
+static bool test_specialized_wrs(void)
+{
+	if (WARN(!(test_wr_char() &&
+		   test_wr_short() &&
+		   test_wr_ushort() &&
+		   test_wr_int() &&
+		   test_wr_uint() &&
+		   test_wr_long() &&
+		   test_wr_ulong() &&
+		   test_wr_longlong() &&
+		   test_wr_ulonglong() &&
+		   test_wr_ptr()),
+		 "specialized write rare failed"))
+		return false;
+	pr_success("specialized write rare");
+	return true;
+
+}
+
+/*
+ * test_pmalloc()  -main entry point for running the test cases
+ */
+static int __init test_pmalloc_init_module(void)
+{
+	if (WARN(!(create_and_destroy_pool() &&
+		   test_alloc() &&
+		   test_self_protection() &&
+		   test_wr_memset() &&
+		   test_wr_strdup() &&
+		   test_wr_copy() &&
+		   test_specialized_wrs()),
+		 "protected memory allocator test failed"))
+		return -EFAULT;
+	pr_success("protected memory allocator");
+	return 0;
+}
+
+module_init(test_pmalloc_init_module);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Igor Stoppa <igor.stoppa@huawei.com>");
+MODULE_DESCRIPTION("Test module for pmalloc.");
diff --git a/mm/test_write_rare.c b/mm/test_write_rare.c
new file mode 100644
index 000000000000..e19473bb319b
--- /dev/null
+++ b/mm/test_write_rare.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_write_rare.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.stoppa@huawei.com>
+ *
+ * Caveat: the tests which perform modifications are run *during* init, so
+ * the memory they use could be still altered through a direct write
+ * operation. But the purpose of these tests is to confirm that the
+ * modification through remapping works correctly. This doesn't depend on
+ * the read/write status of the original mapping.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/bug.h>
+#include <linux/prmemextra.h>
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#define pr_success(test_name)	\
+	pr_info(test_name " test passed")
+
+static int scalar __wr_after_init = 0xA5A5;
+
+/* The section must occupy a non-zero number of whole pages */
+static bool test_alignment(void)
+{
+	size_t pstart = (size_t)&__start_wr_after_init;
+	size_t pend = (size_t)&__end_wr_after_init;
+
+	if (WARN((pstart & ~PAGE_MASK) || (pend & ~PAGE_MASK) ||
+		 (pstart >= pend), "Boundaries test failed."))
+		return false;
+	pr_success("Boundaries");
+	return true;
+}
+
+/* Alter a scalar value */
+static bool test_simple_write(void)
+{
+	int new_val = 0x5A5A;
+
+	if (WARN(!__is_wr_after_init(&scalar, sizeof(scalar)),
+		 "The __wr_after_init modifier did NOT work."))
+		return false;
+
+	if (WARN(!wr(&scalar, &new_val) || scalar != new_val,
+		 "Scalar write rare test failed"))
+		return false;
+
+	pr_success("Scalar write rare");
+	return true;
+}
+
+#define LARGE_SIZE (PAGE_SIZE * 5)
+#define CHANGE_SIZE (PAGE_SIZE * 2)
+#define CHANGE_OFFSET (PAGE_SIZE / 2)
+
+static char large[LARGE_SIZE] __wr_after_init;
+
+
+/* Alter data across multiple pages */
+static bool test_cross_page_write(void)
+{
+	unsigned int i;
+	char *src;
+	bool check;
+
+	src = vmalloc(PAGE_SIZE * 2);
+	if (WARN(!src, "could not allocate memory"))
+		return false;
+
+	for (i = 0; i < LARGE_SIZE; i++)
+		large[i] = 0xA5;
+
+	for (i = 0; i < CHANGE_SIZE; i++)
+		src[i] = 0x5A;
+
+	check = wr_memcpy(large + CHANGE_OFFSET, src, CHANGE_SIZE);
+	vfree(src);
+	if (WARN(!check, "The wr_memcpy() failed"))
+		return false;
+
+	for (i = CHANGE_OFFSET; i < CHANGE_OFFSET + CHANGE_SIZE; i++)
+		if (WARN(large[i] != 0x5A,
+			 "Cross-page write rare test failed"))
+			return false;
+
+	pr_success("Cross-page write rare");
+	return true;
+}
+
+static bool test_memsetting(void)
+{
+	unsigned int i;
+
+	wr_memset(large, 0, LARGE_SIZE);
+	for (i = 0; i < LARGE_SIZE; i++)
+		if (WARN(large[i], "Failed to reset memory"))
+			return false;
+	wr_memset(large + CHANGE_OFFSET, 1, CHANGE_SIZE);
+	for (i = 0; i < CHANGE_OFFSET; i++)
+		if (WARN(large[i], "Failed to set memory"))
+			return false;
+	for (i = CHANGE_OFFSET; i < CHANGE_OFFSET + CHANGE_SIZE; i++)
+		if (WARN(!large[i], "Failed to set memory"))
+			return false;
+	for (i = CHANGE_OFFSET + CHANGE_SIZE; i < LARGE_SIZE; i++)
+		if (WARN(large[i], "Failed to set memory"))
+			return false;
+	pr_success("Memsetting");
+	return true;
+}
+
+#define INIT_VAL 1
+#define END_VAL 4
+
+/* Various tests for the shorthands provided for standard types. */
+static char char_var __wr_after_init = INIT_VAL;
+static bool test_char(void)
+{
+	return wr_char(&char_var, END_VAL) && char_var == END_VAL;
+}
+
+static short short_var __wr_after_init = INIT_VAL;
+static bool test_short(void)
+{
+	return wr_short(&short_var, END_VAL) &&
+		short_var == END_VAL;
+}
+
+static unsigned short ushort_var __wr_after_init = INIT_VAL;
+static bool test_ushort(void)
+{
+	return wr_ushort(&ushort_var, END_VAL) &&
+		ushort_var == END_VAL;
+}
+
+static int int_var __wr_after_init = INIT_VAL;
+static bool test_int(void)
+{
+	return wr_int(&int_var, END_VAL) &&
+		int_var == END_VAL;
+}
+
+static unsigned int uint_var __wr_after_init = INIT_VAL;
+static bool test_uint(void)
+{
+	return wr_uint(&uint_var, END_VAL) &&
+		uint_var == END_VAL;
+}
+
+static long long_var __wr_after_init = INIT_VAL;
+static bool test_long(void)
+{
+	return wr_long(&long_var, END_VAL) &&
+		long_var == END_VAL;
+}
+
+static unsigned long ulong_var __wr_after_init = INIT_VAL;
+static bool test_ulong(void)
+{
+	return wr_ulong(&ulong_var, END_VAL) &&
+		ulong_var == END_VAL;
+}
+
+static long long longlong_var __wr_after_init = INIT_VAL;
+static bool test_longlong(void)
+{
+	return wr_longlong(&longlong_var, END_VAL) &&
+		longlong_var == END_VAL;
+}
+
+static unsigned long long ulonglong_var __wr_after_init = INIT_VAL;
+static bool test_ulonglong(void)
+{
+	return wr_ulonglong(&ulonglong_var, END_VAL) &&
+		ulonglong_var == END_VAL;
+}
+
+static int referred_value = INIT_VAL;
+static int *reference __wr_after_init;
+static bool test_ptr(void)
+{
+	return wr_ptr(&reference, &referred_value) &&
+		reference == &referred_value;
+}
+
+static int *rcu_ptr __wr_after_init __aligned(sizeof(void *));
+static bool test_rcu_ptr(void)
+{
+	uintptr_t addr = wr_rcu_assign_pointer(rcu_ptr, &referred_value);
+
+	return  (addr == (uintptr_t)&referred_value) &&
+		referred_value == *(int *)addr;
+}
+
+static bool test_specialized_write_rare(void)
+{
+	if (WARN(!(test_char() && test_short() &&
+		   test_ushort() && test_int() &&
+		   test_uint() && test_long() && test_ulong() &&
+		   test_long() && test_ulong() &&
+		   test_longlong() && test_ulonglong() &&
+		   test_ptr() && test_rcu_ptr()),
+		 "Specialized write rare test failed"))
+		return false;
+	pr_success("Specialized write rare");
+	return true;
+}
+
+static int __init test_static_wr_init_module(void)
+{
+	if (WARN(!(test_alignment() &&
+		   test_simple_write() &&
+		   test_cross_page_write() &&
+		   test_memsetting() &&
+		   test_specialized_write_rare()),
+		 "static rare-write test failed"))
+		return -EFAULT;
+	pr_success("static write_rare");
+	return 0;
+}
+
+module_init(test_static_wr_init_module);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Igor Stoppa <igor.stoppa@huawei.com>");
+MODULE_DESCRIPTION("Test module for static write rare.");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/17] prmem: lkdtm tests for memory protection
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (5 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 06/17] prmem: test cases for memory protection Igor Stoppa
@ 2018-10-23 21:34 ` " Igor Stoppa
  2018-10-23 21:34 ` [PATCH 08/17] prmem: struct page: track vmap_area Igor Stoppa
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Greg Kroah-Hartman, Arnd Bergmann, linux-mm, linux-kernel

Various cases meant to verify that illegal operations on protected
memory will either BUG() or WARN().

The test cases fall into 2 main categories:
- trying to overwrite (directly) something that is write protected
- trying to use write rare functions on something that is not write rare

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Kees Cook <keescook@chromium.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Arnd Bergmann <arnd@arndb.de>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
---
 drivers/misc/lkdtm/core.c  |  13 ++
 drivers/misc/lkdtm/lkdtm.h |  13 ++
 drivers/misc/lkdtm/perms.c | 248 +++++++++++++++++++++++++++++++++++++
 3 files changed, 274 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2154d1bfd18b..41a3ba16bc57 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -155,6 +155,19 @@ static const struct crashtype crashtypes[] = {
 	CRASHTYPE(ACCESS_USERSPACE),
 	CRASHTYPE(WRITE_RO),
 	CRASHTYPE(WRITE_RO_AFTER_INIT),
+	CRASHTYPE(WRITE_WR_AFTER_INIT),
+	CRASHTYPE(WRITE_WR_AFTER_INIT_ON_RO_AFTER_INIT),
+	CRASHTYPE(WRITE_WR_AFTER_INIT_ON_CONST),
+#ifdef CONFIG_PRMEM
+	CRASHTYPE(WRITE_RO_PMALLOC),
+	CRASHTYPE(WRITE_AUTO_RO_PMALLOC),
+	CRASHTYPE(WRITE_WR_PMALLOC),
+	CRASHTYPE(WRITE_AUTO_WR_PMALLOC),
+	CRASHTYPE(WRITE_START_WR_PMALLOC),
+	CRASHTYPE(WRITE_WR_PMALLOC_ON_RO_PMALLOC),
+	CRASHTYPE(WRITE_WR_PMALLOC_ON_CONST),
+	CRASHTYPE(WRITE_WR_PMALLOC_ON_RO_AFT_INIT),
+#endif
 	CRASHTYPE(WRITE_KERN),
 	CRASHTYPE(REFCOUNT_INC_OVERFLOW),
 	CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 9e513dcfd809..08368c4545f7 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -38,6 +38,19 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
 void __init lkdtm_perms_init(void);
 void lkdtm_WRITE_RO(void);
 void lkdtm_WRITE_RO_AFTER_INIT(void);
+void lkdtm_WRITE_WR_AFTER_INIT(void);
+void lkdtm_WRITE_WR_AFTER_INIT_ON_RO_AFTER_INIT(void);
+void lkdtm_WRITE_WR_AFTER_INIT_ON_CONST(void);
+#ifdef CONFIG_PRMEM
+void lkdtm_WRITE_RO_PMALLOC(void);
+void lkdtm_WRITE_AUTO_RO_PMALLOC(void);
+void lkdtm_WRITE_WR_PMALLOC(void);
+void lkdtm_WRITE_AUTO_WR_PMALLOC(void);
+void lkdtm_WRITE_START_WR_PMALLOC(void);
+void lkdtm_WRITE_WR_PMALLOC_ON_RO_PMALLOC(void);
+void lkdtm_WRITE_WR_PMALLOC_ON_CONST(void);
+void lkdtm_WRITE_WR_PMALLOC_ON_RO_AFT_INIT(void);
+#endif
 void lkdtm_WRITE_KERN(void);
 void lkdtm_EXEC_DATA(void);
 void lkdtm_EXEC_STACK(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 53b85c9d16b8..3c14fd4d90ac 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mman.h>
 #include <linux/uaccess.h>
+#include <linux/prmemextra.h>
 #include <asm/cacheflush.h>
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -27,6 +28,10 @@ static const unsigned long rodata = 0xAA55AA55;
 /* This is marked __ro_after_init, so it should ultimately be .rodata. */
 static unsigned long ro_after_init __ro_after_init = 0x55AA5500;
 
+/* This is marked __wr_after_init, so it should be in .rodata. */
+static
+unsigned long wr_after_init __wr_after_init = 0x55AA5500;
+
 /*
  * This just returns to the caller. It is designed to be copied into
  * non-executable memory regions.
@@ -104,6 +109,247 @@ void lkdtm_WRITE_RO_AFTER_INIT(void)
 	*ptr ^= 0xabcd1234;
 }
 
+void lkdtm_WRITE_WR_AFTER_INIT(void)
+{
+	unsigned long *ptr = &wr_after_init;
+
+	/*
+	 * Verify we were written to during init. Since an Oops
+	 * is considered a "success", a failure is to just skip the
+	 * real test.
+	 */
+	if ((*ptr & 0xAA) != 0xAA) {
+		pr_info("%p was NOT written during init!?\n", ptr);
+		return;
+	}
+
+	pr_info("attempting bad wr_after_init write at %p\n", ptr);
+	*ptr ^= 0xabcd1234;
+}
+
+#define INIT_VAL 0x5A
+#define END_VAL 0xA5
+
+/* Verify that write rare will not work against read-only memory. */
+static int ro_after_init_data __ro_after_init = INIT_VAL;
+void lkdtm_WRITE_WR_AFTER_INIT_ON_RO_AFTER_INIT(void)
+{
+	pr_info("attempting illegal write rare to __ro_after_init");
+	if (wr_int(&ro_after_init_data, END_VAL) ||
+	     ro_after_init_data == END_VAL)
+		pr_info("Unexpected successful write to __ro_after_init");
+}
+
+/*
+ * "volatile" to force the compiler to not optimize away the reading back.
+ * Is there a better way to do it, than using volatile?
+ */
+static volatile const int const_data = INIT_VAL;
+void lkdtm_WRITE_WR_AFTER_INIT_ON_CONST(void)
+{
+	pr_info("attempting illegal write rare to const data");
+	if (wr_int((int *)&const_data, END_VAL) || const_data == END_VAL)
+		pr_info("Unexpected successful write to const memory");
+}
+
+#ifdef CONFIG_PRMEM
+
+#define MSG_NO_POOL "Cannot allocate memory for the pool."
+#define MSG_NO_PMEM "Cannot allocate memory from the pool."
+
+void lkdtm_WRITE_RO_PMALLOC(void)
+{
+	struct pmalloc_pool *pool;
+	int *i;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_RO);
+	if (!pool) {
+		pr_info(MSG_NO_POOL);
+		return;
+	}
+	i = pmalloc(pool, sizeof(int));
+	if (!i) {
+		pr_info(MSG_NO_PMEM);
+		pmalloc_destroy_pool(pool);
+		return;
+	}
+	*i = INT_MAX;
+	pmalloc_protect_pool(pool);
+	pr_info("attempting bad pmalloc write at %p\n", i);
+	*i = 0; /* Note: this will crash and leak the pool memory. */
+}
+
+void lkdtm_WRITE_AUTO_RO_PMALLOC(void)
+{
+	struct pmalloc_pool *pool;
+	int *i;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_AUTO_RO);
+	if (!pool) {
+		pr_info(MSG_NO_POOL);
+		return;
+	}
+	i = pmalloc(pool, sizeof(int));
+	if (!i) {
+		pr_info(MSG_NO_PMEM);
+		pmalloc_destroy_pool(pool);
+		return;
+	}
+	*i = INT_MAX;
+	pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE);
+	pr_info("attempting bad pmalloc write at %p\n", i);
+	*i = 0; /* Note: this will crash and leak the pool memory. */
+}
+
+void lkdtm_WRITE_WR_PMALLOC(void)
+{
+	struct pmalloc_pool *pool;
+	int *i;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (!pool) {
+		pr_info(MSG_NO_POOL);
+		return;
+	}
+	i = pmalloc(pool, sizeof(int));
+	if (!i) {
+		pr_info(MSG_NO_PMEM);
+		pmalloc_destroy_pool(pool);
+		return;
+	}
+	*i = INT_MAX;
+	pmalloc_protect_pool(pool);
+	pr_info("attempting bad pmalloc write at %p\n", i);
+	*i = 0; /* Note: this will crash and leak the pool memory. */
+}
+
+void lkdtm_WRITE_AUTO_WR_PMALLOC(void)
+{
+	struct pmalloc_pool *pool;
+	int *i;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_AUTO_WR);
+	if (!pool) {
+		pr_info(MSG_NO_POOL);
+		return;
+	}
+	i = pmalloc(pool, sizeof(int));
+	if (!i) {
+		pr_info(MSG_NO_PMEM);
+		pmalloc_destroy_pool(pool);
+		return;
+	}
+	*i = INT_MAX;
+	pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE);
+	pr_info("attempting bad pmalloc write at %p\n", i);
+	*i = 0; /* Note: this will crash and leak the pool memory. */
+}
+
+void lkdtm_WRITE_START_WR_PMALLOC(void)
+{
+	struct pmalloc_pool *pool;
+	int *i;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_START_WR);
+	if (!pool) {
+		pr_info(MSG_NO_POOL);
+		return;
+	}
+	i = pmalloc(pool, sizeof(int));
+	if (!i) {
+		pr_info(MSG_NO_PMEM);
+		pmalloc_destroy_pool(pool);
+		return;
+	}
+	*i = INT_MAX;
+	pr_info("attempting bad pmalloc write at %p\n", i);
+	*i = 0; /* Note: this will crash and leak the pool memory. */
+}
+
+void lkdtm_WRITE_WR_PMALLOC_ON_RO_PMALLOC(void)
+{
+	struct pmalloc_pool *pool;
+	int *var_ptr;
+
+	pool = pmalloc_create_pool(PMALLOC_MODE_RO);
+	if (!pool) {
+		pr_info(MSG_NO_POOL);
+		return;
+	}
+	var_ptr = pmalloc(pool, sizeof(int));
+	if (!var_ptr) {
+		pr_info(MSG_NO_PMEM);
+		pmalloc_destroy_pool(pool);
+		return;
+	}
+	*var_ptr = INIT_VAL;
+	pmalloc_protect_pool(pool);
+	pr_info("attempting illegal write rare to R/O pool");
+	if (wr_int(var_ptr, END_VAL))
+		pr_info("Unexpected successful write to R/O pool");
+	pmalloc_destroy_pool(pool);
+}
+
+void lkdtm_WRITE_WR_PMALLOC_ON_CONST(void)
+{
+	struct pmalloc_pool *pool;
+	int *dummy;
+	bool write_result;
+
+	/*
+	 * The pool operations are only meant to simulate an attacker
+	 * using a random pool as parameter for the attack against the
+	 * const.
+	 */
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (!pool) {
+		pr_info(MSG_NO_POOL);
+		return;
+	}
+	dummy = pmalloc(pool, sizeof(*dummy));
+	if (!dummy) {
+		pr_info(MSG_NO_PMEM);
+		pmalloc_destroy_pool(pool);
+		return;
+	}
+	*dummy = 1;
+	pmalloc_protect_pool(pool);
+	pr_info("attempting illegal write rare to const data");
+	write_result = wr_int((int *)&const_data, END_VAL);
+	pmalloc_destroy_pool(pool);
+	if (write_result || const_data != INIT_VAL)
+		pr_info("Unexpected successful write to const memory");
+}
+
+void lkdtm_WRITE_WR_PMALLOC_ON_RO_AFT_INIT(void)
+{
+	struct pmalloc_pool *pool;
+	int *dummy;
+	bool write_result;
+
+	/*
+	 * The pool operations are only meant to simulate an attacker
+	 * using a random pool as parameter for the attack against the
+	 * const.
+	 */
+	pool = pmalloc_create_pool(PMALLOC_MODE_WR);
+	if (WARN(!pool, MSG_NO_POOL))
+		return;
+	dummy = pmalloc(pool, sizeof(*dummy));
+	if (WARN(!dummy, MSG_NO_PMEM)) {
+		pmalloc_destroy_pool(pool);
+		return;
+	}
+	*dummy = 1;
+	pmalloc_protect_pool(pool);
+	pr_info("attempting illegal write rare to ro_after_init");
+	write_result = wr_int(&ro_after_init_data, END_VAL);
+	pmalloc_destroy_pool(pool);
+	WARN(write_result || ro_after_init_data != INIT_VAL,
+	     "Unexpected successful write to ro_after_init memory");
+}
+#endif
+
 void lkdtm_WRITE_KERN(void)
 {
 	size_t size;
@@ -200,4 +446,6 @@ void __init lkdtm_perms_init(void)
 	/* Make sure we can write to __ro_after_init values during __init */
 	ro_after_init |= 0xAA;
 
+	/* Make sure we can write to __wr_after_init during __init */
+	wr_after_init |= 0xAA;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/17] prmem: struct page: track vmap_area
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (6 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 07/17] prmem: lkdtm tests " Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-24  3:12   ` Matthew Wilcox
  2018-10-23 21:34 ` [PATCH 09/17] prmem: hardened usercopy Igor Stoppa
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

When a page is used for virtual memory, it is often necessary to obtain
a handler to the corresponding vmap_area, which refers to the virtually
continuous area generated, when invoking vmalloc.

The struct page has a "private" field, which can be re-used, to store a
pointer to the parent area.

Note: in practice a virtual memory area is characterized both by a
struct vmap_area and a struct vm_struct.

The reason for referring from a page to its vmap_area, rather than to
the vm_struct, is that the vmap_area contains a struct vm_struct *vm
field, which can be used to reach also the information stored in the
corresponding vm_struct. This link, however, is unidirectional, and it's
not possible to easily identify the corresponding vmap_area, given a
reference to a vm_struct.

Furthermore, the struct vmap_area contains a list head node which is
normally used only when it's queued for free and can be put to some
other use during normal operations.

The connection between each page and its vmap_area avoids more expensive
searches through the btree of vmap_areas.

Therefore, it is possible for find_vamp_area to be again a static
function, while the rest of the code will rely on the direct reference
from struct page.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Michal Hocko <mhocko@kernel.org>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Tatashin <pasha.tatashin@oracle.com>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
---
 include/linux/mm_types.h | 25 ++++++++++++++++++-------
 include/linux/prmem.h    | 13 ++++++++-----
 include/linux/vmalloc.h  |  1 -
 mm/prmem.c               |  2 +-
 mm/test_pmalloc.c        | 12 ++++--------
 mm/vmalloc.c             |  9 ++++++++-
 6 files changed, 39 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..8403bdd12d1f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -87,13 +87,24 @@ struct page {
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
 			pgoff_t index;		/* Our offset within mapping. */
-			/**
-			 * @private: Mapping-private opaque data.
-			 * Usually used for buffer_heads if PagePrivate.
-			 * Used for swp_entry_t if PageSwapCache.
-			 * Indicates order in the buddy system if PageBuddy.
-			 */
-			unsigned long private;
+			union {
+				/**
+				 * @private: Mapping-private opaque data.
+				 * Usually used for buffer_heads if
+				 * PagePrivate.
+				 * Used for swp_entry_t if PageSwapCache.
+				 * Indicates order in the buddy system if
+				 * PageBuddy.
+				 */
+				unsigned long private;
+				/**
+				 * @area: reference to the containing area
+				 * For pages that are mapped into a virtually
+				 * contiguous area, avoids performing a more
+				 * expensive lookup.
+				 */
+				struct vmap_area *area;
+			};
 		};
 		struct {	/* slab, slob and slub */
 			union {
diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index 26fd48410d97..cf713fc1c8bb 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -54,14 +54,17 @@ static __always_inline bool __is_wr_after_init(const void *ptr, size_t size)
 
 static __always_inline bool __is_wr_pool(const void *ptr, size_t size)
 {
-	struct vmap_area *area;
+	struct vm_struct *vm;
+	struct page *page;
 
 	if (!is_vmalloc_addr(ptr))
 		return false;
-	area = find_vmap_area((unsigned long)ptr);
-	return area && area->vm && (area->vm->size >= size) &&
-		((area->vm->flags & (VM_PMALLOC | VM_PMALLOC_WR)) ==
-		 (VM_PMALLOC | VM_PMALLOC_WR));
+	page = vmalloc_to_page(ptr);
+	if (!(page && page->area && page->area->vm))
+		return false;
+	vm = page->area->vm;
+	return ((vm->size >= size) &&
+		((vm->flags & VM_PMALLOC_WR_MASK) == VM_PMALLOC_WR_MASK));
 }
 
 /**
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 4d14a3b8089e..43a444f8b1e9 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -143,7 +143,6 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
 					const void *caller);
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
-extern struct vmap_area *find_vmap_area(unsigned long addr);
 
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
 			struct page **pages);
diff --git a/mm/prmem.c b/mm/prmem.c
index 7dd13ea43304..96abf04909e7 100644
--- a/mm/prmem.c
+++ b/mm/prmem.c
@@ -150,7 +150,7 @@ static int grow(struct pmalloc_pool *pool, size_t min_size)
 	if (WARN(!addr, "Failed to allocate %zd bytes", PAGE_ALIGN(size)))
 		return -ENOMEM;
 
-	new_area = find_vmap_area((uintptr_t)addr);
+	new_area = vmalloc_to_page(addr)->area;
 	tag_mask = VM_PMALLOC;
 	if (pool->mode & PMALLOC_WR)
 		tag_mask |= VM_PMALLOC_WR;
diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c
index f9ee8fb29eea..c64872ff05ea 100644
--- a/mm/test_pmalloc.c
+++ b/mm/test_pmalloc.c
@@ -38,15 +38,11 @@ static bool is_address_protected(void *p)
 	if (unlikely(!is_vmalloc_addr(p)))
 		return false;
 	page = vmalloc_to_page(p);
-	if (unlikely(!page))
+	if (unlikely(!(page && page->area && page->area->vm)))
 		return false;
-	wmb(); /* Flush changes to the page table - is it needed? */
-	area = find_vmap_area((uintptr_t)p);
-	if (unlikely((!area) || (!area->vm) ||
-		     ((area->vm->flags & VM_PMALLOC_PROTECTED_MASK) !=
-		      VM_PMALLOC_PROTECTED_MASK)))
-		return false;
-	return true;
+	area = page->area;
+	return (area->vm->flags & VM_PMALLOC_PROTECTED_MASK) ==
+		VM_PMALLOC_PROTECTED_MASK;
 }
 
 static bool create_and_destroy_pool(void)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 15850005fea5..ffef705f0523 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -742,7 +742,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
 	free_vmap_area_noflush(va);
 }
 
-struct vmap_area *find_vmap_area(unsigned long addr)
+static struct vmap_area *find_vmap_area(unsigned long addr)
 {
 	struct vmap_area *va;
 
@@ -1523,6 +1523,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
 			struct page *page = area->pages[i];
 
 			BUG_ON(!page);
+			page->area = NULL;
 			__free_pages(page, 0);
 		}
 
@@ -1731,8 +1732,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			const void *caller)
 {
 	struct vm_struct *area;
+	struct vmap_area *va;
 	void *addr;
 	unsigned long real_size = size;
+	unsigned int i;
 
 	size = PAGE_ALIGN(size);
 	if (!size || (size >> PAGE_SHIFT) > totalram_pages)
@@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 	if (!addr)
 		return NULL;
 
+	va = __find_vmap_area((unsigned long)addr);
+	for (i = 0; i < va->vm->nr_pages; i++)
+		va->vm->pages[i]->area = va;
+
 	/*
 	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
 	 * flag. It means that vm_struct is not fully initialized.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 09/17] prmem: hardened usercopy
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (7 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 08/17] prmem: struct page: track vmap_area Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-29 11:45   ` Chris von Recklinghausen
  2018-10-23 21:34 ` [PATCH 10/17] prmem: documentation Igor Stoppa
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Chris von Recklinghausen, linux-mm, linux-kernel

Prevent leaks of protected memory to userspace.
The protection from overwrited from userspace is already available, once
the memory is write protected.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Kees Cook <keescook@chromium.org>
CC: Chris von Recklinghausen <crecklin@redhat.com>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
---
 include/linux/prmem.h | 24 ++++++++++++++++++++++++
 mm/usercopy.c         |  5 +++++
 2 files changed, 29 insertions(+)

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index cf713fc1c8bb..919d853ddc15 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -273,6 +273,30 @@ struct pmalloc_pool {
 	uint8_t mode;
 };
 
+void __noreturn usercopy_abort(const char *name, const char *detail,
+			       bool to_user, unsigned long offset,
+			       unsigned long len);
+
+/**
+ * check_pmalloc_object() - helper for hardened usercopy
+ * @ptr: the beginning of the memory to check
+ * @n: the size of the memory to check
+ * @to_user: copy to userspace or from userspace
+ *
+ * If the check is ok, it will fall-through, otherwise it will abort.
+ * The function is inlined, to minimize the performance impact of the
+ * extra check that can end up on a hot path.
+ * Non-exhaustive micro benchmarking with QEMU x86_64 shows a reduction of
+ * the time spent in this fragment by 60%, when inlined.
+ */
+static inline
+void check_pmalloc_object(const void *ptr, unsigned long n, bool to_user)
+{
+	if (unlikely(__is_wr_after_init(ptr, n) || __is_wr_pool(ptr, n)))
+		usercopy_abort("pmalloc", "accessing pmalloc obj", to_user,
+			       (const unsigned long)ptr, n);
+}
+
 /*
  * The write rare functionality is fully implemented as __always_inline,
  * to prevent having an internal function call that is capable of modifying
diff --git a/mm/usercopy.c b/mm/usercopy.c
index 852eb4e53f06..a080dd37b684 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -22,8 +22,10 @@
 #include <linux/thread_info.h>
 #include <linux/atomic.h>
 #include <linux/jump_label.h>
+#include <linux/prmem.h>
 #include <asm/sections.h>
 
+
 /*
  * Checks if a given pointer and length is contained by the current
  * stack frame (if possible).
@@ -284,6 +286,9 @@ void __check_object_size(const void *ptr, unsigned long n, bool to_user)
 
 	/* Check for object in kernel to avoid text exposure. */
 	check_kernel_text_object((const unsigned long)ptr, n, to_user);
+
+	/* Check if object is from a pmalloc chunk. */
+	check_pmalloc_object(ptr, n, to_user);
 }
 EXPORT_SYMBOL(__check_object_size);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 10/17] prmem: documentation
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (8 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 09/17] prmem: hardened usercopy Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-24  3:48   ` Randy Dunlap
                     ` (2 more replies)
  2018-10-23 21:34 ` [PATCH 11/17] prmem: llist: use designated initializer Igor Stoppa
                   ` (7 subsequent siblings)
  17 siblings, 3 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, linux-doc, linux-kernel

Documentation for protected memory.

Topics covered:
* static memory allocation
* dynamic memory allocation
* write-rare

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Randy Dunlap <rdunlap@infradead.org>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 Documentation/core-api/index.rst |   1 +
 Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++
 MAINTAINERS                      |   1 +
 3 files changed, 174 insertions(+)
 create mode 100644 Documentation/core-api/prmem.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 26b735cefb93..1a90fa878d8d 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
    gfp_mask-from-fs-io
    timekeeping
    boot-time-mm
+   prmem
 
 Interfaces for kernel debugging
 ===============================
diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
new file mode 100644
index 000000000000..16d7edfe327a
--- /dev/null
+++ b/Documentation/core-api/prmem.rst
@@ -0,0 +1,172 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _prmem:
+
+Memory Protection
+=================
+
+:Date: October 2018
+:Author: Igor Stoppa <igor.stoppa@huawei.com>
+
+Foreword
+--------
+- In a typical system using some sort of RAM as execution environment,
+  **all** memory is initially writable.
+
+- It must be initialized with the appropriate content, be it code or data.
+
+- Said content typically undergoes modifications, i.e. relocations or
+  relocation-induced changes.
+
+- The present document doesn't address such transient.
+
+- Kernel code is protected at system level and, unlike data, it doesn't
+  require special attention.
+
+Protection mechanism
+--------------------
+
+- When available, the MMU can write protect memory pages that would be
+  otherwise writable.
+
+- The protection has page-level granularity.
+
+- An attempt to overwrite a protected page will trigger an exception.
+- **Write protected data must go exclusively to write protected pages**
+- **Writable data must go exclusively to writable pages**
+
+Available protections for kernel data
+-------------------------------------
+
+- **constant**
+   Labelled as **const**, the data is never supposed to be altered.
+   It is statically allocated - if it has any memory footprint at all.
+   The compiler can even optimize it away, where possible, by replacing
+   references to a **const** with its actual value.
+
+- **read only after init**
+   By tagging an otherwise ordinary statically allocated variable with
+   **__ro_after_init**, it is placed in a special segment that will
+   become write protected, at the end of the kernel init phase.
+   The compiler has no notion of this restriction and it will treat any
+   write operation on such variable as legal. However, assignments that
+   are attempted after the write protection is in place, will cause
+   exceptions.
+
+- **write rare after init**
+   This can be seen as variant of read only after init, which uses the
+   tag **__wr_after_init**. It is also limited to statically allocated
+   memory. It is still possible to alter this type of variables, after
+   the kernel init phase is complete, however it can be done exclusively
+   with special functions, instead of the assignment operator. Using the
+   assignment operator after conclusion of the init phase will still
+   trigger an exception. It is not possible to transition a certain
+   variable from __wr_ater_init to a permanent read-only status, at
+   runtime.
+
+- **dynamically allocated write-rare / read-only**
+   After defining a pool, memory can be obtained through it, primarily
+   through the **pmalloc()** allocator. The exact writability state of the
+   memory obtained from **pmalloc()** and friends can be configured when
+   creating the pool. At any point it is possible to transition to a less
+   permissive write status the memory currently associated to the pool.
+   Once memory has become read-only, it the only valid operation, beside
+   reading, is to released it, by destroying the pool it belongs to.
+
+
+Protecting dynamically allocated memory
+---------------------------------------
+
+When dealing with dynamically allocated memory, three options are
+ available for configuring its writability state:
+
+- **Options selected when creating a pool**
+   When creating the pool, it is possible to choose one of the following:
+    - **PMALLOC_MODE_RO**
+       - Writability at allocation time: *WRITABLE*
+       - Writability at protection time: *NONE*
+    - **PMALLOC_MODE_WR**
+       - Writability at allocation time: *WRITABLE*
+       - Writability at protection time: *WRITE-RARE*
+    - **PMALLOC_MODE_AUTO_RO**
+       - Writability at allocation time:
+           - the latest allocation: *WRITABLE*
+           - every other allocation: *NONE*
+       - Writability at protection time: *NONE*
+    - **PMALLOC_MODE_AUTO_WR**
+       - Writability at allocation time:
+           - the latest allocation: *WRITABLE*
+           - every other allocation: *WRITE-RARE*
+       - Writability at protection time: *WRITE-RARE*
+    - **PMALLOC_MODE_START_WR**
+       - Writability at allocation time: *WRITE-RARE*
+       - Writability at protection time: *WRITE-RARE*
+
+   **Remarks:**
+    - The "AUTO" modes perform automatic protection of the content, whenever
+       the current vmap_area is used up and a new one is allocated.
+        - At that point, the vmap_area being phased out is protected.
+        - The size of the vmap_area depends on various parameters.
+        - It might not be possible to know for sure *when* certain data will
+          be protected.
+        - The functionality is provided as tradeoff between hardening and speed.
+        - Its usefulness depends on the specific use case at hand
+    - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
+
+- **Protecting the pool**
+   This is achieved with **pmalloc_protect_pool()**
+    - Any vmap_area currently in the pool is write-protected according to its initial configuration.
+    - Any residual space still available from the current vmap_area is lost, as the area is protected.
+    - **protecting a pool after every allocation will likely be very wasteful**
+    - Using PMALLOC_MODE_START_WR is likely a better choice.
+
+- **Upgrading the protection level**
+   This is achieved with **pmalloc_make_pool_ro()**
+    - it turns the present content of a write-rare pool into read-only
+    - can be useful when the content of the memory has settled
+
+
+Caveats
+-------
+- Freeing of memory is not supported. Pages will be returned to the
+  system upon destruction of their memory pool.
+
+- The address range available for vmalloc (and thus for pmalloc too) is
+  limited, on 32-bit systems. However it shouldn't be an issue, since not
+  much data is expected to be dynamically allocated and turned into
+  write-protected.
+
+- Regarding SMP systems, changing state of pages and altering mappings
+  requires performing cross-processor synchronizations of page tables.
+  This is an additional reason for limiting the use of write rare.
+
+- Not only the pmalloc memory must be protected, but also any reference to
+  it that might become the target for an attack. The attack would replace
+  a reference to the protected memory with a reference to some other,
+  unprotected, memory.
+
+- The users of rare write must take care of ensuring the atomicity of the
+  action, respect to the way they use the data being altered; for example,
+  take a lock before making a copy of the value to modify (if it's
+  relevant), then alter it, issue the call to rare write and finally
+  release the lock. Some special scenario might be exempt from the need
+  for locking, but in general rare-write must be treated as an operation
+  that can incur into races.
+
+- pmalloc relies on virtual memory areas and will therefore use more
+  tlb entries. It still does a better job of it, compared to invoking
+  vmalloc for each allocation, but it is undeniably less optimized wrt to
+  TLB use than using the physmap directly, through kmalloc or similar.
+
+
+Utilization
+-----------
+
+**add examples here**
+
+API
+---
+
+.. kernel-doc:: include/linux/prmem.h
+.. kernel-doc:: mm/prmem.c
+.. kernel-doc:: include/linux/prmemextra.h
diff --git a/MAINTAINERS b/MAINTAINERS
index ea979a5a9ec9..246b1a1cc8bb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9463,6 +9463,7 @@ F:	include/linux/prmemextra.h
 F:	mm/prmem.c
 F:	mm/test_write_rare.c
 F:	mm/test_pmalloc.c
+F:	Documentation/core-api/prmem.rst
 
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 11/17] prmem: llist: use designated initializer
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (9 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 10/17] prmem: documentation Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-23 21:34 ` [PATCH 12/17] prmem: linked list: set alignment Igor Stoppa
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Kate Stewart, David S. Miller, Edward Cree, Philippe Ombredanne,
	Greg Kroah-Hartman, linux-kernel

Using a list_head in an unnamed union poses a problem with the current
implementation of the initializer, since it doesn't specify the names of
the fields it is initializing.

This patch makes it use designated initializers.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Kate Stewart <kstewart@linuxfoundation.org>
CC: "David S. Miller" <davem@davemloft.net>
CC: Edward Cree <ecree@solarflare.com>
CC: Philippe Ombredanne <pombredanne@nexb.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: linux-kernel@vger.kernel.org
---
 include/linux/list.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index de04cc5ed536..184a7b60436f 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -18,7 +18,10 @@
  * using the generic single-entry routines.
  */
 
-#define LIST_HEAD_INIT(name) { &(name), &(name) }
+#define LIST_HEAD_INIT(name) {	\
+	.next = &(name),	\
+	.prev = &(name),	\
+}
 
 #define LIST_HEAD(name) \
 	struct list_head name = LIST_HEAD_INIT(name)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 12/17] prmem: linked list: set alignment
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (10 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 11/17] prmem: llist: use designated initializer Igor Stoppa
@ 2018-10-23 21:34 ` Igor Stoppa
  2018-10-26  9:31   ` Peter Zijlstra
  2018-10-23 21:35 ` [PATCH 13/17] prmem: linked list: disable layout randomization Igor Stoppa
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:34 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Greg Kroah-Hartman, Andrew Morton, Masahiro Yamada,
	Alexey Dobriyan, Pekka Enberg, Paul E. McKenney, Lihao Liang,
	linux-kernel

As preparation to using write rare on the nodes of various types of
lists, specify that the fields in the basic data structures must be
aligned to sizeof(void *)

It is meant to ensure that any static allocation will not cross a page
boundary, to allow pointers to be updated in one step.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Masahiro Yamada <yamada.masahiro@socionext.com>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Pekka Enberg <penberg@kernel.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Lihao Liang <lianglihao@huawei.com>
CC: linux-kernel@vger.kernel.org
---
 include/linux/types.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/linux/types.h b/include/linux/types.h
index 9834e90aa010..53609bbdcf0f 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -183,17 +183,29 @@ typedef struct {
 } atomic64_t;
 #endif
 
+#ifdef CONFIG_PRMEM
 struct list_head {
-	struct list_head *next, *prev;
-};
+	struct list_head *next __aligned(sizeof(void *));
+	struct list_head *prev __aligned(sizeof(void *));
+} __aligned(sizeof(void *));
 
-struct hlist_head {
-	struct hlist_node *first;
+struct hlist_node {
+	struct hlist_node *next __aligned(sizeof(void *));
+	struct hlist_node **pprev __aligned(sizeof(void *));
+} __aligned(sizeof(void *));
+#else
+struct list_head {
+	struct list_head *next, *prev;
 };
 
 struct hlist_node {
 	struct hlist_node *next, **pprev;
 };
+#endif
+
+struct hlist_head {
+	struct hlist_node *first;
+};
 
 struct ustat {
 	__kernel_daddr_t	f_tfree;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 13/17] prmem: linked list: disable layout randomization
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (11 preceding siblings ...)
  2018-10-23 21:34 ` [PATCH 12/17] prmem: linked list: set alignment Igor Stoppa
@ 2018-10-23 21:35 ` Igor Stoppa
  2018-10-24 13:43   ` Alexey Dobriyan
  2018-10-26  9:32   ` Peter Zijlstra
  2018-10-23 21:35 ` [PATCH 14/17] prmem: llist, hlist, both plain and rcu Igor Stoppa
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:35 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Greg Kroah-Hartman, Andrew Morton, Masahiro Yamada,
	Alexey Dobriyan, Pekka Enberg, Paul E. McKenney, Lihao Liang,
	linux-kernel

Some of the data structures used in list management are composed by two
pointers. Since the kernel is now configured by default to randomize the
layout of data structures soleley composed by pointers, this might
prevent correct type punning between these structures and their write
rare counterpart.

It shouldn't be anyway a big loss, in terms of security: with only two
fields, there is a 50% chance of guessing correctly the layout.
The randomization is disabled only when write rare is enabled.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Kees Cook <keescook@chromium.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Masahiro Yamada <yamada.masahiro@socionext.com>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Pekka Enberg <penberg@kernel.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Lihao Liang <lianglihao@huawei.com>
CC: linux-kernel@vger.kernel.org
---
 include/linux/types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/types.h b/include/linux/types.h
index 53609bbdcf0f..a9f6f6515fdc 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -187,12 +187,12 @@ typedef struct {
 struct list_head {
 	struct list_head *next __aligned(sizeof(void *));
 	struct list_head *prev __aligned(sizeof(void *));
-} __aligned(sizeof(void *));
+} __no_randomize_layout __aligned(sizeof(void *));
 
 struct hlist_node {
 	struct hlist_node *next __aligned(sizeof(void *));
 	struct hlist_node **pprev __aligned(sizeof(void *));
-} __aligned(sizeof(void *));
+} __no_randomize_layout __aligned(sizeof(void *));
 #else
 struct list_head {
 	struct list_head *next, *prev;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 14/17] prmem: llist, hlist, both plain and rcu
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (12 preceding siblings ...)
  2018-10-23 21:35 ` [PATCH 13/17] prmem: linked list: disable layout randomization Igor Stoppa
@ 2018-10-23 21:35 ` Igor Stoppa
  2018-10-24 11:37   ` Mathieu Desnoyers
  2018-10-26  9:38   ` Peter Zijlstra
  2018-10-23 21:35 ` [PATCH 15/17] prmem: test cases for prlist and prhlist Igor Stoppa
                   ` (3 subsequent siblings)
  17 siblings, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:35 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Thomas Gleixner, Kate Stewart, David S. Miller,
	Greg Kroah-Hartman, Philippe Ombredanne, Paul E. McKenney,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	linux-kernel

In some cases, all the data needing protection can be allocated from a pool
in one go, as directly writable, then initialized and protected.
The sequence is relatively short and it's acceptable to leave the entire
data set unprotected.

In other cases, this is not possible, because the data will trickle over
a relatively long period of time, in a non predictable way, possibly for
the entire duration of the operations.

For these cases, the safe approach is to have the memory already write
protected, when allocated. However, this will require replacing any
direct assignment with calls to functions that can perform write rare.

Since lists are one of the most commonly used data structures in kernel,
they are a the first candidate for receiving write rare extensions.

This patch implements basic functionality for altering said lists.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Kate Stewart <kstewart@linuxfoundation.org>
CC: "David S. Miller" <davem@davemloft.net>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Philippe Ombredanne <pombredanne@nexb.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Lai Jiangshan <jiangshanlai@gmail.com>
CC: linux-kernel@vger.kernel.org
---
 MAINTAINERS            |   1 +
 include/linux/prlist.h | 934 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 935 insertions(+)
 create mode 100644 include/linux/prlist.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 246b1a1cc8bb..f5689c014e07 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9464,6 +9464,7 @@ F:	mm/prmem.c
 F:	mm/test_write_rare.c
 F:	mm/test_pmalloc.c
 F:	Documentation/core-api/prmem.rst
+F:	include/linux/prlist.h
 
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
diff --git a/include/linux/prlist.h b/include/linux/prlist.h
new file mode 100644
index 000000000000..0387c78f8be8
--- /dev/null
+++ b/include/linux/prlist.h
@@ -0,0 +1,934 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * prlist.h: Header for Protected Lists
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.stoppa@huawei.com>
+ *
+ * Code from <linux/list.h> and <linux/rculist.h>, adapted to perform
+ * writes on write-rare data.
+ * These functions and macros rely on data structures that allow the reuse
+ * of what is already provided for reading the content of their non-write
+ * rare variant.
+ */
+
+#ifndef _LINUX_PRLIST_H
+#define _LINUX_PRLIST_H
+
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/prmemextra.h>
+
+/* --------------- Circular Protected Doubly Linked List --------------- */
+union prlist_head {
+	struct list_head list __aligned(sizeof(void *));
+	struct {
+		union prlist_head *next __aligned(sizeof(void *));
+		union prlist_head *prev __aligned(sizeof(void *));
+	} __no_randomize_layout;
+} __aligned(sizeof(void *));
+
+static __always_inline
+union prlist_head *to_prlist_head(struct list_head *list)
+{
+	return container_of(list, union prlist_head, list);
+}
+
+#define PRLIST_HEAD_INIT(name) {	\
+	.list = LIST_HEAD_INIT(name),	\
+}
+
+#define PRLIST_HEAD(name) \
+	union prlist_head name __wr_after_init = PRLIST_HEAD_INIT(name.list)
+
+static __always_inline
+struct pmalloc_pool *prlist_create_custom_pool(size_t refill,
+					       unsigned short align_order)
+{
+	return pmalloc_create_custom_pool(refill, align_order,
+					  PMALLOC_MODE_START_WR);
+}
+
+static __always_inline struct pmalloc_pool *prlist_create_pool(void)
+{
+	return prlist_create_custom_pool(PMALLOC_REFILL_DEFAULT,
+					 PMALLOC_ALIGN_ORDER_DEFAULT);
+}
+
+static __always_inline
+void prlist_set_prev(union prlist_head *head,
+		     const union prlist_head *prev)
+{
+	wr_ptr(&head->prev, prev);
+}
+
+static __always_inline
+void prlist_set_next(union prlist_head *head,
+		     const union prlist_head *next)
+{
+	wr_ptr(&head->next, next);
+}
+
+static __always_inline void INIT_PRLIST_HEAD(union prlist_head *head)
+{
+	prlist_set_prev(head, head);
+	prlist_set_next(head, head);
+}
+
+/*
+ * Insert a new entry between two known consecutive entries.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ */
+static __always_inline
+void __prlist_add(union prlist_head *new, union prlist_head *prev,
+		  union prlist_head *next)
+{
+	if (!__list_add_valid(&new->list, &prev->list, &next->list))
+		return;
+
+	prlist_set_prev(next, new);
+	prlist_set_next(new, next);
+	prlist_set_prev(new, prev);
+	prlist_set_next(prev, new);
+}
+
+/**
+ * prlist_add - add a new entry
+ * @new: new entry to be added
+ * @head: prlist head to add it after
+ *
+ * Insert a new entry after the specified head.
+ * This is good for implementing stacks.
+ */
+static __always_inline
+void prlist_add(union prlist_head *new, union prlist_head *head)
+{
+	__prlist_add(new, head, head->next);
+}
+
+/**
+ * prlist_add_tail - add a new entry
+ * @new: new entry to be added
+ * @head: list head to add it before
+ *
+ * Insert a new entry before the specified head.
+ * This is useful for implementing queues.
+ */
+static __always_inline
+void prlist_add_tail(union prlist_head *new, union prlist_head *head)
+{
+	__prlist_add(new, head->prev, head);
+}
+
+/*
+ * Delete a prlist entry by making the prev/next entries
+ * point to each other.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ */
+static __always_inline
+void __prlist_del(union prlist_head *prev, union prlist_head *next)
+{
+	prlist_set_prev(next, prev);
+	prlist_set_next(prev, next);
+}
+
+/**
+ * prlist_del - deletes entry from list.
+ * @entry: the element to delete from the list.
+ * Note: list_empty() on entry does not return true after this, the entry is
+ * in an undefined state.
+ */
+static inline void __prlist_del_entry(union prlist_head *entry)
+{
+	if (!__list_del_entry_valid(&entry->list))
+		return;
+	__prlist_del(entry->prev, entry->next);
+}
+
+static __always_inline void prlist_del(union prlist_head *entry)
+{
+	__prlist_del_entry(entry);
+	prlist_set_next(entry, LIST_POISON1);
+	prlist_set_prev(entry, LIST_POISON2);
+}
+
+/**
+ * prlist_replace - replace old entry by new one
+ * @old : the element to be replaced
+ * @new : the new element to insert
+ *
+ * If @old was empty, it will be overwritten.
+ */
+static __always_inline
+void prlist_replace(union prlist_head *old, union prlist_head *new)
+{
+	prlist_set_next(new, old->next);
+	prlist_set_prev(new->next, new);
+	prlist_set_prev(new, old->prev);
+	prlist_set_next(new->prev, new);
+}
+
+static __always_inline
+void prlist_replace_init(union prlist_head *old, union prlist_head *new)
+{
+	prlist_replace(old, new);
+	INIT_PRLIST_HEAD(old);
+}
+
+/**
+ * prlist_del_init - deletes entry from list and reinitialize it.
+ * @entry: the element to delete from the list.
+ */
+static __always_inline void prlist_del_init(union prlist_head *entry)
+{
+	__prlist_del_entry(entry);
+	INIT_PRLIST_HEAD(entry);
+}
+
+/**
+ * prlist_move - delete from one list and add as another's head
+ * @list: the entry to move
+ * @head: the head that will precede our entry
+ */
+static __always_inline
+void prlist_move(union prlist_head *list, union prlist_head *head)
+{
+	__prlist_del_entry(list);
+	prlist_add(list, head);
+}
+
+/**
+ * prlist_move_tail - delete from one list and add as another's tail
+ * @list: the entry to move
+ * @head: the head that will follow our entry
+ */
+static __always_inline
+void prlist_move_tail(union prlist_head *list, union prlist_head *head)
+{
+	__prlist_del_entry(list);
+	prlist_add_tail(list, head);
+}
+
+/**
+ * prlist_rotate_left - rotate the list to the left
+ * @head: the head of the list
+ */
+static __always_inline void prlist_rotate_left(union prlist_head *head)
+{
+	union prlist_head *first;
+
+	if (!list_empty(&head->list)) {
+		first = head->next;
+		prlist_move_tail(first, head);
+	}
+}
+
+static __always_inline
+void __prlist_cut_position(union prlist_head *list, union prlist_head *head,
+			   union prlist_head *entry)
+{
+	union prlist_head *new_first = entry->next;
+
+	prlist_set_next(list, head->next);
+	prlist_set_prev(list->next, list);
+	prlist_set_prev(list, entry);
+	prlist_set_next(entry, list);
+	prlist_set_next(head, new_first);
+	prlist_set_prev(new_first, head);
+}
+
+/**
+ * prlist_cut_position - cut a list into two
+ * @list: a new list to add all removed entries
+ * @head: a list with entries
+ * @entry: an entry within head, could be the head itself
+ *	and if so we won't cut the list
+ *
+ * This helper moves the initial part of @head, up to and
+ * including @entry, from @head to @list. You should
+ * pass on @entry an element you know is on @head. @list
+ * should be an empty list or a list you do not care about
+ * losing its data.
+ *
+ */
+static __always_inline
+void prlist_cut_position(union prlist_head *list, union prlist_head *head,
+			 union prlist_head *entry)
+{
+	if (list_empty(&head->list))
+		return;
+	if (list_is_singular(&head->list) &&
+		(head->next != entry && head != entry))
+		return;
+	if (entry == head)
+		INIT_PRLIST_HEAD(list);
+	else
+		__prlist_cut_position(list, head, entry);
+}
+
+/**
+ * prlist_cut_before - cut a list into two, before given entry
+ * @list: a new list to add all removed entries
+ * @head: a list with entries
+ * @entry: an entry within head, could be the head itself
+ *
+ * This helper moves the initial part of @head, up to but
+ * excluding @entry, from @head to @list.  You should pass
+ * in @entry an element you know is on @head.  @list should
+ * be an empty list or a list you do not care about losing
+ * its data.
+ * If @entry == @head, all entries on @head are moved to
+ * @list.
+ */
+static __always_inline
+void prlist_cut_before(union prlist_head *list, union prlist_head *head,
+		       union prlist_head *entry)
+{
+	if (head->next == entry) {
+		INIT_PRLIST_HEAD(list);
+		return;
+	}
+	prlist_set_next(list, head->next);
+	prlist_set_prev(list->next, list);
+	prlist_set_prev(list, entry->prev);
+	prlist_set_next(list->prev, list);
+	prlist_set_next(head, entry);
+	prlist_set_prev(entry, head);
+}
+
+static __always_inline
+void __prlist_splice(const union prlist_head *list, union prlist_head *prev,
+				 union prlist_head *next)
+{
+	union prlist_head *first = list->next;
+	union prlist_head *last = list->prev;
+
+	prlist_set_prev(first, prev);
+	prlist_set_next(prev, first);
+	prlist_set_next(last, next);
+	prlist_set_prev(next, last);
+}
+
+/**
+ * prlist_splice - join two lists, this is designed for stacks
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ */
+static __always_inline
+void prlist_splice(const union prlist_head *list, union prlist_head *head)
+{
+	if (!list_empty(&list->list))
+		__prlist_splice(list, head, head->next);
+}
+
+/**
+ * prlist_splice_tail - join two lists, each list being a queue
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ */
+static __always_inline
+void prlist_splice_tail(union prlist_head *list, union prlist_head *head)
+{
+	if (!list_empty(&list->list))
+		__prlist_splice(list, head->prev, head);
+}
+
+/**
+ * prlist_splice_init - join two lists and reinitialise the emptied list.
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ *
+ * The list at @list is reinitialised
+ */
+static __always_inline
+void prlist_splice_init(union prlist_head *list, union prlist_head *head)
+{
+	if (!list_empty(&list->list)) {
+		__prlist_splice(list, head, head->next);
+		INIT_PRLIST_HEAD(list);
+	}
+}
+
+/**
+ * prlist_splice_tail_init - join 2 lists and reinitialise the emptied list
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ *
+ * Each of the lists is a queue.
+ * The list at @list is reinitialised
+ */
+static __always_inline
+void prlist_splice_tail_init(union prlist_head *list,
+			     union prlist_head *head)
+{
+	if (!list_empty(&list->list)) {
+		__prlist_splice(list, head->prev, head);
+		INIT_PRLIST_HEAD(list);
+	}
+}
+
+/* ---- Protected Doubly Linked List with single pointer list head ---- */
+union prhlist_head {
+		struct hlist_head head __aligned(sizeof(void *));
+		union prhlist_node *first __aligned(sizeof(void *));
+} __aligned(sizeof(void *));
+
+union prhlist_node {
+	struct hlist_node node __aligned(sizeof(void *))
+			       ;
+	struct {
+		union prhlist_node *next __aligned(sizeof(void *));
+		union prhlist_node **pprev __aligned(sizeof(void *));
+	} __no_randomize_layout;
+} __aligned(sizeof(void *));
+
+#define PRHLIST_HEAD_INIT	{	\
+	.head = HLIST_HEAD_INIT,	\
+}
+
+#define PRHLIST_HEAD(name) \
+	union prhlist_head name __wr_after_init = PRHLIST_HEAD_INIT
+
+
+#define is_static(object) \
+	unlikely(wr_check_boundaries(object, sizeof(*object)))
+
+static __always_inline
+struct pmalloc_pool *prhlist_create_custom_pool(size_t refill,
+						unsigned short align_order)
+{
+	return pmalloc_create_custom_pool(refill, align_order,
+					  PMALLOC_MODE_AUTO_WR);
+}
+
+static __always_inline
+struct pmalloc_pool *prhlist_create_pool(void)
+{
+	return prhlist_create_custom_pool(PMALLOC_REFILL_DEFAULT,
+					  PMALLOC_ALIGN_ORDER_DEFAULT);
+}
+
+static __always_inline
+void prhlist_set_first(union prhlist_head *head, union prhlist_node *first)
+{
+	wr_ptr(&head->first, first);
+}
+
+static __always_inline
+void prhlist_set_next(union prhlist_node *node, union prhlist_node *next)
+{
+	wr_ptr(&node->next, next);
+}
+
+static __always_inline
+void prhlist_set_pprev(union prhlist_node *node, union prhlist_node **pprev)
+{
+	wr_ptr(&node->pprev, pprev);
+}
+
+static __always_inline
+void prhlist_set_prev(union prhlist_node *node, union prhlist_node *prev)
+{
+	wr_ptr(node->pprev, prev);
+}
+
+static __always_inline void INIT_PRHLIST_HEAD(union prhlist_head *head)
+{
+	prhlist_set_first(head, NULL);
+}
+
+static __always_inline void INIT_PRHLIST_NODE(union prhlist_node *node)
+{
+	prhlist_set_next(node, NULL);
+	prhlist_set_pprev(node, NULL);
+}
+
+static __always_inline void __prhlist_del(union prhlist_node *n)
+{
+	union prhlist_node *next = n->next;
+	union prhlist_node **pprev = n->pprev;
+
+	wr_ptr(pprev, next);
+	if (next)
+		prhlist_set_pprev(next, pprev);
+}
+
+static __always_inline void prhlist_del(union prhlist_node *n)
+{
+	__prhlist_del(n);
+	prhlist_set_next(n, LIST_POISON1);
+	prhlist_set_pprev(n, LIST_POISON2);
+}
+
+static __always_inline void prhlist_del_init(union prhlist_node *n)
+{
+	if (!hlist_unhashed(&n->node)) {
+		__prhlist_del(n);
+		INIT_PRHLIST_NODE(n);
+	}
+}
+
+static __always_inline
+void prhlist_add_head(union prhlist_node *n, union prhlist_head *h)
+{
+	union prhlist_node *first = h->first;
+
+	prhlist_set_next(n, first);
+	if (first)
+		prhlist_set_pprev(first, &n->next);
+	prhlist_set_first(h, n);
+	prhlist_set_pprev(n, &h->first);
+}
+
+/* next must be != NULL */
+static __always_inline
+void prhlist_add_before(union prhlist_node *n, union prhlist_node *next)
+{
+	prhlist_set_pprev(n, next->pprev);
+	prhlist_set_next(n, next);
+	prhlist_set_pprev(next, &n->next);
+	prhlist_set_prev(n, n);
+}
+
+static __always_inline
+void prhlist_add_behind(union prhlist_node *n, union prhlist_node *prev)
+{
+	prhlist_set_next(n, prev->next);
+	prhlist_set_next(prev, n);
+	prhlist_set_pprev(n, &prev->next);
+	if (n->next)
+		prhlist_set_pprev(n->next, &n->next);
+}
+
+/* after that we'll appear to be on some hlist and hlist_del will work */
+static __always_inline void prhlist_add_fake(union prhlist_node *n)
+{
+	prhlist_set_pprev(n, &n->next);
+}
+
+/*
+ * Move a list from one list head to another. Fixup the pprev
+ * reference of the first entry if it exists.
+ */
+static __always_inline
+void prhlist_move_list(union prhlist_head *old, union prhlist_head *new)
+{
+	prhlist_set_first(new, old->first);
+	if (new->first)
+		prhlist_set_pprev(new->first, &new->first);
+	prhlist_set_first(old, NULL);
+}
+
+/* ------------------------ RCU list and hlist ------------------------ */
+
+/*
+ * INIT_LIST_HEAD_RCU - Initialize a list_head visible to RCU readers
+ * @head: list to be initialized
+ *
+ * It is exactly equivalent to INIT_LIST_HEAD()
+ */
+static __always_inline void INIT_PRLIST_HEAD_RCU(union prlist_head *head)
+{
+	INIT_PRLIST_HEAD(head);
+}
+
+/*
+ * Insert a new entry between two known consecutive entries.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ */
+static __always_inline
+void __prlist_add_rcu(union prlist_head *new, union prlist_head *prev,
+		      union prlist_head *next)
+{
+	if (!__list_add_valid(&new->list, &prev->list, &next->list))
+		return;
+	prlist_set_next(new, next);
+	prlist_set_prev(new, prev);
+	wr_rcu_assign_pointer(list_next_rcu(&prev->list), new);
+	prlist_set_prev(next, new);
+}
+
+/**
+ * prlist_add_rcu - add a new entry to rcu-protected prlist
+ * @new: new entry to be added
+ * @head: prlist head to add it after
+ *
+ * Insert a new entry after the specified head.
+ * This is good for implementing stacks.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another prlist-mutation primitive, such as prlist_add_rcu()
+ * or prlist_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * list_for_each_entry_rcu().
+ */
+static __always_inline
+void prlist_add_rcu(union prlist_head *new, union prlist_head *head)
+{
+	__prlist_add_rcu(new, head, head->next);
+}
+
+/**
+ * prlist_add_tail_rcu - add a new entry to rcu-protected prlist
+ * @new: new entry to be added
+ * @head: prlist head to add it before
+ *
+ * Insert a new entry before the specified head.
+ * This is useful for implementing queues.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another prlist-mutation primitive, such as prlist_add_tail_rcu()
+ * or prlist_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * list_for_each_entry_rcu().
+ */
+static __always_inline
+void prlist_add_tail_rcu(union prlist_head *new, union prlist_head *head)
+{
+	__prlist_add_rcu(new, head->prev, head);
+}
+
+/**
+ * prlist_del_rcu - deletes entry from prlist without re-initialization
+ * @entry: the element to delete from the prlist.
+ *
+ * Note: list_empty() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as prlist_del_rcu()
+ * or prlist_add_rcu(), running on this same prlist.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * list_for_each_entry_rcu().
+ *
+ * Note that the caller is not permitted to immediately free
+ * the newly deleted entry.  Instead, either synchronize_rcu()
+ * or call_rcu() must be used to defer freeing until an RCU
+ * grace period has elapsed.
+ */
+static __always_inline void prlist_del_rcu(union prlist_head *entry)
+{
+	__prlist_del_entry(entry);
+	prlist_set_prev(entry, LIST_POISON2);
+}
+
+/**
+ * prhlist_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: list_unhashed() on the node return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_add_head_rcu() or
+ * hlist_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_for_each_entry_rcu().
+ */
+static __always_inline void prhlist_del_init_rcu(union prhlist_node *n)
+{
+	if (!hlist_unhashed(&n->node)) {
+		__prhlist_del(n);
+		prhlist_set_pprev(n, NULL);
+	}
+}
+
+/**
+ * prlist_replace_rcu - replace old entry by new one
+ * @old : the element to be replaced
+ * @new : the new element to insert
+ *
+ * The @old entry will be replaced with the @new entry atomically.
+ * Note: @old should not be empty.
+ */
+static __always_inline
+void prlist_replace_rcu(union prlist_head *old, union prlist_head *new)
+{
+	prlist_set_next(new, old->next);
+	prlist_set_prev(new, old->prev);
+	wr_rcu_assign_pointer(list_next_rcu(&new->prev->list), new);
+	prlist_set_prev(new->next, new);
+	prlist_set_prev(old, LIST_POISON2);
+}
+
+/**
+ * __prlist_splice_init_rcu - join an RCU-protected list into an existing list.
+ * @list:	the RCU-protected list to splice
+ * @prev:	points to the last element of the existing list
+ * @next:	points to the first element of the existing list
+ * @sync:	function to sync: synchronize_rcu(), synchronize_sched(), ...
+ *
+ * The list pointed to by @prev and @next can be RCU-read traversed
+ * concurrently with this function.
+ *
+ * Note that this function blocks.
+ *
+ * Important note: the caller must take whatever action is necessary to prevent
+ * any other updates to the existing list.  In principle, it is possible to
+ * modify the list as soon as sync() begins execution. If this sort of thing
+ * becomes necessary, an alternative version based on call_rcu() could be
+ * created.  But only if -really- needed -- there is no shortage of RCU API
+ * members.
+ */
+static __always_inline
+void __prlist_splice_init_rcu(union prlist_head *list,
+			      union prlist_head *prev,
+			      union prlist_head *next, void (*sync)(void))
+{
+	union prlist_head *first = list->next;
+	union prlist_head *last = list->prev;
+
+	/*
+	 * "first" and "last" tracking list, so initialize it.  RCU readers
+	 * have access to this list, so we must use INIT_LIST_HEAD_RCU()
+	 * instead of INIT_LIST_HEAD().
+	 */
+
+	INIT_PRLIST_HEAD_RCU(list);
+
+	/*
+	 * At this point, the list body still points to the source list.
+	 * Wait for any readers to finish using the list before splicing
+	 * the list body into the new list.  Any new readers will see
+	 * an empty list.
+	 */
+
+	sync();
+
+	/*
+	 * Readers are finished with the source list, so perform splice.
+	 * The order is important if the new list is global and accessible
+	 * to concurrent RCU readers.  Note that RCU readers are not
+	 * permitted to traverse the prev pointers without excluding
+	 * this function.
+	 */
+
+	prlist_set_next(last, next);
+	wr_rcu_assign_pointer(list_next_rcu(&prev->list), first);
+	prlist_set_prev(first, prev);
+	prlist_set_prev(next, last);
+}
+
+/**
+ * prlist_splice_init_rcu - splice an RCU-protected list into an existing
+ *                          list, designed for stacks.
+ * @list:	the RCU-protected list to splice
+ * @head:	the place in the existing list to splice the first list into
+ * @sync:	function to sync: synchronize_rcu(), synchronize_sched(), ...
+ */
+static __always_inline
+void prlist_splice_init_rcu(union prlist_head *list,
+			    union prlist_head *head,
+			    void (*sync)(void))
+{
+	if (!list_empty(&list->list))
+		__prlist_splice_init_rcu(list, head, head->next, sync);
+}
+
+/**
+ * prlist_splice_tail_init_rcu - splice an RCU-protected list into an
+ *                               existing list, designed for queues.
+ * @list:	the RCU-protected list to splice
+ * @head:	the place in the existing list to splice the first list into
+ * @sync:	function to sync: synchronize_rcu(), synchronize_sched(), ...
+ */
+static __always_inline
+void prlist_splice_tail_init_rcu(union prlist_head *list,
+				 union prlist_head *head,
+				 void (*sync)(void))
+{
+	if (!list_empty(&list->list))
+		__prlist_splice_init_rcu(list, head->prev, head, sync);
+}
+
+/**
+ * prhlist_del_rcu - deletes entry from hash list without re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: list_unhashed() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the hash list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_add_head_rcu()
+ * or hlist_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_for_each_entry().
+ */
+static __always_inline void prhlist_del_rcu(union prhlist_node *n)
+{
+	__prhlist_del(n);
+	prhlist_set_pprev(n, LIST_POISON2);
+}
+
+/**
+ * prhlist_replace_rcu - replace old entry by new one
+ * @old : the element to be replaced
+ * @new : the new element to insert
+ *
+ * The @old entry will be replaced with the @new entry atomically.
+ */
+static __always_inline
+void prhlist_replace_rcu(union prhlist_node *old, union prhlist_node *new)
+{
+	union prhlist_node *next = old->next;
+
+	prhlist_set_next(new, next);
+	prhlist_set_pprev(new, old->pprev);
+	wr_rcu_assign_pointer(*(union prhlist_node __rcu **)new->pprev, new);
+	if (next)
+		prhlist_set_pprev(new->next, &new->next);
+	prhlist_set_pprev(old, LIST_POISON2);
+}
+
+/**
+ * prhlist_add_head_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_add_head_rcu()
+ * or hlist_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static __always_inline
+void prhlist_add_head_rcu(union prhlist_node *n, union prhlist_head *h)
+{
+	union prhlist_node *first = h->first;
+
+	prhlist_set_next(n, first);
+	prhlist_set_pprev(n, &h->first);
+	wr_rcu_assign_pointer(hlist_first_rcu(&h->head), n);
+	if (first)
+		prhlist_set_pprev(first, &n->next);
+}
+
+/**
+ * prhlist_add_tail_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as prhlist_add_head_rcu()
+ * or prhlist_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static __always_inline
+void prhlist_add_tail_rcu(union prhlist_node *n, union prhlist_head *h)
+{
+	union prhlist_node *i, *last = NULL;
+
+	/* Note: write side code, so rcu accessors are not needed. */
+	for (i = h->first; i; i = i->next)
+		last = i;
+
+	if (last) {
+		prhlist_set_next(n, last->next);
+		prhlist_set_pprev(n, &last->next);
+		wr_rcu_assign_pointer(hlist_next_rcu(&last->node), n);
+	} else {
+		prhlist_add_head_rcu(n, h);
+	}
+}
+
+/**
+ * prhlist_add_before_rcu
+ * @n: the new element to add to the hash list.
+ * @next: the existing element to add the new element before.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist
+ * before the specified node while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as prhlist_add_head_rcu()
+ * or prhlist_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.
+ */
+static __always_inline
+void prhlist_add_before_rcu(union prhlist_node *n, union prhlist_node *next)
+{
+	prhlist_set_pprev(n, next->pprev);
+	prhlist_set_next(n, next);
+	wr_rcu_assign_pointer(hlist_pprev_rcu(&n->node), n);
+	prhlist_set_pprev(next, &n->next);
+}
+
+/**
+ * prhlist_add_behind_rcu
+ * @n: the new element to add to the hash list.
+ * @prev: the existing element to add the new element after.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist
+ * after the specified node while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as prhlist_add_head_rcu()
+ * or prhlist_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.
+ */
+static __always_inline
+void prhlist_add_behind_rcu(union prhlist_node *n, union prhlist_node *prev)
+{
+	prhlist_set_next(n, prev->next);
+	prhlist_set_pprev(n, &prev->next);
+	wr_rcu_assign_pointer(hlist_next_rcu(&prev->node), n);
+	if (n->next)
+		prhlist_set_pprev(n->next, &n->next);
+}
+#endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/17] prmem: test cases for prlist and prhlist
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (13 preceding siblings ...)
  2018-10-23 21:35 ` [PATCH 14/17] prmem: llist, hlist, both plain and rcu Igor Stoppa
@ 2018-10-23 21:35 ` Igor Stoppa
  2018-10-23 21:35 ` [PATCH 16/17] prmem: pratomic-long Igor Stoppa
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:35 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Kate Stewart, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Edward Cree, linux-kernel

These test cases focus on the basic operations required to operate
both prlist and prhlist data, in particular creating, growing, shrinking,
destroying.

They can also be useful as reference for practical use of write-rare lists.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Kate Stewart <kstewart@linuxfoundation.org>
CC: Philippe Ombredanne <pombredanne@nexb.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Edward Cree <ecree@solarflare.com>
CC: linux-kernel@vger.kernel.org
---
 MAINTAINERS       |   1 +
 lib/Kconfig.debug |   9 ++
 lib/Makefile      |   1 +
 lib/test_prlist.c | 252 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 263 insertions(+)
 create mode 100644 lib/test_prlist.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f5689c014e07..e7f7cb1682a6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9465,6 +9465,7 @@ F:	mm/test_write_rare.c
 F:	mm/test_pmalloc.c
 F:	Documentation/core-api/prmem.rst
 F:	include/linux/prlist.h
+F:	lib/test_prlist.c
 
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 4966c4fbe7f7..40039992f05f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2034,6 +2034,15 @@ config IO_STRICT_DEVMEM
 
 	  If in doubt, say Y.
 
+config DEBUG_PRLIST_TEST
+	bool "Testcase for Protected Linked List"
+	depends on STRICT_KERNEL_RWX && PRMEM
+	help
+	  This option enables the testing of an implementation of linked
+	  list based on write rare memory.
+	  The test cases can also be used as examples for how to use the
+	  prlist data structure(s).
+
 source "arch/$(SRCARCH)/Kconfig.debug"
 
 endmenu # Kernel hacking
diff --git a/lib/Makefile b/lib/Makefile
index 423876446810..fe7200e84c5f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -270,3 +270,4 @@ obj-$(CONFIG_GENERIC_LIB_LSHRDI3) += lshrdi3.o
 obj-$(CONFIG_GENERIC_LIB_MULDI3) += muldi3.o
 obj-$(CONFIG_GENERIC_LIB_CMPDI2) += cmpdi2.o
 obj-$(CONFIG_GENERIC_LIB_UCMPDI2) += ucmpdi2.o
+obj-$(CONFIG_DEBUG_PRLIST_TEST) += test_prlist.o
diff --git a/lib/test_prlist.c b/lib/test_prlist.c
new file mode 100644
index 000000000000..8ee46795d72a
--- /dev/null
+++ b/lib/test_prlist.c
@@ -0,0 +1,252 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_prlist.c: Test cases for protected doubly linked list
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.stoppa@huawei.com>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/bug.h>
+#include <linux/prlist.h>
+
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+static struct pmalloc_pool *pool;
+
+static PRLIST_HEAD(test_prlist_head);
+
+/* ---------------------- prlist test functions ---------------------- */
+static bool test_init_prlist_head(void)
+{
+	if (WARN(test_prlist_head.prev != &test_prlist_head ||
+		 test_prlist_head.next != &test_prlist_head,
+		 "static initialization of static prlist_head failed"))
+		return false;
+	wr_ptr(&test_prlist_head.next, NULL);
+	wr_ptr(&test_prlist_head.prev, NULL);
+	if (WARN(test_prlist_head.prev || test_prlist_head.next,
+		 "resetting of static prlist_head failed"))
+		return false;
+	INIT_PRLIST_HEAD(&test_prlist_head);
+	if (WARN(test_prlist_head.prev != &test_prlist_head ||
+		 test_prlist_head.next != &test_prlist_head,
+		 "initialization of static prlist_head failed"))
+		return false;
+	pr_info("initialization of static prlist_head passed");
+	return true;
+}
+
+struct prlist_data {
+	int d_int;
+	union prlist_head node;
+	unsigned long long d_ulonglong;
+};
+
+
+#define LIST_INTERVAL 5
+#define LIST_INTERVALS 3
+#define LIST_NODES (LIST_INTERVALS * LIST_INTERVAL)
+static bool test_build_prlist(void)
+{
+	short i;
+	struct prlist_data *data;
+	int delta;
+
+	pool = prlist_create_pool();
+	if (WARN(!pool, "could not create pool"))
+		return false;
+
+	for (i = 0; i < LIST_NODES; i++) {
+		data = (struct prlist_data *)pmalloc(pool, sizeof(*data));
+		if (WARN(!data, "Failed to allocate prlist node"))
+			goto out;
+		wr_int(&data->d_int, i);
+		wr_ulonglong(&data->d_ulonglong, i);
+		prlist_add_tail(&data->node, &test_prlist_head);
+	}
+	for (i = 1; i < LIST_NODES; i++) {
+		data = (struct prlist_data *)pmalloc(pool, sizeof(*data));
+		if (WARN(!data, "Failed to allocate prlist node"))
+			goto out;
+		wr_int(&data->d_int, i);
+		wr_ulonglong(&data->d_ulonglong, i);
+		prlist_add(&data->node, &test_prlist_head);
+	}
+	i = LIST_NODES;
+	delta = -1;
+	list_for_each_entry(data, &test_prlist_head, node) {
+		i += delta;
+		if (!i)
+			delta = 1;
+		if (WARN(data->d_int != i || data->d_ulonglong != i,
+			 "unexpected value in prlist, build test failed"))
+			goto out;
+	}
+	pr_info("build prlist test passed");
+	return true;
+out:
+	pmalloc_destroy_pool(pool);
+	return false;
+}
+
+static bool test_teardown_prlist(void)
+{
+	short i;
+
+	for (i = 0; !list_empty(&test_prlist_head.list); i++)
+		prlist_del(test_prlist_head.next);
+	if (WARN(i != LIST_NODES * 2 - 1, "teardown prlist test failed"))
+		return false;
+	pmalloc_destroy_pool(pool);
+	pr_info("teardown prlist test passed");
+	return true;
+}
+
+static bool test_prlist(void)
+{
+	if (WARN(!(test_init_prlist_head() &&
+		   test_build_prlist() &&
+		   test_teardown_prlist()),
+		 "prlist test failed"))
+		return false;
+	pr_info("prlist test passed");
+	return true;
+}
+
+/* ---------------------- prhlist test functions ---------------------- */
+static PRHLIST_HEAD(test_prhlist_head);
+
+static bool test_init_prhlist_head(void)
+{
+	if (WARN(test_prhlist_head.first,
+		 "static initialization of static prhlist_head failed"))
+		return false;
+	wr_ptr(&test_prhlist_head.first, (void *)-1);
+	if (WARN(!test_prhlist_head.first,
+		 "resetting of static prhlist_head failed"))
+		return false;
+	INIT_PRHLIST_HEAD(&test_prhlist_head);
+	if (WARN(!test_prlist_head.prev,
+		 "initialization of static prhlist_head failed"))
+		return false;
+	pr_info("initialization of static prlist_head passed");
+	return true;
+}
+
+struct prhlist_data {
+	int d_int;
+	union prhlist_node node;
+	unsigned long long d_ulonglong;
+};
+
+static bool test_build_prhlist(void)
+{
+	short i;
+	struct prhlist_data *data;
+	union prhlist_node *anchor;
+
+	pool = prhlist_create_pool();
+	if (WARN(!pool, "could not create pool"))
+		return false;
+
+	for (i = 2 * LIST_INTERVAL - 1; i >= LIST_INTERVAL; i--) {
+		data = (struct prhlist_data *)pmalloc(pool, sizeof(*data));
+		if (WARN(!data, "Failed to allocate prhlist node"))
+			goto out;
+		wr_int(&data->d_int, i);
+		wr_ulonglong(&data->d_ulonglong, i);
+		prhlist_add_head(&data->node, &test_prhlist_head);
+	}
+	anchor = test_prhlist_head.first;
+	for (i = 0; i < LIST_INTERVAL; i++) {
+		data = (struct prhlist_data *)pmalloc(pool, sizeof(*data));
+		if (WARN(!data, "Failed to allocate prhlist node"))
+			goto out;
+		wr_int(&data->d_int, i);
+		wr_ulonglong(&data->d_ulonglong, i);
+		prhlist_add_before(&data->node, anchor);
+	}
+	hlist_for_each_entry(data, &test_prhlist_head, node)
+		if (!data->node.next)
+			anchor = &data->node;
+	for (i = 3 * LIST_INTERVAL - 1; i >= 2 * LIST_INTERVAL; i--) {
+		data = (struct prhlist_data *)pmalloc(pool, sizeof(*data));
+		if (WARN(!data, "Failed to allocate prhlist node"))
+			goto out;
+		wr_int(&data->d_int, i);
+		wr_ulonglong(&data->d_ulonglong, i);
+		prhlist_add_behind(&data->node, anchor);
+	}
+	i = 0;
+	hlist_for_each_entry(data, &test_prhlist_head, node) {
+		if (WARN(data->d_int != i || data->d_ulonglong != i,
+			 "unexpected value in prhlist, build test failed"))
+			goto out;
+		i++;
+	}
+	if (WARN(i != LIST_NODES,
+		 "wrong number of nodes: %d, expectd %d", i, LIST_NODES))
+		goto out;
+	pr_info("build prhlist test passed");
+	return true;
+out:
+	pmalloc_destroy_pool(pool);
+	return false;
+}
+
+static bool test_teardown_prhlist(void)
+{
+	union prhlist_node **pnode;
+	bool retval = false;
+
+	for (pnode = &test_prhlist_head.first->next; *pnode;) {
+		if (WARN(*(*pnode)->pprev != *pnode,
+			 "inconsistent pprev value, delete test failed"))
+			goto err;
+		prhlist_del(*pnode);
+	}
+	prhlist_del(test_prhlist_head.first);
+	if (WARN(!hlist_empty(&test_prhlist_head.head),
+		 "prhlist is not empty, delete test failed"))
+		goto err;
+	pr_info("deletion of prhlist passed");
+	retval = true;
+err:
+	pmalloc_destroy_pool(pool);
+	return retval;
+}
+
+static bool test_prhlist(void)
+{
+	if (WARN(!(test_init_prhlist_head() &&
+		   test_build_prhlist() &&
+		   test_teardown_prhlist()),
+		 "prhlist test failed"))
+		return false;
+	pr_info("prhlist test passed");
+	return true;
+}
+
+static int __init test_prlists_init_module(void)
+{
+	if (WARN(!(test_prlist() &&
+		   test_prhlist()),
+		 "protected lists test failed"))
+		return -EFAULT;
+	pr_info("protected lists test passed");
+	return 0;
+}
+
+module_init(test_prlists_init_module);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Igor Stoppa <igor.stoppa@huawei.com>");
+MODULE_DESCRIPTION("Test module for protected doubly linked list.");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 16/17] prmem: pratomic-long
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (14 preceding siblings ...)
  2018-10-23 21:35 ` [PATCH 15/17] prmem: test cases for prlist and prhlist Igor Stoppa
@ 2018-10-23 21:35 ` Igor Stoppa
  2018-10-25  0:13   ` Peter Zijlstra
  2018-10-23 21:35 ` [PATCH 17/17] prmem: ima: turn the measurements list write rare Igor Stoppa
  2018-10-24 23:03 ` [RFC v1 PATCH 00/17] prmem: protected memory Dave Chinner
  17 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:35 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Will Deacon, Peter Zijlstra, Boqun Feng, Arnd Bergmann,
	linux-arch, linux-kernel

Minimalistic functionality for having the write rare version of
atomic_long_t data.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Arnd Bergmann <arnd@arndb.de>
CC: linux-arch@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 MAINTAINERS                   |  1 +
 include/linux/pratomic-long.h | 73 +++++++++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+)
 create mode 100644 include/linux/pratomic-long.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e7f7cb1682a6..9d72688d00a3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9466,6 +9466,7 @@ F:	mm/test_pmalloc.c
 F:	Documentation/core-api/prmem.rst
 F:	include/linux/prlist.h
 F:	lib/test_prlist.c
+F:	include/linux/pratomic-long.h
 
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
diff --git a/include/linux/pratomic-long.h b/include/linux/pratomic-long.h
new file mode 100644
index 000000000000..8f1408593733
--- /dev/null
+++ b/include/linux/pratomic-long.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Atomic operations for write rare memory */
+#ifndef _LINUX_PRATOMIC_LONG_H
+#define _LINUX_PRATOMIC_LONG_H
+#include <linux/prmem.h>
+#include <linux/compiler.h>
+#include <asm-generic/atomic-long.h>
+
+struct pratomic_long_t {
+	atomic_long_t l __aligned(sizeof(atomic_long_t));
+} __aligned(sizeof(atomic_long_t));
+
+#define PRATOMIC_LONG_INIT(i)	{	\
+	.l = ATOMIC_LONG_INIT((i)),	\
+}
+
+static __always_inline
+bool __pratomic_long_op(bool inc, struct pratomic_long_t *l)
+{
+	struct page *page;
+	uintptr_t base;
+	uintptr_t offset;
+	unsigned long flags;
+	size_t size = sizeof(*l);
+	bool is_virt = __is_wr_after_init(l, size);
+
+	if (WARN(!(is_virt || likely(__is_wr_pool(l, size))),
+		 WR_ERR_RANGE_MSG))
+		return false;
+	local_irq_save(flags);
+	if (is_virt)
+		page = virt_to_page(l);
+	else
+		vmalloc_to_page(l);
+	offset = (~PAGE_MASK) & (uintptr_t)l;
+	base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
+	if (WARN(!base, WR_ERR_PAGE_MSG)) {
+		local_irq_restore(flags);
+		return false;
+	}
+	if (inc)
+		atomic_long_inc((atomic_long_t *)(base + offset));
+	else
+		atomic_long_dec((atomic_long_t *)(base + offset));
+	vunmap((void *)base);
+	local_irq_restore(flags);
+	return true;
+
+}
+
+/**
+ * pratomic_long_inc - atomic increment of rare write long
+ * @l: address of the variable of type struct pratomic_long_t
+ *
+ * Return: true on success, false otherwise
+ */
+static __always_inline bool pratomic_long_inc(struct pratomic_long_t *l)
+{
+	return __pratomic_long_op(true, l);
+}
+
+/**
+ * pratomic_long_inc - atomic decrement of rare write long
+ * @l: address of the variable of type struct pratomic_long_t
+ *
+ * Return: true on success, false otherwise
+ */
+static __always_inline bool pratomic_long_dec(struct pratomic_long_t *l)
+{
+	return __pratomic_long_op(false, l);
+}
+
+#endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 17/17] prmem: ima: turn the measurements list write rare
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (15 preceding siblings ...)
  2018-10-23 21:35 ` [PATCH 16/17] prmem: pratomic-long Igor Stoppa
@ 2018-10-23 21:35 ` Igor Stoppa
  2018-10-24 23:03 ` [RFC v1 PATCH 00/17] prmem: protected memory Dave Chinner
  17 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-23 21:35 UTC (permalink / raw)
  To: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Dmitry Kasatkin, Serge E. Hallyn, linux-kernel

The measurement list is moved to write rare memory, including
related data structures.

Various boilerplate linux data structures and related functions are
replaced by their write-rare counterpart.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
CC: Mimi Zohar <zohar@linux.vnet.ibm.com>
CC: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
CC: James Morris <jmorris@namei.org>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: linux-integrity@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 security/integrity/ima/ima.h              | 18 ++++++++------
 security/integrity/ima/ima_api.c          | 29 +++++++++++++----------
 security/integrity/ima/ima_fs.c           | 12 +++++-----
 security/integrity/ima/ima_main.c         |  6 +++++
 security/integrity/ima/ima_queue.c        | 28 +++++++++++++---------
 security/integrity/ima/ima_template.c     | 14 ++++++-----
 security/integrity/ima/ima_template_lib.c | 14 +++++++----
 7 files changed, 74 insertions(+), 47 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 67db9d9454ca..5f5959753bf5 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -24,6 +24,8 @@
 #include <linux/hash.h>
 #include <linux/tpm.h>
 #include <linux/audit.h>
+#include <linux/prlist.h>
+#include <linux/pratomic-long.h>
 #include <crypto/hash_info.h>
 
 #include "../integrity.h"
@@ -84,7 +86,7 @@ struct ima_template_field {
 
 /* IMA template descriptor definition */
 struct ima_template_desc {
-	struct list_head list;
+	union prlist_head list;
 	char *name;
 	char *fmt;
 	int num_fields;
@@ -100,11 +102,13 @@ struct ima_template_entry {
 };
 
 struct ima_queue_entry {
-	struct hlist_node hnext;	/* place in hash collision list */
-	struct list_head later;		/* place in ima_measurements list */
+	union prhlist_node hnext;	/* place in hash collision list */
+	union prlist_head later;	/* place in ima_measurements list */
 	struct ima_template_entry *entry;
 };
-extern struct list_head ima_measurements;	/* list of all measurements */
+
+/* list of all measurements */
+extern union prlist_head ima_measurements __wr_after_init;
 
 /* Some details preceding the binary serialized measurement list */
 struct ima_kexec_hdr {
@@ -160,9 +164,9 @@ void ima_init_template_list(void);
 extern spinlock_t ima_queue_lock;
 
 struct ima_h_table {
-	atomic_long_t len;	/* number of stored measurements in the list */
-	atomic_long_t violations;
-	struct hlist_head queue[IMA_MEASURE_HTABLE_SIZE];
+	struct pratomic_long_t len;	/* # of measurements in the list */
+	struct pratomic_long_t violations;
+	union prhlist_head queue[IMA_MEASURE_HTABLE_SIZE];
 };
 extern struct ima_h_table ima_htable;
 
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index a02c5acfd403..4fc28c2478b0 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -19,9 +19,12 @@
 #include <linux/xattr.h>
 #include <linux/evm.h>
 #include <linux/iversion.h>
+#include <linux/prmemextra.h>
+#include <linux/pratomic-long.h>
 
 #include "ima.h"
 
+extern struct pmalloc_pool ima_pool;
 /*
  * ima_free_template_entry - free an existing template entry
  */
@@ -29,10 +32,10 @@ void ima_free_template_entry(struct ima_template_entry *entry)
 {
 	int i;
 
-	for (i = 0; i < entry->template_desc->num_fields; i++)
-		kfree(entry->template_data[i].data);
+//	for (i = 0; i < entry->template_desc->num_fields; i++)
+//		kfree(entry->template_data[i].data);
 
-	kfree(entry);
+//	kfree(entry);
 }
 
 /*
@@ -44,12 +47,13 @@ int ima_alloc_init_template(struct ima_event_data *event_data,
 	struct ima_template_desc *template_desc = ima_template_desc_current();
 	int i, result = 0;
 
-	*entry = kzalloc(sizeof(**entry) + template_desc->num_fields *
-			 sizeof(struct ima_field_data), GFP_NOFS);
+	*entry = pzalloc(&ima_pool,
+			 sizeof(**entry) + template_desc->num_fields *
+			 sizeof(struct ima_field_data));
 	if (!*entry)
 		return -ENOMEM;
 
-	(*entry)->template_desc = template_desc;
+	wr_ptr(&((*entry)->template_desc), template_desc);
 	for (i = 0; i < template_desc->num_fields; i++) {
 		struct ima_template_field *field = template_desc->fields[i];
 		u32 len;
@@ -59,9 +63,10 @@ int ima_alloc_init_template(struct ima_event_data *event_data,
 		if (result != 0)
 			goto out;
 
-		len = (*entry)->template_data[i].len;
-		(*entry)->template_data_len += sizeof(len);
-		(*entry)->template_data_len += len;
+		len = (*entry)->template_data_len + sizeof(len) +
+			(*entry)->template_data[i].len;
+		wr_memcpy(&(*entry)->template_data_len, &len,
+			  sizeof(len));
 	}
 	return 0;
 out:
@@ -113,9 +118,9 @@ int ima_store_template(struct ima_template_entry *entry,
 					    audit_cause, result, 0);
 			return result;
 		}
-		memcpy(entry->digest, hash.hdr.digest, hash.hdr.length);
+		wr_memcpy(entry->digest, hash.hdr.digest, hash.hdr.length);
 	}
-	entry->pcr = pcr;
+	wr_int(&entry->pcr, pcr);
 	result = ima_add_template_entry(entry, violation, op, inode, filename);
 	return result;
 }
@@ -139,7 +144,7 @@ void ima_add_violation(struct file *file, const unsigned char *filename,
 	int result;
 
 	/* can overflow, only indicator */
-	atomic_long_inc(&ima_htable.violations);
+	pratomic_long_inc(&ima_htable.violations);
 
 	result = ima_alloc_init_template(&event_data, &entry);
 	if (result < 0) {
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index ae9d5c766a3c..ab20da1161c7 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -57,7 +57,8 @@ static ssize_t ima_show_htable_violations(struct file *filp,
 					  char __user *buf,
 					  size_t count, loff_t *ppos)
 {
-	return ima_show_htable_value(buf, count, ppos, &ima_htable.violations);
+	return ima_show_htable_value(buf, count, ppos,
+				     &ima_htable.violations.l);
 }
 
 static const struct file_operations ima_htable_violations_ops = {
@@ -69,8 +70,7 @@ static ssize_t ima_show_measurements_count(struct file *filp,
 					   char __user *buf,
 					   size_t count, loff_t *ppos)
 {
-	return ima_show_htable_value(buf, count, ppos, &ima_htable.len);
-
+	return ima_show_htable_value(buf, count, ppos, &ima_htable.len.l);
 }
 
 static const struct file_operations ima_measurements_count_ops = {
@@ -86,7 +86,7 @@ static void *ima_measurements_start(struct seq_file *m, loff_t *pos)
 
 	/* we need a lock since pos could point beyond last element */
 	rcu_read_lock();
-	list_for_each_entry_rcu(qe, &ima_measurements, later) {
+	list_for_each_entry_rcu(qe, &ima_measurements.list, later.list) {
 		if (!l--) {
 			rcu_read_unlock();
 			return qe;
@@ -303,7 +303,7 @@ static ssize_t ima_read_policy(char *path)
 		size -= rc;
 	}
 
-	vfree(data);
+//	vfree(data);
 	if (rc < 0)
 		return rc;
 	else if (size)
@@ -350,7 +350,7 @@ static ssize_t ima_write_policy(struct file *file, const char __user *buf,
 	}
 	mutex_unlock(&ima_write_mutex);
 out_free:
-	kfree(data);
+//	kfree(data);
 out:
 	if (result < 0)
 		valid_policy = 0;
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 2d31921fbda4..d52e59006781 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -29,6 +29,7 @@
 #include <linux/ima.h>
 #include <linux/iversion.h>
 #include <linux/fs.h>
+#include <linux/prlist.h>
 
 #include "ima.h"
 
@@ -536,10 +537,15 @@ int ima_load_data(enum kernel_load_data_id id)
 	return 0;
 }
 
+struct pmalloc_pool ima_pool;
+
+#define IMA_POOL_ALLOC_CHUNK (16 * PAGE_SIZE)
 static int __init init_ima(void)
 {
 	int error;
 
+	pmalloc_init_custom_pool(&ima_pool, IMA_POOL_ALLOC_CHUNK, 3,
+				 PMALLOC_MODE_START_WR);
 	ima_init_template_list();
 	hash_setup(CONFIG_IMA_DEFAULT_HASH);
 	error = ima_init();
diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
index b186819bd5aa..444c47b745d8 100644
--- a/security/integrity/ima/ima_queue.c
+++ b/security/integrity/ima/ima_queue.c
@@ -24,11 +24,14 @@
 #include <linux/module.h>
 #include <linux/rculist.h>
 #include <linux/slab.h>
+#include <linux/prmemextra.h>
+#include <linux/prlist.h>
+#include <linux/pratomic-long.h>
 #include "ima.h"
 
 #define AUDIT_CAUSE_LEN_MAX 32
 
-LIST_HEAD(ima_measurements);	/* list of all measurements */
+PRLIST_HEAD(ima_measurements);	/* list of all measurements */
 #ifdef CONFIG_IMA_KEXEC
 static unsigned long binary_runtime_size;
 #else
@@ -36,9 +39,9 @@ static unsigned long binary_runtime_size = ULONG_MAX;
 #endif
 
 /* key: inode (before secure-hashing a file) */
-struct ima_h_table ima_htable = {
-	.len = ATOMIC_LONG_INIT(0),
-	.violations = ATOMIC_LONG_INIT(0),
+struct ima_h_table ima_htable __wr_after_init = {
+	.len = PRATOMIC_LONG_INIT(0),
+	.violations = PRATOMIC_LONG_INIT(0),
 	.queue[0 ... IMA_MEASURE_HTABLE_SIZE - 1] = HLIST_HEAD_INIT
 };
 
@@ -58,7 +61,7 @@ static struct ima_queue_entry *ima_lookup_digest_entry(u8 *digest_value,
 
 	key = ima_hash_key(digest_value);
 	rcu_read_lock();
-	hlist_for_each_entry_rcu(qe, &ima_htable.queue[key], hnext) {
+	hlist_for_each_entry_rcu(qe, &ima_htable.queue[key], hnext.node) {
 		rc = memcmp(qe->entry->digest, digest_value, TPM_DIGEST_SIZE);
 		if ((rc == 0) && (qe->entry->pcr == pcr)) {
 			ret = qe;
@@ -87,6 +90,8 @@ static int get_binary_runtime_size(struct ima_template_entry *entry)
 	return size;
 }
 
+extern struct pmalloc_pool ima_pool;
+
 /* ima_add_template_entry helper function:
  * - Add template entry to the measurement list and hash table, for
  *   all entries except those carried across kexec.
@@ -99,20 +104,21 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
 	struct ima_queue_entry *qe;
 	unsigned int key;
 
-	qe = kmalloc(sizeof(*qe), GFP_KERNEL);
+	qe = pmalloc(&ima_pool, sizeof(*qe));
 	if (qe == NULL) {
 		pr_err("OUT OF MEMORY ERROR creating queue entry\n");
 		return -ENOMEM;
 	}
-	qe->entry = entry;
+	wr_ptr(&qe->entry, entry);
+	INIT_PRLIST_HEAD(&qe->later);
+	prlist_add_tail_rcu(&qe->later, &ima_measurements);
+
 
-	INIT_LIST_HEAD(&qe->later);
-	list_add_tail_rcu(&qe->later, &ima_measurements);
+	pratomic_long_inc(&ima_htable.len);
 
-	atomic_long_inc(&ima_htable.len);
 	if (update_htable) {
 		key = ima_hash_key(entry->digest);
-		hlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
+		prhlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
 	}
 
 	if (binary_runtime_size != ULONG_MAX) {
diff --git a/security/integrity/ima/ima_template.c b/security/integrity/ima/ima_template.c
index 30db39b23804..40ae57a17d89 100644
--- a/security/integrity/ima/ima_template.c
+++ b/security/integrity/ima/ima_template.c
@@ -22,14 +22,15 @@
 enum header_fields { HDR_PCR, HDR_DIGEST, HDR_TEMPLATE_NAME,
 		     HDR_TEMPLATE_DATA, HDR__LAST };
 
-static struct ima_template_desc builtin_templates[] = {
+static struct ima_template_desc builtin_templates[] __wr_after_init = {
 	{.name = IMA_TEMPLATE_IMA_NAME, .fmt = IMA_TEMPLATE_IMA_FMT},
 	{.name = "ima-ng", .fmt = "d-ng|n-ng"},
 	{.name = "ima-sig", .fmt = "d-ng|n-ng|sig"},
 	{.name = "", .fmt = ""},	/* placeholder for a custom format */
 };
 
-static LIST_HEAD(defined_templates);
+static PRLIST_HEAD(defined_templates);
+
 static DEFINE_SPINLOCK(template_list);
 
 static struct ima_template_field supported_fields[] = {
@@ -114,7 +115,8 @@ static struct ima_template_desc *lookup_template_desc(const char *name)
 	int found = 0;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(template_desc, &defined_templates, list) {
+	list_for_each_entry_rcu(template_desc, &defined_templates.list,
+				list.list) {
 		if ((strcmp(template_desc->name, name) == 0) ||
 		    (strcmp(template_desc->fmt, name) == 0)) {
 			found = 1;
@@ -207,12 +209,12 @@ void ima_init_template_list(void)
 {
 	int i;
 
-	if (!list_empty(&defined_templates))
+	if (!list_empty(&defined_templates.list))
 		return;
 
 	spin_lock(&template_list);
 	for (i = 0; i < ARRAY_SIZE(builtin_templates); i++) {
-		list_add_tail_rcu(&builtin_templates[i].list,
+		prlist_add_tail_rcu(&builtin_templates[i].list,
 				  &defined_templates);
 	}
 	spin_unlock(&template_list);
@@ -266,7 +268,7 @@ static struct ima_template_desc *restore_template_fmt(char *template_name)
 		goto out;
 
 	spin_lock(&template_list);
-	list_add_tail_rcu(&template_desc->list, &defined_templates);
+	prlist_add_tail_rcu(&template_desc->list, &defined_templates);
 	spin_unlock(&template_list);
 out:
 	return template_desc;
diff --git a/security/integrity/ima/ima_template_lib.c b/security/integrity/ima/ima_template_lib.c
index 43752002c222..a6d10eabf0e5 100644
--- a/security/integrity/ima/ima_template_lib.c
+++ b/security/integrity/ima/ima_template_lib.c
@@ -15,8 +15,12 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <linux/printk.h>
+#include <linux/prmemextra.h>
 #include "ima_template_lib.h"
 
+extern struct pmalloc_pool ima_pool;
+
 static bool ima_template_hash_algo_allowed(u8 algo)
 {
 	if (algo == HASH_ALGO_SHA1 || algo == HASH_ALGO_MD5)
@@ -42,11 +46,11 @@ static int ima_write_template_field_data(const void *data, const u32 datalen,
 	if (datafmt == DATA_FMT_STRING)
 		buflen = datalen + 1;
 
-	buf = kzalloc(buflen, GFP_KERNEL);
+	buf = pzalloc(&ima_pool, buflen);
 	if (!buf)
 		return -ENOMEM;
 
-	memcpy(buf, data, datalen);
+	wr_memcpy(buf, data, datalen);
 
 	/*
 	 * Replace all space characters with underscore for event names and
@@ -58,11 +62,11 @@ static int ima_write_template_field_data(const void *data, const u32 datalen,
 	if (datafmt == DATA_FMT_STRING) {
 		for (buf_ptr = buf; buf_ptr - buf < datalen; buf_ptr++)
 			if (*buf_ptr == ' ')
-				*buf_ptr = '_';
+				wr_char(buf_ptr, '_');
 	}
 
-	field_data->data = buf;
-	field_data->len = buflen;
+	wr_ptr(&field_data->data, buf);
+	wr_memcpy(&field_data->len, &buflen, sizeof(buflen));
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 08/17] prmem: struct page: track vmap_area
  2018-10-23 21:34 ` [PATCH 08/17] prmem: struct page: track vmap_area Igor Stoppa
@ 2018-10-24  3:12   ` Matthew Wilcox
  2018-10-24 23:01     ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-24  3:12 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Dave Chinner, James Morris, Michal Hocko,
	kernel-hardening, linux-integrity, linux-security-module,
	igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

On Wed, Oct 24, 2018 at 12:34:55AM +0300, Igor Stoppa wrote:
> The connection between each page and its vmap_area avoids more expensive
> searches through the btree of vmap_areas.

Typo -- it's an rbtree.

> +++ b/include/linux/mm_types.h
> @@ -87,13 +87,24 @@ struct page {
>  			/* See page-flags.h for PAGE_MAPPING_FLAGS */
>  			struct address_space *mapping;
>  			pgoff_t index;		/* Our offset within mapping. */
> -			/**
> -			 * @private: Mapping-private opaque data.
> -			 * Usually used for buffer_heads if PagePrivate.
> -			 * Used for swp_entry_t if PageSwapCache.
> -			 * Indicates order in the buddy system if PageBuddy.
> -			 */
> -			unsigned long private;
> +			union {
> +				/**
> +				 * @private: Mapping-private opaque data.
> +				 * Usually used for buffer_heads if
> +				 * PagePrivate.
> +				 * Used for swp_entry_t if PageSwapCache.
> +				 * Indicates order in the buddy system if
> +				 * PageBuddy.
> +				 */
> +				unsigned long private;
> +				/**
> +				 * @area: reference to the containing area
> +				 * For pages that are mapped into a virtually
> +				 * contiguous area, avoids performing a more
> +				 * expensive lookup.
> +				 */
> +				struct vmap_area *area;
> +			};

Not like this.  Make it part of a different struct in the existing union,
not a part of the pagecache struct.  And there's no need to use ->private
explicitly.

> @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  	if (!addr)
>  		return NULL;
>  
> +	va = __find_vmap_area((unsigned long)addr);
> +	for (i = 0; i < va->vm->nr_pages; i++)
> +		va->vm->pages[i]->area = va;

I don't like it that you're calling this for _every_ vmalloc() caller
when most of them will never use this.  Perhaps have page->va be initially
NULL and then cache the lookup in it when it's accessed for the first time.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 06/17] prmem: test cases for memory protection
  2018-10-23 21:34 ` [PATCH 06/17] prmem: test cases for memory protection Igor Stoppa
@ 2018-10-24  3:27   ` Randy Dunlap
  2018-10-24 14:24     ` Igor Stoppa
  2018-10-25 16:43   ` Dave Hansen
  1 sibling, 1 reply; 140+ messages in thread
From: Randy Dunlap @ 2018-10-24  3:27 UTC (permalink / raw)
  To: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

On 10/23/18 2:34 PM, Igor Stoppa wrote:
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 9a7b8b049d04..57de5b3c0bae 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -94,3 +94,12 @@ config DEBUG_RODATA_TEST
>      depends on STRICT_KERNEL_RWX
>      ---help---
>        This option enables a testcase for the setting rodata read-only.
> +
> +config DEBUG_PRMEM_TEST
> +    tristate "Run self test for protected memory"
> +    depends on STRICT_KERNEL_RWX
> +    select PRMEM
> +    default n
> +    help
> +      Tries to verify that the memory protection works correctly and that
> +      the memory is effectively protected.

Hi,

a. It seems backwards (or upside down) to have a test case select a feature (PRMEM)
instead of depending on that feature.

b. Since PRMEM depends on MMU (in patch 04/17), the "select" here could try to
enabled PRMEM even when MMU is not enabled.

Changing this to "depends on PRMEM" would solve both of these issues.

c. Don't use "default n".  That is already the default.


thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-23 21:34 ` [PATCH 10/17] prmem: documentation Igor Stoppa
@ 2018-10-24  3:48   ` Randy Dunlap
  2018-10-24 14:30     ` Igor Stoppa
  2018-10-24 23:04   ` Mike Rapoport
  2018-10-26  9:26   ` Peter Zijlstra
  2 siblings, 1 reply; 140+ messages in thread
From: Randy Dunlap @ 2018-10-24  3:48 UTC (permalink / raw)
  To: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Mike Rapoport, linux-doc, linux-kernel

Hi,

On 10/23/18 2:34 PM, Igor Stoppa wrote:
> Documentation for protected memory.
> 
> Topics covered:
> * static memory allocation
> * dynamic memory allocation
> * write-rare
> 
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Jonathan Corbet <corbet@lwn.net>
> CC: Randy Dunlap <rdunlap@infradead.org>
> CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> CC: linux-doc@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  Documentation/core-api/index.rst |   1 +
>  Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++
>  MAINTAINERS                      |   1 +
>  3 files changed, 174 insertions(+)
>  create mode 100644 Documentation/core-api/prmem.rst


> diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
> new file mode 100644
> index 000000000000..16d7edfe327a
> --- /dev/null
> +++ b/Documentation/core-api/prmem.rst
> @@ -0,0 +1,172 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _prmem:
> +
> +Memory Protection
> +=================
> +
> +:Date: October 2018
> +:Author: Igor Stoppa <igor.stoppa@huawei.com>
> +
> +Foreword
> +--------
> +- In a typical system using some sort of RAM as execution environment,
> +  **all** memory is initially writable.
> +
> +- It must be initialized with the appropriate content, be it code or data.
> +
> +- Said content typically undergoes modifications, i.e. relocations or
> +  relocation-induced changes.
> +
> +- The present document doesn't address such transient.

                                               transience.

> +
> +- Kernel code is protected at system level and, unlike data, it doesn't
> +  require special attention.
> +
> +Protection mechanism
> +--------------------
> +
> +- When available, the MMU can write protect memory pages that would be
> +  otherwise writable.
> +
> +- The protection has page-level granularity.
> +
> +- An attempt to overwrite a protected page will trigger an exception.
> +- **Write protected data must go exclusively to write protected pages**
> +- **Writable data must go exclusively to writable pages**
> +
> +Available protections for kernel data
> +-------------------------------------
> +
> +- **constant**
> +   Labelled as **const**, the data is never supposed to be altered.
> +   It is statically allocated - if it has any memory footprint at all.
> +   The compiler can even optimize it away, where possible, by replacing
> +   references to a **const** with its actual value.
> +
> +- **read only after init**
> +   By tagging an otherwise ordinary statically allocated variable with
> +   **__ro_after_init**, it is placed in a special segment that will
> +   become write protected, at the end of the kernel init phase.
> +   The compiler has no notion of this restriction and it will treat any
> +   write operation on such variable as legal. However, assignments that
> +   are attempted after the write protection is in place, will cause

no comma.

> +   exceptions.
> +
> +- **write rare after init**
> +   This can be seen as variant of read only after init, which uses the
> +   tag **__wr_after_init**. It is also limited to statically allocated
> +   memory. It is still possible to alter this type of variables, after
> +   the kernel init phase is complete, however it can be done exclusively
> +   with special functions, instead of the assignment operator. Using the
> +   assignment operator after conclusion of the init phase will still
> +   trigger an exception. It is not possible to transition a certain
> +   variable from __wr_ater_init to a permanent read-only status, at
> +   runtime.
> +
> +- **dynamically allocated write-rare / read-only**
> +   After defining a pool, memory can be obtained through it, primarily
> +   through the **pmalloc()** allocator. The exact writability state of the
> +   memory obtained from **pmalloc()** and friends can be configured when
> +   creating the pool. At any point it is possible to transition to a less
> +   permissive write status the memory currently associated to the pool.
> +   Once memory has become read-only, it the only valid operation, beside
> +   reading, is to released it, by destroying the pool it belongs to.
> +
> +
> +Protecting dynamically allocated memory
> +---------------------------------------
> +
> +When dealing with dynamically allocated memory, three options are
> + available for configuring its writability state:
> +
> +- **Options selected when creating a pool**
> +   When creating the pool, it is possible to choose one of the following:
> +    - **PMALLOC_MODE_RO**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_WR**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_AUTO_RO**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *NONE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_AUTO_WR**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_START_WR**
> +       - Writability at allocation time: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*
> +
> +   **Remarks:**
> +    - The "AUTO" modes perform automatic protection of the content, whenever
> +       the current vmap_area is used up and a new one is allocated.
> +        - At that point, the vmap_area being phased out is protected.
> +        - The size of the vmap_area depends on various parameters.
> +        - It might not be possible to know for sure *when* certain data will
> +          be protected.
> +        - The functionality is provided as tradeoff between hardening and speed.
> +        - Its usefulness depends on the specific use case at hand

end above sentence with a period, please, like all of the others above it.

> +    - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.

Please try to keep the line above and a few below to < 80 characters in length.
(because some of us read rst files as text files, with a text editor, and line
wrap is ugly)

> +
> +- **Protecting the pool**
> +   This is achieved with **pmalloc_protect_pool()**
> +    - Any vmap_area currently in the pool is write-protected according to its initial configuration.
> +    - Any residual space still available from the current vmap_area is lost, as the area is protected.
> +    - **protecting a pool after every allocation will likely be very wasteful**
> +    - Using PMALLOC_MODE_START_WR is likely a better choice.
> +
> +- **Upgrading the protection level**
> +   This is achieved with **pmalloc_make_pool_ro()**
> +    - it turns the present content of a write-rare pool into read-only
> +    - can be useful when the content of the memory has settled
> +
> +
> +Caveats
> +-------
> +- Freeing of memory is not supported. Pages will be returned to the
> +  system upon destruction of their memory pool.
> +
> +- The address range available for vmalloc (and thus for pmalloc too) is
> +  limited, on 32-bit systems. However it shouldn't be an issue, since not
> +  much data is expected to be dynamically allocated and turned into
> +  write-protected.
> +
> +- Regarding SMP systems, changing state of pages and altering mappings
> +  requires performing cross-processor synchronizations of page tables.
> +  This is an additional reason for limiting the use of write rare.
> +
> +- Not only the pmalloc memory must be protected, but also any reference to
> +  it that might become the target for an attack. The attack would replace
> +  a reference to the protected memory with a reference to some other,
> +  unprotected, memory.
> +
> +- The users of rare write must take care of ensuring the atomicity of the

s/rare write/write rare/ ?

> +  action, respect to the way they use the data being altered; for example,

  This ..   "respect to the way" is awkward, but I don't know what to
change it to.

> +  take a lock before making a copy of the value to modify (if it's
> +  relevant), then alter it, issue the call to rare write and finally
> +  release the lock. Some special scenario might be exempt from the need
> +  for locking, but in general rare-write must be treated as an operation

It seemed to me that "write-rare" (or write rare) was the going name, but now
it's being called "rare write" (or rare-write).  Just be consistent, please.


> +  that can incur into races.
> +
> +- pmalloc relies on virtual memory areas and will therefore use more
> +  tlb entries. It still does a better job of it, compared to invoking

     TLB

> +  vmalloc for each allocation, but it is undeniably less optimized wrt to

s/wrt/with respect to/

> +  TLB use than using the physmap directly, through kmalloc or similar.
> +
> +
> +Utilization
> +-----------
> +
> +**add examples here**
> +
> +API
> +---
> +
> +.. kernel-doc:: include/linux/prmem.h
> +.. kernel-doc:: mm/prmem.c
> +.. kernel-doc:: include/linux/prmemextra.h


Thanks for the documentation.

-- 
~Randy

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
  2018-10-23 21:35 ` [PATCH 14/17] prmem: llist, hlist, both plain and rcu Igor Stoppa
@ 2018-10-24 11:37   ` Mathieu Desnoyers
  2018-10-24 14:03     ` Igor Stoppa
  2018-10-26  9:38   ` Peter Zijlstra
  1 sibling, 1 reply; 140+ messages in thread
From: Mathieu Desnoyers @ 2018-10-24 11:37 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Thomas Gleixner, Kate Stewart, David S. Miller,
	Greg Kroah-Hartman, Philippe Ombredanne, Paul E. McKenney,
	Josh Triplett, rostedt, Lai Jiangshan, linux-kernel

----- On Oct 23, 2018, at 10:35 PM, Igor Stoppa igor.stoppa@gmail.com wrote:

> In some cases, all the data needing protection can be allocated from a pool
> in one go, as directly writable, then initialized and protected.
> The sequence is relatively short and it's acceptable to leave the entire
> data set unprotected.
> 
> In other cases, this is not possible, because the data will trickle over
> a relatively long period of time, in a non predictable way, possibly for
> the entire duration of the operations.
> 
> For these cases, the safe approach is to have the memory already write
> protected, when allocated. However, this will require replacing any
> direct assignment with calls to functions that can perform write rare.
> 
> Since lists are one of the most commonly used data structures in kernel,
> they are a the first candidate for receiving write rare extensions.
> 
> This patch implements basic functionality for altering said lists.

I could not find a description of the overall context of this patch
(e.g. a patch 00/17 ?) that would explain the attack vectors this aims
to protect against.

This might help figuring out whether this added complexity in the core
kernel is worth it.

Also, is it the right approach to duplicate existing APIs, or should we
rather hook into page fault handlers and let the kernel do those "shadow"
mappings under the hood ?

Adding a new GFP flags for dynamic allocation, and a macro mapping to
a section attribute might suffice for allocation or definition of such
mostly-read-only/seldom-updated data.

Thanks,

Mathieu


> 
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Kate Stewart <kstewart@linuxfoundation.org>
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> CC: Philippe Ombredanne <pombredanne@nexb.com>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Josh Triplett <josh@joshtriplett.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: Lai Jiangshan <jiangshanlai@gmail.com>
> CC: linux-kernel@vger.kernel.org
> ---
> MAINTAINERS            |   1 +
> include/linux/prlist.h | 934 +++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 935 insertions(+)
> create mode 100644 include/linux/prlist.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 246b1a1cc8bb..f5689c014e07 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9464,6 +9464,7 @@ F:	mm/prmem.c
> F:	mm/test_write_rare.c
> F:	mm/test_pmalloc.c
> F:	Documentation/core-api/prmem.rst
> +F:	include/linux/prlist.h
> 
> MEMORY MANAGEMENT
> L:	linux-mm@kvack.org
> diff --git a/include/linux/prlist.h b/include/linux/prlist.h
> new file mode 100644
> index 000000000000..0387c78f8be8
> --- /dev/null
> +++ b/include/linux/prlist.h
> @@ -0,0 +1,934 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * prlist.h: Header for Protected Lists
> + *
> + * (C) Copyright 2018 Huawei Technologies Co. Ltd.
> + * Author: Igor Stoppa <igor.stoppa@huawei.com>
> + *
> + * Code from <linux/list.h> and <linux/rculist.h>, adapted to perform
> + * writes on write-rare data.
> + * These functions and macros rely on data structures that allow the reuse
> + * of what is already provided for reading the content of their non-write
> + * rare variant.
> + */
> +
> +#ifndef _LINUX_PRLIST_H
> +#define _LINUX_PRLIST_H
> +
> +#include <linux/list.h>
> +#include <linux/kernel.h>
> +#include <linux/prmemextra.h>
> +
> +/* --------------- Circular Protected Doubly Linked List --------------- */
> +union prlist_head {
> +	struct list_head list __aligned(sizeof(void *));
> +	struct {
> +		union prlist_head *next __aligned(sizeof(void *));
> +		union prlist_head *prev __aligned(sizeof(void *));
> +	} __no_randomize_layout;
> +} __aligned(sizeof(void *));
> +
> +static __always_inline
> +union prlist_head *to_prlist_head(struct list_head *list)
> +{
> +	return container_of(list, union prlist_head, list);
> +}
> +
> +#define PRLIST_HEAD_INIT(name) {	\
> +	.list = LIST_HEAD_INIT(name),	\
> +}
> +
> +#define PRLIST_HEAD(name) \
> +	union prlist_head name __wr_after_init = PRLIST_HEAD_INIT(name.list)
> +
> +static __always_inline
> +struct pmalloc_pool *prlist_create_custom_pool(size_t refill,
> +					       unsigned short align_order)
> +{
> +	return pmalloc_create_custom_pool(refill, align_order,
> +					  PMALLOC_MODE_START_WR);
> +}
> +
> +static __always_inline struct pmalloc_pool *prlist_create_pool(void)
> +{
> +	return prlist_create_custom_pool(PMALLOC_REFILL_DEFAULT,
> +					 PMALLOC_ALIGN_ORDER_DEFAULT);
> +}
> +
> +static __always_inline
> +void prlist_set_prev(union prlist_head *head,
> +		     const union prlist_head *prev)
> +{
> +	wr_ptr(&head->prev, prev);
> +}
> +
> +static __always_inline
> +void prlist_set_next(union prlist_head *head,
> +		     const union prlist_head *next)
> +{
> +	wr_ptr(&head->next, next);
> +}
> +
> +static __always_inline void INIT_PRLIST_HEAD(union prlist_head *head)
> +{
> +	prlist_set_prev(head, head);
> +	prlist_set_next(head, head);
> +}
> +
> +/*
> + * Insert a new entry between two known consecutive entries.
> + *
> + * This is only for internal list manipulation where we know
> + * the prev/next entries already!
> + */
> +static __always_inline
> +void __prlist_add(union prlist_head *new, union prlist_head *prev,
> +		  union prlist_head *next)
> +{
> +	if (!__list_add_valid(&new->list, &prev->list, &next->list))
> +		return;
> +
> +	prlist_set_prev(next, new);
> +	prlist_set_next(new, next);
> +	prlist_set_prev(new, prev);
> +	prlist_set_next(prev, new);
> +}
> +
> +/**
> + * prlist_add - add a new entry
> + * @new: new entry to be added
> + * @head: prlist head to add it after
> + *
> + * Insert a new entry after the specified head.
> + * This is good for implementing stacks.
> + */
> +static __always_inline
> +void prlist_add(union prlist_head *new, union prlist_head *head)
> +{
> +	__prlist_add(new, head, head->next);
> +}
> +
> +/**
> + * prlist_add_tail - add a new entry
> + * @new: new entry to be added
> + * @head: list head to add it before
> + *
> + * Insert a new entry before the specified head.
> + * This is useful for implementing queues.
> + */
> +static __always_inline
> +void prlist_add_tail(union prlist_head *new, union prlist_head *head)
> +{
> +	__prlist_add(new, head->prev, head);
> +}
> +
> +/*
> + * Delete a prlist entry by making the prev/next entries
> + * point to each other.
> + *
> + * This is only for internal list manipulation where we know
> + * the prev/next entries already!
> + */
> +static __always_inline
> +void __prlist_del(union prlist_head *prev, union prlist_head *next)
> +{
> +	prlist_set_prev(next, prev);
> +	prlist_set_next(prev, next);
> +}
> +
> +/**
> + * prlist_del - deletes entry from list.
> + * @entry: the element to delete from the list.
> + * Note: list_empty() on entry does not return true after this, the entry is
> + * in an undefined state.
> + */
> +static inline void __prlist_del_entry(union prlist_head *entry)
> +{
> +	if (!__list_del_entry_valid(&entry->list))
> +		return;
> +	__prlist_del(entry->prev, entry->next);
> +}
> +
> +static __always_inline void prlist_del(union prlist_head *entry)
> +{
> +	__prlist_del_entry(entry);
> +	prlist_set_next(entry, LIST_POISON1);
> +	prlist_set_prev(entry, LIST_POISON2);
> +}
> +
> +/**
> + * prlist_replace - replace old entry by new one
> + * @old : the element to be replaced
> + * @new : the new element to insert
> + *
> + * If @old was empty, it will be overwritten.
> + */
> +static __always_inline
> +void prlist_replace(union prlist_head *old, union prlist_head *new)
> +{
> +	prlist_set_next(new, old->next);
> +	prlist_set_prev(new->next, new);
> +	prlist_set_prev(new, old->prev);
> +	prlist_set_next(new->prev, new);
> +}
> +
> +static __always_inline
> +void prlist_replace_init(union prlist_head *old, union prlist_head *new)
> +{
> +	prlist_replace(old, new);
> +	INIT_PRLIST_HEAD(old);
> +}
> +
> +/**
> + * prlist_del_init - deletes entry from list and reinitialize it.
> + * @entry: the element to delete from the list.
> + */
> +static __always_inline void prlist_del_init(union prlist_head *entry)
> +{
> +	__prlist_del_entry(entry);
> +	INIT_PRLIST_HEAD(entry);
> +}
> +
> +/**
> + * prlist_move - delete from one list and add as another's head
> + * @list: the entry to move
> + * @head: the head that will precede our entry
> + */
> +static __always_inline
> +void prlist_move(union prlist_head *list, union prlist_head *head)
> +{
> +	__prlist_del_entry(list);
> +	prlist_add(list, head);
> +}
> +
> +/**
> + * prlist_move_tail - delete from one list and add as another's tail
> + * @list: the entry to move
> + * @head: the head that will follow our entry
> + */
> +static __always_inline
> +void prlist_move_tail(union prlist_head *list, union prlist_head *head)
> +{
> +	__prlist_del_entry(list);
> +	prlist_add_tail(list, head);
> +}
> +
> +/**
> + * prlist_rotate_left - rotate the list to the left
> + * @head: the head of the list
> + */
> +static __always_inline void prlist_rotate_left(union prlist_head *head)
> +{
> +	union prlist_head *first;
> +
> +	if (!list_empty(&head->list)) {
> +		first = head->next;
> +		prlist_move_tail(first, head);
> +	}
> +}
> +
> +static __always_inline
> +void __prlist_cut_position(union prlist_head *list, union prlist_head *head,
> +			   union prlist_head *entry)
> +{
> +	union prlist_head *new_first = entry->next;
> +
> +	prlist_set_next(list, head->next);
> +	prlist_set_prev(list->next, list);
> +	prlist_set_prev(list, entry);
> +	prlist_set_next(entry, list);
> +	prlist_set_next(head, new_first);
> +	prlist_set_prev(new_first, head);
> +}
> +
> +/**
> + * prlist_cut_position - cut a list into two
> + * @list: a new list to add all removed entries
> + * @head: a list with entries
> + * @entry: an entry within head, could be the head itself
> + *	and if so we won't cut the list
> + *
> + * This helper moves the initial part of @head, up to and
> + * including @entry, from @head to @list. You should
> + * pass on @entry an element you know is on @head. @list
> + * should be an empty list or a list you do not care about
> + * losing its data.
> + *
> + */
> +static __always_inline
> +void prlist_cut_position(union prlist_head *list, union prlist_head *head,
> +			 union prlist_head *entry)
> +{
> +	if (list_empty(&head->list))
> +		return;
> +	if (list_is_singular(&head->list) &&
> +		(head->next != entry && head != entry))
> +		return;
> +	if (entry == head)
> +		INIT_PRLIST_HEAD(list);
> +	else
> +		__prlist_cut_position(list, head, entry);
> +}
> +
> +/**
> + * prlist_cut_before - cut a list into two, before given entry
> + * @list: a new list to add all removed entries
> + * @head: a list with entries
> + * @entry: an entry within head, could be the head itself
> + *
> + * This helper moves the initial part of @head, up to but
> + * excluding @entry, from @head to @list.  You should pass
> + * in @entry an element you know is on @head.  @list should
> + * be an empty list or a list you do not care about losing
> + * its data.
> + * If @entry == @head, all entries on @head are moved to
> + * @list.
> + */
> +static __always_inline
> +void prlist_cut_before(union prlist_head *list, union prlist_head *head,
> +		       union prlist_head *entry)
> +{
> +	if (head->next == entry) {
> +		INIT_PRLIST_HEAD(list);
> +		return;
> +	}
> +	prlist_set_next(list, head->next);
> +	prlist_set_prev(list->next, list);
> +	prlist_set_prev(list, entry->prev);
> +	prlist_set_next(list->prev, list);
> +	prlist_set_next(head, entry);
> +	prlist_set_prev(entry, head);
> +}
> +
> +static __always_inline
> +void __prlist_splice(const union prlist_head *list, union prlist_head *prev,
> +				 union prlist_head *next)
> +{
> +	union prlist_head *first = list->next;
> +	union prlist_head *last = list->prev;
> +
> +	prlist_set_prev(first, prev);
> +	prlist_set_next(prev, first);
> +	prlist_set_next(last, next);
> +	prlist_set_prev(next, last);
> +}
> +
> +/**
> + * prlist_splice - join two lists, this is designed for stacks
> + * @list: the new list to add.
> + * @head: the place to add it in the first list.
> + */
> +static __always_inline
> +void prlist_splice(const union prlist_head *list, union prlist_head *head)
> +{
> +	if (!list_empty(&list->list))
> +		__prlist_splice(list, head, head->next);
> +}
> +
> +/**
> + * prlist_splice_tail - join two lists, each list being a queue
> + * @list: the new list to add.
> + * @head: the place to add it in the first list.
> + */
> +static __always_inline
> +void prlist_splice_tail(union prlist_head *list, union prlist_head *head)
> +{
> +	if (!list_empty(&list->list))
> +		__prlist_splice(list, head->prev, head);
> +}
> +
> +/**
> + * prlist_splice_init - join two lists and reinitialise the emptied list.
> + * @list: the new list to add.
> + * @head: the place to add it in the first list.
> + *
> + * The list at @list is reinitialised
> + */
> +static __always_inline
> +void prlist_splice_init(union prlist_head *list, union prlist_head *head)
> +{
> +	if (!list_empty(&list->list)) {
> +		__prlist_splice(list, head, head->next);
> +		INIT_PRLIST_HEAD(list);
> +	}
> +}
> +
> +/**
> + * prlist_splice_tail_init - join 2 lists and reinitialise the emptied list
> + * @list: the new list to add.
> + * @head: the place to add it in the first list.
> + *
> + * Each of the lists is a queue.
> + * The list at @list is reinitialised
> + */
> +static __always_inline
> +void prlist_splice_tail_init(union prlist_head *list,
> +			     union prlist_head *head)
> +{
> +	if (!list_empty(&list->list)) {
> +		__prlist_splice(list, head->prev, head);
> +		INIT_PRLIST_HEAD(list);
> +	}
> +}
> +
> +/* ---- Protected Doubly Linked List with single pointer list head ---- */
> +union prhlist_head {
> +		struct hlist_head head __aligned(sizeof(void *));
> +		union prhlist_node *first __aligned(sizeof(void *));
> +} __aligned(sizeof(void *));
> +
> +union prhlist_node {
> +	struct hlist_node node __aligned(sizeof(void *))
> +			       ;
> +	struct {
> +		union prhlist_node *next __aligned(sizeof(void *));
> +		union prhlist_node **pprev __aligned(sizeof(void *));
> +	} __no_randomize_layout;
> +} __aligned(sizeof(void *));
> +
> +#define PRHLIST_HEAD_INIT	{	\
> +	.head = HLIST_HEAD_INIT,	\
> +}
> +
> +#define PRHLIST_HEAD(name) \
> +	union prhlist_head name __wr_after_init = PRHLIST_HEAD_INIT
> +
> +
> +#define is_static(object) \
> +	unlikely(wr_check_boundaries(object, sizeof(*object)))
> +
> +static __always_inline
> +struct pmalloc_pool *prhlist_create_custom_pool(size_t refill,
> +						unsigned short align_order)
> +{
> +	return pmalloc_create_custom_pool(refill, align_order,
> +					  PMALLOC_MODE_AUTO_WR);
> +}
> +
> +static __always_inline
> +struct pmalloc_pool *prhlist_create_pool(void)
> +{
> +	return prhlist_create_custom_pool(PMALLOC_REFILL_DEFAULT,
> +					  PMALLOC_ALIGN_ORDER_DEFAULT);
> +}
> +
> +static __always_inline
> +void prhlist_set_first(union prhlist_head *head, union prhlist_node *first)
> +{
> +	wr_ptr(&head->first, first);
> +}
> +
> +static __always_inline
> +void prhlist_set_next(union prhlist_node *node, union prhlist_node *next)
> +{
> +	wr_ptr(&node->next, next);
> +}
> +
> +static __always_inline
> +void prhlist_set_pprev(union prhlist_node *node, union prhlist_node **pprev)
> +{
> +	wr_ptr(&node->pprev, pprev);
> +}
> +
> +static __always_inline
> +void prhlist_set_prev(union prhlist_node *node, union prhlist_node *prev)
> +{
> +	wr_ptr(node->pprev, prev);
> +}
> +
> +static __always_inline void INIT_PRHLIST_HEAD(union prhlist_head *head)
> +{
> +	prhlist_set_first(head, NULL);
> +}
> +
> +static __always_inline void INIT_PRHLIST_NODE(union prhlist_node *node)
> +{
> +	prhlist_set_next(node, NULL);
> +	prhlist_set_pprev(node, NULL);
> +}
> +
> +static __always_inline void __prhlist_del(union prhlist_node *n)
> +{
> +	union prhlist_node *next = n->next;
> +	union prhlist_node **pprev = n->pprev;
> +
> +	wr_ptr(pprev, next);
> +	if (next)
> +		prhlist_set_pprev(next, pprev);
> +}
> +
> +static __always_inline void prhlist_del(union prhlist_node *n)
> +{
> +	__prhlist_del(n);
> +	prhlist_set_next(n, LIST_POISON1);
> +	prhlist_set_pprev(n, LIST_POISON2);
> +}
> +
> +static __always_inline void prhlist_del_init(union prhlist_node *n)
> +{
> +	if (!hlist_unhashed(&n->node)) {
> +		__prhlist_del(n);
> +		INIT_PRHLIST_NODE(n);
> +	}
> +}
> +
> +static __always_inline
> +void prhlist_add_head(union prhlist_node *n, union prhlist_head *h)
> +{
> +	union prhlist_node *first = h->first;
> +
> +	prhlist_set_next(n, first);
> +	if (first)
> +		prhlist_set_pprev(first, &n->next);
> +	prhlist_set_first(h, n);
> +	prhlist_set_pprev(n, &h->first);
> +}
> +
> +/* next must be != NULL */
> +static __always_inline
> +void prhlist_add_before(union prhlist_node *n, union prhlist_node *next)
> +{
> +	prhlist_set_pprev(n, next->pprev);
> +	prhlist_set_next(n, next);
> +	prhlist_set_pprev(next, &n->next);
> +	prhlist_set_prev(n, n);
> +}
> +
> +static __always_inline
> +void prhlist_add_behind(union prhlist_node *n, union prhlist_node *prev)
> +{
> +	prhlist_set_next(n, prev->next);
> +	prhlist_set_next(prev, n);
> +	prhlist_set_pprev(n, &prev->next);
> +	if (n->next)
> +		prhlist_set_pprev(n->next, &n->next);
> +}
> +
> +/* after that we'll appear to be on some hlist and hlist_del will work */
> +static __always_inline void prhlist_add_fake(union prhlist_node *n)
> +{
> +	prhlist_set_pprev(n, &n->next);
> +}
> +
> +/*
> + * Move a list from one list head to another. Fixup the pprev
> + * reference of the first entry if it exists.
> + */
> +static __always_inline
> +void prhlist_move_list(union prhlist_head *old, union prhlist_head *new)
> +{
> +	prhlist_set_first(new, old->first);
> +	if (new->first)
> +		prhlist_set_pprev(new->first, &new->first);
> +	prhlist_set_first(old, NULL);
> +}
> +
> +/* ------------------------ RCU list and hlist ------------------------ */
> +
> +/*
> + * INIT_LIST_HEAD_RCU - Initialize a list_head visible to RCU readers
> + * @head: list to be initialized
> + *
> + * It is exactly equivalent to INIT_LIST_HEAD()
> + */
> +static __always_inline void INIT_PRLIST_HEAD_RCU(union prlist_head *head)
> +{
> +	INIT_PRLIST_HEAD(head);
> +}
> +
> +/*
> + * Insert a new entry between two known consecutive entries.
> + *
> + * This is only for internal list manipulation where we know
> + * the prev/next entries already!
> + */
> +static __always_inline
> +void __prlist_add_rcu(union prlist_head *new, union prlist_head *prev,
> +		      union prlist_head *next)
> +{
> +	if (!__list_add_valid(&new->list, &prev->list, &next->list))
> +		return;
> +	prlist_set_next(new, next);
> +	prlist_set_prev(new, prev);
> +	wr_rcu_assign_pointer(list_next_rcu(&prev->list), new);
> +	prlist_set_prev(next, new);
> +}
> +
> +/**
> + * prlist_add_rcu - add a new entry to rcu-protected prlist
> + * @new: new entry to be added
> + * @head: prlist head to add it after
> + *
> + * Insert a new entry after the specified head.
> + * This is good for implementing stacks.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another prlist-mutation primitive, such as prlist_add_rcu()
> + * or prlist_del_rcu(), running on this same list.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * list_for_each_entry_rcu().
> + */
> +static __always_inline
> +void prlist_add_rcu(union prlist_head *new, union prlist_head *head)
> +{
> +	__prlist_add_rcu(new, head, head->next);
> +}
> +
> +/**
> + * prlist_add_tail_rcu - add a new entry to rcu-protected prlist
> + * @new: new entry to be added
> + * @head: prlist head to add it before
> + *
> + * Insert a new entry before the specified head.
> + * This is useful for implementing queues.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another prlist-mutation primitive, such as prlist_add_tail_rcu()
> + * or prlist_del_rcu(), running on this same list.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * list_for_each_entry_rcu().
> + */
> +static __always_inline
> +void prlist_add_tail_rcu(union prlist_head *new, union prlist_head *head)
> +{
> +	__prlist_add_rcu(new, head->prev, head);
> +}
> +
> +/**
> + * prlist_del_rcu - deletes entry from prlist without re-initialization
> + * @entry: the element to delete from the prlist.
> + *
> + * Note: list_empty() on entry does not return true after this,
> + * the entry is in an undefined state. It is useful for RCU based
> + * lockfree traversal.
> + *
> + * In particular, it means that we can not poison the forward
> + * pointers that may still be used for walking the list.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another list-mutation primitive, such as prlist_del_rcu()
> + * or prlist_add_rcu(), running on this same prlist.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * list_for_each_entry_rcu().
> + *
> + * Note that the caller is not permitted to immediately free
> + * the newly deleted entry.  Instead, either synchronize_rcu()
> + * or call_rcu() must be used to defer freeing until an RCU
> + * grace period has elapsed.
> + */
> +static __always_inline void prlist_del_rcu(union prlist_head *entry)
> +{
> +	__prlist_del_entry(entry);
> +	prlist_set_prev(entry, LIST_POISON2);
> +}
> +
> +/**
> + * prhlist_del_init_rcu - deletes entry from hash list with re-initialization
> + * @n: the element to delete from the hash list.
> + *
> + * Note: list_unhashed() on the node return true after this. It is
> + * useful for RCU based read lockfree traversal if the writer side
> + * must know if the list entry is still hashed or already unhashed.
> + *
> + * In particular, it means that we can not poison the forward pointers
> + * that may still be used for walking the hash list and we can only
> + * zero the pprev pointer so list_unhashed() will return true after
> + * this.
> + *
> + * The caller must take whatever precautions are necessary (such as
> + * holding appropriate locks) to avoid racing with another
> + * list-mutation primitive, such as hlist_add_head_rcu() or
> + * hlist_del_rcu(), running on this same list.  However, it is
> + * perfectly legal to run concurrently with the _rcu list-traversal
> + * primitives, such as hlist_for_each_entry_rcu().
> + */
> +static __always_inline void prhlist_del_init_rcu(union prhlist_node *n)
> +{
> +	if (!hlist_unhashed(&n->node)) {
> +		__prhlist_del(n);
> +		prhlist_set_pprev(n, NULL);
> +	}
> +}
> +
> +/**
> + * prlist_replace_rcu - replace old entry by new one
> + * @old : the element to be replaced
> + * @new : the new element to insert
> + *
> + * The @old entry will be replaced with the @new entry atomically.
> + * Note: @old should not be empty.
> + */
> +static __always_inline
> +void prlist_replace_rcu(union prlist_head *old, union prlist_head *new)
> +{
> +	prlist_set_next(new, old->next);
> +	prlist_set_prev(new, old->prev);
> +	wr_rcu_assign_pointer(list_next_rcu(&new->prev->list), new);
> +	prlist_set_prev(new->next, new);
> +	prlist_set_prev(old, LIST_POISON2);
> +}
> +
> +/**
> + * __prlist_splice_init_rcu - join an RCU-protected list into an existing list.
> + * @list:	the RCU-protected list to splice
> + * @prev:	points to the last element of the existing list
> + * @next:	points to the first element of the existing list
> + * @sync:	function to sync: synchronize_rcu(), synchronize_sched(), ...
> + *
> + * The list pointed to by @prev and @next can be RCU-read traversed
> + * concurrently with this function.
> + *
> + * Note that this function blocks.
> + *
> + * Important note: the caller must take whatever action is necessary to prevent
> + * any other updates to the existing list.  In principle, it is possible to
> + * modify the list as soon as sync() begins execution. If this sort of thing
> + * becomes necessary, an alternative version based on call_rcu() could be
> + * created.  But only if -really- needed -- there is no shortage of RCU API
> + * members.
> + */
> +static __always_inline
> +void __prlist_splice_init_rcu(union prlist_head *list,
> +			      union prlist_head *prev,
> +			      union prlist_head *next, void (*sync)(void))
> +{
> +	union prlist_head *first = list->next;
> +	union prlist_head *last = list->prev;
> +
> +	/*
> +	 * "first" and "last" tracking list, so initialize it.  RCU readers
> +	 * have access to this list, so we must use INIT_LIST_HEAD_RCU()
> +	 * instead of INIT_LIST_HEAD().
> +	 */
> +
> +	INIT_PRLIST_HEAD_RCU(list);
> +
> +	/*
> +	 * At this point, the list body still points to the source list.
> +	 * Wait for any readers to finish using the list before splicing
> +	 * the list body into the new list.  Any new readers will see
> +	 * an empty list.
> +	 */
> +
> +	sync();
> +
> +	/*
> +	 * Readers are finished with the source list, so perform splice.
> +	 * The order is important if the new list is global and accessible
> +	 * to concurrent RCU readers.  Note that RCU readers are not
> +	 * permitted to traverse the prev pointers without excluding
> +	 * this function.
> +	 */
> +
> +	prlist_set_next(last, next);
> +	wr_rcu_assign_pointer(list_next_rcu(&prev->list), first);
> +	prlist_set_prev(first, prev);
> +	prlist_set_prev(next, last);
> +}
> +
> +/**
> + * prlist_splice_init_rcu - splice an RCU-protected list into an existing
> + *                          list, designed for stacks.
> + * @list:	the RCU-protected list to splice
> + * @head:	the place in the existing list to splice the first list into
> + * @sync:	function to sync: synchronize_rcu(), synchronize_sched(), ...
> + */
> +static __always_inline
> +void prlist_splice_init_rcu(union prlist_head *list,
> +			    union prlist_head *head,
> +			    void (*sync)(void))
> +{
> +	if (!list_empty(&list->list))
> +		__prlist_splice_init_rcu(list, head, head->next, sync);
> +}
> +
> +/**
> + * prlist_splice_tail_init_rcu - splice an RCU-protected list into an
> + *                               existing list, designed for queues.
> + * @list:	the RCU-protected list to splice
> + * @head:	the place in the existing list to splice the first list into
> + * @sync:	function to sync: synchronize_rcu(), synchronize_sched(), ...
> + */
> +static __always_inline
> +void prlist_splice_tail_init_rcu(union prlist_head *list,
> +				 union prlist_head *head,
> +				 void (*sync)(void))
> +{
> +	if (!list_empty(&list->list))
> +		__prlist_splice_init_rcu(list, head->prev, head, sync);
> +}
> +
> +/**
> + * prhlist_del_rcu - deletes entry from hash list without re-initialization
> + * @n: the element to delete from the hash list.
> + *
> + * Note: list_unhashed() on entry does not return true after this,
> + * the entry is in an undefined state. It is useful for RCU based
> + * lockfree traversal.
> + *
> + * In particular, it means that we can not poison the forward
> + * pointers that may still be used for walking the hash list.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another list-mutation primitive, such as hlist_add_head_rcu()
> + * or hlist_del_rcu(), running on this same list.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * hlist_for_each_entry().
> + */
> +static __always_inline void prhlist_del_rcu(union prhlist_node *n)
> +{
> +	__prhlist_del(n);
> +	prhlist_set_pprev(n, LIST_POISON2);
> +}
> +
> +/**
> + * prhlist_replace_rcu - replace old entry by new one
> + * @old : the element to be replaced
> + * @new : the new element to insert
> + *
> + * The @old entry will be replaced with the @new entry atomically.
> + */
> +static __always_inline
> +void prhlist_replace_rcu(union prhlist_node *old, union prhlist_node *new)
> +{
> +	union prhlist_node *next = old->next;
> +
> +	prhlist_set_next(new, next);
> +	prhlist_set_pprev(new, old->pprev);
> +	wr_rcu_assign_pointer(*(union prhlist_node __rcu **)new->pprev, new);
> +	if (next)
> +		prhlist_set_pprev(new->next, &new->next);
> +	prhlist_set_pprev(old, LIST_POISON2);
> +}
> +
> +/**
> + * prhlist_add_head_rcu
> + * @n: the element to add to the hash list.
> + * @h: the list to add to.
> + *
> + * Description:
> + * Adds the specified element to the specified hlist,
> + * while permitting racing traversals.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another list-mutation primitive, such as hlist_add_head_rcu()
> + * or hlist_del_rcu(), running on this same list.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * hlist_for_each_entry_rcu(), used to prevent memory-consistency
> + * problems on Alpha CPUs.  Regardless of the type of CPU, the
> + * list-traversal primitive must be guarded by rcu_read_lock().
> + */
> +static __always_inline
> +void prhlist_add_head_rcu(union prhlist_node *n, union prhlist_head *h)
> +{
> +	union prhlist_node *first = h->first;
> +
> +	prhlist_set_next(n, first);
> +	prhlist_set_pprev(n, &h->first);
> +	wr_rcu_assign_pointer(hlist_first_rcu(&h->head), n);
> +	if (first)
> +		prhlist_set_pprev(first, &n->next);
> +}
> +
> +/**
> + * prhlist_add_tail_rcu
> + * @n: the element to add to the hash list.
> + * @h: the list to add to.
> + *
> + * Description:
> + * Adds the specified element to the specified hlist,
> + * while permitting racing traversals.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another list-mutation primitive, such as prhlist_add_head_rcu()
> + * or prhlist_del_rcu(), running on this same list.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * hlist_for_each_entry_rcu(), used to prevent memory-consistency
> + * problems on Alpha CPUs.  Regardless of the type of CPU, the
> + * list-traversal primitive must be guarded by rcu_read_lock().
> + */
> +static __always_inline
> +void prhlist_add_tail_rcu(union prhlist_node *n, union prhlist_head *h)
> +{
> +	union prhlist_node *i, *last = NULL;
> +
> +	/* Note: write side code, so rcu accessors are not needed. */
> +	for (i = h->first; i; i = i->next)
> +		last = i;
> +
> +	if (last) {
> +		prhlist_set_next(n, last->next);
> +		prhlist_set_pprev(n, &last->next);
> +		wr_rcu_assign_pointer(hlist_next_rcu(&last->node), n);
> +	} else {
> +		prhlist_add_head_rcu(n, h);
> +	}
> +}
> +
> +/**
> + * prhlist_add_before_rcu
> + * @n: the new element to add to the hash list.
> + * @next: the existing element to add the new element before.
> + *
> + * Description:
> + * Adds the specified element to the specified hlist
> + * before the specified node while permitting racing traversals.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another list-mutation primitive, such as prhlist_add_head_rcu()
> + * or prhlist_del_rcu(), running on this same list.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * hlist_for_each_entry_rcu(), used to prevent memory-consistency
> + * problems on Alpha CPUs.
> + */
> +static __always_inline
> +void prhlist_add_before_rcu(union prhlist_node *n, union prhlist_node *next)
> +{
> +	prhlist_set_pprev(n, next->pprev);
> +	prhlist_set_next(n, next);
> +	wr_rcu_assign_pointer(hlist_pprev_rcu(&n->node), n);
> +	prhlist_set_pprev(next, &n->next);
> +}
> +
> +/**
> + * prhlist_add_behind_rcu
> + * @n: the new element to add to the hash list.
> + * @prev: the existing element to add the new element after.
> + *
> + * Description:
> + * Adds the specified element to the specified hlist
> + * after the specified node while permitting racing traversals.
> + *
> + * The caller must take whatever precautions are necessary
> + * (such as holding appropriate locks) to avoid racing
> + * with another list-mutation primitive, such as prhlist_add_head_rcu()
> + * or prhlist_del_rcu(), running on this same list.
> + * However, it is perfectly legal to run concurrently with
> + * the _rcu list-traversal primitives, such as
> + * hlist_for_each_entry_rcu(), used to prevent memory-consistency
> + * problems on Alpha CPUs.
> + */
> +static __always_inline
> +void prhlist_add_behind_rcu(union prhlist_node *n, union prhlist_node *prev)
> +{
> +	prhlist_set_next(n, prev->next);
> +	prhlist_set_pprev(n, &prev->next);
> +	wr_rcu_assign_pointer(hlist_next_rcu(&prev->node), n);
> +	if (n->next)
> +		prhlist_set_pprev(n->next, &n->next);
> +}
> +#endif
> --
> 2.17.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 13/17] prmem: linked list: disable layout randomization
  2018-10-23 21:35 ` [PATCH 13/17] prmem: linked list: disable layout randomization Igor Stoppa
@ 2018-10-24 13:43   ` Alexey Dobriyan
  2018-10-29 19:40     ` Igor Stoppa
  2018-10-26  9:32   ` Peter Zijlstra
  1 sibling, 1 reply; 140+ messages in thread
From: Alexey Dobriyan @ 2018-10-24 13:43 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Greg Kroah-Hartman, Andrew Morton, Masahiro Yamada,
	Pekka Enberg, Paul E. McKenney, Lihao Liang, linux-kernel

On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
> Some of the data structures used in list management are composed by two
> pointers. Since the kernel is now configured by default to randomize the
> layout of data structures soleley composed by pointers,

Isn't this true for function pointers?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
  2018-10-24 11:37   ` Mathieu Desnoyers
@ 2018-10-24 14:03     ` Igor Stoppa
  2018-10-24 14:56       ` Tycho Andersen
  2018-10-28  9:52       ` Steven Rostedt
  0 siblings, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-24 14:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Thomas Gleixner, Kate Stewart, David S. Miller,
	Greg Kroah-Hartman, Philippe Ombredanne, Paul E. McKenney,
	Josh Triplett, rostedt, Lai Jiangshan, linux-kernel

On 24/10/18 14:37, Mathieu Desnoyers wrote:

> I could not find a description of the overall context of this patch
> (e.g. a patch 00/17 ?) that would explain the attack vectors this aims
> to protect against.

Apologies, I have to admit I was a bit baffled about what to do: the 
patchset spans across several subsystems and I was reluctant at spamming 
all the mailing lists.

I was hoping that by CC-ing kernel.org, the explicit recipients would 
get both the mail directly intended for them (as subsystem 
maintainers/supporters) and the rest.

The next time I'll err in the opposite direction.

In the meanwhile, please find the whole set here:

https://www.openwall.com/lists/kernel-hardening/2018/10/23/3

> This might help figuring out whether this added complexity in the core
> kernel is worth it.

I hope it will.

> Also, is it the right approach to duplicate existing APIs, or should we
> rather hook into page fault handlers and let the kernel do those "shadow"
> mappings under the hood ?

This question is probably a good candidate for the small Q&A section I 
have in the 00/17.


> Adding a new GFP flags for dynamic allocation, and a macro mapping to
> a section attribute might suffice for allocation or definition of such
> mostly-read-only/seldom-updated data.

I think what you are proposing makes sense from a pure hardening 
standpoint. From a more defensive one, I'd rather minimise the chances 
of giving a free pass to an attacker.

Maybe there is a better implementation of this, than what I have in 
mind. But, based on my current understanding of what you are describing, 
there would be few issues:

1) where would the pool go? The pool is a way to manage multiple vmas 
and express common property they share. Even before a vma is associated 
to the pool.

2) there would be more code that can seamlessly deal with both protected 
and regular data. Based on what? Some parameter, I suppose.
That parameter would be the new target.
If the code is "duplicated", as you say, the actual differences are 
baked in at compile time. The "duplication" would also allow to have 
always inlined functions for write-rare and leave more freedom to the 
compiler for their non-protected version.

Besides, I think the separate wr version also makes it very clear, to 
the user of the API, that there will be a price to pay, in terms of 
performance. The more seamlessly alternative might make this price less 
obvious.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 06/17] prmem: test cases for memory protection
  2018-10-24  3:27   ` Randy Dunlap
@ 2018-10-24 14:24     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-24 14:24 UTC (permalink / raw)
  To: Randy Dunlap, Mimi Zohar, Kees Cook, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
	linux-integrity, linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

Hi,

On 24/10/18 06:27, Randy Dunlap wrote:

> a. It seems backwards (or upside down) to have a test case select a feature (PRMEM)
> instead of depending on that feature.
> 
> b. Since PRMEM depends on MMU (in patch 04/17), the "select" here could try to
> enabled PRMEM even when MMU is not enabled.
> 
> Changing this to "depends on PRMEM" would solve both of these issues.

The weird dependency you pointed out is partially caused by the 
incompleteness of PRMEM.

What I have in mind is to have a fallback version of it for systems 
without MMU capable of write protection.
Possibly defaulting to kvmalloc.
In that case there would not be any need for a configuration option.

> c. Don't use "default n".  That is already the default.

ok

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-24  3:48   ` Randy Dunlap
@ 2018-10-24 14:30     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-24 14:30 UTC (permalink / raw)
  To: Randy Dunlap, Mimi Zohar, Kees Cook, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
	linux-integrity, linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Mike Rapoport, linux-doc, linux-kernel

Hi,

On 24/10/18 06:48, Randy Dunlap wrote:
> Hi,
> 
> On 10/23/18 2:34 PM, Igor Stoppa wrote:

[...]

>> +- The present document doesn't address such transient.
> 
>                                                 transience.

ok

[...]

>> +   are attempted after the write protection is in place, will cause
> 
> no comma.

ok

[...]

>> +        - Its usefulness depends on the specific use case at hand
> 
> end above sentence with a period, please, like all of the others above it.

ok


>> +    - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
> 
> Please try to keep the line above and a few below to < 80 characters in length.
> (because some of us read rst files as text files, with a text editor, and line
> wrap is ugly)

ok, I still have to master .rst :-(

[...]

>> +- The users of rare write must take care of ensuring the atomicity of the
> 
> s/rare write/write rare/ ?

thanks

>> +  action, respect to the way they use the data being altered; for example,
> 
>    This ..   "respect to the way" is awkward, but I don't know what to
> change it to.
> 
>> +  take a lock before making a copy of the value to modify (if it's
>> +  relevant), then alter it, issue the call to rare write and finally
>> +  release the lock. Some special scenario might be exempt from the need
>> +  for locking, but in general rare-write must be treated as an operation
> 
> It seemed to me that "write-rare" (or write rare) was the going name, but now
> it's being called "rare write" (or rare-write).  Just be consistent, please.


write-rare it is, because it can be shortened as wr_xxx

rare_write becomes rw_xxx

which wrongly hints at read/write, which it definitely is not

>> +  tlb entries. It still does a better job of it, compared to invoking
> 
>       TLB

ok

>> +  vmalloc for each allocation, but it is undeniably less optimized wrt to
> 
> s/wrt/with respect to/

yes

> Thanks for the documentation.

thanks for the review :-)

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
  2018-10-24 14:03     ` Igor Stoppa
@ 2018-10-24 14:56       ` Tycho Andersen
  2018-10-24 22:52         ` Igor Stoppa
  2018-10-28  9:52       ` Steven Rostedt
  1 sibling, 1 reply; 140+ messages in thread
From: Tycho Andersen @ 2018-10-24 14:56 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mathieu Desnoyers, Mimi Zohar, Kees Cook, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
	linux-integrity, linux-security-module, igor stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Thomas Gleixner, Kate Stewart,
	David S. Miller, Greg Kroah-Hartman, Philippe Ombredanne,
	Paul E. McKenney, Josh Triplett, rostedt, Lai Jiangshan,
	linux-kernel

On Wed, Oct 24, 2018 at 05:03:01PM +0300, Igor Stoppa wrote:
> On 24/10/18 14:37, Mathieu Desnoyers wrote:
> > Also, is it the right approach to duplicate existing APIs, or should we
> > rather hook into page fault handlers and let the kernel do those "shadow"
> > mappings under the hood ?
> 
> This question is probably a good candidate for the small Q&A section I have
> in the 00/17.
> 
> 
> > Adding a new GFP flags for dynamic allocation, and a macro mapping to
> > a section attribute might suffice for allocation or definition of such
> > mostly-read-only/seldom-updated data.
> 
> I think what you are proposing makes sense from a pure hardening standpoint.
> From a more defensive one, I'd rather minimise the chances of giving a free
> pass to an attacker.
> 
> Maybe there is a better implementation of this, than what I have in mind.
> But, based on my current understanding of what you are describing, there
> would be few issues:
> 
> 1) where would the pool go? The pool is a way to manage multiple vmas and
> express common property they share. Even before a vma is associated to the
> pool.
> 
> 2) there would be more code that can seamlessly deal with both protected and
> regular data. Based on what? Some parameter, I suppose.
> That parameter would be the new target.
> If the code is "duplicated", as you say, the actual differences are baked in
> at compile time. The "duplication" would also allow to have always inlined
> functions for write-rare and leave more freedom to the compiler for their
> non-protected version.
> 
> Besides, I think the separate wr version also makes it very clear, to the
> user of the API, that there will be a price to pay, in terms of performance.
> The more seamlessly alternative might make this price less obvious.

What about something in the middle, where we move list to list_impl.h,
and add a few macros where you have list_set_prev() in prlist now, so
we could do,

// prlist.h

#define list_set_next(head, next) wr_ptr(&head->next, next)
#define list_set_prev(head, prev) wr_ptr(&head->prev, prev)

#include <linux/list_impl.h>

// list.h

#define list_set_next(next) (head->next = next)
#define list_set_next(prev) (head->prev = prev)

#include <linux/list_impl.h>

I wonder then if you can get rid of some of the type punning too? It's
not clear exactly why that's necessary from the series, but perhaps
I'm missing something obvious :)

I also wonder how much the actual differences being baked in at
compile time makes. Most (all?) of this code is inlined.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
  2018-10-24 14:56       ` Tycho Andersen
@ 2018-10-24 22:52         ` Igor Stoppa
  2018-10-25  8:11           ` Tycho Andersen
  0 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-24 22:52 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Mathieu Desnoyers, Mimi Zohar, Kees Cook, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
	linux-integrity, linux-security-module, igor stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Thomas Gleixner, Kate Stewart,
	David S. Miller, Greg Kroah-Hartman, Philippe Ombredanne,
	Paul E. McKenney, Josh Triplett, rostedt, Lai Jiangshan,
	linux-kernel

On 24/10/2018 17:56, Tycho Andersen wrote:
> On Wed, Oct 24, 2018 at 05:03:01PM +0300, Igor Stoppa wrote:
>> On 24/10/18 14:37, Mathieu Desnoyers wrote:
>>> Also, is it the right approach to duplicate existing APIs, or should we
>>> rather hook into page fault handlers and let the kernel do those "shadow"
>>> mappings under the hood ?
>>
>> This question is probably a good candidate for the small Q&A section I have
>> in the 00/17.
>>
>>
>>> Adding a new GFP flags for dynamic allocation, and a macro mapping to
>>> a section attribute might suffice for allocation or definition of such
>>> mostly-read-only/seldom-updated data.
>>
>> I think what you are proposing makes sense from a pure hardening standpoint.
>>  From a more defensive one, I'd rather minimise the chances of giving a free
>> pass to an attacker.
>>
>> Maybe there is a better implementation of this, than what I have in mind.
>> But, based on my current understanding of what you are describing, there
>> would be few issues:
>>
>> 1) where would the pool go? The pool is a way to manage multiple vmas and
>> express common property they share. Even before a vma is associated to the
>> pool.
>>
>> 2) there would be more code that can seamlessly deal with both protected and
>> regular data. Based on what? Some parameter, I suppose.
>> That parameter would be the new target.
>> If the code is "duplicated", as you say, the actual differences are baked in
>> at compile time. The "duplication" would also allow to have always inlined
>> functions for write-rare and leave more freedom to the compiler for their
>> non-protected version.
>>
>> Besides, I think the separate wr version also makes it very clear, to the
>> user of the API, that there will be a price to pay, in terms of performance.
>> The more seamlessly alternative might make this price less obvious.
> 
> What about something in the middle, where we move list to list_impl.h,
> and add a few macros where you have list_set_prev() in prlist now, so
> we could do,
> 
> // prlist.h
> 
> #define list_set_next(head, next) wr_ptr(&head->next, next)
> #define list_set_prev(head, prev) wr_ptr(&head->prev, prev)
> 
> #include <linux/list_impl.h>
> 
> // list.h
> 
> #define list_set_next(next) (head->next = next)
> #define list_set_next(prev) (head->prev = prev)
> 
> #include <linux/list_impl.h>
> 
> I wonder then if you can get rid of some of the type punning too? It's
> not clear exactly why that's necessary from the series, but perhaps
> I'm missing something obvious :)

nothing obvious, probably there is only half a reference in the slides I 
linked-to in the cover letter :-)

So far I have minimized the number of "intrinsic" write rare functions,
mostly because I would want first to reach an agreement on the 
implementation of the core write-rare.

However, once that is done, it might be good to convert also the prlists 
to be "intrinsics". A list node is 2 pointers.
If that was the alignment, i.e. __align(sizeof(list_head)), it might be 
possible to speed up a lot the list handling even as write rare.

Taking as example the insertion operation, it would be probably 
sufficient, in most cases, to have only two remappings:
- one covering the page with the latest two nodes
- one covering the page with the list head

That is 2 vs 8 remappings, and a good deal of memory barriers less.

This would be incompatible with what you are proposing, yet it would be 
justifiable, I think, because it would provide better performance to 
prlist, potentially widening its adoption, where performance is a concern.

> I also wonder how much the actual differences being baked in at
> compile time makes. Most (all?) of this code is inlined.

If the inlined function expects to receive a prlist_head *, instead of a 
list_head *, doesn't it help turning runtime bugs into buildtime bugs?


Or maybe I miss your point?

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 08/17] prmem: struct page: track vmap_area
  2018-10-24  3:12   ` Matthew Wilcox
@ 2018-10-24 23:01     ` Igor Stoppa
  2018-10-25  2:13       ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-24 23:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mimi Zohar, Kees Cook, Dave Chinner, James Morris, Michal Hocko,
	kernel-hardening, linux-integrity, linux-security-module,
	igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel



On 24/10/2018 06:12, Matthew Wilcox wrote:
> On Wed, Oct 24, 2018 at 12:34:55AM +0300, Igor Stoppa wrote:
>> The connection between each page and its vmap_area avoids more expensive
>> searches through the btree of vmap_areas.
> 
> Typo -- it's an rbtree.

ack

>> +++ b/include/linux/mm_types.h
>> @@ -87,13 +87,24 @@ struct page {
>>   			/* See page-flags.h for PAGE_MAPPING_FLAGS */
>>   			struct address_space *mapping;
>>   			pgoff_t index;		/* Our offset within mapping. */
>> -			/**
>> -			 * @private: Mapping-private opaque data.
>> -			 * Usually used for buffer_heads if PagePrivate.
>> -			 * Used for swp_entry_t if PageSwapCache.
>> -			 * Indicates order in the buddy system if PageBuddy.
>> -			 */
>> -			unsigned long private;
>> +			union {
>> +				/**
>> +				 * @private: Mapping-private opaque data.
>> +				 * Usually used for buffer_heads if
>> +				 * PagePrivate.
>> +				 * Used for swp_entry_t if PageSwapCache.
>> +				 * Indicates order in the buddy system if
>> +				 * PageBuddy.
>> +				 */
>> +				unsigned long private;
>> +				/**
>> +				 * @area: reference to the containing area
>> +				 * For pages that are mapped into a virtually
>> +				 * contiguous area, avoids performing a more
>> +				 * expensive lookup.
>> +				 */
>> +				struct vmap_area *area;
>> +			};
> 
> Not like this.  Make it part of a different struct in the existing union,
> not a part of the pagecache struct.  And there's no need to use ->private
> explicitly.

Ok, I'll have a look at the googledoc you made

>> @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>>   	if (!addr)
>>   		return NULL;
>>   
>> +	va = __find_vmap_area((unsigned long)addr);
>> +	for (i = 0; i < va->vm->nr_pages; i++)
>> +		va->vm->pages[i]->area = va;
> 
> I don't like it that you're calling this for _every_ vmalloc() caller
> when most of them will never use this.  Perhaps have page->va be initially
> NULL and then cache the lookup in it when it's accessed for the first time.
> 

If __find_vmap_area() was part of the API, this loop could be left out 
from __vmalloc_node_range() and the user of the allocation could 
initialize the field, if needed.

What is the reason for keeping __find_vmap_area() private?

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [RFC v1 PATCH 00/17] prmem: protected memory
  2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
                   ` (16 preceding siblings ...)
  2018-10-23 21:35 ` [PATCH 17/17] prmem: ima: turn the measurements list write rare Igor Stoppa
@ 2018-10-24 23:03 ` Dave Chinner
  2018-10-29 19:47   ` Igor Stoppa
  17 siblings, 1 reply; 140+ messages in thread
From: Dave Chinner @ 2018-10-24 23:03 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, James Morris,
	Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott

On Wed, Oct 24, 2018 at 12:34:47AM +0300, Igor Stoppa wrote:
> -- Summary --
> 
> Preliminary version of memory protection patchset, including a sample use
> case, turning into write-rare the IMA measurement list.

I haven't looked at the patches yet, but I see a significant issue
from the subject lines. "prmem" is very similar to "pmem"
(persistent memory) and that's going to cause confusion. Especially
if people start talking about prmem and pmem in the context of write
protect pmem with prmem...

Naming is hard :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-23 21:34 ` [PATCH 10/17] prmem: documentation Igor Stoppa
  2018-10-24  3:48   ` Randy Dunlap
@ 2018-10-24 23:04   ` Mike Rapoport
  2018-10-29 19:05     ` Igor Stoppa
  2018-10-26  9:26   ` Peter Zijlstra
  2 siblings, 1 reply; 140+ messages in thread
From: Mike Rapoport @ 2018-10-24 23:04 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
	linux-kernel

Hi Igor,

On Wed, Oct 24, 2018 at 12:34:57AM +0300, Igor Stoppa wrote:
> Documentation for protected memory.
> 
> Topics covered:
> * static memory allocation
> * dynamic memory allocation
> * write-rare
> 
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Jonathan Corbet <corbet@lwn.net>
> CC: Randy Dunlap <rdunlap@infradead.org>
> CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> CC: linux-doc@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  Documentation/core-api/index.rst |   1 +
>  Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++

Thanks for having docs a part of the patchset!

>  MAINTAINERS                      |   1 +
>  3 files changed, 174 insertions(+)
>  create mode 100644 Documentation/core-api/prmem.rst
> 
> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index 26b735cefb93..1a90fa878d8d 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -31,6 +31,7 @@ Core utilities
>     gfp_mask-from-fs-io
>     timekeeping
>     boot-time-mm
> +   prmem
> 
>  Interfaces for kernel debugging
>  ===============================
> diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
> new file mode 100644
> index 000000000000..16d7edfe327a
> --- /dev/null
> +++ b/Documentation/core-api/prmem.rst
> @@ -0,0 +1,172 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _prmem:
> +
> +Memory Protection
> +=================
> +
> +:Date: October 2018
> +:Author: Igor Stoppa <igor.stoppa@huawei.com>
> +
> +Foreword
> +--------
> +- In a typical system using some sort of RAM as execution environment,
> +  **all** memory is initially writable.
> +
> +- It must be initialized with the appropriate content, be it code or data.
> +
> +- Said content typically undergoes modifications, i.e. relocations or
> +  relocation-induced changes.
> +
> +- The present document doesn't address such transient.
> +
> +- Kernel code is protected at system level and, unlike data, it doesn't
> +  require special attention.
> +

I feel that foreword should include a sentence or two saying why we need
the memory protection and when it can/should be used.

> +Protection mechanism
> +--------------------
> +
> +- When available, the MMU can write protect memory pages that would be
> +  otherwise writable.
> +
> +- The protection has page-level granularity.
> +
> +- An attempt to overwrite a protected page will trigger an exception.
> +- **Write protected data must go exclusively to write protected pages**
> +- **Writable data must go exclusively to writable pages**
> +
> +Available protections for kernel data
> +-------------------------------------
> +
> +- **constant**
> +   Labelled as **const**, the data is never supposed to be altered.
> +   It is statically allocated - if it has any memory footprint at all.
> +   The compiler can even optimize it away, where possible, by replacing
> +   references to a **const** with its actual value.
> +
> +- **read only after init**
> +   By tagging an otherwise ordinary statically allocated variable with
> +   **__ro_after_init**, it is placed in a special segment that will
> +   become write protected, at the end of the kernel init phase.
> +   The compiler has no notion of this restriction and it will treat any
> +   write operation on such variable as legal. However, assignments that
> +   are attempted after the write protection is in place, will cause
> +   exceptions.
> +
> +- **write rare after init**
> +   This can be seen as variant of read only after init, which uses the
> +   tag **__wr_after_init**. It is also limited to statically allocated
> +   memory. It is still possible to alter this type of variables, after

                                                         no comma ^

> +   the kernel init phase is complete, however it can be done exclusively
> +   with special functions, instead of the assignment operator. Using the
> +   assignment operator after conclusion of the init phase will still
> +   trigger an exception. It is not possible to transition a certain
> +   variable from __wr_ater_init to a permanent read-only status, at

                    __wr_aFter_init

> +   runtime.
> +
> +- **dynamically allocated write-rare / read-only**
> +   After defining a pool, memory can be obtained through it, primarily
> +   through the **pmalloc()** allocator. The exact writability state of the
> +   memory obtained from **pmalloc()** and friends can be configured when
> +   creating the pool. At any point it is possible to transition to a less
> +   permissive write status the memory currently associated to the pool.
> +   Once memory has become read-only, it the only valid operation, beside

... become read-only, the only valid operation

> +   reading, is to released it, by destroying the pool it belongs to.
> +
> +
> +Protecting dynamically allocated memory
> +---------------------------------------
> +
> +When dealing with dynamically allocated memory, three options are
> + available for configuring its writability state:
> +
> +- **Options selected when creating a pool**
> +   When creating the pool, it is possible to choose one of the following:
> +    - **PMALLOC_MODE_RO**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_WR**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_AUTO_RO**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *NONE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_AUTO_WR**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_START_WR**
> +       - Writability at allocation time: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*

For me this part is completely blind. Maybe arranging this as a table would
make the states more clearly visible.

> +
> +   **Remarks:**
> +    - The "AUTO" modes perform automatic protection of the content, whenever
> +       the current vmap_area is used up and a new one is allocated.
> +        - At that point, the vmap_area being phased out is protected.
> +        - The size of the vmap_area depends on various parameters.
> +        - It might not be possible to know for sure *when* certain data will
> +          be protected.
> +        - The functionality is provided as tradeoff between hardening and speed.
> +        - Its usefulness depends on the specific use case at hand
> +    - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed.
> +
> +- **Protecting the pool**
> +   This is achieved with **pmalloc_protect_pool()**
> +    - Any vmap_area currently in the pool is write-protected according to its initial configuration.
> +    - Any residual space still available from the current vmap_area is lost, as the area is protected.
> +    - **protecting a pool after every allocation will likely be very wasteful**
> +    - Using PMALLOC_MODE_START_WR is likely a better choice.
> +
> +- **Upgrading the protection level**
> +   This is achieved with **pmalloc_make_pool_ro()**
> +    - it turns the present content of a write-rare pool into read-only
> +    - can be useful when the content of the memory has settled
> +
> +
> +Caveats
> +-------
> +- Freeing of memory is not supported. Pages will be returned to the
> +  system upon destruction of their memory pool.
> +
> +- The address range available for vmalloc (and thus for pmalloc too) is
> +  limited, on 32-bit systems. However it shouldn't be an issue, since not

   no comma ^

> +  much data is expected to be dynamically allocated and turned into
> +  write-protected.
> +
> +- Regarding SMP systems, changing state of pages and altering mappings
> +  requires performing cross-processor synchronizations of page tables.
> +  This is an additional reason for limiting the use of write rare.
> +
> +- Not only the pmalloc memory must be protected, but also any reference to
> +  it that might become the target for an attack. The attack would replace
> +  a reference to the protected memory with a reference to some other,
> +  unprotected, memory.
> +
> +- The users of rare write must take care of ensuring the atomicity of the
> +  action, respect to the way they use the data being altered; for example,
> +  take a lock before making a copy of the value to modify (if it's
> +  relevant), then alter it, issue the call to rare write and finally
> +  release the lock. Some special scenario might be exempt from the need
> +  for locking, but in general rare-write must be treated as an operation
> +  that can incur into races.
> +
> +- pmalloc relies on virtual memory areas and will therefore use more
> +  tlb entries. It still does a better job of it, compared to invoking
> +  vmalloc for each allocation, but it is undeniably less optimized wrt to
> +  TLB use than using the physmap directly, through kmalloc or similar.
> +
> +
> +Utilization
> +-----------
> +
> +**add examples here**
> +
> +API
> +---
> +
> +.. kernel-doc:: include/linux/prmem.h
> +.. kernel-doc:: mm/prmem.c
> +.. kernel-doc:: include/linux/prmemextra.h
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ea979a5a9ec9..246b1a1cc8bb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9463,6 +9463,7 @@ F:	include/linux/prmemextra.h
>  F:	mm/prmem.c
>  F:	mm/test_write_rare.c
>  F:	mm/test_pmalloc.c
> +F:	Documentation/core-api/prmem.rst

I think the MAINTAINERS update can go in one chunk as the last patch in the
series.
 
>  MEMORY MANAGEMENT
>  L:	linux-mm@kvack.org
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 16/17] prmem: pratomic-long
  2018-10-23 21:35 ` [PATCH 16/17] prmem: pratomic-long Igor Stoppa
@ 2018-10-25  0:13   ` Peter Zijlstra
  2018-10-29 21:17     ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-25  0:13 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Will Deacon, Boqun Feng, Arnd Bergmann, linux-arch,
	linux-kernel

On Wed, Oct 24, 2018 at 12:35:03AM +0300, Igor Stoppa wrote:
> +static __always_inline
> +bool __pratomic_long_op(bool inc, struct pratomic_long_t *l)
> +{
> +	struct page *page;
> +	uintptr_t base;
> +	uintptr_t offset;
> +	unsigned long flags;
> +	size_t size = sizeof(*l);
> +	bool is_virt = __is_wr_after_init(l, size);
> +
> +	if (WARN(!(is_virt || likely(__is_wr_pool(l, size))),
> +		 WR_ERR_RANGE_MSG))
> +		return false;
> +	local_irq_save(flags);
> +	if (is_virt)
> +		page = virt_to_page(l);
> +	else
> +		vmalloc_to_page(l);
> +	offset = (~PAGE_MASK) & (uintptr_t)l;
> +	base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> +	if (WARN(!base, WR_ERR_PAGE_MSG)) {
> +		local_irq_restore(flags);
> +		return false;
> +	}
> +	if (inc)
> +		atomic_long_inc((atomic_long_t *)(base + offset));
> +	else
> +		atomic_long_dec((atomic_long_t *)(base + offset));
> +	vunmap((void *)base);
> +	local_irq_restore(flags);
> +	return true;
> +
> +}

That's just hideously nasty.. and horribly broken.

We're not going to duplicate all these kernel interfaces wrapped in gunk
like that. Also, you _cannot_ call vunmap() with IRQs disabled. Clearly
you've never tested this with debug bits enabled.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 02/17] prmem: write rare for static allocation
  2018-10-23 21:34 ` [PATCH 02/17] prmem: write rare for static allocation Igor Stoppa
@ 2018-10-25  0:24   ` Dave Hansen
  2018-10-29 18:03     ` Igor Stoppa
  2018-10-26  9:41   ` Peter Zijlstra
  1 sibling, 1 reply; 140+ messages in thread
From: Dave Hansen @ 2018-10-25  0:24 UTC (permalink / raw)
  To: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

> +static __always_inline bool __is_wr_after_init(const void *ptr, size_t size)
> +{
> +	size_t start = (size_t)&__start_wr_after_init;
> +	size_t end = (size_t)&__end_wr_after_init;
> +	size_t low = (size_t)ptr;
> +	size_t high = (size_t)ptr + size;
> +
> +	return likely(start <= low && low < high && high <= end);
> +}

size_t is an odd type choice for doing address arithmetic.

> +/**
> + * wr_memset() - sets n bytes of the destination to the c value
> + * @dst: beginning of the memory to write to
> + * @c: byte to replicate
> + * @size: amount of bytes to copy
> + *
> + * Returns true on success, false otherwise.
> + */
> +static __always_inline
> +bool wr_memset(const void *dst, const int c, size_t n_bytes)
> +{
> +	size_t size;
> +	unsigned long flags;
> +	uintptr_t d = (uintptr_t)dst;
> +
> +	if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
> +		return false;
> +	while (n_bytes) {
> +		struct page *page;
> +		uintptr_t base;
> +		uintptr_t offset;
> +		uintptr_t offset_complement;

Again, these are really odd choices for types.  vmap() returns a void*
pointer, on which you can do arithmetic.  Why bother keeping another
type to which you have to cast to and from?

BTW, our usual "pointer stored in an integer type" is 'unsigned long',
if a pointer needs to be manipulated.

> +		local_irq_save(flags);

Why are you doing the local_irq_save()?

> +		page = virt_to_page(d);
> +		offset = d & ~PAGE_MASK;
> +		offset_complement = PAGE_SIZE - offset;
> +		size = min(n_bytes, offset_complement);
> +		base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);

Can you even call vmap() (which sleeps) with interrupts off?

> +		if (WARN(!base, WR_ERR_PAGE_MSG)) {
> +			local_irq_restore(flags);
> +			return false;
> +		}

You really need some kmap_atomic()-style accessors to wrap this stuff
for you.  This little pattern is repeated over and over.

...
> +const char WR_ERR_RANGE_MSG[] = "Write rare on invalid memory range.";
> +const char WR_ERR_PAGE_MSG[] = "Failed to remap write rare page.";

Doesn't the compiler de-duplicate duplicated strings for you?  Is there
any reason to declare these like this?


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 03/17] prmem: vmalloc support for dynamic allocation
  2018-10-23 21:34 ` [PATCH 03/17] prmem: vmalloc support for dynamic allocation Igor Stoppa
@ 2018-10-25  0:26   ` Dave Hansen
  2018-10-29 18:07     ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Dave Hansen @ 2018-10-25  0:26 UTC (permalink / raw)
  To: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Andrew Morton, Chintan Pandya, Joe Perches, Luis R. Rodriguez,
	Thomas Gleixner, Kate Stewart, Greg Kroah-Hartman,
	Philippe Ombredanne, linux-mm, linux-kernel

On 10/23/18 2:34 PM, Igor Stoppa wrote:
> +#define VM_PMALLOC		0x00000100	/* pmalloc area - see docs */
> +#define VM_PMALLOC_WR		0x00000200	/* pmalloc write rare area */
> +#define VM_PMALLOC_PROTECTED	0x00000400	/* pmalloc protected area */

Please introduce things as you use them.  It's impossible to review a
patch that just says "see docs" that doesn't contain any docs. :)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 05/17] prmem: shorthands for write rare on common types
  2018-10-23 21:34 ` [PATCH 05/17] prmem: shorthands for write rare on common types Igor Stoppa
@ 2018-10-25  0:28   ` Dave Hansen
  2018-10-29 18:12     ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Dave Hansen @ 2018-10-25  0:28 UTC (permalink / raw)
  To: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

On 10/23/18 2:34 PM, Igor Stoppa wrote:
> Wrappers around the basic write rare functionality, addressing several
> common data types found in the kernel, allowing to specify the new
> values through immediates, like constants and defines.

I have to wonder whether this is the right way, or whether a
size-agnostic function like put_user() is the right way.  put_user() is
certainly easier to use.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 08/17] prmem: struct page: track vmap_area
  2018-10-24 23:01     ` Igor Stoppa
@ 2018-10-25  2:13       ` Matthew Wilcox
  2018-10-29 18:21         ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-25  2:13 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Dave Chinner, James Morris, Michal Hocko,
	kernel-hardening, linux-integrity, linux-security-module,
	igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

On Thu, Oct 25, 2018 at 02:01:02AM +0300, Igor Stoppa wrote:
> > > @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > >   	if (!addr)
> > >   		return NULL;
> > > +	va = __find_vmap_area((unsigned long)addr);
> > > +	for (i = 0; i < va->vm->nr_pages; i++)
> > > +		va->vm->pages[i]->area = va;
> > 
> > I don't like it that you're calling this for _every_ vmalloc() caller
> > when most of them will never use this.  Perhaps have page->va be initially
> > NULL and then cache the lookup in it when it's accessed for the first time.
> > 
> 
> If __find_vmap_area() was part of the API, this loop could be left out from
> __vmalloc_node_range() and the user of the allocation could initialize the
> field, if needed.
> 
> What is the reason for keeping __find_vmap_area() private?

Well, for one, you're walking the rbtree without holding the spinlock,
so you're going to get crashes.  I don't see why we shouldn't export
find_vmap_area() though.

Another way we could approach this is to embed the vmap_area in the
vm_struct.  It'd require a bit of juggling of the alloc/free paths in
vmalloc, but it might be worthwhile.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
  2018-10-24 22:52         ` Igor Stoppa
@ 2018-10-25  8:11           ` Tycho Andersen
  0 siblings, 0 replies; 140+ messages in thread
From: Tycho Andersen @ 2018-10-25  8:11 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mathieu Desnoyers, Mimi Zohar, Kees Cook, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
	linux-integrity, linux-security-module, igor stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Thomas Gleixner, Kate Stewart,
	David S. Miller, Greg Kroah-Hartman, Philippe Ombredanne,
	Paul E. McKenney, Josh Triplett, rostedt, Lai Jiangshan,
	linux-kernel

On Thu, Oct 25, 2018 at 01:52:11AM +0300, Igor Stoppa wrote:
> On 24/10/2018 17:56, Tycho Andersen wrote:
> > On Wed, Oct 24, 2018 at 05:03:01PM +0300, Igor Stoppa wrote:
> > > On 24/10/18 14:37, Mathieu Desnoyers wrote:
> > > > Also, is it the right approach to duplicate existing APIs, or should we
> > > > rather hook into page fault handlers and let the kernel do those "shadow"
> > > > mappings under the hood ?
> > > 
> > > This question is probably a good candidate for the small Q&A section I have
> > > in the 00/17.
> > > 
> > > 
> > > > Adding a new GFP flags for dynamic allocation, and a macro mapping to
> > > > a section attribute might suffice for allocation or definition of such
> > > > mostly-read-only/seldom-updated data.
> > > 
> > > I think what you are proposing makes sense from a pure hardening standpoint.
> > >  From a more defensive one, I'd rather minimise the chances of giving a free
> > > pass to an attacker.
> > > 
> > > Maybe there is a better implementation of this, than what I have in mind.
> > > But, based on my current understanding of what you are describing, there
> > > would be few issues:
> > > 
> > > 1) where would the pool go? The pool is a way to manage multiple vmas and
> > > express common property they share. Even before a vma is associated to the
> > > pool.
> > > 
> > > 2) there would be more code that can seamlessly deal with both protected and
> > > regular data. Based on what? Some parameter, I suppose.
> > > That parameter would be the new target.
> > > If the code is "duplicated", as you say, the actual differences are baked in
> > > at compile time. The "duplication" would also allow to have always inlined
> > > functions for write-rare and leave more freedom to the compiler for their
> > > non-protected version.
> > > 
> > > Besides, I think the separate wr version also makes it very clear, to the
> > > user of the API, that there will be a price to pay, in terms of performance.
> > > The more seamlessly alternative might make this price less obvious.
> > 
> > What about something in the middle, where we move list to list_impl.h,
> > and add a few macros where you have list_set_prev() in prlist now, so
> > we could do,
> > 
> > // prlist.h
> > 
> > #define list_set_next(head, next) wr_ptr(&head->next, next)
> > #define list_set_prev(head, prev) wr_ptr(&head->prev, prev)
> > 
> > #include <linux/list_impl.h>
> > 
> > // list.h
> > 
> > #define list_set_next(next) (head->next = next)
> > #define list_set_next(prev) (head->prev = prev)
> > 
> > #include <linux/list_impl.h>
> > 
> > I wonder then if you can get rid of some of the type punning too? It's
> > not clear exactly why that's necessary from the series, but perhaps
> > I'm missing something obvious :)
> 
> nothing obvious, probably there is only half a reference in the slides I
> linked-to in the cover letter :-)
> 
> So far I have minimized the number of "intrinsic" write rare functions,
> mostly because I would want first to reach an agreement on the
> implementation of the core write-rare.
> 
> However, once that is done, it might be good to convert also the prlists to
> be "intrinsics". A list node is 2 pointers.
> If that was the alignment, i.e. __align(sizeof(list_head)), it might be
> possible to speed up a lot the list handling even as write rare.
> 
> Taking as example the insertion operation, it would be probably sufficient,
> in most cases, to have only two remappings:
> - one covering the page with the latest two nodes
> - one covering the page with the list head
> 
> That is 2 vs 8 remappings, and a good deal of memory barriers less.
> 
> This would be incompatible with what you are proposing, yet it would be
> justifiable, I think, because it would provide better performance to prlist,
> potentially widening its adoption, where performance is a concern.

I guess the writes to these are rare, right? So perhaps it's not such
a big deal :)

> > I also wonder how much the actual differences being baked in at
> > compile time makes. Most (all?) of this code is inlined.
> 
> If the inlined function expects to receive a prlist_head *, instead of a
> list_head *, doesn't it help turning runtime bugs into buildtime bugs?

In principle it's not a bug to use the prmem helpers where the regular
ones would do, it's just slower (assuming the types are the same). But
mostly, it's a way to avoid actually copying and pasting most of the
implementations of most of the data structures. I see some other
replies in the thread already, but this seems not so good to me.

Tycho

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 06/17] prmem: test cases for memory protection
  2018-10-23 21:34 ` [PATCH 06/17] prmem: test cases for memory protection Igor Stoppa
  2018-10-24  3:27   ` Randy Dunlap
@ 2018-10-25 16:43   ` Dave Hansen
  2018-10-29 18:16     ` Igor Stoppa
  1 sibling, 1 reply; 140+ messages in thread
From: Dave Hansen @ 2018-10-25 16:43 UTC (permalink / raw)
  To: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

> +static bool is_address_protected(void *p)
> +{
> +	struct page *page;
> +	struct vmap_area *area;
> +
> +	if (unlikely(!is_vmalloc_addr(p)))
> +		return false;
> +	page = vmalloc_to_page(p);
> +	if (unlikely(!page))
> +		return false;
> +	wmb(); /* Flush changes to the page table - is it needed? */

No.

The rest of this is just pretty verbose and seems to have been very
heavily copied and pasted.  I guess that's OK for test code, though.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-23 21:34 ` [PATCH 10/17] prmem: documentation Igor Stoppa
  2018-10-24  3:48   ` Randy Dunlap
  2018-10-24 23:04   ` Mike Rapoport
@ 2018-10-26  9:26   ` Peter Zijlstra
  2018-10-26 10:20     ` Matthew Wilcox
                       ` (4 more replies)
  2 siblings, 5 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-26  9:26 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
	linux-kernel

Jon,

So the below document is a prime example for why I think RST sucks. As a
text document readability is greatly diminished by all the markup
nonsense.

This stuff should not become write-only content like html and other
gunk. The actual text file is still the primary means of reading this.

> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index 26b735cefb93..1a90fa878d8d 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -31,6 +31,7 @@ Core utilities
>     gfp_mask-from-fs-io
>     timekeeping
>     boot-time-mm
> +   prmem
>  
>  Interfaces for kernel debugging
>  ===============================
> diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst
> new file mode 100644
> index 000000000000..16d7edfe327a
> --- /dev/null
> +++ b/Documentation/core-api/prmem.rst
> @@ -0,0 +1,172 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _prmem:
> +
> +Memory Protection
> +=================
> +
> +:Date: October 2018
> +:Author: Igor Stoppa <igor.stoppa@huawei.com>
> +
> +Foreword
> +--------
> +- In a typical system using some sort of RAM as execution environment,
> +  **all** memory is initially writable.
> +
> +- It must be initialized with the appropriate content, be it code or data.
> +
> +- Said content typically undergoes modifications, i.e. relocations or
> +  relocation-induced changes.
> +
> +- The present document doesn't address such transient.
> +
> +- Kernel code is protected at system level and, unlike data, it doesn't
> +  require special attention.

What does this even mean?

> +Protection mechanism
> +--------------------
> +
> +- When available, the MMU can write protect memory pages that would be
> +  otherwise writable.

Again; what does this really want to say?

> +- The protection has page-level granularity.

I don't think Linux supports non-paging MMUs.

> +- An attempt to overwrite a protected page will trigger an exception.
> +- **Write protected data must go exclusively to write protected pages**
> +- **Writable data must go exclusively to writable pages**

WTH is with all those ** ?

> +Available protections for kernel data
> +-------------------------------------
> +
> +- **constant**
> +   Labelled as **const**, the data is never supposed to be altered.
> +   It is statically allocated - if it has any memory footprint at all.
> +   The compiler can even optimize it away, where possible, by replacing
> +   references to a **const** with its actual value.
> +
> +- **read only after init**
> +   By tagging an otherwise ordinary statically allocated variable with
> +   **__ro_after_init**, it is placed in a special segment that will
> +   become write protected, at the end of the kernel init phase.
> +   The compiler has no notion of this restriction and it will treat any
> +   write operation on such variable as legal. However, assignments that
> +   are attempted after the write protection is in place, will cause
> +   exceptions.
> +
> +- **write rare after init**
> +   This can be seen as variant of read only after init, which uses the
> +   tag **__wr_after_init**. It is also limited to statically allocated
> +   memory. It is still possible to alter this type of variables, after
> +   the kernel init phase is complete, however it can be done exclusively
> +   with special functions, instead of the assignment operator. Using the
> +   assignment operator after conclusion of the init phase will still
> +   trigger an exception. It is not possible to transition a certain
> +   variable from __wr_ater_init to a permanent read-only status, at
> +   runtime.
> +
> +- **dynamically allocated write-rare / read-only**
> +   After defining a pool, memory can be obtained through it, primarily
> +   through the **pmalloc()** allocator. The exact writability state of the
> +   memory obtained from **pmalloc()** and friends can be configured when
> +   creating the pool. At any point it is possible to transition to a less
> +   permissive write status the memory currently associated to the pool.
> +   Once memory has become read-only, it the only valid operation, beside
> +   reading, is to released it, by destroying the pool it belongs to.

Can we ditch all the ** nonsense and put whitespace in there? More paragraphs
and whitespace are more good.

Also, I really don't like how you differentiate between static and
dynamic wr.

> +Protecting dynamically allocated memory
> +---------------------------------------
> +
> +When dealing with dynamically allocated memory, three options are
> + available for configuring its writability state:
> +
> +- **Options selected when creating a pool**
> +   When creating the pool, it is possible to choose one of the following:
> +    - **PMALLOC_MODE_RO**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_WR**
> +       - Writability at allocation time: *WRITABLE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_AUTO_RO**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *NONE*
> +       - Writability at protection time: *NONE*
> +    - **PMALLOC_MODE_AUTO_WR**
> +       - Writability at allocation time:
> +           - the latest allocation: *WRITABLE*
> +           - every other allocation: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*
> +    - **PMALLOC_MODE_START_WR**
> +       - Writability at allocation time: *WRITE-RARE*
> +       - Writability at protection time: *WRITE-RARE*

That's just unreadable gibberish from here. Also what?

We already have RO, why do you need more RO?

> +
> +   **Remarks:**
> +    - The "AUTO" modes perform automatic protection of the content, whenever
> +       the current vmap_area is used up and a new one is allocated.
> +        - At that point, the vmap_area being phased out is protected.
> +        - The size of the vmap_area depends on various parameters.
> +        - It might not be possible to know for sure *when* certain data will
> +          be protected.

Surely that is a problem?

> +        - The functionality is provided as tradeoff between hardening and speed.

Which you fail to explain.

> +        - Its usefulness depends on the specific use case at hand

How about you write sensible text inside the option descriptions
instead?

This is not a presentation; less bullets, more content.

> +- Not only the pmalloc memory must be protected, but also any reference to
> +  it that might become the target for an attack. The attack would replace
> +  a reference to the protected memory with a reference to some other,
> +  unprotected, memory.

I still don't really understand the whole write-rare thing; how does it
really help? If we can write in kernel memory, we can write to
page-tables too.

And I don't think this document even begins to explain _why_ you're
doing any of this. How does it help?

> +- The users of rare write must take care of ensuring the atomicity of the
> +  action, respect to the way they use the data being altered; for example,
> +  take a lock before making a copy of the value to modify (if it's
> +  relevant), then alter it, issue the call to rare write and finally
> +  release the lock. Some special scenario might be exempt from the need
> +  for locking, but in general rare-write must be treated as an operation
> +  that can incur into races.

What?!

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 12/17] prmem: linked list: set alignment
  2018-10-23 21:34 ` [PATCH 12/17] prmem: linked list: set alignment Igor Stoppa
@ 2018-10-26  9:31   ` Peter Zijlstra
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-26  9:31 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Greg Kroah-Hartman, Andrew Morton, Masahiro Yamada,
	Alexey Dobriyan, Pekka Enberg, Paul E. McKenney, Lihao Liang,
	linux-kernel

On Wed, Oct 24, 2018 at 12:34:59AM +0300, Igor Stoppa wrote:
> As preparation to using write rare on the nodes of various types of
> lists, specify that the fields in the basic data structures must be
> aligned to sizeof(void *)
> 
> It is meant to ensure that any static allocation will not cross a page
> boundary, to allow pointers to be updated in one step.
> 
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Masahiro Yamada <yamada.masahiro@socionext.com>
> CC: Alexey Dobriyan <adobriyan@gmail.com>
> CC: Pekka Enberg <penberg@kernel.org>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Lihao Liang <lianglihao@huawei.com>
> CC: linux-kernel@vger.kernel.org
> ---
>  include/linux/types.h | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 9834e90aa010..53609bbdcf0f 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -183,17 +183,29 @@ typedef struct {
>  } atomic64_t;
>  #endif
>  
> +#ifdef CONFIG_PRMEM
>  struct list_head {
> -	struct list_head *next, *prev;
> -};
> +	struct list_head *next __aligned(sizeof(void *));
> +	struct list_head *prev __aligned(sizeof(void *));
> +} __aligned(sizeof(void *));
>  
> -struct hlist_head {
> -	struct hlist_node *first;
> +struct hlist_node {
> +	struct hlist_node *next __aligned(sizeof(void *));
> +	struct hlist_node **pprev __aligned(sizeof(void *));
> +} __aligned(sizeof(void *));

Argh.. are we really supporting platforms that do not naturally align
this? If so, which and can't we fix those?

Also, if you force alignment on a member, the structure as a whole
inherits the largest member alignment.

Also, you made something that was simple an unreadable mess without
proper justification (ie. you fail to show need).

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 13/17] prmem: linked list: disable layout randomization
  2018-10-23 21:35 ` [PATCH 13/17] prmem: linked list: disable layout randomization Igor Stoppa
  2018-10-24 13:43   ` Alexey Dobriyan
@ 2018-10-26  9:32   ` Peter Zijlstra
  2018-10-26 10:17     ` Matthew Wilcox
  1 sibling, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-26  9:32 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Greg Kroah-Hartman, Andrew Morton, Masahiro Yamada,
	Alexey Dobriyan, Pekka Enberg, Paul E. McKenney, Lihao Liang,
	linux-kernel

On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
> Some of the data structures used in list management are composed by two
> pointers. Since the kernel is now configured by default to randomize the
> layout of data structures soleley composed by pointers, this might
> prevent correct type punning between these structures and their write
> rare counterpart.

'might' doesn't really work for me. Either it does or it does not.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
  2018-10-23 21:35 ` [PATCH 14/17] prmem: llist, hlist, both plain and rcu Igor Stoppa
  2018-10-24 11:37   ` Mathieu Desnoyers
@ 2018-10-26  9:38   ` Peter Zijlstra
  1 sibling, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-26  9:38 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Thomas Gleixner, Kate Stewart, David S. Miller,
	Greg Kroah-Hartman, Philippe Ombredanne, Paul E. McKenney,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	linux-kernel

On Wed, Oct 24, 2018 at 12:35:01AM +0300, Igor Stoppa wrote:
> In some cases, all the data needing protection can be allocated from a pool
> in one go, as directly writable, then initialized and protected.
> The sequence is relatively short and it's acceptable to leave the entire
> data set unprotected.
> 
> In other cases, this is not possible, because the data will trickle over
> a relatively long period of time, in a non predictable way, possibly for
> the entire duration of the operations.
> 
> For these cases, the safe approach is to have the memory already write
> protected, when allocated. However, this will require replacing any
> direct assignment with calls to functions that can perform write rare.
> 
> Since lists are one of the most commonly used data structures in kernel,
> they are a the first candidate for receiving write rare extensions.
> 
> This patch implements basic functionality for altering said lists.
> 
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Kate Stewart <kstewart@linuxfoundation.org>
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> CC: Philippe Ombredanne <pombredanne@nexb.com>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Josh Triplett <josh@joshtriplett.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: Lai Jiangshan <jiangshanlai@gmail.com>
> CC: linux-kernel@vger.kernel.org
> ---
>  MAINTAINERS            |   1 +
>  include/linux/prlist.h | 934 +++++++++++++++++++++++++++++++++++++++++

I'm not at all sure I understand the Changelog, or how it justifies
duplicating almost 1k lines of code.

Sure lists aren't the most complicated thing we have, but duplicating
that much is still very _very_ bad form. Why are we doing this?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 02/17] prmem: write rare for static allocation
  2018-10-23 21:34 ` [PATCH 02/17] prmem: write rare for static allocation Igor Stoppa
  2018-10-25  0:24   ` Dave Hansen
@ 2018-10-26  9:41   ` Peter Zijlstra
  2018-10-29 20:01     ` Igor Stoppa
  1 sibling, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-26  9:41 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

On Wed, Oct 24, 2018 at 12:34:49AM +0300, Igor Stoppa wrote:
> +static __always_inline

That's far too large for inline.

> +bool wr_memset(const void *dst, const int c, size_t n_bytes)
> +{
> +	size_t size;
> +	unsigned long flags;
> +	uintptr_t d = (uintptr_t)dst;
> +
> +	if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
> +		return false;
> +	while (n_bytes) {
> +		struct page *page;
> +		uintptr_t base;
> +		uintptr_t offset;
> +		uintptr_t offset_complement;
> +
> +		local_irq_save(flags);
> +		page = virt_to_page(d);
> +		offset = d & ~PAGE_MASK;
> +		offset_complement = PAGE_SIZE - offset;
> +		size = min(n_bytes, offset_complement);
> +		base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> +		if (WARN(!base, WR_ERR_PAGE_MSG)) {
> +			local_irq_restore(flags);
> +			return false;
> +		}
> +		memset((void *)(base + offset), c, size);
> +		vunmap((void *)base);

BUG

> +		d += size;
> +		n_bytes -= size;
> +		local_irq_restore(flags);
> +	}
> +	return true;
> +}
> +
> +static __always_inline

Similarly large

> +bool wr_memcpy(const void *dst, const void *src, size_t n_bytes)
> +{
> +	size_t size;
> +	unsigned long flags;
> +	uintptr_t d = (uintptr_t)dst;
> +	uintptr_t s = (uintptr_t)src;
> +
> +	if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
> +		return false;
> +	while (n_bytes) {
> +		struct page *page;
> +		uintptr_t base;
> +		uintptr_t offset;
> +		uintptr_t offset_complement;
> +
> +		local_irq_save(flags);
> +		page = virt_to_page(d);
> +		offset = d & ~PAGE_MASK;
> +		offset_complement = PAGE_SIZE - offset;
> +		size = (size_t)min(n_bytes, offset_complement);
> +		base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> +		if (WARN(!base, WR_ERR_PAGE_MSG)) {
> +			local_irq_restore(flags);
> +			return false;
> +		}
> +		__write_once_size((void *)(base + offset), (void *)s, size);
> +		vunmap((void *)base);

Similarly BUG.

> +		d += size;
> +		s += size;
> +		n_bytes -= size;
> +		local_irq_restore(flags);
> +	}
> +	return true;
> +}

> +static __always_inline

Guess what..

> +uintptr_t __wr_rcu_ptr(const void *dst_p_p, const void *src_p)
> +{
> +	unsigned long flags;
> +	struct page *page;
> +	void *base;
> +	uintptr_t offset;
> +	const size_t size = sizeof(void *);
> +
> +	if (WARN(!__is_wr_after_init(dst_p_p, size), WR_ERR_RANGE_MSG))
> +		return (uintptr_t)NULL;
> +	local_irq_save(flags);
> +	page = virt_to_page(dst_p_p);
> +	offset = (uintptr_t)dst_p_p & ~PAGE_MASK;
> +	base = vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> +	if (WARN(!base, WR_ERR_PAGE_MSG)) {
> +		local_irq_restore(flags);
> +		return (uintptr_t)NULL;
> +	}
> +	rcu_assign_pointer((*(void **)(offset + (uintptr_t)base)), src_p);
> +	vunmap(base);

Also still bug.

> +	local_irq_restore(flags);
> +	return (uintptr_t)src_p;
> +}

Also, I see an amount of duplication here that shows you're not nearly
lazy enough.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 13/17] prmem: linked list: disable layout randomization
  2018-10-26  9:32   ` Peter Zijlstra
@ 2018-10-26 10:17     ` Matthew Wilcox
  2018-10-30 15:39       ` Peter Zijlstra
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-26 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Igor Stoppa, Mimi Zohar, Kees Cook, Dave Chinner, James Morris,
	Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Greg Kroah-Hartman, Andrew Morton, Masahiro Yamada,
	Alexey Dobriyan, Pekka Enberg, Paul E. McKenney, Lihao Liang,
	linux-kernel

On Fri, Oct 26, 2018 at 11:32:05AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
> > Some of the data structures used in list management are composed by two
> > pointers. Since the kernel is now configured by default to randomize the
> > layout of data structures soleley composed by pointers, this might
> > prevent correct type punning between these structures and their write
> > rare counterpart.
> 
> 'might' doesn't really work for me. Either it does or it does not.

He means "Depending on the random number generator, the two pointers
might be AB or BA.  If they're of opposite polarity (50% of the time),
it _will_ break, and 50% of the time it _won't_ break."

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-26  9:26   ` Peter Zijlstra
@ 2018-10-26 10:20     ` Matthew Wilcox
  2018-10-29 19:28       ` Igor Stoppa
  2018-10-26 10:46     ` Kees Cook
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-26 10:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Igor Stoppa, Mimi Zohar, Kees Cook, Dave Chinner, James Morris,
	Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
	linux-kernel

On Fri, Oct 26, 2018 at 11:26:09AM +0200, Peter Zijlstra wrote:
> Jon,
> 
> So the below document is a prime example for why I think RST sucks. As a
> text document readability is greatly diminished by all the markup
> nonsense.
> 
> This stuff should not become write-only content like html and other
> gunk. The actual text file is still the primary means of reading this.

I think Igor neglected to read doc-guide/sphinx.rst:

Specific guidelines for the kernel documentation
------------------------------------------------

Here are some specific guidelines for the kernel documentation:

* Please don't go overboard with reStructuredText markup. Keep it
  simple. For the most part the documentation should be plain text with
  just enough consistency in formatting that it can be converted to
  other formats.

I agree that it's overly marked up.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-26  9:26   ` Peter Zijlstra
  2018-10-26 10:20     ` Matthew Wilcox
@ 2018-10-26 10:46     ` Kees Cook
  2018-10-28 18:31       ` Peter Zijlstra
  2018-10-26 11:09     ` Markus Heiser
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 140+ messages in thread
From: Kees Cook @ 2018-10-26 10:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Igor Stoppa, Mimi Zohar, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML

On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> I still don't really understand the whole write-rare thing; how does it
> really help? If we can write in kernel memory, we can write to
> page-tables too.

I can speak to the general goal. The specific implementation here may
not fit all the needs, but it's a good starting point for discussion.

One aspect of hardening the kernel against attack is reducing the
internal attack surface. Not all flaws are created equal, so there is
variation in what limitations an attacker may have when exploiting
flaws (not many flaws end up being a fully controlled "write anything,
anywhere, at any time"). By making the more sensitive data structures
of the kernel read-only, we reduce the risk of an attacker finding a
path to manipulating the kernel's behavior in a significant way.

Examples of typical sensitive targets are function pointers, security
policy, and page tables. Having these "read only at rest" makes them
much harder to control by an attacker using memory integrity flaws. We
already have the .rodata section for "const" things. Those are
trivially read-only. For example there is a long history of making
sure that function pointer tables are marked "const". However, some
things need to stay writable. Some of those in .data aren't changed
after __init, so we added the __ro_after_init which made them
read-only for the life of the system after __init. However, we're
still left with a lot of sensitive structures either in .data or
dynamically allocated, that need a way to be made read-only for most
of the time, but get written to during very specific times.

The "write rarely" name itself may not sufficiently describe what is
wanted either (I'll take the blame for the inaccurate name), so I'm
open to new ideas there. The implementation requirements for the
"sensitive data read-only at rest" feature are rather tricky:

- allow writes only from specific places in the kernel
- keep those locations inline to avoid making them trivial ROP targets
- keep the writeability window open only to a single uninterruptable CPU
- fast enough to deal with page table updates

The proposal I made a while back only covered .data things (and used
x86-specific features). Igor's proposal builds on this by including a
way to do this with dynamic allocation too, which greatly expands the
scope of structures that can be protected. Given that the x86-only
method of write-window creation was firmly rejected, this is a new
proposal for how to do it (vmap window). Using switch_mm() has also
been suggested, etc.

We need to find a good way to do the write-windowing that works well
for static and dynamic structures _and_ for the page tables... this
continues to be tricky.

Making it resilient against ROP-style targets makes it difficult to
deal with certain data structures (like list manipulation). In my
earlier RFC, I tried to provide enough examples of where this could
get used to let people see some of the complexity[1]. Igor's series
expands this to even more examples using dynamic allocation.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/write-rarely

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-26  9:26   ` Peter Zijlstra
  2018-10-26 10:20     ` Matthew Wilcox
  2018-10-26 10:46     ` Kees Cook
@ 2018-10-26 11:09     ` Markus Heiser
  2018-10-29 19:35       ` Igor Stoppa
  2018-10-26 15:05     ` Jonathan Corbet
  2018-10-29 20:35     ` Igor Stoppa
  4 siblings, 1 reply; 140+ messages in thread
From: Markus Heiser @ 2018-10-26 11:09 UTC (permalink / raw)
  To: Peter Zijlstra, Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
	linux-kernel

Am 26.10.18 um 11:26 schrieb Peter Zijlstra:
 >> Jon,
 >>
 >> So the below document is a prime example for why I think RST sucks. As a
 >> text document readability is greatly diminished by all the markup
 >> nonsense.
 >>
 >> This stuff should not become write-only content like html and other
 >> gunk. The actual text file is still the primary means of reading this.
 >
 > I think Igor neglected to read doc-guide/sphinx.rst:
 >
 > Specific guidelines for the kernel documentation
 > ------------------------------------------------
 >
 > Here are some specific guidelines for the kernel documentation:
 >
 > * Please don't go overboard with reStructuredText markup. Keep it
 >    simple. For the most part the documentation should be plain text with
 >    just enough consistency in formatting that it can be converted to
 >    other formats.
 >
 > I agree that it's overly marked up.

To add my two cent ..

 > WTH is with all those ** ?

I guess Igor was looking for a definition list ...

 >> +Available protections for kernel data
 >> +-------------------------------------
 >> +
 >> +- **constant**
 >> +   Labelled as **const**, the data is never supposed to be altered.
 >> +   It is statically allocated - if it has any memory footprint at all.
 >> +   The compiler can even optimize it away, where possible, by replacing
 >> +   references to a **const** with its actual value.


+Available protections for kernel data
+-------------------------------------
+
+constant
+   Labelled as const, the data is never supposed to be altered.
+   It is statically allocated - if it has any memory footprint at all.
+   The compiler can even optimize it away, where possible, by replacing
+   references to a const with its actual value.

see "Lists and Quote-like blocks" [1]

[1] 
http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#lists-and-quote-like-blocks

-- Markus --

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-26  9:26   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2018-10-26 11:09     ` Markus Heiser
@ 2018-10-26 15:05     ` Jonathan Corbet
  2018-10-29 19:38       ` Igor Stoppa
  2018-10-29 20:35     ` Igor Stoppa
  4 siblings, 1 reply; 140+ messages in thread
From: Jonathan Corbet @ 2018-10-26 15:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Laura Abbott,
	Randy Dunlap, Mike Rapoport, linux-doc, linux-kernel

On Fri, 26 Oct 2018 11:26:09 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Jon,
> 
> So the below document is a prime example for why I think RST sucks. As a
> text document readability is greatly diminished by all the markup
> nonsense.

Please don't confused RST with the uses of it.  The fact that one can do
amusing things with the C preprocessor doesn't, on its own, make C
suck... 
> 
> This stuff should not become write-only content like html and other
> gunk. The actual text file is still the primary means of reading this.

I agree that the text file comes first, and I agree that the markup in
this particular document is excessive.

Igor, just like #ifdef's make code hard to read, going overboard with RST
markup makes text hard to read.  It's a common trap to fall into, because
it lets you make slick-looking web pages, but that is not our primary
goal here.  Many thanks for writing documentation for this feature, that
already puts you way ahead of a discouraging number of contributors.  But
could I ask you, please, to make a pass over it and reduce the markup to
a minimum?  Using lists as suggested by Markus would help here.

Thanks,

jon

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
  2018-10-24 14:03     ` Igor Stoppa
  2018-10-24 14:56       ` Tycho Andersen
@ 2018-10-28  9:52       ` Steven Rostedt
  2018-10-29 19:43         ` Igor Stoppa
  1 sibling, 1 reply; 140+ messages in thread
From: Steven Rostedt @ 2018-10-28  9:52 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mathieu Desnoyers, Mimi Zohar, Kees Cook, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
	linux-integrity, linux-security-module, igor stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Thomas Gleixner, Kate Stewart,
	David S. Miller, Greg Kroah-Hartman, Philippe Ombredanne,
	Paul E. McKenney, Josh Triplett, Lai Jiangshan, linux-kernel

On Wed, 24 Oct 2018 17:03:01 +0300
Igor Stoppa <igor.stoppa@gmail.com> wrote:

> I was hoping that by CC-ing kernel.org, the explicit recipients would 
> get both the mail directly intended for them (as subsystem 
> maintainers/supporters) and the rest.
> 
> The next time I'll err in the opposite direction.

Please don't.

> 
> In the meanwhile, please find the whole set here:
> 
> https://www.openwall.com/lists/kernel-hardening/2018/10/23/3

Note, it is critical that every change log stands on its own. You do
not need to Cc the entire patch set to everyone. Each patch should have
enough information in it to know exactly what the patch does.

It's OK if each change log has duplicate information from other
patches. The important part is that one should be able to look at the
change log of a specific patch and understand exactly what the patch is
doing.

This is because 5 years from now, if someone does a git blame and comes
up with this commit, they wont have access to the patch series. All
they will have is this single commit log to explain why these changes
were done.

If a change log depends on other commits for context, it is
insufficient.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-26 10:46     ` Kees Cook
@ 2018-10-28 18:31       ` Peter Zijlstra
  2018-10-29 21:04         ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-28 18:31 UTC (permalink / raw)
  To: Kees Cook
  Cc: Igor Stoppa, Mimi Zohar, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML

On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote:
> On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > I still don't really understand the whole write-rare thing; how does it
> > really help? If we can write in kernel memory, we can write to
> > page-tables too.

> One aspect of hardening the kernel against attack is reducing the
> internal attack surface. Not all flaws are created equal, so there is
> variation in what limitations an attacker may have when exploiting
> flaws (not many flaws end up being a fully controlled "write anything,
> anywhere, at any time"). By making the more sensitive data structures
> of the kernel read-only, we reduce the risk of an attacker finding a
> path to manipulating the kernel's behavior in a significant way.
> 
> Examples of typical sensitive targets are function pointers, security
> policy, and page tables. Having these "read only at rest" makes them
> much harder to control by an attacker using memory integrity flaws.

Because 'write-anywhere' exploits are easier than (and the typical first
step to) arbitrary code execution thingies?

> The "write rarely" name itself may not sufficiently describe what is
> wanted either (I'll take the blame for the inaccurate name), so I'm
> open to new ideas there. The implementation requirements for the
> "sensitive data read-only at rest" feature are rather tricky:
> 
> - allow writes only from specific places in the kernel
> - keep those locations inline to avoid making them trivial ROP targets
> - keep the writeability window open only to a single uninterruptable CPU

The current patch set does not achieve that because it uses a global
address space for the alias mapping (vmap) which is equally accessible
from all CPUs.

> - fast enough to deal with page table updates

The proposed implementation needs page-tables for the alias; I don't see
how you could ever do R/O page-tables when you need page-tables to
modify your page-tables.

And this is entirely irrespective of performance.

> The proposal I made a while back only covered .data things (and used
> x86-specific features).

Oh, right, that CR0.WP stuff.

> Igor's proposal builds on this by including a
> way to do this with dynamic allocation too, which greatly expands the
> scope of structures that can be protected. Given that the x86-only
> method of write-window creation was firmly rejected, this is a new
> proposal for how to do it (vmap window). Using switch_mm() has also
> been suggested, etc.

Right... /me goes find the patches we did for text_poke. Hmm, those
never seem to have made it:

  https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com

like that. That approach will in fact work and not be a completely
broken mess like this thing.

> We need to find a good way to do the write-windowing that works well
> for static and dynamic structures _and_ for the page tables... this
> continues to be tricky.
> 
> Making it resilient against ROP-style targets makes it difficult to
> deal with certain data structures (like list manipulation). In my
> earlier RFC, I tried to provide enough examples of where this could
> get used to let people see some of the complexity[1]. Igor's series
> expands this to even more examples using dynamic allocation.

Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be
fast enough for much of anything.

And I don't suppose we can take the WP fault and then fix up from there,
because if we're doing R/O page-tables, that'll incrase the fault depth
and we'll double fault all the time, and tripple fault where we
currently double fault. And we all know how _awesome_ tripple faults
are.

But duplicating (and wrapping in gunk) whole APIs is just not going to
work.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 09/17] prmem: hardened usercopy
  2018-10-23 21:34 ` [PATCH 09/17] prmem: hardened usercopy Igor Stoppa
@ 2018-10-29 11:45   ` Chris von Recklinghausen
  2018-10-29 18:24     ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Chris von Recklinghausen @ 2018-10-29 11:45 UTC (permalink / raw)
  To: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	linux-mm, linux-kernel

On 10/23/2018 05:34 PM, Igor Stoppa wrote:
> Prevent leaks of protected memory to userspace.
> The protection from overwrited from userspace is already available, once
> the memory is write protected.
>
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Kees Cook <keescook@chromium.org>
> CC: Chris von Recklinghausen <crecklin@redhat.com>
> CC: linux-mm@kvack.org
> CC: linux-kernel@vger.kernel.org
> ---
>  include/linux/prmem.h | 24 ++++++++++++++++++++++++
>  mm/usercopy.c         |  5 +++++
>  2 files changed, 29 insertions(+)
>
> diff --git a/include/linux/prmem.h b/include/linux/prmem.h
> index cf713fc1c8bb..919d853ddc15 100644
> --- a/include/linux/prmem.h
> +++ b/include/linux/prmem.h
> @@ -273,6 +273,30 @@ struct pmalloc_pool {
>  	uint8_t mode;
>  };
>  
> +void __noreturn usercopy_abort(const char *name, const char *detail,
> +			       bool to_user, unsigned long offset,
> +			       unsigned long len);
> +
> +/**
> + * check_pmalloc_object() - helper for hardened usercopy
> + * @ptr: the beginning of the memory to check
> + * @n: the size of the memory to check
> + * @to_user: copy to userspace or from userspace
> + *
> + * If the check is ok, it will fall-through, otherwise it will abort.
> + * The function is inlined, to minimize the performance impact of the
> + * extra check that can end up on a hot path.
> + * Non-exhaustive micro benchmarking with QEMU x86_64 shows a reduction of
> + * the time spent in this fragment by 60%, when inlined.
> + */
> +static inline
> +void check_pmalloc_object(const void *ptr, unsigned long n, bool to_user)
> +{
> +	if (unlikely(__is_wr_after_init(ptr, n) || __is_wr_pool(ptr, n)))
> +		usercopy_abort("pmalloc", "accessing pmalloc obj", to_user,
> +			       (const unsigned long)ptr, n);
> +}
> +
>  /*
>   * The write rare functionality is fully implemented as __always_inline,
>   * to prevent having an internal function call that is capable of modifying
> diff --git a/mm/usercopy.c b/mm/usercopy.c
> index 852eb4e53f06..a080dd37b684 100644
> --- a/mm/usercopy.c
> +++ b/mm/usercopy.c
> @@ -22,8 +22,10 @@
>  #include <linux/thread_info.h>
>  #include <linux/atomic.h>
>  #include <linux/jump_label.h>
> +#include <linux/prmem.h>
>  #include <asm/sections.h>
>  
> +
>  /*
>   * Checks if a given pointer and length is contained by the current
>   * stack frame (if possible).
> @@ -284,6 +286,9 @@ void __check_object_size(const void *ptr, unsigned long n, bool to_user)
>  
>  	/* Check for object in kernel to avoid text exposure. */
>  	check_kernel_text_object((const unsigned long)ptr, n, to_user);
> +
> +	/* Check if object is from a pmalloc chunk. */
> +	check_pmalloc_object(ptr, n, to_user);
>  }
>  EXPORT_SYMBOL(__check_object_size);
>  

Could you add code somewhere (lkdtm driver if possible) to demonstrate
the issue and verify the code change?

Thanks,

Chris


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 02/17] prmem: write rare for static allocation
  2018-10-25  0:24   ` Dave Hansen
@ 2018-10-29 18:03     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 18:03 UTC (permalink / raw)
  To: Dave Hansen, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

On 25/10/2018 01:24, Dave Hansen wrote:
>> +static __always_inline bool __is_wr_after_init(const void *ptr, size_t size)
>> +{
>> +	size_t start = (size_t)&__start_wr_after_init;
>> +	size_t end = (size_t)&__end_wr_after_init;
>> +	size_t low = (size_t)ptr;
>> +	size_t high = (size_t)ptr + size;
>> +
>> +	return likely(start <= low && low < high && high <= end);
>> +}
> 
> size_t is an odd type choice for doing address arithmetic.

it seemed more portable than unsigned long

>> +/**
>> + * wr_memset() - sets n bytes of the destination to the c value
>> + * @dst: beginning of the memory to write to
>> + * @c: byte to replicate
>> + * @size: amount of bytes to copy
>> + *
>> + * Returns true on success, false otherwise.
>> + */
>> +static __always_inline
>> +bool wr_memset(const void *dst, const int c, size_t n_bytes)
>> +{
>> +	size_t size;
>> +	unsigned long flags;
>> +	uintptr_t d = (uintptr_t)dst;
>> +
>> +	if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
>> +		return false;
>> +	while (n_bytes) {
>> +		struct page *page;
>> +		uintptr_t base;
>> +		uintptr_t offset;
>> +		uintptr_t offset_complement;
> 
> Again, these are really odd choices for types.  vmap() returns a void*
> pointer, on which you can do arithmetic.  

I wasn't sure of how much I could rely on the compiler not doing some 
unwanted optimizations.

> Why bother keeping another
> type to which you have to cast to and from?

For the above reason. If I'm worrying unnecessarily, I can switch back 
to void *
It certainly is easier to use.

> BTW, our usual "pointer stored in an integer type" is 'unsigned long',
> if a pointer needs to be manipulated.

yes, I noticed that, but it seemed strange ...
size_t corresponds to unsigned long, afaik

but it seems that I have not fully understood where to use it

anyway, I can stick to the convention with unsigned long

> 
>> +		local_irq_save(flags);
> 
> Why are you doing the local_irq_save()?

The idea was to avoid the case where an attack would somehow freeze the 
core doing the write-rare operation, while the temporary mapping is 
accessible.

I have seen comments about using mappings that are private to the 
current core (and I will reply to those comments as well), but this 
approach seems architecture-dependent, while I was looking for a 
solution that, albeit not 100% reliable, would work on any system with 
an MMU. This would not prevent each arch to come up with own custom 
implementation that provides better coverage, performance, etc.

>> +		page = virt_to_page(d);
>> +		offset = d & ~PAGE_MASK;
>> +		offset_complement = PAGE_SIZE - offset;
>> +		size = min(n_bytes, offset_complement);
>> +		base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> 
> Can you even call vmap() (which sleeps) with interrupts off?

I accidentally disabled sleeping while atomic debugging and I totally 
missed this problem :-(

However, to answer your question, nothing exploded while I was testing 
(without that type of debugging).

I suspect I was just "lucky". Or maybe I was simply not triggering the 
sleeping sub-case.

As I understood the code, sleeping _might_ happen, but it's not going to 
happen systematically.

I wonder if I could split vmap() into two parts: first the sleeping one, 
with interrupts enabled, then the non sleeping one, with interrupts 
disabled.

I need to read the code more carefully, but it seems that sleeping might 
happen when memory for the mapping meta data is not immediately available.

BTW, wouldn't the might_sleep() call belong more to the part which 
really sleeps, instead than to the whole vmap() ?

>> +		if (WARN(!base, WR_ERR_PAGE_MSG)) {
>> +			local_irq_restore(flags);
>> +			return false;
>> +		}
> 
> You really need some kmap_atomic()-style accessors to wrap this stuff
> for you.  This little pattern is repeated over and over.

I really need to learn more about the way the kernel works and is 
structured. It's a work in progress. Thanks for the advice.

> ...
>> +const char WR_ERR_RANGE_MSG[] = "Write rare on invalid memory range.";
>> +const char WR_ERR_PAGE_MSG[] = "Failed to remap write rare page.";
> 
> Doesn't the compiler de-duplicate duplicated strings for you?  Is there
> any reason to declare these like this?

I noticed I have made some accidental modifications in a couple of 
cases, when replicating the command.

So I thought that if I really want to use the same string, why not doing 
it explicitly? It seemed also easier, in case I want to tweak the 
message. I need to do it only in one place.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 03/17] prmem: vmalloc support for dynamic allocation
  2018-10-25  0:26   ` Dave Hansen
@ 2018-10-29 18:07     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 18:07 UTC (permalink / raw)
  To: Dave Hansen, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Andrew Morton, Chintan Pandya, Joe Perches, Luis R. Rodriguez,
	Thomas Gleixner, Kate Stewart, Greg Kroah-Hartman,
	Philippe Ombredanne, linux-mm, linux-kernel



On 25/10/2018 01:26, Dave Hansen wrote:
> On 10/23/18 2:34 PM, Igor Stoppa wrote:
>> +#define VM_PMALLOC		0x00000100	/* pmalloc area - see docs */
>> +#define VM_PMALLOC_WR		0x00000200	/* pmalloc write rare area */
>> +#define VM_PMALLOC_PROTECTED	0x00000400	/* pmalloc protected area */
> 
> Please introduce things as you use them.  It's impossible to review a
> patch that just says "see docs" that doesn't contain any docs. :)

Yes, otoh it's a big pain in the neck to keep the docs split into 
smaller patches interleaved with the code, at least while the code is 
still in a flux.

And since the docs refer to the sources, for the automated documentation 
of the API, I cannot just put the documentation at the beginning of the 
patchset.

Can I keep the docs as they are, for now, till the code is more stable?

--
igor


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 05/17] prmem: shorthands for write rare on common types
  2018-10-25  0:28   ` Dave Hansen
@ 2018-10-29 18:12     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 18:12 UTC (permalink / raw)
  To: Dave Hansen, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel



On 25/10/2018 01:28, Dave Hansen wrote:
> On 10/23/18 2:34 PM, Igor Stoppa wrote:
>> Wrappers around the basic write rare functionality, addressing several
>> common data types found in the kernel, allowing to specify the new
>> values through immediates, like constants and defines.
> 
> I have to wonder whether this is the right way, or whether a
> size-agnostic function like put_user() is the right way.  put_user() is
> certainly easier to use.

I definitely did not like it either.
But it was the best that came to my mind ...
The main purpose of this code was to show what I wanted to do.
Once more, thanks for pointing out a better way to do it.

--
igor


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 06/17] prmem: test cases for memory protection
  2018-10-25 16:43   ` Dave Hansen
@ 2018-10-29 18:16     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 18:16 UTC (permalink / raw)
  To: Dave Hansen, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel



On 25/10/2018 17:43, Dave Hansen wrote:
>> +static bool is_address_protected(void *p)
>> +{
>> +	struct page *page;
>> +	struct vmap_area *area;
>> +
>> +	if (unlikely(!is_vmalloc_addr(p)))
>> +		return false;
>> +	page = vmalloc_to_page(p);
>> +	if (unlikely(!page))
>> +		return false;
>> +	wmb(); /* Flush changes to the page table - is it needed? */
> 
> No.

ok

> The rest of this is just pretty verbose and seems to have been very
> heavily copied and pasted.  I guess that's OK for test code, though.

I was tempted to play with macros, as templates to generate tests on the 
fly, according to the type being passed.

But I was afraid it might generate an even stronger rejection than the 
rest of the patchset already has.

Would it be acceptable/preferable?

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 08/17] prmem: struct page: track vmap_area
  2018-10-25  2:13       ` Matthew Wilcox
@ 2018-10-29 18:21         ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 18:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mimi Zohar, Kees Cook, Dave Chinner, James Morris, Michal Hocko,
	kernel-hardening, linux-integrity, linux-security-module,
	igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel

On 25/10/2018 03:13, Matthew Wilcox wrote:
> On Thu, Oct 25, 2018 at 02:01:02AM +0300, Igor Stoppa wrote:
>>>> @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>>>>    	if (!addr)
>>>>    		return NULL;
>>>> +	va = __find_vmap_area((unsigned long)addr);
>>>> +	for (i = 0; i < va->vm->nr_pages; i++)
>>>> +		va->vm->pages[i]->area = va;
>>>
>>> I don't like it that you're calling this for _every_ vmalloc() caller
>>> when most of them will never use this.  Perhaps have page->va be initially
>>> NULL and then cache the lookup in it when it's accessed for the first time.
>>>
>>
>> If __find_vmap_area() was part of the API, this loop could be left out from
>> __vmalloc_node_range() and the user of the allocation could initialize the
>> field, if needed.
>>
>> What is the reason for keeping __find_vmap_area() private?
> 
> Well, for one, you're walking the rbtree without holding the spinlock,
> so you're going to get crashes.  I don't see why we shouldn't export
> find_vmap_area() though.

Argh, yes, sorry. But find_vmap_area() would be enough for what I need.

> Another way we could approach this is to embed the vmap_area in the
> vm_struct.  It'd require a bit of juggling of the alloc/free paths in
> vmalloc, but it might be worthwhile.

I have a feeling of unease about the whole vmap_area / vm_struct 
duality. They clearly are different types, with different purposes, but 
here and there there are functions that are named after some "area", yet 
they actually refer to vm_struct pointers.

I wouldn't mind spending some time understanding the reason for this 
apparently bizarre choice, but after I have emerged from the prmem swamp.

For now I'd stick to find_vmap_area().

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 09/17] prmem: hardened usercopy
  2018-10-29 11:45   ` Chris von Recklinghausen
@ 2018-10-29 18:24     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 18:24 UTC (permalink / raw)
  To: crecklin, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module
  Cc: igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	linux-mm, linux-kernel



On 29/10/2018 11:45, Chris von Recklinghausen wrote:

[...]

> Could you add code somewhere (lkdtm driver if possible) to demonstrate
> the issue and verify the code change?

Sure.

Eventually, I'd like to add test cases for each functionality.
I didn't do it right away for those parts which are either not 
immediately needed for the main functionality or I'm still not confident 
enough that they won't change radically.

--

igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-24 23:04   ` Mike Rapoport
@ 2018-10-29 19:05     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 19:05 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
	linux-kernel

Hello,

On 25/10/2018 00:04, Mike Rapoport wrote:

> I feel that foreword should include a sentence or two saying why we need
> the memory protection and when it can/should be used.

I was somewhat lost about the actual content of this sort of document.

In past reviews of older version of the docs, I go the impression that 
this should focus more on the use of the API and how to use it.

And it seem that different people expect to find here different type of 
information.

What is the best place where to put a more extensive description of the 
feature?

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-26 10:20     ` Matthew Wilcox
@ 2018-10-29 19:28       ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 19:28 UTC (permalink / raw)
  To: Matthew Wilcox, Peter Zijlstra
  Cc: Mimi Zohar, Kees Cook, Dave Chinner, James Morris, Michal Hocko,
	kernel-hardening, linux-integrity, linux-security-module,
	igor.stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, linux-doc, linux-kernel



On 26/10/2018 11:20, Matthew Wilcox wrote:
> On Fri, Oct 26, 2018 at 11:26:09AM +0200, Peter Zijlstra wrote:
>> Jon,
>>
>> So the below document is a prime example for why I think RST sucks. As a
>> text document readability is greatly diminished by all the markup
>> nonsense.
>>
>> This stuff should not become write-only content like html and other
>> gunk. The actual text file is still the primary means of reading this.
> 
> I think Igor neglected to read doc-guide/sphinx.rst:

Guilty as charged :-/

> Specific guidelines for the kernel documentation
> ------------------------------------------------
> 
> Here are some specific guidelines for the kernel documentation:
> 
> * Please don't go overboard with reStructuredText markup. Keep it
>    simple. For the most part the documentation should be plain text with
>    just enough consistency in formatting that it can be converted to
>    other formats.

> I agree that it's overly marked up.

I'll fix it

--
igor




^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-26 11:09     ` Markus Heiser
@ 2018-10-29 19:35       ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 19:35 UTC (permalink / raw)
  To: Markus Heiser, Peter Zijlstra
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
	linux-kernel

On 26/10/2018 12:09, Markus Heiser wrote:

> I guess Igor was looking for a definition list ...

It was meant to be bold.
Even after reading the guideline "keep it simple", what exactly is 
simple can be subjective.

If certain rst constructs are not acceptable and can be detected 
unequivocally, maybe it would help to teach them to checkpatch.pl

[...]

> see "Lists and Quote-like blocks" [1]
> 
> [1] 
> http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#lists-and-quote-like-blocks 

thank you

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-26 15:05     ` Jonathan Corbet
@ 2018-10-29 19:38       ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 19:38 UTC (permalink / raw)
  To: Jonathan Corbet, Peter Zijlstra
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Laura Abbott,
	Randy Dunlap, Mike Rapoport, linux-doc, linux-kernel



On 26/10/2018 16:05, Jonathan Corbet wrote:
> But
> could I ask you, please, to make a pass over it and reduce the markup to
> a minimum?  Using lists as suggested by Markus would help here.

sure, it's even easier for me to maintain the doc :-)
as i just wrote in a related reply, it might be worth having checkpatch 
to detect formatting constructs that are not welcome

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 13/17] prmem: linked list: disable layout randomization
  2018-10-24 13:43   ` Alexey Dobriyan
@ 2018-10-29 19:40     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 19:40 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Greg Kroah-Hartman, Andrew Morton, Masahiro Yamada,
	Pekka Enberg, Paul E. McKenney, Lihao Liang, linux-kernel



On 24/10/2018 14:43, Alexey Dobriyan wrote:
> On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
>> Some of the data structures used in list management are composed by two
>> pointers. Since the kernel is now configured by default to randomize the
>> layout of data structures soleley composed by pointers,
> 
> Isn't this true for function pointers?

Yes, you are right.
Thanks for pointing this out.

I can drop this patch.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 14/17] prmem: llist, hlist, both plain and rcu
  2018-10-28  9:52       ` Steven Rostedt
@ 2018-10-29 19:43         ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 19:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Mimi Zohar, Kees Cook, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, kernel-hardening,
	linux-integrity, linux-security-module, igor stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Thomas Gleixner, Kate Stewart,
	David S. Miller, Greg Kroah-Hartman, Philippe Ombredanne,
	Paul E. McKenney, Josh Triplett, Lai Jiangshan, linux-kernel

On 28/10/2018 09:52, Steven Rostedt wrote:

> If a change log depends on other commits for context, it is
> insufficient.

ok, I will adjust the change logs accordingly

--
thanks, igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [RFC v1 PATCH 00/17] prmem: protected memory
  2018-10-24 23:03 ` [RFC v1 PATCH 00/17] prmem: protected memory Dave Chinner
@ 2018-10-29 19:47   ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 19:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, James Morris,
	Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott



On 25/10/2018 00:03, Dave Chinner wrote:
> On Wed, Oct 24, 2018 at 12:34:47AM +0300, Igor Stoppa wrote:
>> -- Summary --
>>
>> Preliminary version of memory protection patchset, including a sample use
>> case, turning into write-rare the IMA measurement list.
> 
> I haven't looked at the patches yet, but I see a significant issue
> from the subject lines. "prmem" is very similar to "pmem"
> (persistent memory) and that's going to cause confusion. Especially
> if people start talking about prmem and pmem in the context of write
> protect pmem with prmem...
> 
> Naming is hard :/

yes, at some point I had to go from rare-write to write-rare because I 
realized that the acronym "rw" was likely to be interpreted as "read/write"

I propose to keep prmem for the time being, just to avoid adding extra 
changes that are not functional.
Once the code has somewhat settled down, I can proceed with the 
renaming. In the meanwhile, a better name could be discussed.

For example, would "prmemory" be sufficiently unambiguous?

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 02/17] prmem: write rare for static allocation
  2018-10-26  9:41   ` Peter Zijlstra
@ 2018-10-29 20:01     ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 20:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Pavel Tatashin, linux-mm, linux-kernel



On 26/10/2018 10:41, Peter Zijlstra wrote:
> On Wed, Oct 24, 2018 at 12:34:49AM +0300, Igor Stoppa wrote:
>> +static __always_inline
> 
> That's far too large for inline.

The reason for it is that it's supposed to minimize the presence of 
gadgets that might be used in JOP attacks.
I am ready to stand corrected, if I'm wrong, but this is the reason why 
I did it.

Regarding the function being too large, yes, I would not normally choose 
it for inlining.

Actually, I would not normally use "__always_inline" and instead I would 
limit myself to plain "inline", at most.

> 
>> +bool wr_memset(const void *dst, const int c, size_t n_bytes)
>> +{
>> +	size_t size;
>> +	unsigned long flags;
>> +	uintptr_t d = (uintptr_t)dst;
>> +
>> +	if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG))
>> +		return false;
>> +	while (n_bytes) {
>> +		struct page *page;
>> +		uintptr_t base;
>> +		uintptr_t offset;
>> +		uintptr_t offset_complement;
>> +
>> +		local_irq_save(flags);
>> +		page = virt_to_page(d);
>> +		offset = d & ~PAGE_MASK;
>> +		offset_complement = PAGE_SIZE - offset;
>> +		size = min(n_bytes, offset_complement);
>> +		base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
>> +		if (WARN(!base, WR_ERR_PAGE_MSG)) {
>> +			local_irq_restore(flags);
>> +			return false;
>> +		}
>> +		memset((void *)(base + offset), c, size);
>> +		vunmap((void *)base);
> 
> BUG

yes, somehow I managed to drop this debug configuration from the debug 
builds I made.

[...]

> Also, I see an amount of duplication here that shows you're not nearly
> lazy enough.

I did notice a certain amount of duplication, but I didn't know how to 
exploit it.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-26  9:26   ` Peter Zijlstra
                       ` (3 preceding siblings ...)
  2018-10-26 15:05     ` Jonathan Corbet
@ 2018-10-29 20:35     ` Igor Stoppa
  4 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 20:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport, linux-doc,
	linux-kernel

On 26/10/2018 10:26, Peter Zijlstra wrote:

>> +- Kernel code is protected at system level and, unlike data, it doesn't
>> +  require special attention.
> 
> What does this even mean?

I was trying to convey the notion that the pages containing kernel code 
do not require any special handling by the author of a generic kernel 
component, for example a kernel driver.

Pages containing either statically or dynamically allocated data, 
instead, are not automatically protected.

But yes, the sentence is far from being clear.

> 
>> +Protection mechanism
>> +--------------------
>> +
>> +- When available, the MMU can write protect memory pages that would be
>> +  otherwise writable.
> 
> Again; what does this really want to say?

That it is possible to use the MMU also for write-protecting pages 
containing data which was not declared as constant.
> 
>> +- The protection has page-level granularity.
> 
> I don't think Linux supports non-paging MMUs.

This probably came from a model I had in mind where a separate execution 
environment, such as an hypervisor, could trap and filter writes to a 
certain parts of a page, rejecting some and performing others, 
effectively emulating sub-page granularity.

> 
>> +- An attempt to overwrite a protected page will trigger an exception.
>> +- **Write protected data must go exclusively to write protected pages**
>> +- **Writable data must go exclusively to writable pages**
> 
> WTH is with all those ** ?

[...]

> Can we ditch all the ** nonsense and put whitespace in there? More paragraphs
> and whitespace are more good.

Yes

> Also, I really don't like how you differentiate between static and
> dynamic wr.

ok, but why? what would you suggest, instead?

[...]

> We already have RO, why do you need more RO?

I can explain, but I'm at loss of what is the best place: I was under 
the impression that this sort of document should focus mostly on the API 
and its use. I was even considering removing most of the explanation and 
instead put it in a separate document.

> 
>> +
>> +   **Remarks:**
>> +    - The "AUTO" modes perform automatic protection of the content, whenever
>> +       the current vmap_area is used up and a new one is allocated.
>> +        - At that point, the vmap_area being phased out is protected.
>> +        - The size of the vmap_area depends on various parameters.
>> +        - It might not be possible to know for sure *when* certain data will
>> +          be protected.
> 
> Surely that is a problem?
> 
>> +        - The functionality is provided as tradeoff between hardening and speed.
> 
> Which you fail to explain.
> 
>> +        - Its usefulness depends on the specific use case at hand
> 
> How about you write sensible text inside the option descriptions
> instead?
> 
> This is not a presentation; less bullets, more content.

I tried to say something, without saying too much, but it was clearly a 
bad choice.

Where should i put a thoroughly detailed explanation?
Here or in a separate document?

> 
>> +- Not only the pmalloc memory must be protected, but also any reference to
>> +  it that might become the target for an attack. The attack would replace
>> +  a reference to the protected memory with a reference to some other,
>> +  unprotected, memory.
> 
> I still don't really understand the whole write-rare thing; how does it
> really help? If we can write in kernel memory, we can write to
> page-tables too.

It has already been answered by Kees, but I'll provide also my answer:

the exploits used to write in kernel memory are specific to certain 
products and SW builds, so it's not possible to generalize too much, 
however there might be some limitation in the reach of a specific 
vulnerability. For example, if a memory location is referred to as 
offset from a base address, the maximum size of the offset limits the 
scope of the attack. Which might make it impossible to use that specific 
vulnerability for writing directly to the page table. But something 
else, say a statically allocated variable, might be within reach.

said this, there is also the almost orthogonal use case of providing 
increased robustness by trapping accidental modifications, caused by bugs.

> And I don't think this document even begins to explain _why_ you're
> doing any of this. How does it help?

ok, point taken

>> +- The users of rare write must take care of ensuring the atomicity of the
>> +  action, respect to the way they use the data being altered; for example,
>> +  take a lock before making a copy of the value to modify (if it's
>> +  relevant), then alter it, issue the call to rare write and finally
>> +  release the lock. Some special scenario might be exempt from the need
>> +  for locking, but in general rare-write must be treated as an operation
>> +  that can incur into races.
> 
> What?!

Probably something along the lines of:

"users of write-rare are responsible of using mechanisms that allow 
reading/writing data in a consistent way"

and if it seems obvious, I can just drop it.

One of the problems I have faced is to decide what level of knowledge or 
understanding I should expect from the reader of such document: what I 
can take for granted and what I should explain.

--
igor





^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-28 18:31       ` Peter Zijlstra
@ 2018-10-29 21:04         ` Igor Stoppa
  2018-10-30 15:26           ` Peter Zijlstra
  0 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 21:04 UTC (permalink / raw)
  To: Peter Zijlstra, Kees Cook
  Cc: Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
	Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML


On 28/10/2018 18:31, Peter Zijlstra wrote:
> On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote:
>> On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> I still don't really understand the whole write-rare thing; how does it
>>> really help? If we can write in kernel memory, we can write to
>>> page-tables too.
> 
>> One aspect of hardening the kernel against attack is reducing the
>> internal attack surface. Not all flaws are created equal, so there is
>> variation in what limitations an attacker may have when exploiting
>> flaws (not many flaws end up being a fully controlled "write anything,
>> anywhere, at any time"). By making the more sensitive data structures
>> of the kernel read-only, we reduce the risk of an attacker finding a
>> path to manipulating the kernel's behavior in a significant way.
>>
>> Examples of typical sensitive targets are function pointers, security
>> policy, and page tables. Having these "read only at rest" makes them
>> much harder to control by an attacker using memory integrity flaws.
> 
> Because 'write-anywhere' exploits are easier than (and the typical first
> step to) arbitrary code execution thingies?
> 
>> The "write rarely" name itself may not sufficiently describe what is
>> wanted either (I'll take the blame for the inaccurate name), so I'm
>> open to new ideas there. The implementation requirements for the
>> "sensitive data read-only at rest" feature are rather tricky:
>>
>> - allow writes only from specific places in the kernel
>> - keep those locations inline to avoid making them trivial ROP targets
>> - keep the writeability window open only to a single uninterruptable CPU
> 
> The current patch set does not achieve that because it uses a global
> address space for the alias mapping (vmap) which is equally accessible
> from all CPUs.

I never claimed to achieve 100% resilience to attacks.
While it's true that the address space is accessible to all the CPUs, it 
is also true that the access has a limited duration in time, and the 
location where the access can be performed is not fixed.

Iow, assuming that the CPUs not involved in the write-rare operations 
are compromised and that they are trying to perform a concurrent access 
to the data in the writable page, they have a limited window of opportunity.

Said this, this I have posted is just a tentative implementation.
My primary intent was to at least give an idea of what I'd like to do: 
alter some data in a way that is not easily exploitable.

>> - fast enough to deal with page table updates
> 
> The proposed implementation needs page-tables for the alias; I don't see
> how you could ever do R/O page-tables when you need page-tables to
> modify your page-tables.

It's not all-or-nothing.
I hope we agree at least on the reasoning that having only a limited 
amount of address space directly attackable, instead of the whole set of 
pages containing exploitable data, is  reducing the attack surface.

Furthermore, if we think about possible limitations that the attack 
might have (maximum reach), the level of protection might be even 
higher. I have to use "might" because I cannot foresee the vulnerability.

Furthermore, taking a different angle: your average attacker is not 
necessarily very social and inclined to share the vulnerability found.
It is safe to assume that in most cases each attacker has to identify 
the attack strategy autonomously.
Reducing the amount of individual who can perform an attack, by 
increasing the expertise required is also a way of doing damage control.

> And this is entirely irrespective of performance.

I have not completely given up on performance, but, being write-rare, I 
see improved performance as just a way of widening the range of possible 
recipients for the hardening.

>> The proposal I made a while back only covered .data things (and used
>> x86-specific features).
> 
> Oh, right, that CR0.WP stuff.
> 
>> Igor's proposal builds on this by including a
>> way to do this with dynamic allocation too, which greatly expands the
>> scope of structures that can be protected. Given that the x86-only
>> method of write-window creation was firmly rejected, this is a new
>> proposal for how to do it (vmap window). Using switch_mm() has also
>> been suggested, etc.
> 
> Right... /me goes find the patches we did for text_poke. Hmm, those
> never seem to have made it:
> 
>    https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com
> 
> like that. That approach will in fact work and not be a completely
> broken mess like this thing.

That approach is x86 specific.
I preferred to start with something that would work on a broader set of 
architectures, even if with less resilience.

It can be done so that individual architectures can implement their own 
specific way and obtain better results.

>> We need to find a good way to do the write-windowing that works well
>> for static and dynamic structures _and_ for the page tables... this
>> continues to be tricky.
>>
>> Making it resilient against ROP-style targets makes it difficult to
>> deal with certain data structures (like list manipulation). In my
>> earlier RFC, I tried to provide enough examples of where this could
>> get used to let people see some of the complexity[1]. Igor's series
>> expands this to even more examples using dynamic allocation.
> 
> Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be
> fast enough for much of anything.

Part of the reason for the duplication is that, even if the wr_API I 
came up with, can be collapsed with the regular API, the implementation 
needs to be different, to be faster.

Ex:
* force the list_head structure to be aligned so that it will always be 
fully contained by a single page
* create alternate mapping for all the pages involved (max 1 page for 
each list_head)
* do the lsit operation on the remapped list_heads
* destroy all the mappings

I can try to introduce some wrapper similar to kmap_atomic(), as 
suggested by Dave Hansen, which can improve the coding, but it will not 
change the actual set of operations performed.


> And I don't suppose we can take the WP fault and then fix up from there,
> because if we're doing R/O page-tables, that'll incrase the fault depth
> and we'll double fault all the time, and tripple fault where we
> currently double fault. And we all know how _awesome_ tripple faults
> are.
> 
> But duplicating (and wrapping in gunk) whole APIs is just not going to
> work.

Would something like kmap_atomic() be acceptable?
Do you have some better proposal, now that (I hope) it should be more 
clear what I'm trying to do and why?

--
igor


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 16/17] prmem: pratomic-long
  2018-10-25  0:13   ` Peter Zijlstra
@ 2018-10-29 21:17     ` Igor Stoppa
  2018-10-30 15:58       ` Peter Zijlstra
  0 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-29 21:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Will Deacon, Boqun Feng, Arnd Bergmann, linux-arch,
	linux-kernel



On 25/10/2018 01:13, Peter Zijlstra wrote:
> On Wed, Oct 24, 2018 at 12:35:03AM +0300, Igor Stoppa wrote:
>> +static __always_inline
>> +bool __pratomic_long_op(bool inc, struct pratomic_long_t *l)
>> +{
>> +	struct page *page;
>> +	uintptr_t base;
>> +	uintptr_t offset;
>> +	unsigned long flags;
>> +	size_t size = sizeof(*l);
>> +	bool is_virt = __is_wr_after_init(l, size);
>> +
>> +	if (WARN(!(is_virt || likely(__is_wr_pool(l, size))),
>> +		 WR_ERR_RANGE_MSG))
>> +		return false;
>> +	local_irq_save(flags);
>> +	if (is_virt)
>> +		page = virt_to_page(l);
>> +	else
>> +		vmalloc_to_page(l);
>> +	offset = (~PAGE_MASK) & (uintptr_t)l;
>> +	base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
>> +	if (WARN(!base, WR_ERR_PAGE_MSG)) {
>> +		local_irq_restore(flags);
>> +		return false;
>> +	}
>> +	if (inc)
>> +		atomic_long_inc((atomic_long_t *)(base + offset));
>> +	else
>> +		atomic_long_dec((atomic_long_t *)(base + offset));
>> +	vunmap((void *)base);
>> +	local_irq_restore(flags);
>> +	return true;
>> +
>> +}
> 
> That's just hideously nasty.. and horribly broken.
> 
> We're not going to duplicate all these kernel interfaces wrapped in gunk
> like that. 

one possibility would be to have macros which use typeof() on the 
parameter being passed, to decide what implementation to use: regular or 
write-rare

This means that type punning would still be needed, to select the 
implementation.

Would this be enough? Is there some better way?

> Also, you _cannot_ call vunmap() with IRQs disabled. Clearly
> you've never tested this with debug bits enabled.

I thought I had them. And I _did_ have them enabled, at some point.
But I must have messed up with the configuration and I failed to notice 
this.

I can think of a way it might work, albeit it's not going to be very pretty:

* for the vmap(): if I understand correctly, it might sleep while 
obtaining memory for creating the mapping. This part could be executed 
before disabling interrupts. The rest of the function, instead, would be 
executed after interrupts are disabled.

* for vunmap(): after the writing is done, change also the alternate 
mapping to read only, then enable interrupts and destroy the alternate 
mapping. Making also the secondary mapping read only makes it equally 
secure as the primary, which means that it can be visible also with 
interrupts enabled.


--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-29 21:04         ` Igor Stoppa
@ 2018-10-30 15:26           ` Peter Zijlstra
  2018-10-30 16:37             ` Kees Cook
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-30 15:26 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Kees Cook, Mimi Zohar, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Andy Lutomirski, Thomas Gleixner

On Mon, Oct 29, 2018 at 11:04:01PM +0200, Igor Stoppa wrote:
> On 28/10/2018 18:31, Peter Zijlstra wrote:
> > On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote:
> > > On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > I still don't really understand the whole write-rare thing; how does it
> > > > really help? If we can write in kernel memory, we can write to
> > > > page-tables too.
> > 
> > > One aspect of hardening the kernel against attack is reducing the
> > > internal attack surface. Not all flaws are created equal, so there is
> > > variation in what limitations an attacker may have when exploiting
> > > flaws (not many flaws end up being a fully controlled "write anything,
> > > anywhere, at any time"). By making the more sensitive data structures
> > > of the kernel read-only, we reduce the risk of an attacker finding a
> > > path to manipulating the kernel's behavior in a significant way.
> > > 
> > > Examples of typical sensitive targets are function pointers, security
> > > policy, and page tables. Having these "read only at rest" makes them
> > > much harder to control by an attacker using memory integrity flaws.
> > 
> > Because 'write-anywhere' exploits are easier than (and the typical first
> > step to) arbitrary code execution thingies?
> > 
> > > The "write rarely" name itself may not sufficiently describe what is
> > > wanted either (I'll take the blame for the inaccurate name), so I'm
> > > open to new ideas there. The implementation requirements for the
> > > "sensitive data read-only at rest" feature are rather tricky:
> > > 
> > > - allow writes only from specific places in the kernel
> > > - keep those locations inline to avoid making them trivial ROP targets
> > > - keep the writeability window open only to a single uninterruptable CPU
> > 
> > The current patch set does not achieve that because it uses a global
> > address space for the alias mapping (vmap) which is equally accessible
> > from all CPUs.
> 
> I never claimed to achieve 100% resilience to attacks.

You never even begun explaining what you were defending against. Let
alone how well it achieved its stated goals.

> While it's true that the address space is accessible to all the CPUs, it is
> also true that the access has a limited duration in time, and the location
> where the access can be performed is not fixed.
> 
> Iow, assuming that the CPUs not involved in the write-rare operations are
> compromised and that they are trying to perform a concurrent access to the
> data in the writable page, they have a limited window of opportunity.

That sounds like security through obscurity. Sure it makes it harder,
but it doesn't stop anything.

> Said this, this I have posted is just a tentative implementation.
> My primary intent was to at least give an idea of what I'd like to do: alter
> some data in a way that is not easily exploitable.

Since you need to modify page-tables in order to achieve this, the
page-tables are also there for the attacker to write to.

> > > - fast enough to deal with page table updates
> > 
> > The proposed implementation needs page-tables for the alias; I don't see
> > how you could ever do R/O page-tables when you need page-tables to
> > modify your page-tables.
> 
> It's not all-or-nothing.

I really am still strugging with what it is this thing is supposed to
do. As it stands, I see very little value. Yes it makes a few things a
little harder, but it doesn't really do away with things.

> I hope we agree at least on the reasoning that having only a limited amount
> of address space directly attackable, instead of the whole set of pages
> containing exploitable data, is  reducing the attack surface.

I do not in fact agree. Most pages are not interesting for an attack at
all. So simply reducing the set of pages you can write to isn't
sufficient.

IOW. removing all noninteresting pages from the writable set will
satisfy your criteria, but it avoids exactly 0 exploits.

What you need to argue is that we remove common exploit targets, and
you've utterly failed to do so.

Kees mentioned: function pointers, page-tables and a few other things.
You mentioned nothing.

> Furthermore, if we think about possible limitations that the attack might
> have (maximum reach), the level of protection might be even higher. I have
> to use "might" because I cannot foresee the vulnerability.

That's just a bunch of words together that don't really say anything
afaict.

> Furthermore, taking a different angle: your average attacker is not
> necessarily very social and inclined to share the vulnerability found.
> It is safe to assume that in most cases each attacker has to identify the
> attack strategy autonomously.
> Reducing the amount of individual who can perform an attack, by increasing
> the expertise required is also a way of doing damage control.

If you make RO all noninteresting pages, you in fact increase the
density of interesting targets and make things easier.

> > And this is entirely irrespective of performance.
> 
> I have not completely given up on performance, but, being write-rare, I see
> improved performance as just a way of widening the range of possible
> recipients for the hardening.

What?! Are you saying you don't care about performance?

> > Right... /me goes find the patches we did for text_poke. Hmm, those
> > never seem to have made it:
> > 
> >    https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com
> > 
> > like that. That approach will in fact work and not be a completely
> > broken mess like this thing.
> 
> That approach is x86 specific.

It is not. Every architecture does switch_mm() with IRQs disabled, as
that is exactly what the scheduler does.

And keeping a second init_mm around also isn't very architecture
specific I feel.

> > > We need to find a good way to do the write-windowing that works well
> > > for static and dynamic structures _and_ for the page tables... this
> > > continues to be tricky.
> > > 
> > > Making it resilient against ROP-style targets makes it difficult to
> > > deal with certain data structures (like list manipulation). In my
> > > earlier RFC, I tried to provide enough examples of where this could
> > > get used to let people see some of the complexity[1]. Igor's series
> > > expands this to even more examples using dynamic allocation.
> > 
> > Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be
> > fast enough for much of anything.
> 
> Part of the reason for the duplication is that, even if the wr_API I came up
> with, can be collapsed with the regular API, the implementation needs to be
> different, to be faster.
> 
> Ex:
> * force the list_head structure to be aligned so that it will always be
> fully contained by a single page
> * create alternate mapping for all the pages involved (max 1 page for each
> list_head)
> * do the lsit operation on the remapped list_heads
> * destroy all the mappings

Those constraints are due to single page aliases.

> I can try to introduce some wrapper similar to kmap_atomic(), as suggested
> by Dave Hansen, which can improve the coding, but it will not change the
> actual set of operations performed.

See below, I don't think kmap_atomic() is either correct or workable.

One thing that might be interesting is teaching objtool about the
write-enable write-disable markers and making it guarantee we reach a
disable after every enable. IOW, ensure we never leak this state.

I think that was part of why we hated on that initial thing Kees did --
well that an disabling all WP is of course completely insane ;-).

> > And I don't suppose we can take the WP fault and then fix up from there,
> > because if we're doing R/O page-tables, that'll incrase the fault depth
> > and we'll double fault all the time, and tripple fault where we
> > currently double fault. And we all know how _awesome_ tripple faults
> > are.
> > 
> > But duplicating (and wrapping in gunk) whole APIs is just not going to
> > work.
> 
> Would something like kmap_atomic() be acceptable?

Don't think so. kmap_atomic() on x86_32 (64bit doesn't have it at all)
only does the TLB invalidate on the one CPU, which we've established is
incorrect.

Also, kmap_atomic is still page-table based, which means not all
page-tables can be read-only.

> Do you have some better proposal, now that (I hope) it should be more clear
> what I'm trying to do and why?

You've still not talked about any actual attack vectors and how they're
mitigated by these patches.

I suppose the 'normal' attack goes like:

 1) find buffer-overrun / bound check failure
 2) use that to write to 'interesting' location
 3) that write results arbitrary code execution
 4) win

Of course, if the store of 2 is to the current cred structure, and
simply sets the effective uid to 0, we can skip 3.

Which seems to suggest all cred structures should be made r/o like this.
But I'm not sure I remember these patches doing that.


Also, there is an inverse situation with all this. If you make
everything R/O, then you need this allow-write for everything you do,
which then is about to include a case with an overflow / bound check
fail, and we're back to square 1.

What are you doing to avoid that?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 13/17] prmem: linked list: disable layout randomization
  2018-10-26 10:17     ` Matthew Wilcox
@ 2018-10-30 15:39       ` Peter Zijlstra
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-30 15:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Igor Stoppa, Mimi Zohar, Kees Cook, Dave Chinner, James Morris,
	Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Greg Kroah-Hartman, Andrew Morton, Masahiro Yamada,
	Alexey Dobriyan, Pekka Enberg, Paul E. McKenney, Lihao Liang,
	linux-kernel

On Fri, Oct 26, 2018 at 03:17:07AM -0700, Matthew Wilcox wrote:
> On Fri, Oct 26, 2018 at 11:32:05AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
> > > Some of the data structures used in list management are composed by two
> > > pointers. Since the kernel is now configured by default to randomize the
> > > layout of data structures soleley composed by pointers, this might
> > > prevent correct type punning between these structures and their write
> > > rare counterpart.
> > 
> > 'might' doesn't really work for me. Either it does or it does not.
> 
> He means "Depending on the random number generator, the two pointers
> might be AB or BA.  If they're of opposite polarity (50% of the time),
> it _will_ break, and 50% of the time it _won't_ break."

So don't do that then. If he were to include struct list_head inside his
prlist_head, then there is only the one randomization and things will
just work.

Also, I really don't see why he needs that second type and all that type
punning crap in the first place.



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 16/17] prmem: pratomic-long
  2018-10-29 21:17     ` Igor Stoppa
@ 2018-10-30 15:58       ` Peter Zijlstra
  2018-10-30 16:28         ` Will Deacon
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-30 15:58 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Will Deacon, Boqun Feng, Arnd Bergmann, linux-arch,
	linux-kernel

On Mon, Oct 29, 2018 at 11:17:14PM +0200, Igor Stoppa wrote:
> 
> 
> On 25/10/2018 01:13, Peter Zijlstra wrote:
> > On Wed, Oct 24, 2018 at 12:35:03AM +0300, Igor Stoppa wrote:
> > > +static __always_inline
> > > +bool __pratomic_long_op(bool inc, struct pratomic_long_t *l)
> > > +{
> > > +	struct page *page;
> > > +	uintptr_t base;
> > > +	uintptr_t offset;
> > > +	unsigned long flags;
> > > +	size_t size = sizeof(*l);
> > > +	bool is_virt = __is_wr_after_init(l, size);
> > > +
> > > +	if (WARN(!(is_virt || likely(__is_wr_pool(l, size))),
> > > +		 WR_ERR_RANGE_MSG))
> > > +		return false;
> > > +	local_irq_save(flags);
> > > +	if (is_virt)
> > > +		page = virt_to_page(l);
> > > +	else
> > > +		vmalloc_to_page(l);
> > > +	offset = (~PAGE_MASK) & (uintptr_t)l;
> > > +	base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> > > +	if (WARN(!base, WR_ERR_PAGE_MSG)) {
> > > +		local_irq_restore(flags);
> > > +		return false;
> > > +	}
> > > +	if (inc)
> > > +		atomic_long_inc((atomic_long_t *)(base + offset));
> > > +	else
> > > +		atomic_long_dec((atomic_long_t *)(base + offset));
> > > +	vunmap((void *)base);
> > > +	local_irq_restore(flags);
> > > +	return true;
> > > +
> > > +}
> > 
> > That's just hideously nasty.. and horribly broken.
> > 
> > We're not going to duplicate all these kernel interfaces wrapped in gunk
> > like that.
> 
> one possibility would be to have macros which use typeof() on the parameter
> being passed, to decide what implementation to use: regular or write-rare
> 
> This means that type punning would still be needed, to select the
> implementation.
> 
> Would this be enough? Is there some better way?

Like mentioned elsewhere; if you do write_enable() + write_disable()
thingies, it all becomes:

	write_enable();
	atomic_foo(&bar);
	write_disable();

No magic gunk infested duplication at all. Of course, ideally you'd then
teach objtool about this (or a GCC plugin I suppose) to ensure any
enable reached a disable.

The alternative is something like:

#define ALLOW_WRITE(stmt) do { write_enable(); do { stmt; } while (0); write_disable(); } while (0)

which then allows you to write:

	ALLOW_WRITE(atomic_foo(&bar));

No duplication.

> > Also, you _cannot_ call vunmap() with IRQs disabled. Clearly
> > you've never tested this with debug bits enabled.
> 
> I thought I had them. And I _did_ have them enabled, at some point.
> But I must have messed up with the configuration and I failed to notice
> this.
> 
> I can think of a way it might work, albeit it's not going to be very pretty:
> 
> * for the vmap(): if I understand correctly, it might sleep while obtaining
> memory for creating the mapping. This part could be executed before
> disabling interrupts. The rest of the function, instead, would be executed
> after interrupts are disabled.
> 
> * for vunmap(): after the writing is done, change also the alternate mapping
> to read only, then enable interrupts and destroy the alternate mapping.
> Making also the secondary mapping read only makes it equally secure as the
> primary, which means that it can be visible also with interrupts enabled.

That doesn't work if you wanted to do this write while you already have
IRQs disabled for example.



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 16/17] prmem: pratomic-long
  2018-10-30 15:58       ` Peter Zijlstra
@ 2018-10-30 16:28         ` Will Deacon
  2018-10-31  9:10           ` Peter Zijlstra
  0 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2018-10-30 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Boqun Feng, Arnd Bergmann, linux-arch,
	linux-kernel

On Tue, Oct 30, 2018 at 04:58:41PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 11:17:14PM +0200, Igor Stoppa wrote:
> > 
> > 
> > On 25/10/2018 01:13, Peter Zijlstra wrote:
> > > On Wed, Oct 24, 2018 at 12:35:03AM +0300, Igor Stoppa wrote:
> > > > +static __always_inline
> > > > +bool __pratomic_long_op(bool inc, struct pratomic_long_t *l)
> > > > +{
> > > > +	struct page *page;
> > > > +	uintptr_t base;
> > > > +	uintptr_t offset;
> > > > +	unsigned long flags;
> > > > +	size_t size = sizeof(*l);
> > > > +	bool is_virt = __is_wr_after_init(l, size);
> > > > +
> > > > +	if (WARN(!(is_virt || likely(__is_wr_pool(l, size))),
> > > > +		 WR_ERR_RANGE_MSG))
> > > > +		return false;
> > > > +	local_irq_save(flags);
> > > > +	if (is_virt)
> > > > +		page = virt_to_page(l);
> > > > +	else
> > > > +		vmalloc_to_page(l);
> > > > +	offset = (~PAGE_MASK) & (uintptr_t)l;
> > > > +	base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> > > > +	if (WARN(!base, WR_ERR_PAGE_MSG)) {
> > > > +		local_irq_restore(flags);
> > > > +		return false;
> > > > +	}
> > > > +	if (inc)
> > > > +		atomic_long_inc((atomic_long_t *)(base + offset));
> > > > +	else
> > > > +		atomic_long_dec((atomic_long_t *)(base + offset));
> > > > +	vunmap((void *)base);
> > > > +	local_irq_restore(flags);
> > > > +	return true;
> > > > +
> > > > +}
> > > 
> > > That's just hideously nasty.. and horribly broken.
> > > 
> > > We're not going to duplicate all these kernel interfaces wrapped in gunk
> > > like that.
> > 
> > one possibility would be to have macros which use typeof() on the parameter
> > being passed, to decide what implementation to use: regular or write-rare
> > 
> > This means that type punning would still be needed, to select the
> > implementation.
> > 
> > Would this be enough? Is there some better way?
> 
> Like mentioned elsewhere; if you do write_enable() + write_disable()
> thingies, it all becomes:
> 
> 	write_enable();
> 	atomic_foo(&bar);
> 	write_disable();
> 
> No magic gunk infested duplication at all. Of course, ideally you'd then
> teach objtool about this (or a GCC plugin I suppose) to ensure any
> enable reached a disable.

Isn't the issue here that we don't want to change the page tables for the
mapping of &bar, but instead want to create a temporary writable alias
at a random virtual address?

So you'd want:

	wbar = write_enable(&bar);
	atomic_foo(wbar);
	write_disable(wbar);

which is probably better expressed as a map/unmap API. I suspect this
would also be the only way to do things for cmpxchg() loops, where you
want to create the mapping outside of the loop to minimise your time in
the critical section.

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 15:26           ` Peter Zijlstra
@ 2018-10-30 16:37             ` Kees Cook
  2018-10-30 17:06               ` Andy Lutomirski
  2018-10-31  9:27               ` Igor Stoppa
  0 siblings, 2 replies; 140+ messages in thread
From: Kees Cook @ 2018-10-30 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Igor Stoppa, Mimi Zohar, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Andy Lutomirski, Thomas Gleixner

On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> I suppose the 'normal' attack goes like:
>
>  1) find buffer-overrun / bound check failure
>  2) use that to write to 'interesting' location
>  3) that write results arbitrary code execution
>  4) win
>
> Of course, if the store of 2 is to the current cred structure, and
> simply sets the effective uid to 0, we can skip 3.

In most cases, yes, gaining root is game over. However, I don't want
to discount other threat models: some systems have been designed not
to trust root, so a cred attack doesn't always get an attacker full
control (e.g. lockdown series, signed modules, encrypted VMs, etc).

> Which seems to suggest all cred structures should be made r/o like this.
> But I'm not sure I remember these patches doing that.

There are things that attempt to protect cred (and other things, like
page tables) via hypervisors (see Samsung KNOX) or via syscall
boundary checking (see Linux Kernel Runtime Guard). They're pretty
interesting, but I'm not sure if there is a clear way forward on it
working in upstream, but that's why I think these discussions are
useful.

> Also, there is an inverse situation with all this. If you make
> everything R/O, then you need this allow-write for everything you do,
> which then is about to include a case with an overflow / bound check
> fail, and we're back to square 1.

Sure -- this is the fine line in trying to build these defenses. The
point is to narrow the scope of attack. Stupid metaphor follows: right
now we have only a couple walls; if we add walls we can focus on make
sure the doors and windows are safe. If we make the relatively
easy-to-find-in-memory page tables read-only-at-rest then a whole
class of very powerful exploits that depend on page table attacks go
away.

As part of all of this is the observation that there are two types of
things clearly worth protecting: that which is updated rarely (no need
to leave it writable for so much of its lifetime), and that which is
especially sensitive (page tables, security policy, function pointers,
etc). Finding a general purpose way to deal with these (like we have
for other data-lifetime cases like const and __ro_after_init) would be
very nice. I don't think there is a slippery slope here.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 16:37             ` Kees Cook
@ 2018-10-30 17:06               ` Andy Lutomirski
  2018-10-30 17:58                 ` Matthew Wilcox
  2018-11-13 14:25                 ` Igor Stoppa
  2018-10-31  9:27               ` Igor Stoppa
  1 sibling, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-30 17:06 UTC (permalink / raw)
  To: Kees Cook
  Cc: Peter Zijlstra, Igor Stoppa, Mimi Zohar, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner


> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> 
>> On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> I suppose the 'normal' attack goes like:
>> 
>> 1) find buffer-overrun / bound check failure
>> 2) use that to write to 'interesting' location
>> 3) that write results arbitrary code execution
>> 4) win
>> 
>> Of course, if the store of 2 is to the current cred structure, and
>> simply sets the effective uid to 0, we can skip 3.
> 
> In most cases, yes, gaining root is game over. However, I don't want
> to discount other threat models: some systems have been designed not
> to trust root, so a cred attack doesn't always get an attacker full
> control (e.g. lockdown series, signed modules, encrypted VMs, etc).
> 
>> Which seems to suggest all cred structures should be made r/o like this.
>> But I'm not sure I remember these patches doing that.
> 
> There are things that attempt to protect cred (and other things, like
> page tables) via hypervisors (see Samsung KNOX) or via syscall
> boundary checking (see Linux Kernel Runtime Guard). They're pretty
> interesting, but I'm not sure if there is a clear way forward on it
> working in upstream, but that's why I think these discussions are
> useful.
> 
>> Also, there is an inverse situation with all this. If you make
>> everything R/O, then you need this allow-write for everything you do,
>> which then is about to include a case with an overflow / bound check
>> fail, and we're back to square 1.
> 
> Sure -- this is the fine line in trying to build these defenses. The
> point is to narrow the scope of attack. Stupid metaphor follows: right
> now we have only a couple walls; if we add walls we can focus on make
> sure the doors and windows are safe. If we make the relatively
> easy-to-find-in-memory page tables read-only-at-rest then a whole
> class of very powerful exploits that depend on page table attacks go
> away.
> 
> As part of all of this is the observation that there are two types of
> things clearly worth protecting: that which is updated rarely (no need
> to leave it writable for so much of its lifetime), and that which is
> especially sensitive (page tables, security policy, function pointers,
> etc). Finding a general purpose way to deal with these (like we have
> for other data-lifetime cases like const and __ro_after_init) would be
> very nice. I don't think there is a slippery slope here.
> 
> 

Since I wasn’t cc’d on this series:

I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.

If anyone wants to use CR0.WP instead, I’ll remind them that they have to fix up the entry code and justify the added complexity. And make performance not suck in a VM (i.e. CR0 reads on entry are probably off the table).  None of these will be easy.

If anyone wants to use kmap_atomic-like tricks, I’ll point out that we already have enough problems with dangling TLB entries due to SMP issues. The last thing we need is more of them. If someone proposes a viable solution that doesn’t involve CR3 fiddling, I’ll be surprised.

Keep in mind that switch_mm() is actually decently fast on modern CPUs.  It’s probably considerably faster than writing CR0, although I haven’t benchmarked it. It’s certainly faster than writing CR4.  It’s also faster than INVPCID, surprisingly, which means that it will be quite hard to get better performance using any sort of trickery.

Nadav’s patch set would be an excellent starting point.

P.S. EFI is sort of grandfathered in as a hackish alternate page table hierarchy. We’re not adding another one.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 17:06               ` Andy Lutomirski
@ 2018-10-30 17:58                 ` Matthew Wilcox
  2018-10-30 18:03                   ` Dave Hansen
                                     ` (3 more replies)
  2018-11-13 14:25                 ` Igor Stoppa
  1 sibling, 4 replies; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-30 17:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Peter Zijlstra, Igor Stoppa, Mimi Zohar, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> I support the addition of a rare-write mechanism to the upstream kernel.
> And I think that there is only one sane way to implement it: using an
> mm_struct. That mm_struct, just like any sane mm_struct, should only
> differ from init_mm in that it has extra mappings in the *user* region.

I'd like to understand this approach a little better.  In a syscall path,
we run with the user task's mm.  What you're proposing is that when we
want to modify rare data, we switch to rare_mm which contains a
writable mapping to all the kernel data which is rare-write.

So the API might look something like this:

	void *p = rare_alloc(...);	/* writable pointer */
	p->a = x;
	q = rare_protect(p);		/* read-only pointer */

To subsequently modify q,

	p = rare_modify(q);
	q->a = y;
	rare_protect(p);

Under the covers, rare_modify() would switch to the rare_mm and return
(void *)((unsigned long)q + ARCH_RARE_OFFSET).  All of the rare data
would then be modifiable, although you don't have any other pointers
to it.  rare_protect() would switch back to the previous mm and return
(p - ARCH_RARE_OFFSET).

Does this satisfy Igor's requirements?  We wouldn't be able to
copy_to/from_user() while rare_mm was active.  I think that's a feature
though!  It certainly satisfies my interests (kernel code be able to
mark things as dynamically-allocated-and-read-only-after-initialisation)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 17:58                 ` Matthew Wilcox
@ 2018-10-30 18:03                   ` Dave Hansen
  2018-10-31  9:18                     ` Peter Zijlstra
  2018-10-30 18:28                   ` Tycho Andersen
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 140+ messages in thread
From: Dave Hansen @ 2018-10-30 18:03 UTC (permalink / raw)
  To: Matthew Wilcox, Andy Lutomirski
  Cc: Kees Cook, Peter Zijlstra, Igor Stoppa, Mimi Zohar, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On 10/30/18 10:58 AM, Matthew Wilcox wrote:
> Does this satisfy Igor's requirements?  We wouldn't be able to
> copy_to/from_user() while rare_mm was active.  I think that's a feature
> though!  It certainly satisfies my interests (kernel code be able to
> mark things as dynamically-allocated-and-read-only-after-initialisation)

It has to be more than copy_to/from_user(), though, I think.

rare_modify(q) either has to preempt_disable(), or we need to teach the
context-switching code when and how to switch in/out of the rare_mm.
preempt_disable() would also keep us from sleeping.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 17:58                 ` Matthew Wilcox
  2018-10-30 18:03                   ` Dave Hansen
@ 2018-10-30 18:28                   ` Tycho Andersen
  2018-10-30 19:20                     ` Matthew Wilcox
  2018-10-30 18:51                   ` Andy Lutomirski
  2018-10-31  9:36                   ` Peter Zijlstra
  3 siblings, 1 reply; 140+ messages in thread
From: Tycho Andersen @ 2018-10-30 18:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andy Lutomirski, Kees Cook, Peter Zijlstra, Igor Stoppa,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > I support the addition of a rare-write mechanism to the upstream kernel.
> > And I think that there is only one sane way to implement it: using an
> > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > differ from init_mm in that it has extra mappings in the *user* region.
> 
> I'd like to understand this approach a little better.  In a syscall path,
> we run with the user task's mm.  What you're proposing is that when we
> want to modify rare data, we switch to rare_mm which contains a
> writable mapping to all the kernel data which is rare-write.
> 
> So the API might look something like this:
> 
> 	void *p = rare_alloc(...);	/* writable pointer */
> 	p->a = x;
> 	q = rare_protect(p);		/* read-only pointer */
> 
> To subsequently modify q,
> 
> 	p = rare_modify(q);
> 	q->a = y;

Do you mean

    p->a = y;

here? I assume the intent is that q isn't writable ever, but that's
the one we have in the structure at rest.

Tycho

> 	rare_protect(p);
> 
> Under the covers, rare_modify() would switch to the rare_mm and return
> (void *)((unsigned long)q + ARCH_RARE_OFFSET).  All of the rare data
> would then be modifiable, although you don't have any other pointers
> to it.  rare_protect() would switch back to the previous mm and return
> (p - ARCH_RARE_OFFSET).
> 
> Does this satisfy Igor's requirements?  We wouldn't be able to
> copy_to/from_user() while rare_mm was active.  I think that's a feature
> though!  It certainly satisfies my interests (kernel code be able to
> mark things as dynamically-allocated-and-read-only-after-initialisation)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 17:58                 ` Matthew Wilcox
  2018-10-30 18:03                   ` Dave Hansen
  2018-10-30 18:28                   ` Tycho Andersen
@ 2018-10-30 18:51                   ` Andy Lutomirski
  2018-10-30 19:14                     ` Kees Cook
                                       ` (2 more replies)
  2018-10-31  9:36                   ` Peter Zijlstra
  3 siblings, 3 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-30 18:51 UTC (permalink / raw)
  To: Matthew Wilcox, nadav.amit
  Cc: Kees Cook, Peter Zijlstra, Igor Stoppa, Mimi Zohar, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner



> On Oct 30, 2018, at 10:58 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>> I support the addition of a rare-write mechanism to the upstream kernel.
>> And I think that there is only one sane way to implement it: using an
>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>> differ from init_mm in that it has extra mappings in the *user* region.
> 
> I'd like to understand this approach a little better.  In a syscall path,
> we run with the user task's mm.  What you're proposing is that when we
> want to modify rare data, we switch to rare_mm which contains a
> writable mapping to all the kernel data which is rare-write.
> 
> So the API might look something like this:
> 
>    void *p = rare_alloc(...);    /* writable pointer */
>    p->a = x;
>    q = rare_protect(p);        /* read-only pointer */
> 
> To subsequently modify q,
> 
>    p = rare_modify(q);
>    q->a = y;
>    rare_protect(p);

How about:

rare_write(&q->a, y);

Or, for big writes:

rare_write_copy(&q, local_q);

This avoids a whole ton of issues.  In practice, actually running with a special mm requires preemption disabled as well as some other stuff, which Nadav carefully dealt with.

Also, can we maybe focus on getting something merged for statically allocated data first?

Finally, one issue: rare_alloc() is going to utterly suck performance-wise due to the global IPI when the region gets zapped out of the direct map or otherwise made RO.  This is the same issue that makes all existing XPO efforts so painful. We need to either optimize the crap out of it somehow or we need to make sure it’s not called except during rare events like device enumeration.

Nadav, want to resubmit your series?  IIRC the only thing wrong with it was that it was a big change and we wanted a simpler fix to backport. But that’s all done now, and I, at least, rather liked your code. :)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 18:51                   ` Andy Lutomirski
@ 2018-10-30 19:14                     ` Kees Cook
  2018-10-30 21:25                     ` Matthew Wilcox
  2018-10-30 23:18                     ` Nadav Amit
  2 siblings, 0 replies; 140+ messages in thread
From: Kees Cook @ 2018-10-30 19:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matthew Wilcox, Nadav Amit, Peter Zijlstra, Igor Stoppa,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Tue, Oct 30, 2018 at 11:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Oct 30, 2018, at 10:58 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>> And I think that there is only one sane way to implement it: using an
>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>> differ from init_mm in that it has extra mappings in the *user* region.
>>
>> I'd like to understand this approach a little better.  In a syscall path,
>> we run with the user task's mm.  What you're proposing is that when we
>> want to modify rare data, we switch to rare_mm which contains a
>> writable mapping to all the kernel data which is rare-write.
>>
>> So the API might look something like this:
>>
>>    void *p = rare_alloc(...);    /* writable pointer */
>>    p->a = x;
>>    q = rare_protect(p);        /* read-only pointer */
>>
>> To subsequently modify q,
>>
>>    p = rare_modify(q);
>>    q->a = y;
>>    rare_protect(p);
>
> How about:
>
> rare_write(&q->a, y);
>
> Or, for big writes:
>
> rare_write_copy(&q, local_q);
>
> This avoids a whole ton of issues.  In practice, actually running with a special mm requires preemption disabled as well as some other stuff, which Nadav carefully dealt with.

This is what I had before, yes:
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=9ab0cb2618ebbc51f830ceaa06b7d2182fe1a52d

It just needs the switch_mm() backend.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 18:28                   ` Tycho Andersen
@ 2018-10-30 19:20                     ` Matthew Wilcox
  2018-10-30 20:43                       ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-30 19:20 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, Kees Cook, Peter Zijlstra, Igor Stoppa,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > > I support the addition of a rare-write mechanism to the upstream kernel.
> > > And I think that there is only one sane way to implement it: using an
> > > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > > differ from init_mm in that it has extra mappings in the *user* region.
> > 
> > I'd like to understand this approach a little better.  In a syscall path,
> > we run with the user task's mm.  What you're proposing is that when we
> > want to modify rare data, we switch to rare_mm which contains a
> > writable mapping to all the kernel data which is rare-write.
> > 
> > So the API might look something like this:
> > 
> > 	void *p = rare_alloc(...);	/* writable pointer */
> > 	p->a = x;
> > 	q = rare_protect(p);		/* read-only pointer */
> > 
> > To subsequently modify q,
> > 
> > 	p = rare_modify(q);
> > 	q->a = y;
> 
> Do you mean
> 
>     p->a = y;
> 
> here? I assume the intent is that q isn't writable ever, but that's
> the one we have in the structure at rest.

Yes, that was my intent, thanks.

To handle the list case that Igor has pointed out, you might want to
do something like this:

	list_for_each_entry(x, &xs, entry) {
		struct foo *writable = rare_modify(entry);
		kref_get(&writable->ref);
		rare_protect(writable);
	}

but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 19:20                     ` Matthew Wilcox
@ 2018-10-30 20:43                       ` Igor Stoppa
  2018-10-30 21:02                         ` Andy Lutomirski
  2018-10-30 21:35                         ` Matthew Wilcox
  0 siblings, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 20:43 UTC (permalink / raw)
  To: Matthew Wilcox, Tycho Andersen
  Cc: Andy Lutomirski, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On 30/10/2018 21:20, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>> And I think that there is only one sane way to implement it: using an
>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>
>>> I'd like to understand this approach a little better.  In a syscall path,
>>> we run with the user task's mm.  What you're proposing is that when we
>>> want to modify rare data, we switch to rare_mm which contains a
>>> writable mapping to all the kernel data which is rare-write.
>>>
>>> So the API might look something like this:
>>>
>>> 	void *p = rare_alloc(...);	/* writable pointer */
>>> 	p->a = x;
>>> 	q = rare_protect(p);		/* read-only pointer */

With pools and memory allocated from vmap_areas, I was able to say

protect(pool)

and that would do a swipe on all the pages currently in use.
In the SELinux policyDB, for example, one doesn't really want to 
individually protect each allocation.

The loading phase happens usually at boot, when the system can be 
assumed to be sane (one might even preload a bare-bone set of rules from 
initramfs and then replace it later on, with the full blown set).

There is no need to process each of these tens of thousands allocations 
and initialization as write-rare.

Would it be possible to do the same here?

>>>
>>> To subsequently modify q,
>>>
>>> 	p = rare_modify(q);
>>> 	q->a = y;
>>
>> Do you mean
>>
>>      p->a = y;
>>
>> here? I assume the intent is that q isn't writable ever, but that's
>> the one we have in the structure at rest.
> 
> Yes, that was my intent, thanks.
> 
> To handle the list case that Igor has pointed out, you might want to
> do something like this:
> 
> 	list_for_each_entry(x, &xs, entry) {
> 		struct foo *writable = rare_modify(entry);

Would this mapping be impossible to spoof by other cores?

I'm asking this because, from what I understand, local interrupts are 
enabled here, so an attack could freeze the core performing the 
write-rare operation, while another scrapes the memory.

But blocking interrupts for the entire body of the loop would make RT 
latency unpredictable.

> 		kref_get(&writable->ref);
> 		rare_protect(writable);
> 	}
> 
> but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.

This seems suspiciously close to the duplication of kernel interfaces 
that I was roasted for :-)

--
igor


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 20:43                       ` Igor Stoppa
@ 2018-10-30 21:02                         ` Andy Lutomirski
  2018-10-30 21:07                           ` Kees Cook
                                             ` (2 more replies)
  2018-10-30 21:35                         ` Matthew Wilcox
  1 sibling, 3 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-30 21:02 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Matthew Wilcox, Tycho Andersen, Kees Cook, Peter Zijlstra,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner



> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
> 
>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>>>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>>> And I think that there is only one sane way to implement it: using an
>>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>> 
>>>> I'd like to understand this approach a little better.  In a syscall path,
>>>> we run with the user task's mm.  What you're proposing is that when we
>>>> want to modify rare data, we switch to rare_mm which contains a
>>>> writable mapping to all the kernel data which is rare-write.
>>>> 
>>>> So the API might look something like this:
>>>> 
>>>>    void *p = rare_alloc(...);    /* writable pointer */
>>>>    p->a = x;
>>>>    q = rare_protect(p);        /* read-only pointer */
> 
> With pools and memory allocated from vmap_areas, I was able to say
> 
> protect(pool)
> 
> and that would do a swipe on all the pages currently in use.
> In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation.
> 
> The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set).
> 
> There is no need to process each of these tens of thousands allocations and initialization as write-rare.
> 
> Would it be possible to do the same here?

I don’t see why not, although getting the API right will be a tad complicated.

> 
>>>> 
>>>> To subsequently modify q,
>>>> 
>>>>    p = rare_modify(q);
>>>>    q->a = y;
>>> 
>>> Do you mean
>>> 
>>>     p->a = y;
>>> 
>>> here? I assume the intent is that q isn't writable ever, but that's
>>> the one we have in the structure at rest.
>> Yes, that was my intent, thanks.
>> To handle the list case that Igor has pointed out, you might want to
>> do something like this:
>>    list_for_each_entry(x, &xs, entry) {
>>        struct foo *writable = rare_modify(entry);
> 
> Would this mapping be impossible to spoof by other cores?
> 

Indeed. Only the core with the special mm loaded could see it.

But I dislike allowing regular writes in the protected region. We really only need four write primitives:

1. Just write one value.  Call at any time (except NMI).

2. Just copy some bytes. Same as (1) but any number of bytes.

3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization.

Actually getting a modifiable pointer should be disallowed for two reasons:

1. Some architectures may want to use a special write-different-address-space operation. Heck, x86 could, too: make the actual offset be a secret and shove the offset into FSBASE or similar. Then %fs-prefixed writes would do the rare writes.

2. Alternatively, x86 could set the U bit. Then the actual writes would use the uaccess helpers, giving extra protection via SMAP.

We don’t really want a situation where an unchecked pointer in the rare write region completely defeats the mechanism.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 21:02                         ` Andy Lutomirski
@ 2018-10-30 21:07                           ` Kees Cook
  2018-10-30 21:25                             ` Igor Stoppa
  2018-10-30 22:15                           ` Igor Stoppa
  2018-10-31  9:45                           ` Peter Zijlstra
  2 siblings, 1 reply; 140+ messages in thread
From: Kees Cook @ 2018-10-30 21:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Igor Stoppa, Matthew Wilcox, Tycho Andersen, Peter Zijlstra,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Tue, Oct 30, 2018 at 2:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
>>
>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
>>>>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
>>>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>>>>> And I think that there is only one sane way to implement it: using an
>>>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>>>>> differ from init_mm in that it has extra mappings in the *user* region.
>>>>>
>>>>> I'd like to understand this approach a little better.  In a syscall path,
>>>>> we run with the user task's mm.  What you're proposing is that when we
>>>>> want to modify rare data, we switch to rare_mm which contains a
>>>>> writable mapping to all the kernel data which is rare-write.
>>>>>
>>>>> So the API might look something like this:
>>>>>
>>>>>    void *p = rare_alloc(...);    /* writable pointer */
>>>>>    p->a = x;
>>>>>    q = rare_protect(p);        /* read-only pointer */
>>
>> With pools and memory allocated from vmap_areas, I was able to say
>>
>> protect(pool)
>>
>> and that would do a swipe on all the pages currently in use.
>> In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation.
>>
>> The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set).
>>
>> There is no need to process each of these tens of thousands allocations and initialization as write-rare.
>>
>> Would it be possible to do the same here?
>
> I don’t see why not, although getting the API right will be a tad complicated.
>
>>
>>>>>
>>>>> To subsequently modify q,
>>>>>
>>>>>    p = rare_modify(q);
>>>>>    q->a = y;
>>>>
>>>> Do you mean
>>>>
>>>>     p->a = y;
>>>>
>>>> here? I assume the intent is that q isn't writable ever, but that's
>>>> the one we have in the structure at rest.
>>> Yes, that was my intent, thanks.
>>> To handle the list case that Igor has pointed out, you might want to
>>> do something like this:
>>>    list_for_each_entry(x, &xs, entry) {
>>>        struct foo *writable = rare_modify(entry);
>>
>> Would this mapping be impossible to spoof by other cores?
>>
>
> Indeed. Only the core with the special mm loaded could see it.
>
> But I dislike allowing regular writes in the protected region. We really only need four write primitives:
>
> 1. Just write one value.  Call at any time (except NMI).
>
> 2. Just copy some bytes. Same as (1) but any number of bytes.
>
> 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization.
>
> Actually getting a modifiable pointer should be disallowed for two reasons:
>
> 1. Some architectures may want to use a special write-different-address-space operation. Heck, x86 could, too: make the actual offset be a secret and shove the offset into FSBASE or similar. Then %fs-prefixed writes would do the rare writes.
>
> 2. Alternatively, x86 could set the U bit. Then the actual writes would use the uaccess helpers, giving extra protection via SMAP.
>
> We don’t really want a situation where an unchecked pointer in the rare write region completely defeats the mechanism.

We still have to deal with certain structures under the write-rare
window. For example, see:
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=60430b4d3b113aae4adab66f8339074986276474
They are wrappers to non-inline functions that have the same sanity-checking.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 18:51                   ` Andy Lutomirski
  2018-10-30 19:14                     ` Kees Cook
@ 2018-10-30 21:25                     ` Matthew Wilcox
  2018-10-30 21:55                       ` Igor Stoppa
  2018-10-31  9:29                       ` Peter Zijlstra
  2018-10-30 23:18                     ` Nadav Amit
  2 siblings, 2 replies; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-30 21:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: nadav.amit, Kees Cook, Peter Zijlstra, Igor Stoppa, Mimi Zohar,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> Finally, one issue: rare_alloc() is going to utterly suck
> performance-wise due to the global IPI when the region gets zapped out
> of the direct map or otherwise made RO.  This is the same issue that
> makes all existing XPO efforts so painful. We need to either optimize
> the crap out of it somehow or we need to make sure it’s not called
> except during rare events like device enumeration.

Batching operations is kind of the whole point of the VM ;-)  Either
this rare memory gets used a lot, in which case we'll want to create slab
caches for it, make it a MM zone and the whole nine yeards, or it's not
used very much in which case it doesn't matter that performance sucks.

For now, I'd suggest allocating 2MB chunks as needed, and having a
shrinker to hand back any unused pieces.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 21:07                           ` Kees Cook
@ 2018-10-30 21:25                             ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 21:25 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski
  Cc: Matthew Wilcox, Tycho Andersen, Peter Zijlstra, Mimi Zohar,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner



On 30/10/2018 23:07, Kees Cook wrote:

> We still have to deal with certain structures under the write-rare
> window. For example, see:
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=60430b4d3b113aae4adab66f8339074986276474
> They are wrappers to non-inline functions that have the same sanity-checking.

Even if I also did something similar, it was not intended to be final.
Just a stop-gap solution till the write-rare mechanism is identified.

If the size of the whole list_head is used as alignment, then the whole 
list_head structure can be given an alternate mapping and the plain list 
function can be used on this alternate mapping.

It can halve the overhead or more.
The typical case is when not only one list_head is contained in one 
page, but also the other, like when allocating and queuing multiple 
items from the same pool.
One single temporary mapping would be enough.

But it becomes tricky to do it, without generating code that is
almost-but-not-quite-identical.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 20:43                       ` Igor Stoppa
  2018-10-30 21:02                         ` Andy Lutomirski
@ 2018-10-30 21:35                         ` Matthew Wilcox
  2018-10-30 21:49                           ` Igor Stoppa
  2018-10-31  4:41                           ` Andy Lutomirski
  1 sibling, 2 replies; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-30 21:35 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Tycho Andersen, Andy Lutomirski, Kees Cook, Peter Zijlstra,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
> On 30/10/2018 21:20, Matthew Wilcox wrote:
> > > > So the API might look something like this:
> > > > 
> > > > 	void *p = rare_alloc(...);	/* writable pointer */
> > > > 	p->a = x;
> > > > 	q = rare_protect(p);		/* read-only pointer */
> 
> With pools and memory allocated from vmap_areas, I was able to say
> 
> protect(pool)
> 
> and that would do a swipe on all the pages currently in use.
> In the SELinux policyDB, for example, one doesn't really want to
> individually protect each allocation.
> 
> The loading phase happens usually at boot, when the system can be assumed to
> be sane (one might even preload a bare-bone set of rules from initramfs and
> then replace it later on, with the full blown set).
> 
> There is no need to process each of these tens of thousands allocations and
> initialization as write-rare.
> 
> Would it be possible to do the same here?

What Andy is proposing effectively puts all rare allocations into
one pool.  Although I suppose it could be generalised to multiple pools
... one mm_struct per pool.  Andy, what do you think to doing that?

> > but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.
> 
> This seems suspiciously close to the duplication of kernel interfaces that I
> was roasted for :-)

Can you not see the difference between adding one syntactic sugar function
and duplicating the entire infrastructure?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 21:35                         ` Matthew Wilcox
@ 2018-10-30 21:49                           ` Igor Stoppa
  2018-10-31  4:41                           ` Andy Lutomirski
  1 sibling, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 21:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Tycho Andersen, Andy Lutomirski, Kees Cook, Peter Zijlstra,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On 30/10/2018 23:35, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:


>> Would it be possible to do the same here?
> 
> What Andy is proposing effectively puts all rare allocations into
> one pool.  Although I suppose it could be generalised to multiple pools
> ... one mm_struct per pool.  Andy, what do you think to doing that?

The reason to have pools is that, continuing the SELinux example, it 
supports reloading the policyDB.

In this case it seems to me that it would be faster to drop the entire 
pool in one go, and create a new one when re-initializing the rules.

Or maybe the pool could be flushed, without destroying the metadata.
One more reason for having pools is to assign certain property to the 
pool and then rely on it to be applied to every subsequent allocation.

I've also been wondering if pools can be expected to have some well 
defined property.

One might be that they do not need to be created on the fly.
The number of pools should be known at compilation time.
At least the meta-data could be static, but I do not know how this would 
work with modules.

>>> but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.
>>
>> This seems suspiciously close to the duplication of kernel interfaces that I
>> was roasted for :-)
> 
> Can you not see the difference between adding one syntactic sugar function
> and duplicating the entire infrastructure?

And list_add()? or any of the other list related functions?
Don't they end up receiving a similar treatment?

I might have misunderstood your proposal, but it seems to me that they 
too will need the same type of pairs rare_modify/rare_protect. No?

And same for hlist, including the _rcu variants.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 21:25                     ` Matthew Wilcox
@ 2018-10-30 21:55                       ` Igor Stoppa
  2018-10-30 22:08                         ` Matthew Wilcox
  2018-10-31  9:29                       ` Peter Zijlstra
  1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 21:55 UTC (permalink / raw)
  To: Matthew Wilcox, Andy Lutomirski
  Cc: nadav.amit, Kees Cook, Peter Zijlstra, Mimi Zohar, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner



On 30/10/2018 23:25, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
>> Finally, one issue: rare_alloc() is going to utterly suck
>> performance-wise due to the global IPI when the region gets zapped out
>> of the direct map or otherwise made RO.  This is the same issue that
>> makes all existing XPO efforts so painful. We need to either optimize
>> the crap out of it somehow or we need to make sure it’s not called
>> except during rare events like device enumeration.
> 
> Batching operations is kind of the whole point of the VM ;-)  Either
> this rare memory gets used a lot, in which case we'll want to create slab
> caches for it, make it a MM zone and the whole nine yeards, or it's not
> used very much in which case it doesn't matter that performance sucks.
> 
> For now, I'd suggest allocating 2MB chunks as needed, and having a
> shrinker to hand back any unused pieces.

One of the prime candidates for this sort of protection is IMA.
In the IMA case, there are ever-growing lists which are populated when 
accessing files.
It's something that ends up on the critical path of any usual 
performance critical use case, when accessing files for the first time, 
like at boot/application startup.

Also the SELinux AVC is based on lists. It uses an object cache, but it 
is still something that grows and is on the critical path of evaluating 
the callbacks from the LSM hooks. A lot of them.

These are the main two reasons, so far, for me advocating an 
optimization of the write-rare version of the (h)list.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 21:55                       ` Igor Stoppa
@ 2018-10-30 22:08                         ` Matthew Wilcox
  0 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-30 22:08 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Andy Lutomirski, nadav.amit, Kees Cook, Peter Zijlstra,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Tue, Oct 30, 2018 at 11:55:46PM +0200, Igor Stoppa wrote:
> On 30/10/2018 23:25, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> > > Finally, one issue: rare_alloc() is going to utterly suck
> > > performance-wise due to the global IPI when the region gets zapped out
> > > of the direct map or otherwise made RO.  This is the same issue that
> > > makes all existing XPO efforts so painful. We need to either optimize
> > > the crap out of it somehow or we need to make sure it’s not called
> > > except during rare events like device enumeration.
> > 
> > Batching operations is kind of the whole point of the VM ;-)  Either
> > this rare memory gets used a lot, in which case we'll want to create slab
> > caches for it, make it a MM zone and the whole nine yeards, or it's not
> > used very much in which case it doesn't matter that performance sucks.
> > 
> > For now, I'd suggest allocating 2MB chunks as needed, and having a
> > shrinker to hand back any unused pieces.
> 
> One of the prime candidates for this sort of protection is IMA.
> In the IMA case, there are ever-growing lists which are populated when
> accessing files.
> It's something that ends up on the critical path of any usual performance
> critical use case, when accessing files for the first time, like at
> boot/application startup.
> 
> Also the SELinux AVC is based on lists. It uses an object cache, but it is
> still something that grows and is on the critical path of evaluating the
> callbacks from the LSM hooks. A lot of them.
> 
> These are the main two reasons, so far, for me advocating an optimization of
> the write-rare version of the (h)list.

I think these are both great examples of why doubly-linked lists _suck_.
You have to modify three cachelines to add an entry to a list.  Walking a
linked list is an exercise in cache misses.  Far better to use an XArray /
IDR for this purpose.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 21:02                         ` Andy Lutomirski
  2018-10-30 21:07                           ` Kees Cook
@ 2018-10-30 22:15                           ` Igor Stoppa
  2018-10-31 10:11                             ` Peter Zijlstra
  2018-10-31  9:45                           ` Peter Zijlstra
  2 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-30 22:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matthew Wilcox, Tycho Andersen, Kees Cook, Peter Zijlstra,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner



On 30/10/2018 23:02, Andy Lutomirski wrote:
> 
> 
>> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:


>> There is no need to process each of these tens of thousands allocations and initialization as write-rare.
>>
>> Would it be possible to do the same here?
> 
> I don’t see why not, although getting the API right will be a tad complicated.

yes, I have some first-hand experience with this :-/

>>
>>>>>
>>>>> To subsequently modify q,
>>>>>
>>>>>     p = rare_modify(q);
>>>>>     q->a = y;
>>>>
>>>> Do you mean
>>>>
>>>>      p->a = y;
>>>>
>>>> here? I assume the intent is that q isn't writable ever, but that's
>>>> the one we have in the structure at rest.
>>> Yes, that was my intent, thanks.
>>> To handle the list case that Igor has pointed out, you might want to
>>> do something like this:
>>>     list_for_each_entry(x, &xs, entry) {
>>>         struct foo *writable = rare_modify(entry);
>>
>> Would this mapping be impossible to spoof by other cores?
>>
> 
> Indeed. Only the core with the special mm loaded could see it.
> 
> But I dislike allowing regular writes in the protected region. We really only need four write primitives:
> 
> 1. Just write one value.  Call at any time (except NMI).
> 
> 2. Just copy some bytes. Same as (1) but any number of bytes.
> 
> 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization.

Atomic? RCU?

Yes, they are technically just memory writes, but shouldn't the "normal" 
operation be executed on the writable mapping?

It is possible to sandwich any call between a rare_modify/rare_protect, 
however isn't that pretty close to having a write-rare version of these 
plain function.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 18:51                   ` Andy Lutomirski
  2018-10-30 19:14                     ` Kees Cook
  2018-10-30 21:25                     ` Matthew Wilcox
@ 2018-10-30 23:18                     ` Nadav Amit
  2018-10-31  9:08                       ` Peter Zijlstra
  2 siblings, 1 reply; 140+ messages in thread
From: Nadav Amit @ 2018-10-30 23:18 UTC (permalink / raw)
  To: Andy Lutomirski, Matthew Wilcox, Peter Zijlstra
  Cc: Kees Cook, Igor Stoppa, Mimi Zohar, Dave Chinner, James Morris,
	Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

From: Andy Lutomirski
Sent: October 30, 2018 at 6:51:17 PM GMT
> To: Matthew Wilcox <willy@infradead.org>, nadav.amit@gmail.com
> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> 
> 
>> On Oct 30, 2018, at 10:58 AM, Matthew Wilcox <willy@infradead.org> wrote:
>> 
>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>> And I think that there is only one sane way to implement it: using an
>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>> differ from init_mm in that it has extra mappings in the *user* region.
>> 
>> I'd like to understand this approach a little better.  In a syscall path,
>> we run with the user task's mm.  What you're proposing is that when we
>> want to modify rare data, we switch to rare_mm which contains a
>> writable mapping to all the kernel data which is rare-write.
>> 
>> So the API might look something like this:
>> 
>>   void *p = rare_alloc(...);    /* writable pointer */
>>   p->a = x;
>>   q = rare_protect(p);        /* read-only pointer */
>> 
>> To subsequently modify q,
>> 
>>   p = rare_modify(q);
>>   q->a = y;
>>   rare_protect(p);
> 
> How about:
> 
> rare_write(&q->a, y);
> 
> Or, for big writes:
> 
> rare_write_copy(&q, local_q);
> 
> This avoids a whole ton of issues. In practice, actually running with a
> special mm requires preemption disabled as well as some other stuff, which
> Nadav carefully dealt with.
> 
> Also, can we maybe focus on getting something merged for statically
> allocated data first?
> 
> Finally, one issue: rare_alloc() is going to utterly suck performance-wise
> due to the global IPI when the region gets zapped out of the direct map or
> otherwise made RO. This is the same issue that makes all existing XPO
> efforts so painful. We need to either optimize the crap out of it somehow
> or we need to make sure it’s not called except during rare events like
> device enumeration.
> 
> Nadav, want to resubmit your series? IIRC the only thing wrong with it was
> that it was a big change and we wanted a simpler fix to backport. But
> that’s all done now, and I, at least, rather liked your code. :)

I guess since it was based on your ideas…

Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
make kgdb use of poke_text() less disgusting. I still don’t know exactly
how to deal with it.

Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
v2.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 21:35                         ` Matthew Wilcox
  2018-10-30 21:49                           ` Igor Stoppa
@ 2018-10-31  4:41                           ` Andy Lutomirski
  2018-10-31  9:08                             ` Igor Stoppa
  2018-10-31 10:02                             ` Peter Zijlstra
  1 sibling, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31  4:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Igor Stoppa, Tycho Andersen, Kees Cook, Peter Zijlstra,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
> > On 30/10/2018 21:20, Matthew Wilcox wrote:
> > > > > So the API might look something like this:
> > > > >
> > > > >         void *p = rare_alloc(...);      /* writable pointer */
> > > > >         p->a = x;
> > > > >         q = rare_protect(p);            /* read-only pointer */
> >
> > With pools and memory allocated from vmap_areas, I was able to say
> >
> > protect(pool)
> >
> > and that would do a swipe on all the pages currently in use.
> > In the SELinux policyDB, for example, one doesn't really want to
> > individually protect each allocation.
> >
> > The loading phase happens usually at boot, when the system can be assumed to
> > be sane (one might even preload a bare-bone set of rules from initramfs and
> > then replace it later on, with the full blown set).
> >
> > There is no need to process each of these tens of thousands allocations and
> > initialization as write-rare.
> >
> > Would it be possible to do the same here?
>
> What Andy is proposing effectively puts all rare allocations into
> one pool.  Although I suppose it could be generalised to multiple pools
> ... one mm_struct per pool.  Andy, what do you think to doing that?

Hmm.  Let's see.

To clarify some of this thread, I think that the fact that rare_write
uses an mm_struct and alias mappings under the hood should be
completely invisible to users of the API.  No one should ever be
handed a writable pointer to rare_write memory (except perhaps during
bootup or when initializing a large complex data structure that will
be rare_write but isn't yet, e.g. the policy db).

For example, there could easily be architectures where having a
writable alias is problematic.  On such architectures, an entirely
different mechanism might work better.  And, if a tool like KNOX ever
becomes a *part* of the Linux kernel (hint hint!)

If you have multiple pools and one mm_struct per pool, you'll need a
way to find the mm_struct from a given allocation.  Regardless of how
the mm_structs are set up, changing rare_write memory to normal memory
or vice versa will require a global TLB flush (all ASIDs and global
pages) on all CPUs, so having extra mm_structs doesn't seem to buy
much.

(It's just possible that changing rare_write back to normal might be
able to avoid the flush if the spurious faults can be handled
reliably.)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 23:18                     ` Nadav Amit
@ 2018-10-31  9:08                       ` Peter Zijlstra
  2018-11-01 16:31                         ` Nadav Amit
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31  9:08 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Matthew Wilcox, Kees Cook, Igor Stoppa,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Tue, Oct 30, 2018 at 04:18:39PM -0700, Nadav Amit wrote:
> > Nadav, want to resubmit your series? IIRC the only thing wrong with it was
> > that it was a big change and we wanted a simpler fix to backport. But
> > that’s all done now, and I, at least, rather liked your code. :)
> 
> I guess since it was based on your ideas…
> 
> Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
> make kgdb use of poke_text() less disgusting. I still don’t know exactly
> how to deal with it.
> 
> Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
> v2.

Oh man, I completely forgot about kgdb :/

Also, would it be possible to entirely remove that kmap fallback path?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31  4:41                           ` Andy Lutomirski
@ 2018-10-31  9:08                             ` Igor Stoppa
  2018-10-31 19:38                               ` Igor Stoppa
  2018-10-31 10:02                             ` Peter Zijlstra
  1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-31  9:08 UTC (permalink / raw)
  To: Andy Lutomirski, Matthew Wilcox, Paul Moore, Stephen Smalley
  Cc: Tycho Andersen, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, LSM List, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner, selinux

Adding SELinux folks and the SELinux ml

I think it's better if they participate in this discussion.

On 31/10/2018 06:41, Andy Lutomirski wrote:
> On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>>>> So the API might look something like this:
>>>>>>
>>>>>>          void *p = rare_alloc(...);      /* writable pointer */
>>>>>>          p->a = x;
>>>>>>          q = rare_protect(p);            /* read-only pointer */
>>>
>>> With pools and memory allocated from vmap_areas, I was able to say
>>>
>>> protect(pool)
>>>
>>> and that would do a swipe on all the pages currently in use.
>>> In the SELinux policyDB, for example, one doesn't really want to
>>> individually protect each allocation.
>>>
>>> The loading phase happens usually at boot, when the system can be assumed to
>>> be sane (one might even preload a bare-bone set of rules from initramfs and
>>> then replace it later on, with the full blown set).
>>>
>>> There is no need to process each of these tens of thousands allocations and
>>> initialization as write-rare.
>>>
>>> Would it be possible to do the same here?
>>
>> What Andy is proposing effectively puts all rare allocations into
>> one pool.  Although I suppose it could be generalised to multiple pools
>> ... one mm_struct per pool.  Andy, what do you think to doing that?
> 
> Hmm.  Let's see.
> 
> To clarify some of this thread, I think that the fact that rare_write
> uses an mm_struct and alias mappings under the hood should be
> completely invisible to users of the API.  

I agree.

> No one should ever be
> handed a writable pointer to rare_write memory (except perhaps during
> bootup or when initializing a large complex data structure that will
> be rare_write but isn't yet, e.g. the policy db).

The policy db doesn't need to be write rare.
Actually, it really shouldn't be write rare.

Maybe it's just a matter of wording, but effectively the policyDB can be 
trated with this sequence:

1) allocate various data structures in writable form

2) initialize them

3) go back to 1 as needed

4) lock down everything that has been allocated, as Read-Only
The reason why I stress ReadOnly is that differentiating what is really 
ReadOnly from what is WriteRare provides an extra edge against attacks, 
because attempts to alter ReadOnly data through a WriteRare API could be 
detected

5) read any part of the policyDB during regular operations

6) in case of update, create a temporary new version, using steps 1..3

7) if update successful, use the new one and destroy the old one

8) if the update failed, destroy the new one

The destruction at points 7 and 8 is not so much a write operation, as 
it is a release of the memory.

So, we might have a bit different interpretation of what write-rare 
means wrt destroying the memory and its content.

To clarify: I've been using write-rare to indicate primarily small 
operations that one would otherwise achieve with "=", memcpy or memset 
or more complex variants, like atomic ops, rcu pointer assignment, etc.

Tearing down an entire set of allocations like the policyDB doesn't fit 
very well with that model.

The only part which _needs_ to be write rare, in the policyDB, is the 
set of pointers which are used to access all the dynamically allocated 
data set.

These pointers must be updated when a new policyDB is allocated.

> For example, there could easily be architectures where having a
> writable alias is problematic.  On such architectures, an entirely
> different mechanism might work better.  And, if a tool like KNOX ever
> becomes a *part* of the Linux kernel (hint hint!)

Something related, albeit not identical is going on here [1]
Eventually, it could be expanded to deal also with write rare.

> If you have multiple pools and one mm_struct per pool, you'll need a
> way to find the mm_struct from a given allocation. 

Indeed. In my patchset, based on vmas, I do the following:
* a private field from the page struct points to the vma using that page
* inside the vma there is alist_head used only during deletion
   - one pointer is used to chain vmas fro mthe same pool
   - one pointer points to the pool struct
* the pool struct has the property to use for all the associated 
allocations: is it write-rare, read-only, does it auto protect, etc.

> Regardless of how
> the mm_structs are set up, changing rare_write memory to normal memory
> or vice versa will require a global TLB flush (all ASIDs and global
> pages) on all CPUs, so having extra mm_structs doesn't seem to buy
> much.

1) it supports differnt levels of protection:
   temporarily unprotected vs read-only vs write-rare

2) the change of write permission should be possible only toward more 
restrictive rules (writable -> write-rare -> read-only) and only to the 
point that was specified while creating the pool, to avoid DOS attacks, 
where a write-rare is flipped into read-only and further updates fail
(ex: prevent IMA from registering modifications to a file, by not 
letting it store new information - I'm not 100% sure this would work, 
but it gives the idea, I think)

3) being able to track all the allocations related to a pool would allow 
to perform mass operations, like reducing the writabilty or destroying 
all the allocations.

> (It's just possible that changing rare_write back to normal might be
> able to avoid the flush if the spurious faults can be handled
> reliably.)

I do not see the need for such a case of degrading the write permissions 
of an allocation, unless it refers to the release of a pool of 
allocations (see updating the SELinux policy DB)

[1] https://www.openwall.com/lists/kernel-hardening/2018/10/26/11

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 16/17] prmem: pratomic-long
  2018-10-30 16:28         ` Will Deacon
@ 2018-10-31  9:10           ` Peter Zijlstra
  2018-11-01  3:28             ` Kees Cook
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31  9:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Igor Stoppa, Mimi Zohar, Kees Cook, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, kernel-hardening, linux-integrity,
	linux-security-module, igor.stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Boqun Feng, Arnd Bergmann, linux-arch,
	linux-kernel

On Tue, Oct 30, 2018 at 04:28:16PM +0000, Will Deacon wrote:
> On Tue, Oct 30, 2018 at 04:58:41PM +0100, Peter Zijlstra wrote:
> > Like mentioned elsewhere; if you do write_enable() + write_disable()
> > thingies, it all becomes:
> > 
> > 	write_enable();
> > 	atomic_foo(&bar);
> > 	write_disable();
> > 
> > No magic gunk infested duplication at all. Of course, ideally you'd then
> > teach objtool about this (or a GCC plugin I suppose) to ensure any
> > enable reached a disable.
> 
> Isn't the issue here that we don't want to change the page tables for the
> mapping of &bar, but instead want to create a temporary writable alias
> at a random virtual address?
> 
> So you'd want:
> 
> 	wbar = write_enable(&bar);
> 	atomic_foo(wbar);
> 	write_disable(wbar);
> 
> which is probably better expressed as a map/unmap API. I suspect this
> would also be the only way to do things for cmpxchg() loops, where you
> want to create the mapping outside of the loop to minimise your time in
> the critical section.

Ah, so I was thikning that the altnerative mm would have stuff in the
same location, just RW instead of RO.

But yes, if we, like Andy suggets, use the userspace address range for
the aliases, then we need to do as you suggest.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 18:03                   ` Dave Hansen
@ 2018-10-31  9:18                     ` Peter Zijlstra
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31  9:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Matthew Wilcox, Andy Lutomirski, Kees Cook, Igor Stoppa,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Tue, Oct 30, 2018 at 11:03:51AM -0700, Dave Hansen wrote:
> On 10/30/18 10:58 AM, Matthew Wilcox wrote:
> > Does this satisfy Igor's requirements?  We wouldn't be able to
> > copy_to/from_user() while rare_mm was active.  I think that's a feature
> > though!  It certainly satisfies my interests (kernel code be able to
> > mark things as dynamically-allocated-and-read-only-after-initialisation)
> 
> It has to be more than copy_to/from_user(), though, I think.
> 
> rare_modify(q) either has to preempt_disable(), or we need to teach the
> context-switching code when and how to switch in/out of the rare_mm.
> preempt_disable() would also keep us from sleeping.

Yes, I think we want to have preempt disable at the very least. We could
indeed make the context switch code smart and teach is about this state,
but I think allowing preemption in such section is a bad idea. We want
to keep these sections short and simple (and bounded), such that they
can be analyzed for correctness.

Once you allow preemption, things tend to grow large and complex.

Ideally we'd even disabled interrupts over this, to further limit what
code runs in the rare_mm context. NMIs need special care anyway.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 16:37             ` Kees Cook
  2018-10-30 17:06               ` Andy Lutomirski
@ 2018-10-31  9:27               ` Igor Stoppa
  1 sibling, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-31  9:27 UTC (permalink / raw)
  To: Kees Cook, Peter Zijlstra
  Cc: Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
	Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Andy Lutomirski, Thomas Gleixner



On 30/10/2018 18:37, Kees Cook wrote:
> On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> I suppose the 'normal' attack goes like:
>>
>>   1) find buffer-overrun / bound check failure
>>   2) use that to write to 'interesting' location
>>   3) that write results arbitrary code execution
>>   4) win
>>
>> Of course, if the store of 2 is to the current cred structure, and
>> simply sets the effective uid to 0, we can skip 3.
> 
> In most cases, yes, gaining root is game over. However, I don't want
> to discount other threat models: some systems have been designed not
> to trust root, so a cred attack doesn't always get an attacker full
> control (e.g. lockdown series, signed modules, encrypted VMs, etc).

Reading these points I see where I was not clear.

Maybe I can fix that. SELinux is a good example of safeguard against a 
takeover of root, so it is a primary target. Unfortunately it contains 
some state variables that can be used to disable it.

Other attack that comes to my mind, for executing code within the 
kernel, without sweating too much with ROP is the following:

1) find the rabbit hole to write kernel data, whatever it might be
2) spray the keyring with own key
3) use the module loading infrastructure to load own module, signed with 
the key at point 2)

Just to say that direct arbitrary code execution is becoming harder to 
perform, but there are ways around it which rely more on overwriting 
unprotected data.

They are a lower hanging fruit for an attacker.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 21:25                     ` Matthew Wilcox
  2018-10-30 21:55                       ` Igor Stoppa
@ 2018-10-31  9:29                       ` Peter Zijlstra
  1 sibling, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31  9:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andy Lutomirski, nadav.amit, Kees Cook, Igor Stoppa, Mimi Zohar,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Oct 30, 2018 at 02:25:51PM -0700, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> > Finally, one issue: rare_alloc() is going to utterly suck
> > performance-wise due to the global IPI when the region gets zapped out
> > of the direct map or otherwise made RO.  This is the same issue that
> > makes all existing XPO efforts so painful. We need to either optimize
> > the crap out of it somehow or we need to make sure it’s not called
> > except during rare events like device enumeration.
> 
> Batching operations is kind of the whole point of the VM ;-)  Either
> this rare memory gets used a lot, in which case we'll want to create slab
> caches for it, make it a MM zone and the whole nine yeards, or it's not
> used very much in which case it doesn't matter that performance sucks.

Yes, for the dynamic case something along those lines would be needed.
If we have a single rare zone, we could even have __GFP_RARE or whatever
that manages this.

The page allocator would have to grow a rare memblock type, and every
rare alloc would allocate from a rare memblock, when none is available,
creation of a rare block would set up the mappings etc..

> For now, I'd suggest allocating 2MB chunks as needed, and having a
> shrinker to hand back any unused pieces.

Something like the percpu allocator?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 17:58                 ` Matthew Wilcox
                                     ` (2 preceding siblings ...)
  2018-10-30 18:51                   ` Andy Lutomirski
@ 2018-10-31  9:36                   ` Peter Zijlstra
  2018-10-31 11:33                     ` Matthew Wilcox
  3 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31  9:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andy Lutomirski, Kees Cook, Igor Stoppa, Mimi Zohar,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > I support the addition of a rare-write mechanism to the upstream kernel.
> > And I think that there is only one sane way to implement it: using an
> > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > differ from init_mm in that it has extra mappings in the *user* region.
> 
> I'd like to understand this approach a little better.  In a syscall path,
> we run with the user task's mm.  What you're proposing is that when we
> want to modify rare data, we switch to rare_mm which contains a
> writable mapping to all the kernel data which is rare-write.
> 
> So the API might look something like this:
> 
> 	void *p = rare_alloc(...);	/* writable pointer */
> 	p->a = x;
> 	q = rare_protect(p);		/* read-only pointer */
> 
> To subsequently modify q,
> 
> 	p = rare_modify(q);
> 	q->a = y;
> 	rare_protect(p);

Why would you have rare_alloc() imply rare_modify() ? Would you have the
allocator meta data inside the rare section?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 21:02                         ` Andy Lutomirski
  2018-10-30 21:07                           ` Kees Cook
  2018-10-30 22:15                           ` Igor Stoppa
@ 2018-10-31  9:45                           ` Peter Zijlstra
  2 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31  9:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Igor Stoppa, Matthew Wilcox, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Tue, Oct 30, 2018 at 02:02:12PM -0700, Andy Lutomirski wrote:
> But I dislike allowing regular writes in the protected region. We
> really only need four write primitives:
> 
> 1. Just write one value.  Call at any time (except NMI).
> 
> 2. Just copy some bytes. Same as (1) but any number of bytes.

Given the !preempt/!IRQ contraints I'd certainly put an upper limit on
the number of bytes there.

> 3,4: Same as 1 and 2 but must be called inside a special rare write
> region. This is purely an optimization.
> 
> Actually getting a modifiable pointer should be disallowed for two
> reasons:
> 
> 1. Some architectures may want to use a special
> write-different-address-space operation.

You're thinking of s390 ? :-)

> Heck, x86 could, too: make
> the actual offset be a secret and shove the offset into FSBASE or
> similar. Then %fs-prefixed writes would do the rare writes.

> 2. Alternatively, x86 could set the U bit. Then the actual writes
> would use the uaccess helpers, giving extra protection via SMAP.

Cute, and yes, something like that would be nice.

> We don’t really want a situation where an unchecked pointer in the
> rare write region completely defeats the mechanism.

Agreed.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31  4:41                           ` Andy Lutomirski
  2018-10-31  9:08                             ` Igor Stoppa
@ 2018-10-31 10:02                             ` Peter Zijlstra
  2018-10-31 20:36                               ` Andy Lutomirski
  1 sibling, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31 10:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matthew Wilcox, Igor Stoppa, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
> To clarify some of this thread, I think that the fact that rare_write
> uses an mm_struct and alias mappings under the hood should be
> completely invisible to users of the API.  No one should ever be
> handed a writable pointer to rare_write memory (except perhaps during
> bootup or when initializing a large complex data structure that will
> be rare_write but isn't yet, e.g. the policy db).

Being able to use pointers would make it far easier to do atomics and
other things though.

> For example, there could easily be architectures where having a
> writable alias is problematic.

Mostly we'd just have to be careful of cache aliases, alignment should
be able to sort that I think.

> If you have multiple pools and one mm_struct per pool, you'll need a
> way to find the mm_struct from a given allocation.

Or keep track of it externally. For example by context. If you modify
page-tables you pick the page-table pool, if you modify selinux state,
you pick the selinux pool.

> Regardless of how the mm_structs are set up, changing rare_write
> memory to normal memory or vice versa will require a global TLB flush
> (all ASIDs and global pages) on all CPUs, so having extra mm_structs
> doesn't seem to buy much.

The way I understand it, the point is that if you stick page-tables and
selinux state in different pools, a stray write in one will never affect
the other.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 22:15                           ` Igor Stoppa
@ 2018-10-31 10:11                             ` Peter Zijlstra
  2018-10-31 20:38                               ` Andy Lutomirski
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31 10:11 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Andy Lutomirski, Matthew Wilcox, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
> On 30/10/2018 23:02, Andy Lutomirski wrote:

> > But I dislike allowing regular writes in the protected region. We
> > really only need four write primitives:
> > 
> > 1. Just write one value.  Call at any time (except NMI).
> > 
> > 2. Just copy some bytes. Same as (1) but any number of bytes.
> > 
> > 3,4: Same as 1 and 2 but must be called inside a special rare write
> > region. This is purely an optimization.
> 
> Atomic? RCU?

RCU can be done, that's not really a problem. Atomics otoh are a
problem. Having pointers makes them just work.

Andy; I understand your reason for not wanting them, but I really don't
want to duplicate everything. Is there something we can do with static
analysis to make you more comfortable with the pointer thing?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31  9:36                   ` Peter Zijlstra
@ 2018-10-31 11:33                     ` Matthew Wilcox
  0 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2018-10-31 11:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Kees Cook, Igor Stoppa, Mimi Zohar,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On Wed, Oct 31, 2018 at 10:36:59AM +0100, Peter Zijlstra wrote:
> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > > I support the addition of a rare-write mechanism to the upstream kernel.
> > > And I think that there is only one sane way to implement it: using an
> > > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > > differ from init_mm in that it has extra mappings in the *user* region.
> > 
> > I'd like to understand this approach a little better.  In a syscall path,
> > we run with the user task's mm.  What you're proposing is that when we
> > want to modify rare data, we switch to rare_mm which contains a
> > writable mapping to all the kernel data which is rare-write.
> > 
> > So the API might look something like this:
> > 
> > 	void *p = rare_alloc(...);	/* writable pointer */
> > 	p->a = x;
> > 	q = rare_protect(p);		/* read-only pointer */
> > 
> > To subsequently modify q,
> > 
> > 	p = rare_modify(q);
> > 	q->a = y;
> > 	rare_protect(p);
> 
> Why would you have rare_alloc() imply rare_modify() ? Would you have the
> allocator meta data inside the rare section?

Normally when I allocate some memory I need to initialise it before
doing anything else with it ;-)

I mean, you could do:

	ro = rare_alloc(..);
	rare = rare_modify(ro);
	rare->a = x;
	rare_protect(rare);

but that's more typing.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31  9:08                             ` Igor Stoppa
@ 2018-10-31 19:38                               ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-10-31 19:38 UTC (permalink / raw)
  To: Andy Lutomirski, Matthew Wilcox, Paul Moore, Stephen Smalley
  Cc: Tycho Andersen, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, LSM List, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner, selinux



On 31/10/2018 11:08, Igor Stoppa wrote:
> Adding SELinux folks and the SELinux ml
> 
> I think it's better if they participate in this discussion.
> 
> On 31/10/2018 06:41, Andy Lutomirski wrote:
>> On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> 
>> wrote:
>>>
>>> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
>>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>>>>> So the API might look something like this:
>>>>>>>
>>>>>>>          void *p = rare_alloc(...);      /* writable pointer */
>>>>>>>          p->a = x;
>>>>>>>          q = rare_protect(p);            /* read-only pointer */
>>>>
>>>> With pools and memory allocated from vmap_areas, I was able to say
>>>>
>>>> protect(pool)
>>>>
>>>> and that would do a swipe on all the pages currently in use.
>>>> In the SELinux policyDB, for example, one doesn't really want to
>>>> individually protect each allocation.
>>>>
>>>> The loading phase happens usually at boot, when the system can be 
>>>> assumed to
>>>> be sane (one might even preload a bare-bone set of rules from 
>>>> initramfs and
>>>> then replace it later on, with the full blown set).
>>>>
>>>> There is no need to process each of these tens of thousands 
>>>> allocations and
>>>> initialization as write-rare.
>>>>
>>>> Would it be possible to do the same here?
>>>
>>> What Andy is proposing effectively puts all rare allocations into
>>> one pool.  Although I suppose it could be generalised to multiple pools
>>> ... one mm_struct per pool.  Andy, what do you think to doing that?
>>
>> Hmm.  Let's see.
>>
>> To clarify some of this thread, I think that the fact that rare_write
>> uses an mm_struct and alias mappings under the hood should be
>> completely invisible to users of the API. 
> 
> I agree.
> 
>> No one should ever be
>> handed a writable pointer to rare_write memory (except perhaps during
>> bootup or when initializing a large complex data structure that will
>> be rare_write but isn't yet, e.g. the policy db).
> 
> The policy db doesn't need to be write rare.
> Actually, it really shouldn't be write rare.
> 
> Maybe it's just a matter of wording, but effectively the policyDB can be 
> trated with this sequence:
> 
> 1) allocate various data structures in writable form
> 
> 2) initialize them
> 
> 3) go back to 1 as needed
> 
> 4) lock down everything that has been allocated, as Read-Only
> The reason why I stress ReadOnly is that differentiating what is really 
> ReadOnly from what is WriteRare provides an extra edge against attacks, 
> because attempts to alter ReadOnly data through a WriteRare API could be 
> detected
> 
> 5) read any part of the policyDB during regular operations
> 
> 6) in case of update, create a temporary new version, using steps 1..3
> 
> 7) if update successful, use the new one and destroy the old one
> 
> 8) if the update failed, destroy the new one
> 
> The destruction at points 7 and 8 is not so much a write operation, as 
> it is a release of the memory.
> 
> So, we might have a bit different interpretation of what write-rare 
> means wrt destroying the memory and its content.
> 
> To clarify: I've been using write-rare to indicate primarily small 
> operations that one would otherwise achieve with "=", memcpy or memset 
> or more complex variants, like atomic ops, rcu pointer assignment, etc.
> 
> Tearing down an entire set of allocations like the policyDB doesn't fit 
> very well with that model.
> 
> The only part which _needs_ to be write rare, in the policyDB, is the 
> set of pointers which are used to access all the dynamically allocated 
> data set.
> 
> These pointers must be updated when a new policyDB is allocated.
> 
>> For example, there could easily be architectures where having a
>> writable alias is problematic.  On such architectures, an entirely
>> different mechanism might work better.  And, if a tool like KNOX ever
>> becomes a *part* of the Linux kernel (hint hint!)
> 
> Something related, albeit not identical is going on here [1]
> Eventually, it could be expanded to deal also with write rare.
> 
>> If you have multiple pools and one mm_struct per pool, you'll need a
>> way to find the mm_struct from a given allocation. 
> 
> Indeed. In my patchset, based on vmas, I do the following:
> * a private field from the page struct points to the vma using that page
> * inside the vma there is alist_head used only during deletion
>    - one pointer is used to chain vmas fro mthe same pool
>    - one pointer points to the pool struct
> * the pool struct has the property to use for all the associated 
> allocations: is it write-rare, read-only, does it auto protect, etc.
> 
>> Regardless of how
>> the mm_structs are set up, changing rare_write memory to normal memory
>> or vice versa will require a global TLB flush (all ASIDs and global
>> pages) on all CPUs, so having extra mm_structs doesn't seem to buy
>> much.
> 
> 1) it supports differnt levels of protection:
>    temporarily unprotected vs read-only vs write-rare
> 
> 2) the change of write permission should be possible only toward more 
> restrictive rules (writable -> write-rare -> read-only) and only to the 
> point that was specified while creating the pool, to avoid DOS attacks, 
> where a write-rare is flipped into read-only and further updates fail
> (ex: prevent IMA from registering modifications to a file, by not 
> letting it store new information - I'm not 100% sure this would work, 
> but it gives the idea, I think)
> 
> 3) being able to track all the allocations related to a pool would allow 
> to perform mass operations, like reducing the writabilty or destroying 
> all the allocations.
> 
>> (It's just possible that changing rare_write back to normal might be
>> able to avoid the flush if the spurious faults can be handled
>> reliably.)
> 
> I do not see the need for such a case of degrading the write permissions 
> of an allocation, unless it refers to the release of a pool of 
> allocations (see updating the SELinux policy DB)

A few more thoughts about pools. Not sure if they are all correct.
Note: I stick to "pool", instead of mm_struct, because what I'll say is 
mostly independent from the implementation.

- As Peter Zijlstra wrote earlier, protecting a target moves the focus 
of the attack to something else. In this case, probably, the "something 
else" would be the metadata of the pool(s).

- The amount of pools needed should be known at compilation time, so the 
metadata used for pools could be statically allocated and any change to 
it could be treated as write-rare.

- only certain fields of a pool structure would be writable, even as 
write-rare, after the pool is initialized.
In case the pool is a mm_struct or a superset (containing also 
additional properties, like the type of writability: RO or WR), the field

struct vm_area_struct *mmap;

is an example of what could be protected. It should be alterable only 
when creating/destroying the pool and making the first initialization.

- to speed up and also improve the validation of the target of a 
write-rare operation, it would be really desirable if the target had 
some intrinsic property which clearly differentiates it from 
non-write-rare memory. Its address, for example. The amount of write 
rare data needed by a full kernel should not exceed a few tens of 
megabytes. On a 64 bit system it shouldn't be so bad to reserve an 
address range maybe one order of magnitude lager than that.
It could even become a parameter for the creation of a pool.
SELinux, for example, should fit within 10-20MB. Or it could be a 
command line parameter.

- even if an hypervisor was present, it would be preferable to use it 
exclusively as extra protection, which triggers an exception only when 
something abnormal happens. The hypervisor should not become aware of 
the actual meaning of kernel (meta)data. Ideally, it would be mostly 
used for trapping unexpected writes to pages which are not supposed to 
be modified.

- one more reason for using pools is that, if each pool was also acting 
as memory cache for its users, attacks relying on use-after-free would 
not have access to possible vulnerability, because the memory and 
addresses associated with a pool would stay with it.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 10:02                             ` Peter Zijlstra
@ 2018-10-31 20:36                               ` Andy Lutomirski
  2018-10-31 21:00                                 ` Peter Zijlstra
  0 siblings, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 20:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Igor Stoppa, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner


> On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
>> To clarify some of this thread, I think that the fact that rare_write
>> uses an mm_struct and alias mappings under the hood should be
>> completely invisible to users of the API.  No one should ever be
>> handed a writable pointer to rare_write memory (except perhaps during
>> bootup or when initializing a large complex data structure that will
>> be rare_write but isn't yet, e.g. the policy db).
> 
> Being able to use pointers would make it far easier to do atomics and
> other things though.

This stuff is called *rare* write for a reason. Do we really want to allow atomics beyond just store-release?  Taking a big lock and then writing in the right order should cover everything, no?

> 
>> For example, there could easily be architectures where having a
>> writable alias is problematic.
> 
> Mostly we'd just have to be careful of cache aliases, alignment should
> be able to sort that I think.
> 
>> If you have multiple pools and one mm_struct per pool, you'll need a
>> way to find the mm_struct from a given allocation.
> 
> Or keep track of it externally. For example by context. If you modify
> page-tables you pick the page-table pool, if you modify selinux state,
> you pick the selinux pool.
> 
>> Regardless of how the mm_structs are set up, changing rare_write
>> memory to normal memory or vice versa will require a global TLB flush
>> (all ASIDs and global pages) on all CPUs, so having extra mm_structs
>> doesn't seem to buy much.
> 
> The way I understand it, the point is that if you stick page-tables and
> selinux state in different pools, a stray write in one will never affect
> the other.
> 

Hmm. That’s not totally crazy, but the API would need to be carefully designed. And some argument would have to be made as to why it’s better to use a different address space as opposed to checking in software along the lines of the uaccess checking.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 10:11                             ` Peter Zijlstra
@ 2018-10-31 20:38                               ` Andy Lutomirski
  2018-10-31 20:53                                 ` Andy Lutomirski
  0 siblings, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 20:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Igor Stoppa, Matthew Wilcox, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner



> On Oct 31, 2018, at 3:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
>> On 30/10/2018 23:02, Andy Lutomirski wrote:
> 
>>> But I dislike allowing regular writes in the protected region. We
>>> really only need four write primitives:
>>> 
>>> 1. Just write one value.  Call at any time (except NMI).
>>> 
>>> 2. Just copy some bytes. Same as (1) but any number of bytes.
>>> 
>>> 3,4: Same as 1 and 2 but must be called inside a special rare write
>>> region. This is purely an optimization.
>> 
>> Atomic? RCU?
> 
> RCU can be done, that's not really a problem. Atomics otoh are a
> problem. Having pointers makes them just work.
> 
> Andy; I understand your reason for not wanting them, but I really don't
> want to duplicate everything. Is there something we can do with static
> analysis to make you more comfortable with the pointer thing?

I’m sure we could do something with static analysis, but I think seeing a real use case where all this fanciness makes sense would be good.

And I don’t know if s390 *can* have an efficient implementation that uses pointers. OTOH they have all kinds of magic stuff, so who knows?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 20:38                               ` Andy Lutomirski
@ 2018-10-31 20:53                                 ` Andy Lutomirski
  0 siblings, 0 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Igor Stoppa, Matthew Wilcox, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner



> On Oct 31, 2018, at 1:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
>>> On Oct 31, 2018, at 3:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
>>> On 30/10/2018 23:02, Andy Lutomirski wrote:
>> 
>>>> But I dislike allowing regular writes in the protected region. We
>>>> really only need four write primitives:
>>>> 
>>>> 1. Just write one value.  Call at any time (except NMI).
>>>> 
>>>> 2. Just copy some bytes. Same as (1) but any number of bytes.
>>>> 
>>>> 3,4: Same as 1 and 2 but must be called inside a special rare write
>>>> region. This is purely an optimization.
>>> 
>>> Atomic? RCU?
>> 
>> RCU can be done, that's not really a problem. Atomics otoh are a
>> problem. Having pointers makes them just work.
>> 
>> Andy; I understand your reason for not wanting them, but I really don't
>> want to duplicate everything. Is there something we can do with static
>> analysis to make you more comfortable with the pointer thing?
> 
> I’m sure we could do something with static analysis, but I think seeing a real use case where all this fanciness makes sense would be good.
> 
> And I don’t know if s390 *can* have an efficient implementation that uses pointers. OTOH they have all kinds of magic stuff, so who knows?

Also, if we’re using a hypervisor, then there are a couple ways it could be done:

1. VMFUNC.  Pointers work fine.  This is stronger than any amount of CR3 trickery because it can’t be defeated by page table attacks.

2. A hypercall to do the write. No pointers.

Basically, I think that if we can get away without writable pointers, we get more flexibility and less need for fancy static analysis. If we do need pointers, then so be it.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 20:36                               ` Andy Lutomirski
@ 2018-10-31 21:00                                 ` Peter Zijlstra
  2018-10-31 22:57                                   ` Andy Lutomirski
  0 siblings, 1 reply; 140+ messages in thread
From: Peter Zijlstra @ 2018-10-31 21:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matthew Wilcox, Igor Stoppa, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:
> 
> > On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> >> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
> >> To clarify some of this thread, I think that the fact that rare_write
> >> uses an mm_struct and alias mappings under the hood should be
> >> completely invisible to users of the API.  No one should ever be
> >> handed a writable pointer to rare_write memory (except perhaps during
> >> bootup or when initializing a large complex data structure that will
> >> be rare_write but isn't yet, e.g. the policy db).
> > 
> > Being able to use pointers would make it far easier to do atomics and
> > other things though.
> 
> This stuff is called *rare* write for a reason. Do we really want to
> allow atomics beyond just store-release?  Taking a big lock and then
> writing in the right order should cover everything, no?

Ah, so no. That naming is very misleading.

We modify page-tables a _lot_. The point is that only a few sanctioned
sites are allowed writing to it, not everybody.

I _think_ the use-case for atomics is updating the reference counts of
objects that are in this write-rare domain. But I'm not entirely clear
on that myself either. I just really want to avoid duplicating that
stuff.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 21:00                                 ` Peter Zijlstra
@ 2018-10-31 22:57                                   ` Andy Lutomirski
  2018-10-31 23:10                                     ` Igor Stoppa
  2018-11-01 17:08                                     ` Peter Zijlstra
  0 siblings, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 22:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Igor Stoppa, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner



> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:
>> 
>>>> On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> 
>>>> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
>>>> To clarify some of this thread, I think that the fact that rare_write
>>>> uses an mm_struct and alias mappings under the hood should be
>>>> completely invisible to users of the API.  No one should ever be
>>>> handed a writable pointer to rare_write memory (except perhaps during
>>>> bootup or when initializing a large complex data structure that will
>>>> be rare_write but isn't yet, e.g. the policy db).
>>> 
>>> Being able to use pointers would make it far easier to do atomics and
>>> other things though.
>> 
>> This stuff is called *rare* write for a reason. Do we really want to
>> allow atomics beyond just store-release?  Taking a big lock and then
>> writing in the right order should cover everything, no?
> 
> Ah, so no. That naming is very misleading.
> 
> We modify page-tables a _lot_. The point is that only a few sanctioned
> sites are allowed writing to it, not everybody.
> 
> I _think_ the use-case for atomics is updating the reference counts of
> objects that are in this write-rare domain. But I'm not entirely clear
> on that myself either. I just really want to avoid duplicating that
> stuff.

Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky.

Can we see a *real* use case before we over complicate the API?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 22:57                                   ` Andy Lutomirski
@ 2018-10-31 23:10                                     ` Igor Stoppa
  2018-10-31 23:19                                       ` Andy Lutomirski
  2018-11-01 17:08                                     ` Peter Zijlstra
  1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-31 23:10 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Matthew Wilcox, Tycho Andersen, Kees Cook, Mimi Zohar,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, LSM List, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner



On 01/11/2018 00:57, Andy Lutomirski wrote:
> 
> 
>> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:



>> I _think_ the use-case for atomics is updating the reference counts of
>> objects that are in this write-rare domain. But I'm not entirely clear
>> on that myself either. I just really want to avoid duplicating that
>> stuff.
> 
> Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky.
> 
> Can we see a *real* use case before we over complicate the API?
> 


Does patch #14 of this set not qualify? ima_htable.len ?

https://www.openwall.com/lists/kernel-hardening/2018/10/23/20

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 23:10                                     ` Igor Stoppa
@ 2018-10-31 23:19                                       ` Andy Lutomirski
  2018-10-31 23:26                                         ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-10-31 23:19 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Peter Zijlstra, Matthew Wilcox, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

On Wed, Oct 31, 2018 at 4:10 PM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
>
>
> On 01/11/2018 00:57, Andy Lutomirski wrote:
> >
> >
> >> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
>
> >> I _think_ the use-case for atomics is updating the reference counts of
> >> objects that are in this write-rare domain. But I'm not entirely clear
> >> on that myself either. I just really want to avoid duplicating that
> >> stuff.
> >
> > Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky.
> >
> > Can we see a *real* use case before we over complicate the API?
> >
>
>
> Does patch #14 of this set not qualify? ima_htable.len ?
>
> https://www.openwall.com/lists/kernel-hardening/2018/10/23/20
>

Do you mean this (sorry for whitespace damage):

+ pratomic_long_inc(&ima_htable.len);

- atomic_long_inc(&ima_htable.len);
  if (update_htable) {
    key = ima_hash_key(entry->digest);
-   hlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
+   prhlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
  }

ISTM you don't need that atomic operation -- you could take a spinlock
and then just add one directly to the variable.

--Andy

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 23:19                                       ` Andy Lutomirski
@ 2018-10-31 23:26                                         ` Igor Stoppa
  2018-11-01  8:21                                           ` Thomas Gleixner
  0 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-10-31 23:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Matthew Wilcox, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner



On 01/11/2018 01:19, Andy Lutomirski wrote:

> ISTM you don't need that atomic operation -- you could take a spinlock
> and then just add one directly to the variable.

It was my intention to provide a 1:1 conversion of existing code, as it 
should be easier to verify the correctness of the conversion, as long as 
there isn't any significant degradation in performance.

The rework could be done afterward.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 16/17] prmem: pratomic-long
  2018-10-31  9:10           ` Peter Zijlstra
@ 2018-11-01  3:28             ` Kees Cook
  0 siblings, 0 replies; 140+ messages in thread
From: Kees Cook @ 2018-11-01  3:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Igor Stoppa, Mimi Zohar, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, linux-security-module, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Boqun Feng, Arnd Bergmann,
	linux-arch, LKML

On Wed, Oct 31, 2018 at 2:10 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 30, 2018 at 04:28:16PM +0000, Will Deacon wrote:
>> On Tue, Oct 30, 2018 at 04:58:41PM +0100, Peter Zijlstra wrote:
>> > Like mentioned elsewhere; if you do write_enable() + write_disable()
>> > thingies, it all becomes:
>> >
>> >     write_enable();
>> >     atomic_foo(&bar);
>> >     write_disable();
>> >
>> > No magic gunk infested duplication at all. Of course, ideally you'd then
>> > teach objtool about this (or a GCC plugin I suppose) to ensure any
>> > enable reached a disable.
>>
>> Isn't the issue here that we don't want to change the page tables for the
>> mapping of &bar, but instead want to create a temporary writable alias
>> at a random virtual address?
>>
>> So you'd want:
>>
>>       wbar = write_enable(&bar);
>>       atomic_foo(wbar);
>>       write_disable(wbar);
>>
>> which is probably better expressed as a map/unmap API. I suspect this
>> would also be the only way to do things for cmpxchg() loops, where you
>> want to create the mapping outside of the loop to minimise your time in
>> the critical section.
>
> Ah, so I was thikning that the altnerative mm would have stuff in the
> same location, just RW instead of RO.

I was hoping for the same location too. That allows use to use a gcc
plugin to mark, say, function pointer tables, as read-only, and
annotate their rare updates with write_rare() without any
recalculation.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 23:26                                         ` Igor Stoppa
@ 2018-11-01  8:21                                           ` Thomas Gleixner
  2018-11-01 15:58                                             ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Thomas Gleixner @ 2018-11-01  8:21 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Andy Lutomirski, Peter Zijlstra, Matthew Wilcox, Tycho Andersen,
	Kees Cook, Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML

Igor,

On Thu, 1 Nov 2018, Igor Stoppa wrote:
> On 01/11/2018 01:19, Andy Lutomirski wrote:
> 
> > ISTM you don't need that atomic operation -- you could take a spinlock
> > and then just add one directly to the variable.
> 
> It was my intention to provide a 1:1 conversion of existing code, as it should
> be easier to verify the correctness of the conversion, as long as there isn't
> any significant degradation in performance.
> 
> The rework could be done afterward.

Please don't go there. The usual approach is to

  1) Rework existing code in a way that the new functionality can be added
     with minimal effort afterwards and without creating API horrors.

  2) Verify correctness of the rework

  3) Add the new functionality
  
That avoids creation of odd functionality and APIs in the first place, so
they won't be used in other places and does not leave half cleaned up code
around which will stick for a long time.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-01  8:21                                           ` Thomas Gleixner
@ 2018-11-01 15:58                                             ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-01 15:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Peter Zijlstra, Matthew Wilcox, Tycho Andersen,
	Kees Cook, Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML

On 01/11/2018 10:21, Thomas Gleixner wrote:
> On Thu, 1 Nov 2018, Igor Stoppa wrote:

>> The rework could be done afterward.
> 
> Please don't go there. The usual approach is to
> 
>    1) Rework existing code in a way that the new functionality can be added
>       with minimal effort afterwards and without creating API horrors.


Thanks a lot for the advice.
It makes things even easier for me, as I can start the rework, while the 
discussion on the actual write-rare mechanism continues.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31  9:08                       ` Peter Zijlstra
@ 2018-11-01 16:31                         ` Nadav Amit
  2018-11-02 21:11                           ` Nadav Amit
  0 siblings, 1 reply; 140+ messages in thread
From: Nadav Amit @ 2018-11-01 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Matthew Wilcox, Kees Cook, Igor Stoppa,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, linux-security-module,
	Igor Stoppa, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

From: Peter Zijlstra
Sent: October 31, 2018 at 9:08:35 AM GMT
> To: Nadav Amit <nadav.amit@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>, Matthew Wilcox <willy@infradead.org>, Kees Cook <keescook@chromium.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> On Tue, Oct 30, 2018 at 04:18:39PM -0700, Nadav Amit wrote:
>>> Nadav, want to resubmit your series? IIRC the only thing wrong with it was
>>> that it was a big change and we wanted a simpler fix to backport. But
>>> that’s all done now, and I, at least, rather liked your code. :)
>> 
>> I guess since it was based on your ideas…
>> 
>> Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
>> make kgdb use of poke_text() less disgusting. I still don’t know exactly
>> how to deal with it.
>> 
>> Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
>> v2.
> 
> Oh man, I completely forgot about kgdb :/
> 
> Also, would it be possible to entirely remove that kmap fallback path?

Let me see what I can do about it.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-31 22:57                                   ` Andy Lutomirski
  2018-10-31 23:10                                     ` Igor Stoppa
@ 2018-11-01 17:08                                     ` Peter Zijlstra
  1 sibling, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2018-11-01 17:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matthew Wilcox, Igor Stoppa, Tycho Andersen, Kees Cook,
	Mimi Zohar, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

On Wed, Oct 31, 2018 at 03:57:21PM -0700, Andy Lutomirski wrote:
> > On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:

> >> This stuff is called *rare* write for a reason. Do we really want to
> >> allow atomics beyond just store-release?  Taking a big lock and then
> >> writing in the right order should cover everything, no?
> > 
> > Ah, so no. That naming is very misleading.
> > 
> > We modify page-tables a _lot_. The point is that only a few sanctioned
> > sites are allowed writing to it, not everybody.
> > 
> > I _think_ the use-case for atomics is updating the reference counts of
> > objects that are in this write-rare domain. But I'm not entirely clear
> > on that myself either. I just really want to avoid duplicating that
> > stuff.
> 
> Sounds nuts. Doing a rare-write is many hundreds of cycles at best.

Yes, which is why I'm somewhat sceptical of the whole endeavour.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-01 16:31                         ` Nadav Amit
@ 2018-11-02 21:11                           ` Nadav Amit
  0 siblings, 0 replies; 140+ messages in thread
From: Nadav Amit @ 2018-11-02 21:11 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Matthew Wilcox, Kees Cook, Igor Stoppa, Mimi Zohar, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

From: Nadav Amit
Sent: November 1, 2018 at 4:31:59 PM GMT
> To: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@amacapital.net>, Matthew Wilcox <willy@infradead.org>, Kees Cook <keescook@chromium.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> From: Peter Zijlstra
> Sent: October 31, 2018 at 9:08:35 AM GMT
>> To: Nadav Amit <nadav.amit@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>, Matthew Wilcox <willy@infradead.org>, Kees Cook <keescook@chromium.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
>> Subject: Re: [PATCH 10/17] prmem: documentation
>> 
>> 
>> On Tue, Oct 30, 2018 at 04:18:39PM -0700, Nadav Amit wrote:
>>>> Nadav, want to resubmit your series? IIRC the only thing wrong with it was
>>>> that it was a big change and we wanted a simpler fix to backport. But
>>>> that’s all done now, and I, at least, rather liked your code. :)
>>> 
>>> I guess since it was based on your ideas…
>>> 
>>> Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
>>> make kgdb use of poke_text() less disgusting. I still don’t know exactly
>>> how to deal with it.
>>> 
>>> Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
>>> v2.
>> 
>> Oh man, I completely forgot about kgdb :/
>> 
>> Also, would it be possible to entirely remove that kmap fallback path?
> 
> Let me see what I can do about it.

My patches had several embarrassing bugs. I’m fixing and will resend later,
hopefully today.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-10-30 17:06               ` Andy Lutomirski
  2018-10-30 17:58                 ` Matthew Wilcox
@ 2018-11-13 14:25                 ` Igor Stoppa
  2018-11-13 17:16                   ` Andy Lutomirski
  1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 14:25 UTC (permalink / raw)
  To: Andy Lutomirski, Kees Cook, Peter Zijlstra, Nadav Amit
  Cc: Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
	Michal Hocko, Kernel Hardening, linux-integrity,
	linux-security-module, Igor Stoppa, Dave Hansen, Jonathan Corbet,
	Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

Hello,
I've been studying v4 of the patch-set [1] that Nadav has been working on.
Incidentally, I think it would be useful to cc also the 
security/hardening ml.
The patch-set seems to be close to final, so I am resuming this discussion.

On 30/10/2018 19:06, Andy Lutomirski wrote:

> I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.

After reading the code, I see what you meant.
I think I can work with it.

But I have a couple of questions wrt the use of this mechanism, in the 
context of write rare.


1) mm_struct.

Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code 
(live patch?), which seems to happen sequentially and in a relatively 
standardized way, like replacing the NOPs specifically placed in the 
functions that need patching.

This is a bit different from the more generic write-rare case, applied 
to data.

As example, I have in mind a system where both IMA and SELinux are in use.

In this system, a file is accessed for the first time.

That would trigger 2 things:
- evaluation of the SELinux rules and probably update of the AVC cache
- IMA measurement and update of the measurements

Both of them could be write protected, meaning that they would both have 
to be modified through the write rare mechanism.

While the events, for 1 specific file, would be sequential, it's not 
difficult to imagine that multiple files could be accessed at the same time.

If the update of the data structures in both IMA and SELinux must use 
the same mm_struct, that would have to be somehow regulated and it would 
introduce an unnecessary (imho) dependency.

How about having one mm_struct for each writer (core or thread)?



2) Iiuc, the purpose of the 2 pages being remapped is that the target of 
the patch might spill across the page boundary, however if I deal with 
the modification of generic data, I shouldn't (shouldn't I?) assume that 
the data will not span across multiple pages.

If the data spans across multiple pages, in unknown amount, I suppose 
that I should not keep interrupts disabled for an unknown time, as it 
would hurt preemption.

What I thought, in my initial patch-set, was to iterate over each page 
that must be written to, in a loop, re-enabling interrupts in-between 
iterations, to give pending interrupts a chance to be served.

This would mean that the data being written to would not be consistent, 
but it's a problem that would have to be addressed anyways, since it can 
be still read by other cores, while the write is ongoing.


Is this a valid concern/approach?


--
igor


[1] https://lkml.org/lkml/2018/11/11/110

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 14:25                 ` Igor Stoppa
@ 2018-11-13 17:16                   ` Andy Lutomirski
  2018-11-13 17:43                     ` Nadav Amit
  2018-11-13 18:26                     ` Igor Stoppa
  0 siblings, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 17:16 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Kees Cook, Peter Zijlstra, Nadav Amit, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
> Hello,
> I've been studying v4 of the patch-set [1] that Nadav has been working on.
> Incidentally, I think it would be useful to cc also the
> security/hardening ml.
> The patch-set seems to be close to final, so I am resuming this discussion.
>
> On 30/10/2018 19:06, Andy Lutomirski wrote:
>
> > I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>
> After reading the code, I see what you meant.
> I think I can work with it.
>
> But I have a couple of questions wrt the use of this mechanism, in the
> context of write rare.
>
>
> 1) mm_struct.
>
> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
> (live patch?), which seems to happen sequentially and in a relatively
> standardized way, like replacing the NOPs specifically placed in the
> functions that need patching.
>
> This is a bit different from the more generic write-rare case, applied
> to data.
>
> As example, I have in mind a system where both IMA and SELinux are in use.
>
> In this system, a file is accessed for the first time.
>
> That would trigger 2 things:
> - evaluation of the SELinux rules and probably update of the AVC cache
> - IMA measurement and update of the measurements
>
> Both of them could be write protected, meaning that they would both have
> to be modified through the write rare mechanism.
>
> While the events, for 1 specific file, would be sequential, it's not
> difficult to imagine that multiple files could be accessed at the same time.
>
> If the update of the data structures in both IMA and SELinux must use
> the same mm_struct, that would have to be somehow regulated and it would
> introduce an unnecessary (imho) dependency.
>
> How about having one mm_struct for each writer (core or thread)?
>

I don't think that helps anything.  I think the mm_struct used for
prmem (or rare_write or whatever you want to call it) should be
entirely abstracted away by an appropriate API, so neither SELinux nor
IMA need to be aware that there's an mm_struct involved.  It's also
entirely possible that some architectures won't even use an mm_struct
behind the scenes -- x86, for example, could have avoided it if there
were a kernel equivalent of PKRU.  Sadly, there isn't.

>
>
> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> the patch might spill across the page boundary, however if I deal with
> the modification of generic data, I shouldn't (shouldn't I?) assume that
> the data will not span across multiple pages.

The reason for the particular architecture of text_poke() is to avoid
memory allocation to get it working.  i think that prmem/rare_write
should have each rare-writable kernel address map to a unique user
address, possibly just by offsetting everything by a constant.  For
rare_write, you don't actually need it to work as such until fairly
late in boot, since the rare_writable data will just be writable early
on.

>
> If the data spans across multiple pages, in unknown amount, I suppose
> that I should not keep interrupts disabled for an unknown time, as it
> would hurt preemption.
>
> What I thought, in my initial patch-set, was to iterate over each page
> that must be written to, in a loop, re-enabling interrupts in-between
> iterations, to give pending interrupts a chance to be served.
>
> This would mean that the data being written to would not be consistent,
> but it's a problem that would have to be addressed anyways, since it can
> be still read by other cores, while the write is ongoing.

This probably makes sense, except that enabling and disabling
interrupts means you also need to restore the original mm_struct (most
likely), which is slow.  I don't think there's a generic way to check
whether in interrupt is pending without turning interrupts on.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 17:16                   ` Andy Lutomirski
@ 2018-11-13 17:43                     ` Nadav Amit
  2018-11-13 17:47                       ` Andy Lutomirski
  2018-11-13 18:26                     ` Igor Stoppa
  1 sibling, 1 reply; 140+ messages in thread
From: Nadav Amit @ 2018-11-13 17:43 UTC (permalink / raw)
  To: Andy Lutomirski, Igor Stoppa
  Cc: Kees Cook, Peter Zijlstra, Mimi Zohar, Matthew Wilcox,
	Dave Chinner, James Morris, Michal Hocko, Kernel Hardening,
	linux-integrity, LSM List, Igor Stoppa, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

From: Andy Lutomirski
Sent: November 13, 2018 at 5:16:09 PM GMT
> To: Igor Stoppa <igor.stoppa@gmail.com>
> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>> Hello,
>> I've been studying v4 of the patch-set [1] that Nadav has been working on.
>> Incidentally, I think it would be useful to cc also the
>> security/hardening ml.
>> The patch-set seems to be close to final, so I am resuming this discussion.
>> 
>> On 30/10/2018 19:06, Andy Lutomirski wrote:
>> 
>>> I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>> 
>> After reading the code, I see what you meant.
>> I think I can work with it.
>> 
>> But I have a couple of questions wrt the use of this mechanism, in the
>> context of write rare.
>> 
>> 
>> 1) mm_struct.
>> 
>> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
>> (live patch?), which seems to happen sequentially and in a relatively
>> standardized way, like replacing the NOPs specifically placed in the
>> functions that need patching.
>> 
>> This is a bit different from the more generic write-rare case, applied
>> to data.
>> 
>> As example, I have in mind a system where both IMA and SELinux are in use.
>> 
>> In this system, a file is accessed for the first time.
>> 
>> That would trigger 2 things:
>> - evaluation of the SELinux rules and probably update of the AVC cache
>> - IMA measurement and update of the measurements
>> 
>> Both of them could be write protected, meaning that they would both have
>> to be modified through the write rare mechanism.
>> 
>> While the events, for 1 specific file, would be sequential, it's not
>> difficult to imagine that multiple files could be accessed at the same time.
>> 
>> If the update of the data structures in both IMA and SELinux must use
>> the same mm_struct, that would have to be somehow regulated and it would
>> introduce an unnecessary (imho) dependency.
>> 
>> How about having one mm_struct for each writer (core or thread)?
> 
> I don't think that helps anything.  I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it) should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved.  It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU.  Sadly, there isn't.
> 
>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
> 
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working.  i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant.  For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.
> 
>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>> 
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>> 
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
> 
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow.  I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.

I guess that enabling IRQs might break some hidden assumptions in the code,
but is there a fundamental reason that IRQs need to be disabled? use_mm()
got them enabled, although it is only suitable for kernel threads.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 17:43                     ` Nadav Amit
@ 2018-11-13 17:47                       ` Andy Lutomirski
  2018-11-13 18:06                         ` Nadav Amit
  2018-11-13 18:31                         ` Igor Stoppa
  0 siblings, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 17:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Andy Lutomirski
> Sent: November 13, 2018 at 5:16:09 PM GMT
> > To: Igor Stoppa <igor.stoppa@gmail.com>
> > Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> > Subject: Re: [PATCH 10/17] prmem: documentation
> >
> >
> > On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
> >> Hello,
> >> I've been studying v4 of the patch-set [1] that Nadav has been working on.
> >> Incidentally, I think it would be useful to cc also the
> >> security/hardening ml.
> >> The patch-set seems to be close to final, so I am resuming this discussion.
> >>
> >> On 30/10/2018 19:06, Andy Lutomirski wrote:
> >>
> >>> I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
> >>
> >> After reading the code, I see what you meant.
> >> I think I can work with it.
> >>
> >> But I have a couple of questions wrt the use of this mechanism, in the
> >> context of write rare.
> >>
> >>
> >> 1) mm_struct.
> >>
> >> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
> >> (live patch?), which seems to happen sequentially and in a relatively
> >> standardized way, like replacing the NOPs specifically placed in the
> >> functions that need patching.
> >>
> >> This is a bit different from the more generic write-rare case, applied
> >> to data.
> >>
> >> As example, I have in mind a system where both IMA and SELinux are in use.
> >>
> >> In this system, a file is accessed for the first time.
> >>
> >> That would trigger 2 things:
> >> - evaluation of the SELinux rules and probably update of the AVC cache
> >> - IMA measurement and update of the measurements
> >>
> >> Both of them could be write protected, meaning that they would both have
> >> to be modified through the write rare mechanism.
> >>
> >> While the events, for 1 specific file, would be sequential, it's not
> >> difficult to imagine that multiple files could be accessed at the same time.
> >>
> >> If the update of the data structures in both IMA and SELinux must use
> >> the same mm_struct, that would have to be somehow regulated and it would
> >> introduce an unnecessary (imho) dependency.
> >>
> >> How about having one mm_struct for each writer (core or thread)?
> >
> > I don't think that helps anything.  I think the mm_struct used for
> > prmem (or rare_write or whatever you want to call it) should be
> > entirely abstracted away by an appropriate API, so neither SELinux nor
> > IMA need to be aware that there's an mm_struct involved.  It's also
> > entirely possible that some architectures won't even use an mm_struct
> > behind the scenes -- x86, for example, could have avoided it if there
> > were a kernel equivalent of PKRU.  Sadly, there isn't.
> >
> >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> >> the patch might spill across the page boundary, however if I deal with
> >> the modification of generic data, I shouldn't (shouldn't I?) assume that
> >> the data will not span across multiple pages.
> >
> > The reason for the particular architecture of text_poke() is to avoid
> > memory allocation to get it working.  i think that prmem/rare_write
> > should have each rare-writable kernel address map to a unique user
> > address, possibly just by offsetting everything by a constant.  For
> > rare_write, you don't actually need it to work as such until fairly
> > late in boot, since the rare_writable data will just be writable early
> > on.
> >
> >> If the data spans across multiple pages, in unknown amount, I suppose
> >> that I should not keep interrupts disabled for an unknown time, as it
> >> would hurt preemption.
> >>
> >> What I thought, in my initial patch-set, was to iterate over each page
> >> that must be written to, in a loop, re-enabling interrupts in-between
> >> iterations, to give pending interrupts a chance to be served.
> >>
> >> This would mean that the data being written to would not be consistent,
> >> but it's a problem that would have to be addressed anyways, since it can
> >> be still read by other cores, while the write is ongoing.
> >
> > This probably makes sense, except that enabling and disabling
> > interrupts means you also need to restore the original mm_struct (most
> > likely), which is slow.  I don't think there's a generic way to check
> > whether in interrupt is pending without turning interrupts on.
>
> I guess that enabling IRQs might break some hidden assumptions in the code,
> but is there a fundamental reason that IRQs need to be disabled? use_mm()
> got them enabled, although it is only suitable for kernel threads.
>

For general rare-writish stuff, I don't think we want IRQs running
with them mapped anywhere for write.  For AVC and IMA, I'm less sure.

--Andy

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 17:47                       ` Andy Lutomirski
@ 2018-11-13 18:06                         ` Nadav Amit
  2018-11-13 18:31                         ` Igor Stoppa
  1 sibling, 0 replies; 140+ messages in thread
From: Nadav Amit @ 2018-11-13 18:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Igor Stoppa,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

From: Andy Lutomirski
Sent: November 13, 2018 at 5:47:16 PM GMT
> To: Nadav Amit <nadav.amit@gmail.com>
> Cc: Igor Stoppa <igor.stoppa@gmail.com>, Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Andy Lutomirski
>> Sent: November 13, 2018 at 5:16:09 PM GMT
>>> To: Igor Stoppa <igor.stoppa@gmail.com>
>>> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
>>> Subject: Re: [PATCH 10/17] prmem: documentation
>>> 
>>> 
>>> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>>>> Hello,
>>>> I've been studying v4 of the patch-set [1] that Nadav has been working on.
>>>> Incidentally, I think it would be useful to cc also the
>>>> security/hardening ml.
>>>> The patch-set seems to be close to final, so I am resuming this discussion.
>>>> 
>>>> On 30/10/2018 19:06, Andy Lutomirski wrote:
>>>> 
>>>>> I support the addition of a rare-write mechanism to the upstream kernel.  And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>>>> 
>>>> After reading the code, I see what you meant.
>>>> I think I can work with it.
>>>> 
>>>> But I have a couple of questions wrt the use of this mechanism, in the
>>>> context of write rare.
>>>> 
>>>> 
>>>> 1) mm_struct.
>>>> 
>>>> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
>>>> (live patch?), which seems to happen sequentially and in a relatively
>>>> standardized way, like replacing the NOPs specifically placed in the
>>>> functions that need patching.
>>>> 
>>>> This is a bit different from the more generic write-rare case, applied
>>>> to data.
>>>> 
>>>> As example, I have in mind a system where both IMA and SELinux are in use.
>>>> 
>>>> In this system, a file is accessed for the first time.
>>>> 
>>>> That would trigger 2 things:
>>>> - evaluation of the SELinux rules and probably update of the AVC cache
>>>> - IMA measurement and update of the measurements
>>>> 
>>>> Both of them could be write protected, meaning that they would both have
>>>> to be modified through the write rare mechanism.
>>>> 
>>>> While the events, for 1 specific file, would be sequential, it's not
>>>> difficult to imagine that multiple files could be accessed at the same time.
>>>> 
>>>> If the update of the data structures in both IMA and SELinux must use
>>>> the same mm_struct, that would have to be somehow regulated and it would
>>>> introduce an unnecessary (imho) dependency.
>>>> 
>>>> How about having one mm_struct for each writer (core or thread)?
>>> 
>>> I don't think that helps anything.  I think the mm_struct used for
>>> prmem (or rare_write or whatever you want to call it) should be
>>> entirely abstracted away by an appropriate API, so neither SELinux nor
>>> IMA need to be aware that there's an mm_struct involved.  It's also
>>> entirely possible that some architectures won't even use an mm_struct
>>> behind the scenes -- x86, for example, could have avoided it if there
>>> were a kernel equivalent of PKRU.  Sadly, there isn't.
>>> 
>>>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>>>> the patch might spill across the page boundary, however if I deal with
>>>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>>>> the data will not span across multiple pages.
>>> 
>>> The reason for the particular architecture of text_poke() is to avoid
>>> memory allocation to get it working.  i think that prmem/rare_write
>>> should have each rare-writable kernel address map to a unique user
>>> address, possibly just by offsetting everything by a constant.  For
>>> rare_write, you don't actually need it to work as such until fairly
>>> late in boot, since the rare_writable data will just be writable early
>>> on.
>>> 
>>>> If the data spans across multiple pages, in unknown amount, I suppose
>>>> that I should not keep interrupts disabled for an unknown time, as it
>>>> would hurt preemption.
>>>> 
>>>> What I thought, in my initial patch-set, was to iterate over each page
>>>> that must be written to, in a loop, re-enabling interrupts in-between
>>>> iterations, to give pending interrupts a chance to be served.
>>>> 
>>>> This would mean that the data being written to would not be consistent,
>>>> but it's a problem that would have to be addressed anyways, since it can
>>>> be still read by other cores, while the write is ongoing.
>>> 
>>> This probably makes sense, except that enabling and disabling
>>> interrupts means you also need to restore the original mm_struct (most
>>> likely), which is slow.  I don't think there's a generic way to check
>>> whether in interrupt is pending without turning interrupts on.
>> 
>> I guess that enabling IRQs might break some hidden assumptions in the code,
>> but is there a fundamental reason that IRQs need to be disabled? use_mm()
>> got them enabled, although it is only suitable for kernel threads.
> 
> For general rare-writish stuff, I don't think we want IRQs running
> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.

Oh.. Of course. It is sort of a measure to prevent a single malicious/faulty
write from corrupting the sensitive memory. Doing so limits the code that
can compromise the security by a write to the protected data-structures
(rephrasing for myself).

I think I should add it as a comment in your patch.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 17:16                   ` Andy Lutomirski
  2018-11-13 17:43                     ` Nadav Amit
@ 2018-11-13 18:26                     ` Igor Stoppa
  2018-11-13 18:35                       ` Andy Lutomirski
  1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 18:26 UTC (permalink / raw)
  To: Andy Lutomirski, Igor Stoppa
  Cc: Kees Cook, Peter Zijlstra, Nadav Amit, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On 13/11/2018 19:16, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:

[...]

>> How about having one mm_struct for each writer (core or thread)?
>>
> 
> I don't think that helps anything.  I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it)

write_rare / rarely can be shortened to wr_  which is kinda less
confusing than rare_write, since it would become rw_ and easier to
confuse with R/W

Any advice for better naming is welcome.

> should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved.

Yes, that is fine. In my proposal I was thinking about tying it to the
core/thread that performs the actual write.

The high level API could be something like:

wr_memcpy(void *src, void *dst, uint_t size)

>  It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU.  Sadly, there isn't.

The mm_struct - or whatever is the means to do the write on that
architecture - can be kept hidden from the API.

But the reason why I was proposing to have one mm_struct per writer is
that, iiuc, the secondary mapping is created in the secondary mm_struct
for each writer using it.

So the updating of IMA measurements would have, theoretically, also
write access to the SELinux AVC. Which I was trying to avoid.
And similarly any other write rare updater. Is this correct?

>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
> 
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working.  i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant.  For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.

Yes, that is true. I think it's safe to assume, from an attack pattern,
that as long as user space is not started, the system can be considered
ok. Even user-space code run from initrd should be ok, since it can be
bundled (and signed) as a single binary with the kernel.

Modules loaded from a regular filesystem are a bit more risky, because
an attack might inject a rogue key in the key-ring and use it to load
malicious modules.

>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>>
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>>
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
> 
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow.  I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.

The only "excuse" I have is that write_rare is opt-in and is "rare".
Maybe the enabling/disabling of interrupts - and the consequent switch
of mm_struct - could be somehow tied to the latency configuration?

If preemption is disabled, the expectations on the system latency are
anyway more relaxed.

But I'm not sure how it would work against I/O.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 17:47                       ` Andy Lutomirski
  2018-11-13 18:06                         ` Nadav Amit
@ 2018-11-13 18:31                         ` Igor Stoppa
  2018-11-13 18:33                           ` Igor Stoppa
  2018-11-13 18:48                           ` Andy Lutomirski
  1 sibling, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 18:31 UTC (permalink / raw)
  To: Andy Lutomirski, Nadav Amit
  Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On 13/11/2018 19:47, Andy Lutomirski wrote:

> For general rare-writish stuff, I don't think we want IRQs running
> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.

Why would these be less sensitive?

But I see a big difference between my initial implementation and this one.

In my case, by using a shared mapping, visible to all cores, freezing
the core that is performing the write would have exposed the writable
mapping to a potential attack run from another core.

If the mapping is private to the core performing the write, even if it
is frozen, it's much harder to figure out what it had mapped and where,
from another core.

To access that mapping, the attack should be performed from the ISR, I
think.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 18:31                         ` Igor Stoppa
@ 2018-11-13 18:33                           ` Igor Stoppa
  2018-11-13 18:36                             ` Andy Lutomirski
  2018-11-13 18:48                           ` Andy Lutomirski
  1 sibling, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 18:33 UTC (permalink / raw)
  To: Andy Lutomirski, Nadav Amit
  Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

I forgot one sentence :-(

On 13/11/2018 20:31, Igor Stoppa wrote:
> On 13/11/2018 19:47, Andy Lutomirski wrote:
> 
>> For general rare-writish stuff, I don't think we want IRQs running
>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
> 
> Why would these be less sensitive?
> 
> But I see a big difference between my initial implementation and this one.
> 
> In my case, by using a shared mapping, visible to all cores, freezing
> the core that is performing the write would have exposed the writable
> mapping to a potential attack run from another core.
> 
> If the mapping is private to the core performing the write, even if it
> is frozen, it's much harder to figure out what it had mapped and where,
> from another core.
> 
> To access that mapping, the attack should be performed from the ISR, I
> think.

Unless the secondary mapping is also available to other cores, through
the shared mm_struct ?

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 18:26                     ` Igor Stoppa
@ 2018-11-13 18:35                       ` Andy Lutomirski
  2018-11-13 19:01                         ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 18:35 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Nadav Amit, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> On 13/11/2018 19:16, Andy Lutomirski wrote:
>
> > should be
> > entirely abstracted away by an appropriate API, so neither SELinux nor
> > IMA need to be aware that there's an mm_struct involved.
>
> Yes, that is fine. In my proposal I was thinking about tying it to the
> core/thread that performs the actual write.
>
> The high level API could be something like:
>
> wr_memcpy(void *src, void *dst, uint_t size)
>
> >  It's also
> > entirely possible that some architectures won't even use an mm_struct
> > behind the scenes -- x86, for example, could have avoided it if there
> > were a kernel equivalent of PKRU.  Sadly, there isn't.
>
> The mm_struct - or whatever is the means to do the write on that
> architecture - can be kept hidden from the API.
>
> But the reason why I was proposing to have one mm_struct per writer is
> that, iiuc, the secondary mapping is created in the secondary mm_struct
> for each writer using it.
>
> So the updating of IMA measurements would have, theoretically, also
> write access to the SELinux AVC. Which I was trying to avoid.
> And similarly any other write rare updater. Is this correct?

If you call a wr_memcpy() function with the signature you suggested,
then you can overwrite any memory of this type.  Having a different
mm_struct under the hood makes no difference.  As far as I'm
concerned, for *dynamically allocated* rare-writable memory, you might
as well allocate all the paging structures at the same time, so the
mm_struct will always contain the mappings.  If there are serious bugs
in wr_memcpy() that cause it to write to the wrong place, we have
bigger problems.

I can imagine that we'd want a *typed* wr_memcpy()-like API some day,
but that can wait for v2.  And it still doesn't obviously need
multiple mm_structs.

>
> >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> >> the patch might spill across the page boundary, however if I deal with
> >> the modification of generic data, I shouldn't (shouldn't I?) assume that
> >> the data will not span across multiple pages.
> >
> > The reason for the particular architecture of text_poke() is to avoid
> > memory allocation to get it working.  i think that prmem/rare_write
> > should have each rare-writable kernel address map to a unique user
> > address, possibly just by offsetting everything by a constant.  For
> > rare_write, you don't actually need it to work as such until fairly
> > late in boot, since the rare_writable data will just be writable early
> > on.
>
> Yes, that is true. I think it's safe to assume, from an attack pattern,
> that as long as user space is not started, the system can be considered
> ok. Even user-space code run from initrd should be ok, since it can be
> bundled (and signed) as a single binary with the kernel.
>
> Modules loaded from a regular filesystem are a bit more risky, because
> an attack might inject a rogue key in the key-ring and use it to load
> malicious modules.

If a malicious module is loaded, the game is over.

>
> >> If the data spans across multiple pages, in unknown amount, I suppose
> >> that I should not keep interrupts disabled for an unknown time, as it
> >> would hurt preemption.
> >>
> >> What I thought, in my initial patch-set, was to iterate over each page
> >> that must be written to, in a loop, re-enabling interrupts in-between
> >> iterations, to give pending interrupts a chance to be served.
> >>
> >> This would mean that the data being written to would not be consistent,
> >> but it's a problem that would have to be addressed anyways, since it can
> >> be still read by other cores, while the write is ongoing.
> >
> > This probably makes sense, except that enabling and disabling
> > interrupts means you also need to restore the original mm_struct (most
> > likely), which is slow.  I don't think there's a generic way to check
> > whether in interrupt is pending without turning interrupts on.
>
> The only "excuse" I have is that write_rare is opt-in and is "rare".
> Maybe the enabling/disabling of interrupts - and the consequent switch
> of mm_struct - could be somehow tied to the latency configuration?
>
> If preemption is disabled, the expectations on the system latency are
> anyway more relaxed.
>
> But I'm not sure how it would work against I/O.

I think it's entirely reasonable for the API to internally break up
very large memcpys.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 18:33                           ` Igor Stoppa
@ 2018-11-13 18:36                             ` Andy Lutomirski
  2018-11-13 19:03                               ` Igor Stoppa
  2018-11-21 16:34                               ` Igor Stoppa
  0 siblings, 2 replies; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 18:36 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Nadav Amit, Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> I forgot one sentence :-(
>
> On 13/11/2018 20:31, Igor Stoppa wrote:
> > On 13/11/2018 19:47, Andy Lutomirski wrote:
> >
> >> For general rare-writish stuff, I don't think we want IRQs running
> >> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
> >
> > Why would these be less sensitive?
> >
> > But I see a big difference between my initial implementation and this one.
> >
> > In my case, by using a shared mapping, visible to all cores, freezing
> > the core that is performing the write would have exposed the writable
> > mapping to a potential attack run from another core.
> >
> > If the mapping is private to the core performing the write, even if it
> > is frozen, it's much harder to figure out what it had mapped and where,
> > from another core.
> >
> > To access that mapping, the attack should be performed from the ISR, I
> > think.
>
> Unless the secondary mapping is also available to other cores, through
> the shared mm_struct ?
>

I don't think this matters much.  The other cores will only be able to
use that mapping when they're doing a rare write.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 18:31                         ` Igor Stoppa
  2018-11-13 18:33                           ` Igor Stoppa
@ 2018-11-13 18:48                           ` Andy Lutomirski
  2018-11-13 19:35                             ` Igor Stoppa
  1 sibling, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-13 18:48 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Nadav Amit, Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> On 13/11/2018 19:47, Andy Lutomirski wrote:
>
> > For general rare-writish stuff, I don't think we want IRQs running
> > with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>
> Why would these be less sensitive?

I'm not really saying they're less sensitive so much as that the
considerations are different.  I think the original rare-write code is
based on ideas from grsecurity, and it was intended to protect static
data like structs full of function pointers.   Those targets have some
different properties:

 - Static targets are at addresses that are much more guessable, so
they're easier targets for most attacks.  (Not spraying attacks like
the ones you're interested in, though.)

 - Static targets are higher value.  No offense to IMA or AVC, but
outright execution of shellcode, hijacking of control flow, or compete
disablement of core security features is higher impact than bypassing
SELinux or IMA.  Why would you bother corrupting the AVC if you could
instead just set enforcing=0?  (I suppose that corrupting the AVC is
less likely to be noticed by monitoring tools.)

 - Static targets are small.  This means that the interrupt latency
would be negligible, especially in comparison to the latency of
replacing the entire SELinux policy object.

Anyway, I'm not all that familiar with SELinux under the hood, but I'm
wondering if a different approach to thinks like the policy database
might be appropriate.  When the policy is changed, rather than
allocating rare-write memory and writing to it, what if we instead
allocated normal memory, wrote to it, write-protected it, and then
used the rare-write infrastructure to do a much smaller write to
replace the pointer?

Admittedly, this creates a window where another core could corrupt the
data as it's being written.  That may not matter so much if an
attacker can't force a policy update.  Alternatively, the update code
could re-verify the policy after write-protecting it, or there could
be a fancy API to allocate some temporarily-writable memory (by
creating a whole new mm_struct, mapping the memory writable just in
that mm_struct, and activating it) so that only the actual policy
loader could touch the memory.  But I'm mostly speculating here, since
I'm not familiar with the code in question.

Anyway, I tend to think that the right way to approach mainlining all
this is to first get the basic rare write support for static data into
place and then to build on that.  I think it's great that you're
pushing this effort, but doing this for SELinux and IMA is a bigger
project than doing it for static data, and it might make sense to do
it in bite-sized pieces.

Does any of this make sense?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 18:35                       ` Andy Lutomirski
@ 2018-11-13 19:01                         ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 19:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Igor Stoppa, Kees Cook, Peter Zijlstra, Nadav Amit, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner



On 13/11/2018 20:35, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:

[...]

>> The high level API could be something like:
>>
>> wr_memcpy(void *src, void *dst, uint_t size)

[...]

> If you call a wr_memcpy() function with the signature you suggested,
> then you can overwrite any memory of this type.  Having a different
> mm_struct under the hood makes no difference.  As far as I'm
> concerned, for *dynamically allocated* rare-writable memory, you might
> as well allocate all the paging structures at the same time, so the
> mm_struct will always contain the mappings.  If there are serious bugs
> in wr_memcpy() that cause it to write to the wrong place, we have
> bigger problems.

Beside bugs, I'm also thinking about possible vulnerability.
It might be overthinking, though.

I do not think it's possible to really protect against control flow
attacks, unless there is some support from the HW and/or the compiler.

What is left, are data-based attacks. In this case, it would be an
attacker using one existing wr_ invocation with doctored parameters.

However, there is always the objection that it would be possible to come
up with a "writing kit" for plowing through the page tables and
unprotect anything that might be of value.

Ideally, that should be the only type of data-based attack left.

In practice, it might just be an excess of paranoia from my side.

> I can imagine that we'd want a *typed* wr_memcpy()-like API some day,
> but that can wait for v2.  And it still doesn't obviously need
> multiple mm_structs.

I left that out, for now, but yes, typing would play some role here.

[...]

> I think it's entirely reasonable for the API to internally break up
> very large memcpys.

The criteria for deciding if/how to break it down is not clear to me,
though. The single page was easy, but might be (probably is) too much.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 18:36                             ` Andy Lutomirski
@ 2018-11-13 19:03                               ` Igor Stoppa
  2018-11-21 16:34                               ` Igor Stoppa
  1 sibling, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 19:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On 13/11/2018 20:36, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:

[...]

>> Unless the secondary mapping is also available to other cores, through
>> the shared mm_struct ?
>>
> 
> I don't think this matters much.  The other cores will only be able to
> use that mapping when they're doing a rare write.

Which has now been promoted to target for attacks :-)

--
igor


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 18:48                           ` Andy Lutomirski
@ 2018-11-13 19:35                             ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-13 19:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Igor Stoppa, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

On 13/11/2018 20:48, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>
>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>
>>> For general rare-writish stuff, I don't think we want IRQs running
>>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>>
>> Why would these be less sensitive?
> 
> I'm not really saying they're less sensitive so much as that the
> considerations are different.  I think the original rare-write code is
> based on ideas from grsecurity, and it was intended to protect static
> data like structs full of function pointers.   Those targets have some
> different properties:
> 
>  - Static targets are at addresses that are much more guessable, so
> they're easier targets for most attacks.  (Not spraying attacks like
> the ones you're interested in, though.)
> 
>  - Static targets are higher value.  No offense to IMA or AVC, but
> outright execution of shellcode, hijacking of control flow, or compete
> disablement of core security features is higher impact than bypassing
> SELinux or IMA.  Why would you bother corrupting the AVC if you could
> instead just set enforcing=0?  (I suppose that corrupting the AVC is
> less likely to be noticed by monitoring tools.)
> 
>  - Static targets are small.  This means that the interrupt latency
> would be negligible, especially in comparison to the latency of
> replacing the entire SELinux policy object.

Your analysis is correct.
In my case, having already taken care of those, I was going *also* after
the next target in line.
Admittedly, flipping a bit located at a fixed offset is way easier than
spraying dynamically allocated data structured.

However, once the bit is not easily writable, the only options are to
either find another way to flip it (unprotect it or subvert something
that can write it) or to identify another target that is still writable.

AVC and policyDB fit the latter description.

> Anyway, I'm not all that familiar with SELinux under the hood, but I'm
> wondering if a different approach to thinks like the policy database
> might be appropriate.  When the policy is changed, rather than
> allocating rare-write memory and writing to it, what if we instead
> allocated normal memory, wrote to it, write-protected it, and then
> used the rare-write infrastructure to do a much smaller write to
> replace the pointer?

Actually, that's exactly what I did.

I did not want to overload this discussion, but since you brought it up,
I'm not sure write rare is enough.

* write_rare is for stuff that sometimes changes all the time, ex: AVC

* dynamic read only is for stuff that at some point should not be
modified anymore, but could still be destroyed. Ex: policyDB

I think it would be good to differentiate, at runtime, between the two,
to minimize the chance that a write_rare function is used against some
read_only data.

Releasing dynamically allocated protected memory is also a big topic.
In some cases it's allocated and released continuously, like in the AVC.
Maybe it can be optimized, or maybe it can be turned into an object
cache of protected object.

But for releasing, it would be good, I think, to have a mechanism for
freeing all the memory in one loop, like having a pool containing all
the memory that was allocated for a specific use (ex: policyDB)

> Admittedly, this creates a window where another core could corrupt the
> data as it's being written.  That may not matter so much if an
> attacker can't force a policy update.  Alternatively, the update code
> could re-verify the policy after write-protecting it, or there could
> be a fancy API to allocate some temporarily-writable memory (by
> creating a whole new mm_struct, mapping the memory writable just in
> that mm_struct, and activating it) so that only the actual policy
> loader could touch the memory.  But I'm mostly speculating here, since
> I'm not familiar with the code in question.

They are all corner cases ... possible but unlikely.
Another, maybe more critical, one is that the policyDB is not available
at boot.
There is a window of opportunity, before it's loaded. But it could be
mitigated by loading a barebone set of rules, either from initrd or even
as "firmware".

> Anyway, I tend to think that the right way to approach mainlining all
> this is to first get the basic rare write support for static data into
> place and then to build on that.  I think it's great that you're
> pushing this effort, but doing this for SELinux and IMA is a bigger
> project than doing it for static data, and it might make sense to do
> it in bite-sized pieces.
> 
> Does any of this make sense?

Yes, sure.

I *have* to do SELinux, but I do not necessarily have to wait for the
final version to be merged upstream. And anyways Android is on a
different kernel.

However, I think both SELinux and IMA have a value in being sufficiently
complex cases to be used for validating the design as it evolves.

Each of them has static data that could be the first target for
protection, in a smaller patch.

Lists of write rare data are probably the next big thing, in terms of
defining the API.

But I could start with introducing __wr_after_init.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-13 18:36                             ` Andy Lutomirski
  2018-11-13 19:03                               ` Igor Stoppa
@ 2018-11-21 16:34                               ` Igor Stoppa
  2018-11-21 17:36                                 ` Nadav Amit
  2018-11-21 18:15                                 ` Andy Lutomirski
  1 sibling, 2 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-21 16:34 UTC (permalink / raw)
  To: Andy Lutomirski, Igor Stoppa
  Cc: Nadav Amit, Kees Cook, Peter Zijlstra, Mimi Zohar,
	Matthew Wilcox, Dave Chinner, James Morris, Michal Hocko,
	Kernel Hardening, linux-integrity, LSM List, Dave Hansen,
	Jonathan Corbet, Laura Abbott, Randy Dunlap, Mike Rapoport,
	open list:DOCUMENTATION, LKML, Thomas Gleixner

Hi,

On 13/11/2018 20:36, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>
>> I forgot one sentence :-(
>>
>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>
>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>>>
>>> Why would these be less sensitive?
>>>
>>> But I see a big difference between my initial implementation and this one.
>>>
>>> In my case, by using a shared mapping, visible to all cores, freezing
>>> the core that is performing the write would have exposed the writable
>>> mapping to a potential attack run from another core.
>>>
>>> If the mapping is private to the core performing the write, even if it
>>> is frozen, it's much harder to figure out what it had mapped and where,
>>> from another core.
>>>
>>> To access that mapping, the attack should be performed from the ISR, I
>>> think.
>>
>> Unless the secondary mapping is also available to other cores, through
>> the shared mm_struct ?
>>
> 
> I don't think this matters much.  The other cores will only be able to
> use that mapping when they're doing a rare write.


I'm still mulling over this.
There might be other reasons for replicating the mm_struct.

If I understand correctly how the text patching works, it happens 
sequentially, because of the text_mutex used by arch_jump_label_transform

Which might be fine for this specific case, but I think I shouldn't 
introduce a global mutex, when it comes to data.
Most likely, if two or more cores want to perform a write rare 
operation, there is no correlation between them, they could proceed in 
parallel. And if there really is, then the user of the API should 
introduce own locking, for that specific case.

A bit unrelated question, related to text patching: I see that each 
patching operation is validated, but wouldn't it be more robust to first 
validate  all of then, and only after they are all found to be 
compliant, to proceed with the actual modifications?

And about the actual implementation of the write rare for the statically 
allocated variables, is it expected that I use Nadav's function?
Or that I refactor the code?

The name, referring to text would definitely not be ok for data.
And I would have to also generalize it, to deal with larger amounts of data.

I would find it easier, as first cut, to replicate its behavior and 
refactor only later, once it has stabilized and possibly Nadav's patches 
have been acked.

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-21 16:34                               ` Igor Stoppa
@ 2018-11-21 17:36                                 ` Nadav Amit
  2018-11-21 18:01                                   ` Igor Stoppa
  2018-11-21 18:15                                 ` Andy Lutomirski
  1 sibling, 1 reply; 140+ messages in thread
From: Nadav Amit @ 2018-11-21 17:36 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Andy Lutomirski, Igor Stoppa, Kees Cook, Peter Zijlstra,
	Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
	Michal Hocko, Kernel Hardening, linux-integrity, LSM List,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

> On Nov 21, 2018, at 8:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
> 
> Hi,
> 
> On 13/11/2018 20:36, Andy Lutomirski wrote:
>> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>> I forgot one sentence :-(
>>> 
>>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>> 
>>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>>>> 
>>>> Why would these be less sensitive?
>>>> 
>>>> But I see a big difference between my initial implementation and this one.
>>>> 
>>>> In my case, by using a shared mapping, visible to all cores, freezing
>>>> the core that is performing the write would have exposed the writable
>>>> mapping to a potential attack run from another core.
>>>> 
>>>> If the mapping is private to the core performing the write, even if it
>>>> is frozen, it's much harder to figure out what it had mapped and where,
>>>> from another core.
>>>> 
>>>> To access that mapping, the attack should be performed from the ISR, I
>>>> think.
>>> 
>>> Unless the secondary mapping is also available to other cores, through
>>> the shared mm_struct ?
>> I don't think this matters much.  The other cores will only be able to
>> use that mapping when they're doing a rare write.
> 
> 
> I'm still mulling over this.
> There might be other reasons for replicating the mm_struct.
> 
> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
> 
> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.

I think that if you create per-CPU temporary mms as you proposed, you do not
need a global lock. You would need to protect against module unloading, and
maybe refactor the code. Specifically, I’m not sure whether protection
against IRQs is something that you need. I’m also not familiar with this
patch-set so I’m not sure what atomicity guarantees do you need.

> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate  all of then, and only after they are all found to be compliant, to proceed with the actual modifications?
> 
> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function?

It’s not “my” function. ;-)

IMHO the code is in relatively good and stable state. The last couple of
versions were due to me being afraid to add BUG_ONs as Peter asked me to.

The code is rather simple, but there are a couple of pitfalls that hopefully
I avoided correctly.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-21 17:36                                 ` Nadav Amit
@ 2018-11-21 18:01                                   ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-11-21 18:01 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Igor Stoppa, Kees Cook, Peter Zijlstra,
	Mimi Zohar, Matthew Wilcox, Dave Chinner, James Morris,
	Michal Hocko, Kernel Hardening, linux-integrity, LSM List,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner



On 21/11/2018 19:36, Nadav Amit wrote:
>> On Nov 21, 2018, at 8:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:

[...]

>> There might be other reasons for replicating the mm_struct.
>>
>> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
>>
>> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
>> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.
> 
> I think that if you create per-CPU temporary mms as you proposed, you do not
> need a global lock. You would need to protect against module unloading,

yes, it's unlikely to happen and probably a bug in the module, if it 
tries to write while being unloaded, but I can do it

> and
> maybe refactor the code. Specifically, I’m not sure whether protection
> against IRQs is something that you need.

With the initial way I used to do write rare, which was done by creating 
a temporary mapping visible to every core, disabling IRQs was meant to 
prevent that the "writer" core would be frozen and then the mappings 
scrubbed for the page in writable state.

Without shared mapping of the page, the only way to attack it should be 
to generate an interrupt on the "writer" core, while the writing is 
ongoing, and to perform the attack from the interrupt itself, because it 
is on the same core that has the writable mapping.

Maybe it's possible, but it seems to have become quite a corner case.

> I’m also not familiar with this
> patch-set so I’m not sure what atomicity guarantees do you need.

At the very least, I think I need to ensure that pointers are updated 
atomically, like with WRITE_ONCE() And spinlocks.
Maybe atomic types can be left out.

>> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate  all of then, and only after they are all found to be compliant, to proceed with the actual modifications?
>>
>> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function?
> 
> It’s not “my” function. ;-)

:-P

ok, what I meant is that the signature of the __text_poke() function is 
quite specific to what it's meant to do.

I do not rule out that it might be eventually refactored as a special 
case of a more generic __write_rare() function, that would operate on 
different targets, but I'd rather do the refactoring after I have a 
clear understanding of how to alter write-protected data.

The refactoring could be the last patch of the write rare patchset.

> IMHO the code is in relatively good and stable state. The last couple of
> versions were due to me being afraid to add BUG_ONs as Peter asked me to.
> 
> The code is rather simple, but there are a couple of pitfalls that hopefully
> I avoided correctly.

Yes, I did not mean to question the quality of the code, but I'd prefer 
to not have to carry also this patchset, while it's not yet merged.

I actually hope it gets merged soon :-)
--
igor


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-21 16:34                               ` Igor Stoppa
  2018-11-21 17:36                                 ` Nadav Amit
@ 2018-11-21 18:15                                 ` Andy Lutomirski
  2018-11-22 19:27                                   ` Igor Stoppa
  1 sibling, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-21 18:15 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Andy Lutomirski, Igor Stoppa, Nadav Amit, Kees Cook,
	Peter Zijlstra, Mimi Zohar, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	LSM List, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner



> On Nov 21, 2018, at 9:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
> 
> Hi,
> 
>>> On 13/11/2018 20:36, Andy Lutomirski wrote:
>>> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>> 
>>> I forgot one sentence :-(
>>> 
>>>>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>>> 
>>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>>>> 
>>>> Why would these be less sensitive?
>>>> 
>>>> But I see a big difference between my initial implementation and this one.
>>>> 
>>>> In my case, by using a shared mapping, visible to all cores, freezing
>>>> the core that is performing the write would have exposed the writable
>>>> mapping to a potential attack run from another core.
>>>> 
>>>> If the mapping is private to the core performing the write, even if it
>>>> is frozen, it's much harder to figure out what it had mapped and where,
>>>> from another core.
>>>> 
>>>> To access that mapping, the attack should be performed from the ISR, I
>>>> think.
>>> 
>>> Unless the secondary mapping is also available to other cores, through
>>> the shared mm_struct ?
>> I don't think this matters much.  The other cores will only be able to
>> use that mapping when they're doing a rare write.
> 
> 
> I'm still mulling over this.
> There might be other reasons for replicating the mm_struct.
> 
> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
> 
> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.

Text patching uses the same VA for different physical addresses, so it need a mutex to avoid conflicts. I think that, for rare writes, you should just map each rare-writable address at a *different* VA.  You’ll still need a mutex (mmap_sem) to synchronize allocation and freeing of rare-writable ranges, but that shouldn’t have much contention.

> 
> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate  all of then, and only after they are all found to be compliant, to proceed with the actual modifications?
> 
> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function?
> Or that I refactor the code?

I would either refactor it or create a new function to handle the write. The main thing that Nadav is adding that I think you’ll want to use is the infrastructure for temporarily switching mms from a non-kernel-thread context.

> 
> The name, referring to text would definitely not be ok for data.
> And I would have to also generalize it, to deal with larger amounts of data.
> 
> I would find it easier, as first cut, to replicate its behavior and refactor only later, once it has stabilized and possibly Nadav's patches have been acked.
> 

Sounds reasonable. You’ll still want Nadav’s code for setting up the mm in the first place, though.

> --
> igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-21 18:15                                 ` Andy Lutomirski
@ 2018-11-22 19:27                                   ` Igor Stoppa
  2018-11-22 20:04                                     ` Matthew Wilcox
  0 siblings, 1 reply; 140+ messages in thread
From: Igor Stoppa @ 2018-11-22 19:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Igor Stoppa, Nadav Amit, Kees Cook,
	Peter Zijlstra, Mimi Zohar, Matthew Wilcox, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	LSM List, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On 21/11/2018 20:15, Andy Lutomirski wrote:

>> On Nov 21, 2018, at 9:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote:

[...]

>> There might be other reasons for replicating the mm_struct.
>>
>> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform
>>
>> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data.
>> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case.
> 
> Text patching uses the same VA for different physical addresses, so it need a mutex to avoid conflicts. I think that, for rare writes, you should just map each rare-writable address at a *different* VA.  You’ll still need a mutex (mmap_sem) to synchronize allocation and freeing of rare-writable ranges, but that shouldn’t have much contention.

I have studied the code involved with Nadav's patchset.
I am perplexed about these sentences you wrote.

More to the point (to the best of my understanding):

poking_init()
-------------
   1. it gets one random poking address and ensures to have at least 2
      consecutive PTEs from the same PMD
   2. it then proceeds to map/unmap an address from the first of the 2
      consecutive PTEs, so that, later on, there will be no need to
      allocate pages, which might fail, if poking from atomic context.
   3. at this point, the page tables are populated, for the address that
      was obtained at point 1, and this is ok, because the address is fixed

write_rare
----------
   4. it can happen on any available core / thread at any time, therefore
      each of them needs a different address
   5. CPUs support hotplug, but from what I have read, I might get away
      with having up to nr_cpus different addresses (determined at init)
      and I would follow the same technique used by Nadav, of forcing the
      mapping of 1 or 2 (1 could be enough, I have to loop anyway, at
      some point) pages at each address, to ensure the population of the
      page tables

so far, so good, but ...

   6. the addresses used by each CPU are fixed
   7. I do not understand the reference you make to
      "allocation and freeing of rare-writable ranges", because if I
      treat the range as such, then there is a risk that I need to
      populate more entries in the page table, which would have problems
      with the atomic context, unless write_rare from atomic is ruled
      out. If write_rare from atomic can be avoided, then I can also have
      one-use randomized addresses at each write-rare operation, instead
      of fixed ones, like in point 6.

and, apologies for being dense: the following is still not clear to me:

   8. you wrote:
      > You’ll still need a mutex (mmap_sem) to synchronize allocation
      > and freeing of rare-writable ranges, but that shouldn’t have much
      > contention.
      What causes the contention? It's the fact that the various cores
      are using the same mm, if I understood correctly.
      However, if there was one mm for each core, wouldn't that make it
      unnecessary to have any mutex?
      I feel there must be some obvious reason why multiple mms are not a
      good idea, yet I cannot grasp it :-(

      switch_mm_irqs_off() seems to have lots of references to
      "this_cpu_something"; if there is any optimization from having the
      same next across multiple cores, I'm missing it

[...]

> I would either refactor it or create a new function to handle the write. The main thing that Nadav is adding that I think you’ll want to use is the infrastructure for temporarily switching mms from a non-kernel-thread context.

yes

[...]


> You’ll still want Nadav’s code for setting up the mm in the first place, though.

yes

--
thanks, igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-22 19:27                                   ` Igor Stoppa
@ 2018-11-22 20:04                                     ` Matthew Wilcox
  2018-11-22 20:53                                       ` Andy Lutomirski
  0 siblings, 1 reply; 140+ messages in thread
From: Matthew Wilcox @ 2018-11-22 20:04 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Andy Lutomirski, Andy Lutomirski, Igor Stoppa, Nadav Amit,
	Kees Cook, Peter Zijlstra, Mimi Zohar, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	LSM List, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
> I have studied the code involved with Nadav's patchset.
> I am perplexed about these sentences you wrote.
> 
> More to the point (to the best of my understanding):
> 
> poking_init()
> -------------
>   1. it gets one random poking address and ensures to have at least 2
>      consecutive PTEs from the same PMD
>   2. it then proceeds to map/unmap an address from the first of the 2
>      consecutive PTEs, so that, later on, there will be no need to
>      allocate pages, which might fail, if poking from atomic context.
>   3. at this point, the page tables are populated, for the address that
>      was obtained at point 1, and this is ok, because the address is fixed
> 
> write_rare
> ----------
>   4. it can happen on any available core / thread at any time, therefore
>      each of them needs a different address

No?  Each CPU has its own CR3 (eg each CPU might be running a different
user task).  If you have _one_ address for each allocation, it may or
may not be mapped on other CPUs at the same time -- you simply don't care.

The writable address can even be a simple formula to calculate from
the read-only address, you don't have to allocate an address in the
writable mapping space.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-22 20:04                                     ` Matthew Wilcox
@ 2018-11-22 20:53                                       ` Andy Lutomirski
  2018-12-04 12:34                                         ` Igor Stoppa
  0 siblings, 1 reply; 140+ messages in thread
From: Andy Lutomirski @ 2018-11-22 20:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Igor Stoppa, Andrew Lutomirski, Igor Stoppa, Nadav Amit,
	Kees Cook, Peter Zijlstra, Mimi Zohar, Dave Chinner,
	James Morris, Michal Hocko, Kernel Hardening, linux-integrity,
	LSM List, Dave Hansen, Jonathan Corbet, Laura Abbott,
	Randy Dunlap, Mike Rapoport, open list:DOCUMENTATION, LKML,
	Thomas Gleixner

On Thu, Nov 22, 2018 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
> > I have studied the code involved with Nadav's patchset.
> > I am perplexed about these sentences you wrote.
> >
> > More to the point (to the best of my understanding):
> >
> > poking_init()
> > -------------
> >   1. it gets one random poking address and ensures to have at least 2
> >      consecutive PTEs from the same PMD
> >   2. it then proceeds to map/unmap an address from the first of the 2
> >      consecutive PTEs, so that, later on, there will be no need to
> >      allocate pages, which might fail, if poking from atomic context.
> >   3. at this point, the page tables are populated, for the address that
> >      was obtained at point 1, and this is ok, because the address is fixed
> >
> > write_rare
> > ----------
> >   4. it can happen on any available core / thread at any time, therefore
> >      each of them needs a different address
>
> No?  Each CPU has its own CR3 (eg each CPU might be running a different
> user task).  If you have _one_ address for each allocation, it may or
> may not be mapped on other CPUs at the same time -- you simply don't care.
>
> The writable address can even be a simple formula to calculate from
> the read-only address, you don't have to allocate an address in the
> writable mapping space.
>

Agreed.  I suggest the formula:

writable_address = readable_address - rare_write_offset.  For
starters, rare_write_offset can just be a constant.  If we want to get
fancy later on, it can be randomized.

If we do it like this, then we don't need to modify any pagetables at
all when we do a rare write.  Instead we can set up the mapping at
boot or when we allocate the rare write space, and the actual rare
write code can just switch mms and do the write.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 10/17] prmem: documentation
  2018-11-22 20:53                                       ` Andy Lutomirski
@ 2018-12-04 12:34                                         ` Igor Stoppa
  0 siblings, 0 replies; 140+ messages in thread
From: Igor Stoppa @ 2018-12-04 12:34 UTC (permalink / raw)
  To: Andy Lutomirski, Matthew Wilcox
  Cc: Andrew Lutomirski, Igor Stoppa, Nadav Amit, Kees Cook,
	Peter Zijlstra, Mimi Zohar, Dave Chinner, James Morris,
	Michal Hocko, Kernel Hardening, linux-integrity, LSM List,
	Dave Hansen, Jonathan Corbet, Laura Abbott, Randy Dunlap,
	Mike Rapoport, open list:DOCUMENTATION, LKML, Thomas Gleixner

Hello,
apologies for the delayed answer.
Please find my reply to both last mails in the thread, below.

On 22/11/2018 22:53, Andy Lutomirski wrote:
> On Thu, Nov 22, 2018 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
>>> I have studied the code involved with Nadav's patchset.
>>> I am perplexed about these sentences you wrote.
>>>
>>> More to the point (to the best of my understanding):
>>>
>>> poking_init()
>>> -------------
>>>    1. it gets one random poking address and ensures to have at least 2
>>>       consecutive PTEs from the same PMD
>>>    2. it then proceeds to map/unmap an address from the first of the 2
>>>       consecutive PTEs, so that, later on, there will be no need to
>>>       allocate pages, which might fail, if poking from atomic context.
>>>    3. at this point, the page tables are populated, for the address that
>>>       was obtained at point 1, and this is ok, because the address is fixed
>>>
>>> write_rare
>>> ----------
>>>    4. it can happen on any available core / thread at any time, therefore
>>>       each of them needs a different address
>>
>> No?  Each CPU has its own CR3 (eg each CPU might be running a different
>> user task).  If you have _one_ address for each allocation, it may or
>> may not be mapped on other CPUs at the same time -- you simply don't care.

Yes, somehow I lost track of that aspect.

>> The writable address can even be a simple formula to calculate from
>> the read-only address, you don't have to allocate an address in the
>> writable mapping space.
>>
> 
> Agreed.  I suggest the formula:
> 
> writable_address = readable_address - rare_write_offset.  For
> starters, rare_write_offset can just be a constant.  If we want to get
> fancy later on, it can be randomized.

ok, I hope I captured it here [1]

> If we do it like this, then we don't need to modify any pagetables at
> all when we do a rare write.  Instead we can set up the mapping at
> boot or when we allocate the rare write space, and the actual rare
> write code can just switch mms and do the write.

I did it. I have little feeling about the actual amount of data 
involved, but there is a (probably very remote) chance that the remap 
wouldn't work, at least in the current implementation.

It's a bit different from what I had in mind initially, since I was 
thinking to have one single approach to both statically allocated memory 
(is there a better way to describe it) and what is provided from the 
allocator that would come next.

As I wrote, I do not particularly like the way I implemented multiple 
functionality vs remapping, but I couldn't figure out any better way to 
do it, so eventually I kept this one, hoping to get some advice on how 
to improve it.

I did not provide yet an example, yet, but IMA has some flags that are 
probably very suitable, since they depend on policy reloading, which can 
happen multiple times, but could be used to disable it.

[1] https://www.openwall.com/lists/kernel-hardening/2018/12/04/3

--
igor

^ permalink raw reply	[flat|nested] 140+ messages in thread

end of thread, back to index

Thread overview: 140+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-23 21:34 [RFC v1 PATCH 00/17] prmem: protected memory Igor Stoppa
2018-10-23 21:34 ` [PATCH 01/17] prmem: linker section for static write rare Igor Stoppa
2018-10-23 21:34 ` [PATCH 02/17] prmem: write rare for static allocation Igor Stoppa
2018-10-25  0:24   ` Dave Hansen
2018-10-29 18:03     ` Igor Stoppa
2018-10-26  9:41   ` Peter Zijlstra
2018-10-29 20:01     ` Igor Stoppa
2018-10-23 21:34 ` [PATCH 03/17] prmem: vmalloc support for dynamic allocation Igor Stoppa
2018-10-25  0:26   ` Dave Hansen
2018-10-29 18:07     ` Igor Stoppa
2018-10-23 21:34 ` [PATCH 04/17] prmem: " Igor Stoppa
2018-10-23 21:34 ` [PATCH 05/17] prmem: shorthands for write rare on common types Igor Stoppa
2018-10-25  0:28   ` Dave Hansen
2018-10-29 18:12     ` Igor Stoppa
2018-10-23 21:34 ` [PATCH 06/17] prmem: test cases for memory protection Igor Stoppa
2018-10-24  3:27   ` Randy Dunlap
2018-10-24 14:24     ` Igor Stoppa
2018-10-25 16:43   ` Dave Hansen
2018-10-29 18:16     ` Igor Stoppa
2018-10-23 21:34 ` [PATCH 07/17] prmem: lkdtm tests " Igor Stoppa
2018-10-23 21:34 ` [PATCH 08/17] prmem: struct page: track vmap_area Igor Stoppa
2018-10-24  3:12   ` Matthew Wilcox
2018-10-24 23:01     ` Igor Stoppa
2018-10-25  2:13       ` Matthew Wilcox
2018-10-29 18:21         ` Igor Stoppa
2018-10-23 21:34 ` [PATCH 09/17] prmem: hardened usercopy Igor Stoppa
2018-10-29 11:45   ` Chris von Recklinghausen
2018-10-29 18:24     ` Igor Stoppa
2018-10-23 21:34 ` [PATCH 10/17] prmem: documentation Igor Stoppa
2018-10-24  3:48   ` Randy Dunlap
2018-10-24 14:30     ` Igor Stoppa
2018-10-24 23:04   ` Mike Rapoport
2018-10-29 19:05     ` Igor Stoppa
2018-10-26  9:26   ` Peter Zijlstra
2018-10-26 10:20     ` Matthew Wilcox
2018-10-29 19:28       ` Igor Stoppa
2018-10-26 10:46     ` Kees Cook
2018-10-28 18:31       ` Peter Zijlstra
2018-10-29 21:04         ` Igor Stoppa
2018-10-30 15:26           ` Peter Zijlstra
2018-10-30 16:37             ` Kees Cook
2018-10-30 17:06               ` Andy Lutomirski
2018-10-30 17:58                 ` Matthew Wilcox
2018-10-30 18:03                   ` Dave Hansen
2018-10-31  9:18                     ` Peter Zijlstra
2018-10-30 18:28                   ` Tycho Andersen
2018-10-30 19:20                     ` Matthew Wilcox
2018-10-30 20:43                       ` Igor Stoppa
2018-10-30 21:02                         ` Andy Lutomirski
2018-10-30 21:07                           ` Kees Cook
2018-10-30 21:25                             ` Igor Stoppa
2018-10-30 22:15                           ` Igor Stoppa
2018-10-31 10:11                             ` Peter Zijlstra
2018-10-31 20:38                               ` Andy Lutomirski
2018-10-31 20:53                                 ` Andy Lutomirski
2018-10-31  9:45                           ` Peter Zijlstra
2018-10-30 21:35                         ` Matthew Wilcox
2018-10-30 21:49                           ` Igor Stoppa
2018-10-31  4:41                           ` Andy Lutomirski
2018-10-31  9:08                             ` Igor Stoppa
2018-10-31 19:38                               ` Igor Stoppa
2018-10-31 10:02                             ` Peter Zijlstra
2018-10-31 20:36                               ` Andy Lutomirski
2018-10-31 21:00                                 ` Peter Zijlstra
2018-10-31 22:57                                   ` Andy Lutomirski
2018-10-31 23:10                                     ` Igor Stoppa
2018-10-31 23:19                                       ` Andy Lutomirski
2018-10-31 23:26                                         ` Igor Stoppa
2018-11-01  8:21                                           ` Thomas Gleixner
2018-11-01 15:58                                             ` Igor Stoppa
2018-11-01 17:08                                     ` Peter Zijlstra
2018-10-30 18:51                   ` Andy Lutomirski
2018-10-30 19:14                     ` Kees Cook
2018-10-30 21:25                     ` Matthew Wilcox
2018-10-30 21:55                       ` Igor Stoppa
2018-10-30 22:08                         ` Matthew Wilcox
2018-10-31  9:29                       ` Peter Zijlstra
2018-10-30 23:18                     ` Nadav Amit
2018-10-31  9:08                       ` Peter Zijlstra
2018-11-01 16:31                         ` Nadav Amit
2018-11-02 21:11                           ` Nadav Amit
2018-10-31  9:36                   ` Peter Zijlstra
2018-10-31 11:33                     ` Matthew Wilcox
2018-11-13 14:25                 ` Igor Stoppa
2018-11-13 17:16                   ` Andy Lutomirski
2018-11-13 17:43                     ` Nadav Amit
2018-11-13 17:47                       ` Andy Lutomirski
2018-11-13 18:06                         ` Nadav Amit
2018-11-13 18:31                         ` Igor Stoppa
2018-11-13 18:33                           ` Igor Stoppa
2018-11-13 18:36                             ` Andy Lutomirski
2018-11-13 19:03                               ` Igor Stoppa
2018-11-21 16:34                               ` Igor Stoppa
2018-11-21 17:36                                 ` Nadav Amit
2018-11-21 18:01                                   ` Igor Stoppa
2018-11-21 18:15                                 ` Andy Lutomirski
2018-11-22 19:27                                   ` Igor Stoppa
2018-11-22 20:04                                     ` Matthew Wilcox
2018-11-22 20:53                                       ` Andy Lutomirski
2018-12-04 12:34                                         ` Igor Stoppa
2018-11-13 18:48                           ` Andy Lutomirski
2018-11-13 19:35                             ` Igor Stoppa
2018-11-13 18:26                     ` Igor Stoppa
2018-11-13 18:35                       ` Andy Lutomirski
2018-11-13 19:01                         ` Igor Stoppa
2018-10-31  9:27               ` Igor Stoppa
2018-10-26 11:09     ` Markus Heiser
2018-10-29 19:35       ` Igor Stoppa
2018-10-26 15:05     ` Jonathan Corbet
2018-10-29 19:38       ` Igor Stoppa
2018-10-29 20:35     ` Igor Stoppa
2018-10-23 21:34 ` [PATCH 11/17] prmem: llist: use designated initializer Igor Stoppa
2018-10-23 21:34 ` [PATCH 12/17] prmem: linked list: set alignment Igor Stoppa
2018-10-26  9:31   ` Peter Zijlstra
2018-10-23 21:35 ` [PATCH 13/17] prmem: linked list: disable layout randomization Igor Stoppa
2018-10-24 13:43   ` Alexey Dobriyan
2018-10-29 19:40     ` Igor Stoppa
2018-10-26  9:32   ` Peter Zijlstra
2018-10-26 10:17     ` Matthew Wilcox
2018-10-30 15:39       ` Peter Zijlstra
2018-10-23 21:35 ` [PATCH 14/17] prmem: llist, hlist, both plain and rcu Igor Stoppa
2018-10-24 11:37   ` Mathieu Desnoyers
2018-10-24 14:03     ` Igor Stoppa
2018-10-24 14:56       ` Tycho Andersen
2018-10-24 22:52         ` Igor Stoppa
2018-10-25  8:11           ` Tycho Andersen
2018-10-28  9:52       ` Steven Rostedt
2018-10-29 19:43         ` Igor Stoppa
2018-10-26  9:38   ` Peter Zijlstra
2018-10-23 21:35 ` [PATCH 15/17] prmem: test cases for prlist and prhlist Igor Stoppa
2018-10-23 21:35 ` [PATCH 16/17] prmem: pratomic-long Igor Stoppa
2018-10-25  0:13   ` Peter Zijlstra
2018-10-29 21:17     ` Igor Stoppa
2018-10-30 15:58       ` Peter Zijlstra
2018-10-30 16:28         ` Will Deacon
2018-10-31  9:10           ` Peter Zijlstra
2018-11-01  3:28             ` Kees Cook
2018-10-23 21:35 ` [PATCH 17/17] prmem: ima: turn the measurements list write rare Igor Stoppa
2018-10-24 23:03 ` [RFC v1 PATCH 00/17] prmem: protected memory Dave Chinner
2018-10-29 19:47   ` Igor Stoppa

Linux-Security-Module Archive on lore.kernel.org

Archives are clonable: git clone --mirror https://lore.kernel.org/linux-security-module/0 linux-security-module/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-security-module linux-security-module/ https://lore.kernel.org/linux-security-module \
		linux-security-module@vger.kernel.org linux-security-module@archiver.kernel.org
	public-inbox-index linux-security-module


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-security-module


AGPL code for this site: git clone https://public-inbox.org/ public-inbox