All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-19 20:47 ` Rick Edgecombe
  0 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: kernel-hardening, daniel, keescook, catalin.marinas, will.deacon,
	davem, tglx, mingo, bp, x86, arnd, jeyu, linux-arm-kernel,
	linux-kernel, linux-mips, linux-s390, sparclinux, linux-fsdevel,
	linux-arch, jannh
  Cc: kristen, dave.hansen, arjan, deneen.t.dock, Rick Edgecombe

If BPF JIT is on, there is no effective limit to prevent filling the entire
module space with JITed e/BPF filters. For classic BPF filters attached with
setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
number of insertions like there is for the bpf syscall.

This patch adds a per user rlimit for module space, as well as a system wide
limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out the
problem that in some cases a user can get access to 65536 UIDs, so the effective
limit cannot be set low enough to stop an attacker and be useful for the general
case. A discussed alternative solution was a system wide limit for BPF JIT
filters. This much more simply resolves the problem of exhaustion and
de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another user
exhausts the BPF JIT limit. In this case a per user limit is still needed. If
the subuid facility is disabled for normal users, this should still be ok
because the higher limit will not be able to be worked around that way.

The new BPF JIT limit can be set like this:
echo 5000000 > /proc/sys/net/core/bpf_jit_limit

So I *think* this patchset should resolve that issue except for the
configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal users.
Better module space KASLR is another way to resolve the de-randomizing issue,
and so then you would just be left with the BPF DOS in that configuration.

Jann also pointed out how, with purposely fragmenting the module space, you
could make the effective module space blockage area much larger. This is also
somewhat un-resolved. The impact would depend on how big of a space you are
trying to allocate. The limit has been lowered on x86_64 so that at least
typical sized BPF filters cannot be blocked.

If anyone with more experience with subuid/user namespaces has any suggestions
I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-privileged
user could do this. I am going to keep working on this and see if I can find a
better solution.

Changes since v2:
 - System wide BPF JIT limit (discussion with Jann Horn)
 - Holding reference to user correctly (Jann)
 - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
 - Shrinking of default limits, to help prevent the limit being worked around
   with fragmentation (Jann)

Changes since v1:
 - Plug in for non-x86
 - Arch specific default values


Rick Edgecombe (3):
  modules: Create arch versions of module alloc/free
  modules: Create rlimit for module space
  bpf: Add system wide BPF JIT limit

 arch/arm/kernel/module.c                |   2 +-
 arch/arm64/kernel/module.c              |   2 +-
 arch/mips/kernel/module.c               |   2 +-
 arch/nds32/kernel/module.c              |   2 +-
 arch/nios2/kernel/module.c              |   4 +-
 arch/parisc/kernel/module.c             |   2 +-
 arch/s390/kernel/module.c               |   2 +-
 arch/sparc/kernel/module.c              |   2 +-
 arch/unicore32/kernel/module.c          |   2 +-
 arch/x86/include/asm/pgtable_32_types.h |   3 +
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 arch/x86/kernel/module.c                |   2 +-
 fs/proc/base.c                          |   1 +
 include/asm-generic/resource.h          |   8 ++
 include/linux/bpf.h                     |   7 ++
 include/linux/filter.h                  |   1 +
 include/linux/sched/user.h              |   4 +
 include/uapi/asm-generic/resource.h     |   3 +-
 kernel/bpf/core.c                       |  22 +++-
 kernel/bpf/inode.c                      |  16 +++
 kernel/module.c                         | 152 +++++++++++++++++++++++-
 net/core/sysctl_net_core.c              |   7 ++
 22 files changed, 233 insertions(+), 15 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-19 20:47 ` Rick Edgecombe
  0 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: kernel-hardening, daniel, keescook, catalin.marinas, will.deacon,
	davem, tglx, mingo, bp, x86, arnd, jeyu, linux-arm-kernel,
	linux-kernel, linux-mips, linux-s390, sparclinux, linux-fsdevel,
	linux-arch, jannh
  Cc: kristen, dave.hansen, arjan, deneen.t.dock, Rick Edgecombe

If BPF JIT is on, there is no effective limit to prevent filling the entire
module space with JITed e/BPF filters. For classic BPF filters attached with
setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
number of insertions like there is for the bpf syscall.

This patch adds a per user rlimit for module space, as well as a system wide
limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out the
problem that in some cases a user can get access to 65536 UIDs, so the effective
limit cannot be set low enough to stop an attacker and be useful for the general
case. A discussed alternative solution was a system wide limit for BPF JIT
filters. This much more simply resolves the problem of exhaustion and
de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another user
exhausts the BPF JIT limit. In this case a per user limit is still needed. If
the subuid facility is disabled for normal users, this should still be ok
because the higher limit will not be able to be worked around that way.

The new BPF JIT limit can be set like this:
echo 5000000 > /proc/sys/net/core/bpf_jit_limit

So I *think* this patchset should resolve that issue except for the
configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal users.
Better module space KASLR is another way to resolve the de-randomizing issue,
and so then you would just be left with the BPF DOS in that configuration.

Jann also pointed out how, with purposely fragmenting the module space, you
could make the effective module space blockage area much larger. This is also
somewhat un-resolved. The impact would depend on how big of a space you are
trying to allocate. The limit has been lowered on x86_64 so that at least
typical sized BPF filters cannot be blocked.

If anyone with more experience with subuid/user namespaces has any suggestions
I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-privileged
user could do this. I am going to keep working on this and see if I can find a
better solution.

Changes since v2:
 - System wide BPF JIT limit (discussion with Jann Horn)
 - Holding reference to user correctly (Jann)
 - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
 - Shrinking of default limits, to help prevent the limit being worked around
   with fragmentation (Jann)

Changes since v1:
 - Plug in for non-x86
 - Arch specific default values


Rick Edgecombe (3):
  modules: Create arch versions of module alloc/free
  modules: Create rlimit for module space
  bpf: Add system wide BPF JIT limit

 arch/arm/kernel/module.c                |   2 +-
 arch/arm64/kernel/module.c              |   2 +-
 arch/mips/kernel/module.c               |   2 +-
 arch/nds32/kernel/module.c              |   2 +-
 arch/nios2/kernel/module.c              |   4 +-
 arch/parisc/kernel/module.c             |   2 +-
 arch/s390/kernel/module.c               |   2 +-
 arch/sparc/kernel/module.c              |   2 +-
 arch/unicore32/kernel/module.c          |   2 +-
 arch/x86/include/asm/pgtable_32_types.h |   3 +
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 arch/x86/kernel/module.c                |   2 +-
 fs/proc/base.c                          |   1 +
 include/asm-generic/resource.h          |   8 ++
 include/linux/bpf.h                     |   7 ++
 include/linux/filter.h                  |   1 +
 include/linux/sched/user.h              |   4 +
 include/uapi/asm-generic/resource.h     |   3 +-
 kernel/bpf/core.c                       |  22 +++-
 kernel/bpf/inode.c                      |  16 +++
 kernel/module.c                         | 152 +++++++++++++++++++++++-
 net/core/sysctl_net_core.c              |   7 ++
 22 files changed, 233 insertions(+), 15 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-19 20:47 ` Rick Edgecombe
  0 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: linux-arm-kernel

If BPF JIT is on, there is no effective limit to prevent filling the entire
module space with JITed e/BPF filters. For classic BPF filters attached with
setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
number of insertions like there is for the bpf syscall.

This patch adds a per user rlimit for module space, as well as a system wide
limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out the
problem that in some cases a user can get access to 65536 UIDs, so the effective
limit cannot be set low enough to stop an attacker and be useful for the general
case. A discussed alternative solution was a system wide limit for BPF JIT
filters. This much more simply resolves the problem of exhaustion and
de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another user
exhausts the BPF JIT limit. In this case a per user limit is still needed. If
the subuid facility is disabled for normal users, this should still be ok
because the higher limit will not be able to be worked around that way.

The new BPF JIT limit can be set like this:
echo 5000000 > /proc/sys/net/core/bpf_jit_limit

So I *think* this patchset should resolve that issue except for the
configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal users.
Better module space KASLR is another way to resolve the de-randomizing issue,
and so then you would just be left with the BPF DOS in that configuration.

Jann also pointed out how, with purposely fragmenting the module space, you
could make the effective module space blockage area much larger. This is also
somewhat un-resolved. The impact would depend on how big of a space you are
trying to allocate. The limit has been lowered on x86_64 so that at least
typical sized BPF filters cannot be blocked.

If anyone with more experience with subuid/user namespaces has any suggestions
I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-privileged
user could do this. I am going to keep working on this and see if I can find a
better solution.

Changes since v2:
 - System wide BPF JIT limit (discussion with Jann Horn)
 - Holding reference to user correctly (Jann)
 - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
 - Shrinking of default limits, to help prevent the limit being worked around
   with fragmentation (Jann)

Changes since v1:
 - Plug in for non-x86
 - Arch specific default values


Rick Edgecombe (3):
  modules: Create arch versions of module alloc/free
  modules: Create rlimit for module space
  bpf: Add system wide BPF JIT limit

 arch/arm/kernel/module.c                |   2 +-
 arch/arm64/kernel/module.c              |   2 +-
 arch/mips/kernel/module.c               |   2 +-
 arch/nds32/kernel/module.c              |   2 +-
 arch/nios2/kernel/module.c              |   4 +-
 arch/parisc/kernel/module.c             |   2 +-
 arch/s390/kernel/module.c               |   2 +-
 arch/sparc/kernel/module.c              |   2 +-
 arch/unicore32/kernel/module.c          |   2 +-
 arch/x86/include/asm/pgtable_32_types.h |   3 +
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 arch/x86/kernel/module.c                |   2 +-
 fs/proc/base.c                          |   1 +
 include/asm-generic/resource.h          |   8 ++
 include/linux/bpf.h                     |   7 ++
 include/linux/filter.h                  |   1 +
 include/linux/sched/user.h              |   4 +
 include/uapi/asm-generic/resource.h     |   3 +-
 kernel/bpf/core.c                       |  22 +++-
 kernel/bpf/inode.c                      |  16 +++
 kernel/module.c                         | 152 +++++++++++++++++++++++-
 net/core/sysctl_net_core.c              |   7 ++
 22 files changed, 233 insertions(+), 15 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v3 1/3] modules: Create arch versions of module alloc/free
  2018-10-19 20:47 ` Rick Edgecombe
  (?)
@ 2018-10-19 20:47   ` Rick Edgecombe
  -1 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: kernel-hardening, daniel, keescook, catalin.marinas, will.deacon,
	davem, tglx, mingo, bp, x86, arnd, jeyu, linux-arm-kernel,
	linux-kernel, linux-mips, linux-s390, sparclinux, linux-fsdevel,
	linux-arch, jannh
  Cc: kristen, dave.hansen, arjan, deneen.t.dock, Rick Edgecombe

In prep for module space rlimit, create a singular cross platform
module_alloc and module_memfree that call into arch specific
implementations.

This has only been tested on x86.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/arm/kernel/module.c       |  2 +-
 arch/arm64/kernel/module.c     |  2 +-
 arch/mips/kernel/module.c      |  2 +-
 arch/nds32/kernel/module.c     |  2 +-
 arch/nios2/kernel/module.c     |  4 ++--
 arch/parisc/kernel/module.c    |  2 +-
 arch/s390/kernel/module.c      |  2 +-
 arch/sparc/kernel/module.c     |  2 +-
 arch/unicore32/kernel/module.c |  2 +-
 arch/x86/kernel/module.c       |  2 +-
 kernel/module.c                | 14 ++++++++++++--
 11 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 3ff571c2c71c..359838a4bb06 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -38,7 +38,7 @@
 #endif
 
 #ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	gfp_t gfp_mask = GFP_KERNEL;
 	void *p;
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index f0f27aeefb73..a6891eb2fc16 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -30,7 +30,7 @@
 #include <asm/insn.h>
 #include <asm/sections.h>
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	gfp_t gfp_mask = GFP_KERNEL;
 	void *p;
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 491605137b03..e9ee8e7544f9 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -45,7 +45,7 @@ static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
 #ifdef MODULE_START
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
 				GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
diff --git a/arch/nds32/kernel/module.c b/arch/nds32/kernel/module.c
index 1e31829cbc2a..75535daa22a5 100644
--- a/arch/nds32/kernel/module.c
+++ b/arch/nds32/kernel/module.c
@@ -7,7 +7,7 @@
 #include <linux/moduleloader.h>
 #include <asm/pgtable.h>
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
 				    GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index e2e3f13f98d5..cd059a8e9a7b 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -28,7 +28,7 @@
  * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
  * addresses in 0xc0000000)
  */
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	if (size == 0)
 		return NULL;
@@ -36,7 +36,7 @@ void *module_alloc(unsigned long size)
 }
 
 /* Free memory returned from module_alloc */
-void module_memfree(void *module_region)
+void arch_module_memfree(void *module_region)
 {
 	kfree(module_region);
 }
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index b5b3cb00f1fb..72ab3c8b103b 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -213,7 +213,7 @@ static inline int reassemble_22(int as22)
 		((as22 & 0x0003ff) << 3));
 }
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	/* using RWX means less protection for modules, but it's
 	 * easier than trying to map the text, data, init_text and
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index d298d3cb46d0..e07c4a9384c0 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -30,7 +30,7 @@
 
 #define PLT_ENTRY_SIZE 20
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	if (PAGE_ALIGN(size) > MODULES_LEN)
 		return NULL;
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index df39580f398d..870581ba9205 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -40,7 +40,7 @@ static void *module_map(unsigned long size)
 }
 #endif /* CONFIG_SPARC64 */
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	void *ret;
 
diff --git a/arch/unicore32/kernel/module.c b/arch/unicore32/kernel/module.c
index e191b3448bd3..53ea96459d8c 100644
--- a/arch/unicore32/kernel/module.c
+++ b/arch/unicore32/kernel/module.c
@@ -22,7 +22,7 @@
 #include <asm/pgtable.h>
 #include <asm/sections.h>
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
 				GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f58336af095c..032e49180577 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -77,7 +77,7 @@ static unsigned long int get_module_load_offset(void)
 }
 #endif
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	void *p;
 
diff --git a/kernel/module.c b/kernel/module.c
index 6746c85511fe..41c22aba8209 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2110,11 +2110,16 @@ static void free_module_elf(struct module *mod)
 }
 #endif /* CONFIG_LIVEPATCH */
 
-void __weak module_memfree(void *module_region)
+void __weak arch_module_memfree(void *module_region)
 {
 	vfree(module_region);
 }
 
+void module_memfree(void *module_region)
+{
+	arch_module_memfree(module_region);
+}
+
 void __weak module_arch_cleanup(struct module *mod)
 {
 }
@@ -2728,11 +2733,16 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug *debug)
 		ddebug_remove_module(mod->name);
 }
 
-void * __weak module_alloc(unsigned long size)
+void * __weak arch_module_alloc(unsigned long size)
 {
 	return vmalloc_exec(size);
 }
 
+void *module_alloc(unsigned long size)
+{
+	return arch_module_alloc(size);
+}
+
 #ifdef CONFIG_DEBUG_KMEMLEAK
 static void kmemleak_load_module(const struct module *mod,
 				 const struct load_info *info)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 1/3] modules: Create arch versions of module alloc/free
@ 2018-10-19 20:47   ` Rick Edgecombe
  0 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: kernel-hardening, daniel, keescook, catalin.marinas, will.deacon,
	davem, tglx, mingo, bp, x86, arnd, jeyu, linux-arm-kernel,
	linux-kernel, linux-mips, linux-s390, sparclinux, linux-fsdevel,
	linux-arch, jannh
  Cc: kristen, dave.hansen, arjan, deneen.t.dock, Rick Edgecombe

In prep for module space rlimit, create a singular cross platform
module_alloc and module_memfree that call into arch specific
implementations.

This has only been tested on x86.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/arm/kernel/module.c       |  2 +-
 arch/arm64/kernel/module.c     |  2 +-
 arch/mips/kernel/module.c      |  2 +-
 arch/nds32/kernel/module.c     |  2 +-
 arch/nios2/kernel/module.c     |  4 ++--
 arch/parisc/kernel/module.c    |  2 +-
 arch/s390/kernel/module.c      |  2 +-
 arch/sparc/kernel/module.c     |  2 +-
 arch/unicore32/kernel/module.c |  2 +-
 arch/x86/kernel/module.c       |  2 +-
 kernel/module.c                | 14 ++++++++++++--
 11 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 3ff571c2c71c..359838a4bb06 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -38,7 +38,7 @@
 #endif
 
 #ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	gfp_t gfp_mask = GFP_KERNEL;
 	void *p;
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index f0f27aeefb73..a6891eb2fc16 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -30,7 +30,7 @@
 #include <asm/insn.h>
 #include <asm/sections.h>
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	gfp_t gfp_mask = GFP_KERNEL;
 	void *p;
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 491605137b03..e9ee8e7544f9 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -45,7 +45,7 @@ static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
 #ifdef MODULE_START
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
 				GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
diff --git a/arch/nds32/kernel/module.c b/arch/nds32/kernel/module.c
index 1e31829cbc2a..75535daa22a5 100644
--- a/arch/nds32/kernel/module.c
+++ b/arch/nds32/kernel/module.c
@@ -7,7 +7,7 @@
 #include <linux/moduleloader.h>
 #include <asm/pgtable.h>
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
 				    GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index e2e3f13f98d5..cd059a8e9a7b 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -28,7 +28,7 @@
  * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
  * addresses in 0xc0000000)
  */
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	if (size = 0)
 		return NULL;
@@ -36,7 +36,7 @@ void *module_alloc(unsigned long size)
 }
 
 /* Free memory returned from module_alloc */
-void module_memfree(void *module_region)
+void arch_module_memfree(void *module_region)
 {
 	kfree(module_region);
 }
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index b5b3cb00f1fb..72ab3c8b103b 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -213,7 +213,7 @@ static inline int reassemble_22(int as22)
 		((as22 & 0x0003ff) << 3));
 }
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	/* using RWX means less protection for modules, but it's
 	 * easier than trying to map the text, data, init_text and
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index d298d3cb46d0..e07c4a9384c0 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -30,7 +30,7 @@
 
 #define PLT_ENTRY_SIZE 20
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	if (PAGE_ALIGN(size) > MODULES_LEN)
 		return NULL;
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index df39580f398d..870581ba9205 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -40,7 +40,7 @@ static void *module_map(unsigned long size)
 }
 #endif /* CONFIG_SPARC64 */
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	void *ret;
 
diff --git a/arch/unicore32/kernel/module.c b/arch/unicore32/kernel/module.c
index e191b3448bd3..53ea96459d8c 100644
--- a/arch/unicore32/kernel/module.c
+++ b/arch/unicore32/kernel/module.c
@@ -22,7 +22,7 @@
 #include <asm/pgtable.h>
 #include <asm/sections.h>
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
 				GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f58336af095c..032e49180577 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -77,7 +77,7 @@ static unsigned long int get_module_load_offset(void)
 }
 #endif
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	void *p;
 
diff --git a/kernel/module.c b/kernel/module.c
index 6746c85511fe..41c22aba8209 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2110,11 +2110,16 @@ static void free_module_elf(struct module *mod)
 }
 #endif /* CONFIG_LIVEPATCH */
 
-void __weak module_memfree(void *module_region)
+void __weak arch_module_memfree(void *module_region)
 {
 	vfree(module_region);
 }
 
+void module_memfree(void *module_region)
+{
+	arch_module_memfree(module_region);
+}
+
 void __weak module_arch_cleanup(struct module *mod)
 {
 }
@@ -2728,11 +2733,16 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug *debug)
 		ddebug_remove_module(mod->name);
 }
 
-void * __weak module_alloc(unsigned long size)
+void * __weak arch_module_alloc(unsigned long size)
 {
 	return vmalloc_exec(size);
 }
 
+void *module_alloc(unsigned long size)
+{
+	return arch_module_alloc(size);
+}
+
 #ifdef CONFIG_DEBUG_KMEMLEAK
 static void kmemleak_load_module(const struct module *mod,
 				 const struct load_info *info)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 1/3] modules: Create arch versions of module alloc/free
@ 2018-10-19 20:47   ` Rick Edgecombe
  0 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: linux-arm-kernel

In prep for module space rlimit, create a singular cross platform
module_alloc and module_memfree that call into arch specific
implementations.

This has only been tested on x86.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/arm/kernel/module.c       |  2 +-
 arch/arm64/kernel/module.c     |  2 +-
 arch/mips/kernel/module.c      |  2 +-
 arch/nds32/kernel/module.c     |  2 +-
 arch/nios2/kernel/module.c     |  4 ++--
 arch/parisc/kernel/module.c    |  2 +-
 arch/s390/kernel/module.c      |  2 +-
 arch/sparc/kernel/module.c     |  2 +-
 arch/unicore32/kernel/module.c |  2 +-
 arch/x86/kernel/module.c       |  2 +-
 kernel/module.c                | 14 ++++++++++++--
 11 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 3ff571c2c71c..359838a4bb06 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -38,7 +38,7 @@
 #endif
 
 #ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	gfp_t gfp_mask = GFP_KERNEL;
 	void *p;
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index f0f27aeefb73..a6891eb2fc16 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -30,7 +30,7 @@
 #include <asm/insn.h>
 #include <asm/sections.h>
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	gfp_t gfp_mask = GFP_KERNEL;
 	void *p;
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 491605137b03..e9ee8e7544f9 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -45,7 +45,7 @@ static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
 #ifdef MODULE_START
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
 				GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
diff --git a/arch/nds32/kernel/module.c b/arch/nds32/kernel/module.c
index 1e31829cbc2a..75535daa22a5 100644
--- a/arch/nds32/kernel/module.c
+++ b/arch/nds32/kernel/module.c
@@ -7,7 +7,7 @@
 #include <linux/moduleloader.h>
 #include <asm/pgtable.h>
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
 				    GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index e2e3f13f98d5..cd059a8e9a7b 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -28,7 +28,7 @@
  * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
  * addresses in 0xc0000000)
  */
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	if (size == 0)
 		return NULL;
@@ -36,7 +36,7 @@ void *module_alloc(unsigned long size)
 }
 
 /* Free memory returned from module_alloc */
-void module_memfree(void *module_region)
+void arch_module_memfree(void *module_region)
 {
 	kfree(module_region);
 }
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index b5b3cb00f1fb..72ab3c8b103b 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -213,7 +213,7 @@ static inline int reassemble_22(int as22)
 		((as22 & 0x0003ff) << 3));
 }
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	/* using RWX means less protection for modules, but it's
 	 * easier than trying to map the text, data, init_text and
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index d298d3cb46d0..e07c4a9384c0 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -30,7 +30,7 @@
 
 #define PLT_ENTRY_SIZE 20
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	if (PAGE_ALIGN(size) > MODULES_LEN)
 		return NULL;
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index df39580f398d..870581ba9205 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -40,7 +40,7 @@ static void *module_map(unsigned long size)
 }
 #endif /* CONFIG_SPARC64 */
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	void *ret;
 
diff --git a/arch/unicore32/kernel/module.c b/arch/unicore32/kernel/module.c
index e191b3448bd3..53ea96459d8c 100644
--- a/arch/unicore32/kernel/module.c
+++ b/arch/unicore32/kernel/module.c
@@ -22,7 +22,7 @@
 #include <asm/pgtable.h>
 #include <asm/sections.h>
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
 				GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f58336af095c..032e49180577 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -77,7 +77,7 @@ static unsigned long int get_module_load_offset(void)
 }
 #endif
 
-void *module_alloc(unsigned long size)
+void *arch_module_alloc(unsigned long size)
 {
 	void *p;
 
diff --git a/kernel/module.c b/kernel/module.c
index 6746c85511fe..41c22aba8209 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2110,11 +2110,16 @@ static void free_module_elf(struct module *mod)
 }
 #endif /* CONFIG_LIVEPATCH */
 
-void __weak module_memfree(void *module_region)
+void __weak arch_module_memfree(void *module_region)
 {
 	vfree(module_region);
 }
 
+void module_memfree(void *module_region)
+{
+	arch_module_memfree(module_region);
+}
+
 void __weak module_arch_cleanup(struct module *mod)
 {
 }
@@ -2728,11 +2733,16 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug *debug)
 		ddebug_remove_module(mod->name);
 }
 
-void * __weak module_alloc(unsigned long size)
+void * __weak arch_module_alloc(unsigned long size)
 {
 	return vmalloc_exec(size);
 }
 
+void *module_alloc(unsigned long size)
+{
+	return arch_module_alloc(size);
+}
+
 #ifdef CONFIG_DEBUG_KMEMLEAK
 static void kmemleak_load_module(const struct module *mod,
 				 const struct load_info *info)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 2/3] modules: Create rlimit for module space
  2018-10-19 20:47 ` Rick Edgecombe
  (?)
@ 2018-10-19 20:47   ` Rick Edgecombe
  -1 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: kernel-hardening, daniel, keescook, catalin.marinas, will.deacon,
	davem, tglx, mingo, bp, x86, arnd, jeyu, linux-arm-kernel,
	linux-kernel, linux-mips, linux-s390, sparclinux, linux-fsdevel,
	linux-arch, jannh
  Cc: kristen, dave.hansen, arjan, deneen.t.dock, Rick Edgecombe

This introduces a new rlimit, RLIMIT_MODSPACE, which limits the amount of
module space a user can use. The intention is to be able to limit module space
allocations that may come from un-privlidged users inserting e/BPF filters.

Since filters attached to sockets can be passed to other processes via domain
sockets and freed there, there is new tracking for the uid of each allocation.
This way if the allocation is freed by a different user, it will not throw off
the accounting.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable_32_types.h |   3 +
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 fs/proc/base.c                          |   1 +
 include/asm-generic/resource.h          |   8 ++
 include/linux/sched/user.h              |   4 +
 include/uapi/asm-generic/resource.h     |   3 +-
 kernel/module.c                         | 140 +++++++++++++++++++++++-
 7 files changed, 159 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index b0bc0fff5f1f..185e382fa8c3 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -68,6 +68,9 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
 #define MODULES_END	VMALLOC_END
 #define MODULES_LEN	(MODULES_VADDR - MODULES_END)
 
+/* Half of 128MB vmalloc space */
+#define MODSPACE_LIMIT (1 << 25)
+
 #define MAXMEM	(VMALLOC_END - PAGE_OFFSET - __VMALLOC_RESERVE)
 
 #endif /* _ASM_X86_PGTABLE_32_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04edd2d58211..39288812be5a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -143,6 +143,8 @@ extern unsigned int ptrs_per_p4d;
 #define MODULES_END		_AC(0xffffffffff000000, UL)
 #define MODULES_LEN		(MODULES_END - MODULES_VADDR)
 
+#define MODSPACE_LIMIT		(MODULES_LEN / 10)
+
 #define ESPFIX_PGD_ENTRY	_AC(-2, UL)
 #define ESPFIX_BASE_ADDR	(ESPFIX_PGD_ENTRY << P4D_SHIFT)
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7e9f07bf260d..84824f50e9f8 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -562,6 +562,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
 	[RLIMIT_NICE] = {"Max nice priority", NULL},
 	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
 	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+	[RLIMIT_MODSPACE] = {"Max module space", "bytes"},
 };
 
 /* Display limits for a process */
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 8874f681b056..94c150e3dd12 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -4,6 +4,13 @@
 
 #include <uapi/asm-generic/resource.h>
 
+/*
+ * If the module space rlimit is not defined in an arch specific way, leave
+ * room for 10000 large eBPF filters.
+ */
+#ifndef MODSPACE_LIMIT
+#define MODSPACE_LIMIT (5*PAGE_SIZE*10000)
+#endif
 
 /*
  * boot-time rlimit defaults for the init task:
@@ -26,6 +33,7 @@
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
+	[RLIMIT_MODSPACE]	= {  MODSPACE_LIMIT,  MODSPACE_LIMIT },	\
 }
 
 #endif
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..4c6d99d066fe 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -44,6 +44,10 @@ struct user_struct {
 	atomic_long_t locked_vm;
 #endif
 
+#ifdef CONFIG_MODULES
+	atomic_long_t module_vm;
+#endif
+
 	/* Miscellaneous per-user rate limit */
 	struct ratelimit_state ratelimit;
 };
diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
index f12db7a0da64..3f998340ed30 100644
--- a/include/uapi/asm-generic/resource.h
+++ b/include/uapi/asm-generic/resource.h
@@ -46,7 +46,8 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+#define RLIMIT_MODSPACE		16	/* max module space address usage */
+#define RLIM_NLIMITS		17
 
 /*
  * SuS says limits have to be unsigned.
diff --git a/kernel/module.c b/kernel/module.c
index 41c22aba8209..c26ad50365dd 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2110,6 +2110,134 @@ static void free_module_elf(struct module *mod)
 }
 #endif /* CONFIG_LIVEPATCH */
 
+struct mod_alloc_user {
+	struct rb_node node;
+	unsigned long addr;
+	unsigned long pages;
+	struct user_struct *user;
+};
+
+static struct rb_root alloc_users = RB_ROOT;
+static DEFINE_SPINLOCK(alloc_users_lock);
+
+static unsigned int get_mod_page_cnt(unsigned long size)
+{
+	/* Add one for guard page */
+	return (PAGE_ALIGN(size) >> PAGE_SHIFT) + 1;
+}
+
+void update_mod_rlimit(void *addr, unsigned long size)
+{
+	unsigned long addrl = (unsigned long) addr;
+	struct rb_node **new = &(alloc_users.rb_node), *parent = NULL;
+	struct mod_alloc_user *track = kmalloc(sizeof(struct mod_alloc_user),
+				GFP_KERNEL);
+	unsigned int pages = get_mod_page_cnt(size);
+	struct user_struct *user = get_current_user();
+
+	/*
+	 * If addr is NULL, then we need to reverse the earlier increment that
+	 * would have happened in an check_inc_mod_rlimit call.
+	 */
+	if (!addr) {
+		atomic_long_sub(pages, &user->module_vm);
+		free_uid(user);
+		return;
+	}
+
+	/* Now, add tracking for the uid that allocated this */
+	track->addr = addrl;
+	track->pages = pages;
+	track->user = user;
+
+	spin_lock(&alloc_users_lock);
+
+	while (*new) {
+		struct mod_alloc_user *cur =
+				rb_entry(*new, struct mod_alloc_user, node);
+		parent = *new;
+		if (cur->addr > addrl)
+			new = &(*new)->rb_left;
+		else
+			new = &(*new)->rb_right;
+	}
+
+	rb_link_node(&(track->node), parent, new);
+	rb_insert_color(&(track->node), &alloc_users);
+
+	spin_unlock(&alloc_users_lock);
+}
+
+/* Remove user allocation tracking, return NULL if allocation untracked */
+static struct user_struct *remove_user_alloc(void *addr, unsigned long *pages)
+{
+	struct rb_node *cur_node = alloc_users.rb_node;
+	unsigned long addrl = (unsigned long) addr;
+	struct mod_alloc_user *cur_alloc_user = NULL;
+	struct user_struct *user;
+
+	spin_lock(&alloc_users_lock);
+	while (cur_node) {
+		cur_alloc_user =
+			rb_entry(cur_node, struct mod_alloc_user, node);
+		if (cur_alloc_user->addr > addrl)
+			cur_node = cur_node->rb_left;
+		else if (cur_alloc_user->addr < addrl)
+			cur_node = cur_node->rb_right;
+		else
+			goto found;
+	}
+	spin_unlock(&alloc_users_lock);
+
+	return NULL;
+found:
+	rb_erase(&cur_alloc_user->node, &alloc_users);
+	spin_unlock(&alloc_users_lock);
+
+	user = cur_alloc_user->user;
+	*pages = cur_alloc_user->pages;
+	kfree(cur_alloc_user);
+
+	return user;
+}
+
+int check_inc_mod_rlimit(unsigned long size)
+{
+	struct user_struct *user = get_current_user();
+	unsigned long modspace_pages = rlimit(RLIMIT_MODSPACE) >> PAGE_SHIFT;
+	unsigned long cur_pages = atomic_long_read(&user->module_vm);
+	unsigned long new_pages = get_mod_page_cnt(size);
+
+	if (rlimit(RLIMIT_MODSPACE) != RLIM_INFINITY
+			&& cur_pages + new_pages > modspace_pages) {
+		free_uid(user);
+		return 1;
+	}
+
+	atomic_long_add(new_pages, &user->module_vm);
+
+	if (atomic_long_read(&user->module_vm) > modspace_pages) {
+		atomic_long_sub(new_pages, &user->module_vm);
+		free_uid(user);
+		return 1;
+	}
+
+	free_uid(user);
+	return 0;
+}
+
+void dec_mod_rlimit(void *addr)
+{
+	unsigned long pages;
+	struct user_struct *user = remove_user_alloc(addr, &pages);
+
+	if (!user)
+		return;
+
+	atomic_long_sub(pages, &user->module_vm);
+	free_uid(user);
+}
+
 void __weak arch_module_memfree(void *module_region)
 {
 	vfree(module_region);
@@ -2118,6 +2246,7 @@ void __weak arch_module_memfree(void *module_region)
 void module_memfree(void *module_region)
 {
 	arch_module_memfree(module_region);
+	dec_mod_rlimit(module_region);
 }
 
 void __weak module_arch_cleanup(struct module *mod)
@@ -2740,7 +2869,16 @@ void * __weak arch_module_alloc(unsigned long size)
 
 void *module_alloc(unsigned long size)
 {
-	return arch_module_alloc(size);
+	void *p;
+
+	if (check_inc_mod_rlimit(size))
+		return NULL;
+
+	p = arch_module_alloc(size);
+
+	update_mod_rlimit(p, size);
+
+	return p;
 }
 
 #ifdef CONFIG_DEBUG_KMEMLEAK
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 2/3] modules: Create rlimit for module space
@ 2018-10-19 20:47   ` Rick Edgecombe
  0 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: kernel-hardening, daniel, keescook, catalin.marinas, will.deacon,
	davem, tglx, mingo, bp, x86, arnd, jeyu, linux-arm-kernel,
	linux-kernel, linux-mips, linux-s390, sparclinux, linux-fsdevel,
	linux-arch, jannh
  Cc: kristen, dave.hansen, arjan, deneen.t.dock, Rick Edgecombe

This introduces a new rlimit, RLIMIT_MODSPACE, which limits the amount of
module space a user can use. The intention is to be able to limit module space
allocations that may come from un-privlidged users inserting e/BPF filters.

Since filters attached to sockets can be passed to other processes via domain
sockets and freed there, there is new tracking for the uid of each allocation.
This way if the allocation is freed by a different user, it will not throw off
the accounting.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable_32_types.h |   3 +
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 fs/proc/base.c                          |   1 +
 include/asm-generic/resource.h          |   8 ++
 include/linux/sched/user.h              |   4 +
 include/uapi/asm-generic/resource.h     |   3 +-
 kernel/module.c                         | 140 +++++++++++++++++++++++-
 7 files changed, 159 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index b0bc0fff5f1f..185e382fa8c3 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -68,6 +68,9 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
 #define MODULES_END	VMALLOC_END
 #define MODULES_LEN	(MODULES_VADDR - MODULES_END)
 
+/* Half of 128MB vmalloc space */
+#define MODSPACE_LIMIT (1 << 25)
+
 #define MAXMEM	(VMALLOC_END - PAGE_OFFSET - __VMALLOC_RESERVE)
 
 #endif /* _ASM_X86_PGTABLE_32_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04edd2d58211..39288812be5a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -143,6 +143,8 @@ extern unsigned int ptrs_per_p4d;
 #define MODULES_END		_AC(0xffffffffff000000, UL)
 #define MODULES_LEN		(MODULES_END - MODULES_VADDR)
 
+#define MODSPACE_LIMIT		(MODULES_LEN / 10)
+
 #define ESPFIX_PGD_ENTRY	_AC(-2, UL)
 #define ESPFIX_BASE_ADDR	(ESPFIX_PGD_ENTRY << P4D_SHIFT)
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7e9f07bf260d..84824f50e9f8 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -562,6 +562,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
 	[RLIMIT_NICE] = {"Max nice priority", NULL},
 	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
 	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+	[RLIMIT_MODSPACE] = {"Max module space", "bytes"},
 };
 
 /* Display limits for a process */
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 8874f681b056..94c150e3dd12 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -4,6 +4,13 @@
 
 #include <uapi/asm-generic/resource.h>
 
+/*
+ * If the module space rlimit is not defined in an arch specific way, leave
+ * room for 10000 large eBPF filters.
+ */
+#ifndef MODSPACE_LIMIT
+#define MODSPACE_LIMIT (5*PAGE_SIZE*10000)
+#endif
 
 /*
  * boot-time rlimit defaults for the init task:
@@ -26,6 +33,7 @@
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
+	[RLIMIT_MODSPACE]	= {  MODSPACE_LIMIT,  MODSPACE_LIMIT },	\
 }
 
 #endif
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..4c6d99d066fe 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -44,6 +44,10 @@ struct user_struct {
 	atomic_long_t locked_vm;
 #endif
 
+#ifdef CONFIG_MODULES
+	atomic_long_t module_vm;
+#endif
+
 	/* Miscellaneous per-user rate limit */
 	struct ratelimit_state ratelimit;
 };
diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
index f12db7a0da64..3f998340ed30 100644
--- a/include/uapi/asm-generic/resource.h
+++ b/include/uapi/asm-generic/resource.h
@@ -46,7 +46,8 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+#define RLIMIT_MODSPACE		16	/* max module space address usage */
+#define RLIM_NLIMITS		17
 
 /*
  * SuS says limits have to be unsigned.
diff --git a/kernel/module.c b/kernel/module.c
index 41c22aba8209..c26ad50365dd 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2110,6 +2110,134 @@ static void free_module_elf(struct module *mod)
 }
 #endif /* CONFIG_LIVEPATCH */
 
+struct mod_alloc_user {
+	struct rb_node node;
+	unsigned long addr;
+	unsigned long pages;
+	struct user_struct *user;
+};
+
+static struct rb_root alloc_users = RB_ROOT;
+static DEFINE_SPINLOCK(alloc_users_lock);
+
+static unsigned int get_mod_page_cnt(unsigned long size)
+{
+	/* Add one for guard page */
+	return (PAGE_ALIGN(size) >> PAGE_SHIFT) + 1;
+}
+
+void update_mod_rlimit(void *addr, unsigned long size)
+{
+	unsigned long addrl = (unsigned long) addr;
+	struct rb_node **new = &(alloc_users.rb_node), *parent = NULL;
+	struct mod_alloc_user *track = kmalloc(sizeof(struct mod_alloc_user),
+				GFP_KERNEL);
+	unsigned int pages = get_mod_page_cnt(size);
+	struct user_struct *user = get_current_user();
+
+	/*
+	 * If addr is NULL, then we need to reverse the earlier increment that
+	 * would have happened in an check_inc_mod_rlimit call.
+	 */
+	if (!addr) {
+		atomic_long_sub(pages, &user->module_vm);
+		free_uid(user);
+		return;
+	}
+
+	/* Now, add tracking for the uid that allocated this */
+	track->addr = addrl;
+	track->pages = pages;
+	track->user = user;
+
+	spin_lock(&alloc_users_lock);
+
+	while (*new) {
+		struct mod_alloc_user *cur +				rb_entry(*new, struct mod_alloc_user, node);
+		parent = *new;
+		if (cur->addr > addrl)
+			new = &(*new)->rb_left;
+		else
+			new = &(*new)->rb_right;
+	}
+
+	rb_link_node(&(track->node), parent, new);
+	rb_insert_color(&(track->node), &alloc_users);
+
+	spin_unlock(&alloc_users_lock);
+}
+
+/* Remove user allocation tracking, return NULL if allocation untracked */
+static struct user_struct *remove_user_alloc(void *addr, unsigned long *pages)
+{
+	struct rb_node *cur_node = alloc_users.rb_node;
+	unsigned long addrl = (unsigned long) addr;
+	struct mod_alloc_user *cur_alloc_user = NULL;
+	struct user_struct *user;
+
+	spin_lock(&alloc_users_lock);
+	while (cur_node) {
+		cur_alloc_user +			rb_entry(cur_node, struct mod_alloc_user, node);
+		if (cur_alloc_user->addr > addrl)
+			cur_node = cur_node->rb_left;
+		else if (cur_alloc_user->addr < addrl)
+			cur_node = cur_node->rb_right;
+		else
+			goto found;
+	}
+	spin_unlock(&alloc_users_lock);
+
+	return NULL;
+found:
+	rb_erase(&cur_alloc_user->node, &alloc_users);
+	spin_unlock(&alloc_users_lock);
+
+	user = cur_alloc_user->user;
+	*pages = cur_alloc_user->pages;
+	kfree(cur_alloc_user);
+
+	return user;
+}
+
+int check_inc_mod_rlimit(unsigned long size)
+{
+	struct user_struct *user = get_current_user();
+	unsigned long modspace_pages = rlimit(RLIMIT_MODSPACE) >> PAGE_SHIFT;
+	unsigned long cur_pages = atomic_long_read(&user->module_vm);
+	unsigned long new_pages = get_mod_page_cnt(size);
+
+	if (rlimit(RLIMIT_MODSPACE) != RLIM_INFINITY
+			&& cur_pages + new_pages > modspace_pages) {
+		free_uid(user);
+		return 1;
+	}
+
+	atomic_long_add(new_pages, &user->module_vm);
+
+	if (atomic_long_read(&user->module_vm) > modspace_pages) {
+		atomic_long_sub(new_pages, &user->module_vm);
+		free_uid(user);
+		return 1;
+	}
+
+	free_uid(user);
+	return 0;
+}
+
+void dec_mod_rlimit(void *addr)
+{
+	unsigned long pages;
+	struct user_struct *user = remove_user_alloc(addr, &pages);
+
+	if (!user)
+		return;
+
+	atomic_long_sub(pages, &user->module_vm);
+	free_uid(user);
+}
+
 void __weak arch_module_memfree(void *module_region)
 {
 	vfree(module_region);
@@ -2118,6 +2246,7 @@ void __weak arch_module_memfree(void *module_region)
 void module_memfree(void *module_region)
 {
 	arch_module_memfree(module_region);
+	dec_mod_rlimit(module_region);
 }
 
 void __weak module_arch_cleanup(struct module *mod)
@@ -2740,7 +2869,16 @@ void * __weak arch_module_alloc(unsigned long size)
 
 void *module_alloc(unsigned long size)
 {
-	return arch_module_alloc(size);
+	void *p;
+
+	if (check_inc_mod_rlimit(size))
+		return NULL;
+
+	p = arch_module_alloc(size);
+
+	update_mod_rlimit(p, size);
+
+	return p;
 }
 
 #ifdef CONFIG_DEBUG_KMEMLEAK
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 2/3] modules: Create rlimit for module space
@ 2018-10-19 20:47   ` Rick Edgecombe
  0 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: linux-arm-kernel

This introduces a new rlimit, RLIMIT_MODSPACE, which limits the amount of
module space a user can use. The intention is to be able to limit module space
allocations that may come from un-privlidged users inserting e/BPF filters.

Since filters attached to sockets can be passed to other processes via domain
sockets and freed there, there is new tracking for the uid of each allocation.
This way if the allocation is freed by a different user, it will not throw off
the accounting.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/pgtable_32_types.h |   3 +
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 fs/proc/base.c                          |   1 +
 include/asm-generic/resource.h          |   8 ++
 include/linux/sched/user.h              |   4 +
 include/uapi/asm-generic/resource.h     |   3 +-
 kernel/module.c                         | 140 +++++++++++++++++++++++-
 7 files changed, 159 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index b0bc0fff5f1f..185e382fa8c3 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -68,6 +68,9 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
 #define MODULES_END	VMALLOC_END
 #define MODULES_LEN	(MODULES_VADDR - MODULES_END)
 
+/* Half of 128MB vmalloc space */
+#define MODSPACE_LIMIT (1 << 25)
+
 #define MAXMEM	(VMALLOC_END - PAGE_OFFSET - __VMALLOC_RESERVE)
 
 #endif /* _ASM_X86_PGTABLE_32_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04edd2d58211..39288812be5a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -143,6 +143,8 @@ extern unsigned int ptrs_per_p4d;
 #define MODULES_END		_AC(0xffffffffff000000, UL)
 #define MODULES_LEN		(MODULES_END - MODULES_VADDR)
 
+#define MODSPACE_LIMIT		(MODULES_LEN / 10)
+
 #define ESPFIX_PGD_ENTRY	_AC(-2, UL)
 #define ESPFIX_BASE_ADDR	(ESPFIX_PGD_ENTRY << P4D_SHIFT)
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7e9f07bf260d..84824f50e9f8 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -562,6 +562,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
 	[RLIMIT_NICE] = {"Max nice priority", NULL},
 	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
 	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+	[RLIMIT_MODSPACE] = {"Max module space", "bytes"},
 };
 
 /* Display limits for a process */
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 8874f681b056..94c150e3dd12 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -4,6 +4,13 @@
 
 #include <uapi/asm-generic/resource.h>
 
+/*
+ * If the module space rlimit is not defined in an arch specific way, leave
+ * room for 10000 large eBPF filters.
+ */
+#ifndef MODSPACE_LIMIT
+#define MODSPACE_LIMIT (5*PAGE_SIZE*10000)
+#endif
 
 /*
  * boot-time rlimit defaults for the init task:
@@ -26,6 +33,7 @@
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
+	[RLIMIT_MODSPACE]	= {  MODSPACE_LIMIT,  MODSPACE_LIMIT },	\
 }
 
 #endif
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..4c6d99d066fe 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -44,6 +44,10 @@ struct user_struct {
 	atomic_long_t locked_vm;
 #endif
 
+#ifdef CONFIG_MODULES
+	atomic_long_t module_vm;
+#endif
+
 	/* Miscellaneous per-user rate limit */
 	struct ratelimit_state ratelimit;
 };
diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
index f12db7a0da64..3f998340ed30 100644
--- a/include/uapi/asm-generic/resource.h
+++ b/include/uapi/asm-generic/resource.h
@@ -46,7 +46,8 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+#define RLIMIT_MODSPACE		16	/* max module space address usage */
+#define RLIM_NLIMITS		17
 
 /*
  * SuS says limits have to be unsigned.
diff --git a/kernel/module.c b/kernel/module.c
index 41c22aba8209..c26ad50365dd 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2110,6 +2110,134 @@ static void free_module_elf(struct module *mod)
 }
 #endif /* CONFIG_LIVEPATCH */
 
+struct mod_alloc_user {
+	struct rb_node node;
+	unsigned long addr;
+	unsigned long pages;
+	struct user_struct *user;
+};
+
+static struct rb_root alloc_users = RB_ROOT;
+static DEFINE_SPINLOCK(alloc_users_lock);
+
+static unsigned int get_mod_page_cnt(unsigned long size)
+{
+	/* Add one for guard page */
+	return (PAGE_ALIGN(size) >> PAGE_SHIFT) + 1;
+}
+
+void update_mod_rlimit(void *addr, unsigned long size)
+{
+	unsigned long addrl = (unsigned long) addr;
+	struct rb_node **new = &(alloc_users.rb_node), *parent = NULL;
+	struct mod_alloc_user *track = kmalloc(sizeof(struct mod_alloc_user),
+				GFP_KERNEL);
+	unsigned int pages = get_mod_page_cnt(size);
+	struct user_struct *user = get_current_user();
+
+	/*
+	 * If addr is NULL, then we need to reverse the earlier increment that
+	 * would have happened in an check_inc_mod_rlimit call.
+	 */
+	if (!addr) {
+		atomic_long_sub(pages, &user->module_vm);
+		free_uid(user);
+		return;
+	}
+
+	/* Now, add tracking for the uid that allocated this */
+	track->addr = addrl;
+	track->pages = pages;
+	track->user = user;
+
+	spin_lock(&alloc_users_lock);
+
+	while (*new) {
+		struct mod_alloc_user *cur =
+				rb_entry(*new, struct mod_alloc_user, node);
+		parent = *new;
+		if (cur->addr > addrl)
+			new = &(*new)->rb_left;
+		else
+			new = &(*new)->rb_right;
+	}
+
+	rb_link_node(&(track->node), parent, new);
+	rb_insert_color(&(track->node), &alloc_users);
+
+	spin_unlock(&alloc_users_lock);
+}
+
+/* Remove user allocation tracking, return NULL if allocation untracked */
+static struct user_struct *remove_user_alloc(void *addr, unsigned long *pages)
+{
+	struct rb_node *cur_node = alloc_users.rb_node;
+	unsigned long addrl = (unsigned long) addr;
+	struct mod_alloc_user *cur_alloc_user = NULL;
+	struct user_struct *user;
+
+	spin_lock(&alloc_users_lock);
+	while (cur_node) {
+		cur_alloc_user =
+			rb_entry(cur_node, struct mod_alloc_user, node);
+		if (cur_alloc_user->addr > addrl)
+			cur_node = cur_node->rb_left;
+		else if (cur_alloc_user->addr < addrl)
+			cur_node = cur_node->rb_right;
+		else
+			goto found;
+	}
+	spin_unlock(&alloc_users_lock);
+
+	return NULL;
+found:
+	rb_erase(&cur_alloc_user->node, &alloc_users);
+	spin_unlock(&alloc_users_lock);
+
+	user = cur_alloc_user->user;
+	*pages = cur_alloc_user->pages;
+	kfree(cur_alloc_user);
+
+	return user;
+}
+
+int check_inc_mod_rlimit(unsigned long size)
+{
+	struct user_struct *user = get_current_user();
+	unsigned long modspace_pages = rlimit(RLIMIT_MODSPACE) >> PAGE_SHIFT;
+	unsigned long cur_pages = atomic_long_read(&user->module_vm);
+	unsigned long new_pages = get_mod_page_cnt(size);
+
+	if (rlimit(RLIMIT_MODSPACE) != RLIM_INFINITY
+			&& cur_pages + new_pages > modspace_pages) {
+		free_uid(user);
+		return 1;
+	}
+
+	atomic_long_add(new_pages, &user->module_vm);
+
+	if (atomic_long_read(&user->module_vm) > modspace_pages) {
+		atomic_long_sub(new_pages, &user->module_vm);
+		free_uid(user);
+		return 1;
+	}
+
+	free_uid(user);
+	return 0;
+}
+
+void dec_mod_rlimit(void *addr)
+{
+	unsigned long pages;
+	struct user_struct *user = remove_user_alloc(addr, &pages);
+
+	if (!user)
+		return;
+
+	atomic_long_sub(pages, &user->module_vm);
+	free_uid(user);
+}
+
 void __weak arch_module_memfree(void *module_region)
 {
 	vfree(module_region);
@@ -2118,6 +2246,7 @@ void __weak arch_module_memfree(void *module_region)
 void module_memfree(void *module_region)
 {
 	arch_module_memfree(module_region);
+	dec_mod_rlimit(module_region);
 }
 
 void __weak module_arch_cleanup(struct module *mod)
@@ -2740,7 +2869,16 @@ void * __weak arch_module_alloc(unsigned long size)
 
 void *module_alloc(unsigned long size)
 {
-	return arch_module_alloc(size);
+	void *p;
+
+	if (check_inc_mod_rlimit(size))
+		return NULL;
+
+	p = arch_module_alloc(size);
+
+	update_mod_rlimit(p, size);
+
+	return p;
 }
 
 #ifdef CONFIG_DEBUG_KMEMLEAK
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 3/3] bpf: Add system wide BPF JIT limit
  2018-10-19 20:47 ` Rick Edgecombe
  (?)
@ 2018-10-19 20:47   ` Rick Edgecombe
  -1 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: kernel-hardening, daniel, keescook, catalin.marinas, will.deacon,
	davem, tglx, mingo, bp, x86, arnd, jeyu, linux-arm-kernel,
	linux-kernel, linux-mips, linux-s390, sparclinux, linux-fsdevel,
	linux-arch, jannh
  Cc: kristen, dave.hansen, arjan, deneen.t.dock, Rick Edgecombe

In case of games played with multiple users, also add a system wide limit
(in bytes) for BPF JIT. The default intends to be big enough for 10000 BPF JIT
filters. This cannot help with the DOS in the case of
CONFIG_BPF_JIT_ALWAYS_ON, but it can help with DOS for module space and
with forcing a module to be loaded at a paticular address.

The limit can be set like this:
echo 5000000 > /proc/sys/net/core/bpf_jit_limit

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/bpf.h        |  7 +++++++
 include/linux/filter.h     |  1 +
 kernel/bpf/core.c          | 22 +++++++++++++++++++++-
 kernel/bpf/inode.c         | 16 ++++++++++++++++
 net/core/sysctl_net_core.c |  7 +++++++
 5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..4d7b729a1fe7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -827,4 +827,11 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+#ifndef MOD_BPF_LIMIT_DEFAULT
+/*
+ * Leave room for 10000 large eBPF filters as default.
+ */
+#define MOD_BPF_LIMIT_DEFAULT (5 * PAGE_SIZE * 10000)
+#endif
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6791a0ac0139..3e91ffc7962b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -854,6 +854,7 @@ bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 extern int bpf_jit_enable;
 extern int bpf_jit_harden;
 extern int bpf_jit_kallsyms;
+extern int bpf_jit_limit;
 
 typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3f5bf1af0826..12c20fa6f04b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -369,6 +369,9 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
 int bpf_jit_harden   __read_mostly;
 int bpf_jit_kallsyms __read_mostly;
+int bpf_jit_limit __read_mostly;
+
+static atomic_long_t module_vm;
 
 static __always_inline void
 bpf_get_prog_addr_region(const struct bpf_prog *prog,
@@ -583,17 +586,31 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     bpf_jit_fill_hole_t bpf_fill_ill_insns)
 {
 	struct bpf_binary_header *hdr;
-	unsigned int size, hole, start;
+	unsigned int size, hole, start, vpages;
 
 	/* Most of BPF filters are really small, but if some of them
 	 * fill a page, allow at least 128 extra bytes to insert a
 	 * random section of illegal instructions.
 	 */
 	size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
+
+	/* Size plus a guard page */
+	vpages = (PAGE_ALIGN(size) >> PAGE_SHIFT) + 1;
+
+	if (atomic_long_read(&module_vm) + vpages > bpf_jit_limit >> PAGE_SHIFT)
+		return NULL;
+
 	hdr = module_alloc(size);
 	if (hdr == NULL)
 		return NULL;
 
+	atomic_long_add(vpages, &module_vm);
+
+	if (atomic_long_read(&module_vm) > bpf_jit_limit >> PAGE_SHIFT) {
+		bpf_jit_binary_free(hdr);
+		return NULL;
+	}
+
 	/* Fill space with illegal/arch-dep instructions. */
 	bpf_fill_ill_insns(hdr, size);
 
@@ -610,7 +627,10 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 
 void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 {
+	/* Size plus the guard page */
+	unsigned int vpages = hdr->pages + 1;
 	module_memfree(hdr);
+	atomic_long_sub(vpages, &module_vm);
 }
 
 /* This symbol is only overridden by archs that have different
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 2ada5e21dfa6..d0a109733294 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -667,10 +667,26 @@ static struct file_system_type bpf_fs_type = {
 	.kill_sb	= kill_litter_super,
 };
 
+#ifdef CONFIG_BPF_JIT
+void set_bpf_jit_limit(void)
+{
+	bpf_jit_limit = MOD_BPF_LIMIT_DEFAULT;
+}
+#else
+void set_bpf_jit_limit(void)
+{
+}
+#endif
+
 static int __init bpf_init(void)
 {
 	int ret;
 
+	/*
+	 * Module space size can be non-compile time constant so set it here.
+	 */
+	set_bpf_jit_limit();
+
 	ret = sysfs_create_mount_point(fs_kobj, "bpf");
 	if (ret)
 		return ret;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b1a2c5e38530..6bdf4a3da2b2 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -396,6 +396,13 @@ static struct ctl_table net_core_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "bpf_jit_limit",
+		.data		= &bpf_jit_limit,
+		.maxlen		= sizeof(int),
+		.mode		= 0600,
+		.proc_handler	= proc_dointvec,
+	},
 # endif
 #endif
 	{
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 3/3] bpf: Add system wide BPF JIT limit
@ 2018-10-19 20:47   ` Rick Edgecombe
  0 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: kernel-hardening, daniel, keescook, catalin.marinas, will.deacon,
	davem, tglx, mingo, bp, x86, arnd, jeyu, linux-arm-kernel,
	linux-kernel, linux-mips, linux-s390, sparclinux, linux-fsdevel,
	linux-arch, jannh
  Cc: kristen, dave.hansen, arjan, deneen.t.dock, Rick Edgecombe

In case of games played with multiple users, also add a system wide limit
(in bytes) for BPF JIT. The default intends to be big enough for 10000 BPF JIT
filters. This cannot help with the DOS in the case of
CONFIG_BPF_JIT_ALWAYS_ON, but it can help with DOS for module space and
with forcing a module to be loaded at a paticular address.

The limit can be set like this:
echo 5000000 > /proc/sys/net/core/bpf_jit_limit

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/bpf.h        |  7 +++++++
 include/linux/filter.h     |  1 +
 kernel/bpf/core.c          | 22 +++++++++++++++++++++-
 kernel/bpf/inode.c         | 16 ++++++++++++++++
 net/core/sysctl_net_core.c |  7 +++++++
 5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..4d7b729a1fe7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -827,4 +827,11 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+#ifndef MOD_BPF_LIMIT_DEFAULT
+/*
+ * Leave room for 10000 large eBPF filters as default.
+ */
+#define MOD_BPF_LIMIT_DEFAULT (5 * PAGE_SIZE * 10000)
+#endif
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6791a0ac0139..3e91ffc7962b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -854,6 +854,7 @@ bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 extern int bpf_jit_enable;
 extern int bpf_jit_harden;
 extern int bpf_jit_kallsyms;
+extern int bpf_jit_limit;
 
 typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3f5bf1af0826..12c20fa6f04b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -369,6 +369,9 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
 int bpf_jit_harden   __read_mostly;
 int bpf_jit_kallsyms __read_mostly;
+int bpf_jit_limit __read_mostly;
+
+static atomic_long_t module_vm;
 
 static __always_inline void
 bpf_get_prog_addr_region(const struct bpf_prog *prog,
@@ -583,17 +586,31 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     bpf_jit_fill_hole_t bpf_fill_ill_insns)
 {
 	struct bpf_binary_header *hdr;
-	unsigned int size, hole, start;
+	unsigned int size, hole, start, vpages;
 
 	/* Most of BPF filters are really small, but if some of them
 	 * fill a page, allow at least 128 extra bytes to insert a
 	 * random section of illegal instructions.
 	 */
 	size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
+
+	/* Size plus a guard page */
+	vpages = (PAGE_ALIGN(size) >> PAGE_SHIFT) + 1;
+
+	if (atomic_long_read(&module_vm) + vpages > bpf_jit_limit >> PAGE_SHIFT)
+		return NULL;
+
 	hdr = module_alloc(size);
 	if (hdr = NULL)
 		return NULL;
 
+	atomic_long_add(vpages, &module_vm);
+
+	if (atomic_long_read(&module_vm) > bpf_jit_limit >> PAGE_SHIFT) {
+		bpf_jit_binary_free(hdr);
+		return NULL;
+	}
+
 	/* Fill space with illegal/arch-dep instructions. */
 	bpf_fill_ill_insns(hdr, size);
 
@@ -610,7 +627,10 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 
 void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 {
+	/* Size plus the guard page */
+	unsigned int vpages = hdr->pages + 1;
 	module_memfree(hdr);
+	atomic_long_sub(vpages, &module_vm);
 }
 
 /* This symbol is only overridden by archs that have different
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 2ada5e21dfa6..d0a109733294 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -667,10 +667,26 @@ static struct file_system_type bpf_fs_type = {
 	.kill_sb	= kill_litter_super,
 };
 
+#ifdef CONFIG_BPF_JIT
+void set_bpf_jit_limit(void)
+{
+	bpf_jit_limit = MOD_BPF_LIMIT_DEFAULT;
+}
+#else
+void set_bpf_jit_limit(void)
+{
+}
+#endif
+
 static int __init bpf_init(void)
 {
 	int ret;
 
+	/*
+	 * Module space size can be non-compile time constant so set it here.
+	 */
+	set_bpf_jit_limit();
+
 	ret = sysfs_create_mount_point(fs_kobj, "bpf");
 	if (ret)
 		return ret;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b1a2c5e38530..6bdf4a3da2b2 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -396,6 +396,13 @@ static struct ctl_table net_core_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "bpf_jit_limit",
+		.data		= &bpf_jit_limit,
+		.maxlen		= sizeof(int),
+		.mode		= 0600,
+		.proc_handler	= proc_dointvec,
+	},
 # endif
 #endif
 	{
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v3 3/3] bpf: Add system wide BPF JIT limit
@ 2018-10-19 20:47   ` Rick Edgecombe
  0 siblings, 0 replies; 47+ messages in thread
From: Rick Edgecombe @ 2018-10-19 20:47 UTC (permalink / raw)
  To: linux-arm-kernel

In case of games played with multiple users, also add a system wide limit
(in bytes) for BPF JIT. The default intends to be big enough for 10000 BPF JIT
filters. This cannot help with the DOS in the case of
CONFIG_BPF_JIT_ALWAYS_ON, but it can help with DOS for module space and
with forcing a module to be loaded at a paticular address.

The limit can be set like this:
echo 5000000 > /proc/sys/net/core/bpf_jit_limit

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/bpf.h        |  7 +++++++
 include/linux/filter.h     |  1 +
 kernel/bpf/core.c          | 22 +++++++++++++++++++++-
 kernel/bpf/inode.c         | 16 ++++++++++++++++
 net/core/sysctl_net_core.c |  7 +++++++
 5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..4d7b729a1fe7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -827,4 +827,11 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+#ifndef MOD_BPF_LIMIT_DEFAULT
+/*
+ * Leave room for 10000 large eBPF filters as default.
+ */
+#define MOD_BPF_LIMIT_DEFAULT (5 * PAGE_SIZE * 10000)
+#endif
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6791a0ac0139..3e91ffc7962b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -854,6 +854,7 @@ bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 extern int bpf_jit_enable;
 extern int bpf_jit_harden;
 extern int bpf_jit_kallsyms;
+extern int bpf_jit_limit;
 
 typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3f5bf1af0826..12c20fa6f04b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -369,6 +369,9 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
 int bpf_jit_harden   __read_mostly;
 int bpf_jit_kallsyms __read_mostly;
+int bpf_jit_limit __read_mostly;
+
+static atomic_long_t module_vm;
 
 static __always_inline void
 bpf_get_prog_addr_region(const struct bpf_prog *prog,
@@ -583,17 +586,31 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     bpf_jit_fill_hole_t bpf_fill_ill_insns)
 {
 	struct bpf_binary_header *hdr;
-	unsigned int size, hole, start;
+	unsigned int size, hole, start, vpages;
 
 	/* Most of BPF filters are really small, but if some of them
 	 * fill a page, allow at least 128 extra bytes to insert a
 	 * random section of illegal instructions.
 	 */
 	size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
+
+	/* Size plus a guard page */
+	vpages = (PAGE_ALIGN(size) >> PAGE_SHIFT) + 1;
+
+	if (atomic_long_read(&module_vm) + vpages > bpf_jit_limit >> PAGE_SHIFT)
+		return NULL;
+
 	hdr = module_alloc(size);
 	if (hdr == NULL)
 		return NULL;
 
+	atomic_long_add(vpages, &module_vm);
+
+	if (atomic_long_read(&module_vm) > bpf_jit_limit >> PAGE_SHIFT) {
+		bpf_jit_binary_free(hdr);
+		return NULL;
+	}
+
 	/* Fill space with illegal/arch-dep instructions. */
 	bpf_fill_ill_insns(hdr, size);
 
@@ -610,7 +627,10 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 
 void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 {
+	/* Size plus the guard page */
+	unsigned int vpages = hdr->pages + 1;
 	module_memfree(hdr);
+	atomic_long_sub(vpages, &module_vm);
 }
 
 /* This symbol is only overridden by archs that have different
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 2ada5e21dfa6..d0a109733294 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -667,10 +667,26 @@ static struct file_system_type bpf_fs_type = {
 	.kill_sb	= kill_litter_super,
 };
 
+#ifdef CONFIG_BPF_JIT
+void set_bpf_jit_limit(void)
+{
+	bpf_jit_limit = MOD_BPF_LIMIT_DEFAULT;
+}
+#else
+void set_bpf_jit_limit(void)
+{
+}
+#endif
+
 static int __init bpf_init(void)
 {
 	int ret;
 
+	/*
+	 * Module space size can be non-compile time constant so set it here.
+	 */
+	set_bpf_jit_limit();
+
 	ret = sysfs_create_mount_point(fs_kobj, "bpf");
 	if (ret)
 		return ret;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b1a2c5e38530..6bdf4a3da2b2 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -396,6 +396,13 @@ static struct ctl_table net_core_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "bpf_jit_limit",
+		.data		= &bpf_jit_limit,
+		.maxlen		= sizeof(int),
+		.mode		= 0600,
+		.proc_handler	= proc_dointvec,
+	},
 # endif
 #endif
 	{
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
  2018-10-19 20:47 ` Rick Edgecombe
  (?)
  (?)
@ 2018-10-20 17:20   ` Ard Biesheuvel
  -1 siblings, 0 replies; 47+ messages in thread
From: Ard Biesheuvel @ 2018-10-20 17:20 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Kernel Hardening, Daniel Borkmann, Kees Cook, Catalin Marinas,
	Will Deacon, David S. Miller, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, the arch/x86 maintainers, Arnd Bergmann,
	Jessica Yu, linux-arm-kernel, Linux Kernel Mailing List,
	linux-mips, linux-s390, sparclinux, linux-fsdevel, linux-arch,
	Jann Horn, kristen, Dave Hansen, arjan, deneen.t.dock

Hi Rick,

On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> If BPF JIT is on, there is no effective limit to prevent filling the entire
> module space with JITed e/BPF filters.

Why do BPF filters use the module space, and does this reason apply to
all architectures?

On arm64, we already support loading plain modules far away from the
core kernel (i.e. out of range for ordinary relative jump/calll
instructions), and so I'd expect BPF to be able to deal with this
already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
wouldn't be more appropriate.

So before refactoring the module alloc/free routines to accommodate
BPF, I'd like to take one step back and assess whether it wouldn't be
more appropriate to have a separate bpf_alloc/free API, which could be
totally separate from module alloc/free if the arch permits it.


> For classic BPF filters attached with
> setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> number of insertions like there is for the bpf syscall.
>
> This patch adds a per user rlimit for module space, as well as a system wide
> limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out the
> problem that in some cases a user can get access to 65536 UIDs, so the effective
> limit cannot be set low enough to stop an attacker and be useful for the general
> case. A discussed alternative solution was a system wide limit for BPF JIT
> filters. This much more simply resolves the problem of exhaustion and
> de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another user
> exhausts the BPF JIT limit. In this case a per user limit is still needed. If
> the subuid facility is disabled for normal users, this should still be ok
> because the higher limit will not be able to be worked around that way.
>
> The new BPF JIT limit can be set like this:
> echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>
> So I *think* this patchset should resolve that issue except for the
> configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal users.
> Better module space KASLR is another way to resolve the de-randomizing issue,
> and so then you would just be left with the BPF DOS in that configuration.
>
> Jann also pointed out how, with purposely fragmenting the module space, you
> could make the effective module space blockage area much larger. This is also
> somewhat un-resolved. The impact would depend on how big of a space you are
> trying to allocate. The limit has been lowered on x86_64 so that at least
> typical sized BPF filters cannot be blocked.
>
> If anyone with more experience with subuid/user namespaces has any suggestions
> I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-privileged
> user could do this. I am going to keep working on this and see if I can find a
> better solution.
>
> Changes since v2:
>  - System wide BPF JIT limit (discussion with Jann Horn)
>  - Holding reference to user correctly (Jann)
>  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>  - Shrinking of default limits, to help prevent the limit being worked around
>    with fragmentation (Jann)
>
> Changes since v1:
>  - Plug in for non-x86
>  - Arch specific default values
>
>
> Rick Edgecombe (3):
>   modules: Create arch versions of module alloc/free
>   modules: Create rlimit for module space
>   bpf: Add system wide BPF JIT limit
>
>  arch/arm/kernel/module.c                |   2 +-
>  arch/arm64/kernel/module.c              |   2 +-
>  arch/mips/kernel/module.c               |   2 +-
>  arch/nds32/kernel/module.c              |   2 +-
>  arch/nios2/kernel/module.c              |   4 +-
>  arch/parisc/kernel/module.c             |   2 +-
>  arch/s390/kernel/module.c               |   2 +-
>  arch/sparc/kernel/module.c              |   2 +-
>  arch/unicore32/kernel/module.c          |   2 +-
>  arch/x86/include/asm/pgtable_32_types.h |   3 +
>  arch/x86/include/asm/pgtable_64_types.h |   2 +
>  arch/x86/kernel/module.c                |   2 +-
>  fs/proc/base.c                          |   1 +
>  include/asm-generic/resource.h          |   8 ++
>  include/linux/bpf.h                     |   7 ++
>  include/linux/filter.h                  |   1 +
>  include/linux/sched/user.h              |   4 +
>  include/uapi/asm-generic/resource.h     |   3 +-
>  kernel/bpf/core.c                       |  22 +++-
>  kernel/bpf/inode.c                      |  16 +++
>  kernel/module.c                         | 152 +++++++++++++++++++++++-
>  net/core/sysctl_net_core.c              |   7 ++
>  22 files changed, 233 insertions(+), 15 deletions(-)
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-20 17:20   ` Ard Biesheuvel
  0 siblings, 0 replies; 47+ messages in thread
From: Ard Biesheuvel @ 2018-10-20 17:20 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Kernel Hardening, Daniel Borkmann, Kees Cook, Catalin Marinas,
	Will Deacon, David S. Miller, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, the arch/x86 maintainers, Arnd Bergmann,
	Jessica Yu, linux-arm-kernel, Linux Kernel Mailing List,
	linux-mips, linux-s390, sparclinux, linux-fsdevel, linux-arch,
	Jann Horn

Hi Rick,

On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> If BPF JIT is on, there is no effective limit to prevent filling the entire
> module space with JITed e/BPF filters.

Why do BPF filters use the module space, and does this reason apply to
all architectures?

On arm64, we already support loading plain modules far away from the
core kernel (i.e. out of range for ordinary relative jump/calll
instructions), and so I'd expect BPF to be able to deal with this
already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
wouldn't be more appropriate.

So before refactoring the module alloc/free routines to accommodate
BPF, I'd like to take one step back and assess whether it wouldn't be
more appropriate to have a separate bpf_alloc/free API, which could be
totally separate from module alloc/free if the arch permits it.


> For classic BPF filters attached with
> setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> number of insertions like there is for the bpf syscall.
>
> This patch adds a per user rlimit for module space, as well as a system wide
> limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out the
> problem that in some cases a user can get access to 65536 UIDs, so the effective
> limit cannot be set low enough to stop an attacker and be useful for the general
> case. A discussed alternative solution was a system wide limit for BPF JIT
> filters. This much more simply resolves the problem of exhaustion and
> de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another user
> exhausts the BPF JIT limit. In this case a per user limit is still needed. If
> the subuid facility is disabled for normal users, this should still be ok
> because the higher limit will not be able to be worked around that way.
>
> The new BPF JIT limit can be set like this:
> echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>
> So I *think* this patchset should resolve that issue except for the
> configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal users.
> Better module space KASLR is another way to resolve the de-randomizing issue,
> and so then you would just be left with the BPF DOS in that configuration.
>
> Jann also pointed out how, with purposely fragmenting the module space, you
> could make the effective module space blockage area much larger. This is also
> somewhat un-resolved. The impact would depend on how big of a space you are
> trying to allocate. The limit has been lowered on x86_64 so that at least
> typical sized BPF filters cannot be blocked.
>
> If anyone with more experience with subuid/user namespaces has any suggestions
> I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-privileged
> user could do this. I am going to keep working on this and see if I can find a
> better solution.
>
> Changes since v2:
>  - System wide BPF JIT limit (discussion with Jann Horn)
>  - Holding reference to user correctly (Jann)
>  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>  - Shrinking of default limits, to help prevent the limit being worked around
>    with fragmentation (Jann)
>
> Changes since v1:
>  - Plug in for non-x86
>  - Arch specific default values
>
>
> Rick Edgecombe (3):
>   modules: Create arch versions of module alloc/free
>   modules: Create rlimit for module space
>   bpf: Add system wide BPF JIT limit
>
>  arch/arm/kernel/module.c                |   2 +-
>  arch/arm64/kernel/module.c              |   2 +-
>  arch/mips/kernel/module.c               |   2 +-
>  arch/nds32/kernel/module.c              |   2 +-
>  arch/nios2/kernel/module.c              |   4 +-
>  arch/parisc/kernel/module.c             |   2 +-
>  arch/s390/kernel/module.c               |   2 +-
>  arch/sparc/kernel/module.c              |   2 +-
>  arch/unicore32/kernel/module.c          |   2 +-
>  arch/x86/include/asm/pgtable_32_types.h |   3 +
>  arch/x86/include/asm/pgtable_64_types.h |   2 +
>  arch/x86/kernel/module.c                |   2 +-
>  fs/proc/base.c                          |   1 +
>  include/asm-generic/resource.h          |   8 ++
>  include/linux/bpf.h                     |   7 ++
>  include/linux/filter.h                  |   1 +
>  include/linux/sched/user.h              |   4 +
>  include/uapi/asm-generic/resource.h     |   3 +-
>  kernel/bpf/core.c                       |  22 +++-
>  kernel/bpf/inode.c                      |  16 +++
>  kernel/module.c                         | 152 +++++++++++++++++++++++-
>  net/core/sysctl_net_core.c              |   7 ++
>  22 files changed, 233 insertions(+), 15 deletions(-)
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-20 17:20   ` Ard Biesheuvel
  0 siblings, 0 replies; 47+ messages in thread
From: Ard Biesheuvel @ 2018-10-20 17:20 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Kernel Hardening, Daniel Borkmann, Kees Cook, Catalin Marinas,
	Will Deacon, David S. Miller, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, the arch/x86 maintainers, Arnd Bergmann,
	Jessica Yu, linux-arm-kernel, Linux Kernel Mailing List,
	linux-mips, linux-s390, sparclinux, linux-fsdevel, linux-arch,
	Jann Horn, kristen, Dave Hansen, arjan, deneen.t.dock

Hi Rick,

On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> If BPF JIT is on, there is no effective limit to prevent filling the entire
> module space with JITed e/BPF filters.

Why do BPF filters use the module space, and does this reason apply to
all architectures?

On arm64, we already support loading plain modules far away from the
core kernel (i.e. out of range for ordinary relative jump/calll
instructions), and so I'd expect BPF to be able to deal with this
already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
wouldn't be more appropriate.

So before refactoring the module alloc/free routines to accommodate
BPF, I'd like to take one step back and assess whether it wouldn't be
more appropriate to have a separate bpf_alloc/free API, which could be
totally separate from module alloc/free if the arch permits it.


> For classic BPF filters attached with
> setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> number of insertions like there is for the bpf syscall.
>
> This patch adds a per user rlimit for module space, as well as a system wide
> limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out the
> problem that in some cases a user can get access to 65536 UIDs, so the effective
> limit cannot be set low enough to stop an attacker and be useful for the general
> case. A discussed alternative solution was a system wide limit for BPF JIT
> filters. This much more simply resolves the problem of exhaustion and
> de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another user
> exhausts the BPF JIT limit. In this case a per user limit is still needed. If
> the subuid facility is disabled for normal users, this should still be ok
> because the higher limit will not be able to be worked around that way.
>
> The new BPF JIT limit can be set like this:
> echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>
> So I *think* this patchset should resolve that issue except for the
> configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal users.
> Better module space KASLR is another way to resolve the de-randomizing issue,
> and so then you would just be left with the BPF DOS in that configuration.
>
> Jann also pointed out how, with purposely fragmenting the module space, you
> could make the effective module space blockage area much larger. This is also
> somewhat un-resolved. The impact would depend on how big of a space you are
> trying to allocate. The limit has been lowered on x86_64 so that at least
> typical sized BPF filters cannot be blocked.
>
> If anyone with more experience with subuid/user namespaces has any suggestions
> I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-privileged
> user could do this. I am going to keep working on this and see if I can find a
> better solution.
>
> Changes since v2:
>  - System wide BPF JIT limit (discussion with Jann Horn)
>  - Holding reference to user correctly (Jann)
>  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>  - Shrinking of default limits, to help prevent the limit being worked around
>    with fragmentation (Jann)
>
> Changes since v1:
>  - Plug in for non-x86
>  - Arch specific default values
>
>
> Rick Edgecombe (3):
>   modules: Create arch versions of module alloc/free
>   modules: Create rlimit for module space
>   bpf: Add system wide BPF JIT limit
>
>  arch/arm/kernel/module.c                |   2 +-
>  arch/arm64/kernel/module.c              |   2 +-
>  arch/mips/kernel/module.c               |   2 +-
>  arch/nds32/kernel/module.c              |   2 +-
>  arch/nios2/kernel/module.c              |   4 +-
>  arch/parisc/kernel/module.c             |   2 +-
>  arch/s390/kernel/module.c               |   2 +-
>  arch/sparc/kernel/module.c              |   2 +-
>  arch/unicore32/kernel/module.c          |   2 +-
>  arch/x86/include/asm/pgtable_32_types.h |   3 +
>  arch/x86/include/asm/pgtable_64_types.h |   2 +
>  arch/x86/kernel/module.c                |   2 +-
>  fs/proc/base.c                          |   1 +
>  include/asm-generic/resource.h          |   8 ++
>  include/linux/bpf.h                     |   7 ++
>  include/linux/filter.h                  |   1 +
>  include/linux/sched/user.h              |   4 +
>  include/uapi/asm-generic/resource.h     |   3 +-
>  kernel/bpf/core.c                       |  22 +++-
>  kernel/bpf/inode.c                      |  16 +++
>  kernel/module.c                         | 152 +++++++++++++++++++++++-
>  net/core/sysctl_net_core.c              |   7 ++
>  22 files changed, 233 insertions(+), 15 deletions(-)
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-20 17:20   ` Ard Biesheuvel
  0 siblings, 0 replies; 47+ messages in thread
From: Ard Biesheuvel @ 2018-10-20 17:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Rick,

On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> If BPF JIT is on, there is no effective limit to prevent filling the entire
> module space with JITed e/BPF filters.

Why do BPF filters use the module space, and does this reason apply to
all architectures?

On arm64, we already support loading plain modules far away from the
core kernel (i.e. out of range for ordinary relative jump/calll
instructions), and so I'd expect BPF to be able to deal with this
already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
wouldn't be more appropriate.

So before refactoring the module alloc/free routines to accommodate
BPF, I'd like to take one step back and assess whether it wouldn't be
more appropriate to have a separate bpf_alloc/free API, which could be
totally separate from module alloc/free if the arch permits it.


> For classic BPF filters attached with
> setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> number of insertions like there is for the bpf syscall.
>
> This patch adds a per user rlimit for module space, as well as a system wide
> limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out the
> problem that in some cases a user can get access to 65536 UIDs, so the effective
> limit cannot be set low enough to stop an attacker and be useful for the general
> case. A discussed alternative solution was a system wide limit for BPF JIT
> filters. This much more simply resolves the problem of exhaustion and
> de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another user
> exhausts the BPF JIT limit. In this case a per user limit is still needed. If
> the subuid facility is disabled for normal users, this should still be ok
> because the higher limit will not be able to be worked around that way.
>
> The new BPF JIT limit can be set like this:
> echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>
> So I *think* this patchset should resolve that issue except for the
> configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal users.
> Better module space KASLR is another way to resolve the de-randomizing issue,
> and so then you would just be left with the BPF DOS in that configuration.
>
> Jann also pointed out how, with purposely fragmenting the module space, you
> could make the effective module space blockage area much larger. This is also
> somewhat un-resolved. The impact would depend on how big of a space you are
> trying to allocate. The limit has been lowered on x86_64 so that at least
> typical sized BPF filters cannot be blocked.
>
> If anyone with more experience with subuid/user namespaces has any suggestions
> I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-privileged
> user could do this. I am going to keep working on this and see if I can find a
> better solution.
>
> Changes since v2:
>  - System wide BPF JIT limit (discussion with Jann Horn)
>  - Holding reference to user correctly (Jann)
>  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>  - Shrinking of default limits, to help prevent the limit being worked around
>    with fragmentation (Jann)
>
> Changes since v1:
>  - Plug in for non-x86
>  - Arch specific default values
>
>
> Rick Edgecombe (3):
>   modules: Create arch versions of module alloc/free
>   modules: Create rlimit for module space
>   bpf: Add system wide BPF JIT limit
>
>  arch/arm/kernel/module.c                |   2 +-
>  arch/arm64/kernel/module.c              |   2 +-
>  arch/mips/kernel/module.c               |   2 +-
>  arch/nds32/kernel/module.c              |   2 +-
>  arch/nios2/kernel/module.c              |   4 +-
>  arch/parisc/kernel/module.c             |   2 +-
>  arch/s390/kernel/module.c               |   2 +-
>  arch/sparc/kernel/module.c              |   2 +-
>  arch/unicore32/kernel/module.c          |   2 +-
>  arch/x86/include/asm/pgtable_32_types.h |   3 +
>  arch/x86/include/asm/pgtable_64_types.h |   2 +
>  arch/x86/kernel/module.c                |   2 +-
>  fs/proc/base.c                          |   1 +
>  include/asm-generic/resource.h          |   8 ++
>  include/linux/bpf.h                     |   7 ++
>  include/linux/filter.h                  |   1 +
>  include/linux/sched/user.h              |   4 +
>  include/uapi/asm-generic/resource.h     |   3 +-
>  kernel/bpf/core.c                       |  22 +++-
>  kernel/bpf/inode.c                      |  16 +++
>  kernel/module.c                         | 152 +++++++++++++++++++++++-
>  net/core/sysctl_net_core.c              |   7 ++
>  22 files changed, 233 insertions(+), 15 deletions(-)
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
  2018-10-20 17:20   ` Ard Biesheuvel
                       ` (2 preceding siblings ...)
  (?)
@ 2018-10-22 23:06     ` Edgecombe, Rick P
  -1 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-22 23:06 UTC (permalink / raw)
  To: ard.biesheuvel
  Cc: linux-kernel, daniel, keescook, jeyu, linux-arch, arjan, tglx,
	linux-mips, linux-s390, jannh, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon, kernel-hardening, bp,
	Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux

On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
> Hi Rick,
> 
> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
> wrote:
> > If BPF JIT is on, there is no effective limit to prevent filling the entire
> > module space with JITed e/BPF filters.
> 
> Why do BPF filters use the module space, and does this reason apply to
> all architectures?
> 
> On arm64, we already support loading plain modules far away from the
> core kernel (i.e. out of range for ordinary relative jump/calll
> instructions), and so I'd expect BPF to be able to deal with this
> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
> wouldn't be more appropriate.
AFAIK, it's like you said about relative instruction limits, but also because
some predictors don't predict past a certain distance. So performance as well.
Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
to be 8 by my count, that have a dedicated module space for some reason.

> So before refactoring the module alloc/free routines to accommodate
> BPF, I'd like to take one step back and assess whether it wouldn't be
> more appropriate to have a separate bpf_alloc/free API, which could be
> totally separate from module alloc/free if the arch permits it.
> 
I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
will chime in. I only ran into this because I was trying to increase
randomization for the module space and wanted to find out how many allocations
needed to be supported.

I'd guess though, that BPF JIT is just assuming that there will be some arch
specific constraints about where text can be placed optimally and they would
already be taken into account in the module space allocator.

If there are no constraints for some arch, I'd wonder why not just update its
module_alloc to use the whole space available. What exactly are the constraints
for arm64 for normal modules?

It seems fine to me for architectures to have the option of solving this a
different way. If potentially the rlimit ends up being the best solution for
some architectures though, do you think this refactor (pretty close to just a
name change) is that intrusive?

I guess it could be just a BPF JIT per user limit and not touch module space,
but I thought the module space limit was a more general solution as that is the
actual limited resource.

> > For classic BPF filters attached with
> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> > number of insertions like there is for the bpf syscall.
> > 
> > This patch adds a per user rlimit for module space, as well as a system wide
> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
> > the
> > problem that in some cases a user can get access to 65536 UIDs, so the
> > effective
> > limit cannot be set low enough to stop an attacker and be useful for the
> > general
> > case. A discussed alternative solution was a system wide limit for BPF JIT
> > filters. This much more simply resolves the problem of exhaustion and
> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
> > user
> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
> > If
> > the subuid facility is disabled for normal users, this should still be ok
> > because the higher limit will not be able to be worked aroThey might und
> > that way.
> > 
> > The new BPF JIT limit can be set like this:
> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
> > 
> > So I *think* this patchset should resolve that issue except for the
> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
> > users.
> > Better module space KASLR is another way to resolve the de-randomizing
> > issue,
> > and so then you would just be left with the BPF DOS in that configuration.
> > 
> > Jann also pointed out how, with purposely fragmenting the module space, you
> > could make the effective module space blockage area much larger. This is
> > also
> > somewhat un-resolved. The impact would depend on how big of a space you are
> > trying to allocate. The limit has been lowered on x86_64 so that at least
> > typical sized BPF filters cannot be blocked.
> > 
> > If anyone with more experience with subuid/user namespaces has any
> > suggestions
> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
> > privileged
> > user could do this. I am going to keep working on this and see if I can find
> > a
> > better solution.
> > 
> > Changes since v2:
> >  - System wide BPF JIT limit (discussion with Jann Horn)
> >  - Holding reference to user correctly (Jann)
> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
> >  - Shrinking of default limits, to help prevent the limit being worked
> > around
> >    with fragmentation (Jann)
> > 
> > Changes since v1:
> >  - Plug in for non-x86
> >  - Arch specific default values
> > 
> > 
> > Rick Edgecombe (3):
> >   modules: Create arch versions of module alloc/free
> >   modules: Create rlimit for module space
> >   bpf: Add system wide BPF JIT limit
> > 
> >  arch/arm/kernel/module.c                |   2 +-
> >  arch/arm64/kernel/module.c              |   2 +-
> >  arch/mips/kernel/module.c               |   2 +-
> >  arch/nds32/kernel/module.c              |   2 +-
> >  arch/nios2/kernel/module.c              |   4 +-
> >  arch/parisc/kernel/module.c             |   2 +-
> >  arch/s390/kernel/module.c               |   2 +-
> >  arch/sparc/kernel/module.c              |   2 +-
> >  arch/unicore32/kernel/module.c          |   2 +-
> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
> >  arch/x86/kernel/module.c                |   2 +-
> >  fs/proc/base.c                          |   1 +
> >  include/asm-generic/resource.h          |   8 ++
> >  include/linux/bpf.h                     |   7 ++
> >  include/linux/filter.h                  |   1 +
> >  include/linux/sched/user.h              |   4 +
> >  include/uapi/asm-generic/resource.h     |   3 +-
> >  kernel/bpf/core.c                       |  22 +++-
> >  kernel/bpf/inode.c                      |  16 +++
> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
> >  net/core/sysctl_net_core.c              |   7 ++
> >  22 files changed, 233 insertions(+), 15 deletions(-)
> > 
> > --
> > 2.17.1
> > 
branching predictors may not be able to predict past a certain distance. So
performance as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-22 23:06     ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-22 23:06 UTC (permalink / raw)
  To: ard.biesheuvel
  Cc: linux-kernel, daniel, keescook, jeyu, linux-arch, arjan, tglx,
	linux-mips, linux-s390, jannh, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon, kernel-hardening, bp,
	Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux

On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
> Hi Rick,
> 
> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
> wrote:
> > If BPF JIT is on, there is no effective limit to prevent filling the entire
> > module space with JITed e/BPF filters.
> 
> Why do BPF filters use the module space, and does this reason apply to
> all architectures?
> 
> On arm64, we already support loading plain modules far away from the
> core kernel (i.e. out of range for ordinary relative jump/calll
> instructions), and so I'd expect BPF to be able to deal with this
> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
> wouldn't be more appropriate.
AFAIK, it's like you said about relative instruction limits, but also because
some predictors don't predict past a certain distance. So performance as well.
Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
to be 8 by my count, that have a dedicated module space for some reason.

> So before refactoring the module alloc/free routines to accommodate
> BPF, I'd like to take one step back and assess whether it wouldn't be
> more appropriate to have a separate bpf_alloc/free API, which could be
> totally separate from module alloc/free if the arch permits it.
> 
I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
will chime in. I only ran into this because I was trying to increase
randomization for the module space and wanted to find out how many allocations
needed to be supported.

I'd guess though, that BPF JIT is just assuming that there will be some arch
specific constraints about where text can be placed optimally and they would
already be taken into account in the module space allocator.

If there are no constraints for some arch, I'd wonder why not just update its
module_alloc to use the whole space available. What exactly are the constraints
for arm64 for normal modules?

It seems fine to me for architectures to have the option of solving this a
different way. If potentially the rlimit ends up being the best solution for
some architectures though, do you think this refactor (pretty close to just a
name change) is that intrusive?

I guess it could be just a BPF JIT per user limit and not touch module space,
but I thought the module space limit was a more general solution as that is the
actual limited resource.

> > For classic BPF filters attached with
> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> > number of insertions like there is for the bpf syscall.
> > 
> > This patch adds a per user rlimit for module space, as well as a system wide
> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
> > the
> > problem that in some cases a user can get access to 65536 UIDs, so the
> > effective
> > limit cannot be set low enough to stop an attacker and be useful for the
> > general
> > case. A discussed alternative solution was a system wide limit for BPF JIT
> > filters. This much more simply resolves the problem of exhaustion and
> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
> > user
> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
> > If
> > the subuid facility is disabled for normal users, this should still be ok
> > because the higher limit will not be able to be worked aroThey might und
> > that way.
> > 
> > The new BPF JIT limit can be set like this:
> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
> > 
> > So I *think* this patchset should resolve that issue except for the
> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
> > users.
> > Better module space KASLR is another way to resolve the de-randomizing
> > issue,
> > and so then you would just be left with the BPF DOS in that configuration.
> > 
> > Jann also pointed out how, with purposely fragmenting the module space, you
> > could make the effective module space blockage area much larger. This is
> > also
> > somewhat un-resolved. The impact would depend on how big of a space you are
> > trying to allocate. The limit has been lowered on x86_64 so that at least
> > typical sized BPF filters cannot be blocked.
> > 
> > If anyone with more experience with subuid/user namespaces has any
> > suggestions
> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
> > privileged
> > user could do this. I am going to keep working on this and see if I can find
> > a
> > better solution.
> > 
> > Changes since v2:
> >  - System wide BPF JIT limit (discussion with Jann Horn)
> >  - Holding reference to user correctly (Jann)
> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
> >  - Shrinking of default limits, to help prevent the limit being worked
> > around
> >    with fragmentation (Jann)
> > 
> > Changes since v1:
> >  - Plug in for non-x86
> >  - Arch specific default values
> > 
> > 
> > Rick Edgecombe (3):
> >   modules: Create arch versions of module alloc/free
> >   modules: Create rlimit for module space
> >   bpf: Add system wide BPF JIT limit
> > 
> >  arch/arm/kernel/module.c                |   2 +-
> >  arch/arm64/kernel/module.c              |   2 +-
> >  arch/mips/kernel/module.c               |   2 +-
> >  arch/nds32/kernel/module.c              |   2 +-
> >  arch/nios2/kernel/module.c              |   4 +-
> >  arch/parisc/kernel/module.c             |   2 +-
> >  arch/s390/kernel/module.c               |   2 +-
> >  arch/sparc/kernel/module.c              |   2 +-
> >  arch/unicore32/kernel/module.c          |   2 +-
> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
> >  arch/x86/kernel/module.c                |   2 +-
> >  fs/proc/base.c                          |   1 +
> >  include/asm-generic/resource.h          |   8 ++
> >  include/linux/bpf.h                     |   7 ++
> >  include/linux/filter.h                  |   1 +
> >  include/linux/sched/user.h              |   4 +
> >  include/uapi/asm-generic/resource.h     |   3 +-
> >  kernel/bpf/core.c                       |  22 +++-
> >  kernel/bpf/inode.c                      |  16 +++
> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
> >  net/core/sysctl_net_core.c              |   7 ++
> >  22 files changed, 233 insertions(+), 15 deletions(-)
> > 
> > --
> > 2.17.1
> > 
branching predictors may not be able to predict past a certain distance. So
performance as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-22 23:06     ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-22 23:06 UTC (permalink / raw)
  To: ard.biesheuvel
  Cc: linux-kernel, daniel, keescook, jeyu, linux-arch, arjan, tglx,
	linux-mips, linux-s390, jannh, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon

On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
> Hi Rick,
> 
> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
> wrote:
> > If BPF JIT is on, there is no effective limit to prevent filling the entire
> > module space with JITed e/BPF filters.
> 
> Why do BPF filters use the module space, and does this reason apply to
> all architectures?
> 
> On arm64, we already support loading plain modules far away from the
> core kernel (i.e. out of range for ordinary relative jump/calll
> instructions), and so I'd expect BPF to be able to deal with this
> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
> wouldn't be more appropriate.
AFAIK, it's like you said about relative instruction limits, but also because
some predictors don't predict past a certain distance. So performance as well.
Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
to be 8 by my count, that have a dedicated module space for some reason.

> So before refactoring the module alloc/free routines to accommodate
> BPF, I'd like to take one step back and assess whether it wouldn't be
> more appropriate to have a separate bpf_alloc/free API, which could be
> totally separate from module alloc/free if the arch permits it.
> 
I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
will chime in. I only ran into this because I was trying to increase
randomization for the module space and wanted to find out how many allocations
needed to be supported.

I'd guess though, that BPF JIT is just assuming that there will be some arch
specific constraints about where text can be placed optimally and they would
already be taken into account in the module space allocator.

If there are no constraints for some arch, I'd wonder why not just update its
module_alloc to use the whole space available. What exactly are the constraints
for arm64 for normal modules?

It seems fine to me for architectures to have the option of solving this a
different way. If potentially the rlimit ends up being the best solution for
some architectures though, do you think this refactor (pretty close to just a
name change) is that intrusive?

I guess it could be just a BPF JIT per user limit and not touch module space,
but I thought the module space limit was a more general solution as that is the
actual limited resource.

> > For classic BPF filters attached with
> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> > number of insertions like there is for the bpf syscall.
> > 
> > This patch adds a per user rlimit for module space, as well as a system wide
> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
> > the
> > problem that in some cases a user can get access to 65536 UIDs, so the
> > effective
> > limit cannot be set low enough to stop an attacker and be useful for the
> > general
> > case. A discussed alternative solution was a system wide limit for BPF JIT
> > filters. This much more simply resolves the problem of exhaustion and
> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
> > user
> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
> > If
> > the subuid facility is disabled for normal users, this should still be ok
> > because the higher limit will not be able to be worked aroThey might und
> > that way.
> > 
> > The new BPF JIT limit can be set like this:
> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
> > 
> > So I *think* this patchset should resolve that issue except for the
> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
> > users.
> > Better module space KASLR is another way to resolve the de-randomizing
> > issue,
> > and so then you would just be left with the BPF DOS in that configuration.
> > 
> > Jann also pointed out how, with purposely fragmenting the module space, you
> > could make the effective module space blockage area much larger. This is
> > also
> > somewhat un-resolved. The impact would depend on how big of a space you are
> > trying to allocate. The limit has been lowered on x86_64 so that at least
> > typical sized BPF filters cannot be blocked.
> > 
> > If anyone with more experience with subuid/user namespaces has any
> > suggestions
> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
> > privileged
> > user could do this. I am going to keep working on this and see if I can find
> > a
> > better solution.
> > 
> > Changes since v2:
> >  - System wide BPF JIT limit (discussion with Jann Horn)
> >  - Holding reference to user correctly (Jann)
> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
> >  - Shrinking of default limits, to help prevent the limit being worked
> > around
> >    with fragmentation (Jann)
> > 
> > Changes since v1:
> >  - Plug in for non-x86
> >  - Arch specific default values
> > 
> > 
> > Rick Edgecombe (3):
> >   modules: Create arch versions of module alloc/free
> >   modules: Create rlimit for module space
> >   bpf: Add system wide BPF JIT limit
> > 
> >  arch/arm/kernel/module.c                |   2 +-
> >  arch/arm64/kernel/module.c              |   2 +-
> >  arch/mips/kernel/module.c               |   2 +-
> >  arch/nds32/kernel/module.c              |   2 +-
> >  arch/nios2/kernel/module.c              |   4 +-
> >  arch/parisc/kernel/module.c             |   2 +-
> >  arch/s390/kernel/module.c               |   2 +-
> >  arch/sparc/kernel/module.c              |   2 +-
> >  arch/unicore32/kernel/module.c          |   2 +-
> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
> >  arch/x86/kernel/module.c                |   2 +-
> >  fs/proc/base.c                          |   1 +
> >  include/asm-generic/resource.h          |   8 ++
> >  include/linux/bpf.h                     |   7 ++
> >  include/linux/filter.h                  |   1 +
> >  include/linux/sched/user.h              |   4 +
> >  include/uapi/asm-generic/resource.h     |   3 +-
> >  kernel/bpf/core.c                       |  22 +++-
> >  kernel/bpf/inode.c                      |  16 +++
> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
> >  net/core/sysctl_net_core.c              |   7 ++
> >  22 files changed, 233 insertions(+), 15 deletions(-)
> > 
> > --
> > 2.17.1
> > 
branching predictors may not be able to predict past a certain distance. So
performance as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-22 23:06     ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-22 23:06 UTC (permalink / raw)
  To: ard.biesheuvel
  Cc: linux-kernel, daniel, keescook, jeyu, linux-arch, arjan, tglx,
	linux-mips, linux-s390, jannh, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon, kernel-hardening, bp,
	Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux

T24gU2F0LCAyMDE4LTEwLTIwIGF0IDE5OjIwICswMjAwLCBBcmQgQmllc2hldXZlbCB3cm90ZToN
Cj4gSGkgUmljaywNCj4gDQo+IE9uIDE5IE9jdG9iZXIgMjAxOCBhdCAyMjo0NywgUmljayBFZGdl
Y29tYmUgPHJpY2sucC5lZGdlY29tYmVAaW50ZWwuY29tPg0KPiB3cm90ZToNCj4gPiBJZiBCUEYg
SklUIGlzIG9uLCB0aGVyZSBpcyBubyBlZmZlY3RpdmUgbGltaXQgdG8gcHJldmVudCBmaWxsaW5n
IHRoZSBlbnRpcmUNCj4gPiBtb2R1bGUgc3BhY2Ugd2l0aCBKSVRlZCBlL0JQRiBmaWx0ZXJzLg0K
PiANCj4gV2h5IGRvIEJQRiBmaWx0ZXJzIHVzZSB0aGUgbW9kdWxlIHNwYWNlLCBhbmQgZG9lcyB0
aGlzIHJlYXNvbiBhcHBseSB0bw0KPiBhbGwgYXJjaGl0ZWN0dXJlcz8NCj4gDQo+IE9uIGFybTY0
LCB3ZSBhbHJlYWR5IHN1cHBvcnQgbG9hZGluZyBwbGFpbiBtb2R1bGVzIGZhciBhd2F5IGZyb20g
dGhlDQo+IGNvcmUga2VybmVsIChpLmUuIG91dCBvZiByYW5nZSBmb3Igb3JkaW5hcnkgcmVsYXRp
dmUganVtcC9jYWxsbA0KPiBpbnN0cnVjdGlvbnMpLCBhbmQgc28gSSdkIGV4cGVjdCBCUEYgdG8g
YmUgYWJsZSB0byBkZWFsIHdpdGggdGhpcw0KPiBhbHJlYWR5IGFzIHdlbGwuIFNvIGZvciBhcm02
NCwgSSB3b25kZXIgd2h5IGFuIG9yZGluYXJ5IHZtYWxsb2NfZXhlYygpDQo+IHdvdWxkbid0IGJl
IG1vcmUgYXBwcm9wcmlhdGUuDQpBRkFJSywgaXQncyBsaWtlIHlvdSBzYWlkIGFib3V0IHJlbGF0
aXZlIGluc3RydWN0aW9uIGxpbWl0cywgYnV0IGFsc28gYmVjYXVzZQ0Kc29tZSBwcmVkaWN0b3Jz
IGRvbid0IHByZWRpY3QgcGFzdCBhIGNlcnRhaW4gZGlzdGFuY2UuIFNvIHBlcmZvcm1hbmNlIGFz
IHdlbGwuDQpOb3Qgc3VyZSB0aGUgcmVhc29ucyBmb3IgZWFjaCBhcmNoLCBvciBpZiB0aGV5IGFs
bCBhcHBseSBmb3IgQlBGIEpJVC4gVGhlcmUgc2VlbQ0KdG8gYmUgOCBieSBteSBjb3VudCwgdGhh
dCBoYXZlIGEgZGVkaWNhdGVkIG1vZHVsZSBzcGFjZSBmb3Igc29tZSByZWFzb24uDQoNCj4gU28g
YmVmb3JlIHJlZmFjdG9yaW5nIHRoZSBtb2R1bGUgYWxsb2MvZnJlZSByb3V0aW5lcyB0byBhY2Nv
bW1vZGF0ZQ0KPiBCUEYsIEknZCBsaWtlIHRvIHRha2Ugb25lIHN0ZXAgYmFjayBhbmQgYXNzZXNz
IHdoZXRoZXIgaXQgd291bGRuJ3QgYmUNCj4gbW9yZSBhcHByb3ByaWF0ZSB0byBoYXZlIGEgc2Vw
YXJhdGUgYnBmX2FsbG9jL2ZyZWUgQVBJLCB3aGljaCBjb3VsZCBiZQ0KPiB0b3RhbGx5IHNlcGFy
YXRlIGZyb20gbW9kdWxlIGFsbG9jL2ZyZWUgaWYgdGhlIGFyY2ggcGVybWl0cyBpdC4NCj4gDQpJ
IGFtIG5vdCBhIEJQRiBKSVQgZXhwZXJ0IHVuZm9ydHVuYXRlbHksIGhvcGVmdWxseSBzb21lb25l
IG1vcmUgYXV0aG9yaXRhdGl2ZQ0Kd2lsbCBjaGltZSBpbi4gSSBvbmx5IHJhbiBpbnRvIHRoaXMg
YmVjYXVzZSBJIHdhcyB0cnlpbmcgdG8gaW5jcmVhc2UNCnJhbmRvbWl6YXRpb24gZm9yIHRoZSBt
b2R1bGUgc3BhY2UgYW5kIHdhbnRlZCB0byBmaW5kIG91dCBob3cgbWFueSBhbGxvY2F0aW9ucw0K
bmVlZGVkIHRvIGJlIHN1cHBvcnRlZC4NCg0KSSdkIGd1ZXNzIHRob3VnaCwgdGhhdCBCUEYgSklU
IGlzIGp1c3QgYXNzdW1pbmcgdGhhdCB0aGVyZSB3aWxsIGJlIHNvbWUgYXJjaA0Kc3BlY2lmaWMg
Y29uc3RyYWludHMgYWJvdXQgd2hlcmUgdGV4dCBjYW4gYmUgcGxhY2VkIG9wdGltYWxseSBhbmQg
dGhleSB3b3VsZA0KYWxyZWFkeSBiZSB0YWtlbiBpbnRvIGFjY291bnQgaW4gdGhlIG1vZHVsZSBz
cGFjZSBhbGxvY2F0b3IuDQoNCklmIHRoZXJlIGFyZSBubyBjb25zdHJhaW50cyBmb3Igc29tZSBh
cmNoLCBJJ2Qgd29uZGVyIHdoeSBub3QganVzdCB1cGRhdGUgaXRzDQptb2R1bGVfYWxsb2MgdG8g
dXNlIHRoZSB3aG9sZSBzcGFjZSBhdmFpbGFibGUuIFdoYXQgZXhhY3RseSBhcmUgdGhlIGNvbnN0
cmFpbnRzDQpmb3IgYXJtNjQgZm9yIG5vcm1hbCBtb2R1bGVzPw0KDQpJdCBzZWVtcyBmaW5lIHRv
IG1lIGZvciBhcmNoaXRlY3R1cmVzIHRvIGhhdmUgdGhlIG9wdGlvbiBvZiBzb2x2aW5nIHRoaXMg
YQ0KZGlmZmVyZW50IHdheS4gSWYgcG90ZW50aWFsbHkgdGhlIHJsaW1pdCBlbmRzIHVwIGJlaW5n
IHRoZSBiZXN0IHNvbHV0aW9uIGZvcg0Kc29tZSBhcmNoaXRlY3R1cmVzIHRob3VnaCwgZG8geW91
IHRoaW5rIHRoaXMgcmVmYWN0b3IgKHByZXR0eSBjbG9zZSB0byBqdXN0IGENCm5hbWUgY2hhbmdl
KSBpcyB0aGF0IGludHJ1c2l2ZT8NCg0KSSBndWVzcyBpdCBjb3VsZCBiZSBqdXN0IGEgQlBGIEpJ
VCBwZXIgdXNlciBsaW1pdCBhbmQgbm90IHRvdWNoIG1vZHVsZSBzcGFjZSwNCmJ1dCBJIHRob3Vn
aHQgdGhlIG1vZHVsZSBzcGFjZSBsaW1pdCB3YXMgYSBtb3JlIGdlbmVyYWwgc29sdXRpb24gYXMg
dGhhdCBpcyB0aGUNCmFjdHVhbCBsaW1pdGVkIHJlc291cmNlLg0KDQo+ID4gRm9yIGNsYXNzaWMg
QlBGIGZpbHRlcnMgYXR0YWNoZWQgd2l0aA0KPiA+IHNldHNvY2tvcHQgU09fQVRUQUNIX0ZJTFRF
UiwgdGhlcmUgaXMgbm8gbWVtbG9jayBybGltaXQgY2hlY2sgdG8gbGltaXQgdGhlDQo+ID4gbnVt
YmVyIG9mIGluc2VydGlvbnMgbGlrZSB0aGVyZSBpcyBmb3IgdGhlIGJwZiBzeXNjYWxsLg0KPiA+
IA0KPiA+IFRoaXMgcGF0Y2ggYWRkcyBhIHBlciB1c2VyIHJsaW1pdCBmb3IgbW9kdWxlIHNwYWNl
LCBhcyB3ZWxsIGFzIGEgc3lzdGVtIHdpZGUNCj4gPiBsaW1pdCBmb3IgQlBGIEpJVC4gSW4gYSBw
cmV2aW91c2x5IHJldmlld2VkIHBhdGNoc2V0LCBKYW5uIEhvcm4gcG9pbnRlZCBvdXQNCj4gPiB0
aGUNCj4gPiBwcm9ibGVtIHRoYXQgaW4gc29tZSBjYXNlcyBhIHVzZXIgY2FuIGdldCBhY2Nlc3Mg
dG8gNjU1MzYgVUlEcywgc28gdGhlDQo+ID4gZWZmZWN0aXZlDQo+ID4gbGltaXQgY2Fubm90IGJl
IHNldCBsb3cgZW5vdWdoIHRvIHN0b3AgYW4gYXR0YWNrZXIgYW5kIGJlIHVzZWZ1bCBmb3IgdGhl
DQo+ID4gZ2VuZXJhbA0KPiA+IGNhc2UuIEEgZGlzY3Vzc2VkIGFsdGVybmF0aXZlIHNvbHV0aW9u
IHdhcyBhIHN5c3RlbSB3aWRlIGxpbWl0IGZvciBCUEYgSklUDQo+ID4gZmlsdGVycy4gVGhpcyBt
dWNoIG1vcmUgc2ltcGx5IHJlc29sdmVzIHRoZSBwcm9ibGVtIG9mIGV4aGF1c3Rpb24gYW5kDQo+
ID4gZGUtcmFuZG9taXppbmcgaW4gdGhlIGNhc2Ugb2Ygbm9uLUNPTkZJR19CUEZfSklUX0FMV0FZ
U19PTi4gSWYNCj4gPiBDT05GSUdfQlBGX0pJVF9BTFdBWVNfT04gaXMgb24gaG93ZXZlciwgQlBG
IGluc2VydGlvbnMgd2lsbCBmYWlsIGlmIGFub3RoZXINCj4gPiB1c2VyDQo+ID4gZXhoYXVzdHMg
dGhlIEJQRiBKSVQgbGltaXQuIEluIHRoaXMgY2FzZSBhIHBlciB1c2VyIGxpbWl0IGlzIHN0aWxs
IG5lZWRlZC4NCj4gPiBJZg0KPiA+IHRoZSBzdWJ1aWQgZmFjaWxpdHkgaXMgZGlzYWJsZWQgZm9y
IG5vcm1hbCB1c2VycywgdGhpcyBzaG91bGQgc3RpbGwgYmUgb2sNCj4gPiBiZWNhdXNlIHRoZSBo
aWdoZXIgbGltaXQgd2lsbCBub3QgYmUgYWJsZSB0byBiZSB3b3JrZWQgYXJvVGhleSBtaWdodCB1
bmQNCj4gPiB0aGF0IHdheS4NCj4gPiANCj4gPiBUaGUgbmV3IEJQRiBKSVQgbGltaXQgY2FuIGJl
IHNldCBsaWtlIHRoaXM6DQo+ID4gZWNobyA1MDAwMDAwID4gL3Byb2Mvc3lzL25ldC9jb3JlL2Jw
Zl9qaXRfbGltaXQNCj4gPiANCj4gPiBTbyBJICp0aGluayogdGhpcyBwYXRjaHNldCBzaG91bGQg
cmVzb2x2ZSB0aGF0IGlzc3VlIGV4Y2VwdCBmb3IgdGhlDQo+ID4gY29uZmlndXJhdGlvbiBvZiBD
T05GSUdfQlBGX0pJVF9BTFdBWVNfT04gYW5kIHN1YnVpZCBhbGxvd2VkIGZvciBub3JtYWwNCj4g
PiB1c2Vycy4NCj4gPiBCZXR0ZXIgbW9kdWxlIHNwYWNlIEtBU0xSIGlzIGFub3RoZXIgd2F5IHRv
IHJlc29sdmUgdGhlIGRlLXJhbmRvbWl6aW5nDQo+ID4gaXNzdWUsDQo+ID4gYW5kIHNvIHRoZW4g
eW91IHdvdWxkIGp1c3QgYmUgbGVmdCB3aXRoIHRoZSBCUEYgRE9TIGluIHRoYXQgY29uZmlndXJh
dGlvbi4NCj4gPiANCj4gPiBKYW5uIGFsc28gcG9pbnRlZCBvdXQgaG93LCB3aXRoIHB1cnBvc2Vs
eSBmcmFnbWVudGluZyB0aGUgbW9kdWxlIHNwYWNlLCB5b3UNCj4gPiBjb3VsZCBtYWtlIHRoZSBl
ZmZlY3RpdmUgbW9kdWxlIHNwYWNlIGJsb2NrYWdlIGFyZWEgbXVjaCBsYXJnZXIuIFRoaXMgaXMN
Cj4gPiBhbHNvDQo+ID4gc29tZXdoYXQgdW4tcmVzb2x2ZWQuIFRoZSBpbXBhY3Qgd291bGQgZGVw
ZW5kIG9uIGhvdyBiaWcgb2YgYSBzcGFjZSB5b3UgYXJlDQo+ID4gdHJ5aW5nIHRvIGFsbG9jYXRl
LiBUaGUgbGltaXQgaGFzIGJlZW4gbG93ZXJlZCBvbiB4ODZfNjQgc28gdGhhdCBhdCBsZWFzdA0K
PiA+IHR5cGljYWwgc2l6ZWQgQlBGIGZpbHRlcnMgY2Fubm90IGJlIGJsb2NrZWQuDQo+ID4gDQo+
ID4gSWYgYW55b25lIHdpdGggbW9yZSBleHBlcmllbmNlIHdpdGggc3VidWlkL3VzZXIgbmFtZXNw
YWNlcyBoYXMgYW55DQo+ID4gc3VnZ2VzdGlvbnMNCj4gPiBJJ2QgYmUgZ2xhZCB0byBoZWFyLiBP
biBhbiBVYnVudHUgbWFjaGluZSBpdCBkaWRuJ3Qgc2VlbSBsaWtlIGEgdW4tDQo+ID4gcHJpdmls
ZWdlZA0KPiA+IHVzZXIgY291bGQgZG8gdGhpcy4gSSBhbSBnb2luZyB0byBrZWVwIHdvcmtpbmcg
b24gdGhpcyBhbmQgc2VlIGlmIEkgY2FuIGZpbmQNCj4gPiBhDQo+ID4gYmV0dGVyIHNvbHV0aW9u
Lg0KPiA+IA0KPiA+IENoYW5nZXMgc2luY2UgdjI6DQo+ID4gIC0gU3lzdGVtIHdpZGUgQlBGIEpJ
VCBsaW1pdCAoZGlzY3Vzc2lvbiB3aXRoIEphbm4gSG9ybikNCj4gPiAgLSBIb2xkaW5nIHJlZmVy
ZW5jZSB0byB1c2VyIGNvcnJlY3RseSAoSmFubikNCj4gPiAgLSBIYXZpbmcgYXJjaCB2ZXJzaW9u
cyBvZiBtb2R1bGRlX2FsbG9jIChEYXZlIEhhbnNlbiwgSmVzc2ljYSBZdSkNCj4gPiAgLSBTaHJp
bmtpbmcgb2YgZGVmYXVsdCBsaW1pdHMsIHRvIGhlbHAgcHJldmVudCB0aGUgbGltaXQgYmVpbmcg
d29ya2VkDQo+ID4gYXJvdW5kDQo+ID4gICAgd2l0aCBmcmFnbWVudGF0aW9uIChKYW5uKQ0KPiA+
IA0KPiA+IENoYW5nZXMgc2luY2UgdjE6DQo+ID4gIC0gUGx1ZyBpbiBmb3Igbm9uLXg4Ng0KPiA+
ICAtIEFyY2ggc3BlY2lmaWMgZGVmYXVsdCB2YWx1ZXMNCj4gPiANCj4gPiANCj4gPiBSaWNrIEVk
Z2Vjb21iZSAoMyk6DQo+ID4gICBtb2R1bGVzOiBDcmVhdGUgYXJjaCB2ZXJzaW9ucyBvZiBtb2R1
bGUgYWxsb2MvZnJlZQ0KPiA+ICAgbW9kdWxlczogQ3JlYXRlIHJsaW1pdCBmb3IgbW9kdWxlIHNw
YWNlDQo+ID4gICBicGY6IEFkZCBzeXN0ZW0gd2lkZSBCUEYgSklUIGxpbWl0DQo+ID4gDQo+ID4g
IGFyY2gvYXJtL2tlcm5lbC9tb2R1bGUuYyAgICAgICAgICAgICAgICB8ICAgMiArLQ0KPiA+ICBh
cmNoL2FybTY0L2tlcm5lbC9tb2R1bGUuYyAgICAgICAgICAgICAgfCAgIDIgKy0NCj4gPiAgYXJj
aC9taXBzL2tlcm5lbC9tb2R1bGUuYyAgICAgICAgICAgICAgIHwgICAyICstDQo+ID4gIGFyY2gv
bmRzMzIva2VybmVsL21vZHVsZS5jICAgICAgICAgICAgICB8ICAgMiArLQ0KPiA+ICBhcmNoL25p
b3MyL2tlcm5lbC9tb2R1bGUuYyAgICAgICAgICAgICAgfCAgIDQgKy0NCj4gPiAgYXJjaC9wYXJp
c2Mva2VybmVsL21vZHVsZS5jICAgICAgICAgICAgIHwgICAyICstDQo+ID4gIGFyY2gvczM5MC9r
ZXJuZWwvbW9kdWxlLmMgICAgICAgICAgICAgICB8ICAgMiArLQ0KPiA+ICBhcmNoL3NwYXJjL2tl
cm5lbC9tb2R1bGUuYyAgICAgICAgICAgICAgfCAgIDIgKy0NCj4gPiAgYXJjaC91bmljb3JlMzIv
a2VybmVsL21vZHVsZS5jICAgICAgICAgIHwgICAyICstDQo+ID4gIGFyY2gveDg2L2luY2x1ZGUv
YXNtL3BndGFibGVfMzJfdHlwZXMuaCB8ICAgMyArDQo+ID4gIGFyY2gveDg2L2luY2x1ZGUvYXNt
L3BndGFibGVfNjRfdHlwZXMuaCB8ICAgMiArDQo+ID4gIGFyY2gveDg2L2tlcm5lbC9tb2R1bGUu
YyAgICAgICAgICAgICAgICB8ICAgMiArLQ0KPiA+ICBmcy9wcm9jL2Jhc2UuYyAgICAgICAgICAg
ICAgICAgICAgICAgICAgfCAgIDEgKw0KPiA+ICBpbmNsdWRlL2FzbS1nZW5lcmljL3Jlc291cmNl
LmggICAgICAgICAgfCAgIDggKysNCj4gPiAgaW5jbHVkZS9saW51eC9icGYuaCAgICAgICAgICAg
ICAgICAgICAgIHwgICA3ICsrDQo+ID4gIGluY2x1ZGUvbGludXgvZmlsdGVyLmggICAgICAgICAg
ICAgICAgICB8ICAgMSArDQo+ID4gIGluY2x1ZGUvbGludXgvc2NoZWQvdXNlci5oICAgICAgICAg
ICAgICB8ICAgNCArDQo+ID4gIGluY2x1ZGUvdWFwaS9hc20tZ2VuZXJpYy9yZXNvdXJjZS5oICAg
ICB8ICAgMyArLQ0KPiA+ICBrZXJuZWwvYnBmL2NvcmUuYyAgICAgICAgICAgICAgICAgICAgICAg
fCAgMjIgKysrLQ0KPiA+ICBrZXJuZWwvYnBmL2lub2RlLmMgICAgICAgICAgICAgICAgICAgICAg
fCAgMTYgKysrDQo+ID4gIGtlcm5lbC9tb2R1bGUuYyAgICAgICAgICAgICAgICAgICAgICAgICB8
IDE1MiArKysrKysrKysrKysrKysrKysrKysrKy0NCj4gPiAgbmV0L2NvcmUvc3lzY3RsX25ldF9j
b3JlLmMgICAgICAgICAgICAgIHwgICA3ICsrDQo+ID4gIDIyIGZpbGVzIGNoYW5nZWQsIDIzMyBp
bnNlcnRpb25zKCspLCAxNSBkZWxldGlvbnMoLSkNCj4gPiANCj4gPiAtLQ0KPiA+IDIuMTcuMQ0K
PiA+IA0KYnJhbmNoaW5nIHByZWRpY3RvcnMgbWF5IG5vdCBiZSBhYmxlIHRvIHByZWRpY3QgcGFz
dCBhIGNlcnRhaW4gZGlzdGFuY2UuIFNvDQpwZXJmb3JtYW5jZSBhcyB3ZWxsLg0K

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-22 23:06     ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-22 23:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
> Hi Rick,
> 
> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
> wrote:
> > If BPF JIT is on, there is no effective limit to prevent filling the entire
> > module space with JITed e/BPF filters.
> 
> Why do BPF filters use the module space, and does this reason apply to
> all architectures?
> 
> On arm64, we already support loading plain modules far away from the
> core kernel (i.e. out of range for ordinary relative jump/calll
> instructions), and so I'd expect BPF to be able to deal with this
> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
> wouldn't be more appropriate.
AFAIK, it's like you said about relative instruction limits, but also because
some predictors don't predict past a certain distance. So performance as well.
Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
to be 8 by my count, that have a dedicated module space for some reason.

> So before refactoring the module alloc/free routines to accommodate
> BPF, I'd like to take one step back and assess whether it wouldn't be
> more appropriate to have a separate bpf_alloc/free API, which could be
> totally separate from module alloc/free if the arch permits it.
> 
I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
will chime in. I only ran into this because I was trying to increase
randomization for the module space and wanted to find out how many allocations
needed to be supported.

I'd guess though, that BPF JIT is just assuming that there will be some arch
specific constraints about where text can be placed optimally and they would
already be taken into account in the module space allocator.

If there are no constraints for some arch, I'd wonder why not just update its
module_alloc to use the whole space available. What exactly are the constraints
for arm64 for normal modules?

It seems fine to me for architectures to have the option of solving this a
different way. If potentially the rlimit ends up being the best solution for
some architectures though, do you think this refactor (pretty close to just a
name change) is that intrusive?

I guess it could be just a BPF JIT per user limit and not touch module space,
but I thought the module space limit was a more general solution as that is the
actual limited resource.

> > For classic BPF filters attached with
> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> > number of insertions like there is for the bpf syscall.
> > 
> > This patch adds a per user rlimit for module space, as well as a system wide
> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
> > the
> > problem that in some cases a user can get access to 65536 UIDs, so the
> > effective
> > limit cannot be set low enough to stop an attacker and be useful for the
> > general
> > case. A discussed alternative solution was a system wide limit for BPF JIT
> > filters. This much more simply resolves the problem of exhaustion and
> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
> > user
> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
> > If
> > the subuid facility is disabled for normal users, this should still be ok
> > because the higher limit will not be able to be worked aroThey might und
> > that way.
> > 
> > The new BPF JIT limit can be set like this:
> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
> > 
> > So I *think* this patchset should resolve that issue except for the
> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
> > users.
> > Better module space KASLR is another way to resolve the de-randomizing
> > issue,
> > and so then you would just be left with the BPF DOS in that configuration.
> > 
> > Jann also pointed out how, with purposely fragmenting the module space, you
> > could make the effective module space blockage area much larger. This is
> > also
> > somewhat un-resolved. The impact would depend on how big of a space you are
> > trying to allocate. The limit has been lowered on x86_64 so that at least
> > typical sized BPF filters cannot be blocked.
> > 
> > If anyone with more experience with subuid/user namespaces has any
> > suggestions
> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
> > privileged
> > user could do this. I am going to keep working on this and see if I can find
> > a
> > better solution.
> > 
> > Changes since v2:
> >  - System wide BPF JIT limit (discussion with Jann Horn)
> >  - Holding reference to user correctly (Jann)
> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
> >  - Shrinking of default limits, to help prevent the limit being worked
> > around
> >    with fragmentation (Jann)
> > 
> > Changes since v1:
> >  - Plug in for non-x86
> >  - Arch specific default values
> > 
> > 
> > Rick Edgecombe (3):
> >   modules: Create arch versions of module alloc/free
> >   modules: Create rlimit for module space
> >   bpf: Add system wide BPF JIT limit
> > 
> >  arch/arm/kernel/module.c                |   2 +-
> >  arch/arm64/kernel/module.c              |   2 +-
> >  arch/mips/kernel/module.c               |   2 +-
> >  arch/nds32/kernel/module.c              |   2 +-
> >  arch/nios2/kernel/module.c              |   4 +-
> >  arch/parisc/kernel/module.c             |   2 +-
> >  arch/s390/kernel/module.c               |   2 +-
> >  arch/sparc/kernel/module.c              |   2 +-
> >  arch/unicore32/kernel/module.c          |   2 +-
> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
> >  arch/x86/kernel/module.c                |   2 +-
> >  fs/proc/base.c                          |   1 +
> >  include/asm-generic/resource.h          |   8 ++
> >  include/linux/bpf.h                     |   7 ++
> >  include/linux/filter.h                  |   1 +
> >  include/linux/sched/user.h              |   4 +
> >  include/uapi/asm-generic/resource.h     |   3 +-
> >  kernel/bpf/core.c                       |  22 +++-
> >  kernel/bpf/inode.c                      |  16 +++
> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
> >  net/core/sysctl_net_core.c              |   7 ++
> >  22 files changed, 233 insertions(+), 15 deletions(-)
> > 
> > --
> > 2.17.1
> > 
branching predictors may not be able to predict past a certain distance. So
performance as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
  2018-10-22 23:06     ` Edgecombe, Rick P
                         ` (2 preceding siblings ...)
  (?)
@ 2018-10-23 11:54       ` Ard Biesheuvel
  -1 siblings, 0 replies; 47+ messages in thread
From: Ard Biesheuvel @ 2018-10-23 11:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, keescook, jeyu, linux-arch, arjan, tglx,
	linux-mips, linux-s390, jannh, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon, kernel-hardening, bp,
	Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux

On 22 October 2018 at 20:06, Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>> Hi Rick,
>>
>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>> wrote:
>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>> > module space with JITed e/BPF filters.
>>
>> Why do BPF filters use the module space, and does this reason apply to
>> all architectures?
>>
>> On arm64, we already support loading plain modules far away from the
>> core kernel (i.e. out of range for ordinary relative jump/calll
>> instructions), and so I'd expect BPF to be able to deal with this
>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>> wouldn't be more appropriate.
> AFAIK, it's like you said about relative instruction limits, but also because
> some predictors don't predict past a certain distance. So performance as well.
> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
> to be 8 by my count, that have a dedicated module space for some reason.
>
>> So before refactoring the module alloc/free routines to accommodate
>> BPF, I'd like to take one step back and assess whether it wouldn't be
>> more appropriate to have a separate bpf_alloc/free API, which could be
>> totally separate from module alloc/free if the arch permits it.
>>
> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
> will chime in. I only ran into this because I was trying to increase
> randomization for the module space and wanted to find out how many allocations
> needed to be supported.
>
> I'd guess though, that BPF JIT is just assuming that there will be some arch
> specific constraints about where text can be placed optimally and they would
> already be taken into account in the module space allocator.
>
> If there are no constraints for some arch, I'd wonder why not just update its
> module_alloc to use the whole space available. What exactly are the constraints
> for arm64 for normal modules?
>

Relative branches and the interactions with KAsan.

We just fixed something similar in the kprobes code: it was using
RWX-mapped module memory to store kprobed instructions, and we
replaced that with a simple vmalloc_exec() [and code to remap it
read-only], which was possible given that relative branches are always
emulated by arm64 kprobes.

So it depends on whether BPF code needs to be in relative branching
range from the calling code, and whether the BPF code itself is
emitted using relative branches into other parts of the code.

> It seems fine to me for architectures to have the option of solving this a
> different way. If potentially the rlimit ends up being the best solution for
> some architectures though, do you think this refactor (pretty close to just a
> name change) is that intrusive?
>
> I guess it could be just a BPF JIT per user limit and not touch module space,
> but I thought the module space limit was a more general solution as that is the
> actual limited resource.
>

I think it is wrong to conflate the two things. Limiting the number of
BPF allocations and the limiting number of module allocations are two
separate things, and the fact that BPF reuses module_alloc() out of
convenience does not mean a single rlimit for both is appropriate.


>> > For classic BPF filters attached with
>> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
>> > number of insertions like there is for the bpf syscall.
>> >
>> > This patch adds a per user rlimit for module space, as well as a system wide
>> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
>> > the
>> > problem that in some cases a user can get access to 65536 UIDs, so the
>> > effective
>> > limit cannot be set low enough to stop an attacker and be useful for the
>> > general
>> > case. A discussed alternative solution was a system wide limit for BPF JIT
>> > filters. This much more simply resolves the problem of exhaustion and
>> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
>> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
>> > user
>> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
>> > If
>> > the subuid facility is disabled for normal users, this should still be ok
>> > because the higher limit will not be able to be worked aroThey might und
>> > that way.
>> >
>> > The new BPF JIT limit can be set like this:
>> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>> >
>> > So I *think* this patchset should resolve that issue except for the
>> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
>> > users.
>> > Better module space KASLR is another way to resolve the de-randomizing
>> > issue,
>> > and so then you would just be left with the BPF DOS in that configuration.
>> >
>> > Jann also pointed out how, with purposely fragmenting the module space, you
>> > could make the effective module space blockage area much larger. This is
>> > also
>> > somewhat un-resolved. The impact would depend on how big of a space you are
>> > trying to allocate. The limit has been lowered on x86_64 so that at least
>> > typical sized BPF filters cannot be blocked.
>> >
>> > If anyone with more experience with subuid/user namespaces has any
>> > suggestions
>> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
>> > privileged
>> > user could do this. I am going to keep working on this and see if I can find
>> > a
>> > better solution.
>> >
>> > Changes since v2:
>> >  - System wide BPF JIT limit (discussion with Jann Horn)
>> >  - Holding reference to user correctly (Jann)
>> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>> >  - Shrinking of default limits, to help prevent the limit being worked
>> > around
>> >    with fragmentation (Jann)
>> >
>> > Changes since v1:
>> >  - Plug in for non-x86
>> >  - Arch specific default values
>> >
>> >
>> > Rick Edgecombe (3):
>> >   modules: Create arch versions of module alloc/free
>> >   modules: Create rlimit for module space
>> >   bpf: Add system wide BPF JIT limit
>> >
>> >  arch/arm/kernel/module.c                |   2 +-
>> >  arch/arm64/kernel/module.c              |   2 +-
>> >  arch/mips/kernel/module.c               |   2 +-
>> >  arch/nds32/kernel/module.c              |   2 +-
>> >  arch/nios2/kernel/module.c              |   4 +-
>> >  arch/parisc/kernel/module.c             |   2 +-
>> >  arch/s390/kernel/module.c               |   2 +-
>> >  arch/sparc/kernel/module.c              |   2 +-
>> >  arch/unicore32/kernel/module.c          |   2 +-
>> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
>> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
>> >  arch/x86/kernel/module.c                |   2 +-
>> >  fs/proc/base.c                          |   1 +
>> >  include/asm-generic/resource.h          |   8 ++
>> >  include/linux/bpf.h                     |   7 ++
>> >  include/linux/filter.h                  |   1 +
>> >  include/linux/sched/user.h              |   4 +
>> >  include/uapi/asm-generic/resource.h     |   3 +-
>> >  kernel/bpf/core.c                       |  22 +++-
>> >  kernel/bpf/inode.c                      |  16 +++
>> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
>> >  net/core/sysctl_net_core.c              |   7 ++
>> >  22 files changed, 233 insertions(+), 15 deletions(-)
>> >
>> > --
>> > 2.17.1
>> >
> branching predictors may not be able to predict past a certain distance. So
> performance as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-23 11:54       ` Ard Biesheuvel
  0 siblings, 0 replies; 47+ messages in thread
From: Ard Biesheuvel @ 2018-10-23 11:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, keescook, jeyu, linux-arch, arjan, tglx,
	linux-mips, linux-s390, jannh, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon, kernel-hardening, bp,
	Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux

On 22 October 2018 at 20:06, Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>> Hi Rick,
>>
>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>> wrote:
>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>> > module space with JITed e/BPF filters.
>>
>> Why do BPF filters use the module space, and does this reason apply to
>> all architectures?
>>
>> On arm64, we already support loading plain modules far away from the
>> core kernel (i.e. out of range for ordinary relative jump/calll
>> instructions), and so I'd expect BPF to be able to deal with this
>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>> wouldn't be more appropriate.
> AFAIK, it's like you said about relative instruction limits, but also because
> some predictors don't predict past a certain distance. So performance as well.
> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
> to be 8 by my count, that have a dedicated module space for some reason.
>
>> So before refactoring the module alloc/free routines to accommodate
>> BPF, I'd like to take one step back and assess whether it wouldn't be
>> more appropriate to have a separate bpf_alloc/free API, which could be
>> totally separate from module alloc/free if the arch permits it.
>>
> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
> will chime in. I only ran into this because I was trying to increase
> randomization for the module space and wanted to find out how many allocations
> needed to be supported.
>
> I'd guess though, that BPF JIT is just assuming that there will be some arch
> specific constraints about where text can be placed optimally and they would
> already be taken into account in the module space allocator.
>
> If there are no constraints for some arch, I'd wonder why not just update its
> module_alloc to use the whole space available. What exactly are the constraints
> for arm64 for normal modules?
>

Relative branches and the interactions with KAsan.

We just fixed something similar in the kprobes code: it was using
RWX-mapped module memory to store kprobed instructions, and we
replaced that with a simple vmalloc_exec() [and code to remap it
read-only], which was possible given that relative branches are always
emulated by arm64 kprobes.

So it depends on whether BPF code needs to be in relative branching
range from the calling code, and whether the BPF code itself is
emitted using relative branches into other parts of the code.

> It seems fine to me for architectures to have the option of solving this a
> different way. If potentially the rlimit ends up being the best solution for
> some architectures though, do you think this refactor (pretty close to just a
> name change) is that intrusive?
>
> I guess it could be just a BPF JIT per user limit and not touch module space,
> but I thought the module space limit was a more general solution as that is the
> actual limited resource.
>

I think it is wrong to conflate the two things. Limiting the number of
BPF allocations and the limiting number of module allocations are two
separate things, and the fact that BPF reuses module_alloc() out of
convenience does not mean a single rlimit for both is appropriate.


>> > For classic BPF filters attached with
>> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
>> > number of insertions like there is for the bpf syscall.
>> >
>> > This patch adds a per user rlimit for module space, as well as a system wide
>> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
>> > the
>> > problem that in some cases a user can get access to 65536 UIDs, so the
>> > effective
>> > limit cannot be set low enough to stop an attacker and be useful for the
>> > general
>> > case. A discussed alternative solution was a system wide limit for BPF JIT
>> > filters. This much more simply resolves the problem of exhaustion and
>> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
>> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
>> > user
>> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
>> > If
>> > the subuid facility is disabled for normal users, this should still be ok
>> > because the higher limit will not be able to be worked aroThey might und
>> > that way.
>> >
>> > The new BPF JIT limit can be set like this:
>> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>> >
>> > So I *think* this patchset should resolve that issue except for the
>> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
>> > users.
>> > Better module space KASLR is another way to resolve the de-randomizing
>> > issue,
>> > and so then you would just be left with the BPF DOS in that configuration.
>> >
>> > Jann also pointed out how, with purposely fragmenting the module space, you
>> > could make the effective module space blockage area much larger. This is
>> > also
>> > somewhat un-resolved. The impact would depend on how big of a space you are
>> > trying to allocate. The limit has been lowered on x86_64 so that at least
>> > typical sized BPF filters cannot be blocked.
>> >
>> > If anyone with more experience with subuid/user namespaces has any
>> > suggestions
>> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
>> > privileged
>> > user could do this. I am going to keep working on this and see if I can find
>> > a
>> > better solution.
>> >
>> > Changes since v2:
>> >  - System wide BPF JIT limit (discussion with Jann Horn)
>> >  - Holding reference to user correctly (Jann)
>> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>> >  - Shrinking of default limits, to help prevent the limit being worked
>> > around
>> >    with fragmentation (Jann)
>> >
>> > Changes since v1:
>> >  - Plug in for non-x86
>> >  - Arch specific default values
>> >
>> >
>> > Rick Edgecombe (3):
>> >   modules: Create arch versions of module alloc/free
>> >   modules: Create rlimit for module space
>> >   bpf: Add system wide BPF JIT limit
>> >
>> >  arch/arm/kernel/module.c                |   2 +-
>> >  arch/arm64/kernel/module.c              |   2 +-
>> >  arch/mips/kernel/module.c               |   2 +-
>> >  arch/nds32/kernel/module.c              |   2 +-
>> >  arch/nios2/kernel/module.c              |   4 +-
>> >  arch/parisc/kernel/module.c             |   2 +-
>> >  arch/s390/kernel/module.c               |   2 +-
>> >  arch/sparc/kernel/module.c              |   2 +-
>> >  arch/unicore32/kernel/module.c          |   2 +-
>> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
>> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
>> >  arch/x86/kernel/module.c                |   2 +-
>> >  fs/proc/base.c                          |   1 +
>> >  include/asm-generic/resource.h          |   8 ++
>> >  include/linux/bpf.h                     |   7 ++
>> >  include/linux/filter.h                  |   1 +
>> >  include/linux/sched/user.h              |   4 +
>> >  include/uapi/asm-generic/resource.h     |   3 +-
>> >  kernel/bpf/core.c                       |  22 +++-
>> >  kernel/bpf/inode.c                      |  16 +++
>> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
>> >  net/core/sysctl_net_core.c              |   7 ++
>> >  22 files changed, 233 insertions(+), 15 deletions(-)
>> >
>> > --
>> > 2.17.1
>> >
> branching predictors may not be able to predict past a certain distance. So
> performance as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-23 11:54       ` Ard Biesheuvel
  0 siblings, 0 replies; 47+ messages in thread
From: Ard Biesheuvel @ 2018-10-23 11:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, keescook, jeyu, linux-arch, arjan, tglx,
	linux-mips, linux-s390, jannh, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon

On 22 October 2018 at 20:06, Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>> Hi Rick,
>>
>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>> wrote:
>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>> > module space with JITed e/BPF filters.
>>
>> Why do BPF filters use the module space, and does this reason apply to
>> all architectures?
>>
>> On arm64, we already support loading plain modules far away from the
>> core kernel (i.e. out of range for ordinary relative jump/calll
>> instructions), and so I'd expect BPF to be able to deal with this
>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>> wouldn't be more appropriate.
> AFAIK, it's like you said about relative instruction limits, but also because
> some predictors don't predict past a certain distance. So performance as well.
> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
> to be 8 by my count, that have a dedicated module space for some reason.
>
>> So before refactoring the module alloc/free routines to accommodate
>> BPF, I'd like to take one step back and assess whether it wouldn't be
>> more appropriate to have a separate bpf_alloc/free API, which could be
>> totally separate from module alloc/free if the arch permits it.
>>
> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
> will chime in. I only ran into this because I was trying to increase
> randomization for the module space and wanted to find out how many allocations
> needed to be supported.
>
> I'd guess though, that BPF JIT is just assuming that there will be some arch
> specific constraints about where text can be placed optimally and they would
> already be taken into account in the module space allocator.
>
> If there are no constraints for some arch, I'd wonder why not just update its
> module_alloc to use the whole space available. What exactly are the constraints
> for arm64 for normal modules?
>

Relative branches and the interactions with KAsan.

We just fixed something similar in the kprobes code: it was using
RWX-mapped module memory to store kprobed instructions, and we
replaced that with a simple vmalloc_exec() [and code to remap it
read-only], which was possible given that relative branches are always
emulated by arm64 kprobes.

So it depends on whether BPF code needs to be in relative branching
range from the calling code, and whether the BPF code itself is
emitted using relative branches into other parts of the code.

> It seems fine to me for architectures to have the option of solving this a
> different way. If potentially the rlimit ends up being the best solution for
> some architectures though, do you think this refactor (pretty close to just a
> name change) is that intrusive?
>
> I guess it could be just a BPF JIT per user limit and not touch module space,
> but I thought the module space limit was a more general solution as that is the
> actual limited resource.
>

I think it is wrong to conflate the two things. Limiting the number of
BPF allocations and the limiting number of module allocations are two
separate things, and the fact that BPF reuses module_alloc() out of
convenience does not mean a single rlimit for both is appropriate.


>> > For classic BPF filters attached with
>> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
>> > number of insertions like there is for the bpf syscall.
>> >
>> > This patch adds a per user rlimit for module space, as well as a system wide
>> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
>> > the
>> > problem that in some cases a user can get access to 65536 UIDs, so the
>> > effective
>> > limit cannot be set low enough to stop an attacker and be useful for the
>> > general
>> > case. A discussed alternative solution was a system wide limit for BPF JIT
>> > filters. This much more simply resolves the problem of exhaustion and
>> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
>> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
>> > user
>> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
>> > If
>> > the subuid facility is disabled for normal users, this should still be ok
>> > because the higher limit will not be able to be worked aroThey might und
>> > that way.
>> >
>> > The new BPF JIT limit can be set like this:
>> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>> >
>> > So I *think* this patchset should resolve that issue except for the
>> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
>> > users.
>> > Better module space KASLR is another way to resolve the de-randomizing
>> > issue,
>> > and so then you would just be left with the BPF DOS in that configuration.
>> >
>> > Jann also pointed out how, with purposely fragmenting the module space, you
>> > could make the effective module space blockage area much larger. This is
>> > also
>> > somewhat un-resolved. The impact would depend on how big of a space you are
>> > trying to allocate. The limit has been lowered on x86_64 so that at least
>> > typical sized BPF filters cannot be blocked.
>> >
>> > If anyone with more experience with subuid/user namespaces has any
>> > suggestions
>> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
>> > privileged
>> > user could do this. I am going to keep working on this and see if I can find
>> > a
>> > better solution.
>> >
>> > Changes since v2:
>> >  - System wide BPF JIT limit (discussion with Jann Horn)
>> >  - Holding reference to user correctly (Jann)
>> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>> >  - Shrinking of default limits, to help prevent the limit being worked
>> > around
>> >    with fragmentation (Jann)
>> >
>> > Changes since v1:
>> >  - Plug in for non-x86
>> >  - Arch specific default values
>> >
>> >
>> > Rick Edgecombe (3):
>> >   modules: Create arch versions of module alloc/free
>> >   modules: Create rlimit for module space
>> >   bpf: Add system wide BPF JIT limit
>> >
>> >  arch/arm/kernel/module.c                |   2 +-
>> >  arch/arm64/kernel/module.c              |   2 +-
>> >  arch/mips/kernel/module.c               |   2 +-
>> >  arch/nds32/kernel/module.c              |   2 +-
>> >  arch/nios2/kernel/module.c              |   4 +-
>> >  arch/parisc/kernel/module.c             |   2 +-
>> >  arch/s390/kernel/module.c               |   2 +-
>> >  arch/sparc/kernel/module.c              |   2 +-
>> >  arch/unicore32/kernel/module.c          |   2 +-
>> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
>> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
>> >  arch/x86/kernel/module.c                |   2 +-
>> >  fs/proc/base.c                          |   1 +
>> >  include/asm-generic/resource.h          |   8 ++
>> >  include/linux/bpf.h                     |   7 ++
>> >  include/linux/filter.h                  |   1 +
>> >  include/linux/sched/user.h              |   4 +
>> >  include/uapi/asm-generic/resource.h     |   3 +-
>> >  kernel/bpf/core.c                       |  22 +++-
>> >  kernel/bpf/inode.c                      |  16 +++
>> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
>> >  net/core/sysctl_net_core.c              |   7 ++
>> >  22 files changed, 233 insertions(+), 15 deletions(-)
>> >
>> > --
>> > 2.17.1
>> >
> branching predictors may not be able to predict past a certain distance. So
> performance as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-23 11:54       ` Ard Biesheuvel
  0 siblings, 0 replies; 47+ messages in thread
From: Ard Biesheuvel @ 2018-10-23 11:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, keescook, jeyu, linux-arch, arjan, tglx,
	linux-mips, linux-s390, jannh, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon, kernel-hardening, bp,
	Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux

On 22 October 2018 at 20:06, Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>> Hi Rick,
>>
>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>> wrote:
>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>> > module space with JITed e/BPF filters.
>>
>> Why do BPF filters use the module space, and does this reason apply to
>> all architectures?
>>
>> On arm64, we already support loading plain modules far away from the
>> core kernel (i.e. out of range for ordinary relative jump/calll
>> instructions), and so I'd expect BPF to be able to deal with this
>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>> wouldn't be more appropriate.
> AFAIK, it's like you said about relative instruction limits, but also because
> some predictors don't predict past a certain distance. So performance as well.
> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
> to be 8 by my count, that have a dedicated module space for some reason.
>
>> So before refactoring the module alloc/free routines to accommodate
>> BPF, I'd like to take one step back and assess whether it wouldn't be
>> more appropriate to have a separate bpf_alloc/free API, which could be
>> totally separate from module alloc/free if the arch permits it.
>>
> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
> will chime in. I only ran into this because I was trying to increase
> randomization for the module space and wanted to find out how many allocations
> needed to be supported.
>
> I'd guess though, that BPF JIT is just assuming that there will be some arch
> specific constraints about where text can be placed optimally and they would
> already be taken into account in the module space allocator.
>
> If there are no constraints for some arch, I'd wonder why not just update its
> module_alloc to use the whole space available. What exactly are the constraints
> for arm64 for normal modules?
>

Relative branches and the interactions with KAsan.

We just fixed something similar in the kprobes code: it was using
RWX-mapped module memory to store kprobed instructions, and we
replaced that with a simple vmalloc_exec() [and code to remap it
read-only], which was possible given that relative branches are always
emulated by arm64 kprobes.

So it depends on whether BPF code needs to be in relative branching
range from the calling code, and whether the BPF code itself is
emitted using relative branches into other parts of the code.

> It seems fine to me for architectures to have the option of solving this a
> different way. If potentially the rlimit ends up being the best solution for
> some architectures though, do you think this refactor (pretty close to just a
> name change) is that intrusive?
>
> I guess it could be just a BPF JIT per user limit and not touch module space,
> but I thought the module space limit was a more general solution as that is the
> actual limited resource.
>

I think it is wrong to conflate the two things. Limiting the number of
BPF allocations and the limiting number of module allocations are two
separate things, and the fact that BPF reuses module_alloc() out of
convenience does not mean a single rlimit for both is appropriate.


>> > For classic BPF filters attached with
>> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
>> > number of insertions like there is for the bpf syscall.
>> >
>> > This patch adds a per user rlimit for module space, as well as a system wide
>> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
>> > the
>> > problem that in some cases a user can get access to 65536 UIDs, so the
>> > effective
>> > limit cannot be set low enough to stop an attacker and be useful for the
>> > general
>> > case. A discussed alternative solution was a system wide limit for BPF JIT
>> > filters. This much more simply resolves the problem of exhaustion and
>> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
>> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
>> > user
>> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
>> > If
>> > the subuid facility is disabled for normal users, this should still be ok
>> > because the higher limit will not be able to be worked aroThey might und
>> > that way.
>> >
>> > The new BPF JIT limit can be set like this:
>> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>> >
>> > So I *think* this patchset should resolve that issue except for the
>> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
>> > users.
>> > Better module space KASLR is another way to resolve the de-randomizing
>> > issue,
>> > and so then you would just be left with the BPF DOS in that configuration.
>> >
>> > Jann also pointed out how, with purposely fragmenting the module space, you
>> > could make the effective module space blockage area much larger. This is
>> > also
>> > somewhat un-resolved. The impact would depend on how big of a space you are
>> > trying to allocate. The limit has been lowered on x86_64 so that at least
>> > typical sized BPF filters cannot be blocked.
>> >
>> > If anyone with more experience with subuid/user namespaces has any
>> > suggestions
>> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
>> > privileged
>> > user could do this. I am going to keep working on this and see if I can find
>> > a
>> > better solution.
>> >
>> > Changes since v2:
>> >  - System wide BPF JIT limit (discussion with Jann Horn)
>> >  - Holding reference to user correctly (Jann)
>> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>> >  - Shrinking of default limits, to help prevent the limit being worked
>> > around
>> >    with fragmentation (Jann)
>> >
>> > Changes since v1:
>> >  - Plug in for non-x86
>> >  - Arch specific default values
>> >
>> >
>> > Rick Edgecombe (3):
>> >   modules: Create arch versions of module alloc/free
>> >   modules: Create rlimit for module space
>> >   bpf: Add system wide BPF JIT limit
>> >
>> >  arch/arm/kernel/module.c                |   2 +-
>> >  arch/arm64/kernel/module.c              |   2 +-
>> >  arch/mips/kernel/module.c               |   2 +-
>> >  arch/nds32/kernel/module.c              |   2 +-
>> >  arch/nios2/kernel/module.c              |   4 +-
>> >  arch/parisc/kernel/module.c             |   2 +-
>> >  arch/s390/kernel/module.c               |   2 +-
>> >  arch/sparc/kernel/module.c              |   2 +-
>> >  arch/unicore32/kernel/module.c          |   2 +-
>> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
>> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
>> >  arch/x86/kernel/module.c                |   2 +-
>> >  fs/proc/base.c                          |   1 +
>> >  include/asm-generic/resource.h          |   8 ++
>> >  include/linux/bpf.h                     |   7 ++
>> >  include/linux/filter.h                  |   1 +
>> >  include/linux/sched/user.h              |   4 +
>> >  include/uapi/asm-generic/resource.h     |   3 +-
>> >  kernel/bpf/core.c                       |  22 +++-
>> >  kernel/bpf/inode.c                      |  16 +++
>> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
>> >  net/core/sysctl_net_core.c              |   7 ++
>> >  22 files changed, 233 insertions(+), 15 deletions(-)
>> >
>> > --
>> > 2.17.1
>> >
> branching predictors may not be able to predict past a certain distance. So
> performance as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-23 11:54       ` Ard Biesheuvel
  0 siblings, 0 replies; 47+ messages in thread
From: Ard Biesheuvel @ 2018-10-23 11:54 UTC (permalink / raw)
  To: linux-arm-kernel

On 22 October 2018 at 20:06, Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>> Hi Rick,
>>
>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>> wrote:
>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>> > module space with JITed e/BPF filters.
>>
>> Why do BPF filters use the module space, and does this reason apply to
>> all architectures?
>>
>> On arm64, we already support loading plain modules far away from the
>> core kernel (i.e. out of range for ordinary relative jump/calll
>> instructions), and so I'd expect BPF to be able to deal with this
>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>> wouldn't be more appropriate.
> AFAIK, it's like you said about relative instruction limits, but also because
> some predictors don't predict past a certain distance. So performance as well.
> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
> to be 8 by my count, that have a dedicated module space for some reason.
>
>> So before refactoring the module alloc/free routines to accommodate
>> BPF, I'd like to take one step back and assess whether it wouldn't be
>> more appropriate to have a separate bpf_alloc/free API, which could be
>> totally separate from module alloc/free if the arch permits it.
>>
> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
> will chime in. I only ran into this because I was trying to increase
> randomization for the module space and wanted to find out how many allocations
> needed to be supported.
>
> I'd guess though, that BPF JIT is just assuming that there will be some arch
> specific constraints about where text can be placed optimally and they would
> already be taken into account in the module space allocator.
>
> If there are no constraints for some arch, I'd wonder why not just update its
> module_alloc to use the whole space available. What exactly are the constraints
> for arm64 for normal modules?
>

Relative branches and the interactions with KAsan.

We just fixed something similar in the kprobes code: it was using
RWX-mapped module memory to store kprobed instructions, and we
replaced that with a simple vmalloc_exec() [and code to remap it
read-only], which was possible given that relative branches are always
emulated by arm64 kprobes.

So it depends on whether BPF code needs to be in relative branching
range from the calling code, and whether the BPF code itself is
emitted using relative branches into other parts of the code.

> It seems fine to me for architectures to have the option of solving this a
> different way. If potentially the rlimit ends up being the best solution for
> some architectures though, do you think this refactor (pretty close to just a
> name change) is that intrusive?
>
> I guess it could be just a BPF JIT per user limit and not touch module space,
> but I thought the module space limit was a more general solution as that is the
> actual limited resource.
>

I think it is wrong to conflate the two things. Limiting the number of
BPF allocations and the limiting number of module allocations are two
separate things, and the fact that BPF reuses module_alloc() out of
convenience does not mean a single rlimit for both is appropriate.


>> > For classic BPF filters attached with
>> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
>> > number of insertions like there is for the bpf syscall.
>> >
>> > This patch adds a per user rlimit for module space, as well as a system wide
>> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
>> > the
>> > problem that in some cases a user can get access to 65536 UIDs, so the
>> > effective
>> > limit cannot be set low enough to stop an attacker and be useful for the
>> > general
>> > case. A discussed alternative solution was a system wide limit for BPF JIT
>> > filters. This much more simply resolves the problem of exhaustion and
>> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
>> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
>> > user
>> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
>> > If
>> > the subuid facility is disabled for normal users, this should still be ok
>> > because the higher limit will not be able to be worked aroThey might und
>> > that way.
>> >
>> > The new BPF JIT limit can be set like this:
>> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>> >
>> > So I *think* this patchset should resolve that issue except for the
>> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
>> > users.
>> > Better module space KASLR is another way to resolve the de-randomizing
>> > issue,
>> > and so then you would just be left with the BPF DOS in that configuration.
>> >
>> > Jann also pointed out how, with purposely fragmenting the module space, you
>> > could make the effective module space blockage area much larger. This is
>> > also
>> > somewhat un-resolved. The impact would depend on how big of a space you are
>> > trying to allocate. The limit has been lowered on x86_64 so that at least
>> > typical sized BPF filters cannot be blocked.
>> >
>> > If anyone with more experience with subuid/user namespaces has any
>> > suggestions
>> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
>> > privileged
>> > user could do this. I am going to keep working on this and see if I can find
>> > a
>> > better solution.
>> >
>> > Changes since v2:
>> >  - System wide BPF JIT limit (discussion with Jann Horn)
>> >  - Holding reference to user correctly (Jann)
>> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>> >  - Shrinking of default limits, to help prevent the limit being worked
>> > around
>> >    with fragmentation (Jann)
>> >
>> > Changes since v1:
>> >  - Plug in for non-x86
>> >  - Arch specific default values
>> >
>> >
>> > Rick Edgecombe (3):
>> >   modules: Create arch versions of module alloc/free
>> >   modules: Create rlimit for module space
>> >   bpf: Add system wide BPF JIT limit
>> >
>> >  arch/arm/kernel/module.c                |   2 +-
>> >  arch/arm64/kernel/module.c              |   2 +-
>> >  arch/mips/kernel/module.c               |   2 +-
>> >  arch/nds32/kernel/module.c              |   2 +-
>> >  arch/nios2/kernel/module.c              |   4 +-
>> >  arch/parisc/kernel/module.c             |   2 +-
>> >  arch/s390/kernel/module.c               |   2 +-
>> >  arch/sparc/kernel/module.c              |   2 +-
>> >  arch/unicore32/kernel/module.c          |   2 +-
>> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
>> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
>> >  arch/x86/kernel/module.c                |   2 +-
>> >  fs/proc/base.c                          |   1 +
>> >  include/asm-generic/resource.h          |   8 ++
>> >  include/linux/bpf.h                     |   7 ++
>> >  include/linux/filter.h                  |   1 +
>> >  include/linux/sched/user.h              |   4 +
>> >  include/uapi/asm-generic/resource.h     |   3 +-
>> >  kernel/bpf/core.c                       |  22 +++-
>> >  kernel/bpf/inode.c                      |  16 +++
>> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
>> >  net/core/sysctl_net_core.c              |   7 ++
>> >  22 files changed, 233 insertions(+), 15 deletions(-)
>> >
>> > --
>> > 2.17.1
>> >
> branching predictors may not be able to predict past a certain distance. So
> performance as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
  2018-10-23 11:54       ` Ard Biesheuvel
                           ` (2 preceding siblings ...)
  (?)
@ 2018-10-24 15:07         ` Jessica Yu
  -1 siblings, 0 replies; 47+ messages in thread
From: Jessica Yu @ 2018-10-24 15:07 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Edgecombe, Rick P, linux-kernel, daniel, keescook, linux-arch,
	arjan, tglx, linux-mips, linux-s390, jannh, x86, kristen, Dock,
	Deneen T, catalin.marinas, mingo, will.deacon, kernel-hardening,
	bp, Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux

+++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>On 22 October 2018 at 20:06, Edgecombe, Rick P
><rick.p.edgecombe@intel.com> wrote:
>> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>>> Hi Rick,
>>>
>>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> wrote:
>>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>>> > module space with JITed e/BPF filters.
>>>
>>> Why do BPF filters use the module space, and does this reason apply to
>>> all architectures?
>>>
>>> On arm64, we already support loading plain modules far away from the
>>> core kernel (i.e. out of range for ordinary relative jump/calll
>>> instructions), and so I'd expect BPF to be able to deal with this
>>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>>> wouldn't be more appropriate.
>> AFAIK, it's like you said about relative instruction limits, but also because
>> some predictors don't predict past a certain distance. So performance as well.
>> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
>> to be 8 by my count, that have a dedicated module space for some reason.
>>
>>> So before refactoring the module alloc/free routines to accommodate
>>> BPF, I'd like to take one step back and assess whether it wouldn't be
>>> more appropriate to have a separate bpf_alloc/free API, which could be
>>> totally separate from module alloc/free if the arch permits it.
>>>
>> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
>> will chime in. I only ran into this because I was trying to increase
>> randomization for the module space and wanted to find out how many allocations
>> needed to be supported.
>>
>> I'd guess though, that BPF JIT is just assuming that there will be some arch
>> specific constraints about where text can be placed optimally and they would
>> already be taken into account in the module space allocator.
>>
>> If there are no constraints for some arch, I'd wonder why not just update its
>> module_alloc to use the whole space available. What exactly are the constraints
>> for arm64 for normal modules?
>>
>
>Relative branches and the interactions with KAsan.
>
>We just fixed something similar in the kprobes code: it was using
>RWX-mapped module memory to store kprobed instructions, and we
>replaced that with a simple vmalloc_exec() [and code to remap it
>read-only], which was possible given that relative branches are always
>emulated by arm64 kprobes.
>
>So it depends on whether BPF code needs to be in relative branching
>range from the calling code, and whether the BPF code itself is
>emitted using relative branches into other parts of the code.
>
>> It seems fine to me for architectures to have the option of solving this a
>> different way. If potentially the rlimit ends up being the best solution for
>> some architectures though, do you think this refactor (pretty close to just a
>> name change) is that intrusive?
>>
>> I guess it could be just a BPF JIT per user limit and not touch module space,
>> but I thought the module space limit was a more general solution as that is the
>> actual limited resource.
>>
>
>I think it is wrong to conflate the two things. Limiting the number of
>BPF allocations and the limiting number of module allocations are two
>separate things, and the fact that BPF reuses module_alloc() out of
>convenience does not mean a single rlimit for both is appropriate.

Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
because it is an easy way to obtain executable kernel memory (and
depending on the needs of the architecture, being additionally
reachable via relative branches) during runtime. The side effect is
that all these users share the "module" memory space, even though this
memory region is not exclusively used by modules (well, personally I
think it technically should be, because seeing module_alloc() usage
outside of the module loader is kind of a misuse of the module API and
it's confusing for people who don't know the reason behind its usage
outside of the module loader).

Right now I'm not sure if it makes sense to impose a blanket limit on
all module_alloc() allocations when the real motivation behind the
rlimit is related to BPF, i.e., to stop unprivileged users from
hogging up all the vmalloc space for modules with JITed BPF filters.
So the rlimit has more to do with limiting the memory usage of BPF
filters than it has to do with modules themselves.

I think Ard's suggestion of having a separate bpf_alloc/free API makes
a lot of sense if we want to keep track of bpf-related allocations
(and then the rlimit would be enforced for those). Maybe part of the
module mapping space could be carved out for bpf filters (e.g. have
BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
continue sharing the region but explicitly define a separate bpf_alloc
API, depending on an architecture's needs. What do people think?

Thanks,

Jessica


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-24 15:07         ` Jessica Yu
  0 siblings, 0 replies; 47+ messages in thread
From: Jessica Yu @ 2018-10-24 15:07 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Edgecombe, Rick P, linux-kernel, daniel, keescook, linux-arch,
	arjan, tglx, linux-mips, linux-s390, jannh, x86, kristen, Dock,
	Deneen T, catalin.marinas, mingo, will.deacon, kernel-hardening,
	bp, Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux

+++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>On 22 October 2018 at 20:06, Edgecombe, Rick P
><rick.p.edgecombe@intel.com> wrote:
>> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>>> Hi Rick,
>>>
>>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> wrote:
>>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>>> > module space with JITed e/BPF filters.
>>>
>>> Why do BPF filters use the module space, and does this reason apply to
>>> all architectures?
>>>
>>> On arm64, we already support loading plain modules far away from the
>>> core kernel (i.e. out of range for ordinary relative jump/calll
>>> instructions), and so I'd expect BPF to be able to deal with this
>>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>>> wouldn't be more appropriate.
>> AFAIK, it's like you said about relative instruction limits, but also because
>> some predictors don't predict past a certain distance. So performance as well.
>> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
>> to be 8 by my count, that have a dedicated module space for some reason.
>>
>>> So before refactoring the module alloc/free routines to accommodate
>>> BPF, I'd like to take one step back and assess whether it wouldn't be
>>> more appropriate to have a separate bpf_alloc/free API, which could be
>>> totally separate from module alloc/free if the arch permits it.
>>>
>> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
>> will chime in. I only ran into this because I was trying to increase
>> randomization for the module space and wanted to find out how many allocations
>> needed to be supported.
>>
>> I'd guess though, that BPF JIT is just assuming that there will be some arch
>> specific constraints about where text can be placed optimally and they would
>> already be taken into account in the module space allocator.
>>
>> If there are no constraints for some arch, I'd wonder why not just update its
>> module_alloc to use the whole space available. What exactly are the constraints
>> for arm64 for normal modules?
>>
>
>Relative branches and the interactions with KAsan.
>
>We just fixed something similar in the kprobes code: it was using
>RWX-mapped module memory to store kprobed instructions, and we
>replaced that with a simple vmalloc_exec() [and code to remap it
>read-only], which was possible given that relative branches are always
>emulated by arm64 kprobes.
>
>So it depends on whether BPF code needs to be in relative branching
>range from the calling code, and whether the BPF code itself is
>emitted using relative branches into other parts of the code.
>
>> It seems fine to me for architectures to have the option of solving this a
>> different way. If potentially the rlimit ends up being the best solution for
>> some architectures though, do you think this refactor (pretty close to just a
>> name change) is that intrusive?
>>
>> I guess it could be just a BPF JIT per user limit and not touch module space,
>> but I thought the module space limit was a more general solution as that is the
>> actual limited resource.
>>
>
>I think it is wrong to conflate the two things. Limiting the number of
>BPF allocations and the limiting number of module allocations are two
>separate things, and the fact that BPF reuses module_alloc() out of
>convenience does not mean a single rlimit for both is appropriate.

Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
because it is an easy way to obtain executable kernel memory (and
depending on the needs of the architecture, being additionally
reachable via relative branches) during runtime. The side effect is
that all these users share the "module" memory space, even though this
memory region is not exclusively used by modules (well, personally I
think it technically should be, because seeing module_alloc() usage
outside of the module loader is kind of a misuse of the module API and
it's confusing for people who don't know the reason behind its usage
outside of the module loader).

Right now I'm not sure if it makes sense to impose a blanket limit on
all module_alloc() allocations when the real motivation behind the
rlimit is related to BPF, i.e., to stop unprivileged users from
hogging up all the vmalloc space for modules with JITed BPF filters.
So the rlimit has more to do with limiting the memory usage of BPF
filters than it has to do with modules themselves.

I think Ard's suggestion of having a separate bpf_alloc/free API makes
a lot of sense if we want to keep track of bpf-related allocations
(and then the rlimit would be enforced for those). Maybe part of the
module mapping space could be carved out for bpf filters (e.g. have
BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
continue sharing the region but explicitly define a separate bpf_alloc
API, depending on an architecture's needs. What do people think?

Thanks,

Jessica

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-24 15:07         ` Jessica Yu
  0 siblings, 0 replies; 47+ messages in thread
From: Jessica Yu @ 2018-10-24 15:07 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Edgecombe, Rick P, linux-kernel, daniel, keescook, linux-arch,
	arjan, tglx, linux-mips, linux-s390, jannh, x86, kristen, Dock,
	Deneen T, catalin.marinas, mingo, will.deacon@arm.com

+++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>On 22 October 2018 at 20:06, Edgecombe, Rick P
><rick.p.edgecombe@intel.com> wrote:
>> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>>> Hi Rick,
>>>
>>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> wrote:
>>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>>> > module space with JITed e/BPF filters.
>>>
>>> Why do BPF filters use the module space, and does this reason apply to
>>> all architectures?
>>>
>>> On arm64, we already support loading plain modules far away from the
>>> core kernel (i.e. out of range for ordinary relative jump/calll
>>> instructions), and so I'd expect BPF to be able to deal with this
>>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>>> wouldn't be more appropriate.
>> AFAIK, it's like you said about relative instruction limits, but also because
>> some predictors don't predict past a certain distance. So performance as well.
>> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
>> to be 8 by my count, that have a dedicated module space for some reason.
>>
>>> So before refactoring the module alloc/free routines to accommodate
>>> BPF, I'd like to take one step back and assess whether it wouldn't be
>>> more appropriate to have a separate bpf_alloc/free API, which could be
>>> totally separate from module alloc/free if the arch permits it.
>>>
>> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
>> will chime in. I only ran into this because I was trying to increase
>> randomization for the module space and wanted to find out how many allocations
>> needed to be supported.
>>
>> I'd guess though, that BPF JIT is just assuming that there will be some arch
>> specific constraints about where text can be placed optimally and they would
>> already be taken into account in the module space allocator.
>>
>> If there are no constraints for some arch, I'd wonder why not just update its
>> module_alloc to use the whole space available. What exactly are the constraints
>> for arm64 for normal modules?
>>
>
>Relative branches and the interactions with KAsan.
>
>We just fixed something similar in the kprobes code: it was using
>RWX-mapped module memory to store kprobed instructions, and we
>replaced that with a simple vmalloc_exec() [and code to remap it
>read-only], which was possible given that relative branches are always
>emulated by arm64 kprobes.
>
>So it depends on whether BPF code needs to be in relative branching
>range from the calling code, and whether the BPF code itself is
>emitted using relative branches into other parts of the code.
>
>> It seems fine to me for architectures to have the option of solving this a
>> different way. If potentially the rlimit ends up being the best solution for
>> some architectures though, do you think this refactor (pretty close to just a
>> name change) is that intrusive?
>>
>> I guess it could be just a BPF JIT per user limit and not touch module space,
>> but I thought the module space limit was a more general solution as that is the
>> actual limited resource.
>>
>
>I think it is wrong to conflate the two things. Limiting the number of
>BPF allocations and the limiting number of module allocations are two
>separate things, and the fact that BPF reuses module_alloc() out of
>convenience does not mean a single rlimit for both is appropriate.

Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
because it is an easy way to obtain executable kernel memory (and
depending on the needs of the architecture, being additionally
reachable via relative branches) during runtime. The side effect is
that all these users share the "module" memory space, even though this
memory region is not exclusively used by modules (well, personally I
think it technically should be, because seeing module_alloc() usage
outside of the module loader is kind of a misuse of the module API and
it's confusing for people who don't know the reason behind its usage
outside of the module loader).

Right now I'm not sure if it makes sense to impose a blanket limit on
all module_alloc() allocations when the real motivation behind the
rlimit is related to BPF, i.e., to stop unprivileged users from
hogging up all the vmalloc space for modules with JITed BPF filters.
So the rlimit has more to do with limiting the memory usage of BPF
filters than it has to do with modules themselves.

I think Ard's suggestion of having a separate bpf_alloc/free API makes
a lot of sense if we want to keep track of bpf-related allocations
(and then the rlimit would be enforced for those). Maybe part of the
module mapping space could be carved out for bpf filters (e.g. have
BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
continue sharing the region but explicitly define a separate bpf_alloc
API, depending on an architecture's needs. What do people think?

Thanks,

Jessica

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-24 15:07         ` Jessica Yu
  0 siblings, 0 replies; 47+ messages in thread
From: Jessica Yu @ 2018-10-24 15:07 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Edgecombe, Rick P, linux-kernel, daniel, keescook, linux-arch,
	arjan, tglx, linux-mips, linux-s390, jannh, x86, kristen, Dock,
	Deneen T, catalin.marinas, mingo, will.deacon, kernel-hardening,
	bp, Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux

+++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>On 22 October 2018 at 20:06, Edgecombe, Rick P
><rick.p.edgecombe@intel.com> wrote:
>> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>>> Hi Rick,
>>>
>>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> wrote:
>>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>>> > module space with JITed e/BPF filters.
>>>
>>> Why do BPF filters use the module space, and does this reason apply to
>>> all architectures?
>>>
>>> On arm64, we already support loading plain modules far away from the
>>> core kernel (i.e. out of range for ordinary relative jump/calll
>>> instructions), and so I'd expect BPF to be able to deal with this
>>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>>> wouldn't be more appropriate.
>> AFAIK, it's like you said about relative instruction limits, but also because
>> some predictors don't predict past a certain distance. So performance as well.
>> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
>> to be 8 by my count, that have a dedicated module space for some reason.
>>
>>> So before refactoring the module alloc/free routines to accommodate
>>> BPF, I'd like to take one step back and assess whether it wouldn't be
>>> more appropriate to have a separate bpf_alloc/free API, which could be
>>> totally separate from module alloc/free if the arch permits it.
>>>
>> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
>> will chime in. I only ran into this because I was trying to increase
>> randomization for the module space and wanted to find out how many allocations
>> needed to be supported.
>>
>> I'd guess though, that BPF JIT is just assuming that there will be some arch
>> specific constraints about where text can be placed optimally and they would
>> already be taken into account in the module space allocator.
>>
>> If there are no constraints for some arch, I'd wonder why not just update its
>> module_alloc to use the whole space available. What exactly are the constraints
>> for arm64 for normal modules?
>>
>
>Relative branches and the interactions with KAsan.
>
>We just fixed something similar in the kprobes code: it was using
>RWX-mapped module memory to store kprobed instructions, and we
>replaced that with a simple vmalloc_exec() [and code to remap it
>read-only], which was possible given that relative branches are always
>emulated by arm64 kprobes.
>
>So it depends on whether BPF code needs to be in relative branching
>range from the calling code, and whether the BPF code itself is
>emitted using relative branches into other parts of the code.
>
>> It seems fine to me for architectures to have the option of solving this a
>> different way. If potentially the rlimit ends up being the best solution for
>> some architectures though, do you think this refactor (pretty close to just a
>> name change) is that intrusive?
>>
>> I guess it could be just a BPF JIT per user limit and not touch module space,
>> but I thought the module space limit was a more general solution as that is the
>> actual limited resource.
>>
>
>I think it is wrong to conflate the two things. Limiting the number of
>BPF allocations and the limiting number of module allocations are two
>separate things, and the fact that BPF reuses module_alloc() out of
>convenience does not mean a single rlimit for both is appropriate.

Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
because it is an easy way to obtain executable kernel memory (and
depending on the needs of the architecture, being additionally
reachable via relative branches) during runtime. The side effect is
that all these users share the "module" memory space, even though this
memory region is not exclusively used by modules (well, personally I
think it technically should be, because seeing module_alloc() usage
outside of the module loader is kind of a misuse of the module API and
it's confusing for people who don't know the reason behind its usage
outside of the module loader).

Right now I'm not sure if it makes sense to impose a blanket limit on
all module_alloc() allocations when the real motivation behind the
rlimit is related to BPF, i.e., to stop unprivileged users from
hogging up all the vmalloc space for modules with JITed BPF filters.
So the rlimit has more to do with limiting the memory usage of BPF
filters than it has to do with modules themselves.

I think Ard's suggestion of having a separate bpf_alloc/free API makes
a lot of sense if we want to keep track of bpf-related allocations
(and then the rlimit would be enforced for those). Maybe part of the
module mapping space could be carved out for bpf filters (e.g. have
BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
continue sharing the region but explicitly define a separate bpf_alloc
API, depending on an architecture's needs. What do people think?

Thanks,

Jessica

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-24 15:07         ` Jessica Yu
  0 siblings, 0 replies; 47+ messages in thread
From: Jessica Yu @ 2018-10-24 15:07 UTC (permalink / raw)
  To: linux-arm-kernel

+++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>On 22 October 2018 at 20:06, Edgecombe, Rick P
><rick.p.edgecombe@intel.com> wrote:
>> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>>> Hi Rick,
>>>
>>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> wrote:
>>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>>> > module space with JITed e/BPF filters.
>>>
>>> Why do BPF filters use the module space, and does this reason apply to
>>> all architectures?
>>>
>>> On arm64, we already support loading plain modules far away from the
>>> core kernel (i.e. out of range for ordinary relative jump/calll
>>> instructions), and so I'd expect BPF to be able to deal with this
>>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>>> wouldn't be more appropriate.
>> AFAIK, it's like you said about relative instruction limits, but also because
>> some predictors don't predict past a certain distance. So performance as well.
>> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
>> to be 8 by my count, that have a dedicated module space for some reason.
>>
>>> So before refactoring the module alloc/free routines to accommodate
>>> BPF, I'd like to take one step back and assess whether it wouldn't be
>>> more appropriate to have a separate bpf_alloc/free API, which could be
>>> totally separate from module alloc/free if the arch permits it.
>>>
>> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
>> will chime in. I only ran into this because I was trying to increase
>> randomization for the module space and wanted to find out how many allocations
>> needed to be supported.
>>
>> I'd guess though, that BPF JIT is just assuming that there will be some arch
>> specific constraints about where text can be placed optimally and they would
>> already be taken into account in the module space allocator.
>>
>> If there are no constraints for some arch, I'd wonder why not just update its
>> module_alloc to use the whole space available. What exactly are the constraints
>> for arm64 for normal modules?
>>
>
>Relative branches and the interactions with KAsan.
>
>We just fixed something similar in the kprobes code: it was using
>RWX-mapped module memory to store kprobed instructions, and we
>replaced that with a simple vmalloc_exec() [and code to remap it
>read-only], which was possible given that relative branches are always
>emulated by arm64 kprobes.
>
>So it depends on whether BPF code needs to be in relative branching
>range from the calling code, and whether the BPF code itself is
>emitted using relative branches into other parts of the code.
>
>> It seems fine to me for architectures to have the option of solving this a
>> different way. If potentially the rlimit ends up being the best solution for
>> some architectures though, do you think this refactor (pretty close to just a
>> name change) is that intrusive?
>>
>> I guess it could be just a BPF JIT per user limit and not touch module space,
>> but I thought the module space limit was a more general solution as that is the
>> actual limited resource.
>>
>
>I think it is wrong to conflate the two things. Limiting the number of
>BPF allocations and the limiting number of module allocations are two
>separate things, and the fact that BPF reuses module_alloc() out of
>convenience does not mean a single rlimit for both is appropriate.

Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
because it is an easy way to obtain executable kernel memory (and
depending on the needs of the architecture, being additionally
reachable via relative branches) during runtime. The side effect is
that all these users share the "module" memory space, even though this
memory region is not exclusively used by modules (well, personally I
think it technically should be, because seeing module_alloc() usage
outside of the module loader is kind of a misuse of the module API and
it's confusing for people who don't know the reason behind its usage
outside of the module loader).

Right now I'm not sure if it makes sense to impose a blanket limit on
all module_alloc() allocations when the real motivation behind the
rlimit is related to BPF, i.e., to stop unprivileged users from
hogging up all the vmalloc space for modules with JITed BPF filters.
So the rlimit has more to do with limiting the memory usage of BPF
filters than it has to do with modules themselves.

I think Ard's suggestion of having a separate bpf_alloc/free API makes
a lot of sense if we want to keep track of bpf-related allocations
(and then the rlimit would be enforced for those). Maybe part of the
module mapping space could be carved out for bpf filters (e.g. have
BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
continue sharing the region but explicitly define a separate bpf_alloc
API, depending on an architecture's needs. What do people think?

Thanks,

Jessica

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
  2018-10-24 15:07         ` Jessica Yu
                             ` (2 preceding siblings ...)
  (?)
@ 2018-10-24 16:04           ` Daniel Borkmann
  -1 siblings, 0 replies; 47+ messages in thread
From: Daniel Borkmann @ 2018-10-24 16:04 UTC (permalink / raw)
  To: Jessica Yu, Ard Biesheuvel
  Cc: Edgecombe, Rick P, linux-kernel, keescook, linux-arch, arjan,
	tglx, linux-mips, linux-s390, jannh, x86, kristen, Dock,
	Deneen T, catalin.marinas, mingo, will.deacon, kernel-hardening,
	bp, Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux, alexei.starovoitov, netdev

[ +Alexei, netdev ]

On 10/24/2018 05:07 PM, Jessica Yu wrote:
> +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>> On 22 October 2018 at 20:06, Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
[...]
>> I think it is wrong to conflate the two things. Limiting the number of
>> BPF allocations and the limiting number of module allocations are two
>> separate things, and the fact that BPF reuses module_alloc() out of
>> convenience does not mean a single rlimit for both is appropriate.
> 
> Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> because it is an easy way to obtain executable kernel memory (and
> depending on the needs of the architecture, being additionally
> reachable via relative branches) during runtime. The side effect is
> that all these users share the "module" memory space, even though this
> memory region is not exclusively used by modules (well, personally I
> think it technically should be, because seeing module_alloc() usage
> outside of the module loader is kind of a misuse of the module API and
> it's confusing for people who don't know the reason behind its usage
> outside of the module loader).
> 
> Right now I'm not sure if it makes sense to impose a blanket limit on
> all module_alloc() allocations when the real motivation behind the
> rlimit is related to BPF, i.e., to stop unprivileged users from
> hogging up all the vmalloc space for modules with JITed BPF filters.
> So the rlimit has more to do with limiting the memory usage of BPF
> filters than it has to do with modules themselves.
> 
> I think Ard's suggestion of having a separate bpf_alloc/free API makes
> a lot of sense if we want to keep track of bpf-related allocations
> (and then the rlimit would be enforced for those). Maybe part of the
> module mapping space could be carved out for bpf filters (e.g. have
> BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> continue sharing the region but explicitly define a separate bpf_alloc
> API, depending on an architecture's needs. What do people think?

Hmm, I think here are several issues mixed up at the same time which is just
very confusing, imho:

1) The fact that there are several non-module users of module_alloc()
as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
them are not being modules, they all need to alloc some piece of executable
memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
that is even /before/ eBPF existed. Having some different API perhaps for all
these users seems to make sense if the goal is not to interfere with modules
themselves. It might also help as a benefit to potentially increase that
memory pool if we're hitting limits at scale which would not be a concern
for normal kernel modules since there's usually just a very few of them
needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
running BPF-seccomp policies, networking BPF policies, etc which need to
scale w/ application or container deployment; so this is of much more
dynamic and unpredictable nature).

2) Then there is rlimit which is proposing to have a "fairer" share among
unprivileged users. I'm not fully sure yet whether rlimit is actually a
nice usable interface for all this. I'd agree that something is needed
on that regard, but I also tend to like Michal Hocko's cgroup proposal,
iow, to have such resource control as part of memory cgroup could be
something to consider _especially_ since it already _exists_ for vmalloc()
backed memory so this should be not much different than that. It sounds
like 2) comes on top of 1).

3) Last but not least, there's a short term fix which is needed independently
of 1) and 2) and should be done immediately which is to account for
unprivileged users and restrict them based on a global configurable
limit such that privileged use keeps functioning, and 2) could enforce
based on the global upper limit, for example. Pending fix is under
https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
Linus as soon as possible as short term fix. Then something like memcg
can be considered on top since it seems this makes most sense from a
usability point.

Thanks a lot,
Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-24 16:04           ` Daniel Borkmann
  0 siblings, 0 replies; 47+ messages in thread
From: Daniel Borkmann @ 2018-10-24 16:04 UTC (permalink / raw)
  To: Jessica Yu, Ard Biesheuvel
  Cc: Edgecombe, Rick P, linux-kernel, keescook, linux-arch, arjan,
	tglx, linux-mips, linux-s390, jannh, x86, kristen, Dock,
	Deneen T, catalin.marinas, mingo, will.deacon,
	kernel-hardening@lists.openwall.com

[ +Alexei, netdev ]

On 10/24/2018 05:07 PM, Jessica Yu wrote:
> +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>> On 22 October 2018 at 20:06, Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
[...]
>> I think it is wrong to conflate the two things. Limiting the number of
>> BPF allocations and the limiting number of module allocations are two
>> separate things, and the fact that BPF reuses module_alloc() out of
>> convenience does not mean a single rlimit for both is appropriate.
> 
> Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> because it is an easy way to obtain executable kernel memory (and
> depending on the needs of the architecture, being additionally
> reachable via relative branches) during runtime. The side effect is
> that all these users share the "module" memory space, even though this
> memory region is not exclusively used by modules (well, personally I
> think it technically should be, because seeing module_alloc() usage
> outside of the module loader is kind of a misuse of the module API and
> it's confusing for people who don't know the reason behind its usage
> outside of the module loader).
> 
> Right now I'm not sure if it makes sense to impose a blanket limit on
> all module_alloc() allocations when the real motivation behind the
> rlimit is related to BPF, i.e., to stop unprivileged users from
> hogging up all the vmalloc space for modules with JITed BPF filters.
> So the rlimit has more to do with limiting the memory usage of BPF
> filters than it has to do with modules themselves.
> 
> I think Ard's suggestion of having a separate bpf_alloc/free API makes
> a lot of sense if we want to keep track of bpf-related allocations
> (and then the rlimit would be enforced for those). Maybe part of the
> module mapping space could be carved out for bpf filters (e.g. have
> BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> continue sharing the region but explicitly define a separate bpf_alloc
> API, depending on an architecture's needs. What do people think?

Hmm, I think here are several issues mixed up at the same time which is just
very confusing, imho:

1) The fact that there are several non-module users of module_alloc()
as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
them are not being modules, they all need to alloc some piece of executable
memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
that is even /before/ eBPF existed. Having some different API perhaps for all
these users seems to make sense if the goal is not to interfere with modules
themselves. It might also help as a benefit to potentially increase that
memory pool if we're hitting limits at scale which would not be a concern
for normal kernel modules since there's usually just a very few of them
needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
running BPF-seccomp policies, networking BPF policies, etc which need to
scale w/ application or container deployment; so this is of much more
dynamic and unpredictable nature).

2) Then there is rlimit which is proposing to have a "fairer" share among
unprivileged users. I'm not fully sure yet whether rlimit is actually a
nice usable interface for all this. I'd agree that something is needed
on that regard, but I also tend to like Michal Hocko's cgroup proposal,
iow, to have such resource control as part of memory cgroup could be
something to consider _especially_ since it already _exists_ for vmalloc()
backed memory so this should be not much different than that. It sounds
like 2) comes on top of 1).

3) Last but not least, there's a short term fix which is needed independently
of 1) and 2) and should be done immediately which is to account for
unprivileged users and restrict them based on a global configurable
limit such that privileged use keeps functioning, and 2) could enforce
based on the global upper limit, for example. Pending fix is under
https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
Linus as soon as possible as short term fix. Then something like memcg
can be considered on top since it seems this makes most sense from a
usability point.

Thanks a lot,
Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-24 16:04           ` Daniel Borkmann
  0 siblings, 0 replies; 47+ messages in thread
From: Daniel Borkmann @ 2018-10-24 16:04 UTC (permalink / raw)
  To: Jessica Yu, Ard Biesheuvel
  Cc: Edgecombe, Rick P, linux-kernel, keescook, linux-arch, arjan,
	tglx, linux-mips, linux-s390, jannh, x86, kristen, Dock,
	Deneen T, catalin.marinas, mingo, will.deacon, kernel-hardening,
	bp, Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux, alexei.starovoitov, netdev

[ +Alexei, netdev ]

On 10/24/2018 05:07 PM, Jessica Yu wrote:
> +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>> On 22 October 2018 at 20:06, Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
[...]
>> I think it is wrong to conflate the two things. Limiting the number of
>> BPF allocations and the limiting number of module allocations are two
>> separate things, and the fact that BPF reuses module_alloc() out of
>> convenience does not mean a single rlimit for both is appropriate.
> 
> Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> because it is an easy way to obtain executable kernel memory (and
> depending on the needs of the architecture, being additionally
> reachable via relative branches) during runtime. The side effect is
> that all these users share the "module" memory space, even though this
> memory region is not exclusively used by modules (well, personally I
> think it technically should be, because seeing module_alloc() usage
> outside of the module loader is kind of a misuse of the module API and
> it's confusing for people who don't know the reason behind its usage
> outside of the module loader).
> 
> Right now I'm not sure if it makes sense to impose a blanket limit on
> all module_alloc() allocations when the real motivation behind the
> rlimit is related to BPF, i.e., to stop unprivileged users from
> hogging up all the vmalloc space for modules with JITed BPF filters.
> So the rlimit has more to do with limiting the memory usage of BPF
> filters than it has to do with modules themselves.
> 
> I think Ard's suggestion of having a separate bpf_alloc/free API makes
> a lot of sense if we want to keep track of bpf-related allocations
> (and then the rlimit would be enforced for those). Maybe part of the
> module mapping space could be carved out for bpf filters (e.g. have
> BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> continue sharing the region but explicitly define a separate bpf_alloc
> API, depending on an architecture's needs. What do people think?

Hmm, I think here are several issues mixed up at the same time which is just
very confusing, imho:

1) The fact that there are several non-module users of module_alloc()
as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
them are not being modules, they all need to alloc some piece of executable
memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
that is even /before/ eBPF existed. Having some different API perhaps for all
these users seems to make sense if the goal is not to interfere with modules
themselves. It might also help as a benefit to potentially increase that
memory pool if we're hitting limits at scale which would not be a concern
for normal kernel modules since there's usually just a very few of them
needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
running BPF-seccomp policies, networking BPF policies, etc which need to
scale w/ application or container deployment; so this is of much more
dynamic and unpredictable nature).

2) Then there is rlimit which is proposing to have a "fairer" share among
unprivileged users. I'm not fully sure yet whether rlimit is actually a
nice usable interface for all this. I'd agree that something is needed
on that regard, but I also tend to like Michal Hocko's cgroup proposal,
iow, to have such resource control as part of memory cgroup could be
something to consider _especially_ since it already _exists_ for vmalloc()
backed memory so this should be not much different than that. It sounds
like 2) comes on top of 1).

3) Last but not least, there's a short term fix which is needed independently
of 1) and 2) and should be done immediately which is to account for
unprivileged users and restrict them based on a global configurable
limit such that privileged use keeps functioning, and 2) could enforce
based on the global upper limit, for example. Pending fix is under
https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
Linus as soon as possible as short term fix. Then something like memcg
can be considered on top since it seems this makes most sense from a
usability point.

Thanks a lot,
Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-24 16:04           ` Daniel Borkmann
  0 siblings, 0 replies; 47+ messages in thread
From: Daniel Borkmann @ 2018-10-24 16:04 UTC (permalink / raw)
  To: Jessica Yu, Ard Biesheuvel
  Cc: Edgecombe, Rick P, linux-kernel, keescook, linux-arch, arjan,
	tglx, linux-mips, linux-s390, jannh, x86, kristen, Dock,
	Deneen T, catalin.marinas, mingo, will.deacon, kernel-hardening,
	bp, Hansen, Dave, linux-arm-kernel, davem, arnd, linux-fsdevel,
	sparclinux, alexei.starovoitov, netdev

[ +Alexei, netdev ]

On 10/24/2018 05:07 PM, Jessica Yu wrote:
> +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>> On 22 October 2018 at 20:06, Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
[...]
>> I think it is wrong to conflate the two things. Limiting the number of
>> BPF allocations and the limiting number of module allocations are two
>> separate things, and the fact that BPF reuses module_alloc() out of
>> convenience does not mean a single rlimit for both is appropriate.
> 
> Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> because it is an easy way to obtain executable kernel memory (and
> depending on the needs of the architecture, being additionally
> reachable via relative branches) during runtime. The side effect is
> that all these users share the "module" memory space, even though this
> memory region is not exclusively used by modules (well, personally I
> think it technically should be, because seeing module_alloc() usage
> outside of the module loader is kind of a misuse of the module API and
> it's confusing for people who don't know the reason behind its usage
> outside of the module loader).
> 
> Right now I'm not sure if it makes sense to impose a blanket limit on
> all module_alloc() allocations when the real motivation behind the
> rlimit is related to BPF, i.e., to stop unprivileged users from
> hogging up all the vmalloc space for modules with JITed BPF filters.
> So the rlimit has more to do with limiting the memory usage of BPF
> filters than it has to do with modules themselves.
> 
> I think Ard's suggestion of having a separate bpf_alloc/free API makes
> a lot of sense if we want to keep track of bpf-related allocations
> (and then the rlimit would be enforced for those). Maybe part of the
> module mapping space could be carved out for bpf filters (e.g. have
> BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> continue sharing the region but explicitly define a separate bpf_alloc
> API, depending on an architecture's needs. What do people think?

Hmm, I think here are several issues mixed up at the same time which is just
very confusing, imho:

1) The fact that there are several non-module users of module_alloc()
as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
them are not being modules, they all need to alloc some piece of executable
memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
that is even /before/ eBPF existed. Having some different API perhaps for all
these users seems to make sense if the goal is not to interfere with modules
themselves. It might also help as a benefit to potentially increase that
memory pool if we're hitting limits at scale which would not be a concern
for normal kernel modules since there's usually just a very few of them
needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
running BPF-seccomp policies, networking BPF policies, etc which need to
scale w/ application or container deployment; so this is of much more
dynamic and unpredictable nature).

2) Then there is rlimit which is proposing to have a "fairer" share among
unprivileged users. I'm not fully sure yet whether rlimit is actually a
nice usable interface for all this. I'd agree that something is needed
on that regard, but I also tend to like Michal Hocko's cgroup proposal,
iow, to have such resource control as part of memory cgroup could be
something to consider _especially_ since it already _exists_ for vmalloc()
backed memory so this should be not much different than that. It sounds
like 2) comes on top of 1).

3) Last but not least, there's a short term fix which is needed independently
of 1) and 2) and should be done immediately which is to account for
unprivileged users and restrict them based on a global configurable
limit such that privileged use keeps functioning, and 2) could enforce
based on the global upper limit, for example. Pending fix is under
https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
Linus as soon as possible as short term fix. Then something like memcg
can be considered on top since it seems this makes most sense from a
usability point.

Thanks a lot,
Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-24 16:04           ` Daniel Borkmann
  0 siblings, 0 replies; 47+ messages in thread
From: Daniel Borkmann @ 2018-10-24 16:04 UTC (permalink / raw)
  To: linux-arm-kernel

[ +Alexei, netdev ]

On 10/24/2018 05:07 PM, Jessica Yu wrote:
> +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>> On 22 October 2018 at 20:06, Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
[...]
>> I think it is wrong to conflate the two things. Limiting the number of
>> BPF allocations and the limiting number of module allocations are two
>> separate things, and the fact that BPF reuses module_alloc() out of
>> convenience does not mean a single rlimit for both is appropriate.
> 
> Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> because it is an easy way to obtain executable kernel memory (and
> depending on the needs of the architecture, being additionally
> reachable via relative branches) during runtime. The side effect is
> that all these users share the "module" memory space, even though this
> memory region is not exclusively used by modules (well, personally I
> think it technically should be, because seeing module_alloc() usage
> outside of the module loader is kind of a misuse of the module API and
> it's confusing for people who don't know the reason behind its usage
> outside of the module loader).
> 
> Right now I'm not sure if it makes sense to impose a blanket limit on
> all module_alloc() allocations when the real motivation behind the
> rlimit is related to BPF, i.e., to stop unprivileged users from
> hogging up all the vmalloc space for modules with JITed BPF filters.
> So the rlimit has more to do with limiting the memory usage of BPF
> filters than it has to do with modules themselves.
> 
> I think Ard's suggestion of having a separate bpf_alloc/free API makes
> a lot of sense if we want to keep track of bpf-related allocations
> (and then the rlimit would be enforced for those). Maybe part of the
> module mapping space could be carved out for bpf filters (e.g. have
> BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> continue sharing the region but explicitly define a separate bpf_alloc
> API, depending on an architecture's needs. What do people think?

Hmm, I think here are several issues mixed up at the same time which is just
very confusing, imho:

1) The fact that there are several non-module users of module_alloc()
as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
them are not being modules, they all need to alloc some piece of executable
memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
that is even /before/ eBPF existed. Having some different API perhaps for all
these users seems to make sense if the goal is not to interfere with modules
themselves. It might also help as a benefit to potentially increase that
memory pool if we're hitting limits at scale which would not be a concern
for normal kernel modules since there's usually just a very few of them
needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
running BPF-seccomp policies, networking BPF policies, etc which need to
scale w/ application or container deployment; so this is of much more
dynamic and unpredictable nature).

2) Then there is rlimit which is proposing to have a "fairer" share among
unprivileged users. I'm not fully sure yet whether rlimit is actually a
nice usable interface for all this. I'd agree that something is needed
on that regard, but I also tend to like Michal Hocko's cgroup proposal,
iow, to have such resource control as part of memory cgroup could be
something to consider _especially_ since it already _exists_ for vmalloc()
backed memory so this should be not much different than that. It sounds
like 2) comes on top of 1).

3) Last but not least, there's a short term fix which is needed independently
of 1) and 2) and should be done immediately which is to account for
unprivileged users and restrict them based on a global configurable
limit such that privileged use keeps functioning, and 2) could enforce
based on the global upper limit, for example. Pending fix is under
https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
Linus as soon as possible as short term fix. Then something like memcg
can be considered on top since it seems this makes most sense from a
usability point.

Thanks a lot,
Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
  2018-10-24 16:04           ` Daniel Borkmann
                               ` (2 preceding siblings ...)
  (?)
@ 2018-10-25  1:01             ` Edgecombe, Rick P
  -1 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-25  1:01 UTC (permalink / raw)
  To: daniel, jeyu, ard.biesheuvel, mhocko
  Cc: linux-fsdevel, linux-kernel, jannh, keescook, arjan, netdev,
	tglx, linux-mips, linux-s390, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon, kernel-hardening, bp,
	Hansen, Dave, linux-arm-kernel, davem, linux-arch, arnd,
	sparclinux, alexei.starovoitov

On Wed, 2018-10-24 at 18:04 +0200, Daniel Borkmann wrote:
> [ +Alexei, netdev ]
> 
> On 10/24/2018 05:07 PM, Jessica Yu wrote:
> > +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
> > > On 22 October 2018 at 20:06, Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> 
> [...]
> > > I think it is wrong to conflate the two things. Limiting the number of
> > > BPF allocations and the limiting number of module allocations are two
> > > separate things, and the fact that BPF reuses module_alloc() out of
> > > convenience does not mean a single rlimit for both is appropriate.
> > 
> > Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> > users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> > because it is an easy way to obtain executable kernel memory (and
> > depending on the needs of the architecture, being additionally
> > reachable via relative branches) during runtime. The side effect is
> > that all these users share the "module" memory space, even though this
> > memory region is not exclusively used by modules (well, personally I
> > think it technically should be, because seeing module_alloc() usage
> > outside of the module loader is kind of a misuse of the module API and
> > it's confusing for people who don't know the reason behind its usage
> > outside of the module loader).
> > 
> > Right now I'm not sure if it makes sense to impose a blanket limit on
> > all module_alloc() allocations when the real motivation behind the
> > rlimit is related to BPF, i.e., to stop unprivileged users from
> > hogging up all the vmalloc space for modules with JITed BPF filters.
> > So the rlimit has more to do with limiting the memory usage of BPF
> > filters than it has to do with modules themselves.
> > 
> > I think Ard's suggestion of having a separate bpf_alloc/free API makes
> > a lot of sense if we want to keep track of bpf-related allocations
> > (and then the rlimit would be enforced for those). Maybe part of the
> > module mapping space could be carved out for bpf filters (e.g. have
> > BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> > continue sharing the region but explicitly define a separate bpf_alloc
> > API, depending on an architecture's needs. What do people think?
> 
> Hmm, I think here are several issues mixed up at the same time which is just
> very confusing, imho:
> 
> 1) The fact that there are several non-module users of module_alloc()
> as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
> them are not being modules, they all need to alloc some piece of executable
> memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
> ("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
> that is even /before/ eBPF existed. Having some different API perhaps for all
> these users seems to make sense if the goal is not to interfere with modules
> themselves. It might also help as a benefit to potentially increase that
> memory pool if we're hitting limits at scale which would not be a concern
> for normal kernel modules since there's usually just a very few of them
> needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
> running BPF-seccomp policies, networking BPF policies, etc which need to
> scale w/ application or container deployment; so this is of much more
> dynamic and unpredictable nature).
> 
> 2) Then there is rlimit which is proposing to have a "fairer" share among
> unprivileged users. I'm not fully sure yet whether rlimit is actually a
> nice usable interface for all this. I'd agree that something is needed
> on that regard, but I also tend to like Michal Hocko's cgroup proposal,
> iow, to have such resource control as part of memory cgroup could be
> something to consider _especially_ since it already _exists_ for vmalloc()
> backed memory so this should be not much different than that. It sounds
> like 2) comes on top of 1).
FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
already know, but it looks like the cgroups vmalloc charge is done in the main
page allocator and counts against the whole kernel memory limit. A user may want
to have a higher kernel limit than the module space size, so it seems it isn't
enough by itself and some new limit would need to be added.

As for what the limit should be, I wonder if some of the disagreement is just
from the name "module space".

There is a limited resource of physical memory, so we have limits for it. There
is a limited resource of CPU time, so we have limits for it. If there is a
limited resource for preferred address space for executable code, why not just
continue that trend? If other forms of unprivileged JIT come along, would it be
better to have N limits for each type? Request_module probably can't fill the
space, but technically there are already 2 unprivileged users. So IMHO, its a
more forward looking solution.

If there are some usage/architecture combos that don't need the preferred space
they can allocate in vmalloc and have it not count against the preferred space
limit but still count against the cgroups kernel memory limit.

Another benefit of centralizing the allocation of the "executable memory
preferred space" is KASLR. Right now that is only done in module_alloc and so
all users of it get randomized. If they all call vmalloc by themselves they will
just use the normal allocator.

Anyway, it seems like either type of limit (BPF JIT or all module space) will
solve the problem equally well today.

> 3) Last but not least, there's a short term fix which is needed independently
> of 1) and 2) and should be done immediately which is to account for
> unprivileged users and restrict them based on a global configurable
> limit such that privileged use keeps functioning, and 2) could enforce
> based on the global upper limit, for example. Pending fix is under
> https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
> Linus as soon as possible as short term fix. Then something like memcg
> can be considered on top since it seems this makes most sense from a
> usability point.
> 
> Thanks a lot,
> Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-25  1:01             ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-25  1:01 UTC (permalink / raw)
  To: daniel, jeyu, ard.biesheuvel, mhocko
  Cc: linux-fsdevel, linux-kernel, jannh, keescook, arjan, netdev,
	tglx, linux-mips, linux-s390, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon,

On Wed, 2018-10-24 at 18:04 +0200, Daniel Borkmann wrote:
> [ +Alexei, netdev ]
> 
> On 10/24/2018 05:07 PM, Jessica Yu wrote:
> > +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
> > > On 22 October 2018 at 20:06, Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> 
> [...]
> > > I think it is wrong to conflate the two things. Limiting the number of
> > > BPF allocations and the limiting number of module allocations are two
> > > separate things, and the fact that BPF reuses module_alloc() out of
> > > convenience does not mean a single rlimit for both is appropriate.
> > 
> > Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> > users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> > because it is an easy way to obtain executable kernel memory (and
> > depending on the needs of the architecture, being additionally
> > reachable via relative branches) during runtime. The side effect is
> > that all these users share the "module" memory space, even though this
> > memory region is not exclusively used by modules (well, personally I
> > think it technically should be, because seeing module_alloc() usage
> > outside of the module loader is kind of a misuse of the module API and
> > it's confusing for people who don't know the reason behind its usage
> > outside of the module loader).
> > 
> > Right now I'm not sure if it makes sense to impose a blanket limit on
> > all module_alloc() allocations when the real motivation behind the
> > rlimit is related to BPF, i.e., to stop unprivileged users from
> > hogging up all the vmalloc space for modules with JITed BPF filters.
> > So the rlimit has more to do with limiting the memory usage of BPF
> > filters than it has to do with modules themselves.
> > 
> > I think Ard's suggestion of having a separate bpf_alloc/free API makes
> > a lot of sense if we want to keep track of bpf-related allocations
> > (and then the rlimit would be enforced for those). Maybe part of the
> > module mapping space could be carved out for bpf filters (e.g. have
> > BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> > continue sharing the region but explicitly define a separate bpf_alloc
> > API, depending on an architecture's needs. What do people think?
> 
> Hmm, I think here are several issues mixed up at the same time which is just
> very confusing, imho:
> 
> 1) The fact that there are several non-module users of module_alloc()
> as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
> them are not being modules, they all need to alloc some piece of executable
> memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
> ("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
> that is even /before/ eBPF existed. Having some different API perhaps for all
> these users seems to make sense if the goal is not to interfere with modules
> themselves. It might also help as a benefit to potentially increase that
> memory pool if we're hitting limits at scale which would not be a concern
> for normal kernel modules since there's usually just a very few of them
> needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
> running BPF-seccomp policies, networking BPF policies, etc which need to
> scale w/ application or container deployment; so this is of much more
> dynamic and unpredictable nature).
> 
> 2) Then there is rlimit which is proposing to have a "fairer" share among
> unprivileged users. I'm not fully sure yet whether rlimit is actually a
> nice usable interface for all this. I'd agree that something is needed
> on that regard, but I also tend to like Michal Hocko's cgroup proposal,
> iow, to have such resource control as part of memory cgroup could be
> something to consider _especially_ since it already _exists_ for vmalloc()
> backed memory so this should be not much different than that. It sounds
> like 2) comes on top of 1).
FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
already know, but it looks like the cgroups vmalloc charge is done in the main
page allocator and counts against the whole kernel memory limit. A user may want
to have a higher kernel limit than the module space size, so it seems it isn't
enough by itself and some new limit would need to be added.

As for what the limit should be, I wonder if some of the disagreement is just
from the name "module space".

There is a limited resource of physical memory, so we have limits for it. There
is a limited resource of CPU time, so we have limits for it. If there is a
limited resource for preferred address space for executable code, why not just
continue that trend? If other forms of unprivileged JIT come along, would it be
better to have N limits for each type? Request_module probably can't fill the
space, but technically there are already 2 unprivileged users. So IMHO, its a
more forward looking solution.

If there are some usage/architecture combos that don't need the preferred space
they can allocate in vmalloc and have it not count against the preferred space
limit but still count against the cgroups kernel memory limit.

Another benefit of centralizing the allocation of the "executable memory
preferred space" is KASLR. Right now that is only done in module_alloc and so
all users of it get randomized. If they all call vmalloc by themselves they will
just use the normal allocator.

Anyway, it seems like either type of limit (BPF JIT or all module space) will
solve the problem equally well today.

> 3) Last but not least, there's a short term fix which is needed independently
> of 1) and 2) and should be done immediately which is to account for
> unprivileged users and restrict them based on a global configurable
> limit such that privileged use keeps functioning, and 2) could enforce
> based on the global upper limit, for example. Pending fix is under
> https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
> Linus as soon as possible as short term fix. Then something like memcg
> can be considered on top since it seems this makes most sense from a
> usability point.
> 
> Thanks a lot,
> Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-25  1:01             ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-25  1:01 UTC (permalink / raw)
  To: daniel, jeyu, ard.biesheuvel, mhocko
  Cc: linux-fsdevel, linux-kernel, jannh, keescook, arjan, netdev,
	tglx, linux-mips, linux-s390, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon, kernel-hardening, bp,
	Hansen, Dave, linux-arm-kernel, davem, linux-arch, arnd,
	sparclinux, alexei.starovoitov

On Wed, 2018-10-24 at 18:04 +0200, Daniel Borkmann wrote:
> [ +Alexei, netdev ]
> 
> On 10/24/2018 05:07 PM, Jessica Yu wrote:
> > +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
> > > On 22 October 2018 at 20:06, Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> 
> [...]
> > > I think it is wrong to conflate the two things. Limiting the number of
> > > BPF allocations and the limiting number of module allocations are two
> > > separate things, and the fact that BPF reuses module_alloc() out of
> > > convenience does not mean a single rlimit for both is appropriate.
> > 
> > Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> > users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> > because it is an easy way to obtain executable kernel memory (and
> > depending on the needs of the architecture, being additionally
> > reachable via relative branches) during runtime. The side effect is
> > that all these users share the "module" memory space, even though this
> > memory region is not exclusively used by modules (well, personally I
> > think it technically should be, because seeing module_alloc() usage
> > outside of the module loader is kind of a misuse of the module API and
> > it's confusing for people who don't know the reason behind its usage
> > outside of the module loader).
> > 
> > Right now I'm not sure if it makes sense to impose a blanket limit on
> > all module_alloc() allocations when the real motivation behind the
> > rlimit is related to BPF, i.e., to stop unprivileged users from
> > hogging up all the vmalloc space for modules with JITed BPF filters.
> > So the rlimit has more to do with limiting the memory usage of BPF
> > filters than it has to do with modules themselves.
> > 
> > I think Ard's suggestion of having a separate bpf_alloc/free API makes
> > a lot of sense if we want to keep track of bpf-related allocations
> > (and then the rlimit would be enforced for those). Maybe part of the
> > module mapping space could be carved out for bpf filters (e.g. have
> > BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> > continue sharing the region but explicitly define a separate bpf_alloc
> > API, depending on an architecture's needs. What do people think?
> 
> Hmm, I think here are several issues mixed up at the same time which is just
> very confusing, imho:
> 
> 1) The fact that there are several non-module users of module_alloc()
> as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
> them are not being modules, they all need to alloc some piece of executable
> memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
> ("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
> that is even /before/ eBPF existed. Having some different API perhaps for all
> these users seems to make sense if the goal is not to interfere with modules
> themselves. It might also help as a benefit to potentially increase that
> memory pool if we're hitting limits at scale which would not be a concern
> for normal kernel modules since there's usually just a very few of them
> needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
> running BPF-seccomp policies, networking BPF policies, etc which need to
> scale w/ application or container deployment; so this is of much more
> dynamic and unpredictable nature).
> 
> 2) Then there is rlimit which is proposing to have a "fairer" share among
> unprivileged users. I'm not fully sure yet whether rlimit is actually a
> nice usable interface for all this. I'd agree that something is needed
> on that regard, but I also tend to like Michal Hocko's cgroup proposal,
> iow, to have such resource control as part of memory cgroup could be
> something to consider _especially_ since it already _exists_ for vmalloc()
> backed memory so this should be not much different than that. It sounds
> like 2) comes on top of 1).
FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
already know, but it looks like the cgroups vmalloc charge is done in the main
page allocator and counts against the whole kernel memory limit. A user may want
to have a higher kernel limit than the module space size, so it seems it isn't
enough by itself and some new limit would need to be added.

As for what the limit should be, I wonder if some of the disagreement is just
from the name "module space".

There is a limited resource of physical memory, so we have limits for it. There
is a limited resource of CPU time, so we have limits for it. If there is a
limited resource for preferred address space for executable code, why not just
continue that trend? If other forms of unprivileged JIT come along, would it be
better to have N limits for each type? Request_module probably can't fill the
space, but technically there are already 2 unprivileged users. So IMHO, its a
more forward looking solution.

If there are some usage/architecture combos that don't need the preferred space
they can allocate in vmalloc and have it not count against the preferred space
limit but still count against the cgroups kernel memory limit.

Another benefit of centralizing the allocation of the "executable memory
preferred space" is KASLR. Right now that is only done in module_alloc and so
all users of it get randomized. If they all call vmalloc by themselves they will
just use the normal allocator.

Anyway, it seems like either type of limit (BPF JIT or all module space) will
solve the problem equally well today.

> 3) Last but not least, there's a short term fix which is needed independently
> of 1) and 2) and should be done immediately which is to account for
> unprivileged users and restrict them based on a global configurable
> limit such that privileged use keeps functioning, and 2) could enforce
> based on the global upper limit, for example. Pending fix is under
> https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
> Linus as soon as possible as short term fix. Then something like memcg
> can be considered on top since it seems this makes most sense from a
> usability point.
> 
> Thanks a lot,
> Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-25  1:01             ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-25  1:01 UTC (permalink / raw)
  To: daniel, jeyu, ard.biesheuvel, mhocko
  Cc: linux-fsdevel, linux-kernel, jannh, keescook, arjan, netdev,
	tglx, linux-mips, linux-s390, x86, kristen, Dock, Deneen T,
	catalin.marinas, mingo, will.deacon, kernel-hardening, bp,
	Hansen, Dave, linux-arm-kernel, davem, linux-arch, arnd,
	sparclinux, alexei.starovoitov

T24gV2VkLCAyMDE4LTEwLTI0IGF0IDE4OjA0ICswMjAwLCBEYW5pZWwgQm9ya21hbm4gd3JvdGU6
DQo+IFsgK0FsZXhlaSwgbmV0ZGV2IF0NCj4gDQo+IE9uIDEwLzI0LzIwMTggMDU6MDcgUE0sIEpl
c3NpY2EgWXUgd3JvdGU6DQo+ID4gKysrIEFyZCBCaWVzaGV1dmVsIFsyMy8xMC8xOCAwODo1NCAt
MDMwMF06DQo+ID4gPiBPbiAyMiBPY3RvYmVyIDIwMTggYXQgMjA6MDYsIEVkZ2Vjb21iZSwgUmlj
ayBQDQo+ID4gPiA8cmljay5wLmVkZ2Vjb21iZUBpbnRlbC5jb20+IHdyb3RlOg0KPiANCj4gWy4u
Ll0NCj4gPiA+IEkgdGhpbmsgaXQgaXMgd3JvbmcgdG8gY29uZmxhdGUgdGhlIHR3byB0aGluZ3Mu
IExpbWl0aW5nIHRoZSBudW1iZXIgb2YNCj4gPiA+IEJQRiBhbGxvY2F0aW9ucyBhbmQgdGhlIGxp
bWl0aW5nIG51bWJlciBvZiBtb2R1bGUgYWxsb2NhdGlvbnMgYXJlIHR3bw0KPiA+ID4gc2VwYXJh
dGUgdGhpbmdzLCBhbmQgdGhlIGZhY3QgdGhhdCBCUEYgcmV1c2VzIG1vZHVsZV9hbGxvYygpIG91
dCBvZg0KPiA+ID4gY29udmVuaWVuY2UgZG9lcyBub3QgbWVhbiBhIHNpbmdsZSBybGltaXQgZm9y
IGJvdGggaXMgYXBwcm9wcmlhdGUuDQo+ID4gDQo+ID4gSG0sIEkgdGhpbmsgQXJkIGhhcyBhIGdv
b2QgcG9pbnQuIEFGQUlLLCBhbmQgY29ycmVjdCBtZSBpZiBJJ20gd3JvbmcsDQo+ID4gdXNlcnMg
b2YgbW9kdWxlX2FsbG9jKCkgaS5lLiBrcHJvYmVzLCBmdHJhY2UsIGJwZiwgc2VlbSB0byB1c2Ug
aXQNCj4gPiBiZWNhdXNlIGl0IGlzIGFuIGVhc3kgd2F5IHRvIG9idGFpbiBleGVjdXRhYmxlIGtl
cm5lbCBtZW1vcnkgKGFuZA0KPiA+IGRlcGVuZGluZyBvbiB0aGUgbmVlZHMgb2YgdGhlIGFyY2hp
dGVjdHVyZSwgYmVpbmcgYWRkaXRpb25hbGx5DQo+ID4gcmVhY2hhYmxlIHZpYSByZWxhdGl2ZSBi
cmFuY2hlcykgZHVyaW5nIHJ1bnRpbWUuIFRoZSBzaWRlIGVmZmVjdCBpcw0KPiA+IHRoYXQgYWxs
IHRoZXNlIHVzZXJzIHNoYXJlIHRoZSAibW9kdWxlIiBtZW1vcnkgc3BhY2UsIGV2ZW4gdGhvdWdo
IHRoaXMNCj4gPiBtZW1vcnkgcmVnaW9uIGlzIG5vdCBleGNsdXNpdmVseSB1c2VkIGJ5IG1vZHVs
ZXMgKHdlbGwsIHBlcnNvbmFsbHkgSQ0KPiA+IHRoaW5rIGl0IHRlY2huaWNhbGx5IHNob3VsZCBi
ZSwgYmVjYXVzZSBzZWVpbmcgbW9kdWxlX2FsbG9jKCkgdXNhZ2UNCj4gPiBvdXRzaWRlIG9mIHRo
ZSBtb2R1bGUgbG9hZGVyIGlzIGtpbmQgb2YgYSBtaXN1c2Ugb2YgdGhlIG1vZHVsZSBBUEkgYW5k
DQo+ID4gaXQncyBjb25mdXNpbmcgZm9yIHBlb3BsZSB3aG8gZG9uJ3Qga25vdyB0aGUgcmVhc29u
IGJlaGluZCBpdHMgdXNhZ2UNCj4gPiBvdXRzaWRlIG9mIHRoZSBtb2R1bGUgbG9hZGVyKS4NCj4g
PiANCj4gPiBSaWdodCBub3cgSSdtIG5vdCBzdXJlIGlmIGl0IG1ha2VzIHNlbnNlIHRvIGltcG9z
ZSBhIGJsYW5rZXQgbGltaXQgb24NCj4gPiBhbGwgbW9kdWxlX2FsbG9jKCkgYWxsb2NhdGlvbnMg
d2hlbiB0aGUgcmVhbCBtb3RpdmF0aW9uIGJlaGluZCB0aGUNCj4gPiBybGltaXQgaXMgcmVsYXRl
ZCB0byBCUEYsIGkuZS4sIHRvIHN0b3AgdW5wcml2aWxlZ2VkIHVzZXJzIGZyb20NCj4gPiBob2dn
aW5nIHVwIGFsbCB0aGUgdm1hbGxvYyBzcGFjZSBmb3IgbW9kdWxlcyB3aXRoIEpJVGVkIEJQRiBm
aWx0ZXJzLg0KPiA+IFNvIHRoZSBybGltaXQgaGFzIG1vcmUgdG8gZG8gd2l0aCBsaW1pdGluZyB0
aGUgbWVtb3J5IHVzYWdlIG9mIEJQRg0KPiA+IGZpbHRlcnMgdGhhbiBpdCBoYXMgdG8gZG8gd2l0
aCBtb2R1bGVzIHRoZW1zZWx2ZXMuDQo+ID4gDQo+ID4gSSB0aGluayBBcmQncyBzdWdnZXN0aW9u
IG9mIGhhdmluZyBhIHNlcGFyYXRlIGJwZl9hbGxvYy9mcmVlIEFQSSBtYWtlcw0KPiA+IGEgbG90
IG9mIHNlbnNlIGlmIHdlIHdhbnQgdG8ga2VlcCB0cmFjayBvZiBicGYtcmVsYXRlZCBhbGxvY2F0
aW9ucw0KPiA+IChhbmQgdGhlbiB0aGUgcmxpbWl0IHdvdWxkIGJlIGVuZm9yY2VkIGZvciB0aG9z
ZSkuIE1heWJlIHBhcnQgb2YgdGhlDQo+ID4gbW9kdWxlIG1hcHBpbmcgc3BhY2UgY291bGQgYmUg
Y2FydmVkIG91dCBmb3IgYnBmIGZpbHRlcnMgKGUuZy4gaGF2ZQ0KPiA+IEJQRl9WQUREUiwgQlBG
X1ZTSVpFLCBldGMgbGlrZSBob3cgd2UgaGF2ZSBpdCBmb3IgbW9kdWxlcyksIG9yDQo+ID4gY29u
dGludWUgc2hhcmluZyB0aGUgcmVnaW9uIGJ1dCBleHBsaWNpdGx5IGRlZmluZSBhIHNlcGFyYXRl
IGJwZl9hbGxvYw0KPiA+IEFQSSwgZGVwZW5kaW5nIG9uIGFuIGFyY2hpdGVjdHVyZSdzIG5lZWRz
LiBXaGF0IGRvIHBlb3BsZSB0aGluaz8NCj4gDQo+IEhtbSwgSSB0aGluayBoZXJlIGFyZSBzZXZl
cmFsIGlzc3VlcyBtaXhlZCB1cCBhdCB0aGUgc2FtZSB0aW1lIHdoaWNoIGlzIGp1c3QNCj4gdmVy
eSBjb25mdXNpbmcsIGltaG86DQo+IA0KPiAxKSBUaGUgZmFjdCB0aGF0IHRoZXJlIGFyZSBzZXZl
cmFsIG5vbi1tb2R1bGUgdXNlcnMgb2YgbW9kdWxlX2FsbG9jKCkNCj4gYXMgSmVzc2ljYSBub3Rl
cyBzdWNoIGFzIGtwcm9iZXMsIGZ0cmFjZSwgYnBmLCBmb3IgZXhhbXBsZS4gV2hpbGUgYWxsIG9m
DQo+IHRoZW0gYXJlIG5vdCBiZWluZyBtb2R1bGVzLCB0aGV5IGFsbCBuZWVkIHRvIGFsbG9jIHNv
bWUgcGllY2Ugb2YgZXhlY3V0YWJsZQ0KPiBtZW1vcnkuIEl0J3Mgbm90aGluZyBuZXcsIHRoaXMg
ZXhpc3RzIGZvciA3IHllYXJzIG5vdyBzaW5jZSAwYTE0ODQyZjVhM2MNCj4gKCJuZXQ6IGZpbHRl
cjogSnVzdCBJbiBUaW1lIGNvbXBpbGVyIGZvciB4ODYtNjQiKSBmcm9tIEJQRiBzaWRlOyBlZmZl
Y3RpdmVseQ0KPiB0aGF0IGlzIGV2ZW4gL2JlZm9yZS8gZUJQRiBleGlzdGVkLiBIYXZpbmcgc29t
ZSBkaWZmZXJlbnQgQVBJIHBlcmhhcHMgZm9yIGFsbA0KPiB0aGVzZSB1c2VycyBzZWVtcyB0byBt
YWtlIHNlbnNlIGlmIHRoZSBnb2FsIGlzIG5vdCB0byBpbnRlcmZlcmUgd2l0aCBtb2R1bGVzDQo+
IHRoZW1zZWx2ZXMuIEl0IG1pZ2h0IGFsc28gaGVscCBhcyBhIGJlbmVmaXQgdG8gcG90ZW50aWFs
bHkgaW5jcmVhc2UgdGhhdA0KPiBtZW1vcnkgcG9vbCBpZiB3ZSdyZSBoaXR0aW5nIGxpbWl0cyBh
dCBzY2FsZSB3aGljaCB3b3VsZCBub3QgYmUgYSBjb25jZXJuDQo+IGZvciBub3JtYWwga2VybmVs
IG1vZHVsZXMgc2luY2UgdGhlcmUncyB1c3VhbGx5IGp1c3QgYSB2ZXJ5IGZldyBvZiB0aGVtDQo+
IG5lZWRlZCAodW5saWtlIGR5bmFtaWNhbGx5IHRyYWNpbmcgdmFyaW91cyBrZXJuZWwgcGFydHMg
MjQvNyB3LyBvciB3L28gQlBGLA0KPiBydW5uaW5nIEJQRi1zZWNjb21wIHBvbGljaWVzLCBuZXR3
b3JraW5nIEJQRiBwb2xpY2llcywgZXRjIHdoaWNoIG5lZWQgdG8NCj4gc2NhbGUgdy8gYXBwbGlj
YXRpb24gb3IgY29udGFpbmVyIGRlcGxveW1lbnQ7IHNvIHRoaXMgaXMgb2YgbXVjaCBtb3JlDQo+
IGR5bmFtaWMgYW5kIHVucHJlZGljdGFibGUgbmF0dXJlKS4NCj4gDQo+IDIpIFRoZW4gdGhlcmUg
aXMgcmxpbWl0IHdoaWNoIGlzIHByb3Bvc2luZyB0byBoYXZlIGEgImZhaXJlciIgc2hhcmUgYW1v
bmcNCj4gdW5wcml2aWxlZ2VkIHVzZXJzLiBJJ20gbm90IGZ1bGx5IHN1cmUgeWV0IHdoZXRoZXIg
cmxpbWl0IGlzIGFjdHVhbGx5IGENCj4gbmljZSB1c2FibGUgaW50ZXJmYWNlIGZvciBhbGwgdGhp
cy4gSSdkIGFncmVlIHRoYXQgc29tZXRoaW5nIGlzIG5lZWRlZA0KPiBvbiB0aGF0IHJlZ2FyZCwg
YnV0IEkgYWxzbyB0ZW5kIHRvIGxpa2UgTWljaGFsIEhvY2tvJ3MgY2dyb3VwIHByb3Bvc2FsLA0K
PiBpb3csIHRvIGhhdmUgc3VjaCByZXNvdXJjZSBjb250cm9sIGFzIHBhcnQgb2YgbWVtb3J5IGNn
cm91cCBjb3VsZCBiZQ0KPiBzb21ldGhpbmcgdG8gY29uc2lkZXIgX2VzcGVjaWFsbHlfIHNpbmNl
IGl0IGFscmVhZHkgX2V4aXN0c18gZm9yIHZtYWxsb2MoKQ0KPiBiYWNrZWQgbWVtb3J5IHNvIHRo
aXMgc2hvdWxkIGJlIG5vdCBtdWNoIGRpZmZlcmVudCB0aGFuIHRoYXQuIEl0IHNvdW5kcw0KPiBs
aWtlIDIpIGNvbWVzIG9uIHRvcCBvZiAxKS4NCkZXSVcsIGNncm91cHMgc2VlbXMgbGlrZSBhIGJl
dHRlciBzb2x1dGlvbiB0aGFuIHJsaW1pdCB0byBtZSB0b28uIE1heWJlIHlvdSBhbGwNCmFscmVh
ZHkga25vdywgYnV0IGl0IGxvb2tzIGxpa2UgdGhlIGNncm91cHMgdm1hbGxvYyBjaGFyZ2UgaXMg
ZG9uZSBpbiB0aGUgbWFpbg0KcGFnZSBhbGxvY2F0b3IgYW5kIGNvdW50cyBhZ2FpbnN0IHRoZSB3
aG9sZSBrZXJuZWwgbWVtb3J5IGxpbWl0LiBBIHVzZXIgbWF5IHdhbnQNCnRvIGhhdmUgYSBoaWdo
ZXIga2VybmVsIGxpbWl0IHRoYW4gdGhlIG1vZHVsZSBzcGFjZSBzaXplLCBzbyBpdCBzZWVtcyBp
dCBpc24ndA0KZW5vdWdoIGJ5IGl0c2VsZiBhbmQgc29tZSBuZXcgbGltaXQgd291bGQgbmVlZCB0
byBiZSBhZGRlZC4NCg0KQXMgZm9yIHdoYXQgdGhlIGxpbWl0IHNob3VsZCBiZSwgSSB3b25kZXIg
aWYgc29tZSBvZiB0aGUgZGlzYWdyZWVtZW50IGlzIGp1c3QNCmZyb20gdGhlIG5hbWUgIm1vZHVs
ZSBzcGFjZSIuDQoNClRoZXJlIGlzIGEgbGltaXRlZCByZXNvdXJjZSBvZiBwaHlzaWNhbCBtZW1v
cnksIHNvIHdlIGhhdmUgbGltaXRzIGZvciBpdC4gVGhlcmUNCmlzIGEgbGltaXRlZCByZXNvdXJj
ZSBvZiBDUFUgdGltZSwgc28gd2UgaGF2ZSBsaW1pdHMgZm9yIGl0LiBJZiB0aGVyZSBpcyBhDQps
aW1pdGVkIHJlc291cmNlIGZvciBwcmVmZXJyZWQgYWRkcmVzcyBzcGFjZSBmb3IgZXhlY3V0YWJs
ZSBjb2RlLCB3aHkgbm90IGp1c3QNCmNvbnRpbnVlIHRoYXQgdHJlbmQ/IElmIG90aGVyIGZvcm1z
IG9mIHVucHJpdmlsZWdlZCBKSVQgY29tZSBhbG9uZywgd291bGQgaXQgYmUNCmJldHRlciB0byBo
YXZlIE4gbGltaXRzIGZvciBlYWNoIHR5cGU/IFJlcXVlc3RfbW9kdWxlIHByb2JhYmx5IGNhbid0
IGZpbGwgdGhlDQpzcGFjZSwgYnV0IHRlY2huaWNhbGx5IHRoZXJlIGFyZSBhbHJlYWR5IDIgdW5w
cml2aWxlZ2VkIHVzZXJzLiBTbyBJTUhPLCBpdHMgYQ0KbW9yZSBmb3J3YXJkIGxvb2tpbmcgc29s
dXRpb24uDQoNCklmIHRoZXJlIGFyZSBzb21lIHVzYWdlL2FyY2hpdGVjdHVyZSBjb21ib3MgdGhh
dCBkb24ndCBuZWVkIHRoZSBwcmVmZXJyZWQgc3BhY2UNCnRoZXkgY2FuIGFsbG9jYXRlIGluIHZt
YWxsb2MgYW5kIGhhdmUgaXQgbm90IGNvdW50IGFnYWluc3QgdGhlIHByZWZlcnJlZCBzcGFjZQ0K
bGltaXQgYnV0IHN0aWxsIGNvdW50IGFnYWluc3QgdGhlIGNncm91cHMga2VybmVsIG1lbW9yeSBs
aW1pdC4NCg0KQW5vdGhlciBiZW5lZml0IG9mIGNlbnRyYWxpemluZyB0aGUgYWxsb2NhdGlvbiBv
ZiB0aGUgImV4ZWN1dGFibGUgbWVtb3J5DQpwcmVmZXJyZWQgc3BhY2UiIGlzIEtBU0xSLiBSaWdo
dCBub3cgdGhhdCBpcyBvbmx5IGRvbmUgaW4gbW9kdWxlX2FsbG9jIGFuZCBzbw0KYWxsIHVzZXJz
IG9mIGl0IGdldCByYW5kb21pemVkLiBJZiB0aGV5IGFsbCBjYWxsIHZtYWxsb2MgYnkgdGhlbXNl
bHZlcyB0aGV5IHdpbGwNCmp1c3QgdXNlIHRoZSBub3JtYWwgYWxsb2NhdG9yLg0KDQpBbnl3YXks
IGl0IHNlZW1zIGxpa2UgZWl0aGVyIHR5cGUgb2YgbGltaXQgKEJQRiBKSVQgb3IgYWxsIG1vZHVs
ZSBzcGFjZSkgd2lsbA0Kc29sdmUgdGhlIHByb2JsZW0gZXF1YWxseSB3ZWxsIHRvZGF5Lg0KDQo+
IDMpIExhc3QgYnV0IG5vdCBsZWFzdCwgdGhlcmUncyBhIHNob3J0IHRlcm0gZml4IHdoaWNoIGlz
IG5lZWRlZCBpbmRlcGVuZGVudGx5DQo+IG9mIDEpIGFuZCAyKSBhbmQgc2hvdWxkIGJlIGRvbmUg
aW1tZWRpYXRlbHkgd2hpY2ggaXMgdG8gYWNjb3VudCBmb3INCj4gdW5wcml2aWxlZ2VkIHVzZXJz
IGFuZCByZXN0cmljdCB0aGVtIGJhc2VkIG9uIGEgZ2xvYmFsIGNvbmZpZ3VyYWJsZQ0KPiBsaW1p
dCBzdWNoIHRoYXQgcHJpdmlsZWdlZCB1c2Uga2VlcHMgZnVuY3Rpb25pbmcsIGFuZCAyKSBjb3Vs
ZCBlbmZvcmNlDQo+IGJhc2VkIG9uIHRoZSBnbG9iYWwgdXBwZXIgbGltaXQsIGZvciBleGFtcGxl
LiBQZW5kaW5nIGZpeCBpcyB1bmRlcg0KPiBodHRwczovL3BhdGNod29yay5vemxhYnMub3JnL3Bh
dGNoLzk4Nzk3MS8gd2hpY2ggd2UgaW50ZW5kIHRvIHNoaXAgdG8NCj4gTGludXMgYXMgc29vbiBh
cyBwb3NzaWJsZSBhcyBzaG9ydCB0ZXJtIGZpeC4gVGhlbiBzb21ldGhpbmcgbGlrZSBtZW1jZw0K
PiBjYW4gYmUgY29uc2lkZXJlZCBvbiB0b3Agc2luY2UgaXQgc2VlbXMgdGhpcyBtYWtlcyBtb3N0
IHNlbnNlIGZyb20gYQ0KPiB1c2FiaWxpdHkgcG9pbnQuDQo+IA0KPiBUaGFua3MgYSBsb3QsDQo+
IERhbmllbA0K

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-25  1:01             ` Edgecombe, Rick P
  0 siblings, 0 replies; 47+ messages in thread
From: Edgecombe, Rick P @ 2018-10-25  1:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2018-10-24 at 18:04 +0200, Daniel Borkmann wrote:
> [ +Alexei, netdev ]
> 
> On 10/24/2018 05:07 PM, Jessica Yu wrote:
> > +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
> > > On 22 October 2018 at 20:06, Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> 
> [...]
> > > I think it is wrong to conflate the two things. Limiting the number of
> > > BPF allocations and the limiting number of module allocations are two
> > > separate things, and the fact that BPF reuses module_alloc() out of
> > > convenience does not mean a single rlimit for both is appropriate.
> > 
> > Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> > users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> > because it is an easy way to obtain executable kernel memory (and
> > depending on the needs of the architecture, being additionally
> > reachable via relative branches) during runtime. The side effect is
> > that all these users share the "module" memory space, even though this
> > memory region is not exclusively used by modules (well, personally I
> > think it technically should be, because seeing module_alloc() usage
> > outside of the module loader is kind of a misuse of the module API and
> > it's confusing for people who don't know the reason behind its usage
> > outside of the module loader).
> > 
> > Right now I'm not sure if it makes sense to impose a blanket limit on
> > all module_alloc() allocations when the real motivation behind the
> > rlimit is related to BPF, i.e., to stop unprivileged users from
> > hogging up all the vmalloc space for modules with JITed BPF filters.
> > So the rlimit has more to do with limiting the memory usage of BPF
> > filters than it has to do with modules themselves.
> > 
> > I think Ard's suggestion of having a separate bpf_alloc/free API makes
> > a lot of sense if we want to keep track of bpf-related allocations
> > (and then the rlimit would be enforced for those). Maybe part of the
> > module mapping space could be carved out for bpf filters (e.g. have
> > BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> > continue sharing the region but explicitly define a separate bpf_alloc
> > API, depending on an architecture's needs. What do people think?
> 
> Hmm, I think here are several issues mixed up at the same time which is just
> very confusing, imho:
> 
> 1) The fact that there are several non-module users of module_alloc()
> as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
> them are not being modules, they all need to alloc some piece of executable
> memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
> ("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
> that is even /before/ eBPF existed. Having some different API perhaps for all
> these users seems to make sense if the goal is not to interfere with modules
> themselves. It might also help as a benefit to potentially increase that
> memory pool if we're hitting limits at scale which would not be a concern
> for normal kernel modules since there's usually just a very few of them
> needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
> running BPF-seccomp policies, networking BPF policies, etc which need to
> scale w/ application or container deployment; so this is of much more
> dynamic and unpredictable nature).
> 
> 2) Then there is rlimit which is proposing to have a "fairer" share among
> unprivileged users. I'm not fully sure yet whether rlimit is actually a
> nice usable interface for all this. I'd agree that something is needed
> on that regard, but I also tend to like Michal Hocko's cgroup proposal,
> iow, to have such resource control as part of memory cgroup could be
> something to consider _especially_ since it already _exists_ for vmalloc()
> backed memory so this should be not much different than that. It sounds
> like 2) comes on top of 1).
FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
already know, but it looks like the cgroups vmalloc charge is done in the main
page allocator and counts against the whole kernel memory limit. A user may want
to have a higher kernel limit than the module space size, so it seems it isn't
enough by itself and some new limit would need to be added.

As for what the limit should be, I wonder if some of the disagreement is just
from the name "module space".

There is a limited resource of physical memory, so we have limits for it. There
is a limited resource of CPU time, so we have limits for it. If there is a
limited resource for preferred address space for executable code, why not just
continue that trend? If other forms of unprivileged JIT come along, would it be
better to have N limits for each type? Request_module probably can't fill the
space, but technically there are already 2 unprivileged users. So IMHO, its a
more forward looking solution.

If there are some usage/architecture combos that don't need the preferred space
they can allocate in vmalloc and have it not count against the preferred space
limit but still count against the cgroups kernel memory limit.

Another benefit of centralizing the allocation of the "executable memory
preferred space" is KASLR. Right now that is only done in module_alloc and so
all users of it get randomized. If they all call vmalloc by themselves they will
just use the normal allocator.

Anyway, it seems like either type of limit (BPF JIT or all module space) will
solve the problem equally well today.

> 3) Last but not least, there's a short term fix which is needed independently
> of 1) and 2) and should be done immediately which is to account for
> unprivileged users and restrict them based on a global configurable
> limit such that privileged use keeps functioning, and 2) could enforce
> based on the global upper limit, for example. Pending fix is under
> https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
> Linus as soon as possible as short term fix. Then something like memcg
> can be considered on top since it seems this makes most sense from a
> usability point.
> 
> Thanks a lot,
> Daniel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
  2018-10-25  1:01             ` Edgecombe, Rick P
                                 ` (3 preceding siblings ...)
  (?)
@ 2018-10-25 14:18               ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-10-25 14:18 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: daniel, jeyu, ard.biesheuvel, linux-fsdevel, linux-kernel, jannh,
	keescook, arjan, netdev, tglx, linux-mips, linux-s390, x86,
	kristen, Dock, Deneen T, catalin.marinas, mingo, will.deacon,
	kernel-hardening, bp, Hansen, Dave, linux-arm-kernel, davem,
	linux-arch, arnd, sparclinux, alexei.starovoitov

On Thu 25-10-18 01:01:44, Edgecombe, Rick P wrote:
[...]
> FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
> already know, but it looks like the cgroups vmalloc charge is done in the main
> page allocator and counts against the whole kernel memory limit.

I am not sure I understand what you are saying but let me clarify that
vmalloc memory is accounted against memory cgroup associated with the
user context calling vmalloc. All you need to do is to add __GFP_ACCOUNT
to the gfp mask. The only challenge here is the charged memory life
cycle. When does it get deallocated? In the worst case the memory is not
tight to any user context and as such it doesn't get deallocated by
killing all processes which could lead to memcg limit exhaustion.

> A user may want
> to have a higher kernel limit than the module space size, so it seems it isn't
> enough by itself and some new limit would need to be added.

If there is another restriction on this memory then you are right.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-25 14:18               ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-10-25 14:18 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: daniel, jeyu, ard.biesheuvel, linux-fsdevel, linux-kernel, jannh,
	keescook, arjan, netdev, tglx, linux-mips, linux-s390, x86,
	kristen, Dock, Deneen T, catalin.marinas, mingo,
	will.deacon@arm.com

On Thu 25-10-18 01:01:44, Edgecombe, Rick P wrote:
[...]
> FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
> already know, but it looks like the cgroups vmalloc charge is done in the main
> page allocator and counts against the whole kernel memory limit.

I am not sure I understand what you are saying but let me clarify that
vmalloc memory is accounted against memory cgroup associated with the
user context calling vmalloc. All you need to do is to add __GFP_ACCOUNT
to the gfp mask. The only challenge here is the charged memory life
cycle. When does it get deallocated? In the worst case the memory is not
tight to any user context and as such it doesn't get deallocated by
killing all processes which could lead to memcg limit exhaustion.

> A user may want
> to have a higher kernel limit than the module space size, so it seems it isn't
> enough by itself and some new limit would need to be added.

If there is another restriction on this memory then you are right.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-25 14:18               ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-10-25 14:18 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: daniel, jeyu, ard.biesheuvel, linux-fsdevel, linux-kernel, jannh,
	keescook, arjan, netdev, tglx, linux-mips, linux-s390, x86,
	kristen, Dock, Deneen T, catalin.marinas, mingo, will.deacon,
	kernel-hardening, bp, Hansen, Dave, linux-arm-kernel, davem,
	linux-arch, arnd, sparclinux, alexei.starovoitov

On Thu 25-10-18 01:01:44, Edgecombe, Rick P wrote:
[...]
> FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
> already know, but it looks like the cgroups vmalloc charge is done in the main
> page allocator and counts against the whole kernel memory limit.

I am not sure I understand what you are saying but let me clarify that
vmalloc memory is accounted against memory cgroup associated with the
user context calling vmalloc. All you need to do is to add __GFP_ACCOUNT
to the gfp mask. The only challenge here is the charged memory life
cycle. When does it get deallocated? In the worst case the memory is not
tight to any user context and as such it doesn't get deallocated by
killing all processes which could lead to memcg limit exhaustion.

> A user may want
> to have a higher kernel limit than the module space size, so it seems it isn't
> enough by itself and some new limit would need to be added.

If there is another restriction on this memory then you are right.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-25 14:18               ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-10-25 14:18 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: daniel, jeyu, ard.biesheuvel, linux-fsdevel, linux-kernel, jannh,
	keescook, arjan, netdev, tglx, linux-mips, linux-s390, x86,
	kristen, Dock, Deneen T, catalin.marinas, mingo,
	will.deacon@arm.com

On Thu 25-10-18 01:01:44, Edgecombe, Rick P wrote:
[...]
> FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
> already know, but it looks like the cgroups vmalloc charge is done in the main
> page allocator and counts against the whole kernel memory limit.

I am not sure I understand what you are saying but let me clarify that
vmalloc memory is accounted against memory cgroup associated with the
user context calling vmalloc. All you need to do is to add __GFP_ACCOUNT
to the gfp mask. The only challenge here is the charged memory life
cycle. When does it get deallocated? In the worst case the memory is not
tight to any user context and as such it doesn't get deallocated by
killing all processes which could lead to memcg limit exhaustion.

> A user may want
> to have a higher kernel limit than the module space size, so it seems it isn't
> enough by itself and some new limit would need to be added.

If there is another restriction on this memory then you are right.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-25 14:18               ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-10-25 14:18 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: daniel, jeyu, ard.biesheuvel, linux-fsdevel, linux-kernel, jannh,
	keescook, arjan, netdev, tglx, linux-mips, linux-s390, x86,
	kristen, Dock, Deneen T, catalin.marinas, mingo, will.deacon,
	kernel-hardening, bp, Hansen, Dave, linux-arm-kernel, davem,
	linux-arch, arnd, sparclinux, alexei.starovoitov

On Thu 25-10-18 01:01:44, Edgecombe, Rick P wrote:
[...]
> FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
> already know, but it looks like the cgroups vmalloc charge is done in the main
> page allocator and counts against the whole kernel memory limit.

I am not sure I understand what you are saying but let me clarify that
vmalloc memory is accounted against memory cgroup associated with the
user context calling vmalloc. All you need to do is to add __GFP_ACCOUNT
to the gfp mask. The only challenge here is the charged memory life
cycle. When does it get deallocated? In the worst case the memory is not
tight to any user context and as such it doesn't get deallocated by
killing all processes which could lead to memcg limit exhaustion.

> A user may want
> to have a higher kernel limit than the module space size, so it seems it isn't
> enough by itself and some new limit would need to be added.

If there is another restriction on this memory then you are right.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH RFC v3 0/3] Rlimit for module space
@ 2018-10-25 14:18               ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-10-25 14:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 25-10-18 01:01:44, Edgecombe, Rick P wrote:
[...]
> FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
> already know, but it looks like the cgroups vmalloc charge is done in the main
> page allocator and counts against the whole kernel memory limit.

I am not sure I understand what you are saying but let me clarify that
vmalloc memory is accounted against memory cgroup associated with the
user context calling vmalloc. All you need to do is to add __GFP_ACCOUNT
to the gfp mask. The only challenge here is the charged memory life
cycle. When does it get deallocated? In the worst case the memory is not
tight to any user context and as such it doesn't get deallocated by
killing all processes which could lead to memcg limit exhaustion.

> A user may want
> to have a higher kernel limit than the module space size, so it seems it isn't
> enough by itself and some new limit would need to be added.

If there is another restriction on this memory then you are right.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2018-10-25 14:31 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-19 20:47 [PATCH RFC v3 0/3] Rlimit for module space Rick Edgecombe
2018-10-19 20:47 ` Rick Edgecombe
2018-10-19 20:47 ` Rick Edgecombe
2018-10-19 20:47 ` [PATCH v3 1/3] modules: Create arch versions of module alloc/free Rick Edgecombe
2018-10-19 20:47   ` Rick Edgecombe
2018-10-19 20:47   ` Rick Edgecombe
2018-10-19 20:47 ` [PATCH v3 2/3] modules: Create rlimit for module space Rick Edgecombe
2018-10-19 20:47   ` Rick Edgecombe
2018-10-19 20:47   ` Rick Edgecombe
2018-10-19 20:47 ` [PATCH v3 3/3] bpf: Add system wide BPF JIT limit Rick Edgecombe
2018-10-19 20:47   ` Rick Edgecombe
2018-10-19 20:47   ` Rick Edgecombe
2018-10-20 17:20 ` [PATCH RFC v3 0/3] Rlimit for module space Ard Biesheuvel
2018-10-20 17:20   ` Ard Biesheuvel
2018-10-20 17:20   ` Ard Biesheuvel
2018-10-20 17:20   ` Ard Biesheuvel
2018-10-22 23:06   ` Edgecombe, Rick P
2018-10-22 23:06     ` Edgecombe, Rick P
2018-10-22 23:06     ` Edgecombe, Rick P
2018-10-22 23:06     ` Edgecombe, Rick P
2018-10-22 23:06     ` Edgecombe, Rick P
2018-10-23 11:54     ` Ard Biesheuvel
2018-10-23 11:54       ` Ard Biesheuvel
2018-10-23 11:54       ` Ard Biesheuvel
2018-10-23 11:54       ` Ard Biesheuvel
2018-10-23 11:54       ` Ard Biesheuvel
2018-10-24 15:07       ` Jessica Yu
2018-10-24 15:07         ` Jessica Yu
2018-10-24 15:07         ` Jessica Yu
2018-10-24 15:07         ` Jessica Yu
2018-10-24 15:07         ` Jessica Yu
2018-10-24 16:04         ` Daniel Borkmann
2018-10-24 16:04           ` Daniel Borkmann
2018-10-24 16:04           ` Daniel Borkmann
2018-10-24 16:04           ` Daniel Borkmann
2018-10-24 16:04           ` Daniel Borkmann
2018-10-25  1:01           ` Edgecombe, Rick P
2018-10-25  1:01             ` Edgecombe, Rick P
2018-10-25  1:01             ` Edgecombe, Rick P
2018-10-25  1:01             ` Edgecombe, Rick P
2018-10-25  1:01             ` Edgecombe, Rick P
2018-10-25 14:18             ` Michal Hocko
2018-10-25 14:18               ` Michal Hocko
2018-10-25 14:18               ` Michal Hocko
2018-10-25 14:18               ` Michal Hocko
2018-10-25 14:18               ` Michal Hocko
2018-10-25 14:18               ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.