* [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check
@ 2019-09-11 18:18 Christian Barcenas
2019-09-13 18:48 ` Yonghong Song
2019-09-16 9:26 ` Daniel Borkmann
0 siblings, 2 replies; 5+ messages in thread
From: Christian Barcenas @ 2019-09-11 18:18 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, netdev
Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Christian Barcenas, bpf
A process can lock memory addresses into physical RAM explicitly
(via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.
CAP_IPC_LOCK allows a process to exceed these limits, and throughout
the kernel this capability is checked before allowing/denying an attempt
to lock memory regions into RAM.
Because bpf locks its programs and maps into RAM, it should respect
CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
exceeded by a privileged process, which is contrary to documented
RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Christian Barcenas <christian@cbarcenas.com>
---
kernel/bpf/syscall.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 272071e9112f..e551961f364b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -183,8 +183,9 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
static int bpf_charge_memlock(struct user_struct *user, u32 pages)
{
unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ unsigned long locked = atomic_long_add_return(pages, &user->locked_vm);
- if (atomic_long_add_return(pages, &user->locked_vm) > memlock_limit) {
+ if (locked > memlock_limit && !capable(CAP_IPC_LOCK)) {
atomic_long_sub(pages, &user->locked_vm);
return -EPERM;
}
@@ -1231,7 +1232,7 @@ int __bpf_prog_charge(struct user_struct *user, u32 pages)
if (user) {
user_bufs = atomic_long_add_return(pages, &user->locked_vm);
- if (user_bufs > memlock_limit) {
+ if (user_bufs > memlock_limit && !capable(CAP_IPC_LOCK)) {
atomic_long_sub(pages, &user->locked_vm);
return -EPERM;
}
--
2.23.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check
2019-09-11 18:18 [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check Christian Barcenas
@ 2019-09-13 18:48 ` Yonghong Song
2019-09-16 9:26 ` Daniel Borkmann
1 sibling, 0 replies; 5+ messages in thread
From: Yonghong Song @ 2019-09-13 18:48 UTC (permalink / raw)
To: Christian Barcenas, Alexei Starovoitov, Daniel Borkmann, netdev
Cc: Martin Lau, Song Liu, bpf
On 9/11/19 7:18 PM, Christian Barcenas wrote:
> A process can lock memory addresses into physical RAM explicitly
> (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
> perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.
>
> CAP_IPC_LOCK allows a process to exceed these limits, and throughout
> the kernel this capability is checked before allowing/denying an attempt
> to lock memory regions into RAM.
>
> Because bpf locks its programs and maps into RAM, it should respect
> CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
> exceeded by a privileged process, which is contrary to documented
> RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.
>
> Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
> Signed-off-by: Christian Barcenas <christian@cbarcenas.com>
Acked-by: Yonghong Song <yhs@fb.com>
> ---
> kernel/bpf/syscall.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 272071e9112f..e551961f364b 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -183,8 +183,9 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
> static int bpf_charge_memlock(struct user_struct *user, u32 pages)
> {
> unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + unsigned long locked = atomic_long_add_return(pages, &user->locked_vm);
>
> - if (atomic_long_add_return(pages, &user->locked_vm) > memlock_limit) {
> + if (locked > memlock_limit && !capable(CAP_IPC_LOCK)) {
> atomic_long_sub(pages, &user->locked_vm);
> return -EPERM;
> }
> @@ -1231,7 +1232,7 @@ int __bpf_prog_charge(struct user_struct *user, u32 pages)
>
> if (user) {
> user_bufs = atomic_long_add_return(pages, &user->locked_vm);
> - if (user_bufs > memlock_limit) {
> + if (user_bufs > memlock_limit && !capable(CAP_IPC_LOCK)) {
> atomic_long_sub(pages, &user->locked_vm);
> return -EPERM;
> }
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check
2019-09-11 18:18 [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check Christian Barcenas
2019-09-13 18:48 ` Yonghong Song
@ 2019-09-16 9:26 ` Daniel Borkmann
2019-09-16 14:09 ` Christian Barcenas
1 sibling, 1 reply; 5+ messages in thread
From: Daniel Borkmann @ 2019-09-16 9:26 UTC (permalink / raw)
To: Christian Barcenas, Alexei Starovoitov, netdev
Cc: Martin KaFai Lau, Song Liu, Yonghong Song, bpf
On 9/11/19 8:18 PM, Christian Barcenas wrote:
> A process can lock memory addresses into physical RAM explicitly
> (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
> perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.
>
> CAP_IPC_LOCK allows a process to exceed these limits, and throughout
> the kernel this capability is checked before allowing/denying an attempt
> to lock memory regions into RAM.
>
> Because bpf locks its programs and maps into RAM, it should respect
> CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
> exceeded by a privileged process, which is contrary to documented
> RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.
Do you have a link/pointer where this is /clearly/ documented?
Uapi header is not overly clear ...
include/uapi/linux/capability.h says:
/* Allow locking of shared memory segments */
/* Allow mlock and mlockall (which doesn't really have anything to do
with IPC) */
#define CAP_IPC_LOCK 14
[...]
/* Override resource limits. Set resource limits. */
/* Override quota limits. */
/* Override reserved space on ext2 filesystem */
/* Modify data journaling mode on ext3 filesystem (uses journaling
resources) */
/* NOTE: ext2 honors fsuid when checking for resource overrides, so
you can override using fsuid too */
/* Override size restrictions on IPC message queues */
/* Allow more than 64hz interrupts from the real-time clock */
/* Override max number of consoles on console allocation */
/* Override max number of keymaps */
#define CAP_SYS_RESOURCE 24
... but my best guess is you are referring to `man 2 mlock`:
Limits and permissions
In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK)
in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines
a limit on how much memory the process may lock.
Since Linux 2.6.9, no limits are placed on the amount of memory that a
privileged process can lock and the RLIMIT_MEMLOCK soft resource limit
instead defines a limit on how much memory an unprivileged process may lock.
Thanks,
Daniel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check
2019-09-16 9:26 ` Daniel Borkmann
@ 2019-09-16 14:09 ` Christian Barcenas
2019-09-16 22:19 ` Alexei Starovoitov
0 siblings, 1 reply; 5+ messages in thread
From: Christian Barcenas @ 2019-09-16 14:09 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov, netdev
Cc: Martin KaFai Lau, Song Liu, Yonghong Song, bpf
> On 9/11/19 8:18 PM, Christian Barcenas wrote:
>> A process can lock memory addresses into physical RAM explicitly
>> (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
>> perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.
>>
>> CAP_IPC_LOCK allows a process to exceed these limits, and throughout
>> the kernel this capability is checked before allowing/denying an attempt
>> to lock memory regions into RAM.
>>
>> Because bpf locks its programs and maps into RAM, it should respect
>> CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
>> exceeded by a privileged process, which is contrary to documented
>> RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.
>
> Do you have a link/pointer where this is /clearly/ documented?
I admit that after submitting this patch, I did re-think the description
and thought maybe I should have described the CAP_IPC_LOCK behavior as
"expected" rather than "documented". :)
> ... but my best guess is you are referring to `man 2 mlock`:
>
> Limits and permissions
>
> In Linux 2.6.8 and earlier, a process must be privileged
> (CAP_IPC_LOCK)
> in order to lock memory and the RLIMIT_MEMLOCK soft resource
> limit defines
> a limit on how much memory the process may lock.
>
> Since Linux 2.6.9, no limits are placed on the amount of
> memory that a
> privileged process can lock and the RLIMIT_MEMLOCK soft resource
> limit
> instead defines a limit on how much memory an unprivileged
> process may lock.
Yes; this is what I was referring to by "documented
RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior."
Unfortunately - AFAICT - this is the most explicit documentation about
CAP_IPC_LOCK's permission set, but it is incomplete.
I believe it can be understood from other references to RLIMIT and
CAP_IPC_LOCK throughout the kernel that "locking memory" refers not only
to mlock/shmctl syscalls, but also to other code sites where /physical/
memory addresses are allocated for userspace.
After identifying RLIMIT_MEMLOCK checks with
git grep -C3 '[^(get|set)]rlimit(RLIMIT_MEMLOCK'
we find that RLIMIT_MEMLOCK is bypassed - if CAP_IPC_LOCK is held - in
many locations that have nothing to do with the mlock or shm family of
syscalls. From what I can tell, every time RLIMIT_MEMLOCK is referenced
there is a neighboring check to CAP_IPC_LOCK that bypasses the rlimit,
or in some cases memory accounting entirely!
bpf() is currently the only exception to the above, ie. as far as I can
tell it is the only code that enforces RLIMIT_MEMLOCK but does not honor
CAP_IPC_LOCK.
Selected examples follow:
In net/core/skbuff.c:
if (capable(CAP_IPC_LOCK) || !size)
return 0;
num_pg = (size >> PAGE_SHIFT) + 2; /* worst case */
max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
user = mmp->user ? : current_user();
do {
old_pg = atomic_long_read(&user->locked_vm);
new_pg = old_pg + num_pg;
if (new_pg > max_pg)
return -ENOBUFS;
} while (atomic_long_cmpxchg(&user->locked_vm, old_pg, new_pg) !=
old_pg);
In net/xdp/xdp_umem.c:
if (capable(CAP_IPC_LOCK))
return 0;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
umem->user = get_uid(current_user());
do {
old_npgs = atomic_long_read(&umem->user->locked_vm);
new_npgs = old_npgs + umem->npgs;
if (new_npgs > lock_limit) {
free_uid(umem->user);
umem->user = NULL;
return -ENOBUFS;
}
} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
new_npgs) != old_npgs);
return 0;
In arch/x86/kvm/svm.c:
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
pr_err("SEV: %lu locked pages exceed the lock limit of
%lu.\n", locked, lock_limit);
return NULL;
}
In drivers/infiniband/core/umem.c (and other sites in Infiniband code):
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
new_pinned = atomic64_add_return(npages, &mm->pinned_vm);
if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
atomic64_sub(npages, &mm->pinned_vm);
ret = -ENOMEM;
goto out;
}
In drivers/vfio/vfio_iommu_type1.c, albeit in an indirect way:
struct vfio_dma {
bool lock_cap; /* capable(CAP_IPC_LOCK) */
};
// ...
for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
// ...
if (!rsvd && !vfio_find_vpfn(dma, iova)) {
if (!dma->lock_cap &&
current->mm->locked_vm + lock_acct + 1 > limit) {
put_pfn(pfn, dma->prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
__func__, limit << PAGE_SHIFT);
ret = -ENOMEM;
goto unpin_out;
}
lock_acct++;
}
}
Best,
Christian
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check
2019-09-16 14:09 ` Christian Barcenas
@ 2019-09-16 22:19 ` Alexei Starovoitov
0 siblings, 0 replies; 5+ messages in thread
From: Alexei Starovoitov @ 2019-09-16 22:19 UTC (permalink / raw)
To: Christian Barcenas
Cc: Daniel Borkmann, Alexei Starovoitov, netdev, Martin KaFai Lau,
Song Liu, Yonghong Song, bpf
On Mon, Sep 16, 2019 at 07:09:06AM -0700, Christian Barcenas wrote:
>
> bpf() is currently the only exception to the above, ie. as far as I can tell
> it is the only code that enforces RLIMIT_MEMLOCK but does not honor
> CAP_IPC_LOCK.
Yes. bpf is not honoring CAP_IPC_LOCK comparing to other places in the kernel,
but we cannot change this anymore. User space already using rlimit as an enforcement.
bpf_rlimit.h hack we use in selftests is not a universal way of loading bpf progs.
If we make such change root user will become unlimited and rlimit enforcement
will break.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-09-16 22:20 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-11 18:18 [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check Christian Barcenas
2019-09-13 18:48 ` Yonghong Song
2019-09-16 9:26 ` Daniel Borkmann
2019-09-16 14:09 ` Christian Barcenas
2019-09-16 22:19 ` Alexei Starovoitov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).