netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 01/19] mm: Introduce vm_account
       [not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  6:29   ` Christoph Hellwig
                     ` (2 more replies)
  2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 15+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, linuxppc-dev, linux-fpga, linux-rdma,
	virtualization, kvm, netdev, io-uring, bpf, rds-devel,
	linux-kselftest

Kernel drivers that pin pages should account these pages against
either user->locked_vm or mm->pinned_vm and fail the pinning if
RLIMIT_MEMLOCK is exceeded and CAP_IPC_LOCK isn't held.

Currently drivers open-code this accounting and use various methods to
update the atomic variables and check against the limits leading to
various bugs and inconsistencies. To fix this introduce a standard
interface for charging pinned and locked memory. As this involves
taking references on kernel objects such as mm_struct or user_struct
we introduce a new vm_account struct to hold these references. Several
helper functions are then introduced to grab references and check
limits.

As the way these limits are charged and enforced is visible to
userspace we need to be careful not to break existing applications by
charging to different counters. As a result the vm_account functions
support accounting to different counters as required.

A future change will extend this to also account against a cgroup for
pinned pages.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-fpga@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: io-uring@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: bpf@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Cc: linux-kselftest@vger.kernel.org
---
 include/linux/mm_types.h | 87 ++++++++++++++++++++++++++++++++++++++++-
 mm/util.c                | 89 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 176 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9757067..7de2168 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1085,4 +1085,91 @@ enum fault_flag {
 
 typedef unsigned int __bitwise zap_flags_t;
 
+/**
+ * enum vm_account_flags - Determine how pinned/locked memory is accounted.
+ * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
+ * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
+ * @VM_ACCOUNT_USER: Accounnt locked memory to user->locked_vm.
+ *
+ * Determines which statistic pinned/locked memory is accounted
+ * against. All limits will be enforced against RLIMIT_MEMLOCK and the
+ * pins cgroup if CONFIG_CGROUP_PINS is enabled.
+ *
+ * New drivers should use VM_ACCOUNT_TASK. VM_ACCOUNT_USER is used by
+ * pre-existing drivers to maintain existing accounting against
+ * user->locked_mm rather than mm->pinned_mm.
+ *
+ * VM_ACCOUNT_BYPASS may also be specified to bypass rlimit
+ * checks. Typically this is used to cache CAP_IPC_LOCK from when a
+ * driver is first initialised. Note that this does not bypass cgroup
+ * limit checks.
+ */
+enum vm_account_flags {
+	VM_ACCOUNT_TASK = 0,
+	VM_ACCOUNT_BYPASS = 1,
+	VM_ACCOUNT_USER = 2,
+};
+
+struct vm_account {
+	struct task_struct *task;
+	union {
+		struct mm_struct *mm;
+		struct user_struct *user;
+	} a;
+	enum vm_account_flags flags;
+};
+
+/**
+ * vm_account_init - Initialise a new struct vm_account.
+ * @vm_account: pointer to uninitialised vm_account.
+ * @task: task to charge against.
+ * @user: user to charge against. Must be non-NULL for VM_ACCOUNT_USER.
+ * @flags: flags to use when charging to vm_account.
+ *
+ * Initialise a new uninitialiused struct vm_account. Takes references
+ * on the task/mm/user/cgroup as required although callers must ensure
+ * any references passed in remain valid for the duration of this
+ * call.
+ */
+void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
+		struct user_struct *user, enum vm_account_flags flags);
+/**
+ * vm_account_init_current - Initialise a new struct vm_account.
+ * @vm_account: pointer to uninitialised vm_account.
+ *
+ * Helper to initialise a vm_account for the common case of charging
+ * with VM_ACCOUNT_TASK against current.
+ */
+void vm_account_init_current(struct vm_account *vm_account);
+
+/**
+ * vm_account_release - Initialise a new struct vm_account.
+ * @vm_account: pointer to initialised vm_account.
+ *
+ * Drop any object references obtained by vm_account_init(). The
+ * vm_account must not be used after calling this unless reinitialised
+ * with vm_account_init().
+ */
+void vm_account_release(struct vm_account *vm_account);
+
+/**
+ * vm_account_pinned - Charge pinned or locked memory to the vm_account.
+ * @vm_account: pointer to an initialised vm_account.
+ * @npages: number of pages to charge.
+ *
+ * Return: 0 on success, -ENOMEM if a limit would be exceeded.
+ *
+ * Note: All pages must be explicitly uncharged with
+ * vm_unaccount_pinned() prior to releasing the vm_account with
+ * vm_account_release().
+ */
+int vm_account_pinned(struct vm_account *vm_account, unsigned long npages);
+
+/**
+ * vm_unaccount_pinned - Uncharge pinned or locked memory to the vm_account.
+ * @vm_account: pointer to an initialised vm_account.
+ * @npages: number of pages to uncharge.
+ */
+void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages);
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/mm/util.c b/mm/util.c
index b56c92f..af40b1e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -430,6 +430,95 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
 }
 #endif
 
+void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
+		struct user_struct *user, enum vm_account_flags flags)
+{
+	vm_account->task = get_task_struct(task);
+
+	if (flags & VM_ACCOUNT_USER) {
+		vm_account->a.user = get_uid(user);
+	} else {
+		mmgrab(task->mm);
+		vm_account->a.mm = task->mm;
+	}
+
+	vm_account->flags = flags;
+}
+EXPORT_SYMBOL_GPL(vm_account_init);
+
+void vm_account_init_current(struct vm_account *vm_account)
+{
+	vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
+}
+EXPORT_SYMBOL_GPL(vm_account_init_current);
+
+void vm_account_release(struct vm_account *vm_account)
+{
+	put_task_struct(vm_account->task);
+	if (vm_account->flags & VM_ACCOUNT_USER)
+		free_uid(vm_account->a.user);
+	else
+		mmdrop(vm_account->a.mm);
+}
+EXPORT_SYMBOL_GPL(vm_account_release);
+
+/*
+ * Charge pages with an atomic compare and swap. Returns -ENOMEM on
+ * failure, 1 on success and 0 for retry.
+ */
+static int vm_account_cmpxchg(struct vm_account *vm_account,
+				unsigned long npages, unsigned long lock_limit)
+{
+	u64 cur_pages, new_pages;
+
+	if (vm_account->flags & VM_ACCOUNT_USER)
+		cur_pages = atomic_long_read(&vm_account->a.user->locked_vm);
+	else
+		cur_pages = atomic64_read(&vm_account->a.mm->pinned_vm);
+
+	new_pages = cur_pages + npages;
+	if (lock_limit != RLIM_INFINITY && new_pages > lock_limit)
+		return -ENOMEM;
+
+	if (vm_account->flags & VM_ACCOUNT_USER) {
+		return atomic_long_cmpxchg(&vm_account->a.user->locked_vm,
+					   cur_pages, new_pages) == cur_pages;
+	} else {
+		return atomic64_cmpxchg(&vm_account->a.mm->pinned_vm,
+					   cur_pages, new_pages) == cur_pages;
+	}
+}
+
+int vm_account_pinned(struct vm_account *vm_account, unsigned long npages)
+{
+	unsigned long lock_limit = RLIM_INFINITY;
+	int ret;
+
+	if (!(vm_account->flags & VM_ACCOUNT_BYPASS) && !capable(CAP_IPC_LOCK))
+		lock_limit = task_rlimit(vm_account->task,
+					RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	while (true) {
+		ret = vm_account_cmpxchg(vm_account, npages, lock_limit);
+		if (ret > 0)
+			break;
+		else if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vm_account_pinned);
+
+void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages)
+{
+	if (vm_account->flags & VM_ACCOUNT_USER)
+		atomic_long_sub(npages, &vm_account->a.user->locked_vm);
+	else
+		atomic64_sub(npages, &vm_account->a.mm->pinned_vm);
+}
+EXPORT_SYMBOL_GPL(vm_unaccount_pinned);
+
 /**
  * __account_locked_vm - account locked pages to an mm's locked_vm
  * @mm:          mm to account against
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account
       [not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:55   ` Michael S. Tsirkin
  2023-01-24 14:34   ` Jason Gunthorpe
  2023-01-24  5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 11/19] xdp: convert to use vm_account Alistair Popple
  3 siblings, 2 replies; 15+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Michael S. Tsirkin, Jason Wang, kvm,
	virtualization, netdev

Convert vhost to use the new vm_account structure and associated
account_pinned_vm() functions.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/vhost/vdpa.c  |  9 +++++----
 drivers/vhost/vhost.c |  2 ++
 drivers/vhost/vhost.h |  1 +
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ec32f78..a31dd53 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -716,7 +716,7 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 				set_page_dirty_lock(page);
 			unpin_user_page(page);
 		}
-		atomic64_sub(PFN_DOWN(map->size), &dev->mm->pinned_vm);
+		vm_unaccount_pinned(&dev->vm_account, PFN_DOWN(map->size));
 		vhost_vdpa_general_unmap(v, map, asid);
 		vhost_iotlb_map_free(iotlb, map);
 	}
@@ -780,6 +780,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 	u32 asid = iotlb_to_asid(iotlb);
 	int r = 0;
 
+	if (!vdpa->use_va)
+		if (vm_account_pinned(&dev->vm_account, PFN_DOWN(size)))
+			return -ENOMEM;
+
 	r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
 				      pa, perm, opaque);
 	if (r)
@@ -799,9 +803,6 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 		return r;
 	}
 
-	if (!vdpa->use_va)
-		atomic64_add(PFN_DOWN(size), &dev->mm->pinned_vm);
-
 	return 0;
 }
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cbe72bf..5645c26 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -556,6 +556,7 @@ static void vhost_attach_mm(struct vhost_dev *dev)
 		dev->mm = current->mm;
 		mmgrab(dev->mm);
 	}
+	vm_account_init_current(&dev->vm_account);
 }
 
 static void vhost_detach_mm(struct vhost_dev *dev)
@@ -569,6 +570,7 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 		mmdrop(dev->mm);
 
 	dev->mm = NULL;
+	vm_account_release(&dev->vm_account);
 }
 
 /* Caller should have device mutex */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d910910..3a9aed8 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -144,6 +144,7 @@ struct vhost_msg_node {
 struct vhost_dev {
 	struct mm_struct *mm;
 	struct mutex mutex;
+	struct vm_account vm_account;
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 10/19] net: skb: Switch to using vm_account
       [not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24 14:51   ` Jason Gunthorpe
  2023-01-24  5:42 ` [RFC PATCH 11/19] xdp: convert to use vm_account Alistair Popple
  3 siblings, 1 reply; 15+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, netdev, linux-rdma, rds-devel

Switch to using vm_account to charge pinned pages. This will allow a
future change to charge the pinned pages to a cgroup to limit the
overall number of pinned pages in the system.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
---
 include/linux/skbuff.h |  6 ++---
 include/net/sock.h     |  2 ++-
 net/core/skbuff.c      | 47 +++++++++++++++----------------------------
 net/rds/message.c      |  9 +++++---
 4 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4c84924..c956405 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -554,7 +554,6 @@ struct ubuf_info_msgzc {
 	};
 
 	struct mmpin {
-		struct user_struct *user;
 		unsigned int num_pg;
 	} mmp;
 };
@@ -563,8 +562,9 @@ struct ubuf_info_msgzc {
 #define uarg_to_msgzc(ubuf_ptr)	container_of((ubuf_ptr), struct ubuf_info_msgzc, \
 					     ubuf)
 
-int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
-void mm_unaccount_pinned_pages(struct mmpin *mmp);
+int mm_account_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp,
+			size_t size);
+void mm_unaccount_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp);
 
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
diff --git a/include/net/sock.h b/include/net/sock.h
index dcd72e6..bc3a868 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -334,6 +334,7 @@ struct sk_filter;
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_cgrp_data: cgroup data for this cgroup
+  *	@sk_vm_account: data for pinned memory accounting
   *	@sk_memcg: this socket's memory cgroup association
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
@@ -523,6 +524,7 @@ struct sock {
 	void			*sk_security;
 #endif
 	struct sock_cgroup_data	sk_cgrp_data;
+	struct vm_account       sk_vm_account;
 	struct mem_cgroup	*sk_memcg;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4a0eb55..bed3fc9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1309,42 +1309,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
-int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+int mm_account_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp,
+			    size_t size)
 {
-	unsigned long max_pg, num_pg, new_pg, old_pg;
-	struct user_struct *user;
-
-	if (capable(CAP_IPC_LOCK) || !size)
-		return 0;
+	unsigned int num_pg;
 
 	num_pg = (size >> PAGE_SHIFT) + 2;	/* worst case */
-	max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	user = mmp->user ? : current_user();
+	if (vm_account_pinned(vm_account, num_pg))
+		return -ENOBUFS;
 
-	old_pg = atomic_long_read(&user->locked_vm);
-	do {
-		new_pg = old_pg + num_pg;
-		if (new_pg > max_pg)
-			return -ENOBUFS;
-	} while (!atomic_long_try_cmpxchg(&user->locked_vm, &old_pg, new_pg));
-
-	if (!mmp->user) {
-		mmp->user = get_uid(user);
-		mmp->num_pg = num_pg;
-	} else {
-		mmp->num_pg += num_pg;
-	}
+	mmp->num_pg += num_pg;
 
 	return 0;
 }
 EXPORT_SYMBOL_GPL(mm_account_pinned_pages);
 
-void mm_unaccount_pinned_pages(struct mmpin *mmp)
+void mm_unaccount_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp)
 {
-	if (mmp->user) {
-		atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
-		free_uid(mmp->user);
-	}
+	vm_unaccount_pinned(vm_account, mmp->num_pg);
+	vm_account_release(vm_account);
 }
 EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
 
@@ -1361,9 +1344,12 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 
 	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
 	uarg = (void *)skb->cb;
-	uarg->mmp.user = NULL;
+	uarg->mmp.num_pg = 0;
+	vm_account_init(&sk->sk_vm_account, current,
+			current_user(), VM_ACCOUNT_USER);
 
-	if (mm_account_pinned_pages(&uarg->mmp, size)) {
+	if (mm_account_pinned_pages(&sk->sk_vm_account, &uarg->mmp, size)) {
+		vm_account_release(&sk->sk_vm_account);
 		kfree_skb(skb);
 		return NULL;
 	}
@@ -1416,7 +1402,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
 
 		next = (u32)atomic_read(&sk->sk_zckey);
 		if ((u32)(uarg_zc->id + uarg_zc->len) == next) {
-			if (mm_account_pinned_pages(&uarg_zc->mmp, size))
+			if (mm_account_pinned_pages(&sk->sk_vm_account,
+						    &uarg_zc->mmp, size))
 				return NULL;
 			uarg_zc->len++;
 			uarg_zc->bytelen = bytelen;
@@ -1466,7 +1453,7 @@ static void __msg_zerocopy_callback(struct ubuf_info_msgzc *uarg)
 	u32 lo, hi;
 	u16 len;
 
-	mm_unaccount_pinned_pages(&uarg->mmp);
+	mm_unaccount_pinned_pages(&sk->sk_vm_account, &uarg->mmp);
 
 	/* if !len, there was only 1 call, and it was aborted
 	 * so do not queue a completion notification
diff --git a/net/rds/message.c b/net/rds/message.c
index b47e4f0..2138a70 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
 	struct list_head *head;
 	unsigned long flags;
 
-	mm_unaccount_pinned_pages(&znotif->z_mmp);
+	mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
 	q = &rs->rs_zcookie_queue;
 	spin_lock_irqsave(&q->lock, flags);
 	head = &q->zcookie_head;
@@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
 	int ret = 0;
 	int length = iov_iter_count(from);
 	struct rds_msg_zcopy_info *info;
+	struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
 
 	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
 
@@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
 		return -ENOMEM;
 	INIT_LIST_HEAD(&info->rs_zcookie_next);
 	rm->data.op_mmp_znotifier = &info->znotif;
-	if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
+	vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
+	if (mm_account_pinned_pages(vm_account,
+				    &rm->data.op_mmp_znotifier->z_mmp,
 				    length)) {
 		ret = -ENOMEM;
 		goto err;
@@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
 			for (i = 0; i < rm->data.op_nents; i++)
 				put_page(sg_page(&rm->data.op_sg[i]));
 			mmp = &rm->data.op_mmp_znotifier->z_mmp;
-			mm_unaccount_pinned_pages(mmp);
+			mm_unaccount_pinned_pages(vm_account, mmp);
 			ret = -EFAULT;
 			goto err;
 		}
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 11/19] xdp: convert to use vm_account
       [not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
                   ` (2 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  3 siblings, 0 replies; 15+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf

Switch to using the new vm_account struct to charge pinned pages and
enforce the rlimit. This will allow a future change to also charge a
cgroup for limiting the number of pinned pages.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Björn Töpel" <bjorn@kernel.org>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/net/xdp_sock.h |  2 +-
 net/xdp/xdp_umem.c     | 38 +++++++++++++-------------------------
 2 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 3057e1a..b0d3c16 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -25,7 +25,7 @@ struct xdp_umem {
 	u32 chunk_size;
 	u32 chunks;
 	u32 npgs;
-	struct user_struct *user;
+	struct vm_account vm_account;
 	refcount_t users;
 	u8 flags;
 	bool zc;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 4681e8e..4b5fb2f 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -29,12 +29,10 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 	umem->pgs = NULL;
 }
 
-static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
+static void xdp_umem_unaccount_pages(struct xdp_umem *umem, u32 npgs)
 {
-	if (umem->user) {
-		atomic_long_sub(umem->npgs, &umem->user->locked_vm);
-		free_uid(umem->user);
-	}
+	vm_unaccount_pinned(&umem->vm_account, npgs);
+	vm_account_release(&umem->vm_account);
 }
 
 static void xdp_umem_addr_unmap(struct xdp_umem *umem)
@@ -54,13 +52,15 @@ static int xdp_umem_addr_map(struct xdp_umem *umem, struct page **pages,
 
 static void xdp_umem_release(struct xdp_umem *umem)
 {
+	u32 npgs = umem->npgs;
+
 	umem->zc = false;
 	ida_free(&umem_ida, umem->id);
 
 	xdp_umem_addr_unmap(umem);
 	xdp_umem_unpin_pages(umem);
 
-	xdp_umem_unaccount_pages(umem);
+	xdp_umem_unaccount_pages(umem, npgs);
 	kfree(umem);
 }
 
@@ -127,24 +127,13 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
 
 static int xdp_umem_account_pages(struct xdp_umem *umem)
 {
-	unsigned long lock_limit, new_npgs, old_npgs;
-
-	if (capable(CAP_IPC_LOCK))
-		return 0;
-
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	umem->user = get_uid(current_user());
+	vm_account_init(&umem->vm_account, current,
+			current_user(), VM_ACCOUNT_USER);
+	if (vm_account_pinned(&umem->vm_account, umem->npgs)) {
+		vm_account_release(&umem->vm_account);
+		return -ENOBUFS;
+	}
 
-	do {
-		old_npgs = atomic_long_read(&umem->user->locked_vm);
-		new_npgs = old_npgs + umem->npgs;
-		if (new_npgs > lock_limit) {
-			free_uid(umem->user);
-			umem->user = NULL;
-			return -ENOBUFS;
-		}
-	} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
-				     new_npgs) != old_npgs);
 	return 0;
 }
 
@@ -204,7 +193,6 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->chunks = chunks;
 	umem->npgs = (u32)npgs;
 	umem->pgs = NULL;
-	umem->user = NULL;
 	umem->flags = mr->flags;
 
 	INIT_LIST_HEAD(&umem->xsk_dma_list);
@@ -227,7 +215,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 out_unpin:
 	xdp_umem_unpin_pages(umem);
 out_account:
-	xdp_umem_unaccount_pages(umem);
+	xdp_umem_unaccount_pages(umem, npgs);
 	return err;
 }
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account
  2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
@ 2023-01-24  5:55   ` Michael S. Tsirkin
  2023-01-30 10:43     ` Alistair Popple
  2023-01-24 14:34   ` Jason Gunthorpe
  1 sibling, 1 reply; 15+ messages in thread
From: Michael S. Tsirkin @ 2023-01-24  5:55 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel, Jason Wang, kvm, virtualization,
	netdev

On Tue, Jan 24, 2023 at 04:42:31PM +1100, Alistair Popple wrote:
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ec32f78..a31dd53 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c

...

> @@ -780,6 +780,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>  	u32 asid = iotlb_to_asid(iotlb);
>  	int r = 0;
>  
> +	if (!vdpa->use_va)
> +		if (vm_account_pinned(&dev->vm_account, PFN_DOWN(size)))
> +			return -ENOMEM;
> +
>  	r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
>  				      pa, perm, opaque);
>  	if (r)

I suspect some error handling will have to be reworked then, no?

> -- 
> git-series 0.9.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 01/19] mm: Introduce vm_account
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
@ 2023-01-24  6:29   ` Christoph Hellwig
  2023-01-24 14:32   ` Jason Gunthorpe
  2023-01-31 14:00   ` David Hildenbrand
  2 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2023-01-24  6:29 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel, linuxppc-dev, linux-fpga,
	linux-rdma, virtualization, kvm, netdev, io-uring, bpf,
	rds-devel, linux-kselftest

> +/**
> + * vm_account_init - Initialise a new struct vm_account.
> + * @vm_account: pointer to uninitialised vm_account.
> + * @task: task to charge against.
> + * @user: user to charge against. Must be non-NULL for VM_ACCOUNT_USER.
> + * @flags: flags to use when charging to vm_account.
> + *
> + * Initialise a new uninitialiused struct vm_account. Takes references
> + * on the task/mm/user/cgroup as required although callers must ensure
> + * any references passed in remain valid for the duration of this
> + * call.
> + */
> +void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
> +		struct user_struct *user, enum vm_account_flags flags);


kerneldoc comments are supposed to be next to the implementation, and
not the declaration in the header.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 01/19] mm: Introduce vm_account
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
  2023-01-24  6:29   ` Christoph Hellwig
@ 2023-01-24 14:32   ` Jason Gunthorpe
  2023-01-30 11:36     ` Alistair Popple
  2023-01-31 14:00   ` David Hildenbrand
  2 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:32 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, linuxppc-dev, linux-fpga, linux-rdma,
	virtualization, kvm, netdev, io-uring, bpf, rds-devel,
	linux-kselftest

On Tue, Jan 24, 2023 at 04:42:30PM +1100, Alistair Popple wrote:
> +/**
> + * enum vm_account_flags - Determine how pinned/locked memory is accounted.
> + * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
> + * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
> + * @VM_ACCOUNT_USER: Accounnt locked memory to user->locked_vm.
> + *
> + * Determines which statistic pinned/locked memory is accounted
> + * against. All limits will be enforced against RLIMIT_MEMLOCK and the
> + * pins cgroup if CONFIG_CGROUP_PINS is enabled.
> + *
> + * New drivers should use VM_ACCOUNT_TASK. VM_ACCOUNT_USER is used by
> + * pre-existing drivers to maintain existing accounting against
> + * user->locked_mm rather than mm->pinned_mm.

I thought the guidance was the opposite of this, it is the newer
places in the kernel that are using VM_ACCOUNT_USER?

I haven't got to the rest of the patches yet, but isn't there also a
mm->pinned_vm vs mm->locked_vm variation in the current drivers as
well?

> +void vm_account_init_current(struct vm_account *vm_account)
> +{
> +	vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
> +}
> +EXPORT_SYMBOL_GPL(vm_account_init_current);

This can probably just be a static inline

You might consider putting all this in some new vm_account.h - given
how rarely it is used? Compile times and all

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account
  2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
  2023-01-24  5:55   ` Michael S. Tsirkin
@ 2023-01-24 14:34   ` Jason Gunthorpe
  1 sibling, 0 replies; 15+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:34 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Michael S. Tsirkin, Jason Wang, kvm,
	virtualization, netdev

On Tue, Jan 24, 2023 at 04:42:31PM +1100, Alistair Popple wrote:

> @@ -799,9 +803,6 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>  		return r;
>  	}
>  
> -	if (!vdpa->use_va)
> -		atomic64_add(PFN_DOWN(size), &dev->mm->pinned_vm);

Mention in the commit message this fixes a "bug" where vhost didn't
respect the limits

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
  2023-01-24  5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
@ 2023-01-24 14:51   ` Jason Gunthorpe
  2023-01-30 11:17     ` Alistair Popple
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel

On Tue, Jan 24, 2023 at 04:42:39PM +1100, Alistair Popple wrote:
> diff --git a/include/net/sock.h b/include/net/sock.h
> index dcd72e6..bc3a868 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -334,6 +334,7 @@ struct sk_filter;
>    *	@sk_security: used by security modules
>    *	@sk_mark: generic packet mark
>    *	@sk_cgrp_data: cgroup data for this cgroup
> +  *	@sk_vm_account: data for pinned memory accounting
>    *	@sk_memcg: this socket's memory cgroup association
>    *	@sk_write_pending: a write to stream socket waits to start
>    *	@sk_state_change: callback to indicate change in the state of the sock
> @@ -523,6 +524,7 @@ struct sock {
>  	void			*sk_security;
>  #endif
>  	struct sock_cgroup_data	sk_cgrp_data;
> +	struct vm_account       sk_vm_account;
>  	struct mem_cgroup	*sk_memcg;
>  	void			(*sk_state_change)(struct sock *sk);
>  	void			(*sk_data_ready)(struct sock *sk);

I'm not sure this makes sense in a sock - each sock can be shared with
different proceses..

> diff --git a/net/rds/message.c b/net/rds/message.c
> index b47e4f0..2138a70 100644
> --- a/net/rds/message.c
> +++ b/net/rds/message.c
> @@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>  	struct list_head *head;
>  	unsigned long flags;
>  
> -	mm_unaccount_pinned_pages(&znotif->z_mmp);
> +	mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
>  	q = &rs->rs_zcookie_queue;
>  	spin_lock_irqsave(&q->lock, flags);
>  	head = &q->zcookie_head;
> @@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>  	int ret = 0;
>  	int length = iov_iter_count(from);
>  	struct rds_msg_zcopy_info *info;
> +	struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
>  
>  	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>  
> @@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>  		return -ENOMEM;
>  	INIT_LIST_HEAD(&info->rs_zcookie_next);
>  	rm->data.op_mmp_znotifier = &info->znotif;
> -	if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
> +	vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
> +	if (mm_account_pinned_pages(vm_account,
> +				    &rm->data.op_mmp_znotifier->z_mmp,
>  				    length)) {
>  		ret = -ENOMEM;
>  		goto err;
> @@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>  			for (i = 0; i < rm->data.op_nents; i++)
>  				put_page(sg_page(&rm->data.op_sg[i]));
>  			mmp = &rm->data.op_mmp_znotifier->z_mmp;
> -			mm_unaccount_pinned_pages(mmp);
> +			mm_unaccount_pinned_pages(vm_account, mmp);
>  			ret = -EFAULT;
>  			goto err;
>  		}

I wonder if RDS should just not be doing accounting? Usually things
related to iov_iter are short term and we don't account for them.

But then I don't really know how RDS works, Santos?

Regardless, maybe the vm_account should be stored in the
rds_msg_zcopy_info ?

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account
  2023-01-24  5:55   ` Michael S. Tsirkin
@ 2023-01-30 10:43     ` Alistair Popple
  0 siblings, 0 replies; 15+ messages in thread
From: Alistair Popple @ 2023-01-30 10:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel, Jason Wang, kvm, virtualization,
	netdev


"Michael S. Tsirkin" <mst@redhat.com> writes:

> On Tue, Jan 24, 2023 at 04:42:31PM +1100, Alistair Popple wrote:
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index ec32f78..a31dd53 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>
> ...
>
>> @@ -780,6 +780,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>>  	u32 asid = iotlb_to_asid(iotlb);
>>  	int r = 0;
>>  
>> +	if (!vdpa->use_va)
>> +		if (vm_account_pinned(&dev->vm_account, PFN_DOWN(size)))
>> +			return -ENOMEM;
>> +
>>  	r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
>>  				      pa, perm, opaque);
>>  	if (r)
>
> I suspect some error handling will have to be reworked then, no?

Thanks. I had meant to go back and double check some of these driver
conversions. Will add something like below:

@@ -787,7 +787,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 	r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
 				      pa, perm, opaque);
 	if (r)
-		return r;
+		goto out_unaccount;
 
 	if (ops->dma_map) {
 		r = ops->dma_map(vdpa, asid, iova, size, pa, perm, opaque);
@@ -798,12 +798,14 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 		r = iommu_map(v->domain, iova, pa, size,
 			      perm_to_iommu_flags(perm));
 	}
-	if (r) {
+	if (r)
 		vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
-		return r;
-	}
 
-	return 0;
+out_unaccount:
+	if (!vdpa->use_va)
+		vm_unaccount_pinned(&dev->vm_account, PFN_DOWN(size));
+
+	return r;
 }
 
 static void vhost_vdpa_unmap(struct vhost_vdpa *v,

>> -- 
>> git-series 0.9.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
  2023-01-24 14:51   ` Jason Gunthorpe
@ 2023-01-30 11:17     ` Alistair Popple
  2023-02-06  4:36       ` Alistair Popple
  0 siblings, 1 reply; 15+ messages in thread
From: Alistair Popple @ 2023-01-30 11:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel


Jason Gunthorpe <jgg@nvidia.com> writes:

> On Tue, Jan 24, 2023 at 04:42:39PM +1100, Alistair Popple wrote:
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index dcd72e6..bc3a868 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -334,6 +334,7 @@ struct sk_filter;
>>    *	@sk_security: used by security modules
>>    *	@sk_mark: generic packet mark
>>    *	@sk_cgrp_data: cgroup data for this cgroup
>> +  *	@sk_vm_account: data for pinned memory accounting
>>    *	@sk_memcg: this socket's memory cgroup association
>>    *	@sk_write_pending: a write to stream socket waits to start
>>    *	@sk_state_change: callback to indicate change in the state of the sock
>> @@ -523,6 +524,7 @@ struct sock {
>>  	void			*sk_security;
>>  #endif
>>  	struct sock_cgroup_data	sk_cgrp_data;
>> +	struct vm_account       sk_vm_account;
>>  	struct mem_cgroup	*sk_memcg;
>>  	void			(*sk_state_change)(struct sock *sk);
>>  	void			(*sk_data_ready)(struct sock *sk);
>
> I'm not sure this makes sense in a sock - each sock can be shared with
> different proceses..

TBH it didn't feel right to me either so was hoping for some
feedback. Will try your suggestion below.

>> diff --git a/net/rds/message.c b/net/rds/message.c
>> index b47e4f0..2138a70 100644
>> --- a/net/rds/message.c
>> +++ b/net/rds/message.c
>> @@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>>  	struct list_head *head;
>>  	unsigned long flags;
>>  
>> -	mm_unaccount_pinned_pages(&znotif->z_mmp);
>> +	mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
>>  	q = &rs->rs_zcookie_queue;
>>  	spin_lock_irqsave(&q->lock, flags);
>>  	head = &q->zcookie_head;
>> @@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>  	int ret = 0;
>>  	int length = iov_iter_count(from);
>>  	struct rds_msg_zcopy_info *info;
>> +	struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
>>  
>>  	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>>  
>> @@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>  		return -ENOMEM;
>>  	INIT_LIST_HEAD(&info->rs_zcookie_next);
>>  	rm->data.op_mmp_znotifier = &info->znotif;
>> -	if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
>> +	vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
>> +	if (mm_account_pinned_pages(vm_account,
>> +				    &rm->data.op_mmp_znotifier->z_mmp,
>>  				    length)) {
>>  		ret = -ENOMEM;
>>  		goto err;
>> @@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>  			for (i = 0; i < rm->data.op_nents; i++)
>>  				put_page(sg_page(&rm->data.op_sg[i]));
>>  			mmp = &rm->data.op_mmp_znotifier->z_mmp;
>> -			mm_unaccount_pinned_pages(mmp);
>> +			mm_unaccount_pinned_pages(vm_account, mmp);
>>  			ret = -EFAULT;
>>  			goto err;
>>  		}
>
> I wonder if RDS should just not be doing accounting? Usually things
> related to iov_iter are short term and we don't account for them.

Yeah, I couldn't easily figure out why these were accounted for in the
first place either.

> But then I don't really know how RDS works, Santos?
>
> Regardless, maybe the vm_account should be stored in the
> rds_msg_zcopy_info ?

On first glance that looks like a better spot. Thanks for the
idea.

> Jason


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 01/19] mm: Introduce vm_account
  2023-01-24 14:32   ` Jason Gunthorpe
@ 2023-01-30 11:36     ` Alistair Popple
  0 siblings, 0 replies; 15+ messages in thread
From: Alistair Popple @ 2023-01-30 11:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, linuxppc-dev, linux-fpga, linux-rdma,
	virtualization, kvm, netdev, io-uring, bpf, rds-devel,
	linux-kselftest


Jason Gunthorpe <jgg@nvidia.com> writes:

> On Tue, Jan 24, 2023 at 04:42:30PM +1100, Alistair Popple wrote:
>> +/**
>> + * enum vm_account_flags - Determine how pinned/locked memory is accounted.
>> + * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
>> + * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
>> + * @VM_ACCOUNT_USER: Accounnt locked memory to user->locked_vm.
>> + *
>> + * Determines which statistic pinned/locked memory is accounted
>> + * against. All limits will be enforced against RLIMIT_MEMLOCK and the
>> + * pins cgroup if CONFIG_CGROUP_PINS is enabled.
>> + *
>> + * New drivers should use VM_ACCOUNT_TASK. VM_ACCOUNT_USER is used by
>> + * pre-existing drivers to maintain existing accounting against
>> + * user->locked_mm rather than mm->pinned_mm.
>
> I thought the guidance was the opposite of this, it is the newer
> places in the kernel that are using VM_ACCOUNT_USER?

I'd just assumed mm->pinned_vm was preferred because that's what most
drivers use. user->locked_mm does seem more sensible though as at least
it's possible to meaningfully enforce some overall limit. Will switch
the flags/comment around to suggest new users use VM_ACCOUNT_USER.

> I haven't got to the rest of the patches yet, but isn't there also a
> mm->pinned_vm vs mm->locked_vm variation in the current drivers as
> well?
>
>> +void vm_account_init_current(struct vm_account *vm_account)
>> +{
>> +	vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
>> +}
>> +EXPORT_SYMBOL_GPL(vm_account_init_current);
>
> This can probably just be a static inline
>
> You might consider putting all this in some new vm_account.h - given
> how rarely it is used? Compile times and all

Works for me.

> Jason


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 01/19] mm: Introduce vm_account
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
  2023-01-24  6:29   ` Christoph Hellwig
  2023-01-24 14:32   ` Jason Gunthorpe
@ 2023-01-31 14:00   ` David Hildenbrand
  2 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2023-01-31 14:00 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, linuxppc-dev, linux-fpga, linux-rdma, virtualization,
	kvm, netdev, io-uring, bpf, rds-devel, linux-kselftest

On 24.01.23 06:42, Alistair Popple wrote:
> Kernel drivers that pin pages should account these pages against
> either user->locked_vm or mm->pinned_vm and fail the pinning if
> RLIMIT_MEMLOCK is exceeded and CAP_IPC_LOCK isn't held.
> 
> Currently drivers open-code this accounting and use various methods to
> update the atomic variables and check against the limits leading to
> various bugs and inconsistencies. To fix this introduce a standard
> interface for charging pinned and locked memory. As this involves
> taking references on kernel objects such as mm_struct or user_struct
> we introduce a new vm_account struct to hold these references. Several
> helper functions are then introduced to grab references and check
> limits.
> 
> As the way these limits are charged and enforced is visible to
> userspace we need to be careful not to break existing applications by
> charging to different counters. As a result the vm_account functions
> support accounting to different counters as required.
> 
> A future change will extend this to also account against a cgroup for
> pinned pages.

The term "vm_account" is misleading, no? VM_ACCOUNT is for accounting 
towards the commit limit ....

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
  2023-01-30 11:17     ` Alistair Popple
@ 2023-02-06  4:36       ` Alistair Popple
  2023-02-06 13:14         ` Jason Gunthorpe
  0 siblings, 1 reply; 15+ messages in thread
From: Alistair Popple @ 2023-02-06  4:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel


Alistair Popple <apopple@nvidia.com> writes:

> Jason Gunthorpe <jgg@nvidia.com> writes:
>
>> On Tue, Jan 24, 2023 at 04:42:39PM +1100, Alistair Popple wrote:
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index dcd72e6..bc3a868 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -334,6 +334,7 @@ struct sk_filter;
>>>    *	@sk_security: used by security modules
>>>    *	@sk_mark: generic packet mark
>>>    *	@sk_cgrp_data: cgroup data for this cgroup
>>> +  *	@sk_vm_account: data for pinned memory accounting
>>>    *	@sk_memcg: this socket's memory cgroup association
>>>    *	@sk_write_pending: a write to stream socket waits to start
>>>    *	@sk_state_change: callback to indicate change in the state of the sock
>>> @@ -523,6 +524,7 @@ struct sock {
>>>  	void			*sk_security;
>>>  #endif
>>>  	struct sock_cgroup_data	sk_cgrp_data;
>>> +	struct vm_account       sk_vm_account;
>>>  	struct mem_cgroup	*sk_memcg;
>>>  	void			(*sk_state_change)(struct sock *sk);
>>>  	void			(*sk_data_ready)(struct sock *sk);
>>
>> I'm not sure this makes sense in a sock - each sock can be shared with
>> different proceses..
>
> TBH it didn't feel right to me either so was hoping for some
> feedback. Will try your suggestion below.
>
>>> diff --git a/net/rds/message.c b/net/rds/message.c
>>> index b47e4f0..2138a70 100644
>>> --- a/net/rds/message.c
>>> +++ b/net/rds/message.c
>>> @@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>>>  	struct list_head *head;
>>>  	unsigned long flags;
>>>  
>>> -	mm_unaccount_pinned_pages(&znotif->z_mmp);
>>> +	mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
>>>  	q = &rs->rs_zcookie_queue;
>>>  	spin_lock_irqsave(&q->lock, flags);
>>>  	head = &q->zcookie_head;
>>> @@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>>  	int ret = 0;
>>>  	int length = iov_iter_count(from);
>>>  	struct rds_msg_zcopy_info *info;
>>> +	struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
>>>  
>>>  	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>>>  
>>> @@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>>  		return -ENOMEM;
>>>  	INIT_LIST_HEAD(&info->rs_zcookie_next);
>>>  	rm->data.op_mmp_znotifier = &info->znotif;
>>> -	if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
>>> +	vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
>>> +	if (mm_account_pinned_pages(vm_account,
>>> +				    &rm->data.op_mmp_znotifier->z_mmp,
>>>  				    length)) {
>>>  		ret = -ENOMEM;
>>>  		goto err;
>>> @@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>>  			for (i = 0; i < rm->data.op_nents; i++)
>>>  				put_page(sg_page(&rm->data.op_sg[i]));
>>>  			mmp = &rm->data.op_mmp_znotifier->z_mmp;
>>> -			mm_unaccount_pinned_pages(mmp);
>>> +			mm_unaccount_pinned_pages(vm_account, mmp);
>>>  			ret = -EFAULT;
>>>  			goto err;
>>>  		}
>>
>> I wonder if RDS should just not be doing accounting? Usually things
>> related to iov_iter are short term and we don't account for them.
>
> Yeah, I couldn't easily figure out why these were accounted for in the
> first place either.
>
>> But then I don't really know how RDS works, Santos?
>>
>> Regardless, maybe the vm_account should be stored in the
>> rds_msg_zcopy_info ?
>
> On first glance that looks like a better spot. Thanks for the
> idea.

That works fine for RDS but not for skbuff. We still need a vm_account
in the struct sock or somewhere else for that. For example in
msg_zerocopy_realloc() we only have a struct ubuf_info_msgzc
available. We can't add a struct vm_account field to that because
ultimately it is stored in struct sk_buff->ck[] which is not large
enough to contain ubuf_info_msgzc + vm_account.

I'm not terribly familiar with kernel networking code though, so happy
to hear other suggestions.

>> Jason


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
  2023-02-06  4:36       ` Alistair Popple
@ 2023-02-06 13:14         ` Jason Gunthorpe
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Gunthorpe @ 2023-02-06 13:14 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel

On Mon, Feb 06, 2023 at 03:36:49PM +1100, Alistair Popple wrote:

> >> But then I don't really know how RDS works, Santos?
> >>
> >> Regardless, maybe the vm_account should be stored in the
> >> rds_msg_zcopy_info ?
> >
> > On first glance that looks like a better spot. Thanks for the
> > idea.
> 
> That works fine for RDS but not for skbuff. 

I would definately put the RDS stuff like that..

> We still need a vm_account in the struct sock or somewhere else for
> that. For example in msg_zerocopy_realloc() we only have a struct
> ubuf_info_msgzc available. We can't add a struct vm_account field to
> that because ultimately it is stored in struct sk_buff->ck[] which
> is not large enough to contain ubuf_info_msgzc + vm_account.

Well, AFAICT this is using iov_iter to get the pages and in general
iov_iter - eg as used for O_DIRECT - doesn't charge anything.

If this does somehow allow a userspace to hold pin a page for a long
time then it is already technically wrong because it doesn't use
FOLL_LONGTERM.

Arguably FOLL_LONGTERM should be the key precondition to require
accounting.

So I wonder if it should just be deleted?

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-02-06 13:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
2023-01-24  6:29   ` Christoph Hellwig
2023-01-24 14:32   ` Jason Gunthorpe
2023-01-30 11:36     ` Alistair Popple
2023-01-31 14:00   ` David Hildenbrand
2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
2023-01-24  5:55   ` Michael S. Tsirkin
2023-01-30 10:43     ` Alistair Popple
2023-01-24 14:34   ` Jason Gunthorpe
2023-01-24  5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
2023-01-24 14:51   ` Jason Gunthorpe
2023-01-30 11:17     ` Alistair Popple
2023-02-06  4:36       ` Alistair Popple
2023-02-06 13:14         ` Jason Gunthorpe
2023-01-24  5:42 ` [RFC PATCH 11/19] xdp: convert to use vm_account Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).