linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
@ 2023-01-24  5:42 Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
                   ` (20 more replies)
  0 siblings, 21 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple

Having large amounts of unmovable or unreclaimable memory in a system
can lead to system instability due to increasing the likelihood of
encountering out-of-memory conditions. Therefore it is desirable to
limit the amount of memory users can lock or pin.

From userspace such limits can be enforced by setting
RLIMIT_MEMLOCK. However there is no standard method that drivers and
other in-kernel users can use to check and enforce this limit.

This has lead to a large number of inconsistencies in how limits are
enforced. For example some drivers will use mm->locked_mm while others
will use mm->pinned_mm or user->locked_mm. It is therefore possible to
have up to three times RLIMIT_MEMLOCKED pinned.

Having pinned memory limited per-task also makes it easy for users to
exceed the limit. For example drivers that pin memory with
pin_user_pages() it tends to remain pinned after fork. To deal with
this and other issues this series introduces a cgroup for tracking and
limiting the number of pages pinned or locked by tasks in the group.

However the existing behaviour with regards to the rlimit needs to be
maintained. Therefore the lesser of the two limits is
enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
but this bypass is not allowed for the cgroup.

The first part of this series converts existing drivers which
open-code the use of locked_mm/pinned_mm over to a common interface
which manages the refcounts of the associated task/mm/user
structs. This ensures accounting of pages is consistent and makes it
easier to add charging of the cgroup.

The second part of the series adds the cgroup and converts core mm
code such as mlock over to charging the cgroup before finally
introducing some selftests.

As I don't have access to systems with all the various devices I
haven't been able to test all driver changes. Any help there would be
appreciated.

Alistair Popple (19):
  mm: Introduce vm_account
  drivers/vhost: Convert to use vm_account
  drivers/vdpa: Convert vdpa to use the new vm_structure
  infiniband/umem: Convert to use vm_account
  RMDA/siw: Convert to use vm_account
  RDMA/usnic: convert to use vm_account
  vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm
  vfio/spapr_tce: Convert accounting to pinned_vm
  io_uring: convert to use vm_account
  net: skb: Switch to using vm_account
  xdp: convert to use vm_account
  kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned()
  fpga: dfl: afu: convert to use vm_account
  mm: Introduce a cgroup for pinned memory
  mm/util: Extend vm_account to charge pages against the pin cgroup
  mm/util: Refactor account_locked_vm
  mm: Convert mmap and mlock to use account_locked_vm
  mm/mmap: Charge locked memory to pins cgroup
  selftests/vm: Add pins-cgroup selftest for mlock/mmap

 MAINTAINERS                              |   8 +-
 arch/powerpc/kvm/book3s_64_vio.c         |  10 +-
 arch/powerpc/mm/book3s64/iommu_api.c     |  29 +--
 drivers/fpga/dfl-afu-dma-region.c        |  11 +-
 drivers/fpga/dfl-afu.h                   |   1 +-
 drivers/infiniband/core/umem.c           |  16 +-
 drivers/infiniband/core/umem_odp.c       |   6 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c |  13 +-
 drivers/infiniband/hw/usnic/usnic_uiom.h |   1 +-
 drivers/infiniband/sw/siw/siw.h          |   2 +-
 drivers/infiniband/sw/siw/siw_mem.c      |  20 +--
 drivers/infiniband/sw/siw/siw_verbs.c    |  15 +-
 drivers/vdpa/vdpa_user/vduse_dev.c       |  20 +--
 drivers/vfio/vfio_iommu_spapr_tce.c      |  15 +-
 drivers/vfio/vfio_iommu_type1.c          |  59 +----
 drivers/vhost/vdpa.c                     |   9 +-
 drivers/vhost/vhost.c                    |   2 +-
 drivers/vhost/vhost.h                    |   1 +-
 include/linux/cgroup.h                   |  20 ++-
 include/linux/cgroup_subsys.h            |   4 +-
 include/linux/io_uring_types.h           |   3 +-
 include/linux/kvm_host.h                 |   1 +-
 include/linux/mm.h                       |   5 +-
 include/linux/mm_types.h                 |  88 ++++++++-
 include/linux/skbuff.h                   |   6 +-
 include/net/sock.h                       |   2 +-
 include/net/xdp_sock.h                   |   2 +-
 include/rdma/ib_umem.h                   |   1 +-
 io_uring/io_uring.c                      |  20 +--
 io_uring/notif.c                         |   4 +-
 io_uring/notif.h                         |  10 +-
 io_uring/rsrc.c                          |  38 +---
 io_uring/rsrc.h                          |   9 +-
 mm/Kconfig                               |  11 +-
 mm/Makefile                              |   1 +-
 mm/internal.h                            |   2 +-
 mm/mlock.c                               |  76 +------
 mm/mmap.c                                |  76 +++----
 mm/mremap.c                              |  54 +++--
 mm/pins_cgroup.c                         | 273 ++++++++++++++++++++++++-
 mm/secretmem.c                           |   6 +-
 mm/util.c                                | 196 +++++++++++++++--
 net/core/skbuff.c                        |  47 +---
 net/rds/message.c                        |   9 +-
 net/xdp/xdp_umem.c                       |  38 +--
 tools/testing/selftests/vm/Makefile      |   1 +-
 tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++-
 virt/kvm/kvm_main.c                      |   3 +-
 48 files changed, 1114 insertions(+), 401 deletions(-)
 create mode 100644 mm/pins_cgroup.c
 create mode 100644 tools/testing/selftests/vm/pins-cgroup.c

base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65
-- 
git-series 0.9.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC PATCH 01/19] mm: Introduce vm_account
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  6:29   ` Christoph Hellwig
                     ` (2 more replies)
  2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
                   ` (19 subsequent siblings)
  20 siblings, 3 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, linuxppc-dev, linux-fpga, linux-rdma,
	virtualization, kvm, netdev, io-uring, bpf, rds-devel,
	linux-kselftest

Kernel drivers that pin pages should account these pages against
either user->locked_vm or mm->pinned_vm and fail the pinning if
RLIMIT_MEMLOCK is exceeded and CAP_IPC_LOCK isn't held.

Currently drivers open-code this accounting and use various methods to
update the atomic variables and check against the limits leading to
various bugs and inconsistencies. To fix this introduce a standard
interface for charging pinned and locked memory. As this involves
taking references on kernel objects such as mm_struct or user_struct
we introduce a new vm_account struct to hold these references. Several
helper functions are then introduced to grab references and check
limits.

As the way these limits are charged and enforced is visible to
userspace we need to be careful not to break existing applications by
charging to different counters. As a result the vm_account functions
support accounting to different counters as required.

A future change will extend this to also account against a cgroup for
pinned pages.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-fpga@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: io-uring@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: bpf@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Cc: linux-kselftest@vger.kernel.org
---
 include/linux/mm_types.h | 87 ++++++++++++++++++++++++++++++++++++++++-
 mm/util.c                | 89 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 176 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9757067..7de2168 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1085,4 +1085,91 @@ enum fault_flag {
 
 typedef unsigned int __bitwise zap_flags_t;
 
+/**
+ * enum vm_account_flags - Determine how pinned/locked memory is accounted.
+ * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
+ * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
+ * @VM_ACCOUNT_USER: Accounnt locked memory to user->locked_vm.
+ *
+ * Determines which statistic pinned/locked memory is accounted
+ * against. All limits will be enforced against RLIMIT_MEMLOCK and the
+ * pins cgroup if CONFIG_CGROUP_PINS is enabled.
+ *
+ * New drivers should use VM_ACCOUNT_TASK. VM_ACCOUNT_USER is used by
+ * pre-existing drivers to maintain existing accounting against
+ * user->locked_mm rather than mm->pinned_mm.
+ *
+ * VM_ACCOUNT_BYPASS may also be specified to bypass rlimit
+ * checks. Typically this is used to cache CAP_IPC_LOCK from when a
+ * driver is first initialised. Note that this does not bypass cgroup
+ * limit checks.
+ */
+enum vm_account_flags {
+	VM_ACCOUNT_TASK = 0,
+	VM_ACCOUNT_BYPASS = 1,
+	VM_ACCOUNT_USER = 2,
+};
+
+struct vm_account {
+	struct task_struct *task;
+	union {
+		struct mm_struct *mm;
+		struct user_struct *user;
+	} a;
+	enum vm_account_flags flags;
+};
+
+/**
+ * vm_account_init - Initialise a new struct vm_account.
+ * @vm_account: pointer to uninitialised vm_account.
+ * @task: task to charge against.
+ * @user: user to charge against. Must be non-NULL for VM_ACCOUNT_USER.
+ * @flags: flags to use when charging to vm_account.
+ *
+ * Initialise a new uninitialiused struct vm_account. Takes references
+ * on the task/mm/user/cgroup as required although callers must ensure
+ * any references passed in remain valid for the duration of this
+ * call.
+ */
+void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
+		struct user_struct *user, enum vm_account_flags flags);
+/**
+ * vm_account_init_current - Initialise a new struct vm_account.
+ * @vm_account: pointer to uninitialised vm_account.
+ *
+ * Helper to initialise a vm_account for the common case of charging
+ * with VM_ACCOUNT_TASK against current.
+ */
+void vm_account_init_current(struct vm_account *vm_account);
+
+/**
+ * vm_account_release - Initialise a new struct vm_account.
+ * @vm_account: pointer to initialised vm_account.
+ *
+ * Drop any object references obtained by vm_account_init(). The
+ * vm_account must not be used after calling this unless reinitialised
+ * with vm_account_init().
+ */
+void vm_account_release(struct vm_account *vm_account);
+
+/**
+ * vm_account_pinned - Charge pinned or locked memory to the vm_account.
+ * @vm_account: pointer to an initialised vm_account.
+ * @npages: number of pages to charge.
+ *
+ * Return: 0 on success, -ENOMEM if a limit would be exceeded.
+ *
+ * Note: All pages must be explicitly uncharged with
+ * vm_unaccount_pinned() prior to releasing the vm_account with
+ * vm_account_release().
+ */
+int vm_account_pinned(struct vm_account *vm_account, unsigned long npages);
+
+/**
+ * vm_unaccount_pinned - Uncharge pinned or locked memory to the vm_account.
+ * @vm_account: pointer to an initialised vm_account.
+ * @npages: number of pages to uncharge.
+ */
+void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages);
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/mm/util.c b/mm/util.c
index b56c92f..af40b1e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -430,6 +430,95 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
 }
 #endif
 
+void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
+		struct user_struct *user, enum vm_account_flags flags)
+{
+	vm_account->task = get_task_struct(task);
+
+	if (flags & VM_ACCOUNT_USER) {
+		vm_account->a.user = get_uid(user);
+	} else {
+		mmgrab(task->mm);
+		vm_account->a.mm = task->mm;
+	}
+
+	vm_account->flags = flags;
+}
+EXPORT_SYMBOL_GPL(vm_account_init);
+
+void vm_account_init_current(struct vm_account *vm_account)
+{
+	vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
+}
+EXPORT_SYMBOL_GPL(vm_account_init_current);
+
+void vm_account_release(struct vm_account *vm_account)
+{
+	put_task_struct(vm_account->task);
+	if (vm_account->flags & VM_ACCOUNT_USER)
+		free_uid(vm_account->a.user);
+	else
+		mmdrop(vm_account->a.mm);
+}
+EXPORT_SYMBOL_GPL(vm_account_release);
+
+/*
+ * Charge pages with an atomic compare and swap. Returns -ENOMEM on
+ * failure, 1 on success and 0 for retry.
+ */
+static int vm_account_cmpxchg(struct vm_account *vm_account,
+				unsigned long npages, unsigned long lock_limit)
+{
+	u64 cur_pages, new_pages;
+
+	if (vm_account->flags & VM_ACCOUNT_USER)
+		cur_pages = atomic_long_read(&vm_account->a.user->locked_vm);
+	else
+		cur_pages = atomic64_read(&vm_account->a.mm->pinned_vm);
+
+	new_pages = cur_pages + npages;
+	if (lock_limit != RLIM_INFINITY && new_pages > lock_limit)
+		return -ENOMEM;
+
+	if (vm_account->flags & VM_ACCOUNT_USER) {
+		return atomic_long_cmpxchg(&vm_account->a.user->locked_vm,
+					   cur_pages, new_pages) == cur_pages;
+	} else {
+		return atomic64_cmpxchg(&vm_account->a.mm->pinned_vm,
+					   cur_pages, new_pages) == cur_pages;
+	}
+}
+
+int vm_account_pinned(struct vm_account *vm_account, unsigned long npages)
+{
+	unsigned long lock_limit = RLIM_INFINITY;
+	int ret;
+
+	if (!(vm_account->flags & VM_ACCOUNT_BYPASS) && !capable(CAP_IPC_LOCK))
+		lock_limit = task_rlimit(vm_account->task,
+					RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	while (true) {
+		ret = vm_account_cmpxchg(vm_account, npages, lock_limit);
+		if (ret > 0)
+			break;
+		else if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vm_account_pinned);
+
+void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages)
+{
+	if (vm_account->flags & VM_ACCOUNT_USER)
+		atomic_long_sub(npages, &vm_account->a.user->locked_vm);
+	else
+		atomic64_sub(npages, &vm_account->a.mm->pinned_vm);
+}
+EXPORT_SYMBOL_GPL(vm_unaccount_pinned);
+
 /**
  * __account_locked_vm - account locked pages to an mm's locked_vm
  * @mm:          mm to account against
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:55   ` Michael S. Tsirkin
  2023-01-24 14:34   ` Jason Gunthorpe
  2023-01-24  5:42 ` [RFC PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure Alistair Popple
                   ` (18 subsequent siblings)
  20 siblings, 2 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Michael S. Tsirkin, Jason Wang, kvm,
	virtualization, netdev

Convert vhost to use the new vm_account structure and associated
account_pinned_vm() functions.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/vhost/vdpa.c  |  9 +++++----
 drivers/vhost/vhost.c |  2 ++
 drivers/vhost/vhost.h |  1 +
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ec32f78..a31dd53 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -716,7 +716,7 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 				set_page_dirty_lock(page);
 			unpin_user_page(page);
 		}
-		atomic64_sub(PFN_DOWN(map->size), &dev->mm->pinned_vm);
+		vm_unaccount_pinned(&dev->vm_account, PFN_DOWN(map->size));
 		vhost_vdpa_general_unmap(v, map, asid);
 		vhost_iotlb_map_free(iotlb, map);
 	}
@@ -780,6 +780,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 	u32 asid = iotlb_to_asid(iotlb);
 	int r = 0;
 
+	if (!vdpa->use_va)
+		if (vm_account_pinned(&dev->vm_account, PFN_DOWN(size)))
+			return -ENOMEM;
+
 	r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
 				      pa, perm, opaque);
 	if (r)
@@ -799,9 +803,6 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 		return r;
 	}
 
-	if (!vdpa->use_va)
-		atomic64_add(PFN_DOWN(size), &dev->mm->pinned_vm);
-
 	return 0;
 }
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cbe72bf..5645c26 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -556,6 +556,7 @@ static void vhost_attach_mm(struct vhost_dev *dev)
 		dev->mm = current->mm;
 		mmgrab(dev->mm);
 	}
+	vm_account_init_current(&dev->vm_account);
 }
 
 static void vhost_detach_mm(struct vhost_dev *dev)
@@ -569,6 +570,7 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 		mmdrop(dev->mm);
 
 	dev->mm = NULL;
+	vm_account_release(&dev->vm_account);
 }
 
 /* Caller should have device mutex */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d910910..3a9aed8 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -144,6 +144,7 @@ struct vhost_msg_node {
 struct vhost_dev {
 	struct mm_struct *mm;
 	struct mutex mutex;
+	struct vm_account vm_account;
 	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct eventfd_ctx *log_ctx;
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24 14:35   ` Jason Gunthorpe
  2023-01-24  5:42 ` [RFC PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Michael S. Tsirkin, Jason Wang,
	virtualization

Convert vdpa to use the new vm_structure and associated
account_pinned_vm() functions.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0c3b486..bd87b58 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -70,7 +70,7 @@ struct vduse_umem {
 	unsigned long iova;
 	unsigned long npages;
 	struct page **pages;
-	struct mm_struct *mm;
+	struct vm_account vm_account;
 };
 
 struct vduse_dev {
@@ -950,8 +950,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
 	vduse_domain_remove_user_bounce_pages(dev->domain);
 	unpin_user_pages_dirty_lock(dev->umem->pages,
 				    dev->umem->npages, true);
-	atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
-	mmdrop(dev->umem->mm);
+	vm_unaccount_pinned(&dev->umem->vm_account, dev->umem->npages);
 	vfree(dev->umem->pages);
 	kfree(dev->umem);
 	dev->umem = NULL;
@@ -967,7 +966,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 	struct page **page_list = NULL;
 	struct vduse_umem *umem = NULL;
 	long pinned = 0;
-	unsigned long npages, lock_limit;
+	unsigned long npages;
 	int ret;
 
 	if (!dev->domain->bounce_map ||
@@ -990,8 +989,8 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 
 	mmap_read_lock(current->mm);
 
-	lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
-	if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
+	vm_account_init_current(&umem->vm_account);
+	if (vm_account_pinned(&umem->vm_account, npages))
 		goto out;
 
 	pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
@@ -1006,22 +1005,21 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
 	if (ret)
 		goto out;
 
-	atomic64_add(npages, &current->mm->pinned_vm);
-
 	umem->pages = page_list;
 	umem->npages = pinned;
 	umem->iova = iova;
-	umem->mm = current->mm;
-	mmgrab(current->mm);
 
 	dev->umem = umem;
 out:
-	if (ret && pinned > 0)
+	if (ret && pinned > 0) {
 		unpin_user_pages(page_list, pinned);
+		vm_unaccount_pinned(&umem->vm_account, npages);
+	}
 
 	mmap_read_unlock(current->mm);
 unlock:
 	if (ret) {
+		vm_account_release(&umem->vm_account);
 		vfree(page_list);
 		kfree(umem);
 	}
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 04/19] infiniband/umem: Convert to use vm_account
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (2 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 05/19] RMDA/siw: " Alistair Popple
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Jason Gunthorpe, Leon Romanovsky,
	linux-rdma

Converts the infiniband core umem code to use the vm_account structure
so that pinned pages can be charged to the correct cgroup with
account_pinned_vm().

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/infiniband/core/umem.c     | 16 ++++++----------
 drivers/infiniband/core/umem_odp.c |  6 ++++++
 include/rdma/ib_umem.h             |  1 +
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 755a9c5..479b7f0 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -149,8 +149,6 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 {
 	struct ib_umem *umem;
 	struct page **page_list;
-	unsigned long lock_limit;
-	unsigned long new_pinned;
 	unsigned long cur_base;
 	unsigned long dma_attr = 0;
 	struct mm_struct *mm;
@@ -186,6 +184,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 	umem->writable   = ib_access_writable(access);
 	umem->owning_mm = mm = current->mm;
 	mmgrab(mm);
+	vm_account_init_current(&umem->vm_account);
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
 	if (!page_list) {
@@ -199,11 +198,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 		goto out;
 	}
 
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	new_pinned = atomic64_add_return(npages, &mm->pinned_vm);
-	if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
-		atomic64_sub(npages, &mm->pinned_vm);
+	if (vm_account_pinned(&umem->vm_account, npages)) {
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -248,12 +243,13 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 
 umem_release:
 	__ib_umem_release(device, umem, 0);
-	atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
+	vm_unaccount_pinned(&umem->vm_account, ib_umem_num_pages(umem));
 out:
 	free_page((unsigned long) page_list);
 umem_kfree:
 	if (ret) {
 		mmdrop(umem->owning_mm);
+		vm_account_release(&umem->vm_account);
 		kfree(umem);
 	}
 	return ret ? ERR_PTR(ret) : umem;
@@ -275,8 +271,8 @@ void ib_umem_release(struct ib_umem *umem)
 
 	__ib_umem_release(umem->ibdev, umem, 1);
 
-	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
-	mmdrop(umem->owning_mm);
+	vm_unaccount_pinned(&umem->vm_account, ib_umem_num_pages(umem));
+	vm_account_release(&umem->vm_account);
 	kfree(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e9fa22d..4fbca3e 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -130,6 +130,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
 	umem->ibdev = device;
 	umem->writable = ib_access_writable(access);
 	umem->owning_mm = current->mm;
+	vm_account_init_current(&umem->vm_account);
 	umem_odp->is_implicit_odp = 1;
 	umem_odp->page_shift = PAGE_SHIFT;
 
@@ -137,6 +138,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
 	ret = ib_init_umem_odp(umem_odp, NULL);
 	if (ret) {
 		put_pid(umem_odp->tgid);
+		vm_account_release(&umem->vm_account);
 		kfree(umem_odp);
 		return ERR_PTR(ret);
 	}
@@ -179,6 +181,7 @@ ib_umem_odp_alloc_child(struct ib_umem_odp *root, unsigned long addr,
 	umem->address    = addr;
 	umem->writable   = root->umem.writable;
 	umem->owning_mm  = root->umem.owning_mm;
+	umem->vm_account = root->umem.vm_account;
 	odp_data->page_shift = PAGE_SHIFT;
 	odp_data->notifier.ops = ops;
 
@@ -239,6 +242,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device,
 	umem_odp->umem.address = addr;
 	umem_odp->umem.writable = ib_access_writable(access);
 	umem_odp->umem.owning_mm = current->mm;
+	vm_account_init_current(&umem_odp->umem.vm_account);
 	umem_odp->notifier.ops = ops;
 
 	umem_odp->page_shift = PAGE_SHIFT;
@@ -255,6 +259,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device,
 
 err_put_pid:
 	put_pid(umem_odp->tgid);
+	vm_account_release(&umem_odp->umem.vm_account);
 	kfree(umem_odp);
 	return ERR_PTR(ret);
 }
@@ -278,6 +283,7 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 		kvfree(umem_odp->pfn_list);
 	}
 	put_pid(umem_odp->tgid);
+	vm_account_release(&umem_odp->umem.vm_account);
 	kfree(umem_odp);
 }
 EXPORT_SYMBOL(ib_umem_odp_release);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 92a673c..de51406 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -19,6 +19,7 @@ struct dma_buf_attach_ops;
 struct ib_umem {
 	struct ib_device       *ibdev;
 	struct mm_struct       *owning_mm;
+	struct vm_account vm_account;
 	u64 iova;
 	size_t			length;
 	unsigned long		address;
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (3 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24 14:37   ` Jason Gunthorpe
  2023-01-24  5:42 ` [RFC PATCH 06/19] RDMA/usnic: convert " Alistair Popple
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Bernard Metzler, Jason Gunthorpe,
	Leon Romanovsky, linux-rdma

Convert to using a vm_account structure to account pinned memory to
both the mm and the pins cgroup.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Bernard Metzler <bmt@zurich.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/infiniband/sw/siw/siw.h       |  2 +-
 drivers/infiniband/sw/siw/siw_mem.c   | 20 ++++++--------------
 drivers/infiniband/sw/siw/siw_verbs.c | 15 ---------------
 3 files changed, 7 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw.h b/drivers/infiniband/sw/siw/siw.h
index 2f3a9cd..0c4a3ec 100644
--- a/drivers/infiniband/sw/siw/siw.h
+++ b/drivers/infiniband/sw/siw/siw.h
@@ -124,7 +124,7 @@ struct siw_umem {
 	int num_pages;
 	bool writable;
 	u64 fp_addr; /* First page base address */
-	struct mm_struct *owning_mm;
+	struct vm_account vm_account;
 };
 
 struct siw_pble {
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index b2b33dd..9c53fc3 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -68,7 +68,6 @@ static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
 
 void siw_umem_release(struct siw_umem *umem, bool dirty)
 {
-	struct mm_struct *mm_s = umem->owning_mm;
 	int i, num_pages = umem->num_pages;
 
 	for (i = 0; num_pages; i++) {
@@ -79,9 +78,9 @@ void siw_umem_release(struct siw_umem *umem, bool dirty)
 		kfree(umem->page_chunk[i].plist);
 		num_pages -= to_free;
 	}
-	atomic64_sub(umem->num_pages, &mm_s->pinned_vm);
+	vm_unaccount_pinned(&umem->vm_account, umem->num_pages);
+	vm_account_release(&umem->vm_account);
 
-	mmdrop(mm_s);
 	kfree(umem->page_chunk);
 	kfree(umem);
 }
@@ -365,9 +364,7 @@ struct siw_pbl *siw_pbl_alloc(u32 num_buf)
 struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
 {
 	struct siw_umem *umem;
-	struct mm_struct *mm_s;
 	u64 first_page_va;
-	unsigned long mlock_limit;
 	unsigned int foll_flags = FOLL_LONGTERM;
 	int num_pages, num_chunks, i, rv = 0;
 
@@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
 	if (!umem)
 		return ERR_PTR(-ENOMEM);
 
-	mm_s = current->mm;
-	umem->owning_mm = mm_s;
 	umem->writable = writable;
 
-	mmgrab(mm_s);
+	vm_account_init_current(&umem->vm_account);
 
 	if (writable)
 		foll_flags |= FOLL_WRITE;
 
-	mmap_read_lock(mm_s);
+	mmap_read_lock(current->mm);
 
-	mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
+	if (vm_account_pinned(&umem->vm_account, num_pages)) {
 		rv = -ENOMEM;
 		goto out_sem_up;
 	}
@@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
 				goto out_sem_up;
 
 			umem->num_pages += rv;
-			atomic64_add(rv, &mm_s->pinned_vm);
 			first_page_va += rv * PAGE_SIZE;
 			nents -= rv;
 			got += rv;
@@ -437,7 +429,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
 		num_pages -= got;
 	}
 out_sem_up:
-	mmap_read_unlock(mm_s);
+	mmap_read_unlock(current->mm);
 
 	if (rv > 0)
 		return umem;
diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
index 906fde1..8fab009 100644
--- a/drivers/infiniband/sw/siw/siw_verbs.c
+++ b/drivers/infiniband/sw/siw/siw_verbs.c
@@ -1321,8 +1321,6 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
 	struct siw_umem *umem = NULL;
 	struct siw_ureq_reg_mr ureq;
 	struct siw_device *sdev = to_siw_dev(pd->device);
-
-	unsigned long mem_limit = rlimit(RLIMIT_MEMLOCK);
 	int rv;
 
 	siw_dbg_pd(pd, "start: 0x%pK, va: 0x%pK, len: %llu\n",
@@ -1338,19 +1336,6 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
 		rv = -EINVAL;
 		goto err_out;
 	}
-	if (mem_limit != RLIM_INFINITY) {
-		unsigned long num_pages =
-			(PAGE_ALIGN(len + (start & ~PAGE_MASK))) >> PAGE_SHIFT;
-		mem_limit >>= PAGE_SHIFT;
-
-		if (num_pages > mem_limit - current->mm->locked_vm) {
-			siw_dbg_pd(pd, "pages req %lu, max %lu, lock %lu\n",
-				   num_pages, mem_limit,
-				   current->mm->locked_vm);
-			rv = -ENOMEM;
-			goto err_out;
-		}
-	}
 	umem = siw_umem_get(start, len, ib_access_writable(rights));
 	if (IS_ERR(umem)) {
 		rv = PTR_ERR(umem);
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 06/19] RDMA/usnic: convert to use vm_account
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (4 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 05/19] RMDA/siw: " Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24 14:41   ` Jason Gunthorpe
  2023-01-24  5:42 ` [RFC PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm Alistair Popple
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Christian Benvenuti, Nelson Escobar,
	Jason Gunthorpe, Leon Romanovsky, linux-rdma

Convert to using a vm_account structure to account pinned memory to
both the mm and the pins cgroup.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Christian Benvenuti <benve@cisco.com>
Cc: Nelson Escobar <neescoba@cisco.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +++++--------
 drivers/infiniband/hw/usnic/usnic_uiom.h |  1 +
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index c301b3b..250276e 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -89,8 +89,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	struct page **page_list;
 	struct scatterlist *sg;
 	struct usnic_uiom_chunk *chunk;
-	unsigned long locked;
-	unsigned long lock_limit;
 	unsigned long cur_base;
 	unsigned long npages;
 	int ret;
@@ -123,10 +121,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	uiomr->owning_mm = mm = current->mm;
 	mmap_read_lock(mm);
 
-	locked = atomic64_add_return(npages, &current->mm->pinned_vm);
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+	vm_account_init_current(&uiomr->vm_account);
+	if (vm_account_pinned(&uiomr->vm_account, npages)) {
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -178,7 +174,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 out:
 	if (ret < 0) {
 		usnic_uiom_put_pages(chunk_list, 0);
-		atomic64_sub(npages, &current->mm->pinned_vm);
+		vm_unaccount_pinned(&uiomr->vm_account, npages);
+		vm_account_release(&uiomr->vm_account);
 	} else
 		mmgrab(uiomr->owning_mm);
 
@@ -430,7 +427,7 @@ void usnic_uiom_reg_release(struct usnic_uiom_reg *uiomr)
 {
 	__usnic_uiom_reg_release(uiomr->pd, uiomr, 1);
 
-	atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
+	vm_unaccount_pinned(&uiomr->vm_account, usnic_uiom_num_pages(uiomr));
 	__usnic_uiom_release_tail(uiomr);
 }
 
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.h b/drivers/infiniband/hw/usnic/usnic_uiom.h
index 5a9acf9..5c296a7 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.h
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.h
@@ -72,6 +72,7 @@ struct usnic_uiom_reg {
 	struct list_head		chunk_list;
 	struct work_struct		work;
 	struct mm_struct		*owning_mm;
+	struct vm_account               vm_account;
 };
 
 struct usnic_uiom_chunk {
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (5 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 06/19] RDMA/usnic: convert " Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm Alistair Popple
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Alex Williamson, Cornelia Huck, kvm

This switches the charge to pinned_vm to be consistent with other
drivers that pin pages with FOLL_LONGTERM. It also allows the use of
the vm_account helper struct which makes a future change to implement
cgroup accounting of pinned pages easier to implement as that requires
a reference to the cgroup to be maintained.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/vfio/vfio_iommu_type1.c | 59 +++++++++-------------------------
 1 file changed, 16 insertions(+), 43 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 23c24fe..828f6c7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -95,11 +95,11 @@ struct vfio_dma {
 	size_t			size;		/* Map size (bytes) */
 	int			prot;		/* IOMMU_READ/WRITE */
 	bool			iommu_mapped;
-	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
 	bool			vaddr_invalid;
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	struct vm_account       vm_account;
 };
 
 struct vfio_batch {
@@ -412,31 +412,6 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 	return ret;
 }
 
-static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
-{
-	struct mm_struct *mm;
-	int ret;
-
-	if (!npage)
-		return 0;
-
-	mm = async ? get_task_mm(dma->task) : dma->task->mm;
-	if (!mm)
-		return -ESRCH; /* process exited */
-
-	ret = mmap_write_lock_killable(mm);
-	if (!ret) {
-		ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
-					  dma->lock_cap);
-		mmap_write_unlock(mm);
-	}
-
-	if (async)
-		mmput(mm);
-
-	return ret;
-}
-
 /*
  * Some mappings aren't backed by a struct page, for example an mmap'd
  * MMIO range for our own or another device.  These use a different
@@ -715,16 +690,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 			 * externally pinned pages are already counted against
 			 * the user.
 			 */
-			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
-				if (!dma->lock_cap &&
-				    mm->locked_vm + lock_acct + 1 > limit) {
-					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
-						__func__, limit << PAGE_SHIFT);
-					ret = -ENOMEM;
-					goto unpin_out;
-				}
+			if (!rsvd && !vfio_find_vpfn(dma, iova))
 				lock_acct++;
-			}
 
 			pinned++;
 			npage--;
@@ -744,7 +711,11 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	}
 
 out:
-	ret = vfio_lock_acct(dma, lock_acct, false);
+	if (vm_account_pinned(&dma->vm_account, lock_acct)) {
+		ret = -ENOMEM;
+		lock_acct = 0;
+		pr_warn("%s: RLIMIT_MEMLOCK exceeded\n", __func__);
+	}
 
 unpin_out:
 	if (batch->size == 1 && !batch->offset) {
@@ -759,6 +730,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 				put_pfn(pfn, dma->prot);
 		}
 		vfio_batch_unpin(batch, dma);
+		vm_unaccount_pinned(&dma->vm_account, lock_acct);
 
 		return ret;
 	}
@@ -782,7 +754,7 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 	}
 
 	if (do_accounting)
-		vfio_lock_acct(dma, locked - unlocked, true);
+		vm_unaccount_pinned(&dma->vm_account, locked - unlocked);
 
 	return unlocked;
 }
@@ -805,7 +777,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	ret = 0;
 
 	if (do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
-		ret = vfio_lock_acct(dma, 1, true);
+		ret = vm_account_pinned(&dma->vm_account, 1);
 		if (ret) {
 			put_pfn(*pfn_base, dma->prot);
 			if (ret == -ENOMEM)
@@ -833,7 +805,7 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
 	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
 
 	if (do_accounting)
-		vfio_lock_acct(dma, -unlocked, true);
+		vm_unaccount_pinned(&dma->vm_account, unlocked);
 
 	return unlocked;
 }
@@ -921,7 +893,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn);
 		if (ret) {
 			if (put_pfn(phys_pfn, dma->prot) && do_accounting)
-				vfio_lock_acct(dma, -1, true);
+				vm_unaccount_pinned(&dma->vm_account, 1);
 			goto pin_unwind;
 		}
 
@@ -1162,7 +1134,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	}
 
 	if (do_accounting) {
-		vfio_lock_acct(dma, -unlocked, true);
+		vm_unaccount_pinned(&dma->vm_account, unlocked);
 		return 0;
 	}
 	return unlocked;
@@ -1674,7 +1646,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	 */
 	get_task_struct(current->group_leader);
 	dma->task = current->group_leader;
-	dma->lock_cap = capable(CAP_IPC_LOCK);
+	vm_account_init(&dma->vm_account, dma->task, NULL, VM_ACCOUNT_TASK |
+		(capable(CAP_IPC_LOCK) ? VM_ACCOUNT_BYPASS : 0));
 
 	dma->pfn_list = RB_ROOT;
 
@@ -2398,7 +2371,7 @@ static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
 			if (!is_invalid_reserved_pfn(vpfn->pfn))
 				locked++;
 		}
-		vfio_lock_acct(dma, locked - unlocked, true);
+		vm_unaccount_pinned(&dma->vm_account, locked - unlocked);
 	}
 }
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (6 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 09/19] io_uring: convert to use vm_account Alistair Popple
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Alex Williamson, Cornelia Huck,
	Alexey Kardashevskiy, linuxppc-dev, kvm

Convert from accounting pages against locked_vm to accounting them to
pinned_vm. This allows struct vm_account to be used to track the
mm_struct used to charge the pages. A future change also uses this to
track a cgroup for controlling pinned pages.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
 arch/powerpc/mm/book3s64/iommu_api.c | 29 ++++++++++++++++++-----------
 drivers/vfio/vfio_iommu_spapr_tce.c  | 15 ++++++++++-----
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index 7fcfba1..6c57603 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -30,6 +30,7 @@ struct mm_iommu_table_group_mem_t {
 	unsigned long used;
 	atomic64_t mapped;
 	unsigned int pageshift;
+	struct vm_account vm_account;
 	u64 ua;			/* userspace address */
 	u64 entries;		/* number of entries in hpas/hpages[] */
 	/*
@@ -62,20 +63,24 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	unsigned int pageshift;
 	unsigned long entry, chunk;
 
-	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
-		ret = account_locked_vm(mm, entries, true);
-		if (ret)
-			return ret;
-
-		locked_entries = entries;
-	}
-
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem) {
 		ret = -ENOMEM;
 		goto unlock_exit;
 	}
 
+	vm_account_init_current(&mem->vm_account);
+	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
+		ret = vm_account_pinned(&mem->vm_account, entries);
+		if (ret) {
+			vm_account_release(&mem->vm_account);
+			kfree(mem);
+			return ret;
+		}
+
+		locked_entries = entries;
+	}
+
 	if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
 		mem->pageshift = __ffs(dev_hpa | (entries << PAGE_SHIFT));
 		mem->dev_hpa = dev_hpa;
@@ -175,10 +180,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	unpin_user_pages(mem->hpages, pinned);
 
 	vfree(mem->hpas);
-	kfree(mem);
 
 unlock_exit:
-	account_locked_vm(mm, locked_entries, false);
+	vm_unaccount_pinned(&mem->vm_account, locked_entries);
+	vm_account_release(&mem->vm_account);
+	kfree(mem);
 
 	return ret;
 }
@@ -229,6 +235,7 @@ static void mm_iommu_do_free(struct mm_iommu_table_group_mem_t *mem)
 
 	mm_iommu_unpin(mem);
 	vfree(mem->hpas);
+	vm_account_release(&mem->vm_account);
 	kfree(mem);
 }
 
@@ -279,7 +286,7 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
 
-	account_locked_vm(mm, unlock_entries, false);
+	vm_unaccount_pinned(&mem->vm_account, unlock_entries);
 
 	return ret;
 }
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 60a50ce..a58281b 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -67,6 +67,7 @@ struct tce_container {
 	bool def_window_pending;
 	unsigned long locked_pages;
 	struct mm_struct *mm;
+	struct vm_account vm_account;
 	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 	struct list_head group_list;
 	struct list_head prereg_list;
@@ -82,6 +83,7 @@ static long tce_iommu_mm_set(struct tce_container *container)
 	BUG_ON(!current->mm);
 	container->mm = current->mm;
 	mmgrab(container->mm);
+	vm_account_init_current(&container->vm_account);
 
 	return 0;
 }
@@ -291,7 +293,7 @@ static int tce_iommu_enable(struct tce_container *container)
 		return ret;
 
 	locked = table_group->tce32_size >> PAGE_SHIFT;
-	ret = account_locked_vm(container->mm, locked, true);
+	ret = vm_account_pinned(&container->vm_accounnt, locked);
 	if (ret)
 		return ret;
 
@@ -310,7 +312,7 @@ static void tce_iommu_disable(struct tce_container *container)
 	container->enabled = false;
 
 	BUG_ON(!container->mm);
-	account_locked_vm(container->mm, container->locked_pages, false);
+	vm_account_pinned(&container->vm_account, container->locked_pages);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -372,8 +374,10 @@ static void tce_iommu_release(void *iommu_data)
 		WARN_ON(tce_iommu_prereg_free(container, tcemem));
 
 	tce_iommu_disable(container);
-	if (container->mm)
+	if (container->mm) {
 		mmdrop(container->mm);
+		vm_account_release(&container->vm_account);
+	}
 	mutex_destroy(&container->lock);
 
 	kfree(container);
@@ -619,7 +623,8 @@ static long tce_iommu_create_table(struct tce_container *container,
 	if (!table_size)
 		return -EINVAL;
 
-	ret = account_locked_vm(container->mm, table_size >> PAGE_SHIFT, true);
+	ret = vm_account_pinned(&container->vm_account,
+				table_size >> PAGE_SHIFT);
 	if (ret)
 		return ret;
 
@@ -638,7 +643,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	iommu_tce_table_put(tbl);
-	account_locked_vm(container->mm, pages, false);
+	vm_unaccount_pinned(&container->vm_account, pages);
 }
 
 static long tce_iommu_create_window(struct tce_container *container,
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 09/19] io_uring: convert to use vm_account
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (7 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24 14:44   ` Jason Gunthorpe
  2023-01-24  5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Jens Axboe, Pavel Begunkov, io-uring

Convert io_uring to use vm_account instead of directly charging pages
against the user/mm. Rather than charge pages to both user->locked_vm
and mm->pinned_vm this will only charge pages to user->locked_vm.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/io_uring_types.h |  3 +--
 io_uring/io_uring.c            | 20 +++---------------
 io_uring/notif.c               |  4 ++--
 io_uring/notif.h               | 10 +++------
 io_uring/rsrc.c                | 38 +++--------------------------------
 io_uring/rsrc.h                |  9 +--------
 6 files changed, 16 insertions(+), 68 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 128a67a..d81aceb 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -343,8 +343,7 @@ struct io_ring_ctx {
 	struct io_wq_hash		*hash_map;
 
 	/* Only used for accounting purposes */
-	struct user_struct		*user;
-	struct mm_struct		*mm_account;
+	struct vm_account               vm_account;
 
 	/* ctx exit and cancelation */
 	struct llist_head		fallback_llist;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0a4efad..912da4f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2744,15 +2744,11 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 #endif
 	WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
 
-	if (ctx->mm_account) {
-		mmdrop(ctx->mm_account);
-		ctx->mm_account = NULL;
-	}
+	vm_account_release(&ctx->vm_account);
 	io_mem_free(ctx->rings);
 	io_mem_free(ctx->sq_sqes);
 
 	percpu_ref_exit(&ctx->refs);
-	free_uid(ctx->user);
 	io_req_caches_free(ctx);
 	if (ctx->hash_map)
 		io_wq_put_hash(ctx->hash_map);
@@ -3585,8 +3581,9 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 		ctx->syscall_iopoll = 1;
 
 	ctx->compat = in_compat_syscall();
-	if (!capable(CAP_IPC_LOCK))
-		ctx->user = get_uid(current_user());
+	vm_account_init(&ctx->vm_account, current, current_user(),
+			VM_ACCOUNT_USER |
+			(capable(CAP_IPC_LOCK) ? VM_ACCOUNT_BYPASS : 0));
 
 	/*
 	 * For SQPOLL, we just need a wakeup, always. For !SQPOLL, if
@@ -3619,15 +3616,6 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 		goto err;
 	}
 
-	/*
-	 * This is just grabbed for accounting purposes. When a process exits,
-	 * the mm is exited and dropped before the files, hence we need to hang
-	 * on to this mm purely for the purposes of being able to unaccount
-	 * memory (locked/pinned vm). It's not used for anything else.
-	 */
-	mmgrab(current->mm);
-	ctx->mm_account = current->mm;
-
 	ret = io_allocate_scq_urings(ctx, p);
 	if (ret)
 		goto err;
diff --git a/io_uring/notif.c b/io_uring/notif.c
index c4bb793..0f589fa 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -17,8 +17,8 @@ static void io_notif_complete_tw_ext(struct io_kiocb *notif, bool *locked)
 	if (nd->zc_report && (nd->zc_copied || !nd->zc_used))
 		notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED;
 
-	if (nd->account_pages && ctx->user) {
-		__io_unaccount_mem(ctx->user, nd->account_pages);
+	if (nd->account_pages) {
+		vm_unaccount_pinned(&ctx->vm_account, nd->account_pages);
 		nd->account_pages = 0;
 	}
 	io_req_task_complete(notif, locked);
diff --git a/io_uring/notif.h b/io_uring/notif.h
index c88c800..e2cb44a 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -43,11 +43,9 @@ static inline int io_notif_account_mem(struct io_kiocb *notif, unsigned len)
 	unsigned nr_pages = (len >> PAGE_SHIFT) + 2;
 	int ret;
 
-	if (ctx->user) {
-		ret = __io_account_mem(ctx->user, nr_pages);
-		if (ret)
-			return ret;
-		nd->account_pages += nr_pages;
-	}
+	ret = __io_account_mem(&ctx->vm_account, nr_pages);
+	if (ret)
+		return ret;
+	nd->account_pages += nr_pages;
 	return 0;
 }
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 18de10c..aa44528 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -42,49 +42,19 @@ void io_rsrc_refs_drop(struct io_ring_ctx *ctx)
 	}
 }
 
-int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
+int __io_account_mem(struct vm_account *vm_account, unsigned long nr_pages)
 {
-	unsigned long page_limit, cur_pages, new_pages;
-
-	if (!nr_pages)
-		return 0;
-
-	/* Don't allow more pages than we can safely lock */
-	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	cur_pages = atomic_long_read(&user->locked_vm);
-	do {
-		new_pages = cur_pages + nr_pages;
-		if (new_pages > page_limit)
-			return -ENOMEM;
-	} while (!atomic_long_try_cmpxchg(&user->locked_vm,
-					  &cur_pages, new_pages));
-	return 0;
+	return vm_account_pinned(vm_account, nr_pages);
 }
 
 static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
 {
-	if (ctx->user)
-		__io_unaccount_mem(ctx->user, nr_pages);
-
-	if (ctx->mm_account)
-		atomic64_sub(nr_pages, &ctx->mm_account->pinned_vm);
+	vm_unaccount_pinned(&ctx->vm_account, nr_pages);
 }
 
 static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
 {
-	int ret;
-
-	if (ctx->user) {
-		ret = __io_account_mem(ctx->user, nr_pages);
-		if (ret)
-			return ret;
-	}
-
-	if (ctx->mm_account)
-		atomic64_add(nr_pages, &ctx->mm_account->pinned_vm);
-
-	return 0;
+	return vm_account_pinned(&ctx->vm_account, nr_pages);
 }
 
 static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 2b87436..d8833d0 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -167,12 +167,5 @@ static inline u64 *io_get_tag_slot(struct io_rsrc_data *data, unsigned int idx)
 int io_files_update(struct io_kiocb *req, unsigned int issue_flags);
 int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 
-int __io_account_mem(struct user_struct *user, unsigned long nr_pages);
-
-static inline void __io_unaccount_mem(struct user_struct *user,
-				      unsigned long nr_pages)
-{
-	atomic_long_sub(nr_pages, &user->locked_vm);
-}
-
+int __io_account_mem(struct vm_account *vm_account, unsigned long nr_pages);
 #endif
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 10/19] net: skb: Switch to using vm_account
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (8 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 09/19] io_uring: convert to use vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24 14:51   ` Jason Gunthorpe
  2023-01-24  5:42 ` [RFC PATCH 11/19] xdp: convert to use vm_account Alistair Popple
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, netdev, linux-rdma, rds-devel

Switch to using vm_account to charge pinned pages. This will allow a
future change to charge the pinned pages to a cgroup to limit the
overall number of pinned pages in the system.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
---
 include/linux/skbuff.h |  6 ++---
 include/net/sock.h     |  2 ++-
 net/core/skbuff.c      | 47 +++++++++++++++----------------------------
 net/rds/message.c      |  9 +++++---
 4 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4c84924..c956405 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -554,7 +554,6 @@ struct ubuf_info_msgzc {
 	};
 
 	struct mmpin {
-		struct user_struct *user;
 		unsigned int num_pg;
 	} mmp;
 };
@@ -563,8 +562,9 @@ struct ubuf_info_msgzc {
 #define uarg_to_msgzc(ubuf_ptr)	container_of((ubuf_ptr), struct ubuf_info_msgzc, \
 					     ubuf)
 
-int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
-void mm_unaccount_pinned_pages(struct mmpin *mmp);
+int mm_account_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp,
+			size_t size);
+void mm_unaccount_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp);
 
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
diff --git a/include/net/sock.h b/include/net/sock.h
index dcd72e6..bc3a868 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -334,6 +334,7 @@ struct sk_filter;
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_cgrp_data: cgroup data for this cgroup
+  *	@sk_vm_account: data for pinned memory accounting
   *	@sk_memcg: this socket's memory cgroup association
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
@@ -523,6 +524,7 @@ struct sock {
 	void			*sk_security;
 #endif
 	struct sock_cgroup_data	sk_cgrp_data;
+	struct vm_account       sk_vm_account;
 	struct mem_cgroup	*sk_memcg;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4a0eb55..bed3fc9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1309,42 +1309,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
-int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+int mm_account_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp,
+			    size_t size)
 {
-	unsigned long max_pg, num_pg, new_pg, old_pg;
-	struct user_struct *user;
-
-	if (capable(CAP_IPC_LOCK) || !size)
-		return 0;
+	unsigned int num_pg;
 
 	num_pg = (size >> PAGE_SHIFT) + 2;	/* worst case */
-	max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	user = mmp->user ? : current_user();
+	if (vm_account_pinned(vm_account, num_pg))
+		return -ENOBUFS;
 
-	old_pg = atomic_long_read(&user->locked_vm);
-	do {
-		new_pg = old_pg + num_pg;
-		if (new_pg > max_pg)
-			return -ENOBUFS;
-	} while (!atomic_long_try_cmpxchg(&user->locked_vm, &old_pg, new_pg));
-
-	if (!mmp->user) {
-		mmp->user = get_uid(user);
-		mmp->num_pg = num_pg;
-	} else {
-		mmp->num_pg += num_pg;
-	}
+	mmp->num_pg += num_pg;
 
 	return 0;
 }
 EXPORT_SYMBOL_GPL(mm_account_pinned_pages);
 
-void mm_unaccount_pinned_pages(struct mmpin *mmp)
+void mm_unaccount_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp)
 {
-	if (mmp->user) {
-		atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
-		free_uid(mmp->user);
-	}
+	vm_unaccount_pinned(vm_account, mmp->num_pg);
+	vm_account_release(vm_account);
 }
 EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
 
@@ -1361,9 +1344,12 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 
 	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
 	uarg = (void *)skb->cb;
-	uarg->mmp.user = NULL;
+	uarg->mmp.num_pg = 0;
+	vm_account_init(&sk->sk_vm_account, current,
+			current_user(), VM_ACCOUNT_USER);
 
-	if (mm_account_pinned_pages(&uarg->mmp, size)) {
+	if (mm_account_pinned_pages(&sk->sk_vm_account, &uarg->mmp, size)) {
+		vm_account_release(&sk->sk_vm_account);
 		kfree_skb(skb);
 		return NULL;
 	}
@@ -1416,7 +1402,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
 
 		next = (u32)atomic_read(&sk->sk_zckey);
 		if ((u32)(uarg_zc->id + uarg_zc->len) == next) {
-			if (mm_account_pinned_pages(&uarg_zc->mmp, size))
+			if (mm_account_pinned_pages(&sk->sk_vm_account,
+						    &uarg_zc->mmp, size))
 				return NULL;
 			uarg_zc->len++;
 			uarg_zc->bytelen = bytelen;
@@ -1466,7 +1453,7 @@ static void __msg_zerocopy_callback(struct ubuf_info_msgzc *uarg)
 	u32 lo, hi;
 	u16 len;
 
-	mm_unaccount_pinned_pages(&uarg->mmp);
+	mm_unaccount_pinned_pages(&sk->sk_vm_account, &uarg->mmp);
 
 	/* if !len, there was only 1 call, and it was aborted
 	 * so do not queue a completion notification
diff --git a/net/rds/message.c b/net/rds/message.c
index b47e4f0..2138a70 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
 	struct list_head *head;
 	unsigned long flags;
 
-	mm_unaccount_pinned_pages(&znotif->z_mmp);
+	mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
 	q = &rs->rs_zcookie_queue;
 	spin_lock_irqsave(&q->lock, flags);
 	head = &q->zcookie_head;
@@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
 	int ret = 0;
 	int length = iov_iter_count(from);
 	struct rds_msg_zcopy_info *info;
+	struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
 
 	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
 
@@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
 		return -ENOMEM;
 	INIT_LIST_HEAD(&info->rs_zcookie_next);
 	rm->data.op_mmp_znotifier = &info->znotif;
-	if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
+	vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
+	if (mm_account_pinned_pages(vm_account,
+				    &rm->data.op_mmp_znotifier->z_mmp,
 				    length)) {
 		ret = -ENOMEM;
 		goto err;
@@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
 			for (i = 0; i < rm->data.op_nents; i++)
 				put_page(sg_page(&rm->data.op_sg[i]));
 			mmp = &rm->data.op_mmp_znotifier->z_mmp;
-			mm_unaccount_pinned_pages(mmp);
+			mm_unaccount_pinned_pages(vm_account, mmp);
 			ret = -EFAULT;
 			goto err;
 		}
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 11/19] xdp: convert to use vm_account
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (9 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() Alistair Popple
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf

Switch to using the new vm_account struct to charge pinned pages and
enforce the rlimit. This will allow a future change to also charge a
cgroup for limiting the number of pinned pages.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Björn Töpel" <bjorn@kernel.org>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/net/xdp_sock.h |  2 +-
 net/xdp/xdp_umem.c     | 38 +++++++++++++-------------------------
 2 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 3057e1a..b0d3c16 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -25,7 +25,7 @@ struct xdp_umem {
 	u32 chunk_size;
 	u32 chunks;
 	u32 npgs;
-	struct user_struct *user;
+	struct vm_account vm_account;
 	refcount_t users;
 	u8 flags;
 	bool zc;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 4681e8e..4b5fb2f 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -29,12 +29,10 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 	umem->pgs = NULL;
 }
 
-static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
+static void xdp_umem_unaccount_pages(struct xdp_umem *umem, u32 npgs)
 {
-	if (umem->user) {
-		atomic_long_sub(umem->npgs, &umem->user->locked_vm);
-		free_uid(umem->user);
-	}
+	vm_unaccount_pinned(&umem->vm_account, npgs);
+	vm_account_release(&umem->vm_account);
 }
 
 static void xdp_umem_addr_unmap(struct xdp_umem *umem)
@@ -54,13 +52,15 @@ static int xdp_umem_addr_map(struct xdp_umem *umem, struct page **pages,
 
 static void xdp_umem_release(struct xdp_umem *umem)
 {
+	u32 npgs = umem->npgs;
+
 	umem->zc = false;
 	ida_free(&umem_ida, umem->id);
 
 	xdp_umem_addr_unmap(umem);
 	xdp_umem_unpin_pages(umem);
 
-	xdp_umem_unaccount_pages(umem);
+	xdp_umem_unaccount_pages(umem, npgs);
 	kfree(umem);
 }
 
@@ -127,24 +127,13 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
 
 static int xdp_umem_account_pages(struct xdp_umem *umem)
 {
-	unsigned long lock_limit, new_npgs, old_npgs;
-
-	if (capable(CAP_IPC_LOCK))
-		return 0;
-
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	umem->user = get_uid(current_user());
+	vm_account_init(&umem->vm_account, current,
+			current_user(), VM_ACCOUNT_USER);
+	if (vm_account_pinned(&umem->vm_account, umem->npgs)) {
+		vm_account_release(&umem->vm_account);
+		return -ENOBUFS;
+	}
 
-	do {
-		old_npgs = atomic_long_read(&umem->user->locked_vm);
-		new_npgs = old_npgs + umem->npgs;
-		if (new_npgs > lock_limit) {
-			free_uid(umem->user);
-			umem->user = NULL;
-			return -ENOBUFS;
-		}
-	} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
-				     new_npgs) != old_npgs);
 	return 0;
 }
 
@@ -204,7 +193,6 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	umem->chunks = chunks;
 	umem->npgs = (u32)npgs;
 	umem->pgs = NULL;
-	umem->user = NULL;
 	umem->flags = mr->flags;
 
 	INIT_LIST_HEAD(&umem->xsk_dma_list);
@@ -227,7 +215,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 out_unpin:
 	xdp_umem_unpin_pages(umem);
 out_account:
-	xdp_umem_unaccount_pages(umem);
+	xdp_umem_unaccount_pages(umem, npgs);
 	return err;
 }
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned()
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (10 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 11/19] xdp: convert to use vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 13/19] fpga: dfl: afu: convert to use vm_account Alistair Popple
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Paolo Bonzini, Alexey Kardashevskiy,
	linuxppc-dev, kvm

book3s_64_vio currently accounts for pinned pages with
account_locked_vm() which charges the pages to mm->locked_vm. To make
this consistent with other drivers switch to using
vm_account_pinned().

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
 arch/powerpc/kvm/book3s_64_vio.c | 10 +++++-----
 include/linux/kvm_host.h         |  1 +
 virt/kvm/kvm_main.c              |  3 +++
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 95e738e..ecd1deb 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -273,8 +273,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 		}
 	}
 
-	account_locked_vm(kvm->mm,
-		kvmppc_stt_pages(kvmppc_tce_pages(stt->size)), false);
+	vm_unaccount_pinned(&kvm->vm_account,
+		kvmppc_stt_pages(kvmppc_tce_pages(stt->size)));
 
 	kvm_put_kvm(stt->kvm);
 
@@ -301,8 +301,8 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 		(args->offset + args->size > (ULLONG_MAX >> args->page_shift)))
 		return -EINVAL;
 
-	npages = kvmppc_tce_pages(args->size);
-	ret = account_locked_vm(mm, kvmppc_stt_pages(npages), true);
+	npages = kvmppc_tce_pages(size);
+	ret = vm_account_pinned(&kvm->vm_account, kvmppc_stt_pages(npages));
 	if (ret)
 		return ret;
 
@@ -347,7 +347,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 
 	kfree(stt);
  fail_acct:
-	account_locked_vm(mm, kvmppc_stt_pages(npages), false);
+	vm_unaccount_pinned(&kvm->vm_account, kvmppc_stt_pages(npages));
 	return ret;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b24..25ed390 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -717,6 +717,7 @@ struct kvm {
 	 */
 	struct mutex slots_arch_lock;
 	struct mm_struct *mm; /* userspace tied to this vm */
+	struct vm_account vm_account;
 	unsigned long nr_memslot_pages;
 	/* The two memslot sets - active and inactive (per address space) */
 	struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9c60384..770d037 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	KVM_MMU_LOCK_INIT(kvm);
 	mmgrab(current->mm);
 	kvm->mm = current->mm;
+	vm_account_init_current(&kvm->vm_account);
 	kvm_eventfd_init(kvm);
 	mutex_init(&kvm->lock);
 	mutex_init(&kvm->irq_lock);
@@ -1258,6 +1259,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 out_err_no_srcu:
 	kvm_arch_free_vm(kvm);
 	mmdrop(current->mm);
+	vm_account_release(&kvm->vm_account);
 	module_put(kvm_chardev_ops.owner);
 	return ERR_PTR(r);
 }
@@ -1327,6 +1329,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	preempt_notifier_dec();
 	hardware_disable_all();
 	mmdrop(mm);
+	vm_account_release(&kvm->vm_account);
 	module_put(kvm_chardev_ops.owner);
 }
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 13/19] fpga: dfl: afu: convert to use vm_account
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (11 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 14/19] mm: Introduce a cgroup for pinned memory Alistair Popple
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Wu Hao, Tom Rix, Moritz Fischer,
	Xu Yilun, linux-fpga

To charge pinned pages against the pins cgroup drivers must use the
vm_account_pinned() functions which requires initialisation of a
struct vm_account. Convert the dfl-afu-region code to do this and
charge any pins to the pins cgroup.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Wu Hao <hao.wu@intel.com>
Cc: Tom Rix <trix@redhat.com>
Cc: Moritz Fischer <mdf@kernel.org>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: linux-fpga@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/fpga/dfl-afu-dma-region.c | 11 ++++++++---
 drivers/fpga/dfl-afu.h            |  1 +
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/fpga/dfl-afu-dma-region.c b/drivers/fpga/dfl-afu-dma-region.c
index 02b60fd..3b99784 100644
--- a/drivers/fpga/dfl-afu-dma-region.c
+++ b/drivers/fpga/dfl-afu-dma-region.c
@@ -38,7 +38,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata,
 	struct device *dev = &pdata->dev->dev;
 	int ret, pinned;
 
-	ret = account_locked_vm(current->mm, npages, true);
+	ret = vm_account_pinned(&region->vm_account, npages);
 	if (ret)
 		return ret;
 
@@ -67,7 +67,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata,
 free_pages:
 	kfree(region->pages);
 unlock_vm:
-	account_locked_vm(current->mm, npages, false);
+	vm_unaccount_pinned(&region->vm_account, npages);
 	return ret;
 }
 
@@ -87,7 +87,7 @@ static void afu_dma_unpin_pages(struct dfl_feature_platform_data *pdata,
 
 	unpin_user_pages(region->pages, npages);
 	kfree(region->pages);
-	account_locked_vm(current->mm, npages, false);
+	vm_unaccount_pinned(&region->vm_account, npages);
 
 	dev_dbg(dev, "%ld pages unpinned\n", npages);
 }
@@ -223,6 +223,7 @@ void afu_dma_region_destroy(struct dfl_feature_platform_data *pdata)
 			afu_dma_unpin_pages(pdata, region);
 
 		node = rb_next(node);
+		vm_account_release(&region->vm_account);
 		kfree(region);
 	}
 }
@@ -322,6 +323,8 @@ int afu_dma_map_region(struct dfl_feature_platform_data *pdata,
 	region->user_addr = user_addr;
 	region->length = length;
 
+	vm_account_init_current(&region->vm_account);
+
 	/* Pin the user memory region */
 	ret = afu_dma_pin_pages(pdata, region);
 	if (ret) {
@@ -365,6 +368,7 @@ int afu_dma_map_region(struct dfl_feature_platform_data *pdata,
 unpin_pages:
 	afu_dma_unpin_pages(pdata, region);
 free_region:
+	vm_account_release(&region->vm_account);
 	kfree(region);
 	return ret;
 }
@@ -399,6 +403,7 @@ int afu_dma_unmap_region(struct dfl_feature_platform_data *pdata, u64 iova)
 	dma_unmap_page(dfl_fpga_pdata_to_parent(pdata),
 		       region->iova, region->length, DMA_BIDIRECTIONAL);
 	afu_dma_unpin_pages(pdata, region);
+	vm_account_release(&region->vm_account);
 	kfree(region);
 
 	return 0;
diff --git a/drivers/fpga/dfl-afu.h b/drivers/fpga/dfl-afu.h
index e5020e2..b1554e0 100644
--- a/drivers/fpga/dfl-afu.h
+++ b/drivers/fpga/dfl-afu.h
@@ -51,6 +51,7 @@ struct dfl_afu_mmio_region {
  * @in_use: flag to indicate if this region is in_use.
  */
 struct dfl_afu_dma_region {
+	struct vm_account vm_account;
 	u64 user_addr;
 	u64 length;
 	u64 iova;
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 14/19] mm: Introduce a cgroup for pinned memory
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (12 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 13/19] fpga: dfl: afu: convert to use vm_account Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-27 21:44   ` Tejun Heo
  2023-01-24  5:42 ` [RFC PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup Alistair Popple
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Tejun Heo, Zefan Li, Andrew Morton

If too much memory in a system is pinned or locked it can lead to
problems such as performance degredation or in the worst case
out-of-memory errors as such memory cannot be moved or paged out.

In order to prevent users without CAP_IPC_LOCK from causing these
issues the amount of memory that can be pinned is typically limited by
RLIMIT_MEMLOCK. However this is inflexible as limits can't be shared
between tasks and the enforcement of these limits is inconsistent
between in-kernel users of pinned memory such as mlock() and device
drivers which may also pin pages with pin_user_pages().

To allow for a single limit to be set introduce a cgroup controller
which can be used to limit the number of pages being pinned by all
tasks in the cgroup.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 MAINTAINERS                   |   7 +-
 include/linux/cgroup.h        |  20 +++-
 include/linux/cgroup_subsys.h |   4 +-
 mm/Kconfig                    |  11 +-
 mm/Makefile                   |   1 +-
 mm/pins_cgroup.c              | 273 +++++++++++++++++++++++++++++++++++-
 6 files changed, 316 insertions(+)
 create mode 100644 mm/pins_cgroup.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f781f93..f8526e2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5381,6 +5381,13 @@ F:	tools/testing/selftests/cgroup/memcg_protection.m
 F:	tools/testing/selftests/cgroup/test_kmem.c
 F:	tools/testing/selftests/cgroup/test_memcontrol.c
 
+CONTROL GROUP - PINNED AND LOCKED MEMORY
+M:	Alistair Popple <apopple@nvidia.com>
+L:	cgroups@vger.kernel.org
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/pins_cgroup.c
+
 CORETEMP HARDWARE MONITORING DRIVER
 M:	Fenghua Yu <fenghua.yu@intel.com>
 L:	linux-hwmon@vger.kernel.org
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3410aec..440f299 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -857,4 +857,24 @@ static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
 
 #endif /* CONFIG_CGROUP_BPF */
 
+#ifdef CONFIG_CGROUP_PINS
+
+struct pins_cgroup *get_pins_cg(struct task_struct *task);
+void put_pins_cg(struct pins_cgroup *cg);
+void pins_uncharge(struct pins_cgroup *pins, int num);
+int pins_try_charge(struct pins_cgroup *pins, int num);
+
+#else /* CONFIG_CGROUP_PINS */
+
+static inline struct pins_cgroup *get_pins_cg(struct task_struct *task)
+{
+	return NULL;
+}
+
+static inline void put_pins_cg(struct pins_cgroup *cg) {}
+static inline void pins_uncharge(struct pins_cgroup *pins, int num) {}
+static inline int pins_try_charge(struct pins_cgroup *pins, int num) { return 0; }
+
+#endif /* CONFIG_CGROUP_PINS */
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 4452354..c1b4aab 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
 SUBSYS(misc)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_PINS)
+SUBSYS(pins)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209..7a32b98 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1183,6 +1183,17 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config CGROUP_PINS
+    bool "Cgroup for pinned and locked memory"
+    default y
+
+    help
+      Having too much memory pinned or locked can lead to system
+      instability due to increased likelihood of encountering
+      out-of-memory conditions. Select this option to enable a cgroup
+      which can be used to limit the overall number of pages locked or
+      pinned by drivers.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5..81db189 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_CGROUP_PINS) += pins_cgroup.o
diff --git a/mm/pins_cgroup.c b/mm/pins_cgroup.c
new file mode 100644
index 0000000..cc310d5
--- /dev/null
+++ b/mm/pins_cgroup.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Controller for cgroups limiting number of pages pinned for FOLL_LONGETERM.
+ *
+ * Copyright (C) 2022 Alistair Popple <apopple@nvidia.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/threads.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/sched/task.h>
+
+#define PINS_MAX (-1ULL)
+#define PINS_MAX_STR "max"
+
+struct pins_cgroup {
+	struct cgroup_subsys_state	css;
+
+	atomic64_t			counter;
+	atomic64_t			limit;
+
+	struct cgroup_file		events_file;
+	atomic64_t			events_limit;
+};
+
+static struct pins_cgroup *css_pins(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct pins_cgroup, css);
+}
+
+static struct pins_cgroup *parent_pins(struct pins_cgroup *pins)
+{
+	return css_pins(pins->css.parent);
+}
+
+struct pins_cgroup *get_pins_cg(struct task_struct *task)
+{
+	return css_pins(task_get_css(task, pins_cgrp_id));
+}
+
+void put_pins_cg(struct pins_cgroup *cg)
+{
+	css_put(&cg->css);
+}
+
+static struct cgroup_subsys_state *
+pins_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct pins_cgroup *pins;
+
+	pins = kzalloc(sizeof(struct pins_cgroup), GFP_KERNEL);
+	if (!pins)
+		return ERR_PTR(-ENOMEM);
+
+	atomic64_set(&pins->counter, 0);
+	atomic64_set(&pins->limit, PINS_MAX);
+	atomic64_set(&pins->events_limit, 0);
+	return &pins->css;
+}
+
+static void pins_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_pins(css));
+}
+
+/**
+ * pins_cancel - uncharge the local pin count
+ * @pins: the pin cgroup state
+ * @num: the number of pins to cancel
+ *
+ * This function will WARN if the pin count goes under 0, because such a case is
+ * a bug in the pins controller proper.
+ */
+void pins_cancel(struct pins_cgroup *pins, int num)
+{
+	/*
+	 * A negative count (or overflow for that matter) is invalid,
+	 * and indicates a bug in the `pins` controller proper.
+	 */
+	WARN_ON_ONCE(atomic64_add_negative(-num, &pins->counter));
+}
+
+/**
+ * pins_uncharge - hierarchically uncharge the pin count
+ * @pins: the pin cgroup state
+ * @num: the number of pins to uncharge
+ */
+void pins_uncharge(struct pins_cgroup *pins, int num)
+{
+	struct pins_cgroup *p;
+
+	for (p = pins; parent_pins(p); p = parent_pins(p))
+		pins_cancel(p, num);
+}
+
+/**
+ * pins_charge - hierarchically charge the pin count
+ * @pins: the pin cgroup state
+ * @num: the number of pins to charge
+ *
+ * This function does *not* follow the pin limit set. It cannot fail and the new
+ * pin count may exceed the limit. This is only used for reverting failed
+ * attaches, where there is no other way out than violating the limit.
+ */
+static void pins_charge(struct pins_cgroup *pins, int num)
+{
+	struct pins_cgroup *p;
+
+	for (p = pins; parent_pins(p); p = parent_pins(p))
+		atomic64_add(num, &p->counter);
+}
+
+/**
+ * pins_try_charge - hierarchically try to charge the pin count
+ * @pins: the pin cgroup state
+ * @num: the number of pins to charge
+ *
+ * This function follows the set limit. It will fail if the charge would cause
+ * the new value to exceed the hierarchical limit. Returns 0 if the charge
+ * succeeded, otherwise -EAGAIN.
+ */
+int pins_try_charge(struct pins_cgroup *pins, int num)
+{
+	struct pins_cgroup *p, *q;
+
+	for (p = pins; parent_pins(p); p = parent_pins(p)) {
+		uint64_t new = atomic64_add_return(num, &p->counter);
+		uint64_t limit = atomic64_read(&p->limit);
+
+		if (limit != PINS_MAX && new > limit)
+			goto revert;
+	}
+
+	return 0;
+
+revert:
+	for (q = pins; q != p; q = parent_pins(q))
+		pins_cancel(q, num);
+	pins_cancel(p, num);
+
+	return -EAGAIN;
+}
+
+static int pins_can_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *dst_css;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		struct pins_cgroup *pins = css_pins(dst_css);
+		struct cgroup_subsys_state *old_css;
+		struct pins_cgroup *old_pins;
+
+		old_css = task_css(task, pins_cgrp_id);
+		old_pins = css_pins(old_css);
+
+		pins_charge(pins, task->mm->locked_vm);
+		pins_uncharge(old_pins, task->mm->locked_vm);
+	}
+
+	return 0;
+}
+
+static void pins_cancel_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *dst_css;
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		struct pins_cgroup *pins = css_pins(dst_css);
+		struct cgroup_subsys_state *old_css;
+		struct pins_cgroup *old_pins;
+
+		old_css = task_css(task, pins_cgrp_id);
+		old_pins = css_pins(old_css);
+
+		pins_charge(old_pins, task->mm->locked_vm);
+		pins_uncharge(pins, task->mm->locked_vm);
+	}
+}
+
+
+static ssize_t pins_max_write(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct pins_cgroup *pins = css_pins(css);
+	uint64_t limit;
+	int err;
+
+	buf = strstrip(buf);
+	if (!strcmp(buf, PINS_MAX_STR)) {
+		limit = PINS_MAX;
+		goto set_limit;
+	}
+
+	err = kstrtoll(buf, 0, &limit);
+	if (err)
+		return err;
+
+	if (limit < 0 || limit >= PINS_MAX)
+		return -EINVAL;
+
+set_limit:
+	/*
+	 * Limit updates don't need to be mutex'd, since it isn't
+	 * critical that any racing fork()s follow the new limit.
+	 */
+	atomic64_set(&pins->limit, limit);
+	return nbytes;
+}
+
+static int pins_max_show(struct seq_file *sf, void *v)
+{
+	struct cgroup_subsys_state *css = seq_css(sf);
+	struct pins_cgroup *pins = css_pins(css);
+	uint64_t limit = atomic64_read(&pins->limit);
+
+	if (limit >= PINS_MAX)
+		seq_printf(sf, "%s\n", PINS_MAX_STR);
+	else
+		seq_printf(sf, "%lld\n", limit);
+
+	return 0;
+}
+
+static s64 pins_current_read(struct cgroup_subsys_state *css,
+			     struct cftype *cft)
+{
+	struct pins_cgroup *pins = css_pins(css);
+
+	return atomic64_read(&pins->counter);
+}
+
+static int pins_events_show(struct seq_file *sf, void *v)
+{
+	struct pins_cgroup *pins = css_pins(seq_css(sf));
+
+	seq_printf(sf, "max %lld\n", (s64)atomic64_read(&pins->events_limit));
+	return 0;
+}
+
+static struct cftype pins_files[] = {
+	{
+		.name = "max",
+		.write = pins_max_write,
+		.seq_show = pins_max_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "current",
+		.read_s64 = pins_current_read,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "events",
+		.seq_show = pins_events_show,
+		.file_offset = offsetof(struct pins_cgroup, events_file),
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys pins_cgrp_subsys = {
+	.css_alloc = pins_css_alloc,
+	.css_free = pins_css_free,
+	.legacy_cftypes = pins_files,
+	.dfl_cftypes = pins_files,
+	.can_attach = pins_can_attach,
+	.cancel_attach = pins_cancel_attach,
+};
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (13 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 14/19] mm: Introduce a cgroup for pinned memory Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 16/19] mm/util: Refactor account_locked_vm Alistair Popple
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple

The vm_account_pinned() functions currently only account pages against
pinned_vm/locked_vm and enforce limits against RLIMIT_MEMLOCK. Extend
these to account pages and enforce limits using the pin count cgroup.

Accounting of pages will fail if either RLIMIT_MEMLOCK or the cgroup
limit is exceeded. Unlike rlimit enforcement which can be bypassed if
the user has CAP_IPC_LOCK cgroup limits can not be bypassed.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mm_types.h |  1 +
 mm/util.c                | 22 ++++++++++++++++++----
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7de2168..4adf8dc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1116,6 +1116,7 @@ struct vm_account {
 		struct mm_struct *mm;
 		struct user_struct *user;
 	} a;
+	struct pins_cgroup *pins_cg;
 	enum vm_account_flags flags;
 };
 
diff --git a/mm/util.c b/mm/util.c
index af40b1e..e5fb01a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -442,6 +442,7 @@ void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
 		vm_account->a.mm = task->mm;
 	}
 
+	vm_account->pins_cg = get_pins_cg(task);
 	vm_account->flags = flags;
 }
 EXPORT_SYMBOL_GPL(vm_account_init);
@@ -459,6 +460,7 @@ void vm_account_release(struct vm_account *vm_account)
 		free_uid(vm_account->a.user);
 	else
 		mmdrop(vm_account->a.mm);
+	put_pins_cg(vm_account->pins_cg);
 }
 EXPORT_SYMBOL_GPL(vm_account_release);
 
@@ -489,6 +491,15 @@ static int vm_account_cmpxchg(struct vm_account *vm_account,
 	}
 }
 
+static void vm_unaccount_legacy(struct vm_account *vm_account,
+				unsigned long npages)
+{
+	if (vm_account->flags & VM_ACCOUNT_USER)
+		atomic_long_sub(npages, &vm_account->a.user->locked_vm);
+	else
+		atomic64_sub(npages, &vm_account->a.mm->pinned_vm);
+}
+
 int vm_account_pinned(struct vm_account *vm_account, unsigned long npages)
 {
 	unsigned long lock_limit = RLIM_INFINITY;
@@ -506,16 +517,19 @@ int vm_account_pinned(struct vm_account *vm_account, unsigned long npages)
 			return ret;
 	}
 
+	if (pins_try_charge(vm_account->pins_cg, npages)) {
+		vm_unaccount_legacy(vm_account, npages);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(vm_account_pinned);
 
 void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages)
 {
-	if (vm_account->flags & VM_ACCOUNT_USER)
-		atomic_long_sub(npages, &vm_account->a.user->locked_vm);
-	else
-		atomic64_sub(npages, &vm_account->a.mm->pinned_vm);
+	vm_unaccount_legacy(vm_account, npages);
+	pins_uncharge(vm_account->pins_cg, npages);
 }
 EXPORT_SYMBOL_GPL(vm_unaccount_pinned);
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 16/19] mm/util: Refactor account_locked_vm
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (14 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm Alistair Popple
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple

account_locked_vm() takes a flag to indicate if pages are being
accounted or unaccounted for. A flag is also provided to bypass
rlimits. However unaccounting of pages always succeeds and the flag to
ignore the limits is ignored. The flags make calling code harder to
understand so refactor the accounting and unaccounting paths into
separate functions.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mm.h |  5 +--
 mm/util.c          | 73 +++++++++++++++++++++++++++++++++--------------
 2 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f85716..126b756 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2090,9 +2090,10 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
 
-int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
-int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
+int account_locked_vm(struct mm_struct *mm, unsigned long pages);
+int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
 			struct task_struct *task, bool bypass_rlim);
+void __unaccount_locked_vm(struct mm_struct *mm, unsigned long pages);
 
 struct kvec;
 int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
diff --git a/mm/util.c b/mm/util.c
index e5fb01a..78b060d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -537,7 +537,6 @@ EXPORT_SYMBOL_GPL(vm_unaccount_pinned);
  * __account_locked_vm - account locked pages to an mm's locked_vm
  * @mm:          mm to account against
  * @pages:       number of pages to account
- * @inc:         %true if @pages should be considered positive, %false if not
  * @task:        task used to check RLIMIT_MEMLOCK
  * @bypass_rlim: %true if checking RLIMIT_MEMLOCK should be skipped
  *
@@ -548,7 +547,7 @@ EXPORT_SYMBOL_GPL(vm_unaccount_pinned);
  * * 0       on success
  * * -ENOMEM if RLIMIT_MEMLOCK would be exceeded.
  */
-int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
+int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
 			struct task_struct *task, bool bypass_rlim)
 {
 	unsigned long locked_vm, limit;
@@ -557,33 +556,44 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 	mmap_assert_write_locked(mm);
 
 	locked_vm = mm->locked_vm;
-	if (inc) {
-		if (!bypass_rlim) {
-			limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-			if (locked_vm + pages > limit)
-				ret = -ENOMEM;
-		}
-		if (!ret)
-			mm->locked_vm = locked_vm + pages;
-	} else {
-		WARN_ON_ONCE(pages > locked_vm);
-		mm->locked_vm = locked_vm - pages;
+	if (!bypass_rlim) {
+		limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+		if (locked_vm + pages > limit)
+			ret = -ENOMEM;
 	}
 
-	pr_debug("%s: [%d] caller %ps %c%lu %lu/%lu%s\n", __func__, task->pid,
-		 (void *)_RET_IP_, (inc) ? '+' : '-', pages << PAGE_SHIFT,
-		 locked_vm << PAGE_SHIFT, task_rlimit(task, RLIMIT_MEMLOCK),
-		 ret ? " - exceeded" : "");
+	if (!ret)
+		mm->locked_vm = locked_vm + pages;
+
+	pr_debug("%s: [%d] caller %ps %lu %lu/%lu%s\n", __func__, task->pid,
+		 (void *)_RET_IP_, pages << PAGE_SHIFT, locked_vm << PAGE_SHIFT,
+		task_rlimit(task, RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
 
 	return ret;
 }
 EXPORT_SYMBOL_GPL(__account_locked_vm);
 
 /**
+ * __unaccount_locked_vm - unaccount locked pages to an mm's locked_vm
+ * @mm:          mm to account against
+ * @pages:       number of pages to account
+ *
+ * Assumes @mm are valid and that mmap_lock is held as writer.
+ */
+void __unaccount_locked_vm(struct mm_struct *mm, unsigned long pages)
+{
+	unsigned long locked_vm = mm->locked_vm;
+
+	mmap_assert_write_locked(mm);
+	WARN_ON_ONCE(pages > locked_vm);
+	mm->locked_vm = locked_vm - pages;
+}
+EXPORT_SYMBOL_GPL(__unaccount_locked_vm);
+
+/**
  * account_locked_vm - account locked pages to an mm's locked_vm
  * @mm:          mm to account against, may be NULL
  * @pages:       number of pages to account
- * @inc:         %true if @pages should be considered positive, %false if not
  *
  * Assumes a non-NULL @mm is valid (i.e. at least one reference on it).
  *
@@ -591,7 +601,7 @@ EXPORT_SYMBOL_GPL(__account_locked_vm);
  * * 0       on success, or if mm is NULL
  * * -ENOMEM if RLIMIT_MEMLOCK would be exceeded.
  */
-int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc)
+int account_locked_vm(struct mm_struct *mm, unsigned long pages)
 {
 	int ret;
 
@@ -599,14 +609,35 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc)
 		return 0;
 
 	mmap_write_lock(mm);
-	ret = __account_locked_vm(mm, pages, inc, current,
-				  capable(CAP_IPC_LOCK));
+	ret = __account_locked_vm(mm, pages, current, capable(CAP_IPC_LOCK));
 	mmap_write_unlock(mm);
 
 	return ret;
 }
 EXPORT_SYMBOL_GPL(account_locked_vm);
 
+/**
+ * unaccount_locked_vm - account locked pages to an mm's locked_vm
+ * @mm:          mm to account against, may be NULL
+ * @pages:       number of pages to account
+ *
+ * Assumes a non-NULL @mm is valid (i.e. at least one reference on it).
+ *
+ * Return:
+ * * 0       on success, or if mm is NULL
+ * * -ENOMEM if RLIMIT_MEMLOCK would be exceeded.
+ */
+void unaccount_locked_vm(struct mm_struct *mm, unsigned long pages)
+{
+	if (pages == 0 || !mm)
+		return;
+
+	mmap_write_lock(mm);
+	__unaccount_locked_vm(mm, pages);
+	mmap_write_unlock(mm);
+}
+EXPORT_SYMBOL_GPL(unaccount_locked_vm);
+
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
 	unsigned long flag, unsigned long pgoff)
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (15 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 16/19] mm/util: Refactor account_locked_vm Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup Alistair Popple
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple

A future change introduces a cgroup to control the amount of
locked/pinned memory on the system. To ensure memory pinned via mlock
and mmap is accounted for use the common account_locked_vm()
function.

As cgroups can outlive individual processes also unaccount for the
locked memory during process teardown.

This patch should introduce no user visible change.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/internal.h  |  2 +-
 mm/mlock.c     | 76 ++++++++++-----------------------------------------
 mm/mmap.c      | 76 +++++++++++++++++++++++++--------------------------
 mm/mremap.c    | 54 ++++++++++++++++++++++++++----------
 mm/secretmem.c |  6 +---
 5 files changed, 95 insertions(+), 119 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8..7c8c3f2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -489,8 +489,6 @@ extern long populate_vma_page_range(struct vm_area_struct *vma,
 extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end,
 				   bool write, int *locked);
-extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
-			      unsigned long len);
 /*
  * mlock_vma_page() and munlock_vma_page():
  * should be called with vma's mmap_lock held for read or write,
diff --git a/mm/mlock.c b/mm/mlock.c
index 7032f6d..a97c8c5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -416,6 +416,20 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
+	/*
+	 * Keep track of amount of locked VM.
+	 */
+	nr_pages = (end - start) >> PAGE_SHIFT;
+	if (!(newflags & VM_LOCKED)) {
+		__unaccount_locked_vm(mm, nr_pages);
+	} else if (!(oldflags & VM_LOCKED)) {
+		if (__account_locked_vm(mm, nr_pages, current,
+						capable(CAP_IPC_LOCK))) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma),
@@ -439,16 +453,6 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 success:
 	/*
-	 * Keep track of amount of locked VM.
-	 */
-	nr_pages = (end - start) >> PAGE_SHIFT;
-	if (!(newflags & VM_LOCKED))
-		nr_pages = -nr_pages;
-	else if (oldflags & VM_LOCKED)
-		nr_pages = 0;
-	mm->locked_vm += nr_pages;
-
-	/*
 	 * vm_flags is protected by the mmap_lock held in write mode.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
@@ -517,42 +521,6 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
 }
 
 /*
- * Go through vma areas and sum size of mlocked
- * vma pages, as return value.
- * Note deferred memory locking case(mlock2(,,MLOCK_ONFAULT)
- * is also counted.
- * Return value: previously mlocked page counts
- */
-static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
-		unsigned long start, size_t len)
-{
-	struct vm_area_struct *vma;
-	unsigned long count = 0;
-	unsigned long end;
-	VMA_ITERATOR(vmi, mm, start);
-
-	/* Don't overflow past ULONG_MAX */
-	if (unlikely(ULONG_MAX - len < start))
-		end = ULONG_MAX;
-	else
-		end = start + len;
-
-	for_each_vma_range(vmi, vma, end) {
-		if (vma->vm_flags & VM_LOCKED) {
-			if (start > vma->vm_start)
-				count -= (start - vma->vm_start);
-			if (end < vma->vm_end) {
-				count += end - vma->vm_start;
-				break;
-			}
-			count += vma->vm_end - vma->vm_start;
-		}
-	}
-
-	return count >> PAGE_SHIFT;
-}
-
-/*
  * convert get_user_pages() return value to posix mlock() error
  */
 static int __mlock_posix_error_return(long retval)
@@ -585,21 +553,7 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
 	if (mmap_write_lock_killable(current->mm))
 		return -EINTR;
 
-	locked += current->mm->locked_vm;
-	if ((locked > lock_limit) && (!capable(CAP_IPC_LOCK))) {
-		/*
-		 * It is possible that the regions requested intersect with
-		 * previously mlocked areas, that part area in "mm->locked_vm"
-		 * should not be counted to new mlock increment count. So check
-		 * and adjust locked count if necessary.
-		 */
-		locked -= count_mm_mlocked_page_nr(current->mm,
-				start, len);
-	}
-
-	/* check against resource limits */
-	if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
-		error = apply_vma_lock_flags(start, len, flags);
+	error = apply_vma_lock_flags(start, len, flags);
 
 	mmap_write_unlock(current->mm);
 	if (error)
diff --git a/mm/mmap.c b/mm/mmap.c
index 425a934..2c05582 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -160,7 +160,7 @@ static int check_brk_limits(unsigned long addr, unsigned long len)
 	if (IS_ERR_VALUE(mapped_addr))
 		return mapped_addr;
 
-	return mlock_future_check(current->mm, current->mm->def_flags, len);
+	return 0;
 }
 static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 			 unsigned long newbrk, unsigned long oldbrk,
@@ -1184,23 +1184,6 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
 	return hint;
 }
 
-int mlock_future_check(struct mm_struct *mm, unsigned long flags,
-		       unsigned long len)
-{
-	unsigned long locked, lock_limit;
-
-	/*  mlock MCL_FUTURE? */
-	if (flags & VM_LOCKED) {
-		locked = len >> PAGE_SHIFT;
-		locked += mm->locked_vm;
-		lock_limit = rlimit(RLIMIT_MEMLOCK);
-		lock_limit >>= PAGE_SHIFT;
-		if (locked > lock_limit && !capable(CAP_IPC_LOCK))
-			return -EAGAIN;
-	}
-	return 0;
-}
-
 static inline u64 file_mmap_size_max(struct file *file, struct inode *inode)
 {
 	if (S_ISREG(inode->i_mode))
@@ -1310,9 +1293,6 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		if (!can_do_mlock())
 			return -EPERM;
 
-	if (mlock_future_check(mm, vm_flags, len))
-		return -EAGAIN;
-
 	if (file) {
 		struct inode *inode = file_inode(file);
 		unsigned long flags_mask;
@@ -1882,22 +1862,27 @@ static int acct_stack_growth(struct vm_area_struct *vma,
 	if (size > rlimit(RLIMIT_STACK))
 		return -ENOMEM;
 
-	/* mlock limit tests */
-	if (mlock_future_check(mm, vma->vm_flags, grow << PAGE_SHIFT))
-		return -ENOMEM;
-
 	/* Check to ensure the stack will not grow into a hugetlb-only region */
 	new_start = (vma->vm_flags & VM_GROWSUP) ? vma->vm_start :
 			vma->vm_end - size;
 	if (is_hugepage_only_range(vma->vm_mm, new_start, size))
 		return -EFAULT;
 
+	/* mlock limit tests */
+	if (vma->vm_flags & VM_LOCKED)
+		if (__account_locked_vm(mm, grow << PAGE_SHIFT, current,
+						capable(CAP_IPC_LOCK)))
+			return -ENOMEM;
+
 	/*
 	 * Overcommit..  This must be the final test, as it will
 	 * update security statistics.
 	 */
-	if (security_vm_enough_memory_mm(mm, grow))
+	if (security_vm_enough_memory_mm(mm, grow)) {
+		if (vma->vm_flags & VM_LOCKED)
+			__unaccount_locked_vm(mm, grow << PAGE_SHIFT);
 		return -ENOMEM;
+	}
 
 	return 0;
 }
@@ -1975,8 +1960,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				 * to guard against concurrent vma expansions.
 				 */
 				spin_lock(&mm->page_table_lock);
-				if (vma->vm_flags & VM_LOCKED)
-					mm->locked_vm += grow;
 				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
@@ -2056,8 +2039,6 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 				 * to guard against concurrent vma expansions.
 				 */
 				spin_lock(&mm->page_table_lock);
-				if (vma->vm_flags & VM_LOCKED)
-					mm->locked_vm += grow;
 				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_start = address;
@@ -2281,7 +2262,7 @@ static inline int munmap_sidetree(struct vm_area_struct *vma,
 		return -ENOMEM;
 
 	if (vma->vm_flags & VM_LOCKED)
-		vma->vm_mm->locked_vm -= vma_pages(vma);
+		__unaccount_locked_vm(vma->vm_mm, vma_pages(vma));
 
 	return 0;
 }
@@ -2525,6 +2506,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct vm_area_struct *next, *prev, *merge;
 	pgoff_t pglen = len >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned long locked = 0;
 	unsigned long end = addr + len;
 	unsigned long merge_start = addr, merge_end = end;
 	pgoff_t vm_pgoff;
@@ -2560,6 +2542,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
+	if (vm_flags & VM_LOCKED) {
+		locked = len >> PAGE_SHIFT;
+		if (__account_locked_vm(mm, locked, current,
+						capable(CAP_IPC_LOCK))) {
+			error = -ENOMEM;
+			goto unacct_error;
+		}
+	}
+
 	next = mas_next(&mas, ULONG_MAX);
 	prev = mas_prev(&mas, 0);
 	if (vm_flags & VM_SPECIAL)
@@ -2605,7 +2596,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma = vm_area_alloc(mm);
 	if (!vma) {
 		error = -ENOMEM;
-		goto unacct_error;
+		goto unlock_error;
 	}
 
 	vma->vm_start = addr;
@@ -2725,8 +2716,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 					is_vm_hugetlb_page(vma) ||
 					vma == get_gate_vma(current->mm))
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
-		else
-			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
 
 	if (file)
@@ -2759,6 +2748,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		mapping_unmap_writable(file->f_mapping);
 free_vma:
 	vm_area_free(vma);
+unlock_error:
+	if (locked)
+		__unaccount_locked_vm(mm, locked);
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
@@ -2942,8 +2934,13 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 	if (mm->map_count > sysctl_max_map_count)
 		return -ENOMEM;
 
+	if (flags & VM_LOCKED)
+		if (__account_locked_vm(mm, len >> PAGE_SHIFT, current,
+					capable(CAP_IPC_LOCK)))
+			return -ENOMEM;
+
 	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
-		return -ENOMEM;
+		goto unacct_locked;
 
 	/*
 	 * Expand the existing vma if possible; Note that singular lists do not
@@ -2993,8 +2990,6 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
 	mm->data_vm += len >> PAGE_SHIFT;
-	if (flags & VM_LOCKED)
-		mm->locked_vm += (len >> PAGE_SHIFT);
 	vma->vm_flags |= VM_SOFTDIRTY;
 	validate_mm(mm);
 	return 0;
@@ -3003,6 +2998,8 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 	vm_area_free(vma);
 unacct_fail:
 	vm_unacct_memory(len >> PAGE_SHIFT);
+unacct_locked:
+	__unaccount_locked_vm(mm, len >> PAGE_SHIFT);
 	return -ENOMEM;
 }
 
@@ -3064,7 +3061,7 @@ void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	unsigned long nr_accounted = 0;
+	unsigned long nr_accounted = 0, nr_locked = 0;
 	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	int count = 0;
 
@@ -3107,6 +3104,8 @@ void exit_mmap(struct mm_struct *mm)
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
+		if (vma->vm_flags & VM_LOCKED)
+			nr_locked += vma_pages(vma);
 		remove_vma(vma);
 		count++;
 		cond_resched();
@@ -3116,6 +3115,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	trace_exit_mmap(mm);
 	__mt_destroy(&mm->mm_mt);
+	__unaccount_locked_vm(mm, nr_locked);
 	mmap_write_unlock(mm);
 	vm_unacct_memory(nr_accounted);
 }
diff --git a/mm/mremap.c b/mm/mremap.c
index fe587c5..67cc4f3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -574,7 +574,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		bool *locked, unsigned long flags,
 		struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
 {
-	long to_account = new_len - old_len;
+	long to_account = (new_len - old_len) >> PAGE_SHIFT;
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
 	unsigned long vm_flags = vma->vm_flags;
@@ -594,7 +594,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		return -ENOMEM;
 
 	if (unlikely(flags & MREMAP_DONTUNMAP))
-		to_account = new_len;
+		to_account = new_len >> PAGE_SHIFT;
 
 	if (vma->vm_ops && vma->vm_ops->may_split) {
 		if (vma->vm_start != old_addr)
@@ -618,16 +618,36 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		return err;
 
 	if (vm_flags & VM_ACCOUNT) {
-		if (security_vm_enough_memory_mm(mm, to_account >> PAGE_SHIFT))
+		if (security_vm_enough_memory_mm(mm, to_account))
 			return -ENOMEM;
 	}
 
+	/*
+	 * MREMAP_DONTUNMAP clears VM_LOCKED on the old vma and
+	 * implies new_len == old_len so no need to account locked
+	 * pages.
+	 */
+	if ((vm_flags & VM_LOCKED) && likely(!(flags & MREMAP_DONTUNMAP))) {
+		if (__account_locked_vm(mm, to_account, current,
+					capable(CAP_IPC_LOCK))) {
+			if (vm_flags & VM_ACCOUNT)
+				vm_unacct_memory(to_account);
+			return -ENOMEM;
+		}
+		*locked = true;
+	}
+
 	new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
 	new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
 			   &need_rmap_locks);
 	if (!new_vma) {
 		if (vm_flags & VM_ACCOUNT)
-			vm_unacct_memory(to_account >> PAGE_SHIFT);
+			vm_unacct_memory(to_account);
+		if ((vm_flags & VM_LOCKED) &&
+		    likely(!(flags & MREMAP_DONTUNMAP))) {
+			__unaccount_locked_vm(mm, to_account);
+			*locked = false;
+		}
 		return -ENOMEM;
 	}
 
@@ -696,10 +716,11 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 			vma->vm_end == (old_addr + old_len))
 			unlink_anon_vmas(vma);
 
-		/* Because we won't unmap we don't need to touch locked_vm */
 		return new_addr;
 	}
 
+	/* Make sure do_munmap() doesn't unaccount locked pages */
+	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
 	if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
 		/* OOM: unable to split vma, just get accounts right */
 		if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
@@ -707,15 +728,11 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		excess = 0;
 	}
 
-	if (vm_flags & VM_LOCKED) {
-		mm->locked_vm += new_len >> PAGE_SHIFT;
-		*locked = true;
-	}
-
 	mm->hiwater_vm = hiwater_vm;
 
 	/* Restore VM_ACCOUNT if one or two pieces of vma left */
 	if (excess) {
+		vma->vm_flags = vm_flags;
 		vma->vm_flags |= VM_ACCOUNT;
 		if (split)
 			find_vma(mm, vma->vm_end)->vm_flags |= VM_ACCOUNT;
@@ -768,9 +785,6 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
 		return ERR_PTR(-EFAULT);
 
-	if (mlock_future_check(mm, vma->vm_flags, new_len - old_len))
-		return ERR_PTR(-EAGAIN);
-
 	if (!may_expand_vm(mm, vma->vm_flags,
 				(new_len - old_len) >> PAGE_SHIFT))
 		return ERR_PTR(-ENOMEM);
@@ -1026,6 +1040,16 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 				}
 			}
 
+			if (vma->vm_flags & VM_LOCKED) {
+				if (__account_locked_vm(mm, pages, current,
+						capable(CAP_IPC_LOCK))) {
+					if (vma->vm_flags & VM_ACCOUNT)
+						vm_unacct_memory(pages);
+					ret = -ENOMEM;
+					goto out;
+				}
+			}
+
 			/*
 			 * Function vma_merge() is called on the extension we are adding to
 			 * the already existing vma, vma_merge() will merge this extension with
@@ -1038,14 +1062,16 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 					extension_pgoff, vma_policy(vma),
 					vma->vm_userfaultfd_ctx, anon_vma_name(vma));
 			if (!vma) {
+				// TODO: We always unacct memory
+				// regardless of VM_ACCOUNT flags?
 				vm_unacct_memory(pages);
+				__unaccount_locked_vm(mm, pages);
 				ret = -ENOMEM;
 				goto out;
 			}
 
 			vm_stat_account(mm, vma->vm_flags, pages);
 			if (vma->vm_flags & VM_LOCKED) {
-				mm->locked_vm += pages;
 				locked = true;
 				new_addr = addr;
 			}
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 04c3ac9..4515eb4 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -120,13 +120,11 @@ static int secretmem_release(struct inode *inode, struct file *file)
 
 static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	unsigned long len = vma->vm_end - vma->vm_start;
-
 	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
 		return -EINVAL;
 
-	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
-		return -EAGAIN;
+	if (account_locked_vm(vma->vm_mm, vma->vm_end - vma->vm_start))
+		return -ENOMEM;
 
 	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
 	vma->vm_ops = &secretmem_vm_ops;
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (16 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24  5:42 ` [RFC PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap Alistair Popple
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple

account_locked_vm() is used to account memory to mm->locked_vm. This
adds accounting to the pins cgorup as it behaves similarly and should
be accounted against the same global limit if set.

This means memory must now be unaccounted for correctly, as the cgroup
typically outlives both the mm and the task. It is assumed that
callers of account_locked_vm() only do accounting against the current
task. Callers that need to do accounting against remote tasks should
use account_pinned_vm() and associated struct vm_account to hold
references to the cgroup.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/util.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index 78b060d..d6159e3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -551,15 +551,21 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
 			struct task_struct *task, bool bypass_rlim)
 {
 	unsigned long locked_vm, limit;
+	struct pins_cgroup *pins_cg = get_pins_cg(task);
 	int ret = 0;
 
 	mmap_assert_write_locked(mm);
 
+	if (pins_cg && pins_try_charge(pins_cg, pages))
+		return -ENOMEM;
+
 	locked_vm = mm->locked_vm;
 	if (!bypass_rlim) {
 		limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-		if (locked_vm + pages > limit)
+		if (locked_vm + pages > limit) {
+			pins_uncharge(pins_cg, pages);
 			ret = -ENOMEM;
+		}
 	}
 
 	if (!ret)
@@ -569,6 +575,12 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
 		 (void *)_RET_IP_, pages << PAGE_SHIFT, locked_vm << PAGE_SHIFT,
 		task_rlimit(task, RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
 
+	pr_debug("%s: [%d] caller %ps %lu %lu/%lu%s\n", __func__, task->pid,
+		 (void *)_RET_IP_, pages << PAGE_SHIFT, locked_vm << PAGE_SHIFT,
+		task_rlimit(task, RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
+
+	if (pins_cg)
+		put_pins_cg(pins_cg);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(__account_locked_vm);
@@ -584,8 +596,18 @@ void __unaccount_locked_vm(struct mm_struct *mm, unsigned long pages)
 {
 	unsigned long locked_vm = mm->locked_vm;
 
+	/*
+	 * TODO: Convert book3s vio to use pinned vm to ensure
+	 * unaccounting happens to the correct cgroup.
+	 */
+	struct pins_cgroup *pins_cg = get_pins_cg(current);
+
 	mmap_assert_write_locked(mm);
 	WARN_ON_ONCE(pages > locked_vm);
+	if (pins_cg) {
+		pins_uncharge(pins_cg, pages);
+		put_pins_cg(pins_cg);
+	}
 	mm->locked_vm = locked_vm - pages;
 }
 EXPORT_SYMBOL_GPL(__unaccount_locked_vm);
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (17 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup Alistair Popple
@ 2023-01-24  5:42 ` Alistair Popple
  2023-01-24 18:26 ` [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Yosry Ahmed
  2023-01-24 20:12 ` Jason Gunthorpe
  20 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-24  5:42 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, Alistair Popple, Shuah Khan, linux-kselftest

Add some basic tests of mlock/mmap cgroup accounting for pinned
memory.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kselftest@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 MAINTAINERS                              |   1 +-
 tools/testing/selftests/vm/Makefile      |   1 +-
 tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++-
 3 files changed, 273 insertions(+)
 create mode 100644 tools/testing/selftests/vm/pins-cgroup.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f8526e2..4c4eed9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5387,6 +5387,7 @@ L:	cgroups@vger.kernel.org
 L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/pins_cgroup.c
+F:	tools/testing/selftests/vm/pins-cgroup.c
 
 CORETEMP HARDWARE MONITORING DRIVER
 M:	Fenghua Yu <fenghua.yu@intel.com>
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 89c14e4..0653720 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -56,6 +56,7 @@ TEST_GEN_PROGS += soft-dirty
 TEST_GEN_PROGS += split_huge_page_test
 TEST_GEN_FILES += ksm_tests
 TEST_GEN_PROGS += ksm_functional_tests
+TEST_GEN_FILES += pins-cgroup
 
 ifeq ($(MACHINE),x86_64)
 CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_32bit_program.c -m32)
diff --git a/tools/testing/selftests/vm/pins-cgroup.c b/tools/testing/selftests/vm/pins-cgroup.c
new file mode 100644
index 0000000..c2eabc2
--- /dev/null
+++ b/tools/testing/selftests/vm/pins-cgroup.c
@@ -0,0 +1,271 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "../kselftest_harness.h"
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <assert.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+#include <unistd.h>
+
+#define CGROUP_TEMP "/sys/fs/cgroup/pins_XXXXXX"
+#define PINS_MAX (-1UL)
+
+FIXTURE(pins_cg)
+{
+	char *cg_path;
+	long page_size;
+};
+
+static char *cgroup_new(void)
+{
+	char *cg;
+
+	cg = malloc(sizeof(CGROUP_TEMP));
+	strcpy(cg, CGROUP_TEMP);
+	if (!mkdtemp(cg)) {
+		perror("Failed to create cgroup");
+		return NULL;
+	}
+
+	return cg;
+}
+
+static int cgroup_add_proc(char *cg, pid_t pid)
+{
+	char *cg_proc;
+	FILE *f;
+	int ret = 0;
+
+	if (asprintf(&cg_proc, "%s/cgroup.procs", cg) < 0)
+		return -1;
+
+	f = fopen(cg_proc, "w");
+	free(cg_proc);
+	if (!f)
+		return -1;
+
+	if (fprintf(f, "%ld\n", (long) pid) < 0)
+		ret = -1;
+
+	fclose(f);
+	return ret;
+}
+
+static int cgroup_set_limit(char *cg, unsigned long limit)
+{
+	char *cg_pins_max;
+	FILE *f;
+	int ret = 0;
+
+	if (asprintf(&cg_pins_max, "%s/pins.max", cg) < 0)
+		return -1;
+
+	f = fopen(cg_pins_max, "w");
+	free(cg_pins_max);
+	if (!f)
+		return -1;
+
+	if (limit != PINS_MAX) {
+		if (fprintf(f, "%ld\n", limit) < 0)
+			ret = -1;
+	} else {
+		if (fprintf(f, "max\n") < 0)
+			ret = -1;
+	}
+
+	fclose(f);
+	return ret;
+}
+
+FIXTURE_SETUP(pins_cg)
+{
+	char *cg_subtree_control;
+	FILE *f;
+
+	if (asprintf(&cg_subtree_control,
+			"/sys/fs/cgroup/cgroup.subtree_control") < 0)
+		return;
+
+	f = fopen(cg_subtree_control, "w");
+	free(cg_subtree_control);
+	if (!f)
+		return;
+
+	fprintf(f, "+pins\n");
+	fclose(f);
+
+	self->cg_path = cgroup_new();
+	self->page_size = sysconf(_SC_PAGE_SIZE);
+}
+
+FIXTURE_TEARDOWN(pins_cg)
+{
+	cgroup_add_proc("/sys/fs/cgroup", getpid());
+
+	rmdir(self->cg_path);
+	free(self->cg_path);
+}
+
+static long cgroup_pins(char *cg)
+{
+	long pin_count;
+	char *cg_pins_current;
+	FILE *f;
+	int ret;
+
+	if (asprintf(&cg_pins_current, "%s/pins.current", cg) < 0)
+		return -1;
+
+	f = fopen(cg_pins_current, "r");
+	if (!f) {
+		printf("Can't open %s\n", cg_pins_current);
+		getchar();
+		free(cg_pins_current);
+		return -2;
+	}
+
+	free(cg_pins_current);
+
+	if (fscanf(f, "%ld", &pin_count) == EOF)
+		ret = -3;
+	else
+		ret = pin_count;
+
+	fclose(f);
+	return ret;
+}
+
+static int set_rlim_memlock(unsigned long size)
+{
+	struct rlimit rlim_memlock = {
+		.rlim_cur = size,
+		.rlim_max = size,
+	};
+	cap_t cap;
+	cap_value_t capability[1] = { CAP_IPC_LOCK };
+
+	/*
+	 * Many of the rlimit checks are skipped if a process has
+	 * CAP_IP_LOCK. As this test should be run as root we need to
+	 * explicitly drop it.
+	 */
+	cap = cap_get_proc();
+	if (!cap)
+		return -1;
+	if (cap_set_flag(cap, CAP_EFFECTIVE, 1, capability, CAP_CLEAR))
+		return -1;
+	if (cap_set_proc(cap))
+		return -1;
+	return setrlimit(RLIMIT_MEMLOCK, &rlim_memlock);
+}
+
+TEST_F(pins_cg, basic)
+{
+	pid_t child_pid;
+	long page_size = self->page_size;
+	char *p;
+
+	ASSERT_EQ(cgroup_add_proc(self->cg_path, getpid()), 0);
+	p = mmap(NULL, 32*page_size, PROT_READ | PROT_WRITE,
+		MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(p, MAP_FAILED);
+
+	ASSERT_EQ(cgroup_pins(self->cg_path), 0);
+	memset(p, 0, 16*page_size);
+	ASSERT_EQ(mlock(p, page_size), 0);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 1);
+	ASSERT_EQ(mlock(p + page_size, page_size), 0);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 2);
+	ASSERT_EQ(mlock(p, page_size), 0);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 2);
+	ASSERT_EQ(mlock(p, 4*page_size), 0);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 4);
+	ASSERT_EQ(munlock(p + 2*page_size, 2*page_size), 0);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 2);
+	ASSERT_EQ(cgroup_set_limit(self->cg_path, 8), 0);
+	ASSERT_EQ(mlock(p, 16*page_size), -1);
+	ASSERT_EQ(errno, ENOMEM);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 2);
+	ASSERT_EQ(cgroup_set_limit(self->cg_path, PINS_MAX), 0);
+
+	/* check mremap() a locked region correctly accounts locked pages */
+	ASSERT_EQ(mlock(p, 32*page_size), 0);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 32);
+	p = mremap(p, 32*page_size, 64*page_size, MREMAP_MAYMOVE);
+	ASSERT_NE(p, MAP_FAILED);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 64);
+	ASSERT_EQ(munmap(p + 32*page_size, 32*page_size), 0)
+	ASSERT_EQ(cgroup_pins(self->cg_path), 32);
+	p = mremap(p, 32*page_size, 32*page_size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP);
+	ASSERT_NE(p, MAP_FAILED);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 32);
+	ASSERT_EQ(munlock(p, 32*page_size), 0);
+
+	/* mremap() a locked region should fail if limit exceeded */
+	ASSERT_EQ(set_rlim_memlock(32*page_size), 0);
+	ASSERT_EQ(mlock(p, 32*page_size), 0);
+	ASSERT_EQ(mremap(p, 32*page_size, 64*page_size, 0), MAP_FAILED);
+	ASSERT_EQ(munlock(p, 32*page_size), 0);
+
+	/* Exceeds rlimit, expected to fail */
+	ASSERT_EQ(set_rlim_memlock(16*page_size), 0);
+	ASSERT_EQ(mlock(p, 32*page_size), -1);
+	ASSERT_EQ(errno, ENOMEM);
+
+	/* memory in the child isn't locked so shouldn't increase pin_cg count */
+	ASSERT_EQ(mlock(p, 16*page_size), 0);
+	child_pid = fork();
+	if (!child_pid) {
+		ASSERT_EQ(cgroup_pins(self->cg_path), 16);
+		ASSERT_EQ(mlock(p, 16*page_size), 0);
+		ASSERT_EQ(cgroup_pins(self->cg_path), 32);
+		return;
+
+	}
+	waitpid(child_pid, NULL, 0);
+
+	/* check that child exit uncharged the pins */
+	ASSERT_EQ(cgroup_pins(self->cg_path), 16);
+}
+
+TEST_F(pins_cg, mmap)
+{
+	char *p;
+
+	ASSERT_EQ(cgroup_add_proc(self->cg_path, getpid()), 0);
+	p = mmap(NULL, 4*self->page_size, PROT_READ | PROT_WRITE,
+		MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED, -1, 0);
+	ASSERT_NE(p, MAP_FAILED);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 4);
+}
+
+/*
+ * Test moving to a different cgroup.
+ */
+TEST_F(pins_cg, move_cg)
+{
+	char *p, *new_cg;
+
+	ASSERT_EQ(cgroup_add_proc(self->cg_path, getpid()), 0);
+	p = mmap(NULL, 16*self->page_size, PROT_READ | PROT_WRITE,
+		MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(p, MAP_FAILED);
+	memset(p, 0, 16*self->page_size);
+	ASSERT_EQ(mlock(p, 16*self->page_size), 0);
+	ASSERT_EQ(cgroup_pins(self->cg_path), 16);
+	ASSERT_NE(new_cg = cgroup_new(), NULL);
+	ASSERT_EQ(cgroup_add_proc(new_cg, getpid()), 0);
+	ASSERT_EQ(cgroup_pins(new_cg), 16);
+	ASSERT_EQ(cgroup_add_proc(self->cg_path, getpid()), 0);
+	rmdir(new_cg);
+}
+TEST_HARNESS_MAIN
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account
  2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
@ 2023-01-24  5:55   ` Michael S. Tsirkin
  2023-01-30 10:43     ` Alistair Popple
  2023-01-24 14:34   ` Jason Gunthorpe
  1 sibling, 1 reply; 56+ messages in thread
From: Michael S. Tsirkin @ 2023-01-24  5:55 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel, Jason Wang, kvm, virtualization,
	netdev

On Tue, Jan 24, 2023 at 04:42:31PM +1100, Alistair Popple wrote:
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ec32f78..a31dd53 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c

...

> @@ -780,6 +780,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>  	u32 asid = iotlb_to_asid(iotlb);
>  	int r = 0;
>  
> +	if (!vdpa->use_va)
> +		if (vm_account_pinned(&dev->vm_account, PFN_DOWN(size)))
> +			return -ENOMEM;
> +
>  	r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
>  				      pa, perm, opaque);
>  	if (r)

I suspect some error handling will have to be reworked then, no?

> -- 
> git-series 0.9.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 01/19] mm: Introduce vm_account
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
@ 2023-01-24  6:29   ` Christoph Hellwig
  2023-01-24 14:32   ` Jason Gunthorpe
  2023-01-31 14:00   ` David Hildenbrand
  2 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2023-01-24  6:29 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel, linuxppc-dev, linux-fpga,
	linux-rdma, virtualization, kvm, netdev, io-uring, bpf,
	rds-devel, linux-kselftest

> +/**
> + * vm_account_init - Initialise a new struct vm_account.
> + * @vm_account: pointer to uninitialised vm_account.
> + * @task: task to charge against.
> + * @user: user to charge against. Must be non-NULL for VM_ACCOUNT_USER.
> + * @flags: flags to use when charging to vm_account.
> + *
> + * Initialise a new uninitialiused struct vm_account. Takes references
> + * on the task/mm/user/cgroup as required although callers must ensure
> + * any references passed in remain valid for the duration of this
> + * call.
> + */
> +void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
> +		struct user_struct *user, enum vm_account_flags flags);


kerneldoc comments are supposed to be next to the implementation, and
not the declaration in the header.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 01/19] mm: Introduce vm_account
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
  2023-01-24  6:29   ` Christoph Hellwig
@ 2023-01-24 14:32   ` Jason Gunthorpe
  2023-01-30 11:36     ` Alistair Popple
  2023-01-31 14:00   ` David Hildenbrand
  2 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:32 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, linuxppc-dev, linux-fpga, linux-rdma,
	virtualization, kvm, netdev, io-uring, bpf, rds-devel,
	linux-kselftest

On Tue, Jan 24, 2023 at 04:42:30PM +1100, Alistair Popple wrote:
> +/**
> + * enum vm_account_flags - Determine how pinned/locked memory is accounted.
> + * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
> + * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
> + * @VM_ACCOUNT_USER: Accounnt locked memory to user->locked_vm.
> + *
> + * Determines which statistic pinned/locked memory is accounted
> + * against. All limits will be enforced against RLIMIT_MEMLOCK and the
> + * pins cgroup if CONFIG_CGROUP_PINS is enabled.
> + *
> + * New drivers should use VM_ACCOUNT_TASK. VM_ACCOUNT_USER is used by
> + * pre-existing drivers to maintain existing accounting against
> + * user->locked_mm rather than mm->pinned_mm.

I thought the guidance was the opposite of this, it is the newer
places in the kernel that are using VM_ACCOUNT_USER?

I haven't got to the rest of the patches yet, but isn't there also a
mm->pinned_vm vs mm->locked_vm variation in the current drivers as
well?

> +void vm_account_init_current(struct vm_account *vm_account)
> +{
> +	vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
> +}
> +EXPORT_SYMBOL_GPL(vm_account_init_current);

This can probably just be a static inline

You might consider putting all this in some new vm_account.h - given
how rarely it is used? Compile times and all

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account
  2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
  2023-01-24  5:55   ` Michael S. Tsirkin
@ 2023-01-24 14:34   ` Jason Gunthorpe
  1 sibling, 0 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:34 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Michael S. Tsirkin, Jason Wang, kvm,
	virtualization, netdev

On Tue, Jan 24, 2023 at 04:42:31PM +1100, Alistair Popple wrote:

> @@ -799,9 +803,6 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>  		return r;
>  	}
>  
> -	if (!vdpa->use_va)
> -		atomic64_add(PFN_DOWN(size), &dev->mm->pinned_vm);

Mention in the commit message this fixes a "bug" where vhost didn't
respect the limits

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure
  2023-01-24  5:42 ` [RFC PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure Alistair Popple
@ 2023-01-24 14:35   ` Jason Gunthorpe
  0 siblings, 0 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:35 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Michael S. Tsirkin, Jason Wang,
	virtualization

On Tue, Jan 24, 2023 at 04:42:32PM +1100, Alistair Popple wrote:
> @@ -990,8 +989,8 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
>  
>  	mmap_read_lock(current->mm);
>  
> -	lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> -	if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
> +	vm_account_init_current(&umem->vm_account);
> +	if (vm_account_pinned(&umem->vm_account, npages))
>  		goto out;
>  
>  	pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
> @@ -1006,22 +1005,21 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
>  	if (ret)
>  		goto out;
>  
> -	atomic64_add(npages, &current->mm->pinned_vm);

Mention in the commit message that this fixes a bug where vdpa would
race the update of mm->pinned_vm and might go past the limit.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
  2023-01-24  5:42 ` [RFC PATCH 05/19] RMDA/siw: " Alistair Popple
@ 2023-01-24 14:37   ` Jason Gunthorpe
  2023-01-24 15:22     ` Bernard Metzler
  2023-01-24 15:56     ` Bernard Metzler
  0 siblings, 2 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:37 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Bernard Metzler, Leon Romanovsky,
	linux-rdma

On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:

> @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
>  	if (!umem)
>  		return ERR_PTR(-ENOMEM);
>  
> -	mm_s = current->mm;
> -	umem->owning_mm = mm_s;
>  	umem->writable = writable;
>  
> -	mmgrab(mm_s);
> +	vm_account_init_current(&umem->vm_account);
>  
>  	if (writable)
>  		foll_flags |= FOLL_WRITE;
>  
> -	mmap_read_lock(mm_s);
> +	mmap_read_lock(current->mm);
>  
> -	mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -
> -	if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
> +	if (vm_account_pinned(&umem->vm_account, num_pages)) {
>  		rv = -ENOMEM;
>  		goto out_sem_up;
>  	}
> @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
>  				goto out_sem_up;
>  
>  			umem->num_pages += rv;
> -			atomic64_add(rv, &mm_s->pinned_vm);

Also fixes the race bug

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 06/19] RDMA/usnic: convert to use vm_account
  2023-01-24  5:42 ` [RFC PATCH 06/19] RDMA/usnic: convert " Alistair Popple
@ 2023-01-24 14:41   ` Jason Gunthorpe
  2023-01-30 11:10     ` Alistair Popple
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:41 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Christian Benvenuti, Nelson Escobar,
	Leon Romanovsky, linux-rdma

On Tue, Jan 24, 2023 at 04:42:35PM +1100, Alistair Popple wrote:
> diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
> index c301b3b..250276e 100644
> --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
> +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
> @@ -89,8 +89,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>  	struct page **page_list;
>  	struct scatterlist *sg;
>  	struct usnic_uiom_chunk *chunk;
> -	unsigned long locked;
> -	unsigned long lock_limit;
>  	unsigned long cur_base;
>  	unsigned long npages;
>  	int ret;
> @@ -123,10 +121,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>  	uiomr->owning_mm = mm = current->mm;
>  	mmap_read_lock(mm);
>  
> -	locked = atomic64_add_return(npages, &current->mm->pinned_vm);
> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -
> -	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> +	vm_account_init_current(&uiomr->vm_account);
> +	if (vm_account_pinned(&uiomr->vm_account, npages)) {
>  		ret = -ENOMEM;
>  		goto out;
>  	}

Is this error handling right? This driver tried to avoid the race by
using atomic64_add_return() but it means that the out label undoes the add:


> @@ -178,7 +174,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>  out:
>  	if (ret < 0) {
>  		usnic_uiom_put_pages(chunk_list, 0);
> -		atomic64_sub(npages, &current->mm->pinned_vm);

Here

> +		vm_unaccount_pinned(&uiomr->vm_account, npages);
> +		vm_account_release(&uiomr->vm_account);

But with the new API we shouldn't call vm_unaccount_pinned() if
vm_account_pinned() doesn't succeed?

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 09/19] io_uring: convert to use vm_account
  2023-01-24  5:42 ` [RFC PATCH 09/19] io_uring: convert to use vm_account Alistair Popple
@ 2023-01-24 14:44   ` Jason Gunthorpe
  2023-01-30 11:12     ` Alistair Popple
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:44 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Jens Axboe, Pavel Begunkov, io-uring

On Tue, Jan 24, 2023 at 04:42:38PM +1100, Alistair Popple wrote:
> Convert io_uring to use vm_account instead of directly charging pages
> against the user/mm. Rather than charge pages to both user->locked_vm
> and mm->pinned_vm this will only charge pages to user->locked_vm.

I think this is a mistake in the first patch, the pinned_vm should
still increment (but not checked against the rlimit), though its main
purpose in this mode is for debugging in proc.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
  2023-01-24  5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
@ 2023-01-24 14:51   ` Jason Gunthorpe
  2023-01-30 11:17     ` Alistair Popple
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel

On Tue, Jan 24, 2023 at 04:42:39PM +1100, Alistair Popple wrote:
> diff --git a/include/net/sock.h b/include/net/sock.h
> index dcd72e6..bc3a868 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -334,6 +334,7 @@ struct sk_filter;
>    *	@sk_security: used by security modules
>    *	@sk_mark: generic packet mark
>    *	@sk_cgrp_data: cgroup data for this cgroup
> +  *	@sk_vm_account: data for pinned memory accounting
>    *	@sk_memcg: this socket's memory cgroup association
>    *	@sk_write_pending: a write to stream socket waits to start
>    *	@sk_state_change: callback to indicate change in the state of the sock
> @@ -523,6 +524,7 @@ struct sock {
>  	void			*sk_security;
>  #endif
>  	struct sock_cgroup_data	sk_cgrp_data;
> +	struct vm_account       sk_vm_account;
>  	struct mem_cgroup	*sk_memcg;
>  	void			(*sk_state_change)(struct sock *sk);
>  	void			(*sk_data_ready)(struct sock *sk);

I'm not sure this makes sense in a sock - each sock can be shared with
different proceses..

> diff --git a/net/rds/message.c b/net/rds/message.c
> index b47e4f0..2138a70 100644
> --- a/net/rds/message.c
> +++ b/net/rds/message.c
> @@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>  	struct list_head *head;
>  	unsigned long flags;
>  
> -	mm_unaccount_pinned_pages(&znotif->z_mmp);
> +	mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
>  	q = &rs->rs_zcookie_queue;
>  	spin_lock_irqsave(&q->lock, flags);
>  	head = &q->zcookie_head;
> @@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>  	int ret = 0;
>  	int length = iov_iter_count(from);
>  	struct rds_msg_zcopy_info *info;
> +	struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
>  
>  	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>  
> @@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>  		return -ENOMEM;
>  	INIT_LIST_HEAD(&info->rs_zcookie_next);
>  	rm->data.op_mmp_znotifier = &info->znotif;
> -	if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
> +	vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
> +	if (mm_account_pinned_pages(vm_account,
> +				    &rm->data.op_mmp_znotifier->z_mmp,
>  				    length)) {
>  		ret = -ENOMEM;
>  		goto err;
> @@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>  			for (i = 0; i < rm->data.op_nents; i++)
>  				put_page(sg_page(&rm->data.op_sg[i]));
>  			mmp = &rm->data.op_mmp_znotifier->z_mmp;
> -			mm_unaccount_pinned_pages(mmp);
> +			mm_unaccount_pinned_pages(vm_account, mmp);
>  			ret = -EFAULT;
>  			goto err;
>  		}

I wonder if RDS should just not be doing accounting? Usually things
related to iov_iter are short term and we don't account for them.

But then I don't really know how RDS works, Santos?

Regardless, maybe the vm_account should be stored in the
rds_msg_zcopy_info ?

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
  2023-01-24 14:37   ` Jason Gunthorpe
@ 2023-01-24 15:22     ` Bernard Metzler
  2023-01-24 15:56     ` Bernard Metzler
  1 sibling, 0 replies; 56+ messages in thread
From: Bernard Metzler @ 2023-01-24 15:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Leon Romanovsky, linux-rdma



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, 24 January 2023 15:37
> To: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org; cgroups@vger.kernel.org; linux-
> kernel@vger.kernel.org; jhubbard@nvidia.com; tjmercier@google.com;
> hannes@cmpxchg.org; surenb@google.com; mkoutny@suse.com; daniel@ffwll.ch;
> Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky <leon@kernel.org>;
> linux-rdma@vger.kernel.org
> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
> vm_account
> 
> On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:
> 
> > @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> >  	if (!umem)
> >  		return ERR_PTR(-ENOMEM);
> >
> > -	mm_s = current->mm;
> > -	umem->owning_mm = mm_s;
> >  	umem->writable = writable;
> >
> > -	mmgrab(mm_s);
> > +	vm_account_init_current(&umem->vm_account);
> >
> >  	if (writable)
> >  		foll_flags |= FOLL_WRITE;
> >
> > -	mmap_read_lock(mm_s);
> > +	mmap_read_lock(current->mm);
> >
> > -	mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > -
> > -	if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
> > +	if (vm_account_pinned(&umem->vm_account, num_pages)) {
> >  		rv = -ENOMEM;
> >  		goto out_sem_up;
> >  	}
> > @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> >  				goto out_sem_up;
> >
> >  			umem->num_pages += rv;
> > -			atomic64_add(rv, &mm_s->pinned_vm);
> 
> Also fixes the race bug
> 
True! Should have used atomic64_add_return() in the first place...

Bernard.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
  2023-01-24 14:37   ` Jason Gunthorpe
  2023-01-24 15:22     ` Bernard Metzler
@ 2023-01-24 15:56     ` Bernard Metzler
  2023-01-30 11:34       ` Alistair Popple
  1 sibling, 1 reply; 56+ messages in thread
From: Bernard Metzler @ 2023-01-24 15:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Leon Romanovsky, linux-rdma



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, 24 January 2023 15:37
> To: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org; cgroups@vger.kernel.org; linux-
> kernel@vger.kernel.org; jhubbard@nvidia.com; tjmercier@google.com;
> hannes@cmpxchg.org; surenb@google.com; mkoutny@suse.com; daniel@ffwll.ch;
> Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky <leon@kernel.org>;
> linux-rdma@vger.kernel.org
> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
> vm_account
> 
> On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:
> 
> > @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> >  	if (!umem)
> >  		return ERR_PTR(-ENOMEM);
> >
> > -	mm_s = current->mm;
> > -	umem->owning_mm = mm_s;
> >  	umem->writable = writable;
> >
> > -	mmgrab(mm_s);
> > +	vm_account_init_current(&umem->vm_account);
> >
> >  	if (writable)
> >  		foll_flags |= FOLL_WRITE;
> >
> > -	mmap_read_lock(mm_s);
> > +	mmap_read_lock(current->mm);
> >
> > -	mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > -
> > -	if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
> > +	if (vm_account_pinned(&umem->vm_account, num_pages)) {
> >  		rv = -ENOMEM;
> >  		goto out_sem_up;
> >  	}
> > @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> >  				goto out_sem_up;
> >
> >  			umem->num_pages += rv;
> > -			atomic64_add(rv, &mm_s->pinned_vm);
> 
> Also fixes the race bug

But introduces another one. In that loop, umem->num_pages keeps the
number of pages currently pinned, not the target number. The current
patch uses that umem->num_pages to call vm_unaccount_pinned() in
siw_umem_release(). Bailing out before all pages are pinned would
mess up that accounting during release. Maybe introduce another
parameter to siw_umem_release(), or better have another umem member
'umem->num_pages_accounted' for correct accounting during release.

Bernard.
> 
> Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (18 preceding siblings ...)
  2023-01-24  5:42 ` [RFC PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap Alistair Popple
@ 2023-01-24 18:26 ` Yosry Ahmed
  2023-01-31  0:54   ` Alistair Popple
  2023-01-24 20:12 ` Jason Gunthorpe
  20 siblings, 1 reply; 56+ messages in thread
From: Yosry Ahmed @ 2023-01-24 18:26 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel

On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote:
>
> Having large amounts of unmovable or unreclaimable memory in a system
> can lead to system instability due to increasing the likelihood of
> encountering out-of-memory conditions. Therefore it is desirable to
> limit the amount of memory users can lock or pin.
>
> From userspace such limits can be enforced by setting
> RLIMIT_MEMLOCK. However there is no standard method that drivers and
> other in-kernel users can use to check and enforce this limit.
>
> This has lead to a large number of inconsistencies in how limits are
> enforced. For example some drivers will use mm->locked_mm while others
> will use mm->pinned_mm or user->locked_mm. It is therefore possible to
> have up to three times RLIMIT_MEMLOCKED pinned.
>
> Having pinned memory limited per-task also makes it easy for users to
> exceed the limit. For example drivers that pin memory with
> pin_user_pages() it tends to remain pinned after fork. To deal with
> this and other issues this series introduces a cgroup for tracking and
> limiting the number of pages pinned or locked by tasks in the group.
>
> However the existing behaviour with regards to the rlimit needs to be
> maintained. Therefore the lesser of the two limits is
> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
> but this bypass is not allowed for the cgroup.
>
> The first part of this series converts existing drivers which
> open-code the use of locked_mm/pinned_mm over to a common interface
> which manages the refcounts of the associated task/mm/user
> structs. This ensures accounting of pages is consistent and makes it
> easier to add charging of the cgroup.
>
> The second part of the series adds the cgroup and converts core mm
> code such as mlock over to charging the cgroup before finally
> introducing some selftests.


I didn't go through the entire series, so apologies if this was
mentioned somewhere, but do you mind elaborating on why this is added
as a separate cgroup controller rather than an extension of the memory
cgroup controller?

>
>
> As I don't have access to systems with all the various devices I
> haven't been able to test all driver changes. Any help there would be
> appreciated.
>
> Alistair Popple (19):
>   mm: Introduce vm_account
>   drivers/vhost: Convert to use vm_account
>   drivers/vdpa: Convert vdpa to use the new vm_structure
>   infiniband/umem: Convert to use vm_account
>   RMDA/siw: Convert to use vm_account
>   RDMA/usnic: convert to use vm_account
>   vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm
>   vfio/spapr_tce: Convert accounting to pinned_vm
>   io_uring: convert to use vm_account
>   net: skb: Switch to using vm_account
>   xdp: convert to use vm_account
>   kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned()
>   fpga: dfl: afu: convert to use vm_account
>   mm: Introduce a cgroup for pinned memory
>   mm/util: Extend vm_account to charge pages against the pin cgroup
>   mm/util: Refactor account_locked_vm
>   mm: Convert mmap and mlock to use account_locked_vm
>   mm/mmap: Charge locked memory to pins cgroup
>   selftests/vm: Add pins-cgroup selftest for mlock/mmap
>
>  MAINTAINERS                              |   8 +-
>  arch/powerpc/kvm/book3s_64_vio.c         |  10 +-
>  arch/powerpc/mm/book3s64/iommu_api.c     |  29 +--
>  drivers/fpga/dfl-afu-dma-region.c        |  11 +-
>  drivers/fpga/dfl-afu.h                   |   1 +-
>  drivers/infiniband/core/umem.c           |  16 +-
>  drivers/infiniband/core/umem_odp.c       |   6 +-
>  drivers/infiniband/hw/usnic/usnic_uiom.c |  13 +-
>  drivers/infiniband/hw/usnic/usnic_uiom.h |   1 +-
>  drivers/infiniband/sw/siw/siw.h          |   2 +-
>  drivers/infiniband/sw/siw/siw_mem.c      |  20 +--
>  drivers/infiniband/sw/siw/siw_verbs.c    |  15 +-
>  drivers/vdpa/vdpa_user/vduse_dev.c       |  20 +--
>  drivers/vfio/vfio_iommu_spapr_tce.c      |  15 +-
>  drivers/vfio/vfio_iommu_type1.c          |  59 +----
>  drivers/vhost/vdpa.c                     |   9 +-
>  drivers/vhost/vhost.c                    |   2 +-
>  drivers/vhost/vhost.h                    |   1 +-
>  include/linux/cgroup.h                   |  20 ++-
>  include/linux/cgroup_subsys.h            |   4 +-
>  include/linux/io_uring_types.h           |   3 +-
>  include/linux/kvm_host.h                 |   1 +-
>  include/linux/mm.h                       |   5 +-
>  include/linux/mm_types.h                 |  88 ++++++++-
>  include/linux/skbuff.h                   |   6 +-
>  include/net/sock.h                       |   2 +-
>  include/net/xdp_sock.h                   |   2 +-
>  include/rdma/ib_umem.h                   |   1 +-
>  io_uring/io_uring.c                      |  20 +--
>  io_uring/notif.c                         |   4 +-
>  io_uring/notif.h                         |  10 +-
>  io_uring/rsrc.c                          |  38 +---
>  io_uring/rsrc.h                          |   9 +-
>  mm/Kconfig                               |  11 +-
>  mm/Makefile                              |   1 +-
>  mm/internal.h                            |   2 +-
>  mm/mlock.c                               |  76 +------
>  mm/mmap.c                                |  76 +++----
>  mm/mremap.c                              |  54 +++--
>  mm/pins_cgroup.c                         | 273 ++++++++++++++++++++++++-
>  mm/secretmem.c                           |   6 +-
>  mm/util.c                                | 196 +++++++++++++++--
>  net/core/skbuff.c                        |  47 +---
>  net/rds/message.c                        |   9 +-
>  net/xdp/xdp_umem.c                       |  38 +--
>  tools/testing/selftests/vm/Makefile      |   1 +-
>  tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++-
>  virt/kvm/kvm_main.c                      |   3 +-
>  48 files changed, 1114 insertions(+), 401 deletions(-)
>  create mode 100644 mm/pins_cgroup.c
>  create mode 100644 tools/testing/selftests/vm/pins-cgroup.c
>
> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65
> --
> git-series 0.9.1
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
                   ` (19 preceding siblings ...)
  2023-01-24 18:26 ` [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Yosry Ahmed
@ 2023-01-24 20:12 ` Jason Gunthorpe
  2023-01-31 13:57   ` David Hildenbrand
  20 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 20:12 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel

On Tue, Jan 24, 2023 at 04:42:29PM +1100, Alistair Popple wrote:
> Having large amounts of unmovable or unreclaimable memory in a system
> can lead to system instability due to increasing the likelihood of
> encountering out-of-memory conditions. Therefore it is desirable to
> limit the amount of memory users can lock or pin.
> 
> From userspace such limits can be enforced by setting
> RLIMIT_MEMLOCK. However there is no standard method that drivers and
> other in-kernel users can use to check and enforce this limit.
> 
> This has lead to a large number of inconsistencies in how limits are
> enforced. For example some drivers will use mm->locked_mm while others
> will use mm->pinned_mm or user->locked_mm. It is therefore possible to
> have up to three times RLIMIT_MEMLOCKED pinned.
> 
> Having pinned memory limited per-task also makes it easy for users to
> exceed the limit. For example drivers that pin memory with
> pin_user_pages() it tends to remain pinned after fork. To deal with
> this and other issues this series introduces a cgroup for tracking and
> limiting the number of pages pinned or locked by tasks in the group.
> 
> However the existing behaviour with regards to the rlimit needs to be
> maintained. Therefore the lesser of the two limits is
> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
> but this bypass is not allowed for the cgroup.
> 
> The first part of this series converts existing drivers which
> open-code the use of locked_mm/pinned_mm over to a common interface
> which manages the refcounts of the associated task/mm/user
> structs. This ensures accounting of pages is consistent and makes it
> easier to add charging of the cgroup.
> 
> The second part of the series adds the cgroup and converts core mm
> code such as mlock over to charging the cgroup before finally
> introducing some selftests.
>
> As I don't have access to systems with all the various devices I
> haven't been able to test all driver changes. Any help there would be
> appreciated.

I'm excited by this series, thanks for making it.

The pin accounting has been a long standing problem and cgroups will
really help!

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 14/19] mm: Introduce a cgroup for pinned memory
  2023-01-24  5:42 ` [RFC PATCH 14/19] mm: Introduce a cgroup for pinned memory Alistair Popple
@ 2023-01-27 21:44   ` Tejun Heo
  2023-01-30 13:20     ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2023-01-27 21:44 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel, Zefan Li, Andrew Morton

On Tue, Jan 24, 2023 at 04:42:43PM +1100, Alistair Popple wrote:
> If too much memory in a system is pinned or locked it can lead to
> problems such as performance degredation or in the worst case
> out-of-memory errors as such memory cannot be moved or paged out.
> 
> In order to prevent users without CAP_IPC_LOCK from causing these
> issues the amount of memory that can be pinned is typically limited by
> RLIMIT_MEMLOCK. However this is inflexible as limits can't be shared
> between tasks and the enforcement of these limits is inconsistent
> between in-kernel users of pinned memory such as mlock() and device
> drivers which may also pin pages with pin_user_pages().
> 
> To allow for a single limit to be set introduce a cgroup controller
> which can be used to limit the number of pages being pinned by all
> tasks in the cgroup.

The use case makes some sense to me but I wonder whether this'd fit a lot
better in memcg rather than being its own controller.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account
  2023-01-24  5:55   ` Michael S. Tsirkin
@ 2023-01-30 10:43     ` Alistair Popple
  0 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-30 10:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel, Jason Wang, kvm, virtualization,
	netdev


"Michael S. Tsirkin" <mst@redhat.com> writes:

> On Tue, Jan 24, 2023 at 04:42:31PM +1100, Alistair Popple wrote:
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index ec32f78..a31dd53 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>
> ...
>
>> @@ -780,6 +780,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
>>  	u32 asid = iotlb_to_asid(iotlb);
>>  	int r = 0;
>>  
>> +	if (!vdpa->use_va)
>> +		if (vm_account_pinned(&dev->vm_account, PFN_DOWN(size)))
>> +			return -ENOMEM;
>> +
>>  	r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
>>  				      pa, perm, opaque);
>>  	if (r)
>
> I suspect some error handling will have to be reworked then, no?

Thanks. I had meant to go back and double check some of these driver
conversions. Will add something like below:

@@ -787,7 +787,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 	r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
 				      pa, perm, opaque);
 	if (r)
-		return r;
+		goto out_unaccount;
 
 	if (ops->dma_map) {
 		r = ops->dma_map(vdpa, asid, iova, size, pa, perm, opaque);
@@ -798,12 +798,14 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
 		r = iommu_map(v->domain, iova, pa, size,
 			      perm_to_iommu_flags(perm));
 	}
-	if (r) {
+	if (r)
 		vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
-		return r;
-	}
 
-	return 0;
+out_unaccount:
+	if (!vdpa->use_va)
+		vm_unaccount_pinned(&dev->vm_account, PFN_DOWN(size));
+
+	return r;
 }
 
 static void vhost_vdpa_unmap(struct vhost_vdpa *v,

>> -- 
>> git-series 0.9.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 06/19] RDMA/usnic: convert to use vm_account
  2023-01-24 14:41   ` Jason Gunthorpe
@ 2023-01-30 11:10     ` Alistair Popple
  0 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-30 11:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Christian Benvenuti, Nelson Escobar,
	Leon Romanovsky, linux-rdma


Jason Gunthorpe <jgg@nvidia.com> writes:

> On Tue, Jan 24, 2023 at 04:42:35PM +1100, Alistair Popple wrote:
>> diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
>> index c301b3b..250276e 100644
>> --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
>> +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
>> @@ -89,8 +89,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>>  	struct page **page_list;
>>  	struct scatterlist *sg;
>>  	struct usnic_uiom_chunk *chunk;
>> -	unsigned long locked;
>> -	unsigned long lock_limit;
>>  	unsigned long cur_base;
>>  	unsigned long npages;
>>  	int ret;
>> @@ -123,10 +121,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>>  	uiomr->owning_mm = mm = current->mm;
>>  	mmap_read_lock(mm);
>>  
>> -	locked = atomic64_add_return(npages, &current->mm->pinned_vm);
>> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> -
>> -	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
>> +	vm_account_init_current(&uiomr->vm_account);
>> +	if (vm_account_pinned(&uiomr->vm_account, npages)) {
>>  		ret = -ENOMEM;
>>  		goto out;
>>  	}
>
> Is this error handling right? This driver tried to avoid the race by
> using atomic64_add_return() but it means that the out label undoes the add:
>
>
>> @@ -178,7 +174,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>>  out:
>>  	if (ret < 0) {
>>  		usnic_uiom_put_pages(chunk_list, 0);
>> -		atomic64_sub(npages, &current->mm->pinned_vm);
>
> Here
>
>> +		vm_unaccount_pinned(&uiomr->vm_account, npages);
>> +		vm_account_release(&uiomr->vm_account);
>
> But with the new API we shouldn't call vm_unaccount_pinned() if
> vm_account_pinned() doesn't succeed?

Good point. Will add the following fix:

@@ -123,6 +123,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 
 	vm_account_init_current(&uiomr->vm_account);
 	if (vm_account_pinned(&uiomr->vm_account, npages)) {
+		npages = 0;
 		ret = -ENOMEM;
 		goto out;
 	}

>
> Jason


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 09/19] io_uring: convert to use vm_account
  2023-01-24 14:44   ` Jason Gunthorpe
@ 2023-01-30 11:12     ` Alistair Popple
  2023-01-30 13:21       ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-30 11:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Jens Axboe, Pavel Begunkov, io-uring


Jason Gunthorpe <jgg@nvidia.com> writes:

> On Tue, Jan 24, 2023 at 04:42:38PM +1100, Alistair Popple wrote:
>> Convert io_uring to use vm_account instead of directly charging pages
>> against the user/mm. Rather than charge pages to both user->locked_vm
>> and mm->pinned_vm this will only charge pages to user->locked_vm.
>
> I think this is a mistake in the first patch, the pinned_vm should
> still increment (but not checked against the rlimit), though its main
> purpose in this mode is for debugging in proc.

Sorry, didn't quite follow - are you saying we should always increment
mm->pinned_vm and only use VM_ACCOUNT_USER vs. TASK to select which one
the rlimit is enforced against?

> Jason


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
  2023-01-24 14:51   ` Jason Gunthorpe
@ 2023-01-30 11:17     ` Alistair Popple
  2023-02-06  4:36       ` Alistair Popple
  0 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-30 11:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel


Jason Gunthorpe <jgg@nvidia.com> writes:

> On Tue, Jan 24, 2023 at 04:42:39PM +1100, Alistair Popple wrote:
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index dcd72e6..bc3a868 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -334,6 +334,7 @@ struct sk_filter;
>>    *	@sk_security: used by security modules
>>    *	@sk_mark: generic packet mark
>>    *	@sk_cgrp_data: cgroup data for this cgroup
>> +  *	@sk_vm_account: data for pinned memory accounting
>>    *	@sk_memcg: this socket's memory cgroup association
>>    *	@sk_write_pending: a write to stream socket waits to start
>>    *	@sk_state_change: callback to indicate change in the state of the sock
>> @@ -523,6 +524,7 @@ struct sock {
>>  	void			*sk_security;
>>  #endif
>>  	struct sock_cgroup_data	sk_cgrp_data;
>> +	struct vm_account       sk_vm_account;
>>  	struct mem_cgroup	*sk_memcg;
>>  	void			(*sk_state_change)(struct sock *sk);
>>  	void			(*sk_data_ready)(struct sock *sk);
>
> I'm not sure this makes sense in a sock - each sock can be shared with
> different proceses..

TBH it didn't feel right to me either so was hoping for some
feedback. Will try your suggestion below.

>> diff --git a/net/rds/message.c b/net/rds/message.c
>> index b47e4f0..2138a70 100644
>> --- a/net/rds/message.c
>> +++ b/net/rds/message.c
>> @@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>>  	struct list_head *head;
>>  	unsigned long flags;
>>  
>> -	mm_unaccount_pinned_pages(&znotif->z_mmp);
>> +	mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
>>  	q = &rs->rs_zcookie_queue;
>>  	spin_lock_irqsave(&q->lock, flags);
>>  	head = &q->zcookie_head;
>> @@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>  	int ret = 0;
>>  	int length = iov_iter_count(from);
>>  	struct rds_msg_zcopy_info *info;
>> +	struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
>>  
>>  	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>>  
>> @@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>  		return -ENOMEM;
>>  	INIT_LIST_HEAD(&info->rs_zcookie_next);
>>  	rm->data.op_mmp_znotifier = &info->znotif;
>> -	if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
>> +	vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
>> +	if (mm_account_pinned_pages(vm_account,
>> +				    &rm->data.op_mmp_znotifier->z_mmp,
>>  				    length)) {
>>  		ret = -ENOMEM;
>>  		goto err;
>> @@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>  			for (i = 0; i < rm->data.op_nents; i++)
>>  				put_page(sg_page(&rm->data.op_sg[i]));
>>  			mmp = &rm->data.op_mmp_znotifier->z_mmp;
>> -			mm_unaccount_pinned_pages(mmp);
>> +			mm_unaccount_pinned_pages(vm_account, mmp);
>>  			ret = -EFAULT;
>>  			goto err;
>>  		}
>
> I wonder if RDS should just not be doing accounting? Usually things
> related to iov_iter are short term and we don't account for them.

Yeah, I couldn't easily figure out why these were accounted for in the
first place either.

> But then I don't really know how RDS works, Santos?
>
> Regardless, maybe the vm_account should be stored in the
> rds_msg_zcopy_info ?

On first glance that looks like a better spot. Thanks for the
idea.

> Jason


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
  2023-01-24 15:56     ` Bernard Metzler
@ 2023-01-30 11:34       ` Alistair Popple
  2023-01-30 13:27         ` Bernard Metzler
  0 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-30 11:34 UTC (permalink / raw)
  To: Bernard Metzler
  Cc: Jason Gunthorpe, linux-mm, cgroups, linux-kernel, jhubbard,
	tjmercier, hannes, surenb, mkoutny, daniel, Leon Romanovsky,
	linux-rdma


Bernard Metzler <BMT@zurich.ibm.com> writes:

>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Tuesday, 24 January 2023 15:37
>> To: Alistair Popple <apopple@nvidia.com>
>> Cc: linux-mm@kvack.org; cgroups@vger.kernel.org; linux-
>> kernel@vger.kernel.org; jhubbard@nvidia.com; tjmercier@google.com;
>> hannes@cmpxchg.org; surenb@google.com; mkoutny@suse.com; daniel@ffwll.ch;
>> Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky <leon@kernel.org>;
>> linux-rdma@vger.kernel.org
>> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
>> vm_account
>> 
>> On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:
>> 
>> > @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
>> bool writable)
>> >  	if (!umem)
>> >  		return ERR_PTR(-ENOMEM);
>> >
>> > -	mm_s = current->mm;
>> > -	umem->owning_mm = mm_s;
>> >  	umem->writable = writable;
>> >
>> > -	mmgrab(mm_s);
>> > +	vm_account_init_current(&umem->vm_account);
>> >
>> >  	if (writable)
>> >  		foll_flags |= FOLL_WRITE;
>> >
>> > -	mmap_read_lock(mm_s);
>> > +	mmap_read_lock(current->mm);
>> >
>> > -	mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> > -
>> > -	if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
>> > +	if (vm_account_pinned(&umem->vm_account, num_pages)) {
>> >  		rv = -ENOMEM;
>> >  		goto out_sem_up;
>> >  	}
>> > @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
>> bool writable)
>> >  				goto out_sem_up;
>> >
>> >  			umem->num_pages += rv;
>> > -			atomic64_add(rv, &mm_s->pinned_vm);
>> 
>> Also fixes the race bug
>
> But introduces another one. In that loop, umem->num_pages keeps the
> number of pages currently pinned, not the target number. The current
> patch uses that umem->num_pages to call vm_unaccount_pinned() in
> siw_umem_release(). Bailing out before all pages are pinned would
> mess up that accounting during release. Maybe introduce another
> parameter to siw_umem_release(), or better have another umem member
> 'umem->num_pages_accounted' for correct accounting during release.

Yes, I see the problem thanks for pointing it out. Will fix for the next
version.

> Bernard.
>> 
>> Jason


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 01/19] mm: Introduce vm_account
  2023-01-24 14:32   ` Jason Gunthorpe
@ 2023-01-30 11:36     ` Alistair Popple
  0 siblings, 0 replies; 56+ messages in thread
From: Alistair Popple @ 2023-01-30 11:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, linuxppc-dev, linux-fpga, linux-rdma,
	virtualization, kvm, netdev, io-uring, bpf, rds-devel,
	linux-kselftest


Jason Gunthorpe <jgg@nvidia.com> writes:

> On Tue, Jan 24, 2023 at 04:42:30PM +1100, Alistair Popple wrote:
>> +/**
>> + * enum vm_account_flags - Determine how pinned/locked memory is accounted.
>> + * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
>> + * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
>> + * @VM_ACCOUNT_USER: Accounnt locked memory to user->locked_vm.
>> + *
>> + * Determines which statistic pinned/locked memory is accounted
>> + * against. All limits will be enforced against RLIMIT_MEMLOCK and the
>> + * pins cgroup if CONFIG_CGROUP_PINS is enabled.
>> + *
>> + * New drivers should use VM_ACCOUNT_TASK. VM_ACCOUNT_USER is used by
>> + * pre-existing drivers to maintain existing accounting against
>> + * user->locked_mm rather than mm->pinned_mm.
>
> I thought the guidance was the opposite of this, it is the newer
> places in the kernel that are using VM_ACCOUNT_USER?

I'd just assumed mm->pinned_vm was preferred because that's what most
drivers use. user->locked_mm does seem more sensible though as at least
it's possible to meaningfully enforce some overall limit. Will switch
the flags/comment around to suggest new users use VM_ACCOUNT_USER.

> I haven't got to the rest of the patches yet, but isn't there also a
> mm->pinned_vm vs mm->locked_vm variation in the current drivers as
> well?
>
>> +void vm_account_init_current(struct vm_account *vm_account)
>> +{
>> +	vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
>> +}
>> +EXPORT_SYMBOL_GPL(vm_account_init_current);
>
> This can probably just be a static inline
>
> You might consider putting all this in some new vm_account.h - given
> how rarely it is used? Compile times and all

Works for me.

> Jason


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 14/19] mm: Introduce a cgroup for pinned memory
  2023-01-27 21:44   ` Tejun Heo
@ 2023-01-30 13:20     ` Jason Gunthorpe
  0 siblings, 0 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-30 13:20 UTC (permalink / raw)
  To: Tejun Heo, Daniel P. Berrange, Alex Williamson
  Cc: Alistair Popple, linux-mm, cgroups, linux-kernel, jhubbard,
	tjmercier, hannes, surenb, mkoutny, daniel, Zefan Li,
	Andrew Morton, libvir-list, Laine Stump

On Fri, Jan 27, 2023 at 11:44:19AM -1000, Tejun Heo wrote:
> On Tue, Jan 24, 2023 at 04:42:43PM +1100, Alistair Popple wrote:
> > If too much memory in a system is pinned or locked it can lead to
> > problems such as performance degredation or in the worst case
> > out-of-memory errors as such memory cannot be moved or paged out.
> > 
> > In order to prevent users without CAP_IPC_LOCK from causing these
> > issues the amount of memory that can be pinned is typically limited by
> > RLIMIT_MEMLOCK. However this is inflexible as limits can't be shared
> > between tasks and the enforcement of these limits is inconsistent
> > between in-kernel users of pinned memory such as mlock() and device
> > drivers which may also pin pages with pin_user_pages().
> > 
> > To allow for a single limit to be set introduce a cgroup controller
> > which can be used to limit the number of pages being pinned by all
> > tasks in the cgroup.
> 
> The use case makes some sense to me but I wonder whether this'd fit a lot
> better in memcg rather than being its own controller.

As long as the pinned limitation has its own bucket it is probably
fine? The underlying memory allocations should have already been
charged to the memcg - so we don't want to double account.

Alex and Daniel were looking at this from the qemu/libvirt
perspective, perhaps they have some insight what they would like to
see?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 09/19] io_uring: convert to use vm_account
  2023-01-30 11:12     ` Alistair Popple
@ 2023-01-30 13:21       ` Jason Gunthorpe
  0 siblings, 0 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-30 13:21 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, Jens Axboe, Pavel Begunkov, io-uring

On Mon, Jan 30, 2023 at 10:12:43PM +1100, Alistair Popple wrote:
> 
> Jason Gunthorpe <jgg@nvidia.com> writes:
> 
> > On Tue, Jan 24, 2023 at 04:42:38PM +1100, Alistair Popple wrote:
> >> Convert io_uring to use vm_account instead of directly charging pages
> >> against the user/mm. Rather than charge pages to both user->locked_vm
> >> and mm->pinned_vm this will only charge pages to user->locked_vm.
> >
> > I think this is a mistake in the first patch, the pinned_vm should
> > still increment (but not checked against the rlimit), though its main
> > purpose in this mode is for debugging in proc.
> 
> Sorry, didn't quite follow - are you saying we should always increment
> mm->pinned_vm and only use VM_ACCOUNT_USER vs. TASK to select which one
> the rlimit is enforced against?

yes pinned_vm was created primarily a debugging counter in proc

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: Re: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
  2023-01-30 11:34       ` Alistair Popple
@ 2023-01-30 13:27         ` Bernard Metzler
  0 siblings, 0 replies; 56+ messages in thread
From: Bernard Metzler @ 2023-01-30 13:27 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jason Gunthorpe, linux-mm, cgroups, linux-kernel, jhubbard,
	tjmercier, hannes, surenb, mkoutny, daniel, Leon Romanovsky,
	linux-rdma



> -----Original Message-----
> From: Alistair Popple <apopple@nvidia.com>
> Sent: Monday, 30 January 2023 12:35
> To: Bernard Metzler <BMT@zurich.ibm.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; linux-mm@kvack.org;
> cgroups@vger.kernel.org; linux-kernel@vger.kernel.org; jhubbard@nvidia.com;
> tjmercier@google.com; hannes@cmpxchg.org; surenb@google.com;
> mkoutny@suse.com; daniel@ffwll.ch; Leon Romanovsky <leon@kernel.org>;
> linux-rdma@vger.kernel.org
> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
> vm_account
> 
> 
> Bernard Metzler <BMT@zurich.ibm.com> writes:
> 
> >> -----Original Message-----
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Tuesday, 24 January 2023 15:37
> >> To: Alistair Popple <apopple@nvidia.com>
> >> Cc: linux-mm@kvack.org; cgroups@vger.kernel.org; linux-
> >> kernel@vger.kernel.org; jhubbard@nvidia.com; tjmercier@google.com;
> >> hannes@cmpxchg.org; surenb@google.com; mkoutny@suse.com;
> daniel@ffwll.ch;
> >> Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky <leon@kernel.org>;
> >> linux-rdma@vger.kernel.org
> >> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
> >> vm_account
> >>
> >> On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:
> >>
> >> > @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64
> len,
> >> bool writable)
> >> >  	if (!umem)
> >> >  		return ERR_PTR(-ENOMEM);
> >> >
> >> > -	mm_s = current->mm;
> >> > -	umem->owning_mm = mm_s;
> >> >  	umem->writable = writable;
> >> >
> >> > -	mmgrab(mm_s);
> >> > +	vm_account_init_current(&umem->vm_account);
> >> >
> >> >  	if (writable)
> >> >  		foll_flags |= FOLL_WRITE;
> >> >
> >> > -	mmap_read_lock(mm_s);
> >> > +	mmap_read_lock(current->mm);
> >> >
> >> > -	mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >> > -
> >> > -	if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
> >> > +	if (vm_account_pinned(&umem->vm_account, num_pages)) {
> >> >  		rv = -ENOMEM;
> >> >  		goto out_sem_up;
> >> >  	}
> >> > @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> >> bool writable)
> >> >  				goto out_sem_up;
> >> >
> >> >  			umem->num_pages += rv;
> >> > -			atomic64_add(rv, &mm_s->pinned_vm);
> >>
> >> Also fixes the race bug
> >
> > But introduces another one. In that loop, umem->num_pages keeps the
> > number of pages currently pinned, not the target number. The current
> > patch uses that umem->num_pages to call vm_unaccount_pinned() in
> > siw_umem_release(). Bailing out before all pages are pinned would
> > mess up that accounting during release. Maybe introduce another
> > parameter to siw_umem_release(), or better have another umem member
> > 'umem->num_pages_accounted' for correct accounting during release.
> 
> Yes, I see the problem thanks for pointing it out. Will fix for the next
> version.

Thank you! Let me send a patch to the original code,
just checking if not all pages are pinned and fix the
counter accordingly. Maybe you can go from there..?

Thank you,
Bernard.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-24 18:26 ` [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Yosry Ahmed
@ 2023-01-31  0:54   ` Alistair Popple
  2023-01-31  5:14     ` Yosry Ahmed
  0 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-31  0:54 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel


Yosry Ahmed <yosryahmed@google.com> writes:

> On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote:
>>
>> Having large amounts of unmovable or unreclaimable memory in a system
>> can lead to system instability due to increasing the likelihood of
>> encountering out-of-memory conditions. Therefore it is desirable to
>> limit the amount of memory users can lock or pin.
>>
>> From userspace such limits can be enforced by setting
>> RLIMIT_MEMLOCK. However there is no standard method that drivers and
>> other in-kernel users can use to check and enforce this limit.
>>
>> This has lead to a large number of inconsistencies in how limits are
>> enforced. For example some drivers will use mm->locked_mm while others
>> will use mm->pinned_mm or user->locked_mm. It is therefore possible to
>> have up to three times RLIMIT_MEMLOCKED pinned.
>>
>> Having pinned memory limited per-task also makes it easy for users to
>> exceed the limit. For example drivers that pin memory with
>> pin_user_pages() it tends to remain pinned after fork. To deal with
>> this and other issues this series introduces a cgroup for tracking and
>> limiting the number of pages pinned or locked by tasks in the group.
>>
>> However the existing behaviour with regards to the rlimit needs to be
>> maintained. Therefore the lesser of the two limits is
>> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
>> but this bypass is not allowed for the cgroup.
>>
>> The first part of this series converts existing drivers which
>> open-code the use of locked_mm/pinned_mm over to a common interface
>> which manages the refcounts of the associated task/mm/user
>> structs. This ensures accounting of pages is consistent and makes it
>> easier to add charging of the cgroup.
>>
>> The second part of the series adds the cgroup and converts core mm
>> code such as mlock over to charging the cgroup before finally
>> introducing some selftests.
>
>
> I didn't go through the entire series, so apologies if this was
> mentioned somewhere, but do you mind elaborating on why this is added
> as a separate cgroup controller rather than an extension of the memory
> cgroup controller?

One of my early prototypes actually did add this to the memcg
controller. However pinned pages fall under their own limit, and we
wanted to always account pages to the cgroup of the task using the
driver rather than say folio_memcg(). So adding it to memcg didn't seem
to have much benefit as we didn't end up using any of the infrastructure
provided by memcg. Hence I thought it was clearer to just add it as it's
own controller.

 - Alistair
 
>>
>>
>> As I don't have access to systems with all the various devices I
>> haven't been able to test all driver changes. Any help there would be
>> appreciated.
>>
>> Alistair Popple (19):
>>   mm: Introduce vm_account
>>   drivers/vhost: Convert to use vm_account
>>   drivers/vdpa: Convert vdpa to use the new vm_structure
>>   infiniband/umem: Convert to use vm_account
>>   RMDA/siw: Convert to use vm_account
>>   RDMA/usnic: convert to use vm_account
>>   vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm
>>   vfio/spapr_tce: Convert accounting to pinned_vm
>>   io_uring: convert to use vm_account
>>   net: skb: Switch to using vm_account
>>   xdp: convert to use vm_account
>>   kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned()
>>   fpga: dfl: afu: convert to use vm_account
>>   mm: Introduce a cgroup for pinned memory
>>   mm/util: Extend vm_account to charge pages against the pin cgroup
>>   mm/util: Refactor account_locked_vm
>>   mm: Convert mmap and mlock to use account_locked_vm
>>   mm/mmap: Charge locked memory to pins cgroup
>>   selftests/vm: Add pins-cgroup selftest for mlock/mmap
>>
>>  MAINTAINERS                              |   8 +-
>>  arch/powerpc/kvm/book3s_64_vio.c         |  10 +-
>>  arch/powerpc/mm/book3s64/iommu_api.c     |  29 +--
>>  drivers/fpga/dfl-afu-dma-region.c        |  11 +-
>>  drivers/fpga/dfl-afu.h                   |   1 +-
>>  drivers/infiniband/core/umem.c           |  16 +-
>>  drivers/infiniband/core/umem_odp.c       |   6 +-
>>  drivers/infiniband/hw/usnic/usnic_uiom.c |  13 +-
>>  drivers/infiniband/hw/usnic/usnic_uiom.h |   1 +-
>>  drivers/infiniband/sw/siw/siw.h          |   2 +-
>>  drivers/infiniband/sw/siw/siw_mem.c      |  20 +--
>>  drivers/infiniband/sw/siw/siw_verbs.c    |  15 +-
>>  drivers/vdpa/vdpa_user/vduse_dev.c       |  20 +--
>>  drivers/vfio/vfio_iommu_spapr_tce.c      |  15 +-
>>  drivers/vfio/vfio_iommu_type1.c          |  59 +----
>>  drivers/vhost/vdpa.c                     |   9 +-
>>  drivers/vhost/vhost.c                    |   2 +-
>>  drivers/vhost/vhost.h                    |   1 +-
>>  include/linux/cgroup.h                   |  20 ++-
>>  include/linux/cgroup_subsys.h            |   4 +-
>>  include/linux/io_uring_types.h           |   3 +-
>>  include/linux/kvm_host.h                 |   1 +-
>>  include/linux/mm.h                       |   5 +-
>>  include/linux/mm_types.h                 |  88 ++++++++-
>>  include/linux/skbuff.h                   |   6 +-
>>  include/net/sock.h                       |   2 +-
>>  include/net/xdp_sock.h                   |   2 +-
>>  include/rdma/ib_umem.h                   |   1 +-
>>  io_uring/io_uring.c                      |  20 +--
>>  io_uring/notif.c                         |   4 +-
>>  io_uring/notif.h                         |  10 +-
>>  io_uring/rsrc.c                          |  38 +---
>>  io_uring/rsrc.h                          |   9 +-
>>  mm/Kconfig                               |  11 +-
>>  mm/Makefile                              |   1 +-
>>  mm/internal.h                            |   2 +-
>>  mm/mlock.c                               |  76 +------
>>  mm/mmap.c                                |  76 +++----
>>  mm/mremap.c                              |  54 +++--
>>  mm/pins_cgroup.c                         | 273 ++++++++++++++++++++++++-
>>  mm/secretmem.c                           |   6 +-
>>  mm/util.c                                | 196 +++++++++++++++--
>>  net/core/skbuff.c                        |  47 +---
>>  net/rds/message.c                        |   9 +-
>>  net/xdp/xdp_umem.c                       |  38 +--
>>  tools/testing/selftests/vm/Makefile      |   1 +-
>>  tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++-
>>  virt/kvm/kvm_main.c                      |   3 +-
>>  48 files changed, 1114 insertions(+), 401 deletions(-)
>>  create mode 100644 mm/pins_cgroup.c
>>  create mode 100644 tools/testing/selftests/vm/pins-cgroup.c
>>
>> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65
>> --
>> git-series 0.9.1
>>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-31  0:54   ` Alistair Popple
@ 2023-01-31  5:14     ` Yosry Ahmed
  2023-01-31 11:22       ` Alistair Popple
  0 siblings, 1 reply; 56+ messages in thread
From: Yosry Ahmed @ 2023-01-31  5:14 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel

On Mon, Jan 30, 2023 at 5:07 PM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote:
> >>
> >> Having large amounts of unmovable or unreclaimable memory in a system
> >> can lead to system instability due to increasing the likelihood of
> >> encountering out-of-memory conditions. Therefore it is desirable to
> >> limit the amount of memory users can lock or pin.
> >>
> >> From userspace such limits can be enforced by setting
> >> RLIMIT_MEMLOCK. However there is no standard method that drivers and
> >> other in-kernel users can use to check and enforce this limit.
> >>
> >> This has lead to a large number of inconsistencies in how limits are
> >> enforced. For example some drivers will use mm->locked_mm while others
> >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to
> >> have up to three times RLIMIT_MEMLOCKED pinned.
> >>
> >> Having pinned memory limited per-task also makes it easy for users to
> >> exceed the limit. For example drivers that pin memory with
> >> pin_user_pages() it tends to remain pinned after fork. To deal with
> >> this and other issues this series introduces a cgroup for tracking and
> >> limiting the number of pages pinned or locked by tasks in the group.
> >>
> >> However the existing behaviour with regards to the rlimit needs to be
> >> maintained. Therefore the lesser of the two limits is
> >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
> >> but this bypass is not allowed for the cgroup.
> >>
> >> The first part of this series converts existing drivers which
> >> open-code the use of locked_mm/pinned_mm over to a common interface
> >> which manages the refcounts of the associated task/mm/user
> >> structs. This ensures accounting of pages is consistent and makes it
> >> easier to add charging of the cgroup.
> >>
> >> The second part of the series adds the cgroup and converts core mm
> >> code such as mlock over to charging the cgroup before finally
> >> introducing some selftests.
> >
> >
> > I didn't go through the entire series, so apologies if this was
> > mentioned somewhere, but do you mind elaborating on why this is added
> > as a separate cgroup controller rather than an extension of the memory
> > cgroup controller?
>
> One of my early prototypes actually did add this to the memcg
> controller. However pinned pages fall under their own limit, and we
> wanted to always account pages to the cgroup of the task using the
> driver rather than say folio_memcg(). So adding it to memcg didn't seem
> to have much benefit as we didn't end up using any of the infrastructure
> provided by memcg. Hence I thought it was clearer to just add it as it's
> own controller.

To clarify, you account and limit pinned memory based on the cgroup of
the process pinning the pages, not based on the cgroup that the pages
are actually charged to? Is my understanding correct?

IOW, you limit the amount of memory that processes in a cgroup can
pin, not the amount of memory charged to a cgroup that can be pinned?

>
>  - Alistair
>
> >>
> >>
> >> As I don't have access to systems with all the various devices I
> >> haven't been able to test all driver changes. Any help there would be
> >> appreciated.
> >>
> >> Alistair Popple (19):
> >>   mm: Introduce vm_account
> >>   drivers/vhost: Convert to use vm_account
> >>   drivers/vdpa: Convert vdpa to use the new vm_structure
> >>   infiniband/umem: Convert to use vm_account
> >>   RMDA/siw: Convert to use vm_account
> >>   RDMA/usnic: convert to use vm_account
> >>   vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm
> >>   vfio/spapr_tce: Convert accounting to pinned_vm
> >>   io_uring: convert to use vm_account
> >>   net: skb: Switch to using vm_account
> >>   xdp: convert to use vm_account
> >>   kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned()
> >>   fpga: dfl: afu: convert to use vm_account
> >>   mm: Introduce a cgroup for pinned memory
> >>   mm/util: Extend vm_account to charge pages against the pin cgroup
> >>   mm/util: Refactor account_locked_vm
> >>   mm: Convert mmap and mlock to use account_locked_vm
> >>   mm/mmap: Charge locked memory to pins cgroup
> >>   selftests/vm: Add pins-cgroup selftest for mlock/mmap
> >>
> >>  MAINTAINERS                              |   8 +-
> >>  arch/powerpc/kvm/book3s_64_vio.c         |  10 +-
> >>  arch/powerpc/mm/book3s64/iommu_api.c     |  29 +--
> >>  drivers/fpga/dfl-afu-dma-region.c        |  11 +-
> >>  drivers/fpga/dfl-afu.h                   |   1 +-
> >>  drivers/infiniband/core/umem.c           |  16 +-
> >>  drivers/infiniband/core/umem_odp.c       |   6 +-
> >>  drivers/infiniband/hw/usnic/usnic_uiom.c |  13 +-
> >>  drivers/infiniband/hw/usnic/usnic_uiom.h |   1 +-
> >>  drivers/infiniband/sw/siw/siw.h          |   2 +-
> >>  drivers/infiniband/sw/siw/siw_mem.c      |  20 +--
> >>  drivers/infiniband/sw/siw/siw_verbs.c    |  15 +-
> >>  drivers/vdpa/vdpa_user/vduse_dev.c       |  20 +--
> >>  drivers/vfio/vfio_iommu_spapr_tce.c      |  15 +-
> >>  drivers/vfio/vfio_iommu_type1.c          |  59 +----
> >>  drivers/vhost/vdpa.c                     |   9 +-
> >>  drivers/vhost/vhost.c                    |   2 +-
> >>  drivers/vhost/vhost.h                    |   1 +-
> >>  include/linux/cgroup.h                   |  20 ++-
> >>  include/linux/cgroup_subsys.h            |   4 +-
> >>  include/linux/io_uring_types.h           |   3 +-
> >>  include/linux/kvm_host.h                 |   1 +-
> >>  include/linux/mm.h                       |   5 +-
> >>  include/linux/mm_types.h                 |  88 ++++++++-
> >>  include/linux/skbuff.h                   |   6 +-
> >>  include/net/sock.h                       |   2 +-
> >>  include/net/xdp_sock.h                   |   2 +-
> >>  include/rdma/ib_umem.h                   |   1 +-
> >>  io_uring/io_uring.c                      |  20 +--
> >>  io_uring/notif.c                         |   4 +-
> >>  io_uring/notif.h                         |  10 +-
> >>  io_uring/rsrc.c                          |  38 +---
> >>  io_uring/rsrc.h                          |   9 +-
> >>  mm/Kconfig                               |  11 +-
> >>  mm/Makefile                              |   1 +-
> >>  mm/internal.h                            |   2 +-
> >>  mm/mlock.c                               |  76 +------
> >>  mm/mmap.c                                |  76 +++----
> >>  mm/mremap.c                              |  54 +++--
> >>  mm/pins_cgroup.c                         | 273 ++++++++++++++++++++++++-
> >>  mm/secretmem.c                           |   6 +-
> >>  mm/util.c                                | 196 +++++++++++++++--
> >>  net/core/skbuff.c                        |  47 +---
> >>  net/rds/message.c                        |   9 +-
> >>  net/xdp/xdp_umem.c                       |  38 +--
> >>  tools/testing/selftests/vm/Makefile      |   1 +-
> >>  tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++-
> >>  virt/kvm/kvm_main.c                      |   3 +-
> >>  48 files changed, 1114 insertions(+), 401 deletions(-)
> >>  create mode 100644 mm/pins_cgroup.c
> >>  create mode 100644 tools/testing/selftests/vm/pins-cgroup.c
> >>
> >> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65
> >> --
> >> git-series 0.9.1
> >>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-31  5:14     ` Yosry Ahmed
@ 2023-01-31 11:22       ` Alistair Popple
  2023-01-31 19:49         ` Yosry Ahmed
  0 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-01-31 11:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel


Yosry Ahmed <yosryahmed@google.com> writes:

> On Mon, Jan 30, 2023 at 5:07 PM Alistair Popple <apopple@nvidia.com> wrote:
>>
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote:
>> >>
>> >> Having large amounts of unmovable or unreclaimable memory in a system
>> >> can lead to system instability due to increasing the likelihood of
>> >> encountering out-of-memory conditions. Therefore it is desirable to
>> >> limit the amount of memory users can lock or pin.
>> >>
>> >> From userspace such limits can be enforced by setting
>> >> RLIMIT_MEMLOCK. However there is no standard method that drivers and
>> >> other in-kernel users can use to check and enforce this limit.
>> >>
>> >> This has lead to a large number of inconsistencies in how limits are
>> >> enforced. For example some drivers will use mm->locked_mm while others
>> >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to
>> >> have up to three times RLIMIT_MEMLOCKED pinned.
>> >>
>> >> Having pinned memory limited per-task also makes it easy for users to
>> >> exceed the limit. For example drivers that pin memory with
>> >> pin_user_pages() it tends to remain pinned after fork. To deal with
>> >> this and other issues this series introduces a cgroup for tracking and
>> >> limiting the number of pages pinned or locked by tasks in the group.
>> >>
>> >> However the existing behaviour with regards to the rlimit needs to be
>> >> maintained. Therefore the lesser of the two limits is
>> >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
>> >> but this bypass is not allowed for the cgroup.
>> >>
>> >> The first part of this series converts existing drivers which
>> >> open-code the use of locked_mm/pinned_mm over to a common interface
>> >> which manages the refcounts of the associated task/mm/user
>> >> structs. This ensures accounting of pages is consistent and makes it
>> >> easier to add charging of the cgroup.
>> >>
>> >> The second part of the series adds the cgroup and converts core mm
>> >> code such as mlock over to charging the cgroup before finally
>> >> introducing some selftests.
>> >
>> >
>> > I didn't go through the entire series, so apologies if this was
>> > mentioned somewhere, but do you mind elaborating on why this is added
>> > as a separate cgroup controller rather than an extension of the memory
>> > cgroup controller?
>>
>> One of my early prototypes actually did add this to the memcg
>> controller. However pinned pages fall under their own limit, and we
>> wanted to always account pages to the cgroup of the task using the
>> driver rather than say folio_memcg(). So adding it to memcg didn't seem
>> to have much benefit as we didn't end up using any of the infrastructure
>> provided by memcg. Hence I thought it was clearer to just add it as it's
>> own controller.
>
> To clarify, you account and limit pinned memory based on the cgroup of
> the process pinning the pages, not based on the cgroup that the pages
> are actually charged to? Is my understanding correct?

That's correct.

> IOW, you limit the amount of memory that processes in a cgroup can
> pin, not the amount of memory charged to a cgroup that can be pinned?

Right, that's a good clarification which I might steal and add to the
cover letter.

>>
>>  - Alistair
>>
>> >>
>> >>
>> >> As I don't have access to systems with all the various devices I
>> >> haven't been able to test all driver changes. Any help there would be
>> >> appreciated.
>> >>
>> >> Alistair Popple (19):
>> >>   mm: Introduce vm_account
>> >>   drivers/vhost: Convert to use vm_account
>> >>   drivers/vdpa: Convert vdpa to use the new vm_structure
>> >>   infiniband/umem: Convert to use vm_account
>> >>   RMDA/siw: Convert to use vm_account
>> >>   RDMA/usnic: convert to use vm_account
>> >>   vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm
>> >>   vfio/spapr_tce: Convert accounting to pinned_vm
>> >>   io_uring: convert to use vm_account
>> >>   net: skb: Switch to using vm_account
>> >>   xdp: convert to use vm_account
>> >>   kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned()
>> >>   fpga: dfl: afu: convert to use vm_account
>> >>   mm: Introduce a cgroup for pinned memory
>> >>   mm/util: Extend vm_account to charge pages against the pin cgroup
>> >>   mm/util: Refactor account_locked_vm
>> >>   mm: Convert mmap and mlock to use account_locked_vm
>> >>   mm/mmap: Charge locked memory to pins cgroup
>> >>   selftests/vm: Add pins-cgroup selftest for mlock/mmap
>> >>
>> >>  MAINTAINERS                              |   8 +-
>> >>  arch/powerpc/kvm/book3s_64_vio.c         |  10 +-
>> >>  arch/powerpc/mm/book3s64/iommu_api.c     |  29 +--
>> >>  drivers/fpga/dfl-afu-dma-region.c        |  11 +-
>> >>  drivers/fpga/dfl-afu.h                   |   1 +-
>> >>  drivers/infiniband/core/umem.c           |  16 +-
>> >>  drivers/infiniband/core/umem_odp.c       |   6 +-
>> >>  drivers/infiniband/hw/usnic/usnic_uiom.c |  13 +-
>> >>  drivers/infiniband/hw/usnic/usnic_uiom.h |   1 +-
>> >>  drivers/infiniband/sw/siw/siw.h          |   2 +-
>> >>  drivers/infiniband/sw/siw/siw_mem.c      |  20 +--
>> >>  drivers/infiniband/sw/siw/siw_verbs.c    |  15 +-
>> >>  drivers/vdpa/vdpa_user/vduse_dev.c       |  20 +--
>> >>  drivers/vfio/vfio_iommu_spapr_tce.c      |  15 +-
>> >>  drivers/vfio/vfio_iommu_type1.c          |  59 +----
>> >>  drivers/vhost/vdpa.c                     |   9 +-
>> >>  drivers/vhost/vhost.c                    |   2 +-
>> >>  drivers/vhost/vhost.h                    |   1 +-
>> >>  include/linux/cgroup.h                   |  20 ++-
>> >>  include/linux/cgroup_subsys.h            |   4 +-
>> >>  include/linux/io_uring_types.h           |   3 +-
>> >>  include/linux/kvm_host.h                 |   1 +-
>> >>  include/linux/mm.h                       |   5 +-
>> >>  include/linux/mm_types.h                 |  88 ++++++++-
>> >>  include/linux/skbuff.h                   |   6 +-
>> >>  include/net/sock.h                       |   2 +-
>> >>  include/net/xdp_sock.h                   |   2 +-
>> >>  include/rdma/ib_umem.h                   |   1 +-
>> >>  io_uring/io_uring.c                      |  20 +--
>> >>  io_uring/notif.c                         |   4 +-
>> >>  io_uring/notif.h                         |  10 +-
>> >>  io_uring/rsrc.c                          |  38 +---
>> >>  io_uring/rsrc.h                          |   9 +-
>> >>  mm/Kconfig                               |  11 +-
>> >>  mm/Makefile                              |   1 +-
>> >>  mm/internal.h                            |   2 +-
>> >>  mm/mlock.c                               |  76 +------
>> >>  mm/mmap.c                                |  76 +++----
>> >>  mm/mremap.c                              |  54 +++--
>> >>  mm/pins_cgroup.c                         | 273 ++++++++++++++++++++++++-
>> >>  mm/secretmem.c                           |   6 +-
>> >>  mm/util.c                                | 196 +++++++++++++++--
>> >>  net/core/skbuff.c                        |  47 +---
>> >>  net/rds/message.c                        |   9 +-
>> >>  net/xdp/xdp_umem.c                       |  38 +--
>> >>  tools/testing/selftests/vm/Makefile      |   1 +-
>> >>  tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++-
>> >>  virt/kvm/kvm_main.c                      |   3 +-
>> >>  48 files changed, 1114 insertions(+), 401 deletions(-)
>> >>  create mode 100644 mm/pins_cgroup.c
>> >>  create mode 100644 tools/testing/selftests/vm/pins-cgroup.c
>> >>
>> >> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65
>> >> --
>> >> git-series 0.9.1
>> >>
>>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-24 20:12 ` Jason Gunthorpe
@ 2023-01-31 13:57   ` David Hildenbrand
  2023-01-31 14:03     ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2023-01-31 13:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel

On 24.01.23 21:12, Jason Gunthorpe wrote:
> On Tue, Jan 24, 2023 at 04:42:29PM +1100, Alistair Popple wrote:
>> Having large amounts of unmovable or unreclaimable memory in a system
>> can lead to system instability due to increasing the likelihood of
>> encountering out-of-memory conditions. Therefore it is desirable to
>> limit the amount of memory users can lock or pin.
>>
>>  From userspace such limits can be enforced by setting
>> RLIMIT_MEMLOCK. However there is no standard method that drivers and
>> other in-kernel users can use to check and enforce this limit.
>>
>> This has lead to a large number of inconsistencies in how limits are
>> enforced. For example some drivers will use mm->locked_mm while others
>> will use mm->pinned_mm or user->locked_mm. It is therefore possible to
>> have up to three times RLIMIT_MEMLOCKED pinned.
>>
>> Having pinned memory limited per-task also makes it easy for users to
>> exceed the limit. For example drivers that pin memory with
>> pin_user_pages() it tends to remain pinned after fork. To deal with
>> this and other issues this series introduces a cgroup for tracking and
>> limiting the number of pages pinned or locked by tasks in the group.
>>
>> However the existing behaviour with regards to the rlimit needs to be
>> maintained. Therefore the lesser of the two limits is
>> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
>> but this bypass is not allowed for the cgroup.
>>
>> The first part of this series converts existing drivers which
>> open-code the use of locked_mm/pinned_mm over to a common interface
>> which manages the refcounts of the associated task/mm/user
>> structs. This ensures accounting of pages is consistent and makes it
>> easier to add charging of the cgroup.
>>
>> The second part of the series adds the cgroup and converts core mm
>> code such as mlock over to charging the cgroup before finally
>> introducing some selftests.
>>
>> As I don't have access to systems with all the various devices I
>> haven't been able to test all driver changes. Any help there would be
>> appreciated.
> 
> I'm excited by this series, thanks for making it.
> 
> The pin accounting has been a long standing problem and cgroups will
> really help!

Indeed. I'm curious how GUP-fast, pinning the same page multiple times, 
and pinning subpages of larger folios are handled :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 01/19] mm: Introduce vm_account
  2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
  2023-01-24  6:29   ` Christoph Hellwig
  2023-01-24 14:32   ` Jason Gunthorpe
@ 2023-01-31 14:00   ` David Hildenbrand
  2 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2023-01-31 14:00 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, cgroups
  Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
	daniel, linuxppc-dev, linux-fpga, linux-rdma, virtualization,
	kvm, netdev, io-uring, bpf, rds-devel, linux-kselftest

On 24.01.23 06:42, Alistair Popple wrote:
> Kernel drivers that pin pages should account these pages against
> either user->locked_vm or mm->pinned_vm and fail the pinning if
> RLIMIT_MEMLOCK is exceeded and CAP_IPC_LOCK isn't held.
> 
> Currently drivers open-code this accounting and use various methods to
> update the atomic variables and check against the limits leading to
> various bugs and inconsistencies. To fix this introduce a standard
> interface for charging pinned and locked memory. As this involves
> taking references on kernel objects such as mm_struct or user_struct
> we introduce a new vm_account struct to hold these references. Several
> helper functions are then introduced to grab references and check
> limits.
> 
> As the way these limits are charged and enforced is visible to
> userspace we need to be careful not to break existing applications by
> charging to different counters. As a result the vm_account functions
> support accounting to different counters as required.
> 
> A future change will extend this to also account against a cgroup for
> pinned pages.

The term "vm_account" is misleading, no? VM_ACCOUNT is for accounting 
towards the commit limit ....

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-31 13:57   ` David Hildenbrand
@ 2023-01-31 14:03     ` Jason Gunthorpe
  2023-01-31 14:06       ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-31 14:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-mm, cgroups, linux-kernel, jhubbard,
	tjmercier, hannes, surenb, mkoutny, daniel

On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote:

> > I'm excited by this series, thanks for making it.
> > 
> > The pin accounting has been a long standing problem and cgroups will
> > really help!
> 
> Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and
> pinning subpages of larger folios are handled :)

The same as today. The pinning is done based on the result from GUP,
and we charge every returned struct page.

So duplicates are counted multiple times, folios are ignored.

Removing duplicate charges would be costly, it would require storage
to keep track of how many times individual pages have been charged to
each cgroup (eg an xarray indexed by PFN of integers in each cgroup).

It doesn't seem worth the cost, IMHO.

We've made alot of investment now with iommufd to remove the most
annoying sources of duplicated pins so it is much less of a problem in
the qemu context at least.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-31 14:03     ` Jason Gunthorpe
@ 2023-01-31 14:06       ` David Hildenbrand
  2023-01-31 14:10         ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2023-01-31 14:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, linux-mm, cgroups, linux-kernel, jhubbard,
	tjmercier, hannes, surenb, mkoutny, daniel

On 31.01.23 15:03, Jason Gunthorpe wrote:
> On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote:
> 
>>> I'm excited by this series, thanks for making it.
>>>
>>> The pin accounting has been a long standing problem and cgroups will
>>> really help!
>>
>> Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and
>> pinning subpages of larger folios are handled :)
> 
> The same as today. The pinning is done based on the result from GUP,
> and we charge every returned struct page.
> 
> So duplicates are counted multiple times, folios are ignored.
> 
> Removing duplicate charges would be costly, it would require storage
> to keep track of how many times individual pages have been charged to
> each cgroup (eg an xarray indexed by PFN of integers in each cgroup).
> 
> It doesn't seem worth the cost, IMHO.
> 
> We've made alot of investment now with iommufd to remove the most
> annoying sources of duplicated pins so it is much less of a problem in
> the qemu context at least.

Wasn't there the discussion regarding using vfio+io_uring+rdma+$whatever 
on a VM and requiring multiple times the VM size as memlock limit? Would 
it be the same now, just that we need multiple times the pin limit?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-31 14:06       ` David Hildenbrand
@ 2023-01-31 14:10         ` Jason Gunthorpe
  2023-01-31 14:15           ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-31 14:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-mm, cgroups, linux-kernel, jhubbard,
	tjmercier, hannes, surenb, mkoutny, daniel

On Tue, Jan 31, 2023 at 03:06:10PM +0100, David Hildenbrand wrote:
> On 31.01.23 15:03, Jason Gunthorpe wrote:
> > On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote:
> > 
> > > > I'm excited by this series, thanks for making it.
> > > > 
> > > > The pin accounting has been a long standing problem and cgroups will
> > > > really help!
> > > 
> > > Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and
> > > pinning subpages of larger folios are handled :)
> > 
> > The same as today. The pinning is done based on the result from GUP,
> > and we charge every returned struct page.
> > 
> > So duplicates are counted multiple times, folios are ignored.
> > 
> > Removing duplicate charges would be costly, it would require storage
> > to keep track of how many times individual pages have been charged to
> > each cgroup (eg an xarray indexed by PFN of integers in each cgroup).
> > 
> > It doesn't seem worth the cost, IMHO.
> > 
> > We've made alot of investment now with iommufd to remove the most
> > annoying sources of duplicated pins so it is much less of a problem in
> > the qemu context at least.
> 
> Wasn't there the discussion regarding using vfio+io_uring+rdma+$whatever on
> a VM and requiring multiple times the VM size as memlock limit?

Yes, but iommufd gives us some more options to mitigate this.

eg it makes some of logical sense to point RDMA at the iommufd page
table that is already pinned when trying to DMA from guest memory, in
this case it could ride on the existing pin.

> Would it be the same now, just that we need multiple times the pin
> limit?

Yes

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-31 14:10         ` Jason Gunthorpe
@ 2023-01-31 14:15           ` David Hildenbrand
  2023-01-31 14:21             ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2023-01-31 14:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, linux-mm, cgroups, linux-kernel, jhubbard,
	tjmercier, hannes, surenb, mkoutny, daniel

On 31.01.23 15:10, Jason Gunthorpe wrote:
> On Tue, Jan 31, 2023 at 03:06:10PM +0100, David Hildenbrand wrote:
>> On 31.01.23 15:03, Jason Gunthorpe wrote:
>>> On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote:
>>>
>>>>> I'm excited by this series, thanks for making it.
>>>>>
>>>>> The pin accounting has been a long standing problem and cgroups will
>>>>> really help!
>>>>
>>>> Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and
>>>> pinning subpages of larger folios are handled :)
>>>
>>> The same as today. The pinning is done based on the result from GUP,
>>> and we charge every returned struct page.
>>>
>>> So duplicates are counted multiple times, folios are ignored.
>>>
>>> Removing duplicate charges would be costly, it would require storage
>>> to keep track of how many times individual pages have been charged to
>>> each cgroup (eg an xarray indexed by PFN of integers in each cgroup).
>>>
>>> It doesn't seem worth the cost, IMHO.
>>>
>>> We've made alot of investment now with iommufd to remove the most
>>> annoying sources of duplicated pins so it is much less of a problem in
>>> the qemu context at least.
>>
>> Wasn't there the discussion regarding using vfio+io_uring+rdma+$whatever on
>> a VM and requiring multiple times the VM size as memlock limit?
> 
> Yes, but iommufd gives us some more options to mitigate this.
> 
> eg it makes some of logical sense to point RDMA at the iommufd page
> table that is already pinned when trying to DMA from guest memory, in
> this case it could ride on the existing pin.

Right, I suspect some issue is that the address space layout for the 
RDMA device might be completely different. But I'm no expert on IOMMUs 
at all :)

I do understand that at least multiple VFIO containers could benefit by 
only pinning once (IIUC that mgiht have been an issue?).

> 
>> Would it be the same now, just that we need multiple times the pin
>> limit?
> 
> Yes

Okay, thanks.


It's all still a big improvement, because I also asked for TDX 
restrictedmem to be accounted somehow as unmovable.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-31 14:15           ` David Hildenbrand
@ 2023-01-31 14:21             ` Jason Gunthorpe
  0 siblings, 0 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2023-01-31 14:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-mm, cgroups, linux-kernel, jhubbard,
	tjmercier, hannes, surenb, mkoutny, daniel

On Tue, Jan 31, 2023 at 03:15:49PM +0100, David Hildenbrand wrote:
> On 31.01.23 15:10, Jason Gunthorpe wrote:
> > On Tue, Jan 31, 2023 at 03:06:10PM +0100, David Hildenbrand wrote:
> > > On 31.01.23 15:03, Jason Gunthorpe wrote:
> > > > On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote:
> > > > 
> > > > > > I'm excited by this series, thanks for making it.
> > > > > > 
> > > > > > The pin accounting has been a long standing problem and cgroups will
> > > > > > really help!
> > > > > 
> > > > > Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and
> > > > > pinning subpages of larger folios are handled :)
> > > > 
> > > > The same as today. The pinning is done based on the result from GUP,
> > > > and we charge every returned struct page.
> > > > 
> > > > So duplicates are counted multiple times, folios are ignored.
> > > > 
> > > > Removing duplicate charges would be costly, it would require storage
> > > > to keep track of how many times individual pages have been charged to
> > > > each cgroup (eg an xarray indexed by PFN of integers in each cgroup).
> > > > 
> > > > It doesn't seem worth the cost, IMHO.
> > > > 
> > > > We've made alot of investment now with iommufd to remove the most
> > > > annoying sources of duplicated pins so it is much less of a problem in
> > > > the qemu context at least.
> > > 
> > > Wasn't there the discussion regarding using vfio+io_uring+rdma+$whatever on
> > > a VM and requiring multiple times the VM size as memlock limit?
> > 
> > Yes, but iommufd gives us some more options to mitigate this.
> > 
> > eg it makes some of logical sense to point RDMA at the iommufd page
> > table that is already pinned when trying to DMA from guest memory, in
> > this case it could ride on the existing pin.
> 
> Right, I suspect some issue is that the address space layout for the RDMA
> device might be completely different. But I'm no expert on IOMMUs at all :)

Oh it doesn't matter, it is all virtualized so many times..

> I do understand that at least multiple VFIO containers could benefit by only
> pinning once (IIUC that mgiht have been an issue?).

iommufd has fixed this completely.

> It's all still a big improvement, because I also asked for TDX restrictedmem
> to be accounted somehow as unmovable.

Yeah, it is sort of reasonable to think of the CC "secret memory" as
memory that is no different from memory being DMA'd to. The DMA is
just some other vCPU.

I still don't have a clear idea how all this CC memory is going to
actually work. Eventually it has to get into iommufd as well, somehow.

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory
  2023-01-31 11:22       ` Alistair Popple
@ 2023-01-31 19:49         ` Yosry Ahmed
  0 siblings, 0 replies; 56+ messages in thread
From: Yosry Ahmed @ 2023-01-31 19:49 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier,
	hannes, surenb, mkoutny, daniel

On Tue, Jan 31, 2023 at 3:24 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Mon, Jan 30, 2023 at 5:07 PM Alistair Popple <apopple@nvidia.com> wrote:
> >>
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote:
> >> >>
> >> >> Having large amounts of unmovable or unreclaimable memory in a system
> >> >> can lead to system instability due to increasing the likelihood of
> >> >> encountering out-of-memory conditions. Therefore it is desirable to
> >> >> limit the amount of memory users can lock or pin.
> >> >>
> >> >> From userspace such limits can be enforced by setting
> >> >> RLIMIT_MEMLOCK. However there is no standard method that drivers and
> >> >> other in-kernel users can use to check and enforce this limit.
> >> >>
> >> >> This has lead to a large number of inconsistencies in how limits are
> >> >> enforced. For example some drivers will use mm->locked_mm while others
> >> >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to
> >> >> have up to three times RLIMIT_MEMLOCKED pinned.
> >> >>
> >> >> Having pinned memory limited per-task also makes it easy for users to
> >> >> exceed the limit. For example drivers that pin memory with
> >> >> pin_user_pages() it tends to remain pinned after fork. To deal with
> >> >> this and other issues this series introduces a cgroup for tracking and
> >> >> limiting the number of pages pinned or locked by tasks in the group.
> >> >>
> >> >> However the existing behaviour with regards to the rlimit needs to be
> >> >> maintained. Therefore the lesser of the two limits is
> >> >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
> >> >> but this bypass is not allowed for the cgroup.
> >> >>
> >> >> The first part of this series converts existing drivers which
> >> >> open-code the use of locked_mm/pinned_mm over to a common interface
> >> >> which manages the refcounts of the associated task/mm/user
> >> >> structs. This ensures accounting of pages is consistent and makes it
> >> >> easier to add charging of the cgroup.
> >> >>
> >> >> The second part of the series adds the cgroup and converts core mm
> >> >> code such as mlock over to charging the cgroup before finally
> >> >> introducing some selftests.
> >> >
> >> >
> >> > I didn't go through the entire series, so apologies if this was
> >> > mentioned somewhere, but do you mind elaborating on why this is added
> >> > as a separate cgroup controller rather than an extension of the memory
> >> > cgroup controller?
> >>
> >> One of my early prototypes actually did add this to the memcg
> >> controller. However pinned pages fall under their own limit, and we
> >> wanted to always account pages to the cgroup of the task using the
> >> driver rather than say folio_memcg(). So adding it to memcg didn't seem
> >> to have much benefit as we didn't end up using any of the infrastructure
> >> provided by memcg. Hence I thought it was clearer to just add it as it's
> >> own controller.
> >
> > To clarify, you account and limit pinned memory based on the cgroup of
> > the process pinning the pages, not based on the cgroup that the pages
> > are actually charged to? Is my understanding correct?
>
> That's correct.

Interesting.

>
> > IOW, you limit the amount of memory that processes in a cgroup can
> > pin, not the amount of memory charged to a cgroup that can be pinned?
>
> Right, that's a good clarification which I might steal and add to the
> cover letter.

Feel free to :)

Please also clarify this in the code/docs. Glancing through the
patches I was asking myself multiple times why this is not
"memory.pinned.[current/max]" or similar.

>
> >>
> >>  - Alistair
> >>
> >> >>
> >> >>
> >> >> As I don't have access to systems with all the various devices I
> >> >> haven't been able to test all driver changes. Any help there would be
> >> >> appreciated.
> >> >>
> >> >> Alistair Popple (19):
> >> >>   mm: Introduce vm_account
> >> >>   drivers/vhost: Convert to use vm_account
> >> >>   drivers/vdpa: Convert vdpa to use the new vm_structure
> >> >>   infiniband/umem: Convert to use vm_account
> >> >>   RMDA/siw: Convert to use vm_account
> >> >>   RDMA/usnic: convert to use vm_account
> >> >>   vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm
> >> >>   vfio/spapr_tce: Convert accounting to pinned_vm
> >> >>   io_uring: convert to use vm_account
> >> >>   net: skb: Switch to using vm_account
> >> >>   xdp: convert to use vm_account
> >> >>   kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned()
> >> >>   fpga: dfl: afu: convert to use vm_account
> >> >>   mm: Introduce a cgroup for pinned memory
> >> >>   mm/util: Extend vm_account to charge pages against the pin cgroup
> >> >>   mm/util: Refactor account_locked_vm
> >> >>   mm: Convert mmap and mlock to use account_locked_vm
> >> >>   mm/mmap: Charge locked memory to pins cgroup
> >> >>   selftests/vm: Add pins-cgroup selftest for mlock/mmap
> >> >>
> >> >>  MAINTAINERS                              |   8 +-
> >> >>  arch/powerpc/kvm/book3s_64_vio.c         |  10 +-
> >> >>  arch/powerpc/mm/book3s64/iommu_api.c     |  29 +--
> >> >>  drivers/fpga/dfl-afu-dma-region.c        |  11 +-
> >> >>  drivers/fpga/dfl-afu.h                   |   1 +-
> >> >>  drivers/infiniband/core/umem.c           |  16 +-
> >> >>  drivers/infiniband/core/umem_odp.c       |   6 +-
> >> >>  drivers/infiniband/hw/usnic/usnic_uiom.c |  13 +-
> >> >>  drivers/infiniband/hw/usnic/usnic_uiom.h |   1 +-
> >> >>  drivers/infiniband/sw/siw/siw.h          |   2 +-
> >> >>  drivers/infiniband/sw/siw/siw_mem.c      |  20 +--
> >> >>  drivers/infiniband/sw/siw/siw_verbs.c    |  15 +-
> >> >>  drivers/vdpa/vdpa_user/vduse_dev.c       |  20 +--
> >> >>  drivers/vfio/vfio_iommu_spapr_tce.c      |  15 +-
> >> >>  drivers/vfio/vfio_iommu_type1.c          |  59 +----
> >> >>  drivers/vhost/vdpa.c                     |   9 +-
> >> >>  drivers/vhost/vhost.c                    |   2 +-
> >> >>  drivers/vhost/vhost.h                    |   1 +-
> >> >>  include/linux/cgroup.h                   |  20 ++-
> >> >>  include/linux/cgroup_subsys.h            |   4 +-
> >> >>  include/linux/io_uring_types.h           |   3 +-
> >> >>  include/linux/kvm_host.h                 |   1 +-
> >> >>  include/linux/mm.h                       |   5 +-
> >> >>  include/linux/mm_types.h                 |  88 ++++++++-
> >> >>  include/linux/skbuff.h                   |   6 +-
> >> >>  include/net/sock.h                       |   2 +-
> >> >>  include/net/xdp_sock.h                   |   2 +-
> >> >>  include/rdma/ib_umem.h                   |   1 +-
> >> >>  io_uring/io_uring.c                      |  20 +--
> >> >>  io_uring/notif.c                         |   4 +-
> >> >>  io_uring/notif.h                         |  10 +-
> >> >>  io_uring/rsrc.c                          |  38 +---
> >> >>  io_uring/rsrc.h                          |   9 +-
> >> >>  mm/Kconfig                               |  11 +-
> >> >>  mm/Makefile                              |   1 +-
> >> >>  mm/internal.h                            |   2 +-
> >> >>  mm/mlock.c                               |  76 +------
> >> >>  mm/mmap.c                                |  76 +++----
> >> >>  mm/mremap.c                              |  54 +++--
> >> >>  mm/pins_cgroup.c                         | 273 ++++++++++++++++++++++++-
> >> >>  mm/secretmem.c                           |   6 +-
> >> >>  mm/util.c                                | 196 +++++++++++++++--
> >> >>  net/core/skbuff.c                        |  47 +---
> >> >>  net/rds/message.c                        |   9 +-
> >> >>  net/xdp/xdp_umem.c                       |  38 +--
> >> >>  tools/testing/selftests/vm/Makefile      |   1 +-
> >> >>  tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++-
> >> >>  virt/kvm/kvm_main.c                      |   3 +-
> >> >>  48 files changed, 1114 insertions(+), 401 deletions(-)
> >> >>  create mode 100644 mm/pins_cgroup.c
> >> >>  create mode 100644 tools/testing/selftests/vm/pins-cgroup.c
> >> >>
> >> >> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65
> >> >> --
> >> >> git-series 0.9.1
> >> >>
> >>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
  2023-01-30 11:17     ` Alistair Popple
@ 2023-02-06  4:36       ` Alistair Popple
  2023-02-06 13:14         ` Jason Gunthorpe
  0 siblings, 1 reply; 56+ messages in thread
From: Alistair Popple @ 2023-02-06  4:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel


Alistair Popple <apopple@nvidia.com> writes:

> Jason Gunthorpe <jgg@nvidia.com> writes:
>
>> On Tue, Jan 24, 2023 at 04:42:39PM +1100, Alistair Popple wrote:
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index dcd72e6..bc3a868 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -334,6 +334,7 @@ struct sk_filter;
>>>    *	@sk_security: used by security modules
>>>    *	@sk_mark: generic packet mark
>>>    *	@sk_cgrp_data: cgroup data for this cgroup
>>> +  *	@sk_vm_account: data for pinned memory accounting
>>>    *	@sk_memcg: this socket's memory cgroup association
>>>    *	@sk_write_pending: a write to stream socket waits to start
>>>    *	@sk_state_change: callback to indicate change in the state of the sock
>>> @@ -523,6 +524,7 @@ struct sock {
>>>  	void			*sk_security;
>>>  #endif
>>>  	struct sock_cgroup_data	sk_cgrp_data;
>>> +	struct vm_account       sk_vm_account;
>>>  	struct mem_cgroup	*sk_memcg;
>>>  	void			(*sk_state_change)(struct sock *sk);
>>>  	void			(*sk_data_ready)(struct sock *sk);
>>
>> I'm not sure this makes sense in a sock - each sock can be shared with
>> different proceses..
>
> TBH it didn't feel right to me either so was hoping for some
> feedback. Will try your suggestion below.
>
>>> diff --git a/net/rds/message.c b/net/rds/message.c
>>> index b47e4f0..2138a70 100644
>>> --- a/net/rds/message.c
>>> +++ b/net/rds/message.c
>>> @@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>>>  	struct list_head *head;
>>>  	unsigned long flags;
>>>  
>>> -	mm_unaccount_pinned_pages(&znotif->z_mmp);
>>> +	mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
>>>  	q = &rs->rs_zcookie_queue;
>>>  	spin_lock_irqsave(&q->lock, flags);
>>>  	head = &q->zcookie_head;
>>> @@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>>  	int ret = 0;
>>>  	int length = iov_iter_count(from);
>>>  	struct rds_msg_zcopy_info *info;
>>> +	struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
>>>  
>>>  	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>>>  
>>> @@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>>  		return -ENOMEM;
>>>  	INIT_LIST_HEAD(&info->rs_zcookie_next);
>>>  	rm->data.op_mmp_znotifier = &info->znotif;
>>> -	if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
>>> +	vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
>>> +	if (mm_account_pinned_pages(vm_account,
>>> +				    &rm->data.op_mmp_znotifier->z_mmp,
>>>  				    length)) {
>>>  		ret = -ENOMEM;
>>>  		goto err;
>>> @@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>>  			for (i = 0; i < rm->data.op_nents; i++)
>>>  				put_page(sg_page(&rm->data.op_sg[i]));
>>>  			mmp = &rm->data.op_mmp_znotifier->z_mmp;
>>> -			mm_unaccount_pinned_pages(mmp);
>>> +			mm_unaccount_pinned_pages(vm_account, mmp);
>>>  			ret = -EFAULT;
>>>  			goto err;
>>>  		}
>>
>> I wonder if RDS should just not be doing accounting? Usually things
>> related to iov_iter are short term and we don't account for them.
>
> Yeah, I couldn't easily figure out why these were accounted for in the
> first place either.
>
>> But then I don't really know how RDS works, Santos?
>>
>> Regardless, maybe the vm_account should be stored in the
>> rds_msg_zcopy_info ?
>
> On first glance that looks like a better spot. Thanks for the
> idea.

That works fine for RDS but not for skbuff. We still need a vm_account
in the struct sock or somewhere else for that. For example in
msg_zerocopy_realloc() we only have a struct ubuf_info_msgzc
available. We can't add a struct vm_account field to that because
ultimately it is stored in struct sk_buff->ck[] which is not large
enough to contain ubuf_info_msgzc + vm_account.

I'm not terribly familiar with kernel networking code though, so happy
to hear other suggestions.

>> Jason


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
  2023-02-06  4:36       ` Alistair Popple
@ 2023-02-06 13:14         ` Jason Gunthorpe
  0 siblings, 0 replies; 56+ messages in thread
From: Jason Gunthorpe @ 2023-02-06 13:14 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
	surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel

On Mon, Feb 06, 2023 at 03:36:49PM +1100, Alistair Popple wrote:

> >> But then I don't really know how RDS works, Santos?
> >>
> >> Regardless, maybe the vm_account should be stored in the
> >> rds_msg_zcopy_info ?
> >
> > On first glance that looks like a better spot. Thanks for the
> > idea.
> 
> That works fine for RDS but not for skbuff. 

I would definately put the RDS stuff like that..

> We still need a vm_account in the struct sock or somewhere else for
> that. For example in msg_zerocopy_realloc() we only have a struct
> ubuf_info_msgzc available. We can't add a struct vm_account field to
> that because ultimately it is stored in struct sk_buff->ck[] which
> is not large enough to contain ubuf_info_msgzc + vm_account.

Well, AFAICT this is using iov_iter to get the pages and in general
iov_iter - eg as used for O_DIRECT - doesn't charge anything.

If this does somehow allow a userspace to hold pin a page for a long
time then it is already technically wrong because it doesn't use
FOLL_LONGTERM.

Arguably FOLL_LONGTERM should be the key precondition to require
accounting.

So I wonder if it should just be deleted?

Jason

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2023-02-06 13:14 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-24  5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
2023-01-24  6:29   ` Christoph Hellwig
2023-01-24 14:32   ` Jason Gunthorpe
2023-01-30 11:36     ` Alistair Popple
2023-01-31 14:00   ` David Hildenbrand
2023-01-24  5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
2023-01-24  5:55   ` Michael S. Tsirkin
2023-01-30 10:43     ` Alistair Popple
2023-01-24 14:34   ` Jason Gunthorpe
2023-01-24  5:42 ` [RFC PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure Alistair Popple
2023-01-24 14:35   ` Jason Gunthorpe
2023-01-24  5:42 ` [RFC PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 05/19] RMDA/siw: " Alistair Popple
2023-01-24 14:37   ` Jason Gunthorpe
2023-01-24 15:22     ` Bernard Metzler
2023-01-24 15:56     ` Bernard Metzler
2023-01-30 11:34       ` Alistair Popple
2023-01-30 13:27         ` Bernard Metzler
2023-01-24  5:42 ` [RFC PATCH 06/19] RDMA/usnic: convert " Alistair Popple
2023-01-24 14:41   ` Jason Gunthorpe
2023-01-30 11:10     ` Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 09/19] io_uring: convert to use vm_account Alistair Popple
2023-01-24 14:44   ` Jason Gunthorpe
2023-01-30 11:12     ` Alistair Popple
2023-01-30 13:21       ` Jason Gunthorpe
2023-01-24  5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
2023-01-24 14:51   ` Jason Gunthorpe
2023-01-30 11:17     ` Alistair Popple
2023-02-06  4:36       ` Alistair Popple
2023-02-06 13:14         ` Jason Gunthorpe
2023-01-24  5:42 ` [RFC PATCH 11/19] xdp: convert to use vm_account Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 13/19] fpga: dfl: afu: convert to use vm_account Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 14/19] mm: Introduce a cgroup for pinned memory Alistair Popple
2023-01-27 21:44   ` Tejun Heo
2023-01-30 13:20     ` Jason Gunthorpe
2023-01-24  5:42 ` [RFC PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 16/19] mm/util: Refactor account_locked_vm Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup Alistair Popple
2023-01-24  5:42 ` [RFC PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap Alistair Popple
2023-01-24 18:26 ` [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Yosry Ahmed
2023-01-31  0:54   ` Alistair Popple
2023-01-31  5:14     ` Yosry Ahmed
2023-01-31 11:22       ` Alistair Popple
2023-01-31 19:49         ` Yosry Ahmed
2023-01-24 20:12 ` Jason Gunthorpe
2023-01-31 13:57   ` David Hildenbrand
2023-01-31 14:03     ` Jason Gunthorpe
2023-01-31 14:06       ` David Hildenbrand
2023-01-31 14:10         ` Jason Gunthorpe
2023-01-31 14:15           ` David Hildenbrand
2023-01-31 14:21             ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).