All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>, Michal Hocko <mhocko@suse.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, jhubbard@nvidia.com,
	tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com,
	mkoutny@suse.com, daniel@ffwll.ch,
	"Daniel P . Berrange" <berrange@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Zefan Li <lizefan.x@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory
Date: Wed, 22 Feb 2023 22:38:25 +1100	[thread overview]
Message-ID: <87o7pmnd0p.fsf@nvidia.com> (raw)
In-Reply-To: <Y/UiQmuVwh2eqrfA@nvidia.com>


Jason Gunthorpe <jgg@nvidia.com> writes:

>> the owner and pinner disagreeing with each other. At least
>> conceptually, the solution is rather straight-forward - whoever pins
>> a page should also claim the ownership of it.
>
> If the answer is pinner is owner, then multi-pinners must mean
> multi-owner too. We probably can't block multi-pinner without causing
> uAPI problems.

It seems the problem is how to track multiple pinners of the page. At
the moment memcg ownership is stored in folio->memcg_data which
basically points to the owning struct mem_cgroup *.

For pinning this series already introduces this data structure:

struct vm_account {
	struct task_struct *task;
	struct mm_struct *mm;
	struct user_struct *user;
	enum vm_account_flags flags;
};

We could modify it to something like:

struct vm_account {
       struct list_head possible_pinners;
       struct mem_cgroup *memcg;
       [...]
};

When a page is pinned the first pinner takes ownership and stores it's
memcg there, updating memcg_data to point to it. This would require a
new page_memcg_data_flags but I think we have one left. Subsequent
pinners create a vm_account and add it to the pinners list.

When a driver unpins a page we scan the pinners list and assign
ownership to the next driver pinning the page by updating memcg_data and
removing the vm_account from the list.

The problem with this approach is each pinner (ie. struct vm_account)
may cover different subsets of pages. Drivers have to store a list of
pinned pages somewhere, so we could query drivers or store the list of
pinned pages in the vm_account. That seems like a fair bit of overhead
though and would make unpinning expensive as we'd have to traverse
several lists.

We'd also have to ensure possible owners had enough free memory in the
owning memcg to accept having the page charged when another pinner
unpins. That could be done by reserving space during pinning.

And of course it only works for pin_user_pages() - other users don't
always have a list of pages conveniently available although I suppose
they could walk the rmap, but again that adds overhead. So not sure it's
a great solution but figured I'd leave it here in case it triggers other
ideas.

> You are not wrong on any of these remarks, but this looses sight of
> the point - it is take the existing broken RLIMIT scheme and make it
> incrementally better by being the same broken scheme just with
> cgroups.

Right. RLIMIT_MEMLOCK is pretty broken because most uses enforce it
against a specific task so it can be easily bypassed. The aim here was
to make it at least possible to enforce a meaningful limit.

> If we eventually fix everything so memcg can do multi-pinners/owners
> then would it be reasonable to phase out the new pincg at that time?
>
> Jason


WARNING: multiple messages have this Message-ID (diff)
From: Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
To: Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>,
	Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org,
	tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
	surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	mkoutny-IBi9RG/b67k@public.gmane.org,
	daniel-/w4YWyX8dFk@public.gmane.org,
	"Daniel P . Berrange"
	<berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Alex Williamson
	<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Zefan Li <lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory
Date: Wed, 22 Feb 2023 22:38:25 +1100	[thread overview]
Message-ID: <87o7pmnd0p.fsf@nvidia.com> (raw)
In-Reply-To: <Y/UiQmuVwh2eqrfA-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>


Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> writes:

>> the owner and pinner disagreeing with each other. At least
>> conceptually, the solution is rather straight-forward - whoever pins
>> a page should also claim the ownership of it.
>
> If the answer is pinner is owner, then multi-pinners must mean
> multi-owner too. We probably can't block multi-pinner without causing
> uAPI problems.

It seems the problem is how to track multiple pinners of the page. At
the moment memcg ownership is stored in folio->memcg_data which
basically points to the owning struct mem_cgroup *.

For pinning this series already introduces this data structure:

struct vm_account {
	struct task_struct *task;
	struct mm_struct *mm;
	struct user_struct *user;
	enum vm_account_flags flags;
};

We could modify it to something like:

struct vm_account {
       struct list_head possible_pinners;
       struct mem_cgroup *memcg;
       [...]
};

When a page is pinned the first pinner takes ownership and stores it's
memcg there, updating memcg_data to point to it. This would require a
new page_memcg_data_flags but I think we have one left. Subsequent
pinners create a vm_account and add it to the pinners list.

When a driver unpins a page we scan the pinners list and assign
ownership to the next driver pinning the page by updating memcg_data and
removing the vm_account from the list.

The problem with this approach is each pinner (ie. struct vm_account)
may cover different subsets of pages. Drivers have to store a list of
pinned pages somewhere, so we could query drivers or store the list of
pinned pages in the vm_account. That seems like a fair bit of overhead
though and would make unpinning expensive as we'd have to traverse
several lists.

We'd also have to ensure possible owners had enough free memory in the
owning memcg to accept having the page charged when another pinner
unpins. That could be done by reserving space during pinning.

And of course it only works for pin_user_pages() - other users don't
always have a list of pages conveniently available although I suppose
they could walk the rmap, but again that adds overhead. So not sure it's
a great solution but figured I'd leave it here in case it triggers other
ideas.

> You are not wrong on any of these remarks, but this looses sight of
> the point - it is take the existing broken RLIMIT scheme and make it
> incrementally better by being the same broken scheme just with
> cgroups.

Right. RLIMIT_MEMLOCK is pretty broken because most uses enforce it
against a specific task so it can be easily bypassed. The aim here was
to make it at least possible to enforce a meaningful limit.

> If we eventually fix everything so memcg can do multi-pinners/owners
> then would it be reasonable to phase out the new pincg at that time?
>
> Jason


  reply	other threads:[~2023-02-22 11:42 UTC|newest]

Thread overview: 128+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-06  7:47 [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
2023-02-06  7:47 ` Alistair Popple
2023-02-06  7:47 ` [PATCH 01/19] mm: Introduce vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 05/19] RMDA/siw: " Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-12 17:32   ` Bernard Metzler
2023-02-06  7:47 ` [PATCH 06/19] RDMA/usnic: convert " Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 09/19] io_uring: convert to use vm_account Alistair Popple
2023-02-06 15:29   ` Jens Axboe
2023-02-06 15:29     ` Jens Axboe
2023-02-07  1:03     ` Alistair Popple
2023-02-07  1:03       ` Alistair Popple
2023-02-07 14:28       ` Jens Axboe
2023-02-07 14:55         ` Jason Gunthorpe
2023-02-07 14:55           ` Jason Gunthorpe
2023-02-07 17:05           ` Jens Axboe
2023-02-07 17:05             ` Jens Axboe
2023-02-13 11:30             ` Alistair Popple
2023-02-13 11:30               ` Alistair Popple
2023-02-06  7:47 ` [PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
2023-02-06  7:47 ` [PATCH 11/19] xdp: convert to use vm_account Alistair Popple
2023-02-06  7:47 ` [PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 13/19] fpga: dfl: afu: convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 14/19] mm: Introduce a cgroup for pinned memory Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06 21:01   ` Yosry Ahmed
2023-02-06 21:01     ` Yosry Ahmed
2023-02-06 21:14   ` Tejun Heo
2023-02-06 21:14     ` Tejun Heo
2023-02-06 22:32     ` Yosry Ahmed
2023-02-06 22:32       ` Yosry Ahmed
2023-02-06 22:36       ` Tejun Heo
2023-02-06 22:39         ` Yosry Ahmed
2023-02-06 22:39           ` Yosry Ahmed
2023-02-06 23:25           ` Tejun Heo
2023-02-06 23:25             ` Tejun Heo
2023-02-06 23:34             ` Yosry Ahmed
2023-02-06 23:34               ` Yosry Ahmed
2023-02-06 23:40             ` Jason Gunthorpe
2023-02-06 23:40               ` Jason Gunthorpe
2023-02-07  0:32               ` Tejun Heo
2023-02-07  0:32                 ` Tejun Heo
2023-02-07 12:19                 ` Jason Gunthorpe
2023-02-07 12:19                   ` Jason Gunthorpe
2023-02-15 19:00                 ` Michal Hocko
2023-02-15 19:00                   ` Michal Hocko
2023-02-15 19:07                   ` Jason Gunthorpe
2023-02-15 19:07                     ` Jason Gunthorpe
2023-02-16  8:04                     ` Michal Hocko
2023-02-16  8:04                       ` Michal Hocko
2023-02-16 12:45                       ` Jason Gunthorpe
2023-02-16 12:45                         ` Jason Gunthorpe
2023-02-21 16:51                         ` Tejun Heo
2023-02-21 16:51                           ` Tejun Heo
2023-02-21 17:25                           ` Jason Gunthorpe
2023-02-21 17:29                             ` Tejun Heo
2023-02-21 17:29                               ` Tejun Heo
2023-02-21 17:51                               ` Jason Gunthorpe
2023-02-21 17:51                                 ` Jason Gunthorpe
2023-02-21 18:07                                 ` Tejun Heo
2023-02-21 18:07                                   ` Tejun Heo
2023-02-21 19:26                                   ` Jason Gunthorpe
2023-02-21 19:26                                     ` Jason Gunthorpe
2023-02-21 19:45                                     ` Tejun Heo
2023-02-21 19:45                                       ` Tejun Heo
2023-02-21 19:49                                       ` Tejun Heo
2023-02-21 19:49                                         ` Tejun Heo
2023-02-21 19:57                                       ` Jason Gunthorpe
2023-02-22 11:38                                         ` Alistair Popple [this message]
2023-02-22 11:38                                           ` Alistair Popple
2023-02-22 12:57                                           ` Jason Gunthorpe
2023-02-22 12:57                                             ` Jason Gunthorpe
2023-02-22 22:59                                             ` Alistair Popple
2023-02-22 22:59                                               ` Alistair Popple
2023-02-23  0:05                                               ` Christoph Hellwig
2023-02-23  0:35                                                 ` Alistair Popple
2023-02-23  0:35                                                   ` Alistair Popple
2023-02-23  1:53                                               ` Jason Gunthorpe
2023-02-23  1:53                                                 ` Jason Gunthorpe
2023-02-23  9:12                                                 ` Daniel P. Berrangé
2023-02-23 17:31                                                   ` Jason Gunthorpe
2023-02-23 17:31                                                     ` Jason Gunthorpe
2023-02-23 17:18                                                 ` T.J. Mercier
2023-02-23 17:28                                                   ` Jason Gunthorpe
2023-02-23 17:28                                                     ` Jason Gunthorpe
2023-02-23 18:03                                                     ` Yosry Ahmed
2023-02-23 18:10                                                       ` Jason Gunthorpe
2023-02-23 18:10                                                         ` Jason Gunthorpe
2023-02-23 18:14                                                         ` Yosry Ahmed
2023-02-23 18:14                                                           ` Yosry Ahmed
2023-02-23 18:15                                                         ` Tejun Heo
2023-02-23 18:17                                                           ` Jason Gunthorpe
2023-02-23 18:17                                                             ` Jason Gunthorpe
2023-02-23 18:22                                                             ` Tejun Heo
2023-02-23 18:22                                                               ` Tejun Heo
2023-02-07  1:00           ` Waiman Long
2023-02-07  1:00             ` Waiman Long
2023-02-07  1:03             ` Tejun Heo
2023-02-07  1:50               ` Alistair Popple
2023-02-07  1:50                 ` Alistair Popple
2023-02-06  7:47 ` [PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 16/19] mm/util: Refactor account_locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06 21:12   ` Yosry Ahmed
2023-02-06  7:47 ` [PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-16 11:01 ` [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory David Hildenbrand
2023-02-16 11:01   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o7pmnd0p.fsf@nvidia.com \
    --to=apopple@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=berrange@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=daniel@ffwll.ch \
    --cc=hannes@cmpxchg.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=tjmercier@google.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.