All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>, Yosry Ahmed <yosryahmed@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, jhubbard@nvidia.com,
	tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com,
	mkoutny@suse.com, daniel@ffwll.ch,
	"Daniel P . Berrange" <berrange@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Zefan Li <lizefan.x@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory
Date: Wed, 15 Feb 2023 15:07:05 -0400	[thread overview]
Message-ID: <Y+0tWZxMUx/NZ3Ne@nvidia.com> (raw)
In-Reply-To: <Y+0rxoM4w9nilUMZ@dhcp22.suse.cz>

On Wed, Feb 15, 2023 at 08:00:22PM +0100, Michal Hocko wrote:
> On Mon 06-02-23 14:32:37, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Feb 06, 2023 at 07:40:55PM -0400, Jason Gunthorpe wrote:
> > > (a) kind of destroys the point of this as a sandboxing tool
> > > 
> > > It is not so harmful to use memory that someone else has been charged
> > > with allocating.
> > > 
> > > But it is harmful to pin memory if someone else is charged for the
> > > pin. It means it is unpredictable how much memory a sandbox can
> > > actually lock down.
> > > 
> > > Plus we have the double accounting problem, if 1000 processes in
> > > different cgroups open the tmpfs and all pin the memory then cgroup A
> > > will be charged 1000x for the memory and hit its limit, possibly
> > > creating a DOS from less priv to more priv
> > 
> > Let's hear what memcg people think about it. I'm not a fan of disassociating
> > the ownership and locker of the same page but it is true that actively
> > increasing locked consumption on a remote cgroup is awkward too.
> 
> One thing that is not really clear to me is whether those pins do
> actually have any "ownership".

In most cases the ownship traces back to a file descriptor. When the
file is closed the pin goes away.

> The interface itself doesn't talk about
> anything like that and so it seems perfectly fine to unpin from a
> completely different context then pinning. 

Yes, concievably the close of the FD can be in a totally different
process with a different cgroup.

> If there is no enforcement then Tejun is right and relying on memcg
> ownership is likely the only reliable way to use for tracking. The
> downside is sharing obviously but this is the same problem we
> already do deal with with shared pages.

I think this does not work well because the owner in a memcg sense is
unrelated to the file descriptor which is the true owner.

So we can get cases where the pin is charged to the wrong cgroup which
is effectively fatal for sandboxing, IMHO.

> Another thing that is not really clear to me is how the limit is
> actually going to be used in practice. As there is no concept of a
> reclaim for pins then I can imagine that it would be quite easy to
> reach the hard limit and essentially DoS any further use of pins. 

Yes, that is the purpose. It is to sandbox pin users to put some limit
on the effect they have on the full machine.

It replaces the rlimit mess that was doing the same thing.

> Cross cgroup pinning would make it even worse because it could
> become a DoS vector very easily. Practically speaking what tends to
> be a corner case in the memcg limit world would be norm for pin
> based limit.

This is why the cgroup charged for the pin must be tightly linked to
some cgroup that is obviously connected to the creator of the FD
owning the pin.

Jason

WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
To: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org,
	tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
	surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	mkoutny-IBi9RG/b67k@public.gmane.org,
	daniel-/w4YWyX8dFk@public.gmane.org,
	"Daniel P . Berrange"
	<berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Alex Williamson
	<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Zefan Li <lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory
Date: Wed, 15 Feb 2023 15:07:05 -0400	[thread overview]
Message-ID: <Y+0tWZxMUx/NZ3Ne@nvidia.com> (raw)
In-Reply-To: <Y+0rxoM4w9nilUMZ-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

On Wed, Feb 15, 2023 at 08:00:22PM +0100, Michal Hocko wrote:
> On Mon 06-02-23 14:32:37, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Feb 06, 2023 at 07:40:55PM -0400, Jason Gunthorpe wrote:
> > > (a) kind of destroys the point of this as a sandboxing tool
> > > 
> > > It is not so harmful to use memory that someone else has been charged
> > > with allocating.
> > > 
> > > But it is harmful to pin memory if someone else is charged for the
> > > pin. It means it is unpredictable how much memory a sandbox can
> > > actually lock down.
> > > 
> > > Plus we have the double accounting problem, if 1000 processes in
> > > different cgroups open the tmpfs and all pin the memory then cgroup A
> > > will be charged 1000x for the memory and hit its limit, possibly
> > > creating a DOS from less priv to more priv
> > 
> > Let's hear what memcg people think about it. I'm not a fan of disassociating
> > the ownership and locker of the same page but it is true that actively
> > increasing locked consumption on a remote cgroup is awkward too.
> 
> One thing that is not really clear to me is whether those pins do
> actually have any "ownership".

In most cases the ownship traces back to a file descriptor. When the
file is closed the pin goes away.

> The interface itself doesn't talk about
> anything like that and so it seems perfectly fine to unpin from a
> completely different context then pinning. 

Yes, concievably the close of the FD can be in a totally different
process with a different cgroup.

> If there is no enforcement then Tejun is right and relying on memcg
> ownership is likely the only reliable way to use for tracking. The
> downside is sharing obviously but this is the same problem we
> already do deal with with shared pages.

I think this does not work well because the owner in a memcg sense is
unrelated to the file descriptor which is the true owner.

So we can get cases where the pin is charged to the wrong cgroup which
is effectively fatal for sandboxing, IMHO.

> Another thing that is not really clear to me is how the limit is
> actually going to be used in practice. As there is no concept of a
> reclaim for pins then I can imagine that it would be quite easy to
> reach the hard limit and essentially DoS any further use of pins. 

Yes, that is the purpose. It is to sandbox pin users to put some limit
on the effect they have on the full machine.

It replaces the rlimit mess that was doing the same thing.

> Cross cgroup pinning would make it even worse because it could
> become a DoS vector very easily. Practically speaking what tends to
> be a corner case in the memcg limit world would be norm for pin
> based limit.

This is why the cgroup charged for the pin must be tightly linked to
some cgroup that is obviously connected to the creator of the FD
owning the pin.

Jason

  reply	other threads:[~2023-02-15 19:07 UTC|newest]

Thread overview: 128+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-06  7:47 [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
2023-02-06  7:47 ` Alistair Popple
2023-02-06  7:47 ` [PATCH 01/19] mm: Introduce vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 05/19] RMDA/siw: " Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-12 17:32   ` Bernard Metzler
2023-02-06  7:47 ` [PATCH 06/19] RDMA/usnic: convert " Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 09/19] io_uring: convert to use vm_account Alistair Popple
2023-02-06 15:29   ` Jens Axboe
2023-02-06 15:29     ` Jens Axboe
2023-02-07  1:03     ` Alistair Popple
2023-02-07  1:03       ` Alistair Popple
2023-02-07 14:28       ` Jens Axboe
2023-02-07 14:55         ` Jason Gunthorpe
2023-02-07 14:55           ` Jason Gunthorpe
2023-02-07 17:05           ` Jens Axboe
2023-02-07 17:05             ` Jens Axboe
2023-02-13 11:30             ` Alistair Popple
2023-02-13 11:30               ` Alistair Popple
2023-02-06  7:47 ` [PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
2023-02-06  7:47 ` [PATCH 11/19] xdp: convert to use vm_account Alistair Popple
2023-02-06  7:47 ` [PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 13/19] fpga: dfl: afu: convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 14/19] mm: Introduce a cgroup for pinned memory Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06 21:01   ` Yosry Ahmed
2023-02-06 21:01     ` Yosry Ahmed
2023-02-06 21:14   ` Tejun Heo
2023-02-06 21:14     ` Tejun Heo
2023-02-06 22:32     ` Yosry Ahmed
2023-02-06 22:32       ` Yosry Ahmed
2023-02-06 22:36       ` Tejun Heo
2023-02-06 22:39         ` Yosry Ahmed
2023-02-06 22:39           ` Yosry Ahmed
2023-02-06 23:25           ` Tejun Heo
2023-02-06 23:25             ` Tejun Heo
2023-02-06 23:34             ` Yosry Ahmed
2023-02-06 23:34               ` Yosry Ahmed
2023-02-06 23:40             ` Jason Gunthorpe
2023-02-06 23:40               ` Jason Gunthorpe
2023-02-07  0:32               ` Tejun Heo
2023-02-07  0:32                 ` Tejun Heo
2023-02-07 12:19                 ` Jason Gunthorpe
2023-02-07 12:19                   ` Jason Gunthorpe
2023-02-15 19:00                 ` Michal Hocko
2023-02-15 19:00                   ` Michal Hocko
2023-02-15 19:07                   ` Jason Gunthorpe [this message]
2023-02-15 19:07                     ` Jason Gunthorpe
2023-02-16  8:04                     ` Michal Hocko
2023-02-16  8:04                       ` Michal Hocko
2023-02-16 12:45                       ` Jason Gunthorpe
2023-02-16 12:45                         ` Jason Gunthorpe
2023-02-21 16:51                         ` Tejun Heo
2023-02-21 16:51                           ` Tejun Heo
2023-02-21 17:25                           ` Jason Gunthorpe
2023-02-21 17:29                             ` Tejun Heo
2023-02-21 17:29                               ` Tejun Heo
2023-02-21 17:51                               ` Jason Gunthorpe
2023-02-21 17:51                                 ` Jason Gunthorpe
2023-02-21 18:07                                 ` Tejun Heo
2023-02-21 18:07                                   ` Tejun Heo
2023-02-21 19:26                                   ` Jason Gunthorpe
2023-02-21 19:26                                     ` Jason Gunthorpe
2023-02-21 19:45                                     ` Tejun Heo
2023-02-21 19:45                                       ` Tejun Heo
2023-02-21 19:49                                       ` Tejun Heo
2023-02-21 19:49                                         ` Tejun Heo
2023-02-21 19:57                                       ` Jason Gunthorpe
2023-02-22 11:38                                         ` Alistair Popple
2023-02-22 11:38                                           ` Alistair Popple
2023-02-22 12:57                                           ` Jason Gunthorpe
2023-02-22 12:57                                             ` Jason Gunthorpe
2023-02-22 22:59                                             ` Alistair Popple
2023-02-22 22:59                                               ` Alistair Popple
2023-02-23  0:05                                               ` Christoph Hellwig
2023-02-23  0:35                                                 ` Alistair Popple
2023-02-23  0:35                                                   ` Alistair Popple
2023-02-23  1:53                                               ` Jason Gunthorpe
2023-02-23  1:53                                                 ` Jason Gunthorpe
2023-02-23  9:12                                                 ` Daniel P. Berrangé
2023-02-23 17:31                                                   ` Jason Gunthorpe
2023-02-23 17:31                                                     ` Jason Gunthorpe
2023-02-23 17:18                                                 ` T.J. Mercier
2023-02-23 17:28                                                   ` Jason Gunthorpe
2023-02-23 17:28                                                     ` Jason Gunthorpe
2023-02-23 18:03                                                     ` Yosry Ahmed
2023-02-23 18:10                                                       ` Jason Gunthorpe
2023-02-23 18:10                                                         ` Jason Gunthorpe
2023-02-23 18:14                                                         ` Yosry Ahmed
2023-02-23 18:14                                                           ` Yosry Ahmed
2023-02-23 18:15                                                         ` Tejun Heo
2023-02-23 18:17                                                           ` Jason Gunthorpe
2023-02-23 18:17                                                             ` Jason Gunthorpe
2023-02-23 18:22                                                             ` Tejun Heo
2023-02-23 18:22                                                               ` Tejun Heo
2023-02-07  1:00           ` Waiman Long
2023-02-07  1:00             ` Waiman Long
2023-02-07  1:03             ` Tejun Heo
2023-02-07  1:50               ` Alistair Popple
2023-02-07  1:50                 ` Alistair Popple
2023-02-06  7:47 ` [PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 16/19] mm/util: Refactor account_locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06 21:12   ` Yosry Ahmed
2023-02-06  7:47 ` [PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-16 11:01 ` [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory David Hildenbrand
2023-02-16 11:01   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+0tWZxMUx/NZ3Ne@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=apopple@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=daniel@ffwll.ch \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=tjmercier@google.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.