All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: linux-media@vger.kernel.org, linux-kernel@vger.kernel.org,
	intel-gfx@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	nouveau@lists.freedesktop.org, linux-tegra@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	dri-devel@lists.freedesktop.org
Subject: Re: [RFC] Per file OOM-badness / RSS once more
Date: Fri, 24 Jun 2022 11:59:10 +0200	[thread overview]
Message-ID: <YrWK7pwZP3K2vbye@dhcp22.suse.cz> (raw)
In-Reply-To: <20220624080444.7619-1-christian.koenig@amd.com>

On Fri 24-06-22 10:04:30, Christian König wrote:
> Hello everyone,
> 
> To summarize the issue I'm trying to address here: Processes can allocate
> resources through a file descriptor without being held responsible for it.
> 
> I'm not explaining all the details again. See here for a more deeply
> description of the problem: https://lwn.net/ml/linux-kernel/20220531100007.174649-1-christian.koenig@amd.com/
> 
> With this iteration I'm trying to address a bunch of the comments Michal Hocko
> (thanks a lot for that) gave as well as giving some new ideas.
> 
> Changes made so far:
> 1. Renamed the callback into file_rss(). This is at least a start to better
>    describe what this is all about. I've been going back and forth over the
>    naming here, if you have any better idea please speak up.
> 
> 2. Cleanups, e.g. now providing a helper function in the fs layer to sum up
>    all the pages allocated by the files in a file descriptor table.
> 
> 3. Using the actual number of allocated pages for the shmem implementation
>    instead of just the size. I also tried to ignore shmem files which are part
>    of tmpfs, cause that has a separate accounting/limitation approach.

OK, this is better than the original approach there are still holes
there though I am afraid. I am not sure your i_count hack is correct
but that would be mostly an implementation detail.  The scheme will
over-account memory mapped files (including memfd).  How much that
matters will really differ.

For the global OOM situations it is very likely that there will be
barely any disk based page cache as it would be reclaimed by the time
the oom killer is invoked. So this should be OK. Swap backed page cache
(shmem and its users) is more tricky. It is swap bound and processes
which map it will get "charged" in the form of swap entries while those
which rely on read/write will just escape from the sight of the oom
killer no matter how much memory they own via their shmem backed fd.
This sounds rather serious to me and I hope I haven't missed anything
subtle here that would keep those pages somehow visible. Anyway
something to very carefully document.

For the memcg OOM this gets even more tricky. Files can be shared among
tasks accross memcgs. Something that is not really straightforward from
the userspace POV because this is not strictly deterministic as
first-one-first-charged logic is applied so a lot might depend on timing.
This could also easily mean that a large part of the in memory state of
the file is outside of the reclaim and therefore OOM scope of the memcg
which is hitting the hard limit. This could result in tasks being killed
just because they (co)operate on a large file outside of their memcg
domain. To be honest I am not sure how big of a problem this would be in
practice and the existing behavior has its own cons so to me it sounds
like changing one set of deficiency with other.

As we have discussed previously, there is unlikely a great solution but
you a) need to document most prominent downsides so that people can at
least see this is understood and documented behavior and b) think of the
runaway situation wrt non mapped shmems memtioned above and see whether
there is something we can do about that.
-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: linux-mm@kvack.org, nouveau@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	amd-gfx@lists.freedesktop.org, linux-fsdevel@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-tegra@vger.kernel.org,
	linux-media@vger.kernel.org
Subject: Re: [RFC] Per file OOM-badness / RSS once more
Date: Fri, 24 Jun 2022 11:59:10 +0200	[thread overview]
Message-ID: <YrWK7pwZP3K2vbye@dhcp22.suse.cz> (raw)
In-Reply-To: <20220624080444.7619-1-christian.koenig@amd.com>

On Fri 24-06-22 10:04:30, Christian König wrote:
> Hello everyone,
> 
> To summarize the issue I'm trying to address here: Processes can allocate
> resources through a file descriptor without being held responsible for it.
> 
> I'm not explaining all the details again. See here for a more deeply
> description of the problem: https://lwn.net/ml/linux-kernel/20220531100007.174649-1-christian.koenig@amd.com/
> 
> With this iteration I'm trying to address a bunch of the comments Michal Hocko
> (thanks a lot for that) gave as well as giving some new ideas.
> 
> Changes made so far:
> 1. Renamed the callback into file_rss(). This is at least a start to better
>    describe what this is all about. I've been going back and forth over the
>    naming here, if you have any better idea please speak up.
> 
> 2. Cleanups, e.g. now providing a helper function in the fs layer to sum up
>    all the pages allocated by the files in a file descriptor table.
> 
> 3. Using the actual number of allocated pages for the shmem implementation
>    instead of just the size. I also tried to ignore shmem files which are part
>    of tmpfs, cause that has a separate accounting/limitation approach.

OK, this is better than the original approach there are still holes
there though I am afraid. I am not sure your i_count hack is correct
but that would be mostly an implementation detail.  The scheme will
over-account memory mapped files (including memfd).  How much that
matters will really differ.

For the global OOM situations it is very likely that there will be
barely any disk based page cache as it would be reclaimed by the time
the oom killer is invoked. So this should be OK. Swap backed page cache
(shmem and its users) is more tricky. It is swap bound and processes
which map it will get "charged" in the form of swap entries while those
which rely on read/write will just escape from the sight of the oom
killer no matter how much memory they own via their shmem backed fd.
This sounds rather serious to me and I hope I haven't missed anything
subtle here that would keep those pages somehow visible. Anyway
something to very carefully document.

For the memcg OOM this gets even more tricky. Files can be shared among
tasks accross memcgs. Something that is not really straightforward from
the userspace POV because this is not strictly deterministic as
first-one-first-charged logic is applied so a lot might depend on timing.
This could also easily mean that a large part of the in memory state of
the file is outside of the reclaim and therefore OOM scope of the memcg
which is hitting the hard limit. This could result in tasks being killed
just because they (co)operate on a large file outside of their memcg
domain. To be honest I am not sure how big of a problem this would be in
practice and the existing behavior has its own cons so to me it sounds
like changing one set of deficiency with other.

As we have discussed previously, there is unlikely a great solution but
you a) need to document most prominent downsides so that people can at
least see this is understood and documented behavior and b) think of the
runaway situation wrt non mapped shmems memtioned above and see whether
there is something we can do about that.
-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: linux-mm@kvack.org, nouveau@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	amd-gfx@lists.freedesktop.org, linux-fsdevel@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-tegra@vger.kernel.org,
	linux-media@vger.kernel.org
Subject: Re: [Intel-gfx] [RFC] Per file OOM-badness / RSS once more
Date: Fri, 24 Jun 2022 11:59:10 +0200	[thread overview]
Message-ID: <YrWK7pwZP3K2vbye@dhcp22.suse.cz> (raw)
In-Reply-To: <20220624080444.7619-1-christian.koenig@amd.com>

On Fri 24-06-22 10:04:30, Christian König wrote:
> Hello everyone,
> 
> To summarize the issue I'm trying to address here: Processes can allocate
> resources through a file descriptor without being held responsible for it.
> 
> I'm not explaining all the details again. See here for a more deeply
> description of the problem: https://lwn.net/ml/linux-kernel/20220531100007.174649-1-christian.koenig@amd.com/
> 
> With this iteration I'm trying to address a bunch of the comments Michal Hocko
> (thanks a lot for that) gave as well as giving some new ideas.
> 
> Changes made so far:
> 1. Renamed the callback into file_rss(). This is at least a start to better
>    describe what this is all about. I've been going back and forth over the
>    naming here, if you have any better idea please speak up.
> 
> 2. Cleanups, e.g. now providing a helper function in the fs layer to sum up
>    all the pages allocated by the files in a file descriptor table.
> 
> 3. Using the actual number of allocated pages for the shmem implementation
>    instead of just the size. I also tried to ignore shmem files which are part
>    of tmpfs, cause that has a separate accounting/limitation approach.

OK, this is better than the original approach there are still holes
there though I am afraid. I am not sure your i_count hack is correct
but that would be mostly an implementation detail.  The scheme will
over-account memory mapped files (including memfd).  How much that
matters will really differ.

For the global OOM situations it is very likely that there will be
barely any disk based page cache as it would be reclaimed by the time
the oom killer is invoked. So this should be OK. Swap backed page cache
(shmem and its users) is more tricky. It is swap bound and processes
which map it will get "charged" in the form of swap entries while those
which rely on read/write will just escape from the sight of the oom
killer no matter how much memory they own via their shmem backed fd.
This sounds rather serious to me and I hope I haven't missed anything
subtle here that would keep those pages somehow visible. Anyway
something to very carefully document.

For the memcg OOM this gets even more tricky. Files can be shared among
tasks accross memcgs. Something that is not really straightforward from
the userspace POV because this is not strictly deterministic as
first-one-first-charged logic is applied so a lot might depend on timing.
This could also easily mean that a large part of the in memory state of
the file is outside of the reclaim and therefore OOM scope of the memcg
which is hitting the hard limit. This could result in tasks being killed
just because they (co)operate on a large file outside of their memcg
domain. To be honest I am not sure how big of a problem this would be in
practice and the existing behavior has its own cons so to me it sounds
like changing one set of deficiency with other.

As we have discussed previously, there is unlikely a great solution but
you a) need to document most prominent downsides so that people can at
least see this is understood and documented behavior and b) think of the
runaway situation wrt non mapped shmems memtioned above and see whether
there is something we can do about that.
-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: linux-mm@kvack.org, nouveau@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	amd-gfx@lists.freedesktop.org, linux-fsdevel@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-tegra@vger.kernel.org,
	linux-media@vger.kernel.org
Subject: Re: [Nouveau] [RFC] Per file OOM-badness / RSS once more
Date: Fri, 24 Jun 2022 11:59:10 +0200	[thread overview]
Message-ID: <YrWK7pwZP3K2vbye@dhcp22.suse.cz> (raw)
In-Reply-To: <20220624080444.7619-1-christian.koenig@amd.com>

On Fri 24-06-22 10:04:30, Christian König wrote:
> Hello everyone,
> 
> To summarize the issue I'm trying to address here: Processes can allocate
> resources through a file descriptor without being held responsible for it.
> 
> I'm not explaining all the details again. See here for a more deeply
> description of the problem: https://lwn.net/ml/linux-kernel/20220531100007.174649-1-christian.koenig@amd.com/
> 
> With this iteration I'm trying to address a bunch of the comments Michal Hocko
> (thanks a lot for that) gave as well as giving some new ideas.
> 
> Changes made so far:
> 1. Renamed the callback into file_rss(). This is at least a start to better
>    describe what this is all about. I've been going back and forth over the
>    naming here, if you have any better idea please speak up.
> 
> 2. Cleanups, e.g. now providing a helper function in the fs layer to sum up
>    all the pages allocated by the files in a file descriptor table.
> 
> 3. Using the actual number of allocated pages for the shmem implementation
>    instead of just the size. I also tried to ignore shmem files which are part
>    of tmpfs, cause that has a separate accounting/limitation approach.

OK, this is better than the original approach there are still holes
there though I am afraid. I am not sure your i_count hack is correct
but that would be mostly an implementation detail.  The scheme will
over-account memory mapped files (including memfd).  How much that
matters will really differ.

For the global OOM situations it is very likely that there will be
barely any disk based page cache as it would be reclaimed by the time
the oom killer is invoked. So this should be OK. Swap backed page cache
(shmem and its users) is more tricky. It is swap bound and processes
which map it will get "charged" in the form of swap entries while those
which rely on read/write will just escape from the sight of the oom
killer no matter how much memory they own via their shmem backed fd.
This sounds rather serious to me and I hope I haven't missed anything
subtle here that would keep those pages somehow visible. Anyway
something to very carefully document.

For the memcg OOM this gets even more tricky. Files can be shared among
tasks accross memcgs. Something that is not really straightforward from
the userspace POV because this is not strictly deterministic as
first-one-first-charged logic is applied so a lot might depend on timing.
This could also easily mean that a large part of the in memory state of
the file is outside of the reclaim and therefore OOM scope of the memcg
which is hitting the hard limit. This could result in tasks being killed
just because they (co)operate on a large file outside of their memcg
domain. To be honest I am not sure how big of a problem this would be in
practice and the existing behavior has its own cons so to me it sounds
like changing one set of deficiency with other.

As we have discussed previously, there is unlikely a great solution but
you a) need to document most prominent downsides so that people can at
least see this is understood and documented behavior and b) think of the
runaway situation wrt non mapped shmems memtioned above and see whether
there is something we can do about that.
-- 
Michal Hocko
SUSE Labs

  parent reply	other threads:[~2022-06-24  9:59 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-24  8:04 [RFC] Per file OOM-badness / RSS once more Christian König
2022-06-24  8:04 ` [Intel-gfx] " Christian König
2022-06-24  8:04 ` [Nouveau] " Christian König
2022-06-24  8:04 ` [PATCH 01/14] fs: add per file RSS Christian König
2022-06-24  8:04   ` Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04   ` Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04 ` [PATCH 02/14] oom: take per file RSS into account Christian König
2022-06-24  8:04   ` Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04   ` Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04 ` [PATCH 03/14] proc: expose per file RSS Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04 ` [PATCH 04/14] mm: shmem: provide RSS for shmem files Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04 ` [PATCH 05/14] dma-buf: provide file RSS for DMA-buf files Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04 ` [PATCH 06/14] drm/gem: adjust per file RSS on handling buffers Christian König
2022-06-24  8:04   ` Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04   ` Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04 ` [PATCH 07/14] drm/gma500: use drm_file_rss Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04 ` [PATCH 08/14] drm/amdgpu: " Christian König
2022-06-24  8:04   ` Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04   ` Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04 ` [PATCH 09/14] drm/radeon: use drm_oom_badness Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-28 10:34   ` Michel Dänzer
2022-06-28 10:34     ` [Nouveau] " Michel Dänzer
2022-06-28 10:34     ` [Intel-gfx] " Michel Dänzer
2022-06-24  8:04 ` [PATCH 10/14] drm/i915: use drm_file_rss Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04 ` [PATCH 11/14] drm/nouveau: " Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04 ` [PATCH 12/14] drm/omap: " Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04 ` [PATCH 13/14] drm/vmwgfx: " Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04 ` [PATCH 14/14] drm/tegra: " Christian König
2022-06-24  8:04   ` [Intel-gfx] " Christian König
2022-06-24  8:04   ` [Nouveau] " Christian König
2022-06-24  9:59 ` Michal Hocko [this message]
2022-06-24  9:59   ` [Nouveau] [RFC] Per file OOM-badness / RSS once more Michal Hocko
2022-06-24  9:59   ` [Intel-gfx] " Michal Hocko
2022-06-24  9:59   ` Michal Hocko
2022-06-24 10:23 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/14] fs: add per file RSS Patchwork
2022-06-24 10:23 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-06-24 10:47 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YrWK7pwZP3K2vbye@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=nouveau@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.