linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Jan Kara <jack@suse.cz>
Cc: Nikolay Borisov <kernel@kyup.com>,
	john@johnmccutchan.com, eparis@redhat.com,
	linux-kernel@vger.kernel.org, gorcunov@openvz.org,
	avagin@openvz.org, netdev@vger.kernel.org,
	operations@siteground.com,
	Linux Containers <containers@lists.linux-foundation.org>
Subject: Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
Date: Thu, 02 Jun 2016 11:58:26 -0500	[thread overview]
Message-ID: <87bn3jy1cd.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <20160602074920.GG19636@quack2.suse.cz> (Jan Kara's message of "Thu, 2 Jun 2016 09:49:20 +0200")


Nikolay please see my question for you at the end.

Jan Kara <jack@suse.cz> writes:

> On Wed 01-06-16 11:00:06, Eric W. Biederman wrote:
>> Cc'd the containers list.
>> 
>> Nikolay Borisov <kernel@kyup.com> writes:
>> 
>> > Currently the inotify instances/watches are being accounted in the 
>> > user_struct structure. This means that in setups where multiple 
>> > users in unprivileged containers map to the same underlying 
>> > real user (e.g. user_struct) the inotify limits are going to be 
>> > shared as well which can lead to unplesantries. This is a problem 
>> > since any user inside any of the containers can potentially exhaust 
>> > the instance/watches limit which in turn might prevent certain 
>> > services from other containers from starting.
>> 
>> On a high level this is a bit problematic as it appears to escapes the
>> current limits and allows anyone creating a user namespace to have their
>> own fresh set of limits.  Given that anyone should be able to create a
>> user namespace whenever they feel like escaping limits is a problem.
>> That however is solvable.
>> 
>> A practical question.  What kind of limits are we looking at here?
>> 
>> Are these loose limits for detecting buggy programs that have gone
>> off their rails?
>> 
>> Are these tight limits to ensure multitasking is possible?
>
> The original motivation for these limits is to limit resource usage.  There
> is in-kernel data structure that is associated with each notification mark
> you create and we don't want users to be able to DoS the system by creating
> too many of them. Thus we limit number of notification marks for each user.
> There is also a limit on the number of notification instances - those are
> naturally limited by the number of open file descriptors but admin may want
> to limit them more...
>
> So cgroups would be probably the best fit for this but I'm not sure whether
> it is not an overkill...

There is some level of kernel memory accounting in the memory cgroup.

That said my experience with cgroups is that while they are good for
some things the semantics that derive from the userspace API are
problematic.

In the cgroup model objects in the kernel don't belong to a cgroup they
belong to a task/process.  Those processes belong to a cgroup.
Processes under control of a sufficiently privileged parent are allowed
to switch cgroups.  This causes implementation challenges and sematic
mismatch in a world where things are typically considered to have an
owner.

Right now fs_notify groups (upon which all of the rest of the inotify
accounting is built upon) belong to a user.  So there is a semantic
mismatch with cgroups right out of the gate.

Given that cgroups have not choosen to account for individual kernel
objects or give that level of control, I think it reasonable to look to
other possible solutions.  Assuming the overhead can be kept under
control.

The implementation of a hierarchical counter in mm/page_counter.c
strongly suggests to me that the overhead can be kept under control.

And yes.  I am thinking of the problem space where you have a limit
based on the problem domain where if an application consumes more than
the limit, the application is likely bonkers.  Which does prevent a DOS
situation in kernel memory.  But is different from the problem I have
seen cgroups solve.

The problem I have seen cgroups solve looks like.  Hmm.  I have 8GB of
ram.  I have 3 containers.  Container A can have 4GB, Container B can
have 1GB and container C can have 3GB.  Then I know one container won't
push the other containers into swap.

Perhaps that would tend to be a top down/vs a bottom up approach to
coming up with limits.  As DOS preventions limits like the inotify ones
are generally written from the perspective of if you have more than X
you are crazy.  While cgroup limits tend to be thought about top down
from a total system management point of view.

So I think there is definitely something to look at.


All of that said there is definitely a practical question that needs to
be asked.  Nikolay how did you get into this situation?  A typical user
namespace configuration will set up uid and gid maps with the help of a
privileged program and not map the uid of the user who created the user
namespace.  Thus avoiding exhausting the limits of the user who created
the container.

Which makes me personally more worried about escaping the existing
limits than exhausting the limits of a particular user.

Eric

  reply	other threads:[~2016-06-02 17:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-01  7:52 [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Nikolay Borisov
2016-06-01  7:52 ` [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace Nikolay Borisov
2016-06-06  8:05   ` Cyrill Gorcunov
2016-06-06  9:26     ` Nikolay Borisov
2016-06-01  7:52 ` [PATCH 2/4] inotify: Convert inotify limits to be accounted per-realuser/per-namespace Nikolay Borisov
2016-06-01  7:52 ` [PATCH 3/4] misc: Rename the HASH_SIZE macro Nikolay Borisov
2016-06-01 18:13   ` David Miller
2016-06-01  7:53 ` [PATCH 4/4] inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER Nikolay Borisov
2016-06-01 16:00 ` [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Eric W. Biederman
2016-06-02  6:27   ` Nikolay Borisov
2016-06-02 16:19     ` Eric W. Biederman
2016-06-02  7:49   ` Jan Kara
2016-06-02 16:58     ` Eric W. Biederman [this message]
2016-06-03 11:14       ` Nikolay Borisov
2016-06-03 20:41         ` Eric W. Biederman
2016-06-06  6:41           ` Nikolay Borisov
2016-06-06 20:00             ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bn3jy1cd.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=avagin@openvz.org \
    --cc=containers@lists.linux-foundation.org \
    --cc=eparis@redhat.com \
    --cc=gorcunov@openvz.org \
    --cc=jack@suse.cz \
    --cc=john@johnmccutchan.com \
    --cc=kernel@kyup.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=operations@siteground.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).