From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: <570E99A8.5000101@tycho.nsa.gov> <570E9E79.9000105@tycho.nsa.gov> Date: Thu, 14 Apr 2016 11:07:35 +0100 Message-ID: Subject: Re: Failure in sel_netport_sid_slow() From: Greg To: Paul Moore Cc: Stephen Smalley , selinux@tycho.nsa.gov Content-Type: multipart/alternative; boundary=001a114249c04585d805306f10fb List-Id: "Security-Enhanced Linux \(SELinux\) mailing list" List-Post: List-Help: --001a114249c04585d805306f10fb Content-Type: text/plain; charset=UTF-8 On 14 April 2016 at 01:33, Greg wrote: > On 13 April 2016 at 22:36, Paul Moore wrote: > >> On Wed, Apr 13, 2016 at 3:31 PM, Stephen Smalley >> wrote: >> > Looking at the sel_netport_sid_slow() code, I don't see why we need to >> > fail hard if the kzalloc() of the new node fails; we just won't be able >> > to cache the result but can still obtain and return the SID. Paul, do >> > you see any reason we can't just move up the call to security_port_sid() >> > and only fail if that fails? In the common case, that won't require >> > memory allocation at all (only needs it if it hasn't previously mapped >> > the resulting context to a SID). In the same manner, we do not fail >> > just because we cannot add an entry to the AVC. The same should apply >> > to netnode and netif caches. >> >> Sounds reasonable to me, I'll post a RFC patch in just a minute ... it >> will be RFC because it will be completely untested, once I've had a >> chance to test it I'll merge it. >> >> > That said, it seems hard to envision that kzalloc() failing and being >> > able to complete the bind() operation at all, but maybe it is a >> > transient failure. >> >> Agreed, although it might be that the rest of the allocations in the >> network stack are done with something other than GFP_ATOMIC, not sure >> off the top of my head. > > > It doesn't appear to be a memory constrain. My initial report was from an > EC2 instance with 4GiB memory, a "t2.medium" (but plenty left, JupyterHub's > footprint is minimal, overall ~3.1GiB still spare after everything is > started). I also just tried on a new "m4.4xlarge" with 64GiB memory, same > problem. > > It's very difficult to replicate outside of the first run of the 2 Ansible > roles on a fresh instance. On one instance, I was able to replicate it 4 > more times after rebooting, by either running just a couple of last tasks > from role 1, and the first task from role 2, or by running equivalent > commands directly (*). But, I can't replicate it anymore (despite more > reboots, resetting SE bools etc), and unable to replicate on any other > identical instances, so it's 100% repeatable when I build brand new > instance and run the 2 roles for the first time (without any additional > waits), but after that, very hard to trigger. > > (*) I ran something like this, where first parentheses pair corresponds > with last 2 tasks of role 1, and second pair with first task of role 2, and > I run them in parallel to simulate second role starting while first is > still finishing or at least systemd still starting the service -- > "(sss_cache -E && systemctl restart sssd oddjobd; systemctl restart > jupyterhub) & (setsebool -P httpd_can_network_connect 1 && setsebool -P > httpd_can_network_relay 1)". Initially I thought SSS might have something > to do with this, but not sure anymore. > > I also added "strace" to the systemd unit definition for JupyterHub and > spun new instances, but as "strace" adds quite a lot of overhead and > therefore time/delay, everything suddenly just works, so I can't witness > the ENOMEM in a trace unfortunately. For sanity I tried yet another build > with "strace" removed again, and the issue is back. > > If you're changing it to soft failure, that will save the bind in user > space and it'll silently succeed I'm assuming, but based on the above, the > root cause that triggers this in the first place might be something > different than not enough system memory. > Just a follow up. I built 7 more instances, with "strace" in the systemd unit. 2 ended up fine (app bound and listening, no sel_netport_sid_slow failure), while 5 ended with the failure and app didn't bind. I compared traces of bad and good runs and everything's the same up to the point of bind, where it just gives ENOMEM in the failing run. So nothing new here, but at least I can tell that user space contexts of failure and success are pretty much identical and all hinges on the bind syscall and kernel (and the inexplicable perceived lack of kernel memory that must be returned by one of the k*allocs somewhere along the way). Thanks, Greg. --001a114249c04585d805306f10fb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 1= 4 April 2016 at 01:33, Greg <gkubok@gmail.com> wrote:
On 13 April 2016 at 22:36, Paul Moore <paul@= paul-moore.com> wrote:
On Wed, Apr = 13, 2016 at 3:31 PM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
> Looking at the sel_netport_sid_slow() code, I don't see why we nee= d to
> fail hard if the kzalloc() of the new node fails; we just won't be= able
> to cache the result but can still obtain and return the SID.=C2=A0 Pau= l, do
> you see any reason we can't just move up the call to security_port= _sid()
> and only fail if that fails?=C2=A0 In the common case, that won't = require
> memory allocation at all (only needs it if it hasn't previously ma= pped
> the resulting context to a SID).=C2=A0 In the same manner, we do not f= ail
> just because we cannot add an entry to the AVC.=C2=A0 The same should = apply
> to netnode and netif caches.

Sounds reasonable to me, I'll post a RFC patch in just a minute = ... it
will be RFC because it will be completely untested, once I've had a
chance to test it I'll merge it.

> That said, it seems hard to envision that kzalloc() failing and being<= br> > able to complete the bind() operation at all, but maybe it is a
> transient failure.

Agreed, although it might be that the rest of the allocations in the=
network stack are done with something other than GFP_ATOMIC, not sure
off the top of my head.=C2=A0

It doe= sn't appear to be a memory constrain. My initial report was from an EC2= instance with 4GiB memory, a "t2.medium" (but plenty left, Jupyt= erHub's footprint is minimal, overall ~3.1GiB still spare after everyth= ing is started). I also just tried on a new "m4.4xlarge" with 64G= iB memory, same problem.

It's very difficult to repl= icate outside of the first run of the 2 Ansible roles on a fresh instance. = On one instance, I was able to replicate it 4 more times after rebooting, b= y either running just a couple of last tasks from role 1, and the first tas= k from role 2, or by running equivalent commands directly (*). But, I can&#= 39;t replicate it anymore (despite more reboots, resetting SE bools etc), a= nd unable to replicate on any other identical instances, so it's 100% r= epeatable when I build brand new instance and run the 2 roles for the first= time (without any additional waits), but after that, very hard to trigger.=

(*) I ran something like this, where first = parentheses pair corresponds with last 2 tasks of role 1, and second pair w= ith first task of role 2, and I run them in parallel to simulate second rol= e starting while first is still finishing or at least systemd still startin= g the service -- "(sss_cache -E && systemctl restart sssd oddj= obd; systemctl restart jupyterhub) & (setsebool -P httpd_can_network_co= nnect 1 && setsebool -P httpd_can_network_relay 1)". Initially= I thought SSS might have something to do with this, but not sure anymore.<= /div>

I also added "strace" to the systemd uni= t definition for JupyterHub and spun new instances, but as "strace&quo= t; adds quite a lot of overhead and therefore time/delay, everything sudden= ly just works, so I can't witness the ENOMEM in a trace unfortunately. = For sanity I tried yet another build with "strace" removed again,= and the issue is back.

If you're changing it = to soft failure, that will save the bind in user space and it'll silent= ly succeed I'm assuming, but based on the above, the root cause that tr= iggers this in the first place might be something different than not enough= system memory.

Jus= t a follow up. I built 7 more instances, with "strace" in the sys= temd unit. 2 ended up fine (app bound and listening, no sel_netport_sid_slo= w failure), while 5 ended with the failure and app didn't bind.

I compared traces of bad and good runs and everything'= ;s the same up to the point of bind, where it just gives ENOMEM in the fail= ing run. So nothing new here, but at least I can tell that user space conte= xts of failure and success are pretty much identical and all hinges on the = bind syscall and kernel (and the inexplicable perceived lack of kernel memo= ry that must be returned by one of the k*allocs somewhere along the way).

Thanks,

Greg.
<= /div>
--001a114249c04585d805306f10fb--