On 14 April 2016 at 01:33, Greg <gkubok@gmail.com> wrote:
On 13 April 2016 at 22:36, Paul Moore <paul@paul-moore.com> wrote:
On Wed, Apr 13, 2016 at 3:31 PM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
> Looking at the sel_netport_sid_slow() code, I don't see why we need to
> fail hard if the kzalloc() of the new node fails; we just won't be able
> to cache the result but can still obtain and return the SID.  Paul, do
> you see any reason we can't just move up the call to security_port_sid()
> and only fail if that fails?  In the common case, that won't require
> memory allocation at all (only needs it if it hasn't previously mapped
> the resulting context to a SID).  In the same manner, we do not fail
> just because we cannot add an entry to the AVC.  The same should apply
> to netnode and netif caches.

Sounds reasonable to me, I'll post a RFC patch in just a minute ... it
will be RFC because it will be completely untested, once I've had a
chance to test it I'll merge it.

> That said, it seems hard to envision that kzalloc() failing and being
> able to complete the bind() operation at all, but maybe it is a
> transient failure.

Agreed, although it might be that the rest of the allocations in the
network stack are done with something other than GFP_ATOMIC, not sure
off the top of my head. 

It doesn't appear to be a memory constrain. My initial report was from an EC2 instance with 4GiB memory, a "t2.medium" (but plenty left, JupyterHub's footprint is minimal, overall ~3.1GiB still spare after everything is started). I also just tried on a new "m4.4xlarge" with 64GiB memory, same problem.

It's very difficult to replicate outside of the first run of the 2 Ansible roles on a fresh instance. On one instance, I was able to replicate it 4 more times after rebooting, by either running just a couple of last tasks from role 1, and the first task from role 2, or by running equivalent commands directly (*). But, I can't replicate it anymore (despite more reboots, resetting SE bools etc), and unable to replicate on any other identical instances, so it's 100% repeatable when I build brand new instance and run the 2 roles for the first time (without any additional waits), but after that, very hard to trigger.

(*) I ran something like this, where first parentheses pair corresponds with last 2 tasks of role 1, and second pair with first task of role 2, and I run them in parallel to simulate second role starting while first is still finishing or at least systemd still starting the service -- "(sss_cache -E && systemctl restart sssd oddjobd; systemctl restart jupyterhub) & (setsebool -P httpd_can_network_connect 1 && setsebool -P httpd_can_network_relay 1)". Initially I thought SSS might have something to do with this, but not sure anymore.

I also added "strace" to the systemd unit definition for JupyterHub and spun new instances, but as "strace" adds quite a lot of overhead and therefore time/delay, everything suddenly just works, so I can't witness the ENOMEM in a trace unfortunately. For sanity I tried yet another build with "strace" removed again, and the issue is back.

If you're changing it to soft failure, that will save the bind in user space and it'll silently succeed I'm assuming, but based on the above, the root cause that triggers this in the first place might be something different than not enough system memory.

Just a follow up. I built 7 more instances, with "strace" in the systemd unit. 2 ended up fine (app bound and listening, no sel_netport_sid_slow failure), while 5 ended with the failure and app didn't bind.

I compared traces of bad and good runs and everything's the same up to the point of bind, where it just gives ENOMEM in the failing run. So nothing new here, but at least I can tell that user space contexts of failure and success are pretty much identical and all hinges on the bind syscall and kernel (and the inexplicable perceived lack of kernel memory that must be returned by one of the k*allocs somewhere along the way).

Thanks,

Greg.