* [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism @ 2005-12-14 9:12 Sridhar Samudrala 2005-12-14 9:22 ` Andi Kleen 2005-12-14 20:16 ` Jesper Juhl 0 siblings, 2 replies; 46+ messages in thread From: Sridhar Samudrala @ 2005-12-14 9:12 UTC (permalink / raw) To: linux-kernel, netdev These set of patches provide a TCP/IP emergency communication mechanism that could be used to guarantee high priority communications over a critical socket to succeed even under very low memory conditions that last for a couple of minutes. It uses the critical page pool facility provided by Matt's patches that he posted recently on lkml. http://lkml.org/lkml/2005/12/14/34/index.html This mechanism provides a new socket option SO_CRITICAL that can be used to mark a socket as critical. A critical connection used for emergency communications has to be established and marked as critical before we enter the emergency condition. It uses the __GFP_CRITICAL flag introduced in the critical page pool patches to indicate an allocation request as critical and should be satisfied from the critical page pool if required. In the send path, this flag is passed with all allocation requests that are made for a critical socket. But in the receive path we do not know if a packet is critical or not until we receive it and find the socket that it is destined to. So we treat all the allocation requests in the receive path as critical. The critical page pool patches also introduces a global flag 'system_in_emergency' that is used to indicate an emergency situation(could be a low memory condition). When this flag is set any incoming packets that belong to non-critical sockets are dropped as soon as possible in the receive path. This is necessary to prevent incoming non-critical packets to consume memory from critical page pool. I would appreciate any feedback or comments on this approach. Thanks Sridhar ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 9:12 [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism Sridhar Samudrala @ 2005-12-14 9:22 ` Andi Kleen 2005-12-14 17:55 ` Sridhar Samudrala 2005-12-14 20:16 ` Jesper Juhl 1 sibling, 1 reply; 46+ messages in thread From: Andi Kleen @ 2005-12-14 9:22 UTC (permalink / raw) To: Sridhar Samudrala; +Cc: linux-kernel, netdev > I would appreciate any feedback or comments on this approach. Maybe I'm missing something but wouldn't you need an own critical pool (or at least reservation) for each socket to be safe against deadlocks? Otherwise if a critical sockets needs e.g. 2 pages to finish something and 2 critical sockets are active they can each steal the last pages from each other and deadlock. -Andi ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 9:22 ` Andi Kleen @ 2005-12-14 17:55 ` Sridhar Samudrala 2005-12-14 18:41 ` Andi Kleen 2005-12-15 3:39 ` Matt Mackall 0 siblings, 2 replies; 46+ messages in thread From: Sridhar Samudrala @ 2005-12-14 17:55 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, netdev On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote: > > I would appreciate any feedback or comments on this approach. > > Maybe I'm missing something but wouldn't you need an own critical > pool (or at least reservation) for each socket to be safe against deadlocks? > > Otherwise if a critical sockets needs e.g. 2 pages to finish something > and 2 critical sockets are active they can each steal the last pages > from each other and deadlock. Here we are assuming that the pre-allocated critical page pool is big enough to satisfy the requirements of all the critical sockets. In the current critical page pool implementation, there is also a limitation that only order-0 allocations(single page) are supported. I think in the networking send/receive patch, the only place where multi-page allocs are requested is in the drivers if the MTU > PAGESIZE. But i guess the drivers are getting updated to avoid > order-0 allocations. Also during the emergency, we free the memory allocated for non-critical packets as quickly as possible so that it can be re-used for critical allocations. Thanks Sridhar ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 17:55 ` Sridhar Samudrala @ 2005-12-14 18:41 ` Andi Kleen 2005-12-14 19:20 ` David Stevens 2005-12-15 3:39 ` Matt Mackall 1 sibling, 1 reply; 46+ messages in thread From: Andi Kleen @ 2005-12-14 18:41 UTC (permalink / raw) To: Sridhar Samudrala; +Cc: Andi Kleen, linux-kernel, netdev > Here we are assuming that the pre-allocated critical page pool is big enough > to satisfy the requirements of all the critical sockets. That seems like a lot of assumptions. Is it really better than the existing GFP_ATOMIC which works basically the same? It has a lot more users that compete true, but likely the set of GFP_CRITICAL users would grow over time too and it would develop the same problem. I think if you really want to attack this problem and improve over the GFP_ATOMIC "best effort in smaller pool" approach you should probably add real reservations. And then really do a lot of testing to see if it actually helps. -Andi ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 18:41 ` Andi Kleen @ 2005-12-14 19:20 ` David Stevens 0 siblings, 0 replies; 46+ messages in thread From: David Stevens @ 2005-12-14 19:20 UTC (permalink / raw) To: Andi Kleen; +Cc: Andi Kleen, linux-kernel, netdev, netdev-owner, sri > It has a lot > more users that compete true, but likely the set of GFP_CRITICAL users > would grow over time too and it would develop the same problem. No, because the critical set is determined by the user (by setting the socket flag). The receive side has some things marked as "critical" until we have processed enough to check the socket flag, but then they should be released. Those short-lived allocations and frees are more or less 0 net towards the pool. Certainly, it wouldn't work very well if every socket is marked as "critical", but with an adequate pool for the workload, I expect it'll work as advertised (esp. since it'll usually be only one socket associated with swap management that'll be critical). +-DLS ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 17:55 ` Sridhar Samudrala 2005-12-14 18:41 ` Andi Kleen @ 2005-12-15 3:39 ` Matt Mackall 2005-12-15 4:30 ` David S. Miller 1 sibling, 1 reply; 46+ messages in thread From: Matt Mackall @ 2005-12-15 3:39 UTC (permalink / raw) To: Sridhar Samudrala; +Cc: Andi Kleen, linux-kernel, netdev On Wed, Dec 14, 2005 at 09:55:45AM -0800, Sridhar Samudrala wrote: > On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote: > > > I would appreciate any feedback or comments on this approach. > > > > Maybe I'm missing something but wouldn't you need an own critical > > pool (or at least reservation) for each socket to be safe against deadlocks? > > > > Otherwise if a critical sockets needs e.g. 2 pages to finish something > > and 2 critical sockets are active they can each steal the last pages > > from each other and deadlock. > > Here we are assuming that the pre-allocated critical page pool is big enough > to satisfy the requirements of all the critical sockets. Not a good assumption. A system can have between 1-1000 iSCSI connections open and we certainly don't want to preallocate enough room for 1000 connections to make progress when we might only have one in use. I think we need a global receive pool and per-socket send pools. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 3:39 ` Matt Mackall @ 2005-12-15 4:30 ` David S. Miller 2005-12-15 5:02 ` Matt Mackall ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: David S. Miller @ 2005-12-15 4:30 UTC (permalink / raw) To: mpm; +Cc: sri, ak, linux-kernel, netdev From: Matt Mackall <mpm@selenic.com> Date: Wed, 14 Dec 2005 19:39:37 -0800 > I think we need a global receive pool and per-socket send pools. Mind telling everyone how you plan to make use of the global receive pool when the allocation happens in the device driver and we have no idea which socket the packet is destined for? What should be done for non-local packets being routed? The device drivers allocate packets for the entire system, long before we know who the eventually received packets are for. It is fully anonymous memory, and it's easy to design cases where the whole pool can be eaten up by non-local forwarded packets. I truly dislike these patches being discussed because they are a complete hack, and admittedly don't even solve the problem fully. I don't have any concrete better ideas but that doesn't mean this stuff should go into the tree. I think GFP_ATOMIC memory pools are more powerful than they are given credit for. There is nothing preventing the implementation of dynamic GFP_ATOMIC watermarks, and having "critical" socket behavior "kick in" in response to hitting those water marks. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 4:30 ` David S. Miller @ 2005-12-15 5:02 ` Matt Mackall 2005-12-15 5:23 ` David S. Miller 2005-12-15 5:42 ` Andi Kleen 2005-12-15 7:37 ` Sridhar Samudrala 2 siblings, 1 reply; 46+ messages in thread From: Matt Mackall @ 2005-12-15 5:02 UTC (permalink / raw) To: David S. Miller; +Cc: sri, ak, linux-kernel, netdev On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote: > From: Matt Mackall <mpm@selenic.com> > Date: Wed, 14 Dec 2005 19:39:37 -0800 > > > I think we need a global receive pool and per-socket send pools. > > Mind telling everyone how you plan to make use of the global receive > pool when the allocation happens in the device driver and we have no > idea which socket the packet is destined for? What should be done for > non-local packets being routed? The device drivers allocate packets > for the entire system, long before we know who the eventually received > packets are for. It is fully anonymous memory, and it's easy to > design cases where the whole pool can be eaten up by non-local > forwarded packets. There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool I think this will provide the desired behavior, though only probabilistically. That is, we can fill the global receive pool with uninteresting packets such that we're forced to drop critical ACKs, but the boring packets will eventually be discarded as we walk up the stack and we'll eventually have room to receive retried ACKs. > I truly dislike these patches being discussed because they are a > complete hack, and admittedly don't even solve the problem fully. I > don't have any concrete better ideas but that doesn't mean this stuff > should go into the tree. Agreed. I'm fairly convinced a full fix is doable, if you make a couple assumptions (limited fragmentation), but will unavoidably be less than pretty as it needs to cross some layers. > I think GFP_ATOMIC memory pools are more powerful than they are given > credit for. There is nothing preventing the implementation of dynamic > GFP_ATOMIC watermarks, and having "critical" socket behavior "kick in" > in response to hitting those water marks. There are two problems with GFP_ATOMIC. The first is that its users don't pre-state their worst-case usage, which means sizing the pool to reliably avoid deadlocks is impossible. The second is that there aren't any guarantees that GFP_ATOMIC allocations are actually critical in the needed-to-make-forward-VM-progress sense or will be returned to the pool in a timely fashion. So I do think we need a distinct pool if we want to tackle this problem. Though it's probably worth mentioning that Linus was rather adamantly against even trying at KS. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 5:02 ` Matt Mackall @ 2005-12-15 5:23 ` David S. Miller 2005-12-15 5:48 ` Matt Mackall ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: David S. Miller @ 2005-12-15 5:23 UTC (permalink / raw) To: mpm; +Cc: sri, ak, linux-kernel, netdev From: Matt Mackall <mpm@selenic.com> Date: Wed, 14 Dec 2005 21:02:50 -0800 > There needs to be two rules: > > iff global memory critical flag is set > - allocate from the global critical receive pool on receive > - return packet to global pool if not destined for a socket with an > attached send mempool This shuts off a router and/or firewall just because iSCSI or NFS peed in it's pants. Not really acceptable. > I think this will provide the desired behavior It's not desirable. What if iSCSI is protected by IPSEC, and the key management daemon has to process a security assosciation expiration and negotiate a new one in order for iSCSI to further communicate with it's peer when this memory shortage occurs? It needs to send packets back and forth with the remove key management daemon in order to do this, but since you cut it off with this critical receive pool, the negotiation will never succeed. This stuff won't work. It's not a generic solution and that's why it has more holes than swiss cheese. :-) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 5:23 ` David S. Miller @ 2005-12-15 5:48 ` Matt Mackall 2005-12-15 5:53 ` Nick Piggin 2005-12-15 5:56 ` Stephen Hemminger 2 siblings, 0 replies; 46+ messages in thread From: Matt Mackall @ 2005-12-15 5:48 UTC (permalink / raw) To: David S. Miller; +Cc: sri, ak, linux-kernel, netdev On Wed, Dec 14, 2005 at 09:23:09PM -0800, David S. Miller wrote: > From: Matt Mackall <mpm@selenic.com> > Date: Wed, 14 Dec 2005 21:02:50 -0800 > > > There needs to be two rules: > > > > iff global memory critical flag is set > > - allocate from the global critical receive pool on receive > > - return packet to global pool if not destined for a socket with an > > attached send mempool > > This shuts off a router and/or firewall just because iSCSI or NFS peed > in it's pants. Not really acceptable. That'll happen now anyway. > > I think this will provide the desired behavior > > It's not desirable. > > What if iSCSI is protected by IPSEC, and the key management daemon has > to process a security assosciation expiration and negotiate a new one > in order for iSCSI to further communicate with it's peer when this > memory shortage occurs? It needs to send packets back and forth with > the remove key management daemon in order to do this, but since you > cut it off with this critical receive pool, the negotiation will never > succeed. Ok, encapsulation completely ruins the idea. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 5:23 ` David S. Miller 2005-12-15 5:48 ` Matt Mackall @ 2005-12-15 5:53 ` Nick Piggin 2005-12-15 5:56 ` Stephen Hemminger 2 siblings, 0 replies; 46+ messages in thread From: Nick Piggin @ 2005-12-15 5:53 UTC (permalink / raw) To: David S. Miller; +Cc: mpm, sri, ak, linux-kernel, netdev David S. Miller wrote: > From: Matt Mackall <mpm@selenic.com> > Date: Wed, 14 Dec 2005 21:02:50 -0800 > > >>There needs to be two rules: >> >>iff global memory critical flag is set >>- allocate from the global critical receive pool on receive >>- return packet to global pool if not destined for a socket with an >> attached send mempool > > > This shuts off a router and/or firewall just because iSCSI or NFS peed > in it's pants. Not really acceptable. > But that should only happen (shut off a router and/or firewall) in cases where we now completely deadlock and never recover, including shutting off the router and firewall, because they don't have enough memory to recv packets either. > >>I think this will provide the desired behavior > > > It's not desirable. > > What if iSCSI is protected by IPSEC, and the key management daemon has > to process a security assosciation expiration and negotiate a new one > in order for iSCSI to further communicate with it's peer when this > memory shortage occurs? It needs to send packets back and forth with > the remove key management daemon in order to do this, but since you > cut it off with this critical receive pool, the negotiation will never > succeed. > I guess IPSEC would be a critical socket too, in that case. Sure there is nothing we can do if the daemon insists on allocating lots of memory... > This stuff won't work. It's not a generic solution and that's > why it has more holes than swiss cheese. :-) True it will have holes. I think something that is complementary and would be desirable is to simply limit the amount of in-flight writeout that things like NFS allows (or used to allow, haven't checked for a while and there were noises about it getting better). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 5:23 ` David S. Miller 2005-12-15 5:48 ` Matt Mackall 2005-12-15 5:53 ` Nick Piggin @ 2005-12-15 5:56 ` Stephen Hemminger 2005-12-15 8:44 ` David Stevens 2 siblings, 1 reply; 46+ messages in thread From: Stephen Hemminger @ 2005-12-15 5:56 UTC (permalink / raw) To: David S. Miller; +Cc: mpm, sri, ak, linux-kernel, netdev On Wed, 14 Dec 2005 21:23:09 -0800 (PST) "David S. Miller" <davem@davemloft.net> wrote: > From: Matt Mackall <mpm@selenic.com> > Date: Wed, 14 Dec 2005 21:02:50 -0800 > > > There needs to be two rules: > > > > iff global memory critical flag is set > > - allocate from the global critical receive pool on receive > > - return packet to global pool if not destined for a socket with an > > attached send mempool > > This shuts off a router and/or firewall just because iSCSI or NFS peed > in it's pants. Not really acceptable. > > > I think this will provide the desired behavior > > It's not desirable. > > What if iSCSI is protected by IPSEC, and the key management daemon has > to process a security assosciation expiration and negotiate a new one > in order for iSCSI to further communicate with it's peer when this > memory shortage occurs? It needs to send packets back and forth with > the remove key management daemon in order to do this, but since you > cut it off with this critical receive pool, the negotiation will never > succeed. > > This stuff won't work. It's not a generic solution and that's > why it has more holes than swiss cheese. :-) Also, all this stuff is just a band aid because linux OOM behavior is so fucked up. The VM system just lets the user dig themselves into a huge over commit, then we get into trying to change every other system to compensate. How about cutting things off earlier, and not falling off the cliff? How about pushing out pages to swap earlier when memory pressure starts to get noticed. Then you can free those non-dirty pages to make progress. Too many of the VM decisions seem to be made in favor of keep-it-in-memory benchmark situations. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 5:56 ` Stephen Hemminger @ 2005-12-15 8:44 ` David Stevens 2005-12-15 8:58 ` David S. Miller 0 siblings, 1 reply; 46+ messages in thread From: David Stevens @ 2005-12-15 8:44 UTC (permalink / raw) To: Stephen Hemminger Cc: ak, David S. Miller, linux-kernel, mpm, netdev, netdev-owner, sri > Also, all this stuff is just a band aid because linux OOM behavior is so > fucked up. In our internal discussions, characterizing this as "OOM" came up a lot, and I don't think of it as that at all. OOM is exactly what the scheme is trying to avoid! The actual situation we have in mind is a swap device management system in a cluster where a remote system tells you (via socket communication to a user-land management app) that a swap device is going to fail over and it'd be a good idea not to do anything that requires paging out or swapping for a short period of time. The socket communication must work, but the system is not at all out of memory, and the important point is that it never will be if you limit allocations to those things that are required for the critical socket to work (and nothing/little else). Receiver side allocations are unavoidable, because you don't know if you can drop the packet or not until you look at it. Some infrastructure must work. But everything else can fail or succeed based on ordinary churn in ordinary memory pools, until the "in_emergency" condition has passed. The critical socket(s) simply have to be out of the zero-sum game for the rest of the allocations, because those are the (only) path to getting a working swap device again. If you're out of memory without a network mechanism to get you more, this doesn't do anything for you (and it isn't intended to). And if you mark any socket that isn't going to get you failed over or otherwise get you more swap, it isn't going to help you, either. It isn't a priority scheme for low-memory, it's a failover mechanism that relies on networking. There are exactly 2 priorities: critical (as in "you might as well crash if these aren't satisfied") and everything else. Doing other, more general things that handle low memory, or OOM, or identified priorities are great, but the problem we're interested in solving here is really just about making socket communication work when the alternative is a completely dead system. I think these patches do that in a reasonable way. A better solution would be great, too, if there is one. :-) +-DLS ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 8:44 ` David Stevens @ 2005-12-15 8:58 ` David S. Miller 2005-12-15 9:27 ` David Stevens 0 siblings, 1 reply; 46+ messages in thread From: David S. Miller @ 2005-12-15 8:58 UTC (permalink / raw) To: dlstevens; +Cc: shemminger, ak, linux-kernel, mpm, netdev, netdev-owner, sri From: David Stevens <dlstevens@us.ibm.com> Date: Thu, 15 Dec 2005 00:44:52 -0800 > In our internal discussions I really wish this hadn't been discussed internally before being implemented. Any such internal discussions are lost completely upon the community that ends up reviewing such a core and invasive patch such as this one. > The critical socket(s) simply have to be out of the zero-sum game > for the rest of the allocations, because those are the (only) path to > getting a working swap device again. The core fault of the critical socket idea is that it is painfully simple to create a tree of dependant allocations that makes the critical pool useless. IPSEC and tunnels are simple examples. The idea to mark, for example, IPSEC key management daemon's sockets as critical is flawed, because the key management daemon could hit a swap page over the iSCSI device. Don't even start with the idea to lock the IPSEC key management daemon into ram with mlock(). Tunnels are similar, and realistic nesting cases can be shown that makes sizing via a special pool simply unfeasible, and whats more there are no sockets involved. Sockets do not exist in an allocation vacuum, they need to talk over routes, and there are therefore many types of auxiliary data associated with sending a packet besides the packet itself. All you need is a routing change of some type and you're going to start burning GFP_ATOMIC allocations on the next packet send. I think making GFP_ATOMIC better would be wise. Alan's ideas harping from the old 2.0.x/2.2.x NFS days could use some consideration as well. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 8:58 ` David S. Miller @ 2005-12-15 9:27 ` David Stevens 0 siblings, 0 replies; 46+ messages in thread From: David Stevens @ 2005-12-15 9:27 UTC (permalink / raw) To: David S. Miller Cc: ak, linux-kernel, mpm, netdev, netdev-owner, shemminger, sri "David S. Miller" <davem@davemloft.net> wrote on 12/15/2005 12:58:05 AM: > From: David Stevens <dlstevens@us.ibm.com> > Date: Thu, 15 Dec 2005 00:44:52 -0800 > > > In our internal discussions > > I really wish this hadn't been discussed internally before being > implemented. Any such internal discussions are lost completely upon > the community that ends up reviewing such a core and invasive patch > such as this one. I think those were more informal and less extensive than the impression I gave you. I mean simply bouncing around incomplete ideas and discussing some of the potential issues before coming up with a prototype solution, which is intended to be the starting point for community discussions (and the KS discussions, too). "OOM" came up immediately (even when naming the problem), and it isn't how I ever saw it. The patches, of course, are intended to NOT be invasive, or any more than they need to be, and they are not "the" solution, but "a" solution. A completely different one that solves the problem is just as good to me. +-DLS ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 4:30 ` David S. Miller 2005-12-15 5:02 ` Matt Mackall @ 2005-12-15 5:42 ` Andi Kleen 2005-12-15 6:06 ` Stephen Hemminger 2005-12-15 7:37 ` Sridhar Samudrala 2 siblings, 1 reply; 46+ messages in thread From: Andi Kleen @ 2005-12-15 5:42 UTC (permalink / raw) To: David S. Miller; +Cc: mpm, sri, ak, linux-kernel, netdev On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote: > From: Matt Mackall <mpm@selenic.com> > Date: Wed, 14 Dec 2005 19:39:37 -0800 > > > I think we need a global receive pool and per-socket send pools. > > Mind telling everyone how you plan to make use of the global receive > pool when the allocation happens in the device driver and we have no > idea which socket the packet is destined for? What should be done for In theory one could use multiple receive queue on intelligent enough NIC with the NIC distingushing the sockets. But that would be still a nasty "you need advanced hardware FOO to avoid subtle problem Y" case. Also it would require lots of driver hacking. And most NICs seem to have limits on the size of the socket tables for this, which means you would end up in a "only N sockets supported safely" situation, with N likely being quite small on common hardware. I think the idea of the original poster was that just freeing non critical packets after a short time again would be good enough, but I'm a bit sceptical on that. > I truly dislike these patches being discussed because they are a > complete hack, and admittedly don't even solve the problem fully. I I agree. > I think GFP_ATOMIC memory pools are more powerful than they are given > credit for. There is nothing preventing the implementation of dynamic Their main problem is that they are used too widely and in a lot of situations that aren't really critical. -Andi ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 5:42 ` Andi Kleen @ 2005-12-15 6:06 ` Stephen Hemminger 0 siblings, 0 replies; 46+ messages in thread From: Stephen Hemminger @ 2005-12-15 6:06 UTC (permalink / raw) To: Andi Kleen; +Cc: David S. Miller, mpm, sri, ak, linux-kernel, netdev On Thu, 15 Dec 2005 06:42:45 +0100 Andi Kleen <ak@suse.de> wrote: > On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote: > > From: Matt Mackall <mpm@selenic.com> > > Date: Wed, 14 Dec 2005 19:39:37 -0800 > > > > > I think we need a global receive pool and per-socket send pools. > > > > Mind telling everyone how you plan to make use of the global receive > > pool when the allocation happens in the device driver and we have no > > idea which socket the packet is destined for? What should be done for > > In theory one could use multiple receive queue on intelligent enough > NIC with the NIC distingushing the sockets. > > But that would be still a nasty "you need advanced hardware FOO to avoid > subtle problem Y" case. Also it would require lots of driver hacking. > > And most NICs seem to have limits on the size of the socket tables for this, which > means you would end up in a "only N sockets supported safely" situation, > with N likely being quite small on common hardware. > > I think the idea of the original poster was that just freeing non critical packets > after a short time again would be good enough, but I'm a bit sceptical > on that. > > > I truly dislike these patches being discussed because they are a > > complete hack, and admittedly don't even solve the problem fully. I > > I agree. > > > I think GFP_ATOMIC memory pools are more powerful than they are given > > credit for. There is nothing preventing the implementation of dynamic > > Their main problem is that they are used too widely and in a lot > of situations that aren't really critical. Most of the use of GFP_ATOMIC is by stuff that could fail but can't sleep waiting for memory. How about adding a GFP_NORMAL for allocations while holding a lock. #define GFP_NORMAL (__GFP_NOMEMALLOC) Then get people to change the unneeded GFP_ATOMIC's to GFP_NORMAL in places where the error paths are reasonable. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 4:30 ` David S. Miller 2005-12-15 5:02 ` Matt Mackall 2005-12-15 5:42 ` Andi Kleen @ 2005-12-15 7:37 ` Sridhar Samudrala 2005-12-15 8:21 ` David S. Miller 2 siblings, 1 reply; 46+ messages in thread From: Sridhar Samudrala @ 2005-12-15 7:37 UTC (permalink / raw) To: David S. Miller; +Cc: mpm, sri, ak, linux-kernel, netdev On Wed, 14 Dec 2005, David S. Miller wrote: > From: Matt Mackall <mpm@selenic.com> > Date: Wed, 14 Dec 2005 19:39:37 -0800 > > > I think we need a global receive pool and per-socket send pools. > > Mind telling everyone how you plan to make use of the global receive > pool when the allocation happens in the device driver and we have no > idea which socket the packet is destined for? What should be done for > non-local packets being routed? The device drivers allocate packets > for the entire system, long before we know who the eventually received > packets are for. It is fully anonymous memory, and it's easy to > design cases where the whole pool can be eaten up by non-local > forwarded packets. > > I truly dislike these patches being discussed because they are a > complete hack, and admittedly don't even solve the problem fully. I > don't have any concrete better ideas but that doesn't mean this stuff > should go into the tree. > > I think GFP_ATOMIC memory pools are more powerful than they are given > credit for. There is nothing preventing the implementation of dynamic > GFP_ATOMIC watermarks, and having "critical" socket behavior "kick in" > in response to hitting those water marks. Does this mean that you are OK with having a mechanism to mark the sockets as critical and dropping the non critical packets under emergency, but you do not like having a separate critical page pool. Instead, you seem to be suggesting in_emergency to be set dynamically when we are about to run out of ATOMIC memory. Is this right? Thanks Sridhar ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 7:37 ` Sridhar Samudrala @ 2005-12-15 8:21 ` David S. Miller 2005-12-15 8:35 ` Arjan van de Ven ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: David S. Miller @ 2005-12-15 8:21 UTC (permalink / raw) To: sri; +Cc: mpm, ak, linux-kernel, netdev From: Sridhar Samudrala <sri@us.ibm.com> Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST) > Instead, you seem to be suggesting in_emergency to be set dynamically > when we are about to run out of ATOMIC memory. Is this right? Not when we run out, but rather when we reach some low water mark, the "critical sockets" would still use GFP_ATOMIC memory but only "critical sockets" would be allowed to do so. But even this has faults, consider the IPSEC scenerio I mentioned, and this applies to any kind of encapsulation actually, even simple tunneling examples can be concocted which make the "critical socket" idea fail. The knee jerk reaction is "mark IPSEC's sockets critical, and mark the tunneling allocations critical, and... and..." well you have GFP_ATOMIC then my friend. In short, these "seperate page pool" and "critical socket" ideas do not work and we need a different solution, I'm sorry folks spent so much time on them, but they are heavily flawed. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 8:21 ` David S. Miller @ 2005-12-15 8:35 ` Arjan van de Ven 2005-12-15 8:55 ` [RFC] Fine-grained memory priorities and PI Kyle Moffett 2005-12-16 2:09 ` [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism Sridhar Samudrala 2 siblings, 0 replies; 46+ messages in thread From: Arjan van de Ven @ 2005-12-15 8:35 UTC (permalink / raw) To: David S. Miller; +Cc: sri, mpm, ak, linux-kernel, netdev On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote: > From: Sridhar Samudrala <sri@us.ibm.com> > Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST) > > > Instead, you seem to be suggesting in_emergency to be set dynamically > > when we are about to run out of ATOMIC memory. Is this right? > > Not when we run out, but rather when we reach some low water mark, the > "critical sockets" would still use GFP_ATOMIC memory but only > "critical sockets" would be allowed to do so. > > But even this has faults, consider the IPSEC scenerio I mentioned, and > this applies to any kind of encapsulation actually, even simple > tunneling examples can be concocted which make the "critical socket" > idea fail. > > The knee jerk reaction is "mark IPSEC's sockets critical, and mark the > tunneling allocations critical, and... and..." well you have > GFP_ATOMIC then my friend. > > In short, these "seperate page pool" and "critical socket" ideas do > not work and we need a different solution, I'm sorry folks spent so > much time on them, but they are heavily flawed. maybe it should be approached from the other side; having a way to mark connections as low priority (say incoming http connections to your webserver) or as non-critical/expendable would give the "normal" GFP_ATOMIC ones a better chance in case of overload/DDOS etc. It's not going to solve the VM deadlock issue wrt iscsi/nfs; however it might be useful in the "survive slashdot" sense... ^ permalink raw reply [flat|nested] 46+ messages in thread
* [RFC] Fine-grained memory priorities and PI 2005-12-15 8:21 ` David S. Miller 2005-12-15 8:35 ` Arjan van de Ven @ 2005-12-15 8:55 ` Kyle Moffett 2005-12-15 9:04 ` Andi Kleen 2005-12-15 12:45 ` Con Kolivas 2005-12-16 2:09 ` [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism Sridhar Samudrala 2 siblings, 2 replies; 46+ messages in thread From: Kyle Moffett @ 2005-12-15 8:55 UTC (permalink / raw) To: David S. Miller; +Cc: sri, mpm, ak, linux-kernel, netdev On Dec 15, 2005, at 03:21, David S. Miller wrote: > Not when we run out, but rather when we reach some low water mark, > the "critical sockets" would still use GFP_ATOMIC memory but only > "critical sockets" would be allowed to do so. > > But even this has faults, consider the IPSEC scenerio I mentioned, > and this applies to any kind of encapsulation actually, even simple > tunneling examples can be concocted which make the "critical > socket" idea fail. > > The knee jerk reaction is "mark IPSEC's sockets critical, and mark > the tunneling allocations critical, and... and..." well you have > GFP_ATOMIC then my friend. > > In short, these "seperate page pool" and "critical socket" ideas do > not work and we need a different solution, I'm sorry folks spent so > much time on them, but they are heavily flawed. What we really need in the kernel is a more fine-grained memory priority system with PI, similar in concept to what's being done to the scheduler in some of the RT patchsets. Currently we have a very black-and-white memory subsystem; when we go OOM, we just start killing processes until we are no longer OOM. Perhaps we should have some way to pass memory allocation priorities throughout the kernel, including a "this request has X priority", "this request will help free up X pages of RAM", and "drop while dirty under certain OOM to free X memory using this method". The initial benefit would be that OOM handling would become more reliable and less of a special case. When we start to run low on free pages, it might be OK to kill the SETI@home process long before we OOM if such action might prevent the OOM. Likewise, you might be able to flag certain file pages as being "less critical", such that the kernel can kill a process and drop its dirty pages for files in / tmp. Or the kernel might do a variety of other things just by failing new allocations with low priority and forcing existing allocations with low priority to go away using preregistered handlers. When processes request memory through any subsystem, their memory priority would be passed through the kernel layers to the allocator, along with any associated information about how to free the memory in a low-memory condition. As a result, I could configure my database to have a much higher priority than SETI@home (or boinc or whatever), so that when the database server wants to fill memory with clean DB cache pages, the kernel will kill SETI@home for it's memory, even if we could just leave some DB cache pages unfaulted. Questions? Comments? "This is a terrible idea that should never have seen the light of day"? Both constructive and destructive criticism welcomed! (Just please keep the language clean! :-D) Cheers, Kyle Moffett -- Q: Why do programmers confuse Halloween and Christmas? A: Because OCT 31 == DEC 25. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Fine-grained memory priorities and PI 2005-12-15 8:55 ` [RFC] Fine-grained memory priorities and PI Kyle Moffett @ 2005-12-15 9:04 ` Andi Kleen 2005-12-15 12:51 ` Kyle Moffett 2005-12-15 12:45 ` Con Kolivas 1 sibling, 1 reply; 46+ messages in thread From: Andi Kleen @ 2005-12-15 9:04 UTC (permalink / raw) To: Kyle Moffett; +Cc: David S. Miller, sri, mpm, ak, linux-kernel, netdev > When processes request memory through any subsystem, their memory > priority would be passed through the kernel layers to the allocator, > along with any associated information about how to free the memory in > a low-memory condition. As a result, I could configure my database > to have a much higher priority than SETI@home (or boinc or whatever), > so that when the database server wants to fill memory with clean DB > cache pages, the kernel will kill SETI@home for it's memory, even if > we could just leave some DB cache pages unfaulted. Iirc most of the freeing happens in process context anyways, so process priority information is already available. At least for CPU cost it might even be taken into account during schedules (Freeing can take up quite a lot of CPU time) The problem with GFP_ATOMIC is though that someone else needs to free the memory in advance for you because you cannot do it yourself. (you could call it a kind of "parasite" in the normally very cooperative society of memory allocators ...) That would mess up your scheme too. The priority cannot be expressed because it's more a case of "somewhen someone in the future might need it" > > Questions? Comments? "This is a terrible idea that should never have > seen the light of day"? Both constructive and destructive criticism > welcomed! (Just please keep the language clean! :-D) This won't help for this problem here - even with perfect priorities you could still get into situations where you can't make any progress if progress needs more memory. Only preallocating or prereservation can help you out of that trap. -Andi ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Fine-grained memory priorities and PI 2005-12-15 9:04 ` Andi Kleen @ 2005-12-15 12:51 ` Kyle Moffett 2005-12-15 13:31 ` Andi Kleen 0 siblings, 1 reply; 46+ messages in thread From: Kyle Moffett @ 2005-12-15 12:51 UTC (permalink / raw) To: Andi Kleen; +Cc: David S. Miller, sri, mpm, linux-kernel, netdev On Dec 15, 2005, at 04:04, Andi Kleen wrote: >> When processes request memory through any subsystem, their memory >> priority would be passed through the kernel layers to the >> allocator, along with any associated information about how to free >> the memory in a low-memory condition. As a result, I could >> configure my database to have a much higher priority than >> SETI@home (or boinc or whatever), so that when the database server >> wants to fill memory with clean DB cache pages, the kernel will >> kill SETI@home for it's memory, even if we could just leave some >> DB cache pages unfaulted. > > Iirc most of the freeing happens in process context anyways, so > process priority information is already available. At least for CPU > cost it might even be taken into account during schedules (Freeing > can take up quite a lot of CPU time) > > The problem with GFP_ATOMIC is though that someone else needs to > free the memory in advance for you because you cannot do it yourself. > > (you could call it a kind of "parasite" in the normally very > cooperative society of memory allocators ...) > > That would mess up your scheme too. The priority cannot be > expressed because it's more a case of > "somewhen someone in the future might need it" Well, that's currently expressed as a reserved pool with watermarks, so with a PI system you would have a single pool with some collection of reservation watermarks with various priorities. I'm not sure what the best data-structure would be, probably some sort of ordered priority tree. When allocating or freeing memory, the code would check the watermark data (which has some summary statistics so you don't need to check the whole tree each time); if any of the watermarks are too low with relative priority taken into account, you fail the allocation or move pages into the pool. >> Questions? Comments? "This is a terrible idea that should never >> have seen the light of day"? Both constructive and destructive >> criticism welcomed! (Just please keep the language clean! :-D) > > This won't help for this problem here - even with perfect > priorities you could still get into situations where you can't make > any progress if progress needs more memory. Well the point would be that the priorities could force a more- extreme and selective OOM (maybe even dropping dirty pages for noncritical filesystems if necessary!), or handle the situation described with the IPSec daemon and IPSec network traffic (IPSec would inherit the increased memory priority, and when it tries to do networking, its send path and the global receive path would inherit that increased priority as well. Naturally this is all still in the vaporware stage, but I think that if implemented the concept might at least improve the OOM/low-memory situation considerably. Starting to fail allocations for the cluster programs (including their kernel allocations) well before failing them for the swap-fallback tool would help the original poster, and I imagine various tweaked priorities would make true OOM-deadlock far less likely. Cheers, Kyle Moffett -- When you go into court you either want a very, very, very bright line or you want the stomach to outlast the other guy in trench warfare. If both sides are reasonable, you try to stay _out_ of court in the first place. -- Rob Landley ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Fine-grained memory priorities and PI 2005-12-15 12:51 ` Kyle Moffett @ 2005-12-15 13:31 ` Andi Kleen 0 siblings, 0 replies; 46+ messages in thread From: Andi Kleen @ 2005-12-15 13:31 UTC (permalink / raw) To: Kyle Moffett; +Cc: Andi Kleen, David S. Miller, sri, mpm, linux-kernel, netdev > Naturally this is all still in the vaporware stage, but I think that > if implemented the concept might at least improve the OOM/low-memory > situation considerably. Starting to fail allocations for the cluster > programs (including their kernel allocations) well before failing > them for the swap-fallback tool would help the original poster, and I > imagine various tweaked priorities would make true OOM-deadlock far > less likely. The problem is that deadlocks can happen even without anybody running out of virtual memory. The deadlocks GFP_CRITICAL was supposed to handle are deadlocks while swapping out data because the swapping on some devices needs more memory by itself. This happens long before anything is running into a true oom. It's just that the memory cleaning stage cannot make progress anymore. Your proposal isn't addressing this problem at all I think. Handling true OOM is a quite different issue. -Andi ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Fine-grained memory priorities and PI 2005-12-15 8:55 ` [RFC] Fine-grained memory priorities and PI Kyle Moffett 2005-12-15 9:04 ` Andi Kleen @ 2005-12-15 12:45 ` Con Kolivas 2005-12-15 12:58 ` Kyle Moffett 1 sibling, 1 reply; 46+ messages in thread From: Con Kolivas @ 2005-12-15 12:45 UTC (permalink / raw) To: Kyle Moffett; +Cc: David S. Miller, sri, mpm, ak, linux-kernel, netdev On Thursday 15 December 2005 19:55, Kyle Moffett wrote: > On Dec 15, 2005, at 03:21, David S. Miller wrote: > > Not when we run out, but rather when we reach some low water mark, > > the "critical sockets" would still use GFP_ATOMIC memory but only > > "critical sockets" would be allowed to do so. > > > > But even this has faults, consider the IPSEC scenerio I mentioned, > > and this applies to any kind of encapsulation actually, even simple > > tunneling examples can be concocted which make the "critical > > socket" idea fail. > > > > The knee jerk reaction is "mark IPSEC's sockets critical, and mark > > the tunneling allocations critical, and... and..." well you have > > GFP_ATOMIC then my friend. > > > > In short, these "seperate page pool" and "critical socket" ideas do > > not work and we need a different solution, I'm sorry folks spent so > > much time on them, but they are heavily flawed. > > What we really need in the kernel is a more fine-grained memory > priority system with PI, similar in concept to what's being done to > the scheduler in some of the RT patchsets. Currently we have a very > black-and-white memory subsystem; when we go OOM, we just start > killing processes until we are no longer OOM. Perhaps we should have > some way to pass memory allocation priorities throughout the kernel, > including a "this request has X priority", "this request will help > free up X pages of RAM", and "drop while dirty under certain OOM to > free X memory using this method". > > The initial benefit would be that OOM handling would become more > reliable and less of a special case. When we start to run low on > free pages, it might be OK to kill the SETI@home process long before > we OOM if such action might prevent the OOM. Likewise, you might be > able to flag certain file pages as being "less critical", such that > the kernel can kill a process and drop its dirty pages for files in / > tmp. Or the kernel might do a variety of other things just by > failing new allocations with low priority and forcing existing > allocations with low priority to go away using preregistered handlers. > > When processes request memory through any subsystem, their memory > priority would be passed through the kernel layers to the allocator, > along with any associated information about how to free the memory in > a low-memory condition. As a result, I could configure my database > to have a much higher priority than SETI@home (or boinc or whatever), > so that when the database server wants to fill memory with clean DB > cache pages, the kernel will kill SETI@home for it's memory, even if > we could just leave some DB cache pages unfaulted. > > Questions? Comments? "This is a terrible idea that should never have > seen the light of day"? Both constructive and destructive criticism > welcomed! (Just please keep the language clean! :-D) I have some basic process-that-called the memory allocator link in the -ck tree already which alters how aggressively memory is reclaimed according to priority. It does not affect out of memory management but that could be added to said algorithm; however I don't see much point at the moment since oom is still an uncommon condition but regular memory allocation is routine. Cheers, Con ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Fine-grained memory priorities and PI 2005-12-15 12:45 ` Con Kolivas @ 2005-12-15 12:58 ` Kyle Moffett 2005-12-15 13:02 ` Con Kolivas 0 siblings, 1 reply; 46+ messages in thread From: Kyle Moffett @ 2005-12-15 12:58 UTC (permalink / raw) To: Con Kolivas; +Cc: David S. Miller, sri, mpm, ak, linux-kernel, netdev On Dec 15, 2005, at 07:45, Con Kolivas wrote: > I have some basic process-that-called the memory allocator link in > the -ck tree already which alters how aggressively memory is > reclaimed according to priority. It does not affect out of memory > management but that could be added to said algorithm; however I > don't see much point at the moment since oom is still an uncommon > condition but regular memory allocation is routine. My thought would be to generalize the two special cases of writeback of dirty pages or dropping of clean pages under memory pressure and OOM to be the same general case. When you are trying to free up pages, it may be permissible to drop dirty mbox pages and kill the postfix process writing them in order to satisfy allocations for the mission-critical database server. (Or maybe it's the other way around). If a large chunk of the allocated pages have priorities and lossless/lossy free functions, then the kernel can be much more flexible and configurable about what to do when running low on RAM. Cheers, Kyle Moffett -- I lost interest in "blade servers" when I found they didn't throw knives at people who weren't supposed to be in your machine room. -- Anthony de Boer ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC] Fine-grained memory priorities and PI 2005-12-15 12:58 ` Kyle Moffett @ 2005-12-15 13:02 ` Con Kolivas 0 siblings, 0 replies; 46+ messages in thread From: Con Kolivas @ 2005-12-15 13:02 UTC (permalink / raw) To: Kyle Moffett; +Cc: David S. Miller, sri, mpm, ak, linux-kernel, netdev On Thursday 15 December 2005 23:58, Kyle Moffett wrote: > On Dec 15, 2005, at 07:45, Con Kolivas wrote: > > I have some basic process-that-called the memory allocator link in > > the -ck tree already which alters how aggressively memory is > > reclaimed according to priority. It does not affect out of memory > > management but that could be added to said algorithm; however I > > don't see much point at the moment since oom is still an uncommon > > condition but regular memory allocation is routine. > > My thought would be to generalize the two special cases of writeback > of dirty pages or dropping of clean pages under memory pressure and > OOM to be the same general case. When you are trying to free up > pages, it may be permissible to drop dirty mbox pages and kill the > postfix process writing them in order to satisfy allocations for the > mission-critical database server. (Or maybe it's the other way > around). If a large chunk of the allocated pages have priorities and > lossless/lossy free functions, then the kernel can be much more > flexible and configurable about what to do when running low on RAM. Indeed the implementation I currently have is lightweight to say the least but I really didn't think bloating struct page was worth it since the memory cost would be prohibitive, but would allow all sorts of priority effects and vm scheduling to be possible. That is, struct page could have an extra entry keeping track of the highest priority of the process that used it and use that to determine further eviction etc. Cheers, Con ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 8:21 ` David S. Miller 2005-12-15 8:35 ` Arjan van de Ven 2005-12-15 8:55 ` [RFC] Fine-grained memory priorities and PI Kyle Moffett @ 2005-12-16 2:09 ` Sridhar Samudrala 2005-12-16 17:48 ` Stephen Hemminger 2 siblings, 1 reply; 46+ messages in thread From: Sridhar Samudrala @ 2005-12-16 2:09 UTC (permalink / raw) To: David S. Miller; +Cc: mpm, ak, linux-kernel, netdev On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote: > From: Sridhar Samudrala <sri@us.ibm.com> > Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST) > > > Instead, you seem to be suggesting in_emergency to be set dynamically > > when we are about to run out of ATOMIC memory. Is this right? > > Not when we run out, but rather when we reach some low water mark, the > "critical sockets" would still use GFP_ATOMIC memory but only > "critical sockets" would be allowed to do so. > > But even this has faults, consider the IPSEC scenerio I mentioned, and > this applies to any kind of encapsulation actually, even simple > tunneling examples can be concocted which make the "critical socket" > idea fail. > > The knee jerk reaction is "mark IPSEC's sockets critical, and mark the > tunneling allocations critical, and... and..." well you have > GFP_ATOMIC then my friend. I would like to mention another reason why we need to have a new GFP_CRITICAL flag for an allocation request. When we are in emergency, even the GFP_KERNEL allocations for a critical socket should not sleep. This is because the swap device may have failed and we would like to communicate this event to a management server over the critical socket so that it can initiate the failover. We are not trying to solve swapping over network problem. It is much simpler. The critical sockets are to be used only to send/receive a few critical messages reliably during a short period of emergency. Thanks Sridhar ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-16 2:09 ` [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism Sridhar Samudrala @ 2005-12-16 17:48 ` Stephen Hemminger 2005-12-16 18:38 ` Sridhar Samudrala 0 siblings, 1 reply; 46+ messages in thread From: Stephen Hemminger @ 2005-12-16 17:48 UTC (permalink / raw) To: Sridhar Samudrala; +Cc: David S. Miller, mpm, ak, linux-kernel, netdev On Thu, 15 Dec 2005 18:09:22 -0800 Sridhar Samudrala <sri@us.ibm.com> wrote: > On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote: > > From: Sridhar Samudrala <sri@us.ibm.com> > > Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST) > > > > > Instead, you seem to be suggesting in_emergency to be set dynamically > > > when we are about to run out of ATOMIC memory. Is this right? > > > > Not when we run out, but rather when we reach some low water mark, the > > "critical sockets" would still use GFP_ATOMIC memory but only > > "critical sockets" would be allowed to do so. > > > > But even this has faults, consider the IPSEC scenerio I mentioned, and > > this applies to any kind of encapsulation actually, even simple > > tunneling examples can be concocted which make the "critical socket" > > idea fail. > > > > The knee jerk reaction is "mark IPSEC's sockets critical, and mark the > > tunneling allocations critical, and... and..." well you have > > GFP_ATOMIC then my friend. > > I would like to mention another reason why we need to have a new > GFP_CRITICAL flag for an allocation request. When we are in emergency, > even the GFP_KERNEL allocations for a critical socket should not > sleep. This is because the swap device may have failed and we would > like to communicate this event to a management server over the > critical socket so that it can initiate the failover. > > We are not trying to solve swapping over network problem. It is much > simpler. The critical sockets are to be used only to send/receive > a few critical messages reliably during a short period of emergency. > If it is only one place, why not pre-allocate one "I'm sick now" skb and hold onto it. Any bigger solution seems to snowball into a huge mess. -- Stephen Hemminger <shemminger@osdl.org> OSDL http://developer.osdl.org/~shemminger ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-16 17:48 ` Stephen Hemminger @ 2005-12-16 18:38 ` Sridhar Samudrala 2005-12-21 9:11 ` Pavel Machek 0 siblings, 1 reply; 46+ messages in thread From: Sridhar Samudrala @ 2005-12-16 18:38 UTC (permalink / raw) To: Stephen Hemminger; +Cc: David S. Miller, mpm, ak, linux-kernel, netdev On Fri, 2005-12-16 at 09:48 -0800, Stephen Hemminger wrote: > On Thu, 15 Dec 2005 18:09:22 -0800 > Sridhar Samudrala <sri@us.ibm.com> wrote: > > > On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote: > > > From: Sridhar Samudrala <sri@us.ibm.com> > > > Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST) > > > > > > > Instead, you seem to be suggesting in_emergency to be set dynamically > > > > when we are about to run out of ATOMIC memory. Is this right? > > > > > > Not when we run out, but rather when we reach some low water mark, the > > > "critical sockets" would still use GFP_ATOMIC memory but only > > > "critical sockets" would be allowed to do so. > > > > > > But even this has faults, consider the IPSEC scenerio I mentioned, and > > > this applies to any kind of encapsulation actually, even simple > > > tunneling examples can be concocted which make the "critical socket" > > > idea fail. > > > > > > The knee jerk reaction is "mark IPSEC's sockets critical, and mark the > > > tunneling allocations critical, and... and..." well you have > > > GFP_ATOMIC then my friend. > > > > I would like to mention another reason why we need to have a new > > GFP_CRITICAL flag for an allocation request. When we are in emergency, > > even the GFP_KERNEL allocations for a critical socket should not > > sleep. This is because the swap device may have failed and we would > > like to communicate this event to a management server over the > > critical socket so that it can initiate the failover. > > > > We are not trying to solve swapping over network problem. It is much > > simpler. The critical sockets are to be used only to send/receive > > a few critical messages reliably during a short period of emergency. > > > > If it is only one place, why not pre-allocate one "I'm sick now" > skb and hold onto it. Any bigger solution seems to snowball into > a huge mess. But the problem is even sending/receiving a single packet can cause multiple dynamic allocations in the networking path all the way from the sockets layer->transport->ip->driver. To successfully send a packet, we may have to do arp, send acks and create cached routes etc. So my patch tried to identify the allocations that are needed to succesfully send/receive packets over a pre-established socket and adds a new flag GFP_CRITICAL to those calls. This doesn't make any difference when we are not in emergency. But when we go into emergency, VM will try to satisfy these allocations from a critical pool if the normal path leads to failure. We go into emergency when some management app detects that a swap device is about to fail(we are not yet in OOM, but will enter OOM soon). In order to avoid entering OOM, we need to send a message over a critical socket to a remote server that can initiate failover and switch to a different swap device. The switchover will happen within 2 minutes after it is initiated. In a cluster environment, the remote server also sends a message to other nodes which are also running the management app so that they also enter emergency. Once we successfully switch to a different swap device, the remote server sends a message to all the nodes and they come out of emergency. During the period of emergency, all other communications can block. But guranteeing the successful delivery of the critical messages will help in making sure that we do not enter OOM situation. Thanks Sridhar ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-16 18:38 ` Sridhar Samudrala @ 2005-12-21 9:11 ` Pavel Machek 2005-12-21 9:39 ` David Stevens 0 siblings, 1 reply; 46+ messages in thread From: Pavel Machek @ 2005-12-21 9:11 UTC (permalink / raw) To: Sridhar Samudrala Cc: Stephen Hemminger, David S. Miller, mpm, ak, linux-kernel, netdev Hi! > > If it is only one place, why not pre-allocate one "I'm sick now" > > skb and hold onto it. Any bigger solution seems to snowball into > > a huge mess. > > But the problem is even sending/receiving a single packet can cause > multiple dynamic allocations in the networking path all the way from > the sockets layer->transport->ip->driver. > To successfully send a packet, we may have to do arp, send acks and > create cached routes etc. So my patch tried to identify the allocations > that are needed to succesfully send/receive packets over a pre-established > socket and adds a new flag GFP_CRITICAL to those calls. > This doesn't make any difference when we are not in emergency. But when > we go into emergency, VM will try to satisfy these allocations from a > critical pool if the normal path leads to failure. > > We go into emergency when some management app detects that a swap device > is about to fail(we are not yet in OOM, but will enter OOM soon). In order > to avoid entering OOM, we need to send a message over a critical socket to > a remote server that can initiate failover and switch to a different swap > device. The switchover will happen within 2 minutes after it is initiated. > In a cluster environment, the remote server also sends a message to other > nodes which are also running the management app so that they also enter > emergency. Once we successfully switch to a different swap device, the remote > server sends a message to all the nodes and they come out of emergency. > > During the period of emergency, all other communications can block. But > guranteeing the successful delivery of the critical messages will help > in making sure that we do not enter OOM situation. Why not do it the other way? "If you don't hear from me for 2 minutes, do a switchover". Then all you have to do is _not_ to send a packet -- easier to do. Anything else seems overkill. Pavel -- Thanks, Sharp! ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-21 9:11 ` Pavel Machek @ 2005-12-21 9:39 ` David Stevens 0 siblings, 0 replies; 46+ messages in thread From: David Stevens @ 2005-12-21 9:39 UTC (permalink / raw) To: Pavel Machek Cc: ak, David S. Miller, linux-kernel, mpm, netdev, netdev-owner, Stephen Hemminger, sri > Why not do it the other way? "If you don't hear from me for 2 minutes, > do a switchover". Then all you have to do is _not_ to send a packet -- > easier to do. > > Anything else seems overkill. > Pavel Because in some of the scenarios, including ours, it isn't a simple failover to a known alternate device or configuration -- it is reconfiguring dynamically with information received on a socket from a remote machine (while the swap device is unavailable). Limited socket communication without allocating new memory that may not be available is the problem definition. Avoiding the problem in the first place (your solution) is effective if you can do it, of course. The trick is to solve the problem when you can't avoid it. :-) +-DLS ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 9:12 [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism Sridhar Samudrala 2005-12-14 9:22 ` Andi Kleen @ 2005-12-14 20:16 ` Jesper Juhl 2005-12-14 20:25 ` Ben Greear 2005-12-14 20:49 ` James Courtier-Dutton 1 sibling, 2 replies; 46+ messages in thread From: Jesper Juhl @ 2005-12-14 20:16 UTC (permalink / raw) To: Sridhar Samudrala; +Cc: linux-kernel, netdev On 12/14/05, Sridhar Samudrala <sri@us.ibm.com> wrote: > > These set of patches provide a TCP/IP emergency communication mechanism that > could be used to guarantee high priority communications over a critical socket > to succeed even under very low memory conditions that last for a couple of > minutes. It uses the critical page pool facility provided by Matt's patches > that he posted recently on lkml. > http://lkml.org/lkml/2005/12/14/34/index.html > > This mechanism provides a new socket option SO_CRITICAL that can be used to > mark a socket as critical. A critical connection used for emergency So now everyone writing commercial apps for Linux are going to set SO_CRITICAL on sockets in their apps so their apps can "survive better under pressure than the competitors aps" and clueless programmers all over are going to think "cool, with this I can make my app more important than everyone elses, I'm going to use this". When everyone and his dog starts to set this, what's the point? > communications has to be established and marked as critical before we enter > the emergency condition. > > It uses the __GFP_CRITICAL flag introduced in the critical page pool patches > to indicate an allocation request as critical and should be satisfied from the > critical page pool if required. In the send path, this flag is passed with all > allocation requests that are made for a critical socket. But in the receive > path we do not know if a packet is critical or not until we receive it and > find the socket that it is destined to. So we treat all the allocation > requests in the receive path as critical. > > The critical page pool patches also introduces a global flag > 'system_in_emergency' that is used to indicate an emergency situation(could be > a low memory condition). When this flag is set any incoming packets that belong > to non-critical sockets are dropped as soon as possible in the receive path. Hmm, so if I fire up an app that has SO_CRITICAL set on a socket and can then somehow put a lot of memory pressure on the machine I can cause traffic on other sockets to be dropped.. hmmm.. sounds like something to play with to create new and interresting DoS attacks... > This is necessary to prevent incoming non-critical packets to consume memory > from critical page pool. > > I would appreciate any feedback or comments on this approach. > To be a little serious, it sounds like something that could be used to cause trouble and something that will lose its usefulness once enough people start using it (for valid or invalid reasons), so what's the point... -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 20:16 ` Jesper Juhl @ 2005-12-14 20:25 ` Ben Greear 2005-12-14 20:49 ` James Courtier-Dutton 1 sibling, 0 replies; 46+ messages in thread From: Ben Greear @ 2005-12-14 20:25 UTC (permalink / raw) To: Jesper Juhl; +Cc: Sridhar Samudrala, linux-kernel, netdev Jesper Juhl wrote: > To be a little serious, it sounds like something that could be used to > cause trouble and something that will lose its usefulness once enough > people start using it (for valid or invalid reasons), so what's the > point... It could easily be a user-configurable option in an application. If DOS is a real concern, only let this work for root users... Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 20:16 ` Jesper Juhl 2005-12-14 20:25 ` Ben Greear @ 2005-12-14 20:49 ` James Courtier-Dutton 2005-12-14 21:55 ` Sridhar Samudrala 2005-12-15 1:54 ` Mitchell Blank Jr 1 sibling, 2 replies; 46+ messages in thread From: James Courtier-Dutton @ 2005-12-14 20:49 UTC (permalink / raw) To: Jesper Juhl; +Cc: Sridhar Samudrala, linux-kernel, netdev Jesper Juhl wrote: > On 12/14/05, Sridhar Samudrala <sri@us.ibm.com> wrote: > >>These set of patches provide a TCP/IP emergency communication mechanism that >>could be used to guarantee high priority communications over a critical socket >>to succeed even under very low memory conditions that last for a couple of >>minutes. It uses the critical page pool facility provided by Matt's patches >>that he posted recently on lkml. >> http://lkml.org/lkml/2005/12/14/34/index.html >> >>This mechanism provides a new socket option SO_CRITICAL that can be used to >>mark a socket as critical. A critical connection used for emergency > > > So now everyone writing commercial apps for Linux are going to set > SO_CRITICAL on sockets in their apps so their apps can "survive better > under pressure than the competitors aps" and clueless programmers all > over are going to think "cool, with this I can make my app more > important than everyone elses, I'm going to use this". When everyone > and his dog starts to set this, what's the point? > > I don't think the initial patches that Matt did were intended for what you are describing. When I had the conversation with Matt at KS, the problem we were trying to solve was "Memory pressure with network attached swap space". I came up with the idea that I think Matt has implemented. Letting the OS choose which are "critical" TCP/IP sessions is fine. But letting an application choose is a recipe for disaster. James ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 20:49 ` James Courtier-Dutton @ 2005-12-14 21:55 ` Sridhar Samudrala 2005-12-14 22:09 ` James Courtier-Dutton 2005-12-15 1:54 ` Mitchell Blank Jr 1 sibling, 1 reply; 46+ messages in thread From: Sridhar Samudrala @ 2005-12-14 21:55 UTC (permalink / raw) To: James Courtier-Dutton; +Cc: Jesper Juhl, linux-kernel, netdev On Wed, 2005-12-14 at 20:49 +0000, James Courtier-Dutton wrote: > Jesper Juhl wrote: > > On 12/14/05, Sridhar Samudrala <sri@us.ibm.com> wrote: > > > >>These set of patches provide a TCP/IP emergency communication mechanism that > >>could be used to guarantee high priority communications over a critical socket > >>to succeed even under very low memory conditions that last for a couple of > >>minutes. It uses the critical page pool facility provided by Matt's patches > >>that he posted recently on lkml. > >> http://lkml.org/lkml/2005/12/14/34/index.html > >> > >>This mechanism provides a new socket option SO_CRITICAL that can be used to > >>mark a socket as critical. A critical connection used for emergency > > > > > > So now everyone writing commercial apps for Linux are going to set > > SO_CRITICAL on sockets in their apps so their apps can "survive better > > under pressure than the competitors aps" and clueless programmers all > > over are going to think "cool, with this I can make my app more > > important than everyone elses, I'm going to use this". When everyone > > and his dog starts to set this, what's the point? > > > > > > I don't think the initial patches that Matt did were intended for what > you are describing. > When I had the conversation with Matt at KS, the problem we were trying > to solve was "Memory pressure with network attached swap space". > I came up with the idea that I think Matt has implemented. > Letting the OS choose which are "critical" TCP/IP sessions is fine. But > letting an application choose is a recipe for disaster. We could easily add capable(CAP_NET_ADMIN) check to allow this option to be set only by privileged users. Thanks Sridhar ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 21:55 ` Sridhar Samudrala @ 2005-12-14 22:09 ` James Courtier-Dutton 2005-12-14 22:39 ` Ben Greear 0 siblings, 1 reply; 46+ messages in thread From: James Courtier-Dutton @ 2005-12-14 22:09 UTC (permalink / raw) To: Sridhar Samudrala; +Cc: Jesper Juhl, linux-kernel, netdev Sridhar Samudrala wrote: > On Wed, 2005-12-14 at 20:49 +0000, James Courtier-Dutton wrote: > >>Jesper Juhl wrote: >> >>>On 12/14/05, Sridhar Samudrala <sri@us.ibm.com> wrote: >>> >>> >>>>These set of patches provide a TCP/IP emergency communication mechanism that >>>>could be used to guarantee high priority communications over a critical socket >>>>to succeed even under very low memory conditions that last for a couple of >>>>minutes. It uses the critical page pool facility provided by Matt's patches >>>>that he posted recently on lkml. >>>> http://lkml.org/lkml/2005/12/14/34/index.html >>>> >>>>This mechanism provides a new socket option SO_CRITICAL that can be used to >>>>mark a socket as critical. A critical connection used for emergency >>> >>> >>>So now everyone writing commercial apps for Linux are going to set >>>SO_CRITICAL on sockets in their apps so their apps can "survive better >>>under pressure than the competitors aps" and clueless programmers all >>>over are going to think "cool, with this I can make my app more >>>important than everyone elses, I'm going to use this". When everyone >>>and his dog starts to set this, what's the point? >>> >>> >> >>I don't think the initial patches that Matt did were intended for what >>you are describing. >>When I had the conversation with Matt at KS, the problem we were trying >>to solve was "Memory pressure with network attached swap space". >>I came up with the idea that I think Matt has implemented. >>Letting the OS choose which are "critical" TCP/IP sessions is fine. But >>letting an application choose is a recipe for disaster. > > > We could easily add capable(CAP_NET_ADMIN) check to allow this option to > be set only by privileged users. > > Thanks > Sridhar > Sridhar, Have you actually thought about what would happen in a real world senario? There is no real world requirement for this sort of user land feature. In memory pressure mode, you don't care about user applications. In fact, under memory pressure no user applications are getting scheduled. All you care about is swapping out memory to achieve a net gain in free memory, so that the applications can then run ok again. James ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 22:09 ` James Courtier-Dutton @ 2005-12-14 22:39 ` Ben Greear 2005-12-14 23:42 ` Sridhar Samudrala 0 siblings, 1 reply; 46+ messages in thread From: Ben Greear @ 2005-12-14 22:39 UTC (permalink / raw) To: James Courtier-Dutton Cc: Sridhar Samudrala, Jesper Juhl, linux-kernel, netdev James Courtier-Dutton wrote: > Have you actually thought about what would happen in a real world senario? > There is no real world requirement for this sort of user land feature. > In memory pressure mode, you don't care about user applications. In > fact, under memory pressure no user applications are getting scheduled. > All you care about is swapping out memory to achieve a net gain in free > memory, so that the applications can then run ok again. Low 'ATOMIC' memory is different from the memory that user space typically uses, so just because you can't allocate an SKB does not mean you are swapping out user-space apps. I have an app that can have 2000+ sockets open. I would definately like to make the management and other important sockets have priority over others in my app... Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 22:39 ` Ben Greear @ 2005-12-14 23:42 ` Sridhar Samudrala 0 siblings, 0 replies; 46+ messages in thread From: Sridhar Samudrala @ 2005-12-14 23:42 UTC (permalink / raw) To: Ben Greear; +Cc: James Courtier-Dutton, Jesper Juhl, linux-kernel, netdev On Wed, 2005-12-14 at 14:39 -0800, Ben Greear wrote: > James Courtier-Dutton wrote: > > > Have you actually thought about what would happen in a real world senario? > > There is no real world requirement for this sort of user land feature. > > In memory pressure mode, you don't care about user applications. In > > fact, under memory pressure no user applications are getting scheduled. > > All you care about is swapping out memory to achieve a net gain in free > > memory, so that the applications can then run ok again. > > Low 'ATOMIC' memory is different from the memory that user space typically > uses, so just because you can't allocate an SKB does not mean you are swapping > out user-space apps. > > I have an app that can have 2000+ sockets open. I would definately like to make > the management and other important sockets have priority over others in my app... The scenario we are trying to address is also a management connection between the nodes of a cluster and a server that manages the swap devices accessible by all the nodes of the cluster. The critical connection is supposed to be used to exchange status notifications of the swap devices so that failover can happen and propagated to all the nodes as quickly as possible. The management apps will be pinned into memory so that they are not swapped out. As such the traffic that flows over the critical sockets is not high but should not stall even if we run into a memory constrained situation. That is the reason why we would like to have a pre-allocated critical page pool which could be used when we run out of ATOMIC memory. Thanks Sridhar ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-14 20:49 ` James Courtier-Dutton 2005-12-14 21:55 ` Sridhar Samudrala @ 2005-12-15 1:54 ` Mitchell Blank Jr 2005-12-15 11:38 ` James Courtier-Dutton 1 sibling, 1 reply; 46+ messages in thread From: Mitchell Blank Jr @ 2005-12-15 1:54 UTC (permalink / raw) To: James Courtier-Dutton Cc: Jesper Juhl, Sridhar Samudrala, linux-kernel, netdev James Courtier-Dutton wrote: > When I had the conversation with Matt at KS, the problem we were trying > to solve was "Memory pressure with network attached swap space". s/swap space/writable filesystems/ You can hit these problems even if you have no swap. Too much of the memory becomes filled with dirty pages needing writeback -- then you lose your NFS server's ARP entry at the wrong moment. If you have a local disk to swap to the machine will recover after a little bit of grinding, otherwise it's all pretty much over. The big problem is that as long as there's network I/O coming in it's likely that pages you free (as the VM gets more and more desperate about dropping the few remaining non-dirty pages) will get used for sockets that AREN'T helping you recover RAM. You really need to be able to tell the whole network stack "we're in really rough shape here; ignore all RX work unless it's going to help me get write ACKs back from my {NFS,iSCSI} server" My understanding is that is what this patchset is trying to accomplish. -Mitch ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 1:54 ` Mitchell Blank Jr @ 2005-12-15 11:38 ` James Courtier-Dutton 2005-12-15 11:47 ` Arjan van de Ven 0 siblings, 1 reply; 46+ messages in thread From: James Courtier-Dutton @ 2005-12-15 11:38 UTC (permalink / raw) To: Mitchell Blank Jr; +Cc: Jesper Juhl, Sridhar Samudrala, linux-kernel, netdev Mitchell Blank Jr wrote: > James Courtier-Dutton wrote: > >>When I had the conversation with Matt at KS, the problem we were trying >>to solve was "Memory pressure with network attached swap space". > > > s/swap space/writable filesystems/ > > You can hit these problems even if you have no swap. Too much of the > memory becomes filled with dirty pages needing writeback -- then you lose > your NFS server's ARP entry at the wrong moment. If you have a local disk > to swap to the machine will recover after a little bit of grinding, otherwise > it's all pretty much over. > > The big problem is that as long as there's network I/O coming in it's > likely that pages you free (as the VM gets more and more desperate about > dropping the few remaining non-dirty pages) will get used for sockets > that AREN'T helping you recover RAM. You really need to be able to tell > the whole network stack "we're in really rough shape here; ignore all RX > work unless it's going to help me get write ACKs back from my {NFS,iSCSI} > server" My understanding is that is what this patchset is trying to > accomplish. > > -Mitch > > You are using the wrong hammer to crack your nut. You should instead approach your problem of why the ARP entry gets lost. For example, you could give as critical priority to your TCP session, but that still won't cure your ARP problem. I would suggest that the best way to cure your arp problem, is to increase the time between arp cache refreshes. James ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 11:38 ` James Courtier-Dutton @ 2005-12-15 11:47 ` Arjan van de Ven 2005-12-15 13:00 ` jamal 0 siblings, 1 reply; 46+ messages in thread From: Arjan van de Ven @ 2005-12-15 11:47 UTC (permalink / raw) To: James Courtier-Dutton Cc: Mitchell Blank Jr, Jesper Juhl, Sridhar Samudrala, linux-kernel, netdev > > You are using the wrong hammer to crack your nut. > You should instead approach your problem of why the ARP entry gets lost. > For example, you could give as critical priority to your TCP session, > but that still won't cure your ARP problem. > I would suggest that the best way to cure your arp problem, is to > increase the time between arp cache refreshes. or turn it around entirely: all traffic is considered important unless... and have a bunch of non-critical sockets (like http requests) be marked non-critical. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 11:47 ` Arjan van de Ven @ 2005-12-15 13:00 ` jamal 2005-12-15 13:07 ` Arjan van de Ven 0 siblings, 1 reply; 46+ messages in thread From: jamal @ 2005-12-15 13:00 UTC (permalink / raw) To: Arjan van de Ven Cc: James Courtier-Dutton, Mitchell Blank Jr, Jesper Juhl, Sridhar Samudrala, linux-kernel, netdev On Thu, 2005-15-12 at 12:47 +0100, Arjan van de Ven wrote: > > > > You are using the wrong hammer to crack your nut. > > You should instead approach your problem of why the ARP entry gets lost. > > For example, you could give as critical priority to your TCP session, > > but that still won't cure your ARP problem. > > I would suggest that the best way to cure your arp problem, is to > > increase the time between arp cache refreshes. > > or turn it around entirely: all traffic is considered important > unless... and have a bunch of non-critical sockets (like http requests) > be marked non-critical. The big hole punched by DaveM is that of dependencies: a http tcp connection is tied to ICMP or the IPSEC example given; so you need a lot more intelligence than just what your app is knowledgeable about at its level. You cant really do this shit at the socket level. You need to do it much earlier. At runtime, when lower memory thresholds gets crossed, you kick classification of what packets need to be dropped using something along the lines of statefull/connection tracking. When things get better you undo. cheers, jamal ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 13:00 ` jamal @ 2005-12-15 13:07 ` Arjan van de Ven 2005-12-15 13:32 ` jamal 0 siblings, 1 reply; 46+ messages in thread From: Arjan van de Ven @ 2005-12-15 13:07 UTC (permalink / raw) To: hadi Cc: James Courtier-Dutton, Mitchell Blank Jr, Jesper Juhl, Sridhar Samudrala, linux-kernel, netdev On Thu, 2005-12-15 at 08:00 -0500, jamal wrote: > On Thu, 2005-15-12 at 12:47 +0100, Arjan van de Ven wrote: > > > > > > You are using the wrong hammer to crack your nut. > > > You should instead approach your problem of why the ARP entry gets lost. > > > For example, you could give as critical priority to your TCP session, > > > but that still won't cure your ARP problem. > > > I would suggest that the best way to cure your arp problem, is to > > > increase the time between arp cache refreshes. > > > > or turn it around entirely: all traffic is considered important > > unless... and have a bunch of non-critical sockets (like http requests) > > be marked non-critical. > > The big hole punched by DaveM is that of dependencies: a http tcp > connection is tied to ICMP or the IPSEC example given; so you need a lot > more intelligence than just what your app is knowledgeable about at its > level. yeah well sort of. You're right of course, but that also doesn't mean you can't give hints from the other side. Like "data for this socked is NOT critical important". It gets tricky if you only do it for OOM stuff; because then that one ACK packet could cause a LOT of memory to be freed, and as such can be important for the system even if the socket isn't. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism 2005-12-15 13:07 ` Arjan van de Ven @ 2005-12-15 13:32 ` jamal 0 siblings, 0 replies; 46+ messages in thread From: jamal @ 2005-12-15 13:32 UTC (permalink / raw) To: Arjan van de Ven Cc: James Courtier-Dutton, Mitchell Blank Jr, Jesper Juhl, Sridhar Samudrala, linux-kernel, netdev On Thu, 2005-15-12 at 14:07 +0100, Arjan van de Ven wrote: > On Thu, 2005-12-15 at 08:00 -0500, jamal wrote: > > The big hole punched by DaveM is that of dependencies: a http tcp > > connection is tied to ICMP or the IPSEC example given; so you need a lot > > more intelligence than just what your app is knowledgeable about at its > > level. > > yeah well sort of. You're right of course, but that also doesn't mean > you can't give hints from the other side. Like "data for this socked is > NOT critical important". It gets tricky if you only do it for OOM stuff; > because then that one ACK packet could cause a LOT of memory to be > freed, and as such can be important for the system even if the socket > isn't. > true - but thats _just one input_ into a complex policy decision process. The other is clearly VM realizing some type of threshold has been crossed. The output being a policy decision of what to drop - which gets very interesting if one looks at it being as fine grained as "drop ACKS". The fallacy in the proposed solution is that it simplisticly ties the decision to VM input and the network level input to sockets; as in the example of sockets doing http requests. Methinks what is needed is something which keeps state and takes input from the sockets and the VM and then runs some algorithm to decide what needs to be the final policy that gets installed at the low level kernel (tc classifier level or hardware). Sockets provide hints that they are critical. The box admin could override what is important. cheers, jamal ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <5jUjW-8nu-7@gated-at.bofh.it>]
[parent not found: <5jWYp-3K1-19@gated-at.bofh.it>]
[parent not found: <5jXhZ-4kj-19@gated-at.bofh.it>]
* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism [not found] ` <5jXhZ-4kj-19@gated-at.bofh.it> @ 2005-12-16 8:35 ` Bodo Eggert 0 siblings, 0 replies; 46+ messages in thread From: Bodo Eggert @ 2005-12-16 8:35 UTC (permalink / raw) To: David S. Miller, dlstevens, shemminger, ak, linux-kernel, mpm, netdev, netdev-owner, sri David S. Miller <davem@davemloft.net> wrote: > The idea to mark, for example, IPSEC key management daemon's sockets > as critical is flawed, because the key management daemon could hit a > swap page over the iSCSI device. Don't even start with the idea to > lock the IPSEC key management daemon into ram with mlock(). How are you going to swap in the key manager if you need the key manager for doing this? However, I'd prefer a system where you can't dirty mor than (e.g.) 80 % of RAM unless you need this to maintain vital system activity and not more than 95 % unless it will help to get more clean RAM. (Like the priority inheritance suggestion from this thread.) I suppose this to least significantly reduce thrashing and give a very good chance of recovering from memory pressure. Off cause the implementation won't be easy, especially if userspace applications need to inherit priority from different code paths, but in theory, it can be done. -- Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF verbreiteten Lügen zu sabotieren. ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2005-12-21 9:38 UTC | newest] Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-12-14 9:12 [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism Sridhar Samudrala 2005-12-14 9:22 ` Andi Kleen 2005-12-14 17:55 ` Sridhar Samudrala 2005-12-14 18:41 ` Andi Kleen 2005-12-14 19:20 ` David Stevens 2005-12-15 3:39 ` Matt Mackall 2005-12-15 4:30 ` David S. Miller 2005-12-15 5:02 ` Matt Mackall 2005-12-15 5:23 ` David S. Miller 2005-12-15 5:48 ` Matt Mackall 2005-12-15 5:53 ` Nick Piggin 2005-12-15 5:56 ` Stephen Hemminger 2005-12-15 8:44 ` David Stevens 2005-12-15 8:58 ` David S. Miller 2005-12-15 9:27 ` David Stevens 2005-12-15 5:42 ` Andi Kleen 2005-12-15 6:06 ` Stephen Hemminger 2005-12-15 7:37 ` Sridhar Samudrala 2005-12-15 8:21 ` David S. Miller 2005-12-15 8:35 ` Arjan van de Ven 2005-12-15 8:55 ` [RFC] Fine-grained memory priorities and PI Kyle Moffett 2005-12-15 9:04 ` Andi Kleen 2005-12-15 12:51 ` Kyle Moffett 2005-12-15 13:31 ` Andi Kleen 2005-12-15 12:45 ` Con Kolivas 2005-12-15 12:58 ` Kyle Moffett 2005-12-15 13:02 ` Con Kolivas 2005-12-16 2:09 ` [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism Sridhar Samudrala 2005-12-16 17:48 ` Stephen Hemminger 2005-12-16 18:38 ` Sridhar Samudrala 2005-12-21 9:11 ` Pavel Machek 2005-12-21 9:39 ` David Stevens 2005-12-14 20:16 ` Jesper Juhl 2005-12-14 20:25 ` Ben Greear 2005-12-14 20:49 ` James Courtier-Dutton 2005-12-14 21:55 ` Sridhar Samudrala 2005-12-14 22:09 ` James Courtier-Dutton 2005-12-14 22:39 ` Ben Greear 2005-12-14 23:42 ` Sridhar Samudrala 2005-12-15 1:54 ` Mitchell Blank Jr 2005-12-15 11:38 ` James Courtier-Dutton 2005-12-15 11:47 ` Arjan van de Ven 2005-12-15 13:00 ` jamal 2005-12-15 13:07 ` Arjan van de Ven 2005-12-15 13:32 ` jamal [not found] <5jUjW-8nu-7@gated-at.bofh.it> [not found] ` <5jWYp-3K1-19@gated-at.bofh.it> [not found] ` <5jXhZ-4kj-19@gated-at.bofh.it> 2005-12-16 8:35 ` Bodo Eggert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).