* Unix Socket buffer attribution @ 2013-01-22 2:01 Yannick Koehler 2013-01-23 9:59 ` Hannes Frederic Sowa ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Yannick Koehler @ 2013-01-22 2:01 UTC (permalink / raw) To: netdev Hi, I got pointed to this list, I have this question about the unix socket domain buffer attribution system (sorry about the language, trying to do my best). I believe I have found a problem on how memory is associated with a given socket when using Unix Socket Domains in datagram mode, the problem is possibly also present in other mode but I have not checked. I am not that familiar with kernel debugging, and got to this level after attempting to understand a user-space application situation. I have a server socket, using a unix domain socket. Many clients connect to this server. If one of the client stop calling recvfrom() on the socket, all others clients stops receiving anymore events. This is pretty much a observer/subscriver model, where clients do not need to send stuff in order to receive. At the technical level, when my server try to send data to any of my clients, I get errno EAGAIN (11) on a sendto. I would have expected that in such situation, I would get the EAGAIN when calling sendto() with the address of the particular client which stop calling recvfrom(), but I actually am getting EAGAIN on all clients, even those which properly behave and empty their receive buffer using recvfrom(). I believe I tracked down in the kernel why this is occurring. When sending, the af_unix.c uses sock_alloc_send_skb() and passes as the first parameter the sk variable that hold my server socket. This increase the sk_wmem_alloc variable associated with the server socket. Then the code retrieve the socket associated with my destination and place it under the variable "other". It then validate that this "other" socket structure receive queue isn't full, and then add the newly created skb to that queue. But, the memory cost is still accounted for under the server sk_rmem_alloc, and will only get cleared when the skb gets free using kfree_skb() which will invoke sock_wfree() and decrease the count. Since one of the client isn't calling recvfrom(), the kfree_skb will never get invoked and therefore the count will only increase. When the count exceed or reach the sk_sendrcv_size then the sock_alloc_send_skb() function always return EAGAIN, independently of which destination I am trying to reach. I believe that the problem is that once we move the skb into the client's receive queue we need to decrease the sk_wmem_alloc variable of the server socket since that skb is no more tied to the server. The code should then account for this memory as part of the sk_rmem_alloc variable on the client's socket. The function "skb_set_owner_r(skb,owner)" would seem to be the function to do that, so it would seem to me. -- Yannick Koehler ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-22 2:01 Unix Socket buffer attribution Yannick Koehler @ 2013-01-23 9:59 ` Hannes Frederic Sowa 2013-01-23 16:39 ` Yannick Koehler 2013-01-23 11:42 ` Cong Wang 2013-01-23 16:41 ` Yannick Koehler 2 siblings, 1 reply; 11+ messages in thread From: Hannes Frederic Sowa @ 2013-01-23 9:59 UTC (permalink / raw) To: Yannick Koehler; +Cc: netdev On Mon, Jan 21, 2013 at 09:01:53PM -0500, Yannick Koehler wrote: > I believe that the problem is that once we move the skb into the > client's receive queue we need to decrease the sk_wmem_alloc variable > of the server socket since that skb is no more tied to the server. > The code should then account for this memory as part of the > sk_rmem_alloc variable on the client's socket. The function > "skb_set_owner_r(skb,owner)" would seem to be the function to do that, > so it would seem to me. Your analysis does make sense. Could you cook a patch? Thanks, Hannes ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-23 9:59 ` Hannes Frederic Sowa @ 2013-01-23 16:39 ` Yannick Koehler 0 siblings, 0 replies; 11+ messages in thread From: Yannick Koehler @ 2013-01-23 16:39 UTC (permalink / raw) To: Yannick Koehler, netdev This patch should fix an issue where unix socket buffer remains accounted as part of the socket sndbuf (sk_wmem_alloc) instead of being accounted as part of the receiving socket rcvbuf (sk_rmem_alloc), leading to a situation where if one of the receiving socket isn't calling recvfrom() the sending socket can no more send to any of its listeners, even those which properly behave. This could create a DOS situation where the unix socket is reachable by many users on the same linux machine. diff -uprN -X linux-3.6-vanilla/Documentation/dontdiff linux-3.6-vanilla/include/net/af_unix.h linux-3.6/include/net/af_unix.h --- linux-3.6-vanilla/include/net/af_unix.h 2012-09-30 19:47:46.000000000 -0400 +++ linux-3.6/include/net/af_unix.h 2013-01-23 11:21:35.000000000 -0500 @@ -34,6 +34,8 @@ struct unix_skb_parms { #ifdef CONFIG_SECURITY_NETWORK u32 secid; /* Security ID */ #endif + char peer_name[UNIX_PATH_MAX]; + int peer_namelen; }; #define UNIXCB(skb) (*(struct unix_skb_parms *)&((skb)->cb)) diff -uprN -X linux-3.6-vanilla/Documentation/dontdiff linux-3.6-vanilla/net/rxrpc/ar-internal.h linux-3.6/net/rxrpc/ar-internal.h --- linux-3.6-vanilla/net/rxrpc/ar-internal.h 2012-09-30 19:47:46.000000000 -0400 +++ linux-3.6/net/rxrpc/ar-internal.h 2013-01-23 11:00:43.000000000 -0500 @@ -77,7 +77,7 @@ struct rxrpc_sock { /* * RxRPC socket buffer private variables - * - max 48 bytes (struct sk_buff::cb) + * - max 160 bytes (struct sk_buff::cb) */ struct rxrpc_skb_priv { struct rxrpc_call *call; /* call with which associated */ diff -uprN -X linux-3.6-vanilla/Documentation/dontdiff linux-3.6-vanilla/net/unix/af_unix.c linux-3.6/net/unix/af_unix.c --- linux-3.6-vanilla/net/unix/af_unix.c 2012-09-30 19:47:46.000000000 -0400 +++ linux-3.6/net/unix/af_unix.c 2013-01-23 11:26:00.000000000 -0500 @@ -1425,6 +1425,19 @@ static void maybe_add_creds(struct sk_bu } } +static void unix_backup_addr(struct sk_buff *skb, struct sock *sk) +{ + struct unix_sock *u = unix_sk(sk); + + if (u->addr) { + memcpy(UNIXCB(skb).peer_name, u->addr->name, u->addr->len); + UNIXCB(skb).peer_namelen = u->addr->len; + } else { + UNIXCB(skb).peer_name[0] = 0; + UNIXCB(skb).peer_namelen = 0; + } +} + /* * Send AF_UNIX data. */ @@ -1579,9 +1592,19 @@ restart: goto restart; } + if (atomic_read(&other->sk_rmem_alloc) + skb->truesize >= + (unsigned)other->sk_rcvbuf) { + err = -EAGAIN; + goto out_unlock; + } + + /* Backup source address into sk_buff :: cb */ + unix_backup_addr(skb, sk); + if (sock_flag(other, SOCK_RCVTSTAMP)) __net_timestamp(skb); maybe_add_creds(skb, sock, other); + skb_set_owner_r(skb, other); skb_queue_tail(&other->sk_receive_queue, skb); if (max_level > unix_sk(other)->recursion_level) unix_sk(other)->recursion_level = max_level; @@ -1696,7 +1719,17 @@ static int unix_stream_sendmsg(struct ki (other->sk_shutdown & RCV_SHUTDOWN)) goto pipe_err_free; + if (atomic_read(&other->sk_rmem_alloc) + skb->truesize >= + (unsigned)other->sk_rcvbuf) { + err = -EAGAIN; + goto pipe_err_free; + } + + /* Backup source address into sk_buff :: cb */ + unix_backup_addr(skb, sk); + maybe_add_creds(skb, sock, other); + skb_set_owner_r(skb, other); skb_queue_tail(&other->sk_receive_queue, skb); if (max_level > unix_sk(other)->recursion_level) unix_sk(other)->recursion_level = max_level; @@ -1754,15 +1787,10 @@ static int unix_seqpacket_recvmsg(struct return unix_dgram_recvmsg(iocb, sock, msg, size, flags); } -static void unix_copy_addr(struct msghdr *msg, struct sock *sk) +static void unix_restore_addr(struct msghdr *msg, struct unix_skb_parms *parms) { - struct unix_sock *u = unix_sk(sk); - - msg->msg_namelen = 0; - if (u->addr) { - msg->msg_namelen = u->addr->len; - memcpy(msg->msg_name, u->addr->name, u->addr->len); - } + msg->msg_namelen = parms->peer_namelen; + memcpy(msg->msg_name, parms->peer_name, parms->peer_namelen); } static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock, @@ -1807,7 +1835,7 @@ static int unix_dgram_recvmsg(struct kio POLLOUT | POLLWRNORM | POLLWRBAND); if (msg->msg_name) - unix_copy_addr(msg, skb->sk); + unix_restore_addr(msg, &(UNIXCB(skb))); if (size > skb->len - skip) size = skb->len - skip; @@ -2007,7 +2035,7 @@ again: /* Copy address just once */ if (sunaddr) { - unix_copy_addr(msg, skb->sk); + unix_restore_addr(msg, &(UNIXCB(skb))); sunaddr = NULL; } 2013/1/23 Hannes Frederic Sowa <hannes@stressinduktion.org>: > On Mon, Jan 21, 2013 at 09:01:53PM -0500, Yannick Koehler wrote: >> I believe that the problem is that once we move the skb into the >> client's receive queue we need to decrease the sk_wmem_alloc variable >> of the server socket since that skb is no more tied to the server. >> The code should then account for this memory as part of the >> sk_rmem_alloc variable on the client's socket. The function >> "skb_set_owner_r(skb,owner)" would seem to be the function to do that, >> so it would seem to me. > > Your analysis does make sense. Could you cook a patch? > > Thanks, > > Hannes > -- Yannick Koehler Courriel: yannick@koehler.name Blog: http://corbeillepensees.blogspot.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-22 2:01 Unix Socket buffer attribution Yannick Koehler 2013-01-23 9:59 ` Hannes Frederic Sowa @ 2013-01-23 11:42 ` Cong Wang 2013-01-23 14:26 ` Eric Dumazet 2013-01-23 16:41 ` Yannick Koehler 2 siblings, 1 reply; 11+ messages in thread From: Cong Wang @ 2013-01-23 11:42 UTC (permalink / raw) To: netdev On Tue, 22 Jan 2013 at 02:01 GMT, Yannick Koehler <yannick@koehler.name> wrote: > > I believe that the problem is that once we move the skb into the > client's receive queue we need to decrease the sk_wmem_alloc variable > of the server socket since that skb is no more tied to the server. > The code should then account for this memory as part of the > sk_rmem_alloc variable on the client's socket. The function > "skb_set_owner_r(skb,owner)" would seem to be the function to do that, > so it would seem to me. Something like below?? --------> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 0c61236..e273072 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1205,6 +1205,7 @@ restart: unix_state_unlock(sk); + skb_set_owner_r(skb, other); /* take ten and and send info to listening sock */ spin_lock(&other->sk_receive_queue.lock); __skb_queue_tail(&other->sk_receive_queue, skb); @@ -1578,6 +1579,7 @@ restart: if (sock_flag(other, SOCK_RCVTSTAMP)) __net_timestamp(skb); + skb_set_owner_r(skb, other); maybe_add_creds(skb, sock, other); skb_queue_tail(&other->sk_receive_queue, skb); if (max_level > unix_sk(other)->recursion_level) @@ -1693,6 +1695,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock, (other->sk_shutdown & RCV_SHUTDOWN)) goto pipe_err_free; + skb_set_owner_r(skb, other); maybe_add_creds(skb, sock, other); skb_queue_tail(&other->sk_receive_queue, skb); if (max_level > unix_sk(other)->recursion_level) ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-23 11:42 ` Cong Wang @ 2013-01-23 14:26 ` Eric Dumazet 2013-01-23 16:36 ` Yannick Koehler 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2013-01-23 14:26 UTC (permalink / raw) To: Cong Wang; +Cc: netdev On Wed, 2013-01-23 at 11:42 +0000, Cong Wang wrote: > On Tue, 22 Jan 2013 at 02:01 GMT, Yannick Koehler <yannick@koehler.name> wrote: > > > > I believe that the problem is that once we move the skb into the > > client's receive queue we need to decrease the sk_wmem_alloc variable > > of the server socket since that skb is no more tied to the server. > > The code should then account for this memory as part of the > > sk_rmem_alloc variable on the client's socket. The function > > "skb_set_owner_r(skb,owner)" would seem to be the function to do that, > > so it would seem to me. > > Something like below?? > > --------> > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 0c61236..e273072 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -1205,6 +1205,7 @@ restart: > > unix_state_unlock(sk); > > + skb_set_owner_r(skb, other); > /* take ten and and send info to listening sock */ > spin_lock(&other->sk_receive_queue.lock); > __skb_queue_tail(&other->sk_receive_queue, skb); > @@ -1578,6 +1579,7 @@ restart: > > if (sock_flag(other, SOCK_RCVTSTAMP)) > __net_timestamp(skb); > + skb_set_owner_r(skb, other); > maybe_add_creds(skb, sock, other); > skb_queue_tail(&other->sk_receive_queue, skb); > if (max_level > unix_sk(other)->recursion_level) > @@ -1693,6 +1695,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock, > (other->sk_shutdown & RCV_SHUTDOWN)) > goto pipe_err_free; > > + skb_set_owner_r(skb, other); > maybe_add_creds(skb, sock, other); > skb_queue_tail(&other->sk_receive_queue, skb); > if (max_level > unix_sk(other)->recursion_level) > So what prevents a malicious program to DOS the machine ? Current behavior is on purpose. Limited, predictable, but less holes. Some applications might depend on the current flow control : Limiting the working set also permits to keep cpu caches hot. If you want to change it, better do a full analysis, because hackers will do it. Its probably doable, but with a "man unix" change. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-23 14:26 ` Eric Dumazet @ 2013-01-23 16:36 ` Yannick Koehler 2013-01-23 16:56 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: Yannick Koehler @ 2013-01-23 16:36 UTC (permalink / raw) To: Eric Dumazet; +Cc: Cong Wang, netdev > So what prevents a malicious program to DOS the machine ? The recv queue (checked with recvq_full()) and receiving's socket rcvbuf (check added in my patch). Actually the current situation can easily lead to a DOS situation. I simply have to write one application that connect to a unix socket domain and have it send me data for which I never call recvfrom() and voilà, all other consumer of this unix socket application will no more be able to communicate with this application once it maxed out it's sndbuf, default is 128k I believe. I will submit my patch in my next email. -- Yannick Koehler ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-23 16:36 ` Yannick Koehler @ 2013-01-23 16:56 ` Eric Dumazet 2013-01-23 17:13 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2013-01-23 16:56 UTC (permalink / raw) To: Yannick Koehler; +Cc: Cong Wang, netdev On Wed, 2013-01-23 at 11:36 -0500, Yannick Koehler wrote: > > So what prevents a malicious program to DOS the machine ? > > The recv queue (checked with recvq_full()) and receiving's socket > rcvbuf (check added in my patch). > Nope. The limit is given in number of messages, and its the socket backlog. Many machines setup a somaxconn = 1024 limit in order to reasonably listen for TCP connections. > Actually the current situation can easily lead to a DOS situation. I > simply have to write one application that connect to a unix socket > domain and have it send me data for which I never call recvfrom() and > voilà, all other consumer of this unix socket application will no more > be able to communicate with this application once it maxed out it's > sndbuf, default is 128k I believe. A single message can consume ~128k. If we allow 1024 messages being sent, we consume 128 Mbytes per evil socket. Enough to kill many linux based devices. You'll have to add proper limits (SO_RCVBUF), accounting the truesize of all accumulated messages. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-23 16:56 ` Eric Dumazet @ 2013-01-23 17:13 ` Eric Dumazet 2013-01-23 17:36 ` Yannick Koehler 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2013-01-23 17:13 UTC (permalink / raw) To: Yannick Koehler; +Cc: Cong Wang, netdev On Wed, 2013-01-23 at 08:56 -0800, Eric Dumazet wrote: > You'll have to add proper limits (SO_RCVBUF), accounting the truesize of > all accumulated messages. And if you claim being able to remove DOS attacks, you'll also have to add global limits, at a very minimum. (a la /proc/sys/net/ipv4/tcp_mem or /proc/sys/net/ipv4/udp_mem) Its not an easy problem, unfortunately. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-23 17:13 ` Eric Dumazet @ 2013-01-23 17:36 ` Yannick Koehler 0 siblings, 0 replies; 11+ messages in thread From: Yannick Koehler @ 2013-01-23 17:36 UTC (permalink / raw) To: Eric Dumazet; +Cc: Cong Wang, netdev Hi Eric, I am not sure to follow you. I am not changing how sockets works. I am actually making the af_unix socket works like others, by using the sndbuf/rcvbuf limits. The code I added was took from netlink.c and sock.c (sock_queue_err_skb). And actually, I am simply "adding" a limit check, not removing. The only thing this may do as a negative side effect is allow more buffer at the same time in the system, but the global number of buffer remains checked, as it was, if it was, since I am not changing how buffer gets allocated, just accounted. Please check my patch. 2013/1/23 Eric Dumazet <eric.dumazet@gmail.com>: > On Wed, 2013-01-23 at 08:56 -0800, Eric Dumazet wrote: > >> You'll have to add proper limits (SO_RCVBUF), accounting the truesize of >> all accumulated messages. > > And if you claim being able to remove DOS attacks, you'll also have to > add global limits, at a very minimum. > > (a la /proc/sys/net/ipv4/tcp_mem or /proc/sys/net/ipv4/udp_mem) > > Its not an easy problem, unfortunately. > > > -- Yannick Koehler Courriel: yannick@koehler.name Blog: http://corbeillepensees.blogspot.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-22 2:01 Unix Socket buffer attribution Yannick Koehler 2013-01-23 9:59 ` Hannes Frederic Sowa 2013-01-23 11:42 ` Cong Wang @ 2013-01-23 16:41 ` Yannick Koehler 2013-01-23 18:35 ` Hannes Frederic Sowa 2 siblings, 1 reply; 11+ messages in thread From: Yannick Koehler @ 2013-01-23 16:41 UTC (permalink / raw) To: netdev I did some more research, I found out that netlink and sock_queue_err_skb does the same trick that I claim to be missing under the net/unix/af_unix.c. After adding the code, I got a problem since af_unix.c "_recvmsg()" functions assume that the skb->sk is holding the peer socket not the current one related to the receive skb. It extract the sun_path name from it. Since with UDP each packet may have a different peer, the only solution I found was to use the skb control block to hold the peer name. The problem is that this cb member is 48 bytes in length and sun_path is 108 bytes in size. So I had to increase it from 48 to 160 bytes. This obviously increase the cost of an SKB struct, so I do not really like this solution. But at least it seems to prove my point and now, with this I can have my clients working except the one mis-behaving. I am attaching a patch. I also saw this thread that would be somehow related: http://www.mail-archive.com/netdev@vger.kernel.org/msg20195.html Basically since we now count the buffer size into the socket who receives it we can safely use this check instead of looking at the receive queue size: if (atomic_read(&other->sk_rmem_alloc) + skb->truesize >= (unsigned)other->sk_rcvbuf) { So anyway I sent my patch, and I am awaiting for comments on how to improve it. -- Yannick Koehler ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Unix Socket buffer attribution 2013-01-23 16:41 ` Yannick Koehler @ 2013-01-23 18:35 ` Hannes Frederic Sowa 0 siblings, 0 replies; 11+ messages in thread From: Hannes Frederic Sowa @ 2013-01-23 18:35 UTC (permalink / raw) To: Yannick Koehler; +Cc: netdev On Wed, Jan 23, 2013 at 11:41:16AM -0500, Yannick Koehler wrote: > I did some more research, I found out that netlink and > sock_queue_err_skb does the same trick that I claim to be missing > under the net/unix/af_unix.c. After adding the code, I got a problem > since af_unix.c "_recvmsg()" functions assume that the skb->sk is > holding the peer socket not the current one related to the receive > skb. It extract the sun_path name from it. Since with UDP each > packet may have a different peer, the only solution I found was to use > the skb control block to hold the peer name. The problem is that this > cb member is 48 bytes in length and sun_path is 108 bytes in size. So > I had to increase it from 48 to 160 bytes. > > This obviously increase the cost of an SKB struct, so I do not really > like this solution. But at least it seems to prove my point and now, > with this I can have my clients working except the one mis-behaving. > I am attaching a patch. Yes, you cannot do that. I would try to place a refcounted pointer to the original sk in unix_skb_params. Perhaps a refcounted unix_address will dot the trick, too. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-01-23 18:35 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-01-22 2:01 Unix Socket buffer attribution Yannick Koehler 2013-01-23 9:59 ` Hannes Frederic Sowa 2013-01-23 16:39 ` Yannick Koehler 2013-01-23 11:42 ` Cong Wang 2013-01-23 14:26 ` Eric Dumazet 2013-01-23 16:36 ` Yannick Koehler 2013-01-23 16:56 ` Eric Dumazet 2013-01-23 17:13 ` Eric Dumazet 2013-01-23 17:36 ` Yannick Koehler 2013-01-23 16:41 ` Yannick Koehler 2013-01-23 18:35 ` Hannes Frederic Sowa
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.