All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-02 20:31 [PATCH] phonet: Check input from user before allocating Sasha Levin
@ 2012-04-02 19:00 ` Rémi Denis-Courmont
  2012-04-02 21:38   ` David Miller
  2012-04-02 19:01 ` Rémi Denis-Courmont
  1 sibling, 1 reply; 20+ messages in thread
From: Rémi Denis-Courmont @ 2012-04-02 19:00 UTC (permalink / raw)
  To: Sasha Levin; +Cc: remi.denis-courmont, davem, davej, netdev, linux-kernel

Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
> A phonet packet is limited to USHRT_MAX bytes, this is never checked during
> tx which means that the user can specify any size he wishes, and the kernel
> will attempt to allocate that size.

> 
> In the good case, it'll lead to the following warning, but it may also
> cause the kernel to kick in the OOM and kill a random task on the server.
> 
> [ 8921.744094] WARNING: at mm/page_alloc.c:2255
> __alloc_pages_slowpath+0x65/0x730() [ 8921.749770] Pid: 5081, comm:
> trinity Tainted: G        W    3.4.0-rc1-next-20120402-sasha #46 [
> 8921.756672] Call Trace:
> [ 8921.758185]  [<ffffffff810b2ba7>] warn_slowpath_common+0x87/0xb0
> [ 8921.762868]  [<ffffffff810b2be5>] warn_slowpath_null+0x15/0x20
> [ 8921.765399]  [<ffffffff8117eae5>] __alloc_pages_slowpath+0x65/0x730
> [ 8921.769226]  [<ffffffff81179c8a>] ? zone_watermark_ok+0x1a/0x20
> [ 8921.771686]  [<ffffffff8117d045>] ? get_page_from_freelist+0x625/0x660
> [ 8921.773919]  [<ffffffff8117f3a8>] __alloc_pages_nodemask+0x1f8/0x240
> [ 8921.776248]  [<ffffffff811c03e0>] kmalloc_large_node+0x70/0xc0
> [ 8921.778294]  [<ffffffff811c4bd4>] __kmalloc_node_track_caller+0x34/0x1c0
> [ 8921.780847]  [<ffffffff821b0e3c>] ? sock_alloc_send_pskb+0xbc/0x260
> [ 8921.783179]  [<ffffffff821b3c65>] __alloc_skb+0x75/0x170
> [ 8921.784971]  [<ffffffff821b0e3c>] sock_alloc_send_pskb+0xbc/0x260
> [ 8921.787111]  [<ffffffff821b002e>] ? release_sock+0x7e/0x90
> [ 8921.788973]  [<ffffffff821b0ff0>] sock_alloc_send_skb+0x10/0x20
> [ 8921.791052]  [<ffffffff824cfc20>] pep_sendmsg+0x60/0x380
> [ 8921.792931]  [<ffffffff824cb4a6>] ? pn_socket_bind+0x156/0x180
> [ 8921.794917]  [<ffffffff824cb50f>] ? pn_socket_autobind+0x3f/0x90
> [ 8921.797053]  [<ffffffff824cb63f>] pn_socket_sendmsg+0x4f/0x70
> [ 8921.798992]  [<ffffffff821ab8e7>] sock_aio_write+0x187/0x1b0
> [ 8921.801395]  [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0
> [ 8921.803501]  [<ffffffff8111842c>] ? __lock_acquire+0x42c/0x4b0
> [ 8921.805505]  [<ffffffff821ab760>] ? __sock_recv_ts_and_drops+0x140/0x140
> [ 8921.807860]  [<ffffffff811e07cc>] do_sync_readv_writev+0xbc/0x110
> [ 8921.809986]  [<ffffffff811958e7>] ? might_fault+0x97/0xa0
> [ 8921.811998]  [<ffffffff817bd99e>] ? security_file_permission+0x1e/0x90
> [ 8921.814595]  [<ffffffff811e17e2>] do_readv_writev+0xe2/0x1e0
> [ 8921.816702]  [<ffffffff810b8dac>] ? do_setitimer+0x1ac/0x200
> [ 8921.818819]  [<ffffffff810e2ec1>] ? get_parent_ip+0x11/0x50
> [ 8921.820863]  [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0
> [ 8921.823318]  [<ffffffff811e1926>] vfs_writev+0x46/0x60
> [ 8921.825219]  [<ffffffff811e1a3f>] sys_writev+0x4f/0xb0
> [ 8921.827127]  [<ffffffff82658039>] system_call_fastpath+0x16/0x1b
> [ 8921.829384] ---[ end trace dffe390f30db9eb7 ]---
> 
> Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> ---
>  net/phonet/pep.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/net/phonet/pep.c b/net/phonet/pep.c
> index 9f60008..caee99e 100644
> --- a/net/phonet/pep.c
> +++ b/net/phonet/pep.c
> @@ -1130,6 +1130,9 @@ static int pep_sendmsg(struct kiocb *iocb, struct
> sock *sk, int flags = msg->msg_flags;
>  	int err, done;
> 
> +	if (len > USHRT_MAX)
> +		return -E2BIG;

I think EMSGSIZE is specified in that case.

> +
>  	if ((msg->msg_flags & ~(MSG_DONTWAIT|MSG_EOR|MSG_NOSIGNAL|
>  				MSG_CMSG_COMPAT)) ||
>  			!(msg->msg_flags & MSG_EOR))

-- 
Rémi Denis-Courmont
http://www.remlab.net/
http://fi.linkedin.com/in/remidenis

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-02 20:31 [PATCH] phonet: Check input from user before allocating Sasha Levin
  2012-04-02 19:00 ` Rémi Denis-Courmont
@ 2012-04-02 19:01 ` Rémi Denis-Courmont
  2012-04-02 21:40   ` David Miller
  1 sibling, 1 reply; 20+ messages in thread
From: Rémi Denis-Courmont @ 2012-04-02 19:01 UTC (permalink / raw)
  To: Sasha Levin; +Cc: remi.denis-courmont, davem, davej, netdev, linux-kernel

Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
> A phonet packet is limited to USHRT_MAX bytes, this is never checked during
> tx which means that the user can specify any size he wishes, and the kernel
> will attempt to allocate that size.

Does this really solve the problem?  I guess 128kb is still possible with 
USHRT_MAX plus skbuff overhead, which might still trigger OOM relatively 
easily once the memory gets sufficiently fragmented.

How does UDP deal with this?

-- 
Rémi Denis-Courmont
http://www.remlab.net/
http://fi.linkedin.com/in/remidenis

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] phonet: Check input from user before allocating
@ 2012-04-02 20:31 Sasha Levin
  2012-04-02 19:00 ` Rémi Denis-Courmont
  2012-04-02 19:01 ` Rémi Denis-Courmont
  0 siblings, 2 replies; 20+ messages in thread
From: Sasha Levin @ 2012-04-02 20:31 UTC (permalink / raw)
  To: remi.denis-courmont, davem; +Cc: davej, netdev, linux-kernel, Sasha Levin

A phonet packet is limited to USHRT_MAX bytes, this is never checked during
tx which means that the user can specify any size he wishes, and the kernel
will attempt to allocate that size.

In the good case, it'll lead to the following warning, but it may also cause
the kernel to kick in the OOM and kill a random task on the server.

[ 8921.744094] WARNING: at mm/page_alloc.c:2255 __alloc_pages_slowpath+0x65/0x730()
[ 8921.749770] Pid: 5081, comm: trinity Tainted: G        W    3.4.0-rc1-next-20120402-sasha #46
[ 8921.756672] Call Trace:
[ 8921.758185]  [<ffffffff810b2ba7>] warn_slowpath_common+0x87/0xb0
[ 8921.762868]  [<ffffffff810b2be5>] warn_slowpath_null+0x15/0x20
[ 8921.765399]  [<ffffffff8117eae5>] __alloc_pages_slowpath+0x65/0x730
[ 8921.769226]  [<ffffffff81179c8a>] ? zone_watermark_ok+0x1a/0x20
[ 8921.771686]  [<ffffffff8117d045>] ? get_page_from_freelist+0x625/0x660
[ 8921.773919]  [<ffffffff8117f3a8>] __alloc_pages_nodemask+0x1f8/0x240
[ 8921.776248]  [<ffffffff811c03e0>] kmalloc_large_node+0x70/0xc0
[ 8921.778294]  [<ffffffff811c4bd4>] __kmalloc_node_track_caller+0x34/0x1c0
[ 8921.780847]  [<ffffffff821b0e3c>] ? sock_alloc_send_pskb+0xbc/0x260
[ 8921.783179]  [<ffffffff821b3c65>] __alloc_skb+0x75/0x170
[ 8921.784971]  [<ffffffff821b0e3c>] sock_alloc_send_pskb+0xbc/0x260
[ 8921.787111]  [<ffffffff821b002e>] ? release_sock+0x7e/0x90
[ 8921.788973]  [<ffffffff821b0ff0>] sock_alloc_send_skb+0x10/0x20
[ 8921.791052]  [<ffffffff824cfc20>] pep_sendmsg+0x60/0x380
[ 8921.792931]  [<ffffffff824cb4a6>] ? pn_socket_bind+0x156/0x180
[ 8921.794917]  [<ffffffff824cb50f>] ? pn_socket_autobind+0x3f/0x90
[ 8921.797053]  [<ffffffff824cb63f>] pn_socket_sendmsg+0x4f/0x70
[ 8921.798992]  [<ffffffff821ab8e7>] sock_aio_write+0x187/0x1b0
[ 8921.801395]  [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0
[ 8921.803501]  [<ffffffff8111842c>] ? __lock_acquire+0x42c/0x4b0
[ 8921.805505]  [<ffffffff821ab760>] ? __sock_recv_ts_and_drops+0x140/0x140
[ 8921.807860]  [<ffffffff811e07cc>] do_sync_readv_writev+0xbc/0x110
[ 8921.809986]  [<ffffffff811958e7>] ? might_fault+0x97/0xa0
[ 8921.811998]  [<ffffffff817bd99e>] ? security_file_permission+0x1e/0x90
[ 8921.814595]  [<ffffffff811e17e2>] do_readv_writev+0xe2/0x1e0
[ 8921.816702]  [<ffffffff810b8dac>] ? do_setitimer+0x1ac/0x200
[ 8921.818819]  [<ffffffff810e2ec1>] ? get_parent_ip+0x11/0x50
[ 8921.820863]  [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0
[ 8921.823318]  [<ffffffff811e1926>] vfs_writev+0x46/0x60
[ 8921.825219]  [<ffffffff811e1a3f>] sys_writev+0x4f/0xb0
[ 8921.827127]  [<ffffffff82658039>] system_call_fastpath+0x16/0x1b
[ 8921.829384] ---[ end trace dffe390f30db9eb7 ]---

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
---
 net/phonet/pep.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index 9f60008..caee99e 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -1130,6 +1130,9 @@ static int pep_sendmsg(struct kiocb *iocb, struct sock *sk,
 	int flags = msg->msg_flags;
 	int err, done;
 
+	if (len > USHRT_MAX)
+		return -E2BIG;
+
 	if ((msg->msg_flags & ~(MSG_DONTWAIT|MSG_EOR|MSG_NOSIGNAL|
 				MSG_CMSG_COMPAT)) ||
 			!(msg->msg_flags & MSG_EOR))
-- 
1.7.8.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-02 19:00 ` Rémi Denis-Courmont
@ 2012-04-02 21:38   ` David Miller
  0 siblings, 0 replies; 20+ messages in thread
From: David Miller @ 2012-04-02 21:38 UTC (permalink / raw)
  To: remi; +Cc: levinsasha928, remi.denis-courmont, davej, netdev, linux-kernel

From: "Rémi Denis-Courmont" <remi@remlab.net>
Date: Mon, 2 Apr 2012 22:00:40 +0300

> Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
>> +	if (len > USHRT_MAX)
>> +		return -E2BIG;
> 
> I think EMSGSIZE is specified in that case.

Agreed.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-02 19:01 ` Rémi Denis-Courmont
@ 2012-04-02 21:40   ` David Miller
  2012-04-03  1:53     ` Eric Dumazet
  2012-04-03  6:36       ` Rémi Denis-Courmont
  0 siblings, 2 replies; 20+ messages in thread
From: David Miller @ 2012-04-02 21:40 UTC (permalink / raw)
  To: remi; +Cc: levinsasha928, remi.denis-courmont, davej, netdev, linux-kernel

From: "Rémi Denis-Courmont" <remi@remlab.net>
Date: Mon, 2 Apr 2012 22:01:40 +0300

> Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
>> A phonet packet is limited to USHRT_MAX bytes, this is never checked during
>> tx which means that the user can specify any size he wishes, and the kernel
>> will attempt to allocate that size.
> 
> Does this really solve the problem?  I guess 128kb is still possible with 
> USHRT_MAX plus skbuff overhead, which might still trigger OOM relatively 
> easily once the memory gets sufficiently fragmented.
> 
> How does UDP deal with this?

UDP generates a fragment list of MTU sized SKBs.

Phonet could avoid the large allocations by building page based
SKBs.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-02 21:40   ` David Miller
@ 2012-04-03  1:53     ` Eric Dumazet
  2012-04-03  1:59       ` David Miller
  2012-04-03  6:36       ` Rémi Denis-Courmont
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-04-03  1:53 UTC (permalink / raw)
  To: David Miller
  Cc: remi, levinsasha928, remi.denis-courmont, davej, netdev, linux-kernel

On Mon, 2012-04-02 at 17:40 -0400, David Miller wrote:
> From: "Rémi Denis-Courmont" <remi@remlab.net>
> Date: Mon, 2 Apr 2012 22:01:40 +0300
> 
> > Le lundi 2 avril 2012 23:31:00 Sasha Levin, vous avez écrit :
> >> A phonet packet is limited to USHRT_MAX bytes, this is never checked during
> >> tx which means that the user can specify any size he wishes, and the kernel
> >> will attempt to allocate that size.
> > 
> > Does this really solve the problem?  I guess 128kb is still possible with 
> > USHRT_MAX plus skbuff overhead, which might still trigger OOM relatively 
> > easily once the memory gets sufficiently fragmented.
> > 
> > How does UDP deal with this?
> 
> UDP generates a fragment list of MTU sized SKBs.
> 
> Phonet could avoid the large allocations by building page based
> SKBs.

Not that AF_UNIX does nothing in this respect, it can use order-XX pages
for large datagrams.

(I beleve I sent a patch some time ago to address this point)




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  1:53     ` Eric Dumazet
@ 2012-04-03  1:59       ` David Miller
  2012-04-03  2:15         ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: David Miller @ 2012-04-03  1:59 UTC (permalink / raw)
  To: eric.dumazet
  Cc: remi, levinsasha928, remi.denis-courmont, davej, netdev, linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Apr 2012 03:53:17 +0200

> Not that AF_UNIX does nothing in this respect, it can use order-XX pages
> for large datagrams.
> 
> (I beleve I sent a patch some time ago to address this point)

Yes, on the datagram side it's a problem.

For stream AF_UNIX sockets the allocation is capped at SKB_MAX_ALLOC
which evaluates to an order 2 page.

Overall, AF_UNIX ought to be easy to deal with since all of the
routines that copy data between userspace and SKBs can handle
segmented SKBs and thus most of the work is converting over to
sock_alloc_send_pskb() and setting data_len how we set the normal
length of sock_alloc_skb_skb() currently.

Anyways, feel free to resubmit your patch.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  1:59       ` David Miller
@ 2012-04-03  2:15         ` Eric Dumazet
  2012-04-03  2:23           ` David Miller
  2012-04-03 15:28           ` [PATCH v2 net-next] af_unix: reduce high order page allocations Eric Dumazet
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2012-04-03  2:15 UTC (permalink / raw)
  To: David Miller
  Cc: remi, levinsasha928, remi.denis-courmont, davej, netdev, linux-kernel

On Mon, 2012-04-02 at 21:59 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 03:53:17 +0200
> 
> > Not that AF_UNIX does nothing in this respect, it can use order-XX pages
> > for large datagrams.
> > 
> > (I beleve I sent a patch some time ago to address this point)
> 
> Yes, on the datagram side it's a problem.
> 
> For stream AF_UNIX sockets the allocation is capped at SKB_MAX_ALLOC
> which evaluates to an order 2 page.
> 
> Overall, AF_UNIX ought to be easy to deal with since all of the
> routines that copy data between userspace and SKBs can handle
> segmented SKBs and thus most of the work is converting over to
> sock_alloc_send_pskb() and setting data_len how we set the normal
> length of sock_alloc_skb_skb() currently.
> 
> Anyways, feel free to resubmit your patch.

This was indeed a basic patch, but it probably can lower raw performance
on some apps, (if memory frag is not an issue) so I need to bench it. 


Any idea of a representative benchmark in dgram af_unix ?

http://patchwork.ozlabs.org/patch/114103/

I'll respin it with proper performance resuts.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  2:15         ` Eric Dumazet
@ 2012-04-03  2:23           ` David Miller
  2012-04-03  2:29             ` Eric Dumazet
  2012-04-03  2:29             ` Rick Jones
  2012-04-03 15:28           ` [PATCH v2 net-next] af_unix: reduce high order page allocations Eric Dumazet
  1 sibling, 2 replies; 20+ messages in thread
From: David Miller @ 2012-04-03  2:23 UTC (permalink / raw)
  To: eric.dumazet
  Cc: remi, levinsasha928, remi.denis-courmont, davej, netdev, linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Apr 2012 04:15:04 +0200

> Any idea of a representative benchmark in dgram af_unix ?

Maybe you could try to make bw_unix from lmbench use
SOCK_DGRAM? :-)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  2:23           ` David Miller
@ 2012-04-03  2:29             ` Eric Dumazet
  2012-04-03  2:29             ` Rick Jones
  1 sibling, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2012-04-03  2:29 UTC (permalink / raw)
  To: David Miller
  Cc: remi, levinsasha928, remi.denis-courmont, davej, netdev, linux-kernel

On Mon, 2012-04-02 at 22:23 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 04:15:04 +0200
> 
> > Any idea of a representative benchmark in dgram af_unix ?
> 
> Maybe you could try to make bw_unix from lmbench use
> SOCK_DGRAM? :-)
> 

Yes, I also converted hackbench to SOCK_DGRAM and change its DATASIZE
from 100 to 10000 bytes per message.

Thanks



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  2:23           ` David Miller
  2012-04-03  2:29             ` Eric Dumazet
@ 2012-04-03  2:29             ` Rick Jones
  2012-04-03  2:34               ` Eric Dumazet
  1 sibling, 1 reply; 20+ messages in thread
From: Rick Jones @ 2012-04-03  2:29 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, remi, levinsasha928, remi.denis-courmont, davej,
	netdev, linux-kernel

On 04/02/2012 07:23 PM, David Miller wrote:
> From: Eric Dumazet<eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 04:15:04 +0200
>
>> Any idea of a representative benchmark in dgram af_unix ?
>
> Maybe you could try to make bw_unix from lmbench use
> SOCK_DGRAM? :-)

I don't know it isn't entirely bitrotted, but there are streaming and 
datagram AF_UNIX tests in netperf - they require conditional inclusion 
via ./configure --enable-unixdomain:

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#DG_005fSTREAM

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  2:29             ` Rick Jones
@ 2012-04-03  2:34               ` Eric Dumazet
  2012-04-03  2:39                 ` Rick Jones
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-04-03  2:34 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, remi, levinsasha928, remi.denis-courmont, davej,
	netdev, linux-kernel

On Mon, 2012-04-02 at 19:29 -0700, Rick Jones wrote:
> On 04/02/2012 07:23 PM, David Miller wrote:
> > From: Eric Dumazet<eric.dumazet@gmail.com>
> > Date: Tue, 03 Apr 2012 04:15:04 +0200
> >
> >> Any idea of a representative benchmark in dgram af_unix ?
> >
> > Maybe you could try to make bw_unix from lmbench use
> > SOCK_DGRAM? :-)
> 
> I don't know it isn't entirely bitrotted, but there are streaming and 
> datagram AF_UNIX tests in netperf - they require conditional inclusion 
> via ./configure --enable-unixdomain:
> 
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#DG_005fSTREAM
> 

Ah yes of course, I'll try that.

BTW, do you have plans to support vmsplice()/splice() as a way to
provide 0-copy to TCP_STREAM ?

I ask this because I am currently working on fixing splice(pipe ->
socket) performance problem and need a standard benchmark to publish
performance results.

Thanks



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  2:34               ` Eric Dumazet
@ 2012-04-03  2:39                 ` Rick Jones
  2012-04-03  3:14                   ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Rick Jones @ 2012-04-03  2:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, remi, levinsasha928, remi.denis-courmont, davej,
	netdev, linux-kernel

On 04/02/2012 07:34 PM, Eric Dumazet wrote:
> On Mon, 2012-04-02 at 19:29 -0700, Rick Jones wrote:
>> I don't know it isn't entirely bitrotted, but there are streaming and
>> datagram AF_UNIX tests in netperf - they require conditional inclusion
>> via ./configure --enable-unixdomain:
>>
>> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#DG_005fSTREAM
>>
>
> Ah yes of course, I'll try that.
>
> BTW, do you have plans to support vmsplice()/splice() as a way to
> provide 0-copy to TCP_STREAM ?

I'd probably plead "too platform specific" but I've already got some 
platform-specific stuff in there like TCP_INFO so I probably cannot hide 
behind that.  Particularly if I ever do make "native" WSA calls for 
Windows...

Until now I'd not thought of vmsplice/splice though - only sendfile for 
zero copy, which should be there for TCP in the form of the TCP_SENDFILE 
test (unmigrated, so none of the omni output selection is available)

> I ask this because I am currently working on fixing splice(pipe ->
> socket) performance problem and need a standard benchmark to publish
> performance results.

I'll start to ponder, without promises.

rick

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  2:39                 ` Rick Jones
@ 2012-04-03  3:14                   ` Eric Dumazet
  2012-04-03 18:18                     ` Rick Jones
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-04-03  3:14 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, remi, levinsasha928, remi.denis-courmont, davej,
	netdev, linux-kernel

On Mon, 2012-04-02 at 19:39 -0700, Rick Jones wrote:
> On 04/02/2012 07:34 PM, Eric Dumazet wrote:
> > On Mon, 2012-04-02 at 19:29 -0700, Rick Jones wrote:
> >> I don't know it isn't entirely bitrotted, but there are streaming and
> >> datagram AF_UNIX tests in netperf - they require conditional inclusion
> >> via ./configure --enable-unixdomain:
> >>
> >> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#DG_005fSTREAM
> >>
> >
> > Ah yes of course, I'll try that.
> >

It seems netperf has some problems zith AF_UNIX dgram :

socket(PF_FILE, SOCK_DGRAM, 0)          = 4
setsockopt(4, SOL_SOCKET, SO_SNDBUF, [0], 4) = 0
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [2048], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [2288], [4]) = 0
sendto(3, "\0\0\0+\377\377\377\377\0\0\0\0\0\0\10\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0"..., 256, 0, NULL, 0) = 256
select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 979635})
recvfrom(3, "\0\0\0,\0\0\0\0\0\0\10\360\0\0\10\0\0\0\0\0\0\0\0\n\0\0\0\0\0\0\0\0"..., 256, 0, NULL, NULL) = 256
connect(4, {sa_family=AF_FILE, path="/tmp/netpe62HGNM"}, 110) = 0
rt_sigaction(SIGALRM, {0x403171, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f691f026af0}, NULL, 8) = 0
alarm(10)                               = 0
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 2048, 0, NULL, 0) = -1 EMSGSIZE (Message too long)
dup(2)                                  = 5
fcntl(5, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
fstat(5, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 3), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f691fa18000
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
write(5, "dg_send: data send error: Messag"..., 43dg_send: data send error: Message too long
) = 43
close(5)                                = 0
munmap(0x7f691fa18000, 4096)            = 0
exit_group(1)                           = ?

I guess the SO_SNDBUF/SO_RCVBUF limits are a bit too low ?

Tried this on an old kernel as well.

Linux edumazet-glaptop 2.6.38-13-generic #57~lucid1-Ubuntu SMP Tue Mar 6 20:05:46 UTC 2012 x86_64 GNU/Linux



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-02 21:40   ` David Miller
@ 2012-04-03  6:36       ` Rémi Denis-Courmont
  2012-04-03  6:36       ` Rémi Denis-Courmont
  1 sibling, 0 replies; 20+ messages in thread
From: Rémi Denis-Courmont @ 2012-04-03  6:36 UTC (permalink / raw)
  To: David Miller; +Cc: levinsasha928, davej, netdev, linux-kernel

On Mon, 02 Apr 2012 17:40:06 -0400 (EDT), David Miller

<davem@davemloft.net> wrote:

> UDP generates a fragment list of MTU sized SKBs.

> 

> Phonet could avoid the large allocations by building page based

> SKBs.



Oh right. And Phonet devices don't support scatter/gather, so that I guess

that would merely delay the problem.



Also sendmsg() code would need to be reectored to look up the output

device and then the MTU before allocating the socket buffer. This will only

work if the default MTU is reduced first :/



-- 

Rémi Denis-Courmont

Sent from my collocated server

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
@ 2012-04-03  6:36       ` Rémi Denis-Courmont
  0 siblings, 0 replies; 20+ messages in thread
From: Rémi Denis-Courmont @ 2012-04-03  6:36 UTC (permalink / raw)
  To: David Miller; +Cc: levinsasha928, davej, netdev, linux-kernel

On Mon, 02 Apr 2012 17:40:06 -0400 (EDT), David Miller
<davem@davemloft.net> wrote:
> UDP generates a fragment list of MTU sized SKBs.
> 
> Phonet could avoid the large allocations by building page based
> SKBs.

Oh right. And Phonet devices don't support scatter/gather, so that I guess
that would merely delay the problem.

Also sendmsg() code would need to be reectored to look up the output
device and then the MTU before allocating the socket buffer. This will only
work if the default MTU is reduced first :/

-- 
Rémi Denis-Courmont
Sent from my collocated server

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  6:36       ` Rémi Denis-Courmont
  (?)
@ 2012-04-03  6:38       ` David Miller
  -1 siblings, 0 replies; 20+ messages in thread
From: David Miller @ 2012-04-03  6:38 UTC (permalink / raw)
  To: remi; +Cc: levinsasha928, davej, netdev, linux-kernel

From: Rémi Denis-Courmont <remi@remlab.net>
Date: Tue, 03 Apr 2012 08:36:20 +0200

> Oh right. And Phonet devices don't support scatter/gather, so that I
> guess that would merely delay the problem.

And this lack of SG support in the drivers never materialized largely
because the phonet stack never generated such things.

So yes indeed both sides of this equation will need to be addressed.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 net-next] af_unix: reduce high order page allocations
  2012-04-03  2:15         ` Eric Dumazet
  2012-04-03  2:23           ` David Miller
@ 2012-04-03 15:28           ` Eric Dumazet
  2012-04-03 20:43             ` David Miller
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-04-03 15:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

unix_dgram_sendmsg() currently builds linear skbs, and this can stress
page allocator with high order page allocations. When memory gets
fragmented, this can eventually fail.

We can try to use order-2 allocations for skb head (SKB_MAX_ALLOC) plus
up to 16 page fragments to lower pressure on buddy allocator.

This patch has no effect on messages of less than 16064 bytes.
(on 64bit arches with PAGE_SIZE=4096)

For bigger messages (from 16065 to 81600 bytes), this patch brings
reliability at the expense of performance penalty because of extra pages
allocations.

netperf -t DG_STREAM -T 0,2 -- -m 16064 -s 200000
->4086040 Messages / 10s

netperf -t DG_STREAM -T 0,2 -- -m 16068 -s 200000
->3901747 Messages / 10s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
v2: use SKB_MAX_ALLOC instead of SKB_MAX_ORDER(0, 0) to not slow down
    applications using up to 16000 bytes messages.

 net/unix/af_unix.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index d510353..eadb902 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1442,6 +1442,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	long timeo;
 	struct scm_cookie tmp_scm;
 	int max_level;
+	int data_len = 0;
 
 	if (NULL == siocb->scm)
 		siocb->scm = &tmp_scm;
@@ -1475,7 +1476,13 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (len > sk->sk_sndbuf - 32)
 		goto out;
 
-	skb = sock_alloc_send_skb(sk, len, msg->msg_flags&MSG_DONTWAIT, &err);
+	if (len > SKB_MAX_ALLOC)
+		data_len = min_t(size_t,
+				 len - SKB_MAX_ALLOC,
+				 MAX_SKB_FRAGS * PAGE_SIZE);
+
+	skb = sock_alloc_send_pskb(sk, len - data_len, data_len,
+				   msg->msg_flags & MSG_DONTWAIT, &err);
 	if (skb == NULL)
 		goto out;
 
@@ -1485,8 +1492,10 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	max_level = err + 1;
 	unix_get_secdata(siocb->scm, skb);
 
-	skb_reset_transport_header(skb);
-	err = memcpy_fromiovec(skb_put(skb, len), msg->msg_iov, len);
+	skb_put(skb, len - data_len);
+	skb->data_len = data_len;
+	skb->len = len;
+	err = skb_copy_datagram_from_iovec(skb, 0, msg->msg_iov, 0, len);
 	if (err)
 		goto out_free;
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] phonet: Check input from user before allocating
  2012-04-03  3:14                   ` Eric Dumazet
@ 2012-04-03 18:18                     ` Rick Jones
  0 siblings, 0 replies; 20+ messages in thread
From: Rick Jones @ 2012-04-03 18:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, remi, levinsasha928, remi.denis-courmont, davej,
	netdev, linux-kernel

On 04/02/2012 08:14 PM, Eric Dumazet wrote:
> It seems netperf has some problems zith AF_UNIX dgram :
>
> socket(PF_FILE, SOCK_DGRAM, 0)          = 4
> setsockopt(4, SOL_SOCKET, SO_SNDBUF, [0], 4) = 0
> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [2048], [4]) = 0
> setsockopt(4, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [2288], [4]) = 0
> sendto(3, "\0\0\0+\377\377\377\377\0\0\0\0\0\0\10\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0"..., 256, 0, NULL, 0) = 256
> select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 979635})
> recvfrom(3, "\0\0\0,\0\0\0\0\0\0\10\360\0\0\10\0\0\0\0\0\0\0\0\n\0\0\0\0\0\0\0\0"..., 256, 0, NULL, NULL) = 256
> connect(4, {sa_family=AF_FILE, path="/tmp/netpe62HGNM"}, 110) = 0
> rt_sigaction(SIGALRM, {0x403171, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f691f026af0}, NULL, 8) = 0
> alarm(10)                               = 0
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 2048, 0, NULL, 0) = -1 EMSGSIZE (Message too long)
> dup(2)                                  = 5
> fcntl(5, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
> fstat(5, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 3), ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f691fa18000
> lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
> write(5, "dg_send: data send error: Messag"..., 43dg_send: data send error: Message too long
> ) = 43
> close(5)                                = 0
> munmap(0x7f691fa18000, 4096)            = 0
> exit_group(1)                           = ?
>
> I guess the SO_SNDBUF/SO_RCVBUF limits are a bit too low ?

The AF_UNIX support in netperf is from a time when the code was very 
simplistic - if no socket buffer size specified on the command line, 
take the system default.  If no send size specified, use the SO_SNDBUF 
size, though some limits were applied.

Of course, along the way, there was a change in the value for socket 
buffer size that meant "take the default" - it used to be 0 but then 
became -1.  Something to do with when Windows would decide to do copy 
avoidance - tell Windows the socket buffer size is 0 and it will do copy 
avoidance. The code in src/nettest_unix.c was still initializing to 0 
rather than -1.

I've fixed that in the top-of-trunk:

raj@tardy:~/netperf2_trunk$ src/netperf -t DG_STREAM
DG UNIDIRECTIONAL SEND TEST
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

126976   65507    10.00     1195647      0     811.07
  2288            10.00     1195647            811.07

BTW, the local CPU utilization seems sane, but "remote" (which is 
redundant anyway) seems to be fubar somehow.  More bitrot I suspect.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 net-next] af_unix: reduce high order page allocations
  2012-04-03 15:28           ` [PATCH v2 net-next] af_unix: reduce high order page allocations Eric Dumazet
@ 2012-04-03 20:43             ` David Miller
  0 siblings, 0 replies; 20+ messages in thread
From: David Miller @ 2012-04-03 20:43 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Apr 2012 17:28:28 +0200

> unix_dgram_sendmsg() currently builds linear skbs, and this can stress
> page allocator with high order page allocations. When memory gets
> fragmented, this can eventually fail.
> 
> We can try to use order-2 allocations for skb head (SKB_MAX_ALLOC) plus
> up to 16 page fragments to lower pressure on buddy allocator.
> 
> This patch has no effect on messages of less than 16064 bytes.
> (on 64bit arches with PAGE_SIZE=4096)
> 
> For bigger messages (from 16065 to 81600 bytes), this patch brings
> reliability at the expense of performance penalty because of extra pages
> allocations.
> 
> netperf -t DG_STREAM -T 0,2 -- -m 16064 -s 200000
> ->4086040 Messages / 10s
> 
> netperf -t DG_STREAM -T 0,2 -- -m 16068 -s 200000
> ->3901747 Messages / 10s
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> v2: use SKB_MAX_ALLOC instead of SKB_MAX_ORDER(0, 0) to not slow down
>     applications using up to 16000 bytes messages.

Looks good, applied to net-next, thanks Eric!

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-04-03 20:43 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-02 20:31 [PATCH] phonet: Check input from user before allocating Sasha Levin
2012-04-02 19:00 ` Rémi Denis-Courmont
2012-04-02 21:38   ` David Miller
2012-04-02 19:01 ` Rémi Denis-Courmont
2012-04-02 21:40   ` David Miller
2012-04-03  1:53     ` Eric Dumazet
2012-04-03  1:59       ` David Miller
2012-04-03  2:15         ` Eric Dumazet
2012-04-03  2:23           ` David Miller
2012-04-03  2:29             ` Eric Dumazet
2012-04-03  2:29             ` Rick Jones
2012-04-03  2:34               ` Eric Dumazet
2012-04-03  2:39                 ` Rick Jones
2012-04-03  3:14                   ` Eric Dumazet
2012-04-03 18:18                     ` Rick Jones
2012-04-03 15:28           ` [PATCH v2 net-next] af_unix: reduce high order page allocations Eric Dumazet
2012-04-03 20:43             ` David Miller
2012-04-03  6:36     ` [PATCH] phonet: Check input from user before allocating Rémi Denis-Courmont
2012-04-03  6:36       ` Rémi Denis-Courmont
2012-04-03  6:38       ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.