From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758339AbcAMPHT (ORCPT ); Wed, 13 Jan 2016 10:07:19 -0500 Received: from out3-smtp.messagingengine.com ([66.111.4.27]:42487 "EHLO out3-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756584AbcAMPHO (ORCPT ); Wed, 13 Jan 2016 10:07:14 -0500 X-Sasl-enc: +VbKk1HZJ6qNlw1yV81kpZbBSiCBgTwUTzLJCuaMUJ4R 1452697633 Subject: Re: [PATCH] net: add per device sg_max_frags for skb To: Eric Dumazet , Hans Westgaard Ry References: <1452086182-26748-1-git-send-email-hans.westgaard.ry@oracle.com> <063D6719AE5E284EB5DD2968C1650D6D1CCBE5AA@AcuExch.aculab.com> <568F87AC.60405@oracle.com> <568FA1E5.7060204@stressinduktion.org> <569657D8.1020807@oracle.com> Cc: David Laight , "David S. Miller" , Alexey Kuznetsov , James Morris , Hideaki YOSHIFUJI , Patrick McHardy , Alexei Starovoitov , Jiri Pirko , Daniel Borkmann , Nicolas Dichtel , "Eric W. Biederman" , Salam Noureddine , Jarod Wilson , Toshiaki Makita , Julian Anastasov , Ying Xue , Craig Gallek , Mel Gorman , Edward Jee , Julia Lawall , "netdev@vger.kernel.org" , "linux-kernel@vger.kernel.org" , Haakon Bugge , Knut Omang , Wei Lin Guay , Santosh Shilimkar , Yuval Shaia From: Hannes Frederic Sowa Message-ID: <5696681D.4060002@stressinduktion.org> Date: Wed, 13 Jan 2016 16:07:09 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 13.01.2016 15:19, Eric Dumazet wrote: > 1) There are no arch with 1K page sizes. Most certainly, if we had > MAX_SKB_FRAGS=65 some assumptions in the stack would fail. > > 2) TCP stack has coalescing support. write(2) or sendmsg(2) should > append data into the last skb in write queue, and still use 32 KB > frags. > You get pathological skb when using sendpage() or when one thread > writes data into _multiple_ TCP sockets, since TCP stack uses > a per thread 32 KB reserve ( > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5640f7685831e088fe6c2e1f863a6805962f8e81 > ) > > 2) As I said, implementing a limit in TCP stack is not enough. Your > patch is therefore adding complexity for all users, but is not a > general solution. > > GRO, tun device, many things can still cook 'big skbs' > > You need to properly implement a fallback, possibly using > ndo_features_check(), or directly from your ndo_start_xmit() > > 3) We currently have a very dumb way to fallback, forcing a linearize > call, likely to fail if memory is fragmented and skb big. > > You could instead provide a smart helper, trying to reduce the > number of frags in a skb by chosing adjacent frags and > re-allocating/merging them. > > By choosing, I mean trying to pick smallest ones to minimize copy > cost, to get one skb with X less fragment. (X=1 in your case ?) > > I know for example that bnx2x could benefit from such a helper, as > it has a 13 frags limits. > (bnx2x_pkt_req_lin(), called from bnx2x ndo_start_xmit() As I proposed, we could globally (or per netns) limit the maximum , I think this would be okay and could be the best alternative to install slow-paths which could be hit quite constantly. Otherwise, the fallbacks like Eric proposed them are needed. I do not see any other choice. Thanks, Hannes From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hannes Frederic Sowa Subject: Re: [PATCH] net: add per device sg_max_frags for skb Date: Wed, 13 Jan 2016 16:07:09 +0100 Message-ID: <5696681D.4060002@stressinduktion.org> References: <1452086182-26748-1-git-send-email-hans.westgaard.ry@oracle.com> <063D6719AE5E284EB5DD2968C1650D6D1CCBE5AA@AcuExch.aculab.com> <568F87AC.60405@oracle.com> <568FA1E5.7060204@stressinduktion.org> <569657D8.1020807@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Cc: David Laight , "David S. Miller" , Alexey Kuznetsov , James Morris , Hideaki YOSHIFUJI , Patrick McHardy , Alexei Starovoitov , Jiri Pirko , Daniel Borkmann , Nicolas Dichtel , "Eric W. Biederman" , Salam Noureddine , Jarod Wilson , Toshiaki Makita , Julian Anastasov , Ying Xue , Craig Gallek , Mel Gorman , Edward Jee , Julia Lawall , "netdev@vger.kernel.org" , "linux-kernel@vger.kernel.org" , Hans Westgaard Ry Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: netdev.vger.kernel.org On 13.01.2016 15:19, Eric Dumazet wrote: > 1) There are no arch with 1K page sizes. Most certainly, if we had > MAX_SKB_FRAGS=65 some assumptions in the stack would fail. > > 2) TCP stack has coalescing support. write(2) or sendmsg(2) should > append data into the last skb in write queue, and still use 32 KB > frags. > You get pathological skb when using sendpage() or when one thread > writes data into _multiple_ TCP sockets, since TCP stack uses > a per thread 32 KB reserve ( > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5640f7685831e088fe6c2e1f863a6805962f8e81 > ) > > 2) As I said, implementing a limit in TCP stack is not enough. Your > patch is therefore adding complexity for all users, but is not a > general solution. > > GRO, tun device, many things can still cook 'big skbs' > > You need to properly implement a fallback, possibly using > ndo_features_check(), or directly from your ndo_start_xmit() > > 3) We currently have a very dumb way to fallback, forcing a linearize > call, likely to fail if memory is fragmented and skb big. > > You could instead provide a smart helper, trying to reduce the > number of frags in a skb by chosing adjacent frags and > re-allocating/merging them. > > By choosing, I mean trying to pick smallest ones to minimize copy > cost, to get one skb with X less fragment. (X=1 in your case ?) > > I know for example that bnx2x could benefit from such a helper, as > it has a 13 frags limits. > (bnx2x_pkt_req_lin(), called from bnx2x ndo_start_xmit() As I proposed, we could globally (or per netns) limit the maximum , I think this would be okay and could be the best alternative to install slow-paths which could be hit quite constantly. Otherwise, the fallbacks like Eric proposed them are needed. I do not see any other choice. Thanks, Hannes