From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934382AbdDSNLx (ORCPT ); Wed, 19 Apr 2017 09:11:53 -0400 Received: from mail-wr0-f196.google.com ([209.85.128.196]:35171 "EHLO mail-wr0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934263AbdDSNLt (ORCPT ); Wed, 19 Apr 2017 09:11:49 -0400 MIME-Version: 1.0 In-Reply-To: <20170419120114.GB3238@e104818-lin.cambridge.arm.com> References: <10fef112-37f1-0a1b-b5af-435acd032f01@codeaurora.org> <4525901c-45d4-6bd8-eec6-ae92977f16d1@codeaurora.org> <20170406155825.GA7705@e104818-lin.cambridge.arm.com> <08fa98de-760b-15bc-5220-fa449b08c118@codeaurora.org> <725F073F-025B-48B9-9935-24EFEBF2B7DC@caviumnetworks.com> <93d2819a-95b1-6606-74d4-0bc0a64db29e@codeaurora.org> <20170418144839.GF27592@e104818-lin.cambridge.arm.com> <20170419120114.GB3238@e104818-lin.cambridge.arm.com> From: Sunil Kovvuri Date: Wed, 19 Apr 2017 18:41:46 +0530 Message-ID: Subject: Re: [PATCH] Revert "arm64: Increase the max granular size" To: Catalin Marinas Cc: Ganesh Mahendran , open list , "Chalamarla, Tirumalesh" , "open list:ARM/QUALCOMM SUPPORT" , Imran Khan , "linux-arm-kernel@lists.infradead.org" , sgoutham@caviumnetworks.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 19, 2017 at 5:31 PM, Catalin Marinas wrote: > On Tue, Apr 18, 2017 at 10:35:02PM +0530, Sunil Kovvuri wrote: >> On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas >> wrote: >> > On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote: >> >> >> >> Do you have an explanation on the performance variation when >> >> >> >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack >> >> >> >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for >> >> >> >> non-coherent DMA?). >> >> >> > >> >> >> > network stack use SKB_DATA_ALIGN to align. >> >> >> > --- >> >> >> > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ >> >> >> > ~(SMP_CACHE_BYTES - 1)) >> >> >> > >> >> >> > #define SMP_CACHE_BYTES L1_CACHE_BYTES >> >> >> > --- >> >> >> > I think this is the reason of performance regression. >> >> >> > >> >> >> >> >> >> Yes this is the reason for performance regression. Due to increases L1 cache alignment the >> >> >> object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to >> >> >> 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. >> >> >> >> With what traffic did you check 'skb->truesize' ? Increase from >> >> 2304 to 4352 bytes doesn't seem to be real. I checked with ICMP >> >> pkts with maximum size possible with 1500byte MTU and I don't see >> >> such a bump. If the bump is observed with Iperf sending TCP packets >> >> then I suggest to check if TSO is playing a part over here. >> > >> > I haven't checked truesize but I added some printks to __alloc_skb() (on >> > a Juno platform) and the size argument to this function is 1720 on many >> > occasions. With sizeof(struct skb_shared_info) of 320, the actual data >> > allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a >> > 128 byte cache size, it goes slightly over 2K, hence the 4K slab >> > allocation. >> >> Understood but still in my opinion this '4K slab allocation' cannot be >> considered as an issue with cache line size, there are many network >> drivers out there which do receive buffer or page recycling to >> minimize (sometimes almost to zero) the cost of buffer allocation. > > The slab allocation shouldn't make much difference (unless you are > running on a memory constrained system) but I don't understand how > skb->truesize (which is almost half unused) affects the sk_wmem_alloc > and its interaction with other bits in the network stack (e.g. > tcp_limit_output_bytes). > > However, I do think it's worth investigating further to fully understand > the issue. Absolutely. > >> >The 1720 figure surprised me a bit as well since I was >> > expecting something close to 1500. >> > >> > The thing that worries me is that skb->data may be used as a buffer to >> > DMA into. If that's the case, skb_shared_info is wrongly aligned based >> > on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent >> > platform. It should really be ARCH_DMA_MINALIGN. >> >> I didn't get this, if you see __alloc_skb() >> >> 229 size = SKB_DATA_ALIGN(size); >> 230 size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); >> >> both DMA buffer and skb_shared_info are aligned to a cacheline separately, >> considering 128byte alignment guarantees 64byte alignment as well, how >> will this lead to corruption ? > > It's the other way around: you align only to 64 byte while running on a > platform with 128 byte cache lines and non-coherent DMA. Okay, I mistook your statement. This is indeed a valid statement. >> And if platform is non-DMA-coherent then again it's the driver which >> should take of coherency by using appropriate map/unmap APIs and >> should avoid any cacheline sharing btw DMA buffer and skb_shared_info. > > The problem is that the streaming DMA API can only work correctly on > cacheline-aligned buffers (because of the cache invalidation it performs > for DMA ops; even with clean&invalidate, the operation isn't always safe > if a cacheline is shared between DMA and CPU buffers). In the skb case, > we could have the data potentially sharing the last addresses of a DMA > buffer with struct skb_shared_info. > > We may be able to get away with SKB_DATA_ALIGN not using > ARCH_DMA_MINALIGN *if* skb_shared_info is *not* written before or during > an inbound DMA transfer (though such tricks are arch specific). > >> > IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue >> > if we go back to 64 byte cache lines. >> >> Yes, Cavium platform is DMA coherent and there is no issue with reverting back >> to 64byte cachelines. But do we want to do this because some platform has a >> performance issue and this is an easy way to solve it. IMHO there seems >> to be many ways to solve performance degradation within the driver itself, and >> if those doesn't work then probably it makes sense to revert this. > > My initial thought was to revert the change because it was causing a > significant performance regression on certain SoC. But given that it > took over a year for people to follow up, it doesn't seem too urgent, so > we should rather try to understand the issue and potential side effects > of moving back to a 64 byte cache line. Yes. Thanks, Sunil. > > -- > Catalin