From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C434C433F5 for ; Wed, 6 Oct 2021 07:33:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F41676115A for ; Wed, 6 Oct 2021 07:33:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237551AbhJFHfN (ORCPT ); Wed, 6 Oct 2021 03:35:13 -0400 Received: from mout.kundenserver.de ([217.72.192.73]:59043 "EHLO mout.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237618AbhJFHfG (ORCPT ); Wed, 6 Oct 2021 03:35:06 -0400 Received: from mail-wr1-f45.google.com ([209.85.221.45]) by mrelayeu.kundenserver.de (mreue106 [213.165.67.113]) with ESMTPSA (Nemesis) id 1N7yz7-1muNoo2Nm7-0154Sh for ; Wed, 06 Oct 2021 09:33:13 +0200 Received: by mail-wr1-f45.google.com with SMTP id v25so5839284wra.2 for ; Wed, 06 Oct 2021 00:33:13 -0700 (PDT) X-Gm-Message-State: AOAM531/a3DZsxifTbeZb2ZYtl3tXNzVi7oTKrzQBw6OwlawEo287Ym/ TFmYMZn3XLmVsjymrpgLf035jDuBhUw7wNUJnTo= X-Google-Smtp-Source: ABdhPJx9eyi0bMWIbhw8jkuNfzLulRPPryCl4uaLO6jGQTH+h6tLt/rjJZONdO/2NVNqpPsb05Doz/j6mk/SjQLgaH8= X-Received: by 2002:a05:6000:1561:: with SMTP id 1mr26386607wrz.369.1633505592991; Wed, 06 Oct 2021 00:33:12 -0700 (PDT) MIME-Version: 1.0 References: <20210820043538.12424-1-pkshih@realtek.com> <20210820043538.12424-4-pkshih@realtek.com> <877dewiudp.fsf@codeaurora.org> <4b2f1c6b06e040d19b64d07500e0447b@realtek.com> <87pmsjgbhp.fsf@codeaurora.org> <5d0cb19ab70348f88ab0924dad5c3e7e@realtek.com> In-Reply-To: <5d0cb19ab70348f88ab0924dad5c3e7e@realtek.com> From: Arnd Bergmann Date: Wed, 6 Oct 2021 09:32:57 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v6 03/24] rtw89: add core and trx files To: Pkshih Cc: Arnd Bergmann , Kalle Valo , "linux-wireless@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" X-Provags-ID: V03:K1:66xp+C6F7PW1m7cjNgYGqBLA5xR7uHZFomops3uTLpXYXsWTshH tPk3vqCLC6mnvg45A0H4rAv//KO1XIjYzy1abtvnuQdPYX9ubb1flZClzhd4ozogClS62JB mMZBRT+Es/djj1twPoroH5nJi/hHzd6O1jod0UjOrqUSf09QcqtMD9Rf+Ctg2Fred2dW5Ma gd+AxX5UEI9QuMCUwemYA== X-UI-Out-Filterresults: notjunk:1;V03:K0:ZrlrgfHXcsQ=:2/JBiWXJXrxLH+1X2hOu8t gntKfESmH9oWBsdTJpaJwR7hC+v05FvOFnc+sO0qywfRc+SSmMTMaPsWgvHI6zAiRb/raLrtJ sHQxFaznYh15QkPNg4H6hIoVPbUfg6/NIMNEsYNdpZEC41BxiH2cjhCi/u5flmLZB1gBiIh2h W1qKU3ERXtp4w0aOq7yRkaHdIYvppih8jX920a8CYphHs4VQKX2Jc/evyfoi6CxAiX06GUnoW a3a5KT6W1zoYossMmROIVOs2zzMME5LoNXpCluThBkDDDLelme3jJorHWdP//Z/lLlCurZMDQ 4NdpFJvAwPX79ULMsE/AN5I+H4fonR5JEGnbkCLmdJDTfe1FgGZLWwA+oUNDufoprmxoYELZF /X0hbM4U90xCpQlsvBQ9CqQIAq1XWHXAXmTyV/MS4cOz7wrCAcFzQ8QdwtHx+d/sP5g69Dd+G gvcnRd5v57OKMxh9+VYpOAudVt+tYh3CUAtNSL5bQIzlXiyxD5qh41CdrPRplQOINjpdHERhx Q3CZUTZuFOygd/c92YYEgqZEyMpWGR5IOWwKIhsU8RCwmeTmPLhiGcUbSPWsfVho1jnozZ44U 0/rcNlR2VNXY4MJ69KI//AWg/8mdfydpb5cj5/n0MBncMPJlWVfN7V4beKGFaZ/rGHsL4rAHK OeI+ZjNRvBWFAVOBjtDvM3xdzRH0RLL19qdFs9ctVK3M4nZQRFent40JvUNxahAtsBCWQvva2 faV3jyAK6xbRtmZD95RWCy9+Q0JYJ0aZ+mMLGA== Precedence: bulk List-ID: X-Mailing-List: linux-wireless@vger.kernel.org On Wed, Oct 6, 2021 at 3:35 AM Pkshih wrote: > > > > > > Compare the object codes side-by-side, they are almost the same except > > > to some instructions. I think this is because the inline function > > > I apply __always_inline contains only a simple statement. > > > > Ok. Did you check the output for the configuration that showed the > > problem as well, after adding __always_inline? There are certain > > compile-time options that could cause the code to become unoptimized, > > e.g. KASAN, in addition to the OPTIMIZE_FOR_SIZE. > > Summarize object code size of the combinations: > > ccflag default -Os > ====== ======= ============= > inline 0x1AF X > always_inline 0x1AA 0x1A4 > > With default ccflag, the difference of inline and always_inline is a > je/jne instruction for 'if (!desc_info->en_wd_info)'. The always_inline > doesn't affect the part that use RTW89_SET_TXWD(). > > Compare always_inline row, the case of default ccflag uses movzbl (4 bytes), > but -Os case uses mov (3 bytes). > > By the results, -Os affect the object code size. always_inline doesn't > affect the code, but affect the instruction (je/jne) nearby. Those are the known-good cases, yes. > I use Ubuntun kernel that doesn't enable KASAN. > # CONFIG_KASAN is not set Ah, so you test using the driver backports package on a distro kernel? While this may be a good option for your development needs, I think it is generally a good idea to also be able to test your patches against the latest mainline or linux-next kernel directly, if only to ensure that there are no obvious regressions. > > > > +#define RTW89_SET_TXWD_BODY_WP_OFFSET(txdesc, val) \ > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, GENMASK(31, 24)) > > > > +#define RTW89_SET_TXWD_BODY_MORE_DATA(txdesc, val) \ > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, BIT(23)) > > > > +#define RTW89_SET_TXWD_BODY_WD_INFO_EN(txdesc, val) \ > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, BIT(22)) > > > > +#define RTW89_SET_TXWD_BODY_FW_DL(txdesc, val) \ > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, BIT(20)) > > > > > > > > I would personally write this without the wrappers, instead defining the > > > > bitmask macros as the masks and then open-coding the > > > > le32p_replace_bits() calls instead, which I would find more > > > > intuitive while it avoids the problem with the bitmasks. > > > > > > Use these macros can address offset and bit fields quickly. > > > How about I use macro instead of inline function? Like, > > > > > > #define RTW89_SET_TXWD (txdesc, val, offset, mask) \ > > > do { \ > > > u32 *txd32 = (u32 *)txdesc; \ > > > le32p_replace_bits((__le32 *)(txd32 + offset), val, mask); \ > > > } while (0) > > > > That would obviously address the immediate bug, but I think > > using le32p_replace_bits() directly here would actually be > > more readable, after you define the descriptor layout using > > a structure with named __le32 members to replace the offset. > > I will remove the wrapper and use le32p_replace_bits() directly. > > I don't plan to use structure, because these data contain bit-fields. > Then, I need to maintain little-/big-endian formats, like > > struct foo { > #if BIG_ENDINA > __le32 msb:1; > __le32 rsvd:30; > __le32 lsb:1; > #else > __le32 lsb:1; > __le32 rsvd:30; > __le32 msb:1; > #endif > }; Right, bitfields would not work well here, as they are generally not portable. Using an "#ifdef __BIG_ENDIAN_BITFIELD" check can work, but as you say this is really ugly. What I was trying to suggest instead is a structure like struct descriptor { __le32 word0; __le32 word1; __le32 word2; __le32 word3; }; And then build the descriptor like (with proper naming of the fields of course) void fill_descriptor(struct my_device *dev, struct sk_buff *skb, volatile struct descriptor *d) { d->word0 = build_desc_word0(fieldA, fieldB, fieldC, fieldD); d->word1 = build_desc_word1(fieldE, fieldF); ... } where the build_desc_word0() functions are the ones that encode the actual layout, e.g. using the linux/bitfield.h helpers like static inline __le32 build_desc_word0(u32 fieldA, u32 fieldB, u32 fieldC, u32 fieldD) { u32 word = FIELD_PREP(REG_FIELD_A, fieldA) | FIELD_PREP(REG_FIELD_B, fieldB) | FIELD_PREP(REG_FIELD_C, fieldC) | FIELD_PREP(REG_FIELD_D, fieldD); return cpu_to_le32(word); } Doing it this way has the advantage of keeping the assignment separate, which makes sure you don't accidentally introduce a read-modify-write cycle on the descriptor. This should work well on all architectures using dma_alloc_coherent() buffers. > > > > Going back one more step, I see that that rtw89_core_fill_txdesc() > > > > manipulates the descriptor fields in-memory, which also seems > > > > like a bad idea: The descriptor is mapped as cache-coherent, > > > > so on machines with no coherent DMA (i.e. most ARM or MIPS > > > > machines), that is uncached memory, and writing the descriptor > > > > using a series of read-modify-write cycles on uncached memory > > > > will be awfully slow. Maybe the answer is to just completely > > > > replace the descriptor access. > > > > > > I'll think if we can use chached memory with single_map/unmap for > > > descriptor. That would improve the performance. > > > > Using dma_unmap_single() with its cache flush may not work > > correctly if the descriptor fields have to be written in a particular > > order. Usually the last field in a descriptor contains a 'valid' > > bit that must not be observed by the hardware before the rest > > is visible. The cache flush however would not guarantee the > > order of the update. > > Is it possible to flush cache twice? Writing the fields other > than 'valid' bit, and do wmb() and first flush. Then, set 'valid' bit, > and do second flush. This could work, but it would be really expensive, since the dma-mapping API is based on ownership state transitions, so you'd have to got through dma_sync_single_for_device(), dma_sync_single_for_cpu(), and another dma_sync_single_for_device(). On machines using swiotlb(), those would in turn translate into copy operations. > > It would also likely be slower than dma_alloc_coherent() on > > machines that have cache-coherent PCI, such as most x86. > > > > The best way is usually to construct the descriptor one word > > at a time in registers, and write that word using WRITE_ONCE(), > > with an explict dma_wmb() before the final write that makes > > the descriptor valid. > > > > Thanks for the guideline. > > Fortunately, descriptor of this hardware uses circular ring buffer with > read/write index instead of 'valid' bit. To issue a packet with descriptor > to hardware, we fill descriptor and fill address of skb as well, and then > update write index (a register) to trigger hardware to start DMA this > packet. So, I think it is possible to use dma_map_single(). > > Anyway, I will try both methods later. If you end up with the streaming mapping, I would suggest using a single dma_alloc_noncoherent(), followed by dma_sync_single_* later on, rather than multiple map/unmap calls that would need to reprogram the IOMMU. The coherent API as I explained above should be more efficient though, unless you need to do a lot of reads from the descriptors. Arnd