All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pkshih <pkshih@realtek.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: Kalle Valo <kvalo@codeaurora.org>,
	"linux-wireless@vger.kernel.org" <linux-wireless@vger.kernel.org>
Subject: RE: [PATCH v6 03/24] rtw89: add core and trx files
Date: Wed, 6 Oct 2021 08:19:26 +0000	[thread overview]
Message-ID: <3276be03a32a470e8e1b363ab41b2013@realtek.com> (raw)
In-Reply-To: <CAK8P3a1rsKZZKMKFTDWgE3usX9gYKJqUvTMxSdEuZrp8BaKdaA@mail.gmail.com>


> -----Original Message-----
> From: Arnd Bergmann <arnd@arndb.de>
> Sent: Wednesday, October 6, 2021 3:33 PM
> To: Pkshih <pkshih@realtek.com>
> Cc: Arnd Bergmann <arnd@arndb.de>; Kalle Valo <kvalo@codeaurora.org>;
> linux-wireless@vger.kernel.org
> Subject: Re: [PATCH v6 03/24] rtw89: add core and trx files
> 
> On Wed, Oct 6, 2021 at 3:35 AM Pkshih <pkshih@realtek.com> wrote:
> > > >
> > > > Compare the object codes side-by-side, they are almost the same except
> > > > to some instructions. I think this is because the inline function
> > > > I apply __always_inline contains only a simple statement.
> > >
> > > Ok. Did you check the output for the configuration that showed the
> > > problem as well, after adding __always_inline? There are certain
> > > compile-time options that could cause the code to become unoptimized,
> > > e.g. KASAN, in addition to the OPTIMIZE_FOR_SIZE.
> >
> > Summarize object code size of the combinations:
> >
> > ccflag              default           -Os
> > ======              =======           =============
> > inline              0x1AF             X
> > always_inline      0x1AA             0x1A4
> >
> > With default ccflag, the difference of inline and always_inline is a
> > je/jne instruction for 'if (!desc_info->en_wd_info)'. The always_inline
> > doesn't affect the part that use RTW89_SET_TXWD().
> >
> > Compare always_inline row, the case of default ccflag uses movzbl (4 bytes),
> > but -Os case uses mov (3 bytes).
> >
> > By the results, -Os affect the object code size. always_inline doesn't
> > affect the code, but affect the instruction (je/jne) nearby.
> 
> Those are the known-good cases, yes.
> 
> > I use Ubuntun kernel that doesn't enable KASAN.
> > # CONFIG_KASAN is not set
> 
> Ah, so you test using the driver backports package on a distro
> kernel? While this may be a good option for your development
> needs, I think it is generally a good idea to also be able to test
> your patches against the latest mainline or linux-next kernel
> directly, if only to ensure that there are no obvious regressions.

No, I don't use backport. I use Ubuntu kernel PPA [1] to upgrade my kernel
regularly. So, it is almost the latest version.

> 
> > > > > +#define RTW89_SET_TXWD_BODY_WP_OFFSET(txdesc, val) \
> > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, GENMASK(31, 24))
> > > > > +#define RTW89_SET_TXWD_BODY_MORE_DATA(txdesc, val) \
> > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, BIT(23))
> > > > > +#define RTW89_SET_TXWD_BODY_WD_INFO_EN(txdesc, val) \
> > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, BIT(22))
> > > > > +#define RTW89_SET_TXWD_BODY_FW_DL(txdesc, val) \
> > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, BIT(20))
> > > > >
> > > > > I would personally write this without the wrappers, instead defining the
> > > > > bitmask macros as the masks and then open-coding the
> > > > > le32p_replace_bits() calls instead, which I would find more
> > > > > intuitive while it avoids the problem with the bitmasks.
> > > >
> > > > Use these macros can address offset and bit fields quickly.
> > > > How about I use macro instead of inline function? Like,
> > > >
> > > > #define RTW89_SET_TXWD (txdesc, val, offset, mask) \
> > > > do { \
> > > >         u32 *txd32 = (u32 *)txdesc; \
> > > >         le32p_replace_bits((__le32 *)(txd32 + offset), val, mask); \
> > > > } while (0)
> > >
> > > That would obviously address the immediate bug, but I think
> > > using le32p_replace_bits() directly here would actually be
> > > more readable, after you define the descriptor layout using
> > > a structure with named __le32 members to replace the offset.
> >
> > I will remove the wrapper and use le32p_replace_bits() directly.
> >
> > I don't plan to use structure, because these data contain bit-fields.
> > Then, I need to maintain little-/big-endian formats, like
> >
> > struct foo {
> > #if BIG_ENDINA
> >         __le32 msb:1;
> >         __le32 rsvd:30;
> >         __le32 lsb:1;
> > #else
> >         __le32 lsb:1;
> >         __le32 rsvd:30;
> >         __le32 msb:1;
> > #endif
> > };
> 
> Right, bitfields would not work well here, as they are generally not
> portable. Using an "#ifdef __BIG_ENDIAN_BITFIELD" check can
> work, but as you say this is really ugly.
> 
> What I was trying to suggest instead is a structure like
> 
> struct descriptor {
>      __le32 word0;
>      __le32 word1;
>      __le32 word2;
>      __le32 word3;
> };
> 
> And then build the descriptor like (with proper naming of the fields of course)
> 
> void fill_descriptor(struct my_device *dev, struct sk_buff *skb,
> volatile struct descriptor *d)
> {
>           d->word0 = build_desc_word0(fieldA, fieldB, fieldC, fieldD);
>           d->word1 = build_desc_word1(fieldE, fieldF);
>           ...
> }
> 
> where the build_desc_word0() functions are the ones that encode the
> actual layout, e.g. using the linux/bitfield.h helpers like
> 
> static inline __le32 build_desc_word0(u32 fieldA, u32 fieldB, u32
> fieldC, u32 fieldD)
> {
>         u32 word = FIELD_PREP(REG_FIELD_A, fieldA) |
>                            FIELD_PREP(REG_FIELD_B, fieldB) |
>                            FIELD_PREP(REG_FIELD_C, fieldC) |
>                            FIELD_PREP(REG_FIELD_D, fieldD);
> 
>        return cpu_to_le32(word);
> }
> 
> Doing it this way has the advantage of keeping the assignment
> separate, which makes sure you don't accidentally introduce
> a read-modify-write cycle on the descriptor. This should work
> well on all architectures using dma_alloc_coherent() buffers.

Got it.

> 
> > > > > Going back one more step, I see that that rtw89_core_fill_txdesc()
> > > > > manipulates the descriptor fields in-memory, which also seems
> > > > > like a bad idea: The descriptor is mapped as cache-coherent,
> > > > > so on machines with no coherent DMA (i.e. most ARM or MIPS
> > > > > machines), that is uncached memory, and writing the descriptor
> > > > > using a series of read-modify-write cycles on uncached memory
> > > > > will be awfully slow. Maybe the answer is to just completely
> > > > > replace the descriptor access.
> > > >
> > > > I'll think if we can use chached memory with single_map/unmap for
> > > > descriptor. That would improve the performance.
> > >
> > > Using dma_unmap_single() with its cache flush may not work
> > > correctly if the descriptor fields have to be written in a particular
> > > order. Usually the last field in a descriptor contains a 'valid'
> > > bit that must not be observed by the hardware before the rest
> > > is visible. The cache flush however would not guarantee the
> > > order of the update.
> >
> > Is it possible to flush cache twice? Writing the fields other
> > than 'valid' bit, and do wmb() and first flush. Then, set 'valid' bit,
> > and do second flush.
> 
> This could work, but it would be really expensive, since the
> dma-mapping API is based on ownership state transitions, so
> you'd have to got through dma_sync_single_for_device(),
> dma_sync_single_for_cpu(), and another
> dma_sync_single_for_device(). On machines using swiotlb(),
> those would in turn translate into copy operations.
> 
> > > It would also likely be slower than dma_alloc_coherent() on
> > > machines that have cache-coherent PCI, such as most x86.
> > >
> > > The best way is usually to construct the descriptor one word
> > > at a time in registers, and write that word using WRITE_ONCE(),
> > > with an explict dma_wmb() before the final write that makes
> > > the descriptor valid.
> > >
> >
> > Thanks for the guideline.
> >
> > Fortunately, descriptor of this hardware uses circular ring buffer with
> > read/write index instead of 'valid' bit. To issue a packet with descriptor
> > to hardware, we fill descriptor and fill address of skb as well, and then
> > update write index (a register) to trigger hardware to start DMA this
> > packet. So, I think it is possible to use dma_map_single().
> >
> > Anyway, I will try both methods later.
> 
> If you end up with the streaming mapping, I would suggest using a
> single dma_alloc_noncoherent(), followed by dma_sync_single_*
> later on, rather than multiple map/unmap calls that would need to
> reprogram the IOMMU. The coherent API as I explained above
> should be more efficient though, unless you need to do a lot of
> reads from the descriptors.
> 

OK. I will try dma_alloc_noncoherent(), and measure the performance.
But, it seems like you have told me the answer now.

Thanks again for your rich guideline.

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/

--
Ping-Ke


  reply	other threads:[~2021-10-06  8:19 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-20  4:35 [PATCH v6 00/24] rtw89: add Realtek 802.11ax driver Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 01/24] rtw89: add CAM files Ping-Ke Shih
2021-10-01 14:46   ` Kalle Valo
2021-08-20  4:35 ` [PATCH v6 02/24] rtw89: add BT coexistence files Ping-Ke Shih
2021-10-01 15:26   ` Kalle Valo
2021-10-01 17:40     ` Small driver submissions and long feedback cycles Brian Norris
2021-08-20  4:35 ` [PATCH v6 03/24] rtw89: add core and trx files Ping-Ke Shih
2021-10-01 16:26   ` Kalle Valo
2021-10-05  7:16     ` Pkshih
2021-10-05  7:46       ` Kalle Valo
2021-10-05  8:42         ` Arnd Bergmann
2021-10-05  9:32           ` Pkshih
2021-10-05  9:59             ` Arnd Bergmann
2021-10-06  1:35               ` Pkshih
2021-10-06  7:32                 ` Arnd Bergmann
2021-10-06  8:19                   ` Pkshih [this message]
2021-08-20  4:35 ` [PATCH v6 04/24] rtw89: add debug files Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 05/24] rtw89: add efuse files Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 06/24] rtw89: add files to download and communicate with firmware Ping-Ke Shih
2021-10-01 15:55   ` Kalle Valo
2021-08-20  4:35 ` [PATCH v6 07/24] rtw89: add MAC files Ping-Ke Shih
2021-10-01 16:13   ` Kalle Valo
2021-08-20  4:35 ` [PATCH v6 08/24] rtw89: implement mac80211 ops Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 09/24] rtw89: add pci files Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 10/24] rtw89: add phy files Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 11/24] rtw89: define register names Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 12/24] rtw89: add regulatory support Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 13/24] rtw89: 8852a: add 8852a specific files Ping-Ke Shih
2021-10-01 16:20   ` Kalle Valo
2021-08-20  4:35 ` [PATCH v6 14/24] rtw89: 8852a: add 8852a RFK files Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 15/24] rtw89: 8852a: add 8852a RFK tables Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 16/24] rtw89: 8852a: add 8852a tables (1 of 5) Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 17/24] rtw89: 8852a: add 8852a tables (2 " Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 18/24] rtw89: 8852a: add 8852a tables (3 " Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 19/24] rtw89: 8852a: add 8852a tables (4 " Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 20/24] rtw89: 8852a: add 8852a tables (5 " Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 21/24] rtw89: add ser to recover error reported by firmware Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 22/24] rtw89: add PS files Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 23/24] rtw89: add SAR files Ping-Ke Shih
2021-08-20  4:35 ` [PATCH v6 24/24] rtw89: add Kconfig and Makefile Ping-Ke Shih
2021-08-22  3:43   ` kernel test robot
2021-08-23  1:37     ` Pkshih
2021-10-01 15:57   ` Kalle Valo
2021-10-01 16:34 ` [PATCH v6 00/24] rtw89: add Realtek 802.11ax driver Kalle Valo
2021-10-01 16:42   ` Larry Finger
2021-10-01 16:46     ` Kalle Valo
2021-10-01 17:18       ` Larry Finger
2021-10-05  5:46         ` Kalle Valo
2021-10-04  6:46   ` Pkshih
2021-10-05  5:52     ` Kalle Valo
2021-10-06  0:10       ` Brian Norris
2021-10-08  4:14         ` Pkshih
2021-10-08  4:11       ` Pkshih
2021-10-09  8:28         ` Kalle Valo
2021-10-12  1:53           ` Pkshih

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3276be03a32a470e8e1b363ab41b2013@realtek.com \
    --to=pkshih@realtek.com \
    --cc=arnd@arndb.de \
    --cc=kvalo@codeaurora.org \
    --cc=linux-wireless@vger.kernel.org \
    --subject='RE: [PATCH v6 03/24] rtw89: add core and trx files' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.