linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux 5.19-rc8
@ 2022-07-24 20:42 Linus Torvalds
  2022-07-25 16:11 ` Guenter Roeck
  2022-07-25 20:34 ` Build regressions/improvements in v5.19-rc8 Geert Uytterhoeven
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2022-07-24 20:42 UTC (permalink / raw)
  To: Linux Kernel Mailing List

As already mentioned last week, this release is one of those "extra
week of rc" ones, and here we are, with release candidate #8.

There's nothing really surprising in here - a few smaller fixups for
the retbleed mess as expected, and the usual random one-liners
elsewhere.

The diffstat shows mainly some documentation updates and a couple of
drivers with bigger fixes (eg the i916 GuC firmware thing), and the
networking sysctl data-race annotations.

So it all just makes me go "yeah, I'm happy to have done another rc,
but there is nothing particularly interesting here". Which is all
fine. Shortlog appended for the curious among you.

We'll let this simmer for another week, and please do give it another
round of testing to make this last week count, ok?

                  Linus

---

Aaron Lewis (1):
      KVM: x86: Protect the unused bits in MSR exiting flags

Adam Borowski (1):
      certs: make system keyring depend on x509 parser

Alexandru Elisei (1):
      ASoC: rockchip: i2s: Fix NULL pointer dereference when pinctrl
is not found

Andy Shevchenko (1):
      gpiolib: cdev: Fix kernel doc for struct line

Ben Dooks (1):
      riscv: add as-options for modules with assembly compontents

Ben Hutchings (1):
      x86/speculation: Make all RETbleed mitigations 64-bit only

Biao Huang (3):
      stmmac: dwmac-mediatek: fix clock issue
      net: stmmac: fix pm runtime issue in stmmac_dvr_remove()
      net: stmmac: fix unbalanced ptp clock issue in suspend/resume flow

Biju Das (1):
      spi: spi-rspi: Fix PIO fallback on RZ platforms

Christian König (1):
      drm/ttm: fix locking in vmap/vunmap TTM GEM helpers

Dan Carpenter (1):
      md/raid5: missing error code in setup_conf()

Daniele Ceraolo Spurio (1):
      drm/i915/guc: support v69 in parallel to v70

Dawid Lukwinski (1):
      i40e: Fix erroneous adapter reinitialization during recovery process

Dmitry Osipenko (1):
      drm/scheduler: Don't kill jobs in interrupt context

Dylan Yudaken (2):
      io_uring: fix free of unallocated buffer list
      io_uring: do not recycle buffer in READV

Eric Snowberg (1):
      lockdown: Fix kexec lockdown bypass with ima policy

Flavio Suligoi (1):
      i2c: imx: fix typo in comment

Gavin Shan (1):
      KVM: selftests: Fix target thread to be migrated in rseq_test

Haibo Chen (3):
      gpio: pca953x: only use single read/write for No AI mode
      gpio: pca953x: use the correct range when do regmap sync
      gpio: pca953x: use the correct register address when regcache
sync during init

Hangyu Hua (1):
      xfrm: xfrm_policy: fix a possible double xfrm_pols_put() in
xfrm_bundle_lookup()

Hayes Wang (1):
      r8152: fix a WOL issue

Herve Codina (1):
      clk: lan966x: Fix the lan966x clock gate register address

Horatiu Vultur (7):
      pinctrl: ocelot: Fix pincfg for lan966x
      pinctrl: ocelot: Fix pincfg
      net: lan966x: Fix taking rtnl_lock while holding spin_lock
      net: lan966x: Fix usage of lan966x->mac_lock when entry is added
      net: lan966x: Fix usage of lan966x->mac_lock when entry is removed
      net: lan966x: Fix usage of lan966x->mac_lock inside
lan966x_mac_irq_handler
      net: lan966x: Fix usage of lan966x->mac_lock when used by FDB

Hristo Venev (1):
      be2net: Fix buffer overflow in be_get_module_eeprom

Ido Schimmel (1):
      mlxsw: spectrum_router: Fix IPv4 nexthop gateway indication

Jacky Bai (1):
      MAINTAINERS: Update freescale pin controllers maintainer

Josh Poimboeuf (1):
      lkdtm: Disable return thunks in rodata.c

Junxiao Chang (1):
      net: stmmac: fix dma queue left shift overflow issue

Juri Lelli (1):
      sched/deadline: Fix BUG_ON condition for deboosted tasks

Justin Stitt (1):
      net: ipv4: fix clang -Wformat warnings

Kan Liang (1):
      perf/x86/intel/lbr: Fix unchecked MSR access error on HSW

Kees Cook (1):
      x86/alternative: Report missing return thunk details

Kent Gibson (1):
      selftests: gpio: fix include path to kernel headers for out of tree builds

Khalid Masum (1):
      scripts/gdb: Fix gdb 'lx-symbols' command

Krzysztof Kozlowski (1):
      riscv: dts: align gpio-key node names with dtschema

Kuniyuki Iwashima (46):
      ip: Fix data-races around sysctl_ip_default_ttl.
      ip: Fix data-races around sysctl_ip_no_pmtu_disc.
      ip: Fix data-races around sysctl_ip_fwd_use_pmtu.
      ip: Fix data-races around sysctl_ip_fwd_update_priority.
      ip: Fix data-races around sysctl_ip_nonlocal_bind.
      ip: Fix a data-race around sysctl_ip_autobind_reuse.
      ip: Fix a data-race around sysctl_fwmark_reflect.
      tcp/dccp: Fix a data-race around sysctl_tcp_fwmark_accept.
      tcp: Fix data-races around sysctl_tcp_l3mdev_accept.
      tcp: Fix data-races around sysctl_tcp_mtu_probing.
      tcp: Fix data-races around sysctl_tcp_base_mss.
      tcp: Fix data-races around sysctl_tcp_min_snd_mss.
      tcp: Fix a data-race around sysctl_tcp_mtu_probe_floor.
      tcp: Fix a data-race around sysctl_tcp_probe_threshold.
      tcp: Fix a data-race around sysctl_tcp_probe_interval.
      tcp/udp: Make early_demux back namespacified.
      igmp: Fix data-races around sysctl_igmp_llm_reports.
      igmp: Fix a data-race around sysctl_igmp_max_memberships.
      igmp: Fix data-races around sysctl_igmp_max_msf.
      igmp: Fix data-races around sysctl_igmp_qrv.
      tcp: Fix data-races around keepalive sysctl knobs.
      tcp: Fix data-races around sysctl_tcp_syn(ack)?_retries.
      tcp: Fix data-races around sysctl_tcp_syncookies.
      tcp: Fix data-races around sysctl_tcp_migrate_req.
      tcp: Fix data-races around sysctl_tcp_reordering.
      tcp: Fix data-races around some timeout sysctl knobs.
      tcp: Fix a data-race around sysctl_tcp_notsent_lowat.
      tcp: Fix a data-race around sysctl_tcp_tw_reuse.
      tcp: Fix data-races around sysctl_max_syn_backlog.
      tcp: Fix data-races around sysctl_tcp_fastopen.
      tcp: Fix data-races around sysctl_tcp_fastopen_blackhole_timeout.
      ipv4: Fix a data-race around sysctl_fib_multipath_use_neigh.
      ipv4: Fix data-races around sysctl_fib_multipath_hash_policy.
      ipv4: Fix data-races around sysctl_fib_multipath_hash_fields.
      ip: Fix data-races around sysctl_ip_prot_sock.
      udp: Fix a data-race around sysctl_udp_l3mdev_accept.
      tcp: Fix data-races around sysctl knobs related to SYN option.
      tcp: Fix a data-race around sysctl_tcp_early_retrans.
      tcp: Fix data-races around sysctl_tcp_recovery.
      tcp: Fix a data-race around sysctl_tcp_thin_linear_timeouts.
      tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.
      tcp: Fix a data-race around sysctl_tcp_retrans_collapse.
      tcp: Fix a data-race around sysctl_tcp_stdurg.
      tcp: Fix a data-race around sysctl_tcp_rfc1337.
      tcp: Fix a data-race around sysctl_tcp_abort_on_overflow.
      tcp: Fix data-races around sysctl_tcp_max_reordering.

Lennert Buytenhek (1):
      igc: Reinstate IGC_REMOVED logic and implement it properly

Li Zhengyu (2):
      RISCV: kexec: Fix build error without CONFIG_MODULES
      RISC-V: kexec: Fix build error without CONFIG_KEXEC

Liang He (3):
      net: dsa: microchip: ksz_common: Fix refcount leak bug
      drm/imx/dcss: Add missing of_node_put() in fail path
      can: rcar_canfd: Add missing of_node_put() in rcar_canfd_probe()

Linus Torvalds (4):
      watchqueue: make sure to serialize 'wqueue->defunct' properly
      watch-queue: remove spurious double semicolon
      mmu_gather: fix the CONFIG_MMU_GATHER_NO_RANGE case
      Linux 5.19-rc8

Lorenzo Bianconi (1):
      net: ethernet: mtk_ppe: fix possible NULL pointer dereference in
mtk_flow_get_wdma_info

Luben Tuikov (1):
      drm/amdgpu: Protect the amdgpu_bo_list list with a mutex v2

Maksym Glubokiy (1):
      net: prestera: acl: use proper mask for port selector

Marc Kleine-Budde (2):
      can: mcp251xfd: fix detection of mcp251863
      spi: bcm2835: bcm2835_spi_handle_err(): fix NULL pointer deref
for non DMA transfers

Mario Limonciello (2):
      pinctrl: Don't allow PINCTRL_AMD to be a module
      ACPI: CPPC: Don't require flexible address space if
X86_FEATURE_CPPC is supported

Mark Brown (1):
      ASoC: rockchip-i2s: Undo BCLK pinctrl changes

Matthew Brost (1):
      drm/i915/guc: Support programming the EU priority in the GuC descriptor

Mustafa Ismail (2):
      RDMA/irdma: Do not advertise 1GB page size for x722
      RDMA/irdma: Fix sleep from invalid context BUG

Neeraj Upadhyay (1):
      srcu: Make expedited RCU grace periods block even less frequently

Nícolas F. R. A. Prado (1):
      drm/panel-edp: Fix variable typo when saving hpd absent delay from DT

Oleksij Rempel (2):
      net: dsa: sja1105: silent spi_device_id warnings
      net: dsa: vitesse-vsc73xx: silent spi_device_id warnings

Oliver Upton (1):
      KVM: stats: Fix value for KVM_STATS_UNIT_MAX for boolean stats

Oz Shlomo (1):
      net/sched: cls_api: Fix flow action initialization

Paolo Bonzini (1):
      tools headers UAPI: Sync linux/kvm.h with the kernel sources

Paul E. McKenney (1):
      srcu: Block less aggressively for expedited grace periods

Pawan Gupta (1):
      x86/bugs: Warn when "ibrs" mitigation is selected on Enhanced IBRS parts

Peter Zijlstra (5):
      x86/amd: Use IBPB for firmware calls
      mmu_gather: Remove per arch tlb_{start,end}_vma()
      csky/tlb: Remove tlb_flush() define
      mmu_gather: Let there be one tlb_{start,end}_vma() implementation
      mmu_gather: Force tlb-flush VM_PFNMAP vmas

Piotr Skajewski (1):
      ixgbe: Add locking to prevent panic when setting sriov_numvfs to zero

Przemyslaw Patynowski (4):
      iavf: Fix VLAN_V2 addition/rejection
      iavf: Disallow changing rx/tx-frames and rx/tx-frames-irq
      iavf: Fix handling of dummy receive descriptors
      iavf: Fix missing state logs

Robert Hancock (1):
      i2c: cadence: Change large transfer count reset logic to be unconditional

Sai Krishna Potthuri (1):
      spi: spi-cadence: Fix SPI NO Slave Select macro definition

Sascha Hauer (1):
      mtd: rawnand: gpmi: Set WAIT_FOR_READY timeout based on
program/erase times

Sasha Neftin (2):
      e1000e: Enable GPT clock before sending message to CSME
      Revert "e1000e: Fix possible HW unit hang after an s0ix exit"

Srinivas Neeli (1):
      gpio: gpio-xilinx: Fix integer overflow

Stylon Wang (1):
      drm/amd/display: Fix new dmub notification enabling in DM

Taehee Yoo (8):
      amt: use workqueue for gateway side message handling
      amt: remove unnecessary locks
      amt: use READ_ONCE() in amt module
      amt: add missing regeneration nonce logic in request logic
      amt: drop unexpected advertisement message
      amt: drop unexpected query message
      amt: drop unexpected multicast data
      amt: do not use amt->nr_tunnels outside of lock

Tariq Toukan (1):
      net/tls: Fix race in TLS device down flow

Tom Lendacky (1):
      virt: sev-guest: Pass the appropriate argument type to iounmap()

Tom Rix (1):
      net: ethernet: mtk_eth_soc: fix off by one check of ARRAY_SIZE

Tony Lindgren (1):
      mmc: sdhci-omap: Fix a lockdep warning for PM runtime init

Vadim Pasternak (1):
      i2c: mlxcpld: Fix register setting for 400KHz frequency

Vladimir Oltean (19):
      docs: net: dsa: update probing documentation
      docs: net: dsa: document the shutdown behavior
      docs: net: dsa: rename tag_protocol to get_tag_protocol
      docs: net: dsa: add more info about the other arguments to
get_tag_protocol
      docs: net: dsa: document change_tag_protocol
      docs: net: dsa: document the teardown method
      docs: net: dsa: document port_setup and port_teardown
      docs: net: dsa: document port_fast_age
      docs: net: dsa: remove port_bridge_tx_fwd_offload
      docs: net: dsa: remove port_vlan_dump
      docs: net: dsa: delete port_mdb_dump
      docs: net: dsa: add a section for address databases
      docs: net: dsa: re-explain what port_fdb_dump actually does
      docs: net: dsa: delete misinformation about -EOPNOTSUPP for FDB/MDB/VLAN
      docs: net: dsa: mention that VLANs are now refcounted on shared ports
      pinctrl: armada-37xx: make irq_lock a raw spinlock to avoid
invalid wait context
      pinctrl: armada-37xx: use raw spinlocks for regmap to avoid
invalid wait context
      net: dsa: fix dsa_port_vlan_filtering when global
      net: dsa: fix NULL pointer dereference in dsa_port_reset_vlan_filtering

William Dean (2):
      pinctrl: ralink: Check for null return of devm_kcalloc
      pinctrl: sunplus: Add check for kcalloc

Wong Vee Khee (2):
      net: stmmac: switch to use interrupt for hw crosstimestamping
      net: stmmac: remove redunctant disable xPCS EEE call

Xin Long (1):
      Documentation: fix udp_wmem_min in ip-sysctl.rst

xinhui pan (1):
      drm/amdgpu: Remove one duplicated ef removal

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-24 20:42 Linux 5.19-rc8 Linus Torvalds
@ 2022-07-25 16:11 ` Guenter Roeck
  2022-07-25 17:55   ` Linus Torvalds
  2022-07-25 20:34 ` Build regressions/improvements in v5.19-rc8 Geert Uytterhoeven
  1 sibling, 1 reply; 27+ messages in thread
From: Guenter Roeck @ 2022-07-25 16:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List

On Sun, Jul 24, 2022 at 01:42:18PM -0700, Linus Torvalds wrote:
> As already mentioned last week, this release is one of those "extra
> week of rc" ones, and here we are, with release candidate #8.
> 
> There's nothing really surprising in here - a few smaller fixups for
> the retbleed mess as expected, and the usual random one-liners
> elsewhere.
> 
> The diffstat shows mainly some documentation updates and a couple of
> drivers with bigger fixes (eg the i916 GuC firmware thing), and the
> networking sysctl data-race annotations.
> 
> So it all just makes me go "yeah, I'm happy to have done another rc,
> but there is nothing particularly interesting here". Which is all
> fine. Shortlog appended for the curious among you.
> 
> We'll let this simmer for another week, and please do give it another
> round of testing to make this last week count, ok?
> 

Build results:
	total: 150 pass: 150 fail: 0
Qemu test results:
	total: 489 pass: 489 fail: 0

I still randomly see

BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48

That is not a new problem; I see it with some arm boot tests
whenever KFENCE is enabled. I attached a trace with decoded
symbols below. Maybe someone with more experience than me can
figure out if this is an arm specific problem or a test problem
or a qemu problem.

Guenter

---
test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 11667
==================================================================
BUG: KFENCE: out-of-bounds read in _find_next_bit_le (arch/arm/lib/findbit.S:88)

Out-of-bounds read at 0xef59e000 (4096B right of kfence-#93):
_find_next_bit_le (arch/arm/lib/findbit.S:88)
kfence-#93: 0xef59d000-0xef59dfff, size=4096, cache=kmalloc-4k
allocated by task 1 on cpu 1 at 18.432911s:
test_bitmap_printlist (./include/linux/slab.h:600 lib/test_bitmap.c:452)
test_bitmap_init (lib/test_bitmap.c:883 lib/test_bitmap.c:889)
do_one_initcall (./include/linux/jump_label.h:261 ./include/linux/jump_label.h:271 ./include/trace/events/initcall.h:48 init/main.c:1296)
kernel_init_freeable (init/main.c:1367 init/main.c:1384 init/main.c:1403 init/main.c:1610)
kernel_init (init/main.c:1501)
ret_from_fork (arch/arm/kernel/entry-common.S:149)
 0x0
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.19.0-rc8 #1
Hardware name: Samsung Exynos (Flattened Device Tree)
PC is at _find_next_bit_le (arch/arm/lib/findbit.S:88)
LR is at bitmap_list_string.constprop.0 (lib/vsprintf.c:1246)
pc : lr : psr: 20000113
sp : f082dc70  ip : 00000001  fp : 00000001
r10: 00000000  r9 : 0000002d  r8 : ef59d000
r7 : c0e55514  r6 : c2215000  r5 : 00008000  r4 : 00008000
r3 : 845cac12  r2 : 00008001  r1 : 00008000  r0 : ef59d000
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 4000406a  DAC: 00000051
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.19.0-rc8 #1
Hardware name: Samsung Exynos (Flattened Device Tree)
unwind_backtrace from show_stack (arch/arm/kernel/traps.c:255)
show_stack from dump_stack_lvl (lib/dump_stack.c:107)
dump_stack_lvl from kfence_report_error (mm/kfence/report.c:262)
kfence_report_error from kfence_handle_page_fault (mm/kfence/core.c:1151)
kfence_handle_page_fault from __do_kernel_fault.part.0 (arch/arm/mm/fault.c:143)
__do_kernel_fault.part.0 from do_page_fault (arch/arm/mm/fault.c:380)
do_page_fault from do_DataAbort (arch/arm/mm/fault.c:539)
do_DataAbort from __dabt_svc (arch/arm/kernel/entry-armv.S:214)
Exception stack(0xf082dc20 to 0xf082dc68)
dc20: ef59d000 00008000 00008001 845cac12 00008000 00008000 c2215000 c0e55514
dc40: ef59d000 0000002d 00000000 00000001 00000001 f082dc70 c0715930 c06ff18c
dc60: 20000113 ffffffff
__dabt_svc from _find_next_bit_le (arch/arm/lib/findbit.S:88)
==================================================================
test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
', Time: 15696250

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-25 16:11 ` Guenter Roeck
@ 2022-07-25 17:55   ` Linus Torvalds
  2022-07-25 18:49     ` Linus Torvalds
                       ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Linus Torvalds @ 2022-07-25 17:55 UTC (permalink / raw)
  To: Guenter Roeck, Yury Norov, Dennis Zhou, Russell King - ARM Linux,
	Catalin Marinas
  Cc: Linux Kernel Mailing List

On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <linux@roeck-us.net> wrote:
>
> BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48

Ok, I was hoping somebody more ARMy would look at this, particularly
since there is no call trace beyond the actual fault.

So it shows that it happens in _find_next_bit_le(), but not who called it.

It does show "who allocated the page", and I can see the message that
is printed afterwards, so it comes from that

   static void __init test_bitmap_printlist(void)

function, so I guess we know the call chain:

  test_bitmap_printlist ->
    bitmap_print_to_pagebuf ->
      scnprintf "%*pbl\n" ->
        pointer ->
          bitmap_list_string ->
            for_each_set_bitrange

and I think I see what's wrong in there. That thing does

             (b) = find_next_bit((addr), (size), (e) + 1),      \
             (e) = find_next_zero_bit((addr), (size), (b) + 1))

for the end of the range, and looking at the oops, the instruction
that oopses is

         ldrb    r3, [r0, r2, lsr #3]

where 'r2' is the bit position, and 'r0' is the start of the bitmap.

And:

> r10: 00000000  r9 : 0000002d  r8 : ef59d000
> r7 : c0e55514  r6 : c2215000  r5 : 00008000  r4 : 00008000
> r3 : 845cac12  r2 : 00008001  r1 : 00008000  r0 : ef59d000

Lookie here: r1 contains the size, and r2 is past the end of the size.

So pick your poison: either the bug is in

 (a) the bitmap region iterators shouldn't even ask for past-the-end results

     I've added Dennis Zhou who did that first
bitmap_for_each_set_region() in commit e837dfde15a4 ("bitmap:
genericize percpu bitmap region iterators"), and Yuri Norov who
renamed and moved it to for_each_set_bitrange() in commit ec288a2cf7ca
("bitmap: unify find_bit operations").

or

 (b) the ARM find_next_bit() implementation, which doesn't check
whether the position is past the end

     I've added Russell King (ARM stuff) and Catalin Marinas who
touched that last many many years ago in 8b592783a2e8 ("Thumb-2:
Implement the unified arch/arm/lib functions")

I think it's arguably a little bit of both, but mostly (b).

Note how the genetic find_next_bit() (and _find_next_bit()) does

        if (unlikely(start >= nbits))
                return nbits;

but the arm version of it does not.

I think the fix might be something like this:

  diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
  index b5e8b9ae4c7d..b36ca301892e 100644
  --- a/arch/arm/lib/findbit.S
  +++ b/arch/arm/lib/findbit.S
  @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
   ENTRY(_find_next_bit_le)
                teq     r1, #0
                beq     3b
  +             cmp     r2, r1
  +             bhs     3b
                ands    ip, r2, #7
                beq     1b                      @ If new byte, goto old routine
    ARM(                ldrb    r3, [r0, r2, lsr #3]    )

but my ARM asm is so broken that the above is just really random noise
that may or may not build - much less work.

I'll leave it to Russell &co to have a tested and working patch.

Hmm?

                    Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-25 17:55   ` Linus Torvalds
@ 2022-07-25 18:49     ` Linus Torvalds
  2022-07-25 20:35       ` Yury Norov
  2022-07-25 19:41     ` Yury Norov
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2022-07-25 18:49 UTC (permalink / raw)
  To: Guenter Roeck, Yury Norov, Dennis Zhou, Russell King - ARM Linux,
	Catalin Marinas
  Cc: Linux Kernel Mailing List

On Mon, Jul 25, 2022 at 10:55 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I think the fix might be something like this:

Hmm. Maybe the fix is to just not have the arm architecture-specific
version at all.

The generic code handles the "small constant size bitmap that fits in
a word" case better than the ARM special case code does.

And the generic code handles the "scan large bitmap" case better than
the ARM code does too, in that it does things a word at a time, while
the ARM special case code does things one byte at a time.

The ARM code does have a few things going for it:

 (a) it's simple

 (b) it has separate routines for the little-endian case

Now, (a) is probably not too strong an argument, because it's arguably
*too* simple, and buggy as a result. And having looked a bit more,
it's not just _find_next_bit_le() that has this bug, it's all the
"next" versions (zero-bit and big-endian).

But (b) is actively better than what the generic "find bit" code has.
The generic code is kind of disgusting in this area, with code like

        if (le)
                tmp = swab(tmp);

in lib/find_bit.c and this is nasty for two reasons:

 (1) on little-endian, the "le" flag is mis-named: it's always zero,
and it never should swab, but the code was taken from some big-endian
generic case

 (2) even on big-endian, that "le" flag is basically a compile-time
constant, but the header file shenanigans have hidden that fact, so it
ends up being a "pass a constant to a function that then has to test
it dynamically" situation

So the generic code is in many ways better than the ARM special case
code, but it has a couple of unfortunate warts too. At least those
unfortunate warts aren't outright *bugs*, they are just ugly.

              Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-25 17:55   ` Linus Torvalds
  2022-07-25 18:49     ` Linus Torvalds
@ 2022-07-25 19:41     ` Yury Norov
  2022-07-26  9:12     ` Russell King (Oracle)
  2022-07-26 17:39     ` Dennis Zhou
  3 siblings, 0 replies; 27+ messages in thread
From: Yury Norov @ 2022-07-25 19:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guenter Roeck, Dennis Zhou, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List

On Mon, Jul 25, 2022 at 10:55:18AM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48
> 
> Ok, I was hoping somebody more ARMy would look at this, particularly
> since there is no call trace beyond the actual fault.
> 
> So it shows that it happens in _find_next_bit_le(), but not who called it.
> 
> It does show "who allocated the page", and I can see the message that
> is printed afterwards, so it comes from that
> 
>    static void __init test_bitmap_printlist(void)
> 
> function, so I guess we know the call chain:
> 
>   test_bitmap_printlist ->
>     bitmap_print_to_pagebuf ->
>       scnprintf "%*pbl\n" ->
>         pointer ->
>           bitmap_list_string ->
>             for_each_set_bitrange
> 
> and I think I see what's wrong in there. That thing does
> 
>              (b) = find_next_bit((addr), (size), (e) + 1),      \
>              (e) = find_next_zero_bit((addr), (size), (b) + 1))
> 
> for the end of the range, and looking at the oops, the instruction
> that oopses is
> 
>          ldrb    r3, [r0, r2, lsr #3]
> 
> where 'r2' is the bit position, and 'r0' is the start of the bitmap.
> 
> And:
> 
> > r10: 00000000  r9 : 0000002d  r8 : ef59d000
> > r7 : c0e55514  r6 : c2215000  r5 : 00008000  r4 : 00008000
> > r3 : 845cac12  r2 : 00008001  r1 : 00008000  r0 : ef59d000
> 
> Lookie here: r1 contains the size, and r2 is past the end of the size.
> 
> So pick your poison: either the bug is in
> 
>  (a) the bitmap region iterators shouldn't even ask for past-the-end results
> 
>      I've added Dennis Zhou who did that first
> bitmap_for_each_set_region() in commit e837dfde15a4 ("bitmap:
> genericize percpu bitmap region iterators"), and Yuri Norov who
> renamed and moved it to for_each_set_bitrange() in commit ec288a2cf7ca
> ("bitmap: unify find_bit operations").
> 
> or
> 
>  (b) the ARM find_next_bit() implementation, which doesn't check
> whether the position is past the end
> 
>      I've added Russell King (ARM stuff) and Catalin Marinas who
> touched that last many many years ago in 8b592783a2e8 ("Thumb-2:
> Implement the unified arch/arm/lib functions")
> 
> I think it's arguably a little bit of both, but mostly (b).
> 
> Note how the genetic find_next_bit() (and _find_next_bit()) does
> 
>         if (unlikely(start >= nbits))
>                 return nbits;
> 
> but the arm version of it does not.

If nbits == 0, a function of this sort shouldn't dereference memory at all.
Consider for example:
        void *memchr(const void *s, int c, size_t n)
        {
                const unsigned char *p = s;
                while (n-- != 0) {
                        if ((unsigned char)c == *p++) {
                                return (void *)(p - 1);
                        }
                }
                return NULL;
         }

In case of find_next_bit(), we shouldn't also dereference memory if
start is out of bonds. That's why there's start >= nbits check at the
very beginning. (We can't pack everything in a nice-looking loop, like
memchr does, because we need to mask 1st word at the beginning.)
 
> I think the fix might be something like this:
> 
>   diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
>   index b5e8b9ae4c7d..b36ca301892e 100644
>   --- a/arch/arm/lib/findbit.S
>   +++ b/arch/arm/lib/findbit.S
>   @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
>    ENTRY(_find_next_bit_le)
>                 teq     r1, #0
>                 beq     3b
>   +             cmp     r2, r1
>   +             bhs     3b
>                 ands    ip, r2, #7
>                 beq     1b                      @ If new byte, goto old routine
>     ARM(                ldrb    r3, [r0, r2, lsr #3]    )

Looking at the ARM implementation... For sure, it's harder to maintain
because it's asm. It hasn't been revisited for long, and I'm not even
sure it's faster than generic code, because it reads memory per-byte
(ldrb), while bitmaps are optimized for per-word operations (ldr). 

It doesn't implement new functions from the API like find_next_and_bit(),
so ARM takes generic code for them, and nobody complains.

I'm looking at this code for quite a long. Now it starts causing troubles.
Maybe it's time to switch ARM to generic bitmap API entirely?

Thanks,
Yury

> but my ARM asm is so broken that the above is just really random noise
> that may or may not build - much less work.
> 
> I'll leave it to Russell &co to have a tested and working patch.
> 
> Hmm?
> 
>                     Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Build regressions/improvements in v5.19-rc8
  2022-07-24 20:42 Linux 5.19-rc8 Linus Torvalds
  2022-07-25 16:11 ` Guenter Roeck
@ 2022-07-25 20:34 ` Geert Uytterhoeven
  2022-07-25 20:39   ` Geert Uytterhoeven
  1 sibling, 1 reply; 27+ messages in thread
From: Geert Uytterhoeven @ 2022-07-25 20:34 UTC (permalink / raw)
  To: linux-kernel

Below is the list of build error/warning regressions/improvements in
v5.19-rc8[1] compared to v5.18[2].

Summarized:
  - build errors: +3/-4
  - build warnings: +11/-11

JFYI, when comparing v5.19-rc8[1] to v5.19-rc7[3], the summaries are:
  - build errors: +1/-5
  - build warnings: +0/-0

Note that there may be false regressions, as some logs are incomplete.
Still, they're build errors/warnings.

Happy fixing! ;-)

Thanks to the linux-next team for providing the build service.

[1] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/e0dccc3b76fb35bb257b4118367a883073d7390e/ (all 135 configs)
[2] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/4b0986a3613c92f4ec1bdc7f60ec66fea135991f/ (131 out of 135 configs)
[3] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/ff6992735ade75aae3e35d16b17da1008d753d28/ (all 135 configs)


*** ERRORS ***

3 error regressions:
  + /kisskb/src/include/ufs/ufshci.h: error: initializer element is not constant:  => 245:36
  + error: relocation truncated to fit: R_SPARC_WDISP22 against `.init.text':  => (.head.text+0x5100), (.head.text+0x5040)
  + error: relocation truncated to fit: R_SPARC_WDISP22 against symbol `leon_smp_cpu_startup' defined in .text section in arch/sparc/kernel/trampoline_32.o:  => (.init.text+0xa4)

4 error improvements:
  - /kisskb/src/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c: error: case label does not reduce to an integer constant: 4917:4 => 
  - /kisskb/src/drivers/scsi/aacraid/commsup.c: error: case label does not reduce to an integer constant: 1983:2 => 
  - error: arch/sparc/kernel/head_32.o: relocation truncated to fit: R_SPARC_WDISP22 against `.init.text': (.head.text+0x5100), (.head.text+0x5040) => 
  - error: arch/sparc/kernel/head_32.o: relocation truncated to fit: R_SPARC_WDISP22 against symbol `leon_smp_cpu_startup' defined in .text section in arch/sparc/kernel/trampoline_32.o: (.init.text+0xa4) => 


*** WARNINGS ***

11 warning regressions:
  + .config: warning: override: ARCH_RV32I changes choice state:  => 3864
  + arch/m68k/configs/multi_defconfig: warning: symbol value 'm' invalid for ZPOOL:  => 61
  + arch/m68k/configs/sun3_defconfig: warning: symbol value 'm' invalid for ZPOOL:  => 37
  + modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47b0): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000:  => N/A
  + modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47c8): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000:  => N/A
  + modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47e0): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000:  => N/A
  + modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47f8): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000:  => N/A
  + modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4810): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000:  => N/A
  + modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4828): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000:  => N/A
  + modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4840): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000:  => N/A
  + modpost: WARNING: modpost: vmlinux.o(.text.unlikely+0x52bc): Section mismatch in reference from the function __trace_event_discard_commit() to the variable .init.data:initcall_level_names:  => N/A

11 warning improvements:
  - .config: warning: override: reassigning to symbol GCC_PLUGIN_RANDSTRUCT: 12253, 12475 => 
  - /kisskb/src/drivers/scsi/mpt3sas/mpt3sas_base.c: warning: array subscript 'Mpi2SasIOUnitPage1_t {aka struct _MPI2_CONFIG_PAGE_SASIOUNIT_1}[0]' is partly outside array bounds of 'unsigned char[20]' [-Warray-bounds]: 5400:40, 5396:40, 5403:43 => 
  - /kisskb/src/drivers/tty/serial/sh-sci.c: warning: unused variable 'sport' [-Wunused-variable]: 2651:26 => 
  - modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4790): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A => 
  - modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47a8): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A => 
  - modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47c0): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A => 
  - modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47d8): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A => 
  - modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x47f0): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A => 
  - modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4808): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A => 
  - modpost: WARNING: modpost: drivers/net/ethernet/qlogic/qede/qede.o(.data+0x4820): Section mismatch in reference from the variable qede_forced_speed_maps to the variable .init.rodata:qede_forced_speed_100000: N/A => 
  - modpost: WARNING: modpost: vmlinux.o(.text.unlikely+0x45d4): Section mismatch in reference from the function __trace_event_discard_commit() to the variable .init.data:initcall_level_names: N/A => 

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-25 18:49     ` Linus Torvalds
@ 2022-07-25 20:35       ` Yury Norov
  2022-07-25 20:40         ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Yury Norov @ 2022-07-25 20:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guenter Roeck, Dennis Zhou, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List

On Mon, Jul 25, 2022 at 11:49:09AM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 10:55 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I think the fix might be something like this:
> 
> Hmm. Maybe the fix is to just not have the arm architecture-specific
> version at all.

I agree (see my other email in the thread). If no objections from ARM
people, I can drop it.

> The generic code handles the "small constant size bitmap that fits in
> a word" case better than the ARM special case code does.
> 
> And the generic code handles the "scan large bitmap" case better than
> the ARM code does too, in that it does things a word at a time, while
> the ARM special case code does things one byte at a time.
> 
> The ARM code does have a few things going for it:
> 
>  (a) it's simple
> 
>  (b) it has separate routines for the little-endian case
> 
> Now, (a) is probably not too strong an argument, because it's arguably
> *too* simple, and buggy as a result. And having looked a bit more,
> it's not just _find_next_bit_le() that has this bug, it's all the
> "next" versions (zero-bit and big-endian).
> 
> But (b) is actively better than what the generic "find bit" code has.
> The generic code is kind of disgusting in this area, with code like
> 
>         if (le)
>                 tmp = swab(tmp);

The patch that adds this is: b78c57135d470 ("lib/find_bit.c: join
_find_next_bit{_le}"), so I did that on purpose.

> in lib/find_bit.c and this is nasty for two reasons:
> 
>  (1) on little-endian, the "le" flag is mis-named: it's always zero,
> and it never should swab, but the code was taken from some big-endian
> generic case

Yes, the "le" is a bad name, and I think it should be changed. Are you OK
with "need_swab"?
 
>  (2) even on big-endian, that "le" flag is basically a compile-time
> constant, but the header file shenanigans have hidden that fact, so it
> ends up being a "pass a constant to a function that then has to test
> it dynamically" situation

I think it's not measurable, at least find_bit_benchmark didn't get worse.
Even if find_next_bit() is invoked many times in a loop, we can expect
that branch predictor would optimize the difference out.
 
> So the generic code is in many ways better than the ARM special case
> code, but it has a couple of unfortunate warts too. At least those
> unfortunate warts aren't outright *bugs*, they are just ugly.

Here we have 2 ugly options - having pairs of almost identical
functions, or passing dummy variables. I decided that copy-pasting is
worse than abusing branch predictor.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Build regressions/improvements in v5.19-rc8
  2022-07-25 20:34 ` Build regressions/improvements in v5.19-rc8 Geert Uytterhoeven
@ 2022-07-25 20:39   ` Geert Uytterhoeven
  0 siblings, 0 replies; 27+ messages in thread
From: Geert Uytterhoeven @ 2022-07-25 20:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi, linuxppc-dev

On Mon, 25 Jul 2022, Geert Uytterhoeven wrote:
> JFYI, when comparing v5.19-rc8[1] to v5.19-rc7[3], the summaries are:
>  - build errors: +1/-5

   + /kisskb/src/include/ufs/ufshci.h: error: initializer element is not constant:  => 245:36

v5.19-rc8/powerpc-gcc5/ppc64_book3e_allmodconfig
v5.19-rc8/powerpc-gcc5/powerpc-allmodconfig
v5.19-rc8/powerpc-gcc5/ppc64le_allmodconfig

Seen before

> [1] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/e0dccc3b76fb35bb257b4118367a883073d7390e/ (all 135 configs)
> [3] http://kisskb.ellerman.id.au/kisskb/branch/linus/head/ff6992735ade75aae3e35d16b17da1008d753d28/ (all 135 configs)

Gr{oetje,eeting}s,

 						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
 							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-25 20:35       ` Yury Norov
@ 2022-07-25 20:40         ` Linus Torvalds
  2022-07-26 15:51           ` Yury Norov
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2022-07-25 20:40 UTC (permalink / raw)
  To: Yury Norov
  Cc: Guenter Roeck, Dennis Zhou, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List

On Mon, Jul 25, 2022 at 1:35 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> Here we have 2 ugly options - having pairs of almost identical
> functions, or passing dummy variables. I decided that copy-pasting is
> worse than abusing branch predictor.

The thing is, we have solutions for this - just use a single inline
function, and make it *see* the constant.

That way the compiler can DTRT, either generating two copies (with the
constant elided) or depending on the branch predictor.

That's particularly true in a case like this, when there is only *one*
case for the normal situation (ie little-endian). Because let's face
it, big-endian is completely dead and legacy model.

              Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-25 17:55   ` Linus Torvalds
  2022-07-25 18:49     ` Linus Torvalds
  2022-07-25 19:41     ` Yury Norov
@ 2022-07-26  9:12     ` Russell King (Oracle)
  2022-07-26 15:35       ` Yury Norov
  2022-07-28 18:28       ` Russell King (Oracle)
  2022-07-26 17:39     ` Dennis Zhou
  3 siblings, 2 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2022-07-26  9:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guenter Roeck, Yury Norov, Dennis Zhou, Catalin Marinas,
	Linux Kernel Mailing List

On Mon, Jul 25, 2022 at 10:55:18AM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48
> 
> Ok, I was hoping somebody more ARMy would look at this, particularly
> since there is no call trace beyond the actual fault.

First I'm aware of it. Was it reported to linux-arm-kernel? I'm guessing
the report wasn't Cc'd to me - I can't find anything in my mailbox about
it.

> I think the fix might be something like this:
> 
>   diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
>   index b5e8b9ae4c7d..b36ca301892e 100644
>   --- a/arch/arm/lib/findbit.S
>   +++ b/arch/arm/lib/findbit.S
>   @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
>    ENTRY(_find_next_bit_le)
>                 teq     r1, #0
>                 beq     3b
>   +             cmp     r2, r1
>   +             bhs     3b
>                 ands    ip, r2, #7
>                 beq     1b                      @ If new byte, goto old routine
>     ARM(                ldrb    r3, [r0, r2, lsr #3]    )
> 
> but my ARM asm is so broken that the above is just really random noise
> that may or may not build - much less work.
> 
> I'll leave it to Russell &co to have a tested and working patch.

I think it needs a bit more than that, but as you point out in later
emails, the compiler may do a better job for this.

One of the reasons for using byte loads was to avoid problems in the
early days of Linux where these took void pointers and thus could be
misaligned - and using word accesses would have resulted in much
pain. However, that was changed to unsigned long pointers back in
2017, so in theory that should no longer be a concern.

I don't remember why we used void pointers there originally - that's
something which dates back to the 1990s.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26  9:12     ` Russell King (Oracle)
@ 2022-07-26 15:35       ` Yury Norov
  2022-07-28 18:28       ` Russell King (Oracle)
  1 sibling, 0 replies; 27+ messages in thread
From: Yury Norov @ 2022-07-26 15:35 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Linus Torvalds, Guenter Roeck, Dennis Zhou, Catalin Marinas,
	Linux Kernel Mailing List

On Tue, Jul 26, 2022 at 10:12:21AM +0100, Russell King (Oracle) wrote:
> On Mon, Jul 25, 2022 at 10:55:18AM -0700, Linus Torvalds wrote:
> > On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <linux@roeck-us.net> wrote:
> > >
> > > BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48
> > 
> > Ok, I was hoping somebody more ARMy would look at this, particularly
> > since there is no call trace beyond the actual fault.
> 
> First I'm aware of it. Was it reported to linux-arm-kernel? I'm guessing
> the report wasn't Cc'd to me - I can't find anything in my mailbox about
> it.
> 
> > I think the fix might be something like this:
> > 
> >   diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
> >   index b5e8b9ae4c7d..b36ca301892e 100644
> >   --- a/arch/arm/lib/findbit.S
> >   +++ b/arch/arm/lib/findbit.S
> >   @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
> >    ENTRY(_find_next_bit_le)
> >                 teq     r1, #0
> >                 beq     3b
> >   +             cmp     r2, r1
> >   +             bhs     3b
> >                 ands    ip, r2, #7
> >                 beq     1b                      @ If new byte, goto old routine
> >     ARM(                ldrb    r3, [r0, r2, lsr #3]    )
> > 
> > but my ARM asm is so broken that the above is just really random noise
> > that may or may not build - much less work.
> > 
> > I'll leave it to Russell &co to have a tested and working patch.
> 
> I think it needs a bit more than that, but as you point out in later
> emails, the compiler may do a better job for this.
> 
> One of the reasons for using byte loads was to avoid problems in the
> early days of Linux where these took void pointers and thus could be
> misaligned - and using word accesses would have resulted in much
> pain. However, that was changed to unsigned long pointers back in
> 2017, so in theory that should no longer be a concern.
> 
> I don't remember why we used void pointers there originally - that's
> something which dates back to the 1990s.

OK, then I'm sending the patch that switches it to generic code.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-25 20:40         ` Linus Torvalds
@ 2022-07-26 15:51           ` Yury Norov
  0 siblings, 0 replies; 27+ messages in thread
From: Yury Norov @ 2022-07-26 15:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guenter Roeck, Dennis Zhou, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List

On Mon, Jul 25, 2022 at 01:40:58PM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 1:35 PM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > Here we have 2 ugly options - having pairs of almost identical
> > functions, or passing dummy variables. I decided that copy-pasting is
> > worse than abusing branch predictor.
> 
> The thing is, we have solutions for this - just use a single inline
> function, and make it *see* the constant.
> 
> That way the compiler can DTRT, either generating two copies (with the
> constant elided) or depending on the branch predictor.
> 
> That's particularly true in a case like this, when there is only *one*
> case for the normal situation (ie little-endian). Because let's face
> it, big-endian is completely dead and legacy model.

OK, I'll do it for 5.20. Thanks for the hint.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-25 17:55   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2022-07-26  9:12     ` Russell King (Oracle)
@ 2022-07-26 17:39     ` Dennis Zhou
  2022-07-26 17:51       ` Linus Torvalds
  3 siblings, 1 reply; 27+ messages in thread
From: Dennis Zhou @ 2022-07-26 17:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guenter Roeck, Yury Norov, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List

Hello,

On Mon, Jul 25, 2022 at 10:55:18AM -0700, Linus Torvalds wrote:
> On Mon, Jul 25, 2022 at 9:11 AM Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > BUG: KFENCE: out-of-bounds read in _find_next_bit_le+0x10/0x48
> 
> Ok, I was hoping somebody more ARMy would look at this, particularly
> since there is no call trace beyond the actual fault.
> 
> So it shows that it happens in _find_next_bit_le(), but not who called it.
> 
> It does show "who allocated the page", and I can see the message that
> is printed afterwards, so it comes from that
> 
>    static void __init test_bitmap_printlist(void)
> 
> function, so I guess we know the call chain:
> 
>   test_bitmap_printlist ->
>     bitmap_print_to_pagebuf ->
>       scnprintf "%*pbl\n" ->
>         pointer ->
>           bitmap_list_string ->
>             for_each_set_bitrange
> 
> and I think I see what's wrong in there. That thing does
> 
>              (b) = find_next_bit((addr), (size), (e) + 1),      \
>              (e) = find_next_zero_bit((addr), (size), (b) + 1))
> 
> for the end of the range, and looking at the oops, the instruction
> that oopses is
> 
>          ldrb    r3, [r0, r2, lsr #3]
> 
> where 'r2' is the bit position, and 'r0' is the start of the bitmap.
> 
> And:
> 
> > r10: 00000000  r9 : 0000002d  r8 : ef59d000
> > r7 : c0e55514  r6 : c2215000  r5 : 00008000  r4 : 00008000
> > r3 : 845cac12  r2 : 00008001  r1 : 00008000  r0 : ef59d000
> 
> Lookie here: r1 contains the size, and r2 is past the end of the size.
> 
> So pick your poison: either the bug is in
> 
>  (a) the bitmap region iterators shouldn't even ask for past-the-end results
> 
>      I've added Dennis Zhou who did that first
> bitmap_for_each_set_region() in commit e837dfde15a4 ("bitmap:
> genericize percpu bitmap region iterators"), and Yuri Norov who
> renamed and moved it to for_each_set_bitrange() in commit ec288a2cf7ca
> ("bitmap: unify find_bit operations").
> 

It seems like this is mostly taken care of by migrating arm to use the
generic implementations, but I just want to cover our basis here.

Are we okay with adding the contract find_*_bit() operations must handle
asking for past size properly? FWIW, we'd have to modify most of the
iterators in find.h.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 17:39     ` Dennis Zhou
@ 2022-07-26 17:51       ` Linus Torvalds
  2022-07-26 18:18         ` Yury Norov
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2022-07-26 17:51 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Guenter Roeck, Yury Norov, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List

On Tue, Jul 26, 2022 at 10:39 AM Dennis Zhou <dennis@kernel.org> wrote:
>
> Are we okay with adding the contract find_*_bit() operations must handle
> asking for past size properly? FWIW, we'd have to modify most of the
> iterators in find.h.

So I think we're ok with it, if only it makes the macros simpler.

I also think we should probably look at the m68k case, because while
that one seems to not have the bug that the arm case had, if we remove
the arm case the m68k code is now the only non-generic case remaining.

And it just makes me go "maybe we should get rid of the whole
'override the generic code' thing entirely?"

I don't think that inlining the first word (like the m68k code does)
is worth it, but it *is* possible that the architecture-specific
functions generate better code for some common cases, so I think this
is a "needs looking at the generated code" and not just a blind
removal.

              Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 17:51       ` Linus Torvalds
@ 2022-07-26 18:18         ` Yury Norov
  2022-07-26 18:36           ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Yury Norov @ 2022-07-26 18:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dennis Zhou, Guenter Roeck, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven,
	linux-m68k

+ Geert Uytterhoeven <geert@linux-m68k.org>
+ linux-m68k@lists.linux-m68k.org

On Tue, Jul 26, 2022 at 10:51:01AM -0700, Linus Torvalds wrote:
> On Tue, Jul 26, 2022 at 10:39 AM Dennis Zhou <dennis@kernel.org> wrote:
> >
> > Are we okay with adding the contract find_*_bit() operations must handle
> > asking for past size properly? FWIW, we'd have to modify most of the
> > iterators in find.h.
> 
> So I think we're ok with it, if only it makes the macros simpler.
> 
> I also think we should probably look at the m68k case, because while
> that one seems to not have the bug that the arm case had, if we remove
> the arm case the m68k code is now the only non-generic case remaining.
> 
> And it just makes me go "maybe we should get rid of the whole
> 'override the generic code' thing entirely?"
> 
> I don't think that inlining the first word (like the m68k code does)
> is worth it, but it *is* possible that the architecture-specific
> functions generate better code for some common cases,

We have find_bit_benchmark to check how it works in practice. Would
be great if someone with access to the hardware can share numbers.

> so I think this
> is a "needs looking at the generated code" and not just a blind
> removal.
> 
>               Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 18:18         ` Yury Norov
@ 2022-07-26 18:36           ` Linus Torvalds
  2022-07-26 19:44             ` Russell King (Oracle)
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2022-07-26 18:36 UTC (permalink / raw)
  To: Yury Norov
  Cc: Dennis Zhou, Guenter Roeck, Russell King - ARM Linux,
	Catalin Marinas, Linux Kernel Mailing List, Geert Uytterhoeven,
	linux-m68k

On Tue, Jul 26, 2022 at 11:18 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> We have find_bit_benchmark to check how it works in practice. Would
> be great if someone with access to the hardware can share numbers.

Honestly, I doubt benchmarking find_bit in a loop is all that sensible.

These are helper functions that are probably seldom super-hot in
cache, and as with so many of these things, I suspect the cool-I$
numbers are the ones that matter most in real life.

When some filesystem ends up searching for a free block or similar, it
will probably have done other things before that that means that the
L1 I$ has been long flushed, and branch history is quite possibly
entirely gone too.

The same is quite possibly true for the bitmap itself in D$ too.

That said, looking at the x86 code generation (not only because I have
the build environment, but where I can actually read make sense of the
asm), the only thing that looks bad is the conditional bswap.

So the code gen looks fairly good,

                 Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 18:36           ` Linus Torvalds
@ 2022-07-26 19:44             ` Russell King (Oracle)
  2022-07-26 20:20               ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Russell King (Oracle) @ 2022-07-26 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yury Norov, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 11:36:21AM -0700, Linus Torvalds wrote:
> On Tue, Jul 26, 2022 at 11:18 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > We have find_bit_benchmark to check how it works in practice. Would
> > be great if someone with access to the hardware can share numbers.
> 
> Honestly, I doubt benchmarking find_bit in a loop is all that sensible.

Yes, that's what I was thinking - I've never seen it crop up in any of
the perf traces I've seen.

Nevertheless, here's some numbers from a single run of the
find_bit_benchmark module, kernel built with:
arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110

Current native implementation:

[   46.184565]
               Start testing find_bit() with random-filled bitmap
[   46.195127] find_next_bit:                 2440833 ns, 163112 iterations
[   46.204226] find_next_zero_bit:            2372128 ns, 164569 iterations
[   46.213152] find_last_bit:                 2199779 ns, 163112 iterations
[   46.299398] find_first_bit:               79526013 ns,  16234 iterations
[   46.684026] find_first_and_bit:          377912990 ns,  32617 iterations
[   46.692020] find_next_and_bit:             1269071 ns,  73562 iterations
[   46.698745]
               Start testing find_bit() with sparse bitmap
[   46.705711] find_next_bit:                  118652 ns,    656 iterations
[   46.716621] find_next_zero_bit:            4183472 ns, 327025 iterations
[   46.723395] find_last_bit:                   50448 ns,    656 iterations
[   46.762308] find_first_bit:               32190802 ns,    656 iterations
[   46.769093] find_first_and_bit:              52129 ns,      1 iterations
[   46.775882] find_next_and_bit:               62522 ns,      1 iterations

Generic implementation:

[   25.149238]
               Start testing find_bit() with random-filled bitmap
[   25.160002] find_next_bit:                 2640943 ns, 163537 iterations
[   25.169567] find_next_zero_bit:            2838485 ns, 164144 iterations
[   25.178595] find_last_bit:                 2302372 ns, 163538 iterations
[   25.204016] find_first_bit:               18697630 ns,  16373 iterations
[   25.602571] find_first_and_bit:          391841480 ns,  32555 iterations
[   25.610563] find_next_and_bit:             1260306 ns,  73587 iterations
[   25.617295]
               Start testing find_bit() with sparse bitmap
[   25.624222] find_next_bit:                   70289 ns,    656 iterations
[   25.636478] find_next_zero_bit:            5527050 ns, 327025 iterations
[   25.643253] find_last_bit:                   52147 ns,    656 iterations
[   25.657304] find_first_bit:                7328573 ns,    656 iterations
[   25.664087] find_first_and_bit:              48518 ns,      1 iterations
[   25.670871] find_next_and_bit:               59750 ns,      1 iterations

Overall, I would say it's pretty similar (some generic perform
marginally better, some native perform marginally better) with the
exception of find_first_bit() being much better with the generic
implementation, but find_next_zero_bit() being noticably worse.

So, pretty much nothing of any relevance between them, which may
come as a surprise given the byte vs word access differences between
the two implementations.

I suspect the reason behind that may be because the native
implementation code is smaller than the generic implementation,
outweighing the effects of the by-byte rather than by-word. I would
also suspect that, because of the smaller implementation, the native
version performs better in a I$-cool situation than the generic. Lastly,
I would suspect if we fixed the bug in the native version, and converted
it to use word loads, it would probably be better than the generic
version. I haven't anything to base that on other than gut feeling at
the moment, but I can make the changes to the native implementation and
see what effect that has, possibly tomorrow.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 19:44             ` Russell King (Oracle)
@ 2022-07-26 20:20               ` Linus Torvalds
  2022-07-27  0:15                 ` Russell King (Oracle)
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2022-07-26 20:20 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> Overall, I would say it's pretty similar (some generic perform
> marginally better, some native perform marginally better) with the
> exception of find_first_bit() being much better with the generic
> implementation, but find_next_zero_bit() being noticably worse.

The generic _find_first_bit() code is actually sane and simple. It
loops over words until it finds a non-zero one, and then does trivial
calculations on that last word.

That explains why the generic code does so much better than your byte-wise asm.

In contrast, the generic _find_next_bit() I find almost offensively
silly - which in turn explains why your byte-wide asm does better.

I think the generic _find_next_bit() should actually do what the m68k
find_next_bit code does: handle the first special word itself, and
then just call find_first_bit() on the rest of it.

And it should *not* try to handle the dynamic "bswap and/or bit sense
invert" thing at all. That should be just four different (trivial)
cases for the first word.

             Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26 20:20               ` Linus Torvalds
@ 2022-07-27  0:15                 ` Russell King (Oracle)
  2022-07-27  1:33                   ` Yury Norov
  0 siblings, 1 reply; 27+ messages in thread
From: Russell King (Oracle) @ 2022-07-27  0:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yury Norov, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > Overall, I would say it's pretty similar (some generic perform
> > marginally better, some native perform marginally better) with the
> > exception of find_first_bit() being much better with the generic
> > implementation, but find_next_zero_bit() being noticably worse.
> 
> The generic _find_first_bit() code is actually sane and simple. It
> loops over words until it finds a non-zero one, and then does trivial
> calculations on that last word.
> 
> That explains why the generic code does so much better than your byte-wise asm.
> 
> In contrast, the generic _find_next_bit() I find almost offensively
> silly - which in turn explains why your byte-wide asm does better.
> 
> I think the generic _find_next_bit() should actually do what the m68k
> find_next_bit code does: handle the first special word itself, and
> then just call find_first_bit() on the rest of it.
> 
> And it should *not* try to handle the dynamic "bswap and/or bit sense
> invert" thing at all. That should be just four different (trivial)
> cases for the first word.

Here's the results for the native version converted to use word loads:

[   37.319937]
               Start testing find_bit() with random-filled bitmap
[   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
[   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
[   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
[   37.372564] find_first_bit:               17722203 ns,  16370 iterations
[   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
[   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
[   37.752143]
               Start testing find_bit() with sparse bitmap
[   37.759032] find_next_bit:                   41256 ns,    655 iterations
[   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
[   37.776675] find_last_bit:                   48742 ns,    655 iterations
[   37.790961] find_first_bit:                7562371 ns,    655 iterations
[   37.797743] find_first_and_bit:              47366 ns,      1 iterations
[   37.804527] find_next_and_bit:               59924 ns,      1 iterations

which is generally faster than the generic version, with the exception
of the sparse find_first_bit (generic was:
[   25.657304] find_first_bit:                7328573 ns,    656 iterations)

find_next_{,zero_}bit() in the sparse case are quite a bit faster than
the generic code.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-27  0:15                 ` Russell King (Oracle)
@ 2022-07-27  1:33                   ` Yury Norov
  2022-07-27  7:43                     ` Russell King (Oracle)
  2022-07-27  7:46                     ` David Laight
  0 siblings, 2 replies; 27+ messages in thread
From: Yury Norov @ 2022-07-27  1:33 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> > >
> > > Overall, I would say it's pretty similar (some generic perform
> > > marginally better, some native perform marginally better) with the
> > > exception of find_first_bit() being much better with the generic
> > > implementation, but find_next_zero_bit() being noticably worse.
> >
> > The generic _find_first_bit() code is actually sane and simple. It
> > loops over words until it finds a non-zero one, and then does trivial
> > calculations on that last word.
> >
> > That explains why the generic code does so much better than your byte-wise asm.
> >
> > In contrast, the generic _find_next_bit() I find almost offensively
> > silly - which in turn explains why your byte-wide asm does better.
> >
> > I think the generic _find_next_bit() should actually do what the m68k
> > find_next_bit code does: handle the first special word itself, and
> > then just call find_first_bit() on the rest of it.
> >
> > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > invert" thing at all. That should be just four different (trivial)
> > cases for the first word.
>
> Here's the results for the native version converted to use word loads:
>
> [   37.319937]
>                Start testing find_bit() with random-filled bitmap
> [   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
> [   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
> [   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
> [   37.372564] find_first_bit:               17722203 ns,  16370 iterations
> [   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
> [   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
> [   37.752143]
>                Start testing find_bit() with sparse bitmap
> [   37.759032] find_next_bit:                   41256 ns,    655 iterations
> [   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
> [   37.776675] find_last_bit:                   48742 ns,    655 iterations
> [   37.790961] find_first_bit:                7562371 ns,    655 iterations
> [   37.797743] find_first_and_bit:              47366 ns,      1 iterations
> [   37.804527] find_next_and_bit:               59924 ns,      1 iterations
>
> which is generally faster than the generic version, with the exception
> of the sparse find_first_bit (generic was:
> [   25.657304] find_first_bit:                7328573 ns,    656 iterations)
>
> find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> the generic code.

Look at find_{first,next}_and_bit results. Those two have no arch version
and in both cases use generic code. In theory they should be equally fast
before and after, but your testing says that generic case is slower even
for them, and the difference is comparable with real arch functions numbers.
It makes me feel like:
 - there's something unrelated, like governor/throttling that affect results;
 - the numbers are identical, taking the dispersion into account.

If the difference really concerns you, I'd suggest running the test
several times
to measure confidence intervals.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-27  1:33                   ` Yury Norov
@ 2022-07-27  7:43                     ` Russell King (Oracle)
  2022-07-30 21:38                       ` Yury Norov
  2022-07-27  7:46                     ` David Laight
  1 sibling, 1 reply; 27+ messages in thread
From: Russell King (Oracle) @ 2022-07-27  7:43 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
> On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > > <linux@armlinux.org.uk> wrote:
> > > >
> > > > Overall, I would say it's pretty similar (some generic perform
> > > > marginally better, some native perform marginally better) with the
> > > > exception of find_first_bit() being much better with the generic
> > > > implementation, but find_next_zero_bit() being noticably worse.
> > >
> > > The generic _find_first_bit() code is actually sane and simple. It
> > > loops over words until it finds a non-zero one, and then does trivial
> > > calculations on that last word.
> > >
> > > That explains why the generic code does so much better than your byte-wise asm.
> > >
> > > In contrast, the generic _find_next_bit() I find almost offensively
> > > silly - which in turn explains why your byte-wide asm does better.
> > >
> > > I think the generic _find_next_bit() should actually do what the m68k
> > > find_next_bit code does: handle the first special word itself, and
> > > then just call find_first_bit() on the rest of it.
> > >
> > > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > > invert" thing at all. That should be just four different (trivial)
> > > cases for the first word.
> >
> > Here's the results for the native version converted to use word loads:
> >
> > [   37.319937]
> >                Start testing find_bit() with random-filled bitmap
> > [   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
> > [   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
> > [   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
> > [   37.372564] find_first_bit:               17722203 ns,  16370 iterations
> > [   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
> > [   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
> > [   37.752143]
> >                Start testing find_bit() with sparse bitmap
> > [   37.759032] find_next_bit:                   41256 ns,    655 iterations
> > [   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
> > [   37.776675] find_last_bit:                   48742 ns,    655 iterations
> > [   37.790961] find_first_bit:                7562371 ns,    655 iterations
> > [   37.797743] find_first_and_bit:              47366 ns,      1 iterations
> > [   37.804527] find_next_and_bit:               59924 ns,      1 iterations
> >
> > which is generally faster than the generic version, with the exception
> > of the sparse find_first_bit (generic was:
> > [   25.657304] find_first_bit:                7328573 ns,    656 iterations)
> >
> > find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> > the generic code.
> 
> Look at find_{first,next}_and_bit results. Those two have no arch version
> and in both cases use generic code. In theory they should be equally fast
> before and after, but your testing says that generic case is slower even
> for them, and the difference is comparable with real arch functions numbers.
> It makes me feel like:
>  - there's something unrelated, like governor/throttling that affect results;
>  - the numbers are identical, taking the dispersion into account.
> 
> If the difference really concerns you, I'd suggest running the test
> several times
> to measure confidence intervals.

Given that the benchmark is run against random bitmaps and with
interrupts enabled, there is going to be noise in the results.

Here's the second run:

[26234.429389]
               Start testing find_bit() with random-filled bitmap
[26234.439722] find_next_bit:                 2206687 ns, 164277 iterations
[26234.448664] find_next_zero_bit:            2188368 ns, 163404 iterations
[26234.457612] find_last_bit:                 2223742 ns, 164278 iterations
[26234.482056] find_first_bit:               17720726 ns,  16384 iterations
[26234.859374] find_first_and_bit:          370602019 ns,  32877 iterations
[26234.867379] find_next_and_bit:             1280651 ns,  74091 iterations
[26234.874107]
               Start testing find_bit() with sparse bitmap
[26234.881014] find_next_bit:                   46142 ns,    656 iterations
[26234.891900] find_next_zero_bit:            4158987 ns, 327025 iterations
[26234.898672] find_last_bit:                   49727 ns,    656 iterations
[26234.912504] find_first_bit:                7107862 ns,    656 iterations
[26234.919290] find_first_and_bit:              52092 ns,      1 iterations
[26234.926076] find_next_and_bit:               60856 ns,      1 iterations

And a third run:

[26459.679524]
               Start testing find_bit() with random-filled bitmap
[26459.689871] find_next_bit:                 2199418 ns, 163311 iterations
[26459.698798] find_next_zero_bit:            2181289 ns, 164370 iterations
[26459.707738] find_last_bit:                 2213638 ns, 163311 iterations
[26459.732224] find_first_bit:               17764152 ns,  16429 iterations
[26460.133823] find_first_and_bit:          394886375 ns,  32672 iterations
[26460.141818] find_next_and_bit:             1269693 ns,  73485 iterations
[26460.148545]
               Start testing find_bit() with sparse bitmap
[26460.155433] find_next_bit:                   40753 ns,    653 iterations
[26460.166307] find_next_zero_bit:            4148211 ns, 327028 iterations
[26460.173078] find_last_bit:                   50017 ns,    653 iterations
[26460.187007] find_first_bit:                7205325 ns,    653 iterations
[26460.193790] find_first_and_bit:              49358 ns,      1 iterations
[26460.200577] find_next_and_bit:               62332 ns,      1 iterations

My gut feeling is that yes, there is some variance, but not on an
order that is significant that would allow us to say "there's no
difference".

find_next_bit results for random are: 2222703, 2206687, 2199418,
which is an average of 2209603 and a variance of around 0.5%.
The difference between this and the single generic figure I have
is on the order of 20%.

I'll do the same with find_first_bit for random: 17722203, 17720726,
and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
The difference between this and the single generic figure I have is
on the order of 5%. Not so large, but still quite a big difference
compared to the variance.

find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
7291853. Variance is higher at about 4%. Difference between this and
the generic figure is 0.5%, so this one is not significantly
different.

The best result looks to be find_next_zero_bit for the sparse bitmap
case. The generic code measures 5.5ms, the native code is sitting
around 4.1ms. That's a difference of around 34%, and by just looking
at the range in the figures above we can see this is a significant
result without needing to do the calculations. Similar is true of
find_next_bit for the sparse bitmap.

So, I think the results are significant in most cases and variance
doesn't account for the differences. The only one which isn't is
find_first_bit for the sparse case.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: Linux 5.19-rc8
  2022-07-27  1:33                   ` Yury Norov
  2022-07-27  7:43                     ` Russell King (Oracle)
@ 2022-07-27  7:46                     ` David Laight
  1 sibling, 0 replies; 27+ messages in thread
From: David Laight @ 2022-07-27  7:46 UTC (permalink / raw)
  To: 'Yury Norov', Russell King (Oracle)
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, linux-m68k

From: Yury Norov
> Sent: 27 July 2022 02:34
...
> If the difference really concerns you, I'd suggest running the test
> several times to measure confidence intervals.

Or, do what I've been doing and get an accurate cycle count
for a single call and repeat about 10 times.
The first value is typically slow (loading I$), the
rest typically identical.
They can often even be matched to the expected value!
The fastest value is the one to quote!

On x86 you have to use the perf cycle counter, the tsc
is just useless.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-26  9:12     ` Russell King (Oracle)
  2022-07-26 15:35       ` Yury Norov
@ 2022-07-28 18:28       ` Russell King (Oracle)
  2022-07-29  0:11         ` Guenter Roeck
  1 sibling, 1 reply; 27+ messages in thread
From: Russell King (Oracle) @ 2022-07-28 18:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guenter Roeck, Yury Norov, Dennis Zhou, Catalin Marinas,
	Linux Kernel Mailing List

On Tue, Jul 26, 2022 at 10:12:21AM +0100, Russell King (Oracle) wrote:
> First I'm aware of it. Was it reported to linux-arm-kernel? I'm guessing
> the report wasn't Cc'd to me - I can't find anything in my mailbox about
> it.
> 
> > I think the fix might be something like this:
> > 
> >   diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
> >   index b5e8b9ae4c7d..b36ca301892e 100644
> >   --- a/arch/arm/lib/findbit.S
> >   +++ b/arch/arm/lib/findbit.S
> >   @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
> >    ENTRY(_find_next_bit_le)
> >                 teq     r1, #0
> >                 beq     3b
> >   +             cmp     r2, r1
> >   +             bhs     3b
> >                 ands    ip, r2, #7
> >                 beq     1b                      @ If new byte, goto old routine
> >     ARM(                ldrb    r3, [r0, r2, lsr #3]    )
> > 
> > but my ARM asm is so broken that the above is just really random noise
> > that may or may not build - much less work.
> > 
> > I'll leave it to Russell &co to have a tested and working patch.
> 
> I think it needs a bit more than that, but as you point out in later
> emails, the compiler may do a better job for this.

Okay, I've moved my patch that fixes this (without adding a single line
of code!) to my fixes branch, which I'll ask you to pull in the next
couple of days.

Each of the _find_next_* functions had:

	teq	r1, #0
	beq	3b

at the beginning to catch the case where size == 0. This is now:

	cmp	r2, r1
	bhs	3b

which is the C equivalent of:

	if (offset >= size)
		goto 3b;

where both are unsigned, and nicely covers the case where size == 0 as
before (since if size is 0, the condition is always true irrespective
of the value of offset.)

We can sort out the question of keeping this code or not later, but I
think as this has been spotted as an issue, it's important to get it
fixed.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-28 18:28       ` Russell King (Oracle)
@ 2022-07-29  0:11         ` Guenter Roeck
  0 siblings, 0 replies; 27+ messages in thread
From: Guenter Roeck @ 2022-07-29  0:11 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Linus Torvalds, Yury Norov, Dennis Zhou, Catalin Marinas,
	Linux Kernel Mailing List

On Thu, Jul 28, 2022 at 07:28:22PM +0100, Russell King (Oracle) wrote:
> On Tue, Jul 26, 2022 at 10:12:21AM +0100, Russell King (Oracle) wrote:
> > First I'm aware of it. Was it reported to linux-arm-kernel? I'm guessing
> > the report wasn't Cc'd to me - I can't find anything in my mailbox about
> > it.
> > 
> > > I think the fix might be something like this:
> > > 
> > >   diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
> > >   index b5e8b9ae4c7d..b36ca301892e 100644
> > >   --- a/arch/arm/lib/findbit.S
> > >   +++ b/arch/arm/lib/findbit.S
> > >   @@ -83,6 +83,8 @@ ENDPROC(_find_first_bit_le)
> > >    ENTRY(_find_next_bit_le)
> > >                 teq     r1, #0
> > >                 beq     3b
> > >   +             cmp     r2, r1
> > >   +             bhs     3b
> > >                 ands    ip, r2, #7
> > >                 beq     1b                      @ If new byte, goto old routine
> > >     ARM(                ldrb    r3, [r0, r2, lsr #3]    )
> > > 
> > > but my ARM asm is so broken that the above is just really random noise
> > > that may or may not build - much less work.
> > > 
> > > I'll leave it to Russell &co to have a tested and working patch.
> > 
> > I think it needs a bit more than that, but as you point out in later
> > emails, the compiler may do a better job for this.
> 
> Okay, I've moved my patch that fixes this (without adding a single line
> of code!) to my fixes branch, which I'll ask you to pull in the next
> couple of days.
> 
I downloaded your patch and ran it through my testbed.
With it applied, the problem is no longer seen.
Feel free to add

Tested-by: Guenter Roeck <linux@roeck-us.net>

Thanks,
Guenter

> Each of the _find_next_* functions had:
> 
> 	teq	r1, #0
> 	beq	3b
> 
> at the beginning to catch the case where size == 0. This is now:
> 
> 	cmp	r2, r1
> 	bhs	3b
> 
> which is the C equivalent of:
> 
> 	if (offset >= size)
> 		goto 3b;
> 
> where both are unsigned, and nicely covers the case where size == 0 as
> before (since if size is 0, the condition is always true irrespective
> of the value of offset.)
> 
> We can sort out the question of keeping this code or not later, but I
> think as this has been spotted as an issue, it's important to get it
> fixed.
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-27  7:43                     ` Russell King (Oracle)
@ 2022-07-30 21:38                       ` Yury Norov
  2022-08-01 15:48                         ` Russell King (Oracle)
  0 siblings, 1 reply; 27+ messages in thread
From: Yury Norov @ 2022-07-30 21:38 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, Alexey Klimov,
	linux-m68k

On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote:
> On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
> > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> > >
> > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > > > <linux@armlinux.org.uk> wrote:
> > > > >
> > > > > Overall, I would say it's pretty similar (some generic perform
> > > > > marginally better, some native perform marginally better) with the
> > > > > exception of find_first_bit() being much better with the generic
> > > > > implementation, but find_next_zero_bit() being noticably worse.
> > > >
> > > > The generic _find_first_bit() code is actually sane and simple. It
> > > > loops over words until it finds a non-zero one, and then does trivial
> > > > calculations on that last word.
> > > >
> > > > That explains why the generic code does so much better than your byte-wise asm.
> > > >
> > > > In contrast, the generic _find_next_bit() I find almost offensively
> > > > silly - which in turn explains why your byte-wide asm does better.
> > > >
> > > > I think the generic _find_next_bit() should actually do what the m68k
> > > > find_next_bit code does: handle the first special word itself, and
> > > > then just call find_first_bit() on the rest of it.
> > > >
> > > > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > > > invert" thing at all. That should be just four different (trivial)
> > > > cases for the first word.
> > >
> > > Here's the results for the native version converted to use word loads:
> > >
> > > [   37.319937]
> > >                Start testing find_bit() with random-filled bitmap
> > > [   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
> > > [   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
> > > [   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
> > > [   37.372564] find_first_bit:               17722203 ns,  16370 iterations
> > > [   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
> > > [   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
> > > [   37.752143]
> > >                Start testing find_bit() with sparse bitmap
> > > [   37.759032] find_next_bit:                   41256 ns,    655 iterations
> > > [   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
> > > [   37.776675] find_last_bit:                   48742 ns,    655 iterations
> > > [   37.790961] find_first_bit:                7562371 ns,    655 iterations
> > > [   37.797743] find_first_and_bit:              47366 ns,      1 iterations
> > > [   37.804527] find_next_and_bit:               59924 ns,      1 iterations
> > >
> > > which is generally faster than the generic version, with the exception
> > > of the sparse find_first_bit (generic was:
> > > [   25.657304] find_first_bit:                7328573 ns,    656 iterations)
> > >
> > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> > > the generic code.
> > 
> > Look at find_{first,next}_and_bit results. Those two have no arch version
> > and in both cases use generic code. In theory they should be equally fast
> > before and after, but your testing says that generic case is slower even
> > for them, and the difference is comparable with real arch functions numbers.
> > It makes me feel like:
> >  - there's something unrelated, like governor/throttling that affect results;
> >  - the numbers are identical, taking the dispersion into account.
> > 
> > If the difference really concerns you, I'd suggest running the test
> > several times
> > to measure confidence intervals.
> 
> Given that the benchmark is run against random bitmaps and with
> interrupts enabled, there is going to be noise in the results.
> 
> Here's the second run:
> 
> [26234.429389]
>                Start testing find_bit() with random-filled bitmap
> [26234.439722] find_next_bit:                 2206687 ns, 164277 iterations
> [26234.448664] find_next_zero_bit:            2188368 ns, 163404 iterations
> [26234.457612] find_last_bit:                 2223742 ns, 164278 iterations
> [26234.482056] find_first_bit:               17720726 ns,  16384 iterations
> [26234.859374] find_first_and_bit:          370602019 ns,  32877 iterations
> [26234.867379] find_next_and_bit:             1280651 ns,  74091 iterations
> [26234.874107]
>                Start testing find_bit() with sparse bitmap
> [26234.881014] find_next_bit:                   46142 ns,    656 iterations
> [26234.891900] find_next_zero_bit:            4158987 ns, 327025 iterations
> [26234.898672] find_last_bit:                   49727 ns,    656 iterations
> [26234.912504] find_first_bit:                7107862 ns,    656 iterations
> [26234.919290] find_first_and_bit:              52092 ns,      1 iterations
> [26234.926076] find_next_and_bit:               60856 ns,      1 iterations
> 
> And a third run:
> 
> [26459.679524]
>                Start testing find_bit() with random-filled bitmap
> [26459.689871] find_next_bit:                 2199418 ns, 163311 iterations
> [26459.698798] find_next_zero_bit:            2181289 ns, 164370 iterations
> [26459.707738] find_last_bit:                 2213638 ns, 163311 iterations
> [26459.732224] find_first_bit:               17764152 ns,  16429 iterations
> [26460.133823] find_first_and_bit:          394886375 ns,  32672 iterations
> [26460.141818] find_next_and_bit:             1269693 ns,  73485 iterations
> [26460.148545]
>                Start testing find_bit() with sparse bitmap
> [26460.155433] find_next_bit:                   40753 ns,    653 iterations
> [26460.166307] find_next_zero_bit:            4148211 ns, 327028 iterations
> [26460.173078] find_last_bit:                   50017 ns,    653 iterations
> [26460.187007] find_first_bit:                7205325 ns,    653 iterations
> [26460.193790] find_first_and_bit:              49358 ns,      1 iterations
> [26460.200577] find_next_and_bit:               62332 ns,      1 iterations
> 
> My gut feeling is that yes, there is some variance, but not on an
> order that is significant that would allow us to say "there's no
> difference".
> 
> find_next_bit results for random are: 2222703, 2206687, 2199418,
> which is an average of 2209603 and a variance of around 0.5%.
> The difference between this and the single generic figure I have
> is on the order of 20%.
> 
> I'll do the same with find_first_bit for random: 17722203, 17720726,
> and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
> The difference between this and the single generic figure I have is
> on the order of 5%. Not so large, but still quite a big difference
> compared to the variance.
> 
> find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
> 7291853. Variance is higher at about 4%. Difference between this and
> the generic figure is 0.5%, so this one is not significantly
> different.
> 
> The best result looks to be find_next_zero_bit for the sparse bitmap
> case. The generic code measures 5.5ms, the native code is sitting
> around 4.1ms. That's a difference of around 34%, and by just looking
> at the range in the figures above we can see this is a significant
> result without needing to do the calculations. Similar is true of
> find_next_bit for the sparse bitmap.
> 
> So, I think the results are significant in most cases and variance
> doesn't account for the differences. The only one which isn't is
> find_first_bit for the sparse case.

Hi Russel,

+ Alexey Klimov <klimov.linux@gmail.com>

This is my testings for native vs generic find_bit operations on a15
and 17.

The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu
frequencies were fixed at 1000Mhz. (Thanks a lot!)

For each native/generic @ a15/a7 configuration, the find_bit_benchmark 
was run 5 times, and the results are summarized below:

A15                      Native     Generic       Difference
Dense                        ns          ns       %   sigmas
find_next_bit:          3746929     3231641      14      8.3
find_next_zero_bit:     3935354     3202608      19     10.4
find_last_bit:          3134713     3129717       0      0.1
find_first_bit:        85626542    20498669      76    172.4
find_first_and_bit:   409252997   414820417      -1     -0.2
find_next_and_bit:      1678706     1654420       1      0.4
                                              
Sparse                                        
find_next_bit:          143208        77924      46     29.4
find_next_zero_bit:    6893375      6059177      12     14.3
find_last_bit:           67174        68616      -2     -1.0
find_first_bit:       33689256      8151493      76     47.8
find_first_and_bit:     124758       156974     -26     -1.3
find_next_and_bit:       53391        56716      -6     -0.2

A7                      Native      Generic       Difference
Dense                       ns           ns       %   sigmas
find_next_bit:         4207627      5532764     -31    -14.9
find_next_zero_bit:    4259961      5236880     -23    -10.0
find_last_bit:         4281386      4201025       2      1.5
find_first_bit:      236913620     50970424      78    163.3
find_first_and_bit:  728069762    750580781      -3     -0.7
find_next_and_bit:     2696263      2766077      -3     -0.9

Sparse
find_next_bit:          327241       143776      56     40.7
find_next_zero_bit:    6987249     10235989     -46    -21.8
find_last_bit:           97758        94725       3      0.6
find_first_bit:       94628040     21051964      78     41.8
find_first_and_bit:     248133       241267       3      0.3
find_next_and_bit:      136475       154000     -13     -0.5

The last column is the difference between native and generic code
performance normalized to a standard deviation:
        (mean(native) - mean(generic)) / max(std(native), std(generic))

The results look consistent to me because 'and' subtests that are always
generic differ by less than one sigma.

On A15 generic code is a clear winner. On A7 results are inconsistent
although significant. Maybe it's worth to retest on A7.
 
Regarding the choice between native and generic core - I would prefer
generic version even if it's slightly slower because it is tested and
maintained better. And because the results of the test are at least on
par, to me it's a no-brainer.

Would be really interesting to compare performance of your LDRB->LDR
patch with the generic code using the same procedure.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-07-30 21:38                       ` Yury Norov
@ 2022-08-01 15:48                         ` Russell King (Oracle)
  2022-08-01 15:54                           ` Russell King (Oracle)
  0 siblings, 1 reply; 27+ messages in thread
From: Russell King (Oracle) @ 2022-08-01 15:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, Alexey Klimov,
	linux-m68k

Oh FFS.

I see you decided off your own back to remove the ARM version of the
find_bit functions, with NO agreement from the arch maintainer. This
is not on.


On Sat, Jul 30, 2022 at 02:38:38PM -0700, Yury Norov wrote:
> On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote:
> > On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
> > > On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
> > > <linux@armlinux.org.uk> wrote:
> > > >
> > > > On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
> > > > > On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
> > > > > <linux@armlinux.org.uk> wrote:
> > > > > >
> > > > > > Overall, I would say it's pretty similar (some generic perform
> > > > > > marginally better, some native perform marginally better) with the
> > > > > > exception of find_first_bit() being much better with the generic
> > > > > > implementation, but find_next_zero_bit() being noticably worse.
> > > > >
> > > > > The generic _find_first_bit() code is actually sane and simple. It
> > > > > loops over words until it finds a non-zero one, and then does trivial
> > > > > calculations on that last word.
> > > > >
> > > > > That explains why the generic code does so much better than your byte-wise asm.
> > > > >
> > > > > In contrast, the generic _find_next_bit() I find almost offensively
> > > > > silly - which in turn explains why your byte-wide asm does better.
> > > > >
> > > > > I think the generic _find_next_bit() should actually do what the m68k
> > > > > find_next_bit code does: handle the first special word itself, and
> > > > > then just call find_first_bit() on the rest of it.
> > > > >
> > > > > And it should *not* try to handle the dynamic "bswap and/or bit sense
> > > > > invert" thing at all. That should be just four different (trivial)
> > > > > cases for the first word.
> > > >
> > > > Here's the results for the native version converted to use word loads:
> > > >
> > > > [   37.319937]
> > > >                Start testing find_bit() with random-filled bitmap
> > > > [   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
> > > > [   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
> > > > [   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
> > > > [   37.372564] find_first_bit:               17722203 ns,  16370 iterations
> > > > [   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
> > > > [   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
> > > > [   37.752143]
> > > >                Start testing find_bit() with sparse bitmap
> > > > [   37.759032] find_next_bit:                   41256 ns,    655 iterations
> > > > [   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
> > > > [   37.776675] find_last_bit:                   48742 ns,    655 iterations
> > > > [   37.790961] find_first_bit:                7562371 ns,    655 iterations
> > > > [   37.797743] find_first_and_bit:              47366 ns,      1 iterations
> > > > [   37.804527] find_next_and_bit:               59924 ns,      1 iterations
> > > >
> > > > which is generally faster than the generic version, with the exception
> > > > of the sparse find_first_bit (generic was:
> > > > [   25.657304] find_first_bit:                7328573 ns,    656 iterations)
> > > >
> > > > find_next_{,zero_}bit() in the sparse case are quite a bit faster than
> > > > the generic code.
> > > 
> > > Look at find_{first,next}_and_bit results. Those two have no arch version
> > > and in both cases use generic code. In theory they should be equally fast
> > > before and after, but your testing says that generic case is slower even
> > > for them, and the difference is comparable with real arch functions numbers.
> > > It makes me feel like:
> > >  - there's something unrelated, like governor/throttling that affect results;
> > >  - the numbers are identical, taking the dispersion into account.
> > > 
> > > If the difference really concerns you, I'd suggest running the test
> > > several times
> > > to measure confidence intervals.
> > 
> > Given that the benchmark is run against random bitmaps and with
> > interrupts enabled, there is going to be noise in the results.
> > 
> > Here's the second run:
> > 
> > [26234.429389]
> >                Start testing find_bit() with random-filled bitmap
> > [26234.439722] find_next_bit:                 2206687 ns, 164277 iterations
> > [26234.448664] find_next_zero_bit:            2188368 ns, 163404 iterations
> > [26234.457612] find_last_bit:                 2223742 ns, 164278 iterations
> > [26234.482056] find_first_bit:               17720726 ns,  16384 iterations
> > [26234.859374] find_first_and_bit:          370602019 ns,  32877 iterations
> > [26234.867379] find_next_and_bit:             1280651 ns,  74091 iterations
> > [26234.874107]
> >                Start testing find_bit() with sparse bitmap
> > [26234.881014] find_next_bit:                   46142 ns,    656 iterations
> > [26234.891900] find_next_zero_bit:            4158987 ns, 327025 iterations
> > [26234.898672] find_last_bit:                   49727 ns,    656 iterations
> > [26234.912504] find_first_bit:                7107862 ns,    656 iterations
> > [26234.919290] find_first_and_bit:              52092 ns,      1 iterations
> > [26234.926076] find_next_and_bit:               60856 ns,      1 iterations
> > 
> > And a third run:
> > 
> > [26459.679524]
> >                Start testing find_bit() with random-filled bitmap
> > [26459.689871] find_next_bit:                 2199418 ns, 163311 iterations
> > [26459.698798] find_next_zero_bit:            2181289 ns, 164370 iterations
> > [26459.707738] find_last_bit:                 2213638 ns, 163311 iterations
> > [26459.732224] find_first_bit:               17764152 ns,  16429 iterations
> > [26460.133823] find_first_and_bit:          394886375 ns,  32672 iterations
> > [26460.141818] find_next_and_bit:             1269693 ns,  73485 iterations
> > [26460.148545]
> >                Start testing find_bit() with sparse bitmap
> > [26460.155433] find_next_bit:                   40753 ns,    653 iterations
> > [26460.166307] find_next_zero_bit:            4148211 ns, 327028 iterations
> > [26460.173078] find_last_bit:                   50017 ns,    653 iterations
> > [26460.187007] find_first_bit:                7205325 ns,    653 iterations
> > [26460.193790] find_first_and_bit:              49358 ns,      1 iterations
> > [26460.200577] find_next_and_bit:               62332 ns,      1 iterations
> > 
> > My gut feeling is that yes, there is some variance, but not on an
> > order that is significant that would allow us to say "there's no
> > difference".
> > 
> > find_next_bit results for random are: 2222703, 2206687, 2199418,
> > which is an average of 2209603 and a variance of around 0.5%.
> > The difference between this and the single generic figure I have
> > is on the order of 20%.
> > 
> > I'll do the same with find_first_bit for random: 17722203, 17720726,
> > and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
> > The difference between this and the single generic figure I have is
> > on the order of 5%. Not so large, but still quite a big difference
> > compared to the variance.
> > 
> > find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
> > 7291853. Variance is higher at about 4%. Difference between this and
> > the generic figure is 0.5%, so this one is not significantly
> > different.
> > 
> > The best result looks to be find_next_zero_bit for the sparse bitmap
> > case. The generic code measures 5.5ms, the native code is sitting
> > around 4.1ms. That's a difference of around 34%, and by just looking
> > at the range in the figures above we can see this is a significant
> > result without needing to do the calculations. Similar is true of
> > find_next_bit for the sparse bitmap.
> > 
> > So, I think the results are significant in most cases and variance
> > doesn't account for the differences. The only one which isn't is
> > find_first_bit for the sparse case.
> 
> Hi Russel,
> 
> + Alexey Klimov <klimov.linux@gmail.com>
> 
> This is my testings for native vs generic find_bit operations on a15
> and 17.
> 
> The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu
> frequencies were fixed at 1000Mhz. (Thanks a lot!)
> 
> For each native/generic @ a15/a7 configuration, the find_bit_benchmark 
> was run 5 times, and the results are summarized below:
> 
> A15                      Native     Generic       Difference
> Dense                        ns          ns       %   sigmas
> find_next_bit:          3746929     3231641      14      8.3
> find_next_zero_bit:     3935354     3202608      19     10.4
> find_last_bit:          3134713     3129717       0      0.1
> find_first_bit:        85626542    20498669      76    172.4
> find_first_and_bit:   409252997   414820417      -1     -0.2
> find_next_and_bit:      1678706     1654420       1      0.4
>                                               
> Sparse                                        
> find_next_bit:          143208        77924      46     29.4
> find_next_zero_bit:    6893375      6059177      12     14.3
> find_last_bit:           67174        68616      -2     -1.0
> find_first_bit:       33689256      8151493      76     47.8
> find_first_and_bit:     124758       156974     -26     -1.3
> find_next_and_bit:       53391        56716      -6     -0.2
> 
> A7                      Native      Generic       Difference
> Dense                       ns           ns       %   sigmas
> find_next_bit:         4207627      5532764     -31    -14.9
> find_next_zero_bit:    4259961      5236880     -23    -10.0
> find_last_bit:         4281386      4201025       2      1.5
> find_first_bit:      236913620     50970424      78    163.3
> find_first_and_bit:  728069762    750580781      -3     -0.7
> find_next_and_bit:     2696263      2766077      -3     -0.9
> 
> Sparse
> find_next_bit:          327241       143776      56     40.7
> find_next_zero_bit:    6987249     10235989     -46    -21.8
> find_last_bit:           97758        94725       3      0.6
> find_first_bit:       94628040     21051964      78     41.8
> find_first_and_bit:     248133       241267       3      0.3
> find_next_and_bit:      136475       154000     -13     -0.5
> 
> The last column is the difference between native and generic code
> performance normalized to a standard deviation:
>         (mean(native) - mean(generic)) / max(std(native), std(generic))
> 
> The results look consistent to me because 'and' subtests that are always
> generic differ by less than one sigma.
> 
> On A15 generic code is a clear winner. On A7 results are inconsistent
> although significant. Maybe it's worth to retest on A7.
>  
> Regarding the choice between native and generic core - I would prefer
> generic version even if it's slightly slower because it is tested and
> maintained better. And because the results of the test are at least on
> par, to me it's a no-brainer.
> 
> Would be really interesting to compare performance of your LDRB->LDR
> patch with the generic code using the same procedure.
> 
> Thanks,
> Yury
> 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Linux 5.19-rc8
  2022-08-01 15:48                         ` Russell King (Oracle)
@ 2022-08-01 15:54                           ` Russell King (Oracle)
  0 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2022-08-01 15:54 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Dennis Zhou, Guenter Roeck, Catalin Marinas,
	Linux Kernel Mailing List, Geert Uytterhoeven, Alexey Klimov,
	linux-m68k

On Mon, Aug 01, 2022 at 04:48:21PM +0100, Russell King (Oracle) wrote:
> Oh FFS.
> 
> I see you decided off your own back to remove the ARM version of the
> find_bit functions, with NO agreement from the arch maintainer. This
> is not on.

Sorry, my mistake, I'm getting confused with git over conflicts which
aren't making much sense.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-08-01 15:55 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-24 20:42 Linux 5.19-rc8 Linus Torvalds
2022-07-25 16:11 ` Guenter Roeck
2022-07-25 17:55   ` Linus Torvalds
2022-07-25 18:49     ` Linus Torvalds
2022-07-25 20:35       ` Yury Norov
2022-07-25 20:40         ` Linus Torvalds
2022-07-26 15:51           ` Yury Norov
2022-07-25 19:41     ` Yury Norov
2022-07-26  9:12     ` Russell King (Oracle)
2022-07-26 15:35       ` Yury Norov
2022-07-28 18:28       ` Russell King (Oracle)
2022-07-29  0:11         ` Guenter Roeck
2022-07-26 17:39     ` Dennis Zhou
2022-07-26 17:51       ` Linus Torvalds
2022-07-26 18:18         ` Yury Norov
2022-07-26 18:36           ` Linus Torvalds
2022-07-26 19:44             ` Russell King (Oracle)
2022-07-26 20:20               ` Linus Torvalds
2022-07-27  0:15                 ` Russell King (Oracle)
2022-07-27  1:33                   ` Yury Norov
2022-07-27  7:43                     ` Russell King (Oracle)
2022-07-30 21:38                       ` Yury Norov
2022-08-01 15:48                         ` Russell King (Oracle)
2022-08-01 15:54                           ` Russell King (Oracle)
2022-07-27  7:46                     ` David Laight
2022-07-25 20:34 ` Build regressions/improvements in v5.19-rc8 Geert Uytterhoeven
2022-07-25 20:39   ` Geert Uytterhoeven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).