Hello,

I have seen ~3 ms delay in interrupt handling on ARM64.

I have traced it down to raw_spin_lock() call in handle_irq_event() in
kernel/irq/handle.c:

irqreturn_t handle_irq_event(struct irq_desc *desc)
{
    irqreturn_t ret;

    desc->istate &= ~IRQS_PENDING;
    irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
    raw_spin_unlock(&desc->lock);

    ret = handle_irq_event_percpu(desc);

--> raw_spin_lock(&desc->lock);
    irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
    return ret;
}

It took ~3 ms for this raw_spin_lock() to lock.

During this time irq_finalize_oneshot() from kernel/irq/manage.c locks and
unlocks the same raw spin lock more than 1000 times:

static void irq_finalize_oneshot(struct irq_desc *desc,
                 struct irqaction *action)
{
    if (!(desc->istate & IRQS_ONESHOT) ||
        action->handler == irq_forced_secondary_handler)
        return;
again:
    chip_bus_lock(desc);
--> raw_spin_lock_irq(&desc->lock);

    /*
     * Implausible though it may be we need to protect us against
     * the following scenario:
     *
     * The thread is faster done than the hard interrupt handler
     * on the other CPU. If we unmask the irq line then the
     * interrupt can come in again and masks the line, leaves due
     * to IRQS_INPROGRESS and the irq line is masked forever.
     *
     * This also serializes the state of shared oneshot handlers
     * versus "desc->threads_oneshot |= action->thread_mask;" in
     * irq_wake_thread(). See the comment there which explains the
     * serialization.
     */
    if (unlikely(irqd_irq_inprogress(&desc->irq_data))) {
-->     raw_spin_unlock_irq(&desc->lock);
        chip_bus_sync_unlock(desc);
        cpu_relax();
        goto again;
    }
...

I have created a workaround for this problem by calling cpu_relax() 50
times after 100 failed tries. See attached patch
3ms_tx_delay_workaround.patch.

I have created custom kernel module with 2 threads, one similar to
irq_finalize_oneshot(), second similar to handle_irq_event(). I have used
latest Linux 6.3-rc3 with no added patches and I confirmed that even there
qspinlocks are not fair on my ARM64 board.

I copied qspinlocks code to the module twice and I have put traces only to
one thread, the one which takes several ms to lock and is originally
called from handle_irq_event(). I have found out that the
queued_fetch_set_pending_acquire() takes those 3 ms to finish. On ARM64
queued_fetch_set_pending_acquire() is implemented as
atomic_fetch_or_acquire().

I have found out that my CPU doesn't know LSE atomic instructions and it
looks like atomic operations could be quite slow there. Assembler code in
arch/arm64/include/asm/atomic_ll_sc.h has loop inside:

#define ATOMIC_FETCH_OP(name, mb, acq, rel, cl, op, asm_op, constraint) \
static __always_inline int                      \
__ll_sc_atomic_fetch_##op##name(int i, atomic_t *v)         \
{                                   \
    unsigned long tmp;                      \
    int val, result;                        \
                                    \
    asm volatile("// atomic_fetch_" #op #name "\n"          \
    "   prfm    pstl1strm, %3\n"                \
    "1: ld" #acq "xr    %w0, %3\n"              \
    "   " #asm_op " %w1, %w0, %w4\n"            \
    "   st" #rel "xr    %w2, %w1, %3\n"             \
--> "   cbnz    %w2, 1b\n"                  \
    "   " #mb                           \
    : "=&r" (result), "=&r" (val), "=&r" (tmp), "+Q" (v->counter)   \
    : __stringify(constraint) "r" (i)               \
    : cl);                              \
                                    \
    return result;                          \
}

Most importantly, these atomic operations seem to make one CPU dominate
the cache line so that the other is unable to take the lock. And that is
problematic in combination with the retry loop in irq_finalize_oneshot().

To confirm it I have created small userspace program, which just calls
__ll_sc_atomic_fetch_or_acquire() from two threads. See attached
unfair_arm64_asm_atomic_ll_sc_demonstration.tar.gz. Bellow you can see
that it took 16 ms for one atomic operation.

# ./contested
load thread started
evaluation thread started
new max duration: 6420 ns
new max duration: 9355 ns
new max duration: 22240 ns
new max duration: 23180 ns
new max duration: 70465 ns
new max duration: 77860 ns
new max duration: 83100 ns
new max duration: 105115 ns
new max duration: 127695 ns
new max duration: 128840 ns
new max duration: 1265595 ns
new max duration: 3713430 ns
new max duration: 3750810 ns
new max duration: 7996020 ns
new max duration: 7998890 ns
new max duration: 7999340 ns
new max duration: 7999490 ns
new max duration: 12000210 ns
new max duration: 15999700 ns
new max duration: 16000000 ns
new max duration: 16000030 ns

So I confirmed that atomic operations from
arch/arm64/include/asm/atomic_ll_sc.h can be quite slow when they are
contested from second CPU.

Do you think that it is possible to create fair qspinlock implementation
on top of atomic instructions supported by ARM64 version 8 (no LSE atomic
instructions) without compromising performance in the uncontested case?
For example ARM64 could have custom queued_fetch_set_pending_acquire
implementation same as x86 has in arch/x86/include/asm/qspinlock.h. Is the
retry loop in irq_finalize_oneshot() ok together with the current ARM64
cpu_relax() implementation for processor with no LSE atomic instructions?

I reproduced the real life scenario of TX delay only in ICSSG network
driver (not yet merged to mainline) [1], it was with kernel 5.10 with
patches, CONFIG_PREEMPT_RT and custom ICSSG firmware on Texas Instruments
AM65x IDK [2] with ARM Cortex A53. This custom setup comes with high
interrupt load.

[1] https://lore.kernel.org/all/20220406094358.7895-1-p-mohan@ti.com/
[2] https://www.ti.com/tool/TMDX654IDKEVM

With best regards,
Zdenek Bouska

--
Siemens, s.r.o
Siemens Advanta Development