linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Serge Semin <fancer.lancer@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Michal Simek <michal.simek@amd.com>,
	 Alexander Stein <alexander.stein@ew.tq-group.com>,
	Tony Luck <tony.luck@intel.com>,
	 James Morse <james.morse@arm.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	 Robert Richter <rric@kernel.org>,
	Dinh Nguyen <dinguyen@kernel.org>,
	 Punnaiah Choudary Kalluri <punnaiah.choudary.kalluri@xilinx.com>,
	Arnd Bergmann <arnd@arndb.de>,
	 Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	linux-arm-kernel@lists.infradead.org, linux-edac@vger.kernel.org,
	 linux-kernel@vger.kernel.org, Sherry Sun <sherry.sun@nxp.com>,
	Borislav Petkov <bp@suse.de>
Subject: Re: [PATCH v5 01/20] EDAC/synopsys: Fix ECC status data and IRQ disable race condition
Date: Thu, 25 Apr 2024 15:52:38 +0300	[thread overview]
Message-ID: <whgp2xx4dv3szezz3bvmgutgazz6kvie3q7rgpr35zqzuzsygk@wppqzusteru4> (raw)
In-Reply-To: <20240421100712.GAZiTlUOm1hrLQvaMi@fat_crate.local>

On Sun, Apr 21, 2024 at 12:07:30PM +0200, Borislav Petkov wrote:
> On Tue, Apr 16, 2024 at 01:06:11PM +0300, Serge Semin wrote:
> > It looks indeed crazy because the method is called enable_intr() and
> > is called in the IRQ handler. Right, re-enabling the IRQ in the handler
> > doesn't look good. But under the hood it was just a way to fix the
> > problem described in the commit you cited. enable_intr() just gets
> > back the IRQ Enable flags cleared a bit before in the
> > zynqmp_get_error_info() method.
> > 
> > The root cause of the problem is that the IRQ status/clear flags:
> > ECCCLR.ecc_corrected_err_clr	(R/W1C)
> > ECCCLR.ecc_uncorrected_err_clr	(R/W1C)
> > ECCCLR.ecc_corr_err_cnt_clr	(R/W1C)
> > ECCCLR.ecc_uncorr_err_cnt_clr	(R/W1C)
> > etc
> > 
> > and the IRQ enable/disable flags (since v3.10a):
> > ECCCLR.ecc_corrected_err_intr_en	(R/W)
> > ECCCLR.ecc_uncorrected_err_intr_en	(R/W)
> > 
> > reside in a single register - ECCCLR (Synopsys has renamed it to
> > ECCCTL since v3.10a due to adding the IRQ En/Dis flags).
> > 
> > Thus any concurrent access to that CSR like "Clear IRQ
> > status/counters" and "IRQ disable/enable" need to be protected from
> > the race condition.
> 
> Ok, let's pick this apart one-by-one. I'll return to the rest you're
> explaining as needed.
> 
> So, can writes to the status/counter bits while writing the *same* bit
> to the IRQ enable/disable bit prevent any race conditions?

No, because the clear and enable/disable bits belong to the same CSR.
While you are writing the clear+same/enable bits, the concurrent IO
may have changed the same/enable bits. Like this:

     IRQ-handler                        |    IRQ-disabler
                                        |
 tmp = clear_sts_bits | enable_irq_bits;|
                                        | ECCCLR = 0; // disable IRQ
 ECCCLR = tmp;                          |
----------------------------------------+--------------------------------------

As a result even though the IRQ-disabler cleared the IRQ-enable bits,
the IRQ-handler got them back to being set. The same will happen if we
get to write the *same* bits in the handler:

     IRQ-handler                        |    IRQ-disabler
                                        |
 tmp = ECCCLR | clear_sts_bits;         |
                                        | ECCCLR = 0; // disable IRQs
 ECCCLR = tmp;                          |
----------------------------------------+--------------------------------------

The last example is almost the same as what happens at the moment and
what I am fixing in this patch. The difference is that there is a
greater number of ECCCLR CSR changes performed in the IRQ-handler
context, which makes the critical section even wider than it could be:

     IRQ-handler                        |    IRQ-disabler
                                        |
zynqmp_get_error_info:                  |
 ECCCLR = clear_sts_bits;               |
 ECCCLR = 0; // actually redundant      |
...                                     | ECCCLR = 0; // disable IRQs
enable_intr:                            |
 ECCCLR = enable_irq_bits;              |
----------------------------------------+--------------------------------------

> 
> Meaning, you only change the status and counter bits and you preserve
> the same value in the IRQ disable/enable bit?

AFAICS this won't help to solve the race condition because writing the
preserved value of the enable/disable bits is the cause of the race
condition. The critical section is in concurrent flushing of different
values to the ECCCLR.*en bits. The only ways to solve that are:
1. prevent the concurrent access
2. serialize the critical section

> 
> IOW, I'm thinking of shadowing that ECCCTL in software so that we update
> it from the shadowed value.

I don't see the shadowing will help to prevent what is happening
unless you know some shadow-register pattern I am not aware of. AFAIR
the shadow register is normally utilized for the cases of:
1. read ops returns an incorrect value or a CSR couldn't be read
2. IO bus is too slow in order to speed-up the RMW-pattern
In any case the shadowed value and the process of the data flushing
would need to be protected with a lock anyway in order to sync the
shadow register content and the actual value written to the CSR.

> 
> Because, AFAIU, the spinlock won't help if you grab it, clear the
> status/counter bits and disable the interrupt in the process. You want
> to only clear the status/counter bits and leave the interrupt enabled.
> 
> Right?

Right, but the spinlock will help. What I need to do deal with two
concurrent operations:
IRQ-handler:  clear the status/counter bits and leave the IRQ enable
              bits as is.
IRQ-disabler: clear the IRQ enable bits
These actions need to be serialized in order to prevent the race
condition.

> 
> IOW, in one single write you do:
> 
> ECCCLR.ecc_corrected_err_clr=1
> ECCCLR.ecc_uncorrected_err_clr=1
> ECCCLR.ecc_corr_err_cnt_clr=1
> ECCCLR.ecc_uncorr_err_cnt_clr=1
> ECCCLR.ecc_corrected_err_intr_en=1
> ECCCLR.ecc_uncorrected_err_intr_en=1
> 
> ?

This won't be help because the concurrent IRQ-disabler could have
already cleared the IRQ enable bits while the IRQ-handler is being
executed and about to write to the ECCCLR register. Like this:

     IRQ-handler                        |    IRQ-disabler
                                        |
 tmp = clear_sts_bits | enable_irq_bits;|
                                        | ECCCLR = 0; // disable IRQ
 ECCCLR = tmp;                          |
----------------------------------------+--------------------------------------

Even if we get to add the spin-lock serializing the ECCCLR writes it
won't solve the problem since the IRQ-disabler critical section could
be executed a bit before the IRQ-handler critical section so the later
one will just re-enable the IRQs disabled by the former one.

Here is what is suggested in my patch to fix the problem:

     IRQ-handler                        |    IRQ-disabler
                                        |
zynqmp_get_error_info:                  |
                                        | lock_irqsave
                                        | ECCCLR = 0; // disable IRQs
                                        | unlock_irqrestore
 lock_irqsave;                          |
 tmp = ECCCLR | clear_sts_bits;         |
 ECCCLR = tmp;                          |
 unlock_irqrestore;                     |
----------------------------------------+--------------------------------------

See, the IRQ-status/counters clearing and IRQ disabling processes are
serialized so the former one wouldn't override the values written by
the later one.

Here is the way it would have looked in case of the shadow-register
implementation:

     IRQ-handler                        |    IRQ-disabler
                                        |
zynqmp_get_error_info:                  |
                                        | lock_irqsave
                                        | shadow_en_bits = 0;
                                        | ECCCLR = shadow_en_bits; // disable IRQs
                                        | unlock_irqrestore
 lock_irqsave;                          |
 tmp = clear_sts_bits | shadow_en_bits; |
 ECCCLR = tmp;                          |
 unlock_irqrestore;                     |
----------------------------------------+--------------------------------------

The shadow-register pattern just prevents one ECCCLR read op. The
shadowed data sync would have needed the serialization anyway. Seeing
the DW DDR uMCTL2 controller CSRs are always memory mapped, I don't
see using the shadow-register CSR would worth being implemented unless
you meant something different.

-Serge(y)

> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

  reply	other threads:[~2024-04-25 12:52 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-22 18:12 [PATCH v5 00/20] EDAC/mc/synopsys: Various fixes and cleanups Serge Semin
2024-02-22 18:12 ` [PATCH v5 01/20] EDAC/synopsys: Fix ECC status data and IRQ disable race condition Serge Semin
2024-04-15 18:36   ` Borislav Petkov
2024-04-16 10:06     ` Serge Semin
2024-04-21 10:07       ` Borislav Petkov
2024-04-25 12:52         ` Serge Semin [this message]
2024-05-06 10:20           ` Borislav Petkov
2024-05-06 11:27             ` Serge Semin
2024-05-06 12:12               ` Borislav Petkov
2024-02-22 18:12 ` [PATCH v5 02/20] EDAC/synopsys: Fix generic device type detection procedure Serge Semin
2024-02-22 18:12 ` [PATCH v5 03/20] EDAC/synopsys: Fix mci->scrub_cap field setting Serge Semin
2024-02-22 18:12 ` [PATCH v5 04/20] EDAC/synopsys: Drop erroneous ADDRMAP4.addrmap_col_b10 parse Serge Semin
2024-02-22 18:12 ` [PATCH v5 05/20] EDAC/synopsys: Fix reading errors count before ECC status Serge Semin
2024-02-22 18:12 ` [PATCH v5 06/20] EDAC/synopsys: Fix misleading IRQ self-cleared quirk flag Serge Semin
2024-02-22 18:12 ` [PATCH v5 07/20] EDAC/synopsys: Use platform device devm ioremap method Serge Semin
2024-02-22 18:12 ` [PATCH v5 08/20] EDAC/synopsys: Drop internal CE and UE counters Serge Semin
2024-02-22 18:12 ` [PATCH v5 09/20] EDAC/synopsys: Drop local to_mci() macro definition Serge Semin
2024-02-22 18:12 ` [PATCH v5 10/20] EDAC/synopsys: Drop struct ecc_error_info.blknr field Serge Semin
2024-02-22 18:12 ` [PATCH v5 11/20] EDAC/synopsys: Shorten out struct ecc_error_info.bankgrpnr field name Serge Semin
2024-02-22 18:12 ` [PATCH v5 12/20] EDAC/synopsys: Drop redundant info from the error messages Serge Semin
2024-02-22 18:12 ` [PATCH v5 13/20] EDAC/mc: Init DIMM labels in MC registration method Serge Semin
2024-02-22 18:12 ` [PATCH v5 14/20] EDAC/mc: Add generic unique MC index allocation procedure Serge Semin
2024-02-22 18:13 ` [PATCH v5 15/20] EDAC/mc: Re-use " Serge Semin
2024-02-22 18:13 ` [PATCH v5 16/20] EDAC/synopsys: Detach Zynq A05 DDRC support to separate driver Serge Semin
2024-02-22 18:13 ` [PATCH v5 17/20] EDAC/synopsys: Drop unused platform-specific setup API Serge Semin
2024-02-22 18:13 ` [PATCH v5 18/20] EDAC/synopsys: Unify CSRs macro declarations Serge Semin
2024-02-22 18:13 ` [PATCH v5 19/20] EDAC/synopsys: Unify struct/macro/function prefixes Serge Semin
2024-02-22 18:13 ` [PATCH v5 20/20] EDAC/synopsys: Convert to using BIT/GENMASK/FIELD_x macros Serge Semin
2024-03-06  5:27 ` [PATCH v5 00/20] EDAC/mc/synopsys: Various fixes and cleanups Shubhrajyoti Datta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=whgp2xx4dv3szezz3bvmgutgazz6kvie3q7rgpr35zqzuzsygk@wppqzusteru4 \
    --to=fancer.lancer@gmail.com \
    --cc=alexander.stein@ew.tq-group.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=bp@suse.de \
    --cc=dinguyen@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=james.morse@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=michal.simek@amd.com \
    --cc=punnaiah.choudary.kalluri@xilinx.com \
    --cc=rric@kernel.org \
    --cc=sherry.sun@nxp.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).